Tests Show That Voice Assistants Still Lack Critical Intelligence

Increasingly, voice assistants from vendors such as Amazon, Apple, Google, Microsoft, and others are starting to find their way into myriad of devices, products, and tools used on a daily basis. While once we might have only interacted with conversational systems on our phones, dedicated desktop appliances, or desktop computers, we can now find conversational interfaces on a wide range of appliances and products from televisions to cars and even toaster ovens. Soon, any device we can interact with will have an audio conversational interface instead of buttons or screens to type or click. The dawn of the conversational computing age is here.

However, are these devices intelligent enough to handle the wide range of queries that humans are posing? The objective of finding out how intelligent these systems really are is the goal of Cognilytica’s most recent Voice Assistant Benchmark aiming to test the cognitive capabilities of the most widely deployed voice assistant devices on the market. (Disclosure: I am a principal analyst with Cognilytica).

In its second iteration, the Voice Assistant Benchmark asks 120 questions grouped into 12 categories of various levels of cognitive difficulty. These questions aim to test not only the ability for the devices to understand the questions being asked, but also their underlying knowledge graph and cognitive capabilities. To results of the questions asked are evaluated into one of four categories: Category 0 responses are those in which the device either could not answer the question at all or defaulted the user to a search or other generic response. Category 1 responses are those in which the device responds with an irrelevant or incorrect response. Category 2 responses are those in which the device responds such that a human must make the determination as to what the right response is. Category 3 responses are clear, straightforward answers that provide an acceptable response to the user.

Each response is also marked with whether or not the response is “adequate” to address the specific question being asked. In most cases, a Category 3 response is required to be adequate, but in some situations Category 0 responses are preferred when we would rather the device not attempt to answer something that is intentionally ambiguous or even jibberish. The benchmark tallies up all the total adequate responses and then compares them against what the top score could possibly be. Since these backends are regularly improving, this benchmark is repeated regularly to see how the voice assistant responses change over time.

Results from the Benchmark

While the voice assistants this round did dramatically better than they did in the previous first version of the benchmark, they still performed, as a whole, inadequately. For the current benchmark, Alexa provided the greatest number of adequate responses at 49 out of 144 questions asked (34.7%) while Google followed close behind with 48 out of 144 questions responded adequately (34.0%). Microsoft’s Cortana showed the biggest improvement over the past benchmark with 46 out of 144 adequate responses (31.9%). Apple’s Siri trails the pack with 35 out of 144 adequate responses (24.3%). The charts below outline overall adequate answers as well as total answers for each category 0-3.

MORE FOR YOU

The Best Romantic Comedy Of The Last Year Just Hit Netflix

Apple iPhone 16 Unique All New Design Promised In New Report

Rudy Giuliani And Mark Meadows Indicted In Arizona Fake Electors Case

The questions asked were those that an elementary school student should be easily able to understand and respond to. As such, if these voice assistants were in school, they’d all get a failing grade.

Interesting Responses from Voice Assistants

What is most interesting in these benchmarks is that it’s clear that the voice assistant companies are continually working on their knowledge graphs and underlying cloud-based AI technology that powers the intelligence of these devices. After all, the intelligence of these devices is not in the device itself but in the big infrastructure in the cloud powered by lots of compute power and data to support it. So, in essence, what’s really being testing is the intelligence of the big back-end system, and not what’s on the device itself. From the benchmark, it’s clear that there is evidence that these companies are working very hard to improve and broaden their underlying data and these conversational systems continue to improve over time.

All results of the benchmark questions and answers are recorded on video to document and keep transparent the category results, and also so we can have some evidence of how these systems are improving over time. As a result, Cognilytica produced a number of interesting videos that highlight and showcase some of the unusual and interesting responses of the voice assistants:

Benchmark Videos: Comparing Responses of Voice Assistants

How far away are we from truly intelligent voice assistants?

Given that these voice assistants still seem to fail with fairly basic and straightforward questions, it makes us ask: How far away are we from a truly valuable, intelligent conversational system? We’re actually much closer than it might seem. While these devices still have a long way to go to prove that they can reliably answer most questions, the rate of improvement is impressive. The major vendors are putting large teams to work making these devices better. Amazon alone has claimed over 10,000 employees in their Alexa division. And news continues to trickle out about how Microsoft, Google, and Apple are putting humans in the loop, improving the devices by listening in on conversations. While this is definitely a controversial practice, and possibly a compliance and regulatory related concern, it is clear that the vendors are doing this to continue to train and evolve the models that power these voice assistant systems.

As such, we can expect continued cognitive capabilities in the devices, and benchmarks as the above should continue to show improvements over time. And benchmarks like this one will help show how quickly these voice assistants continue to improve.

Follow me on Twitter or LinkedIn. Check out my website or some of my other work here.

More From Forbes

Tests Show That Voice Assistants Still Lack Critical Intelligence

The Best Romantic Comedy Of The Last Year Just Hit Netflix

Apple iPhone 16 Unique All New Design Promised In New Report

Rudy Giuliani And Mark Meadows Indicted In Arizona Fake Electors Case