Measuring AI System Effectiveness | Sela.
Sela. | Cloud Better.

Measuring AI System Effectiveness

What are we trying to measure when we think about evaluating intelligence? To couch it in human terms, it would be to assess the overall ability to think, reason, and make rational decisions.

Daniel Karp

In conventional psychology, the aim is to measure several different cognitive characteristics and form an overall view of an individual. Usually, something like the Wechsler Adult Intelligence Scale (WAIS) is used for this, which requires subjecting a person to various tasks. The subject's performance in each task contributes to a final comprehensive assessment (Lichtenberger et al., 2012).

Artificial Intelligence (AI) has been defined as the ability of a computer system – using maths and logic and leveraging new information - to simulate human cognitive functions such as learning or problem-solving to make decisions (Microsoft, 2021). The term Cognitive Intelligence has also been coined to emphasize the human-like nature of these computer-derived capabilities. It seems, however, that the comprehensive aggregative nature of human intelligence testing does not apply to the assessment of computer-based systems. AI intelligence is assessed on a more functional level, whereas human intelligence testing is more holistic.

The WAIS test is unsuitable for computer systems, which, as we have mentioned, are primarily uni-modal, focusing on performing single human-like tasks or functions simultaneously. To assess Artificial Intelligence (AI), we want to know how efficiently and accurately a computer system performs a single - albeit perhaps complex - task. This could be, for example navigating a self-driving car through a series of obstacles or summarizing a legal file into topics. These involve multi-step tasks humans can do, which has been regarded as artificial intelligence.

Although the exact underlying biological processes that contribute to intelligence in humans are not entirely understood, research has narrowed down four potential factors – brain size, sensory ability, speed and efficiency of neural transmission, and working memory capacity (Stangor & Walinga in Cummings, J. A. & Sanders, L., 2019).

These factors have clear parallels in the world of computer systems with:

  • brain size analogous to storage capacity
  • sensory ability analogous to the number and type of incoming data sources
  • speed and efficiency of neural transmission analogous to the complexity of the code, the number of CPU cores and their clock speed, and the extent of their distributed processing
  • working memory analogous to the available RAM.

The first step is to use standard statistical measures – such as R2, MSE, F1-score, Precision, and Recall - to assess machine learning model performance. These measures are applied at too low a level to be considered assessments of AI. They are model-level assessments. A method is required to aggregate these low-level measures and standardize a series of tests that could be applied to gauge AI systems' accuracy, speed, and effectiveness in toto. In doing so, we would have a way to compare systems against the equivalent human capability and other similar cognitive models. For example, how well a system translates a benchmarked document or how safely a car performs when guided by a self-driving AI.

These are measurable but not usually at the sub-task level. When measuring intelligence, we don’t, for example, usually measure the ability of a person to discern colors or to sort similar-meaning words into lists. These would be tested if we were trying to get to the bottom of an evident deficit.

While there are increasing gains in this area, there seems to be nothing like an all-embracing AI testing regimen. This is likely due to the difficulty of lumping different AI systems into a single performance assessment tool. Moreover, it is uncommon for the same data science team who are innovating in one area – computer vision, to also deal with other types of models as well. If they do so, a more cohesive and overarching viewpoint could be arrived at more easily.

As AI solutions proliferate in the market, it is becoming increasingly important to differentiate their reliability and accuracy. For example, how can we differentiate a Google self-driving solution from a Tesla one? Which platform offers the best off-the-shelf AI plugin for translating speech into text? What computer vision solution provides the best accuracy and reliability for detecting potentially aggressive behavior in a crowd of protesters? The list of potential AI solutions is endless and growing every day.

When discussing this subject with others, the Artificial Intelligence Maturity Model (AIMM) often comes up. However, it does not relate to this discussion. Whereas an AIMM attempts to assess the level of AI integration within a particular company, it does not relate to assessing individual AI capabilities. The more integrated AI capabilities are indeed within a company, the better the AIMM score, and, likely, companies scoring higher on an AIMM assessment are also those that would have innovated some AI task assessment solution. However, these are likely to focus on individual tasks and not a holistic assessment.

So what should we, as an AI community, be doing – if anything – about generating holistic testing assessments? Do we necessarily need to aim for a catchall testing regimen that encompasses multiple facets of artificial intelligence in the same way that cognitive testing in humans assesses their level of intelligence?

I don’t believe we do. Though artificial Intelligence as it stands today probably needs a consistent approach to testing, unless we are considering testing a humanoid-like robot designed to act like a human, there would be nothing to gain from combining the actual testing of different categories of AI into one.

What do you think?

References

Cummings, J. A. & Sanders, L. (2019). Introduction to Psychology. Saskatoon, SK: University of Saskatchewan Open Press. https://openpress.usask.ca/introductiontopsychology/

Lichtenberger, E. O., Kaufman, A. S. & Kaufman, N.L. (2012). Essentials of WAIS-IV Assessment - Volume 96 of Essentials of Psychological Assessment: John Wiley & Sons Microsoft Azure (2021). https://azure.microsoft.com/en-au/overview/artificial-intelligence-ai-vs-machine-learning/#introduction

* Photo by DeepMind on Unsplash

Want to achieve positive ROI rapidly? Check out our cloud AI Services.