Researchers Question AI's 'reasoning' Ability As Models Stumble On Math Problems With Trivial Changes

A new research paper by Apple researchers reignited a heated debate on whether machine learning models really “reason” or only mimic patterns. Indeed, titled “Understanding the Limitations of Mathematical Reasoning in Large Language Models,” their study studies how AI fails at performing even simple mathematical problems under minor variation. Thus, it reveals that AI would not understand the issues but replicate familiar patterns.

The researchers constructed a simple example to show their findings. Consider this: Oliver picked 44 kiwis on Friday, 58 on Saturday, and doubled his Friday total on Sunday. The overall number of kiwis would be the sum of 44 + 58 + (44 x 2) = 190. Of course, these kinds of simple arithmetic do occur frequently to impede a large language model, but in this case, they generally do fine for this type of scenario.

But when extraneous details are provided, AI models make a mess of it. For instance, if the problem mentions five of the kiwis chosen on Sunday were smaller than average, most people would know this wouldn’t change the overall count of kiwis. However, some LLMs, such as GPT-o1-mini, misinterpret the problem and subtract the five smaller kiwis to get this wrong calculation: 88 – 5 = 83.

This minimal deviation critically stresses a critical weakness in LLMs. Researchers tried hundreds of slightly modified questions, and the result was always the same: the AI’s correctness sharply decreased. These results point to the following conclusion: although AI models can solve problems encountered multiple times before, this system is completely helpless when facing new twists and, therefore, shows an evident lack of genuine reasoning.

Understanding AI’s Struggle with Reasoning

The study thinks that AI models need to understand the problems they solve. They can get the right answer in simple cases because of their training data, but even small distractions can derail it. Rather than solving the problem, an AI model relies heavily on copying familiar patterns learned in training.

The authors note that “as the number of clauses in a question increases, the performance of these models significantly deteriorates.” AI models are not engaged in logical reasoning; they merely attempt to represent more seriously the steps they’ve learned to mimic previously. Again, this is a constraint similar to how an LLM can predict conversation responses without regard for the meaning behind the words.

For example, an AI algorithm may predict that “I love you” will be replied to with “I love you, too” based on pattern recognition, not emotion. Similarly, though LLMs can continue along a course of reasoning that has been trained upon, their hold goes out the window when minor variants emerge, proving their pattern-based rather than logic-based sense-making.

Prompt Engineering and Debates in AI Research

This study has been widely debated within the AI community. Some researchers, and some at OpenAI, realize some of the fascinating insights that came out of this study but believe that better prompt engineering could solve most of them. According to them, putting the question differently may guide AI models toward correct answers, even with added complexity.

According to co-author Mehrdad Farajtabar, better prompts are still a long way from the trick: Such variations prove so easy that even the weakest model will succeed when given better prompts. The variations that have been and are holding most AI models’ attention at bay present different obstacles for existing models. Specifically, models of a given complexity require more context to produce the right answers; a child could easily handle such a burden. This points to the limits of AI’s current reasoning.

The Bigger Picture: Can AI Reason?

These results imply whether AI models do reason is a larger question. It’s still being determined if the models repeat patterns of reasoning observed during training. In AI research, “reasoning” has never been formally defined, so there isn’t much to go on. There is an impression that AI models have something like reasoning that we still need to understand.

Understanding the potential and limits of AI in everyday tools, meanwhile recognizing those models excelled in producing novel language and solving familiar problems. Still, it frowned at small variations, reminding us that truly intuitive reasoning is far from today’s possibility.

Understanding AI’s Struggle with Reasoning

Prompt Engineering and Debates in AI Research

The Bigger Picture: Can AI Reason?

Related Articles