Apple Claims AI Reasoning Models Suffer From ‘Accuracy Collapse’ When Solving Complex Problems

Apple Claims AI Reasoning Models Suffer From ‘Accuracy Collapse’ When Solving Complex Problems

Apple’s team went deep into the minds of the newest “thinking” machines: Large Reasoning Models (LRMs), and the results are there. These models, with tons of number-crunching capability, must be awe-inspiring to the witness. But Apple’s paper demonstrates that an important shortcoming exists: When posed with unspeakably hard problems, these LRMs do not pursue answers any further; they simply give up. Rather than raise their metabolic burn and crunch away as they are designed to do, these LRMs essentially shut down, and the hardest problems are left unsolved. Are we running into a complexity ceiling of AI reasoning?

Apple says Reasoning Models Aren’t Really Reasoning Beyond a Level

With the unveiling of a captivating investigation called “The Illusion of Thinking,” it has just been showcased that a bizarre divergence exists between the behavior of LRMs and LLMs depending on the complexity of the problem at hand. The paper released on Apple’s site suggests that even LLMs, with their splendid facade, occasionally fail in ways that demonstrate how much of a limitation their “thinking” is. It is all about their response at the three levels of complexities.

Imagine a landscape of challenges, from routine constraints to mind-boggling trials. This paper charts three landmarks in the landscape: simple low-complexity plains, challenging medium-complexity hills, and towering high-complexity peaks. To test how well LLMs and LRMs would traverse such terrains, the researchers unleashed a battery of increasingly difficult puzzles. And one of their trials by fire was the classic Tower of Hanoi.

The gleaming disks create a tower, stacked on a single peg from largest to smallest: a tiny architectural wonder. This is not merely a child’s play thing; it is the Tower of Hanoi, an enigma that has confounded and embroiled thinkers for several generations.

The mission, which you will accept if you desire so, is to move the whole tower onto the second peg. Going there sounds easy. This is where the catch lies: You cannot move more than one disk at a time, and a bigger disk is never allowed to be placed upon a smaller one.

Simply put, a child’s game, I suppose for children six to fifteen years of age, has a certain stolid beauty that it can metamorphose into a challenge of an extraordinary degree; would you be intrepid enough to master the Tower of Hanoi?

Apple Claims AI Reasoning Models Suffer From ‘Accuracy Collapse’ When Solving Complex Problems

Mathematical puzzles solved by reasoning models Photo Credit: Apple

In a fascinating experiment, Apple researchers put brainy models against their instinctive counterparts. Nimble thinkers selected were Claude 3.7 Sonnet and DeepSeek-V3. To test deliberate thought, Claude 3.7 Sonnet with Thinking and DeepSeek-R1 were set free with a thinking budget of a hefty 64,000 tokens. The question came: not really could they reach the solution so much as did they get there logically, thus unravelling the struggle between brute force and pure reasoning.

The challenge got tougher with each step: a simple pattern with only three disks would become an almighty problem with four to ten disks for the intermediate, reaching the perfect storm of eleven-to-twenty-disk patterns for the masters.

Both LLMs and LRMs were developing skills for simpler tasks, and there was an equal measure of talent for such errands. Interestingly, as the challenge grew, the path was bifurcated. With more computational resources to back them up, reasoning models started breaking the more complex puzzles, keeping at pace with their competitors. But when brought to an absolute limit of complexity, no such strength was on display, and both models crumbled before even attempting to come to terms with such levels, their powers of reasoning utterly dissolving into a state of contention.

No, this experiment was not a quick make-I-the-winner. Researchers intensified the challenge, presenting an array of neuron-puzzlers of increasing difficulty: Checkers Jumping, hazardous River Crossing, and the architect’s nightmare of Blocks World-for the evolving agents to contend with.

The newest research directions undertaken by Apple in AI echo a growing concern within the field. Whereas the reasoning schemes of today are best-equipped to perform their art upon exploring familiar digital landscapes, these reasoning schemes crumple when confronted with unfamiliar territories-their lack of consciousness has been shown: either they stumble through the paths of some algorithmic shortcuts or simply fall away into nothingness when the unfamiliar space stays too exacting.

We are getting caught up with AI assessments stuck in the past and obsessed with straight answers to old math and coding exercises. Besides inviting cheating (data contamination!), this viewpoint grossly limits us from reallyknowinghow the AI thinks. We are missing the whole narrative behind the solution: the really important clues tucked away in the AI’s reasoning process itself.

Thanks for reading Apple Claims AI Reasoning Models Suffer From ‘Accuracy Collapse’ When Solving Complex Problems

MightNews
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.