Apple has published a detailed investigation into how advanced AI reasoning models respond as the complexity of a problem increases.
The research focuses on Large Reasoning Models such as Claude 3.7 Sonnet Thinking, DeepSeek R1, and OpenAI’s o3, putting them through a set of logic puzzles designed to gradually scale in difficulty. Puzzles, such as the Tower of Hanoi and river crossing tasks, were chosen based on their controllable structure and clear rules.
What the study uncovers is a consistent behavioral curve.
Models did well on easier tasks and handled some moderate ones effectively. However, the performance dropped off once the puzzles required more steps and planning.
The models would give shorter answers even when they had enough resources to keep going and they often stopped early and produced incomplete or incorrect solutions whenever the challenge reached a level that the models could not manage reliably.
Why It Matters: This research reveals a significant limitation in how today’s AI manages complex reasoning. As tasks become more intricate, these systems don’t consistently scale their problem-solving abilities. AI tools designed for planning, logic, or structured decision-making often falter when the challenge requires deeper or more sustained effort.
- Three Distinct Regimes: Models tend to fall into three performance groups. On simple tasks, standard language models are often more accurate and efficient. Reasoning models help on moderately complex problems by taking longer paths through the logic. However, once the problems hit a certain level of difficulty, performance drops sharply, with no sign of the models being able to adjust or scale their reasoning.
- Reduced Effort with Rising Complexity: The amount of reasoning these models put in doesn’t track how hard the problem is. At first, they use more tokens as things get tougher. However, once the task crosses a certain line, responses get shorter and less detailed, even when there’s still plenty of room to keep going. The models seem to be following fixed habits rather than adjusting their efforts to match the challenge.
- Algorithms Don’t Guarantee Execution: Giving the models clear instructions doesn’t seem to help either. Even when they’re handed the full solution method, they still struggle with following the steps correctly. This exhibits a deeper issue with how well they handle structure and rules, even when everything they need is spelled out.
- Puzzle-Based Testing Reveals Hidden Gaps: Standard benchmarks don’t always show what these models can or can’t do. That is why the custom puzzles in this study are built differently. They’re clean, controlled, and make it easier to see where the reasoning breaks down. That kind of setup helps surface the exact point in the process things go wrong.
- Failure Isn’t Linear: The drop in performance hits all at once. In puzzles like Tower of Hanoi or River Crossing, models go from doing okay to failing completely after a small bump in difficulty. This marks the point where learned patterns stop being useful and the models can’t keep up. What looks like reasoning might actually be pattern-following that fails when complexity increases.