Artificial Intelligence, Extra Bytes

Measuring AI Progress with Tests That Challenge Human Limits

From boardgames to benchmarks.

Cambron Kelly

Contributing Writer

Save

As AI systems grow more advanced, determining what they are capable of has become a pressing challenge. Early evaluations focused on assessing narrow tasks like image recognition or playing board games, with years passing before AI models could match or surpass human performance. Today, however, cutting-edge AI systems routinely ace popular benchmarks like SATs or professional licensing exams, making it difficult to track their improvement.

In response, researchers and institutions have developed more complex tests, or “evals,” designed to probe deeper into AI capabilities.

These new benchmarks, such as Epoch AI’s FrontierMath and the upcoming “Humanity’s Last Exam,” cover complex domains like math, science, and reasoning. While recent scores on these tests have surprised experts, they also represent concerns that future AI systems could outperform human researchers in critical fields, posing potential risks.

Why It Matters: Evals play a crucial role in understanding and mitigating the potential risks posed by advanced AI systems. They offer early indicators of emergent capabilities, highlight areas where models could fail safely, and provide valuable data for policymakers and developers alike. As concerns around AI-driven cybersecurity risks, misinformation, and scientific research automation grow, building rigorous, reliable benchmarks is more urgent than ever.

FrontierMath’s High Stakes: Created by Epoch AI and 60 top mathematicians, the FrontierMath benchmark features 300 complex math problems. These questions span difficulty levels, from high school-level Olympiad tasks to research-frontier problems only solvable by domain experts. Within a month of its release, OpenAI’s o3 model scored 25.2%, an unexpected leap from earlier models’ 2% performance.

“Humanity’s Last Exam” and Domain Expansion: Developed by the Center for AI Safety and Scale AI, this benchmark aims to include thousands of questions across physics, biology, and engineering. Designed to be unanswerable by current AI, this eval leverages crowdsourced academic expertise and seeks to ensure the longevity of its relevance for future AI models.

Real-World Simulations with RE-Bench: METR’s RE-Bench benchmark tests human and AI agents on engineering tasks within strict time constraints. While AI agents can outperform humans in certain coding optimizations, they often struggle to adjust after early mistakes—highlighting weaknesses in adaptability and problem-solving over time.

ARC-AGI and Novel Reasoning: François Chollet’s ARC-AGI benchmark assesses an AI’s ability to solve unique reasoning puzzles. These tasks require models to discern rules from scratch rather than recall information. The latest OpenAI o3 model demonstrated significant progress, achieving higher scores in generalization and adaptability.

Cost and Incentive Challenges: Evals are resource-intensive, with development costs ranging from $1,000 to $10,000 per model. While some labs cover operational costs, nonprofits often bear the burden of designing these tests. Experts argue that mandatory third-party evaluations, similar to those in finance, are needed to ensure transparency and prevent labs from self-regulating their safety benchmarks.

Go Deeper -> AI Models Are Getting Smarter. New Tests Are Racing to Catch Up – Time

Save

January 6, 2025

☀️ Subscribe to the Early Morning Byte! Begin your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

☀️ Your latest edition of the Early Morning Byte is here! Kickstart your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.