AI agents are increasingly being integrated into enterprise workflows, yet they still fail about one-third of the time according to structured benchmarks. This disparity between AI’s potential and its reliability is identified as a major operational challenge for IT leaders in 2026, as highlighted in Stanford HAI’s ninth annual AI Index report.
Referred to as the “jagged frontier” by the AI Index, this inconsistent performance describes the area where AI might suddenly fail despite excelling in other areas, a concept introduced by AI researcher Ethan Mollick.
Stanford HAI researchers note, “AI models can achieve top honors at the International Mathematical Olympiad, yet they struggle with basic tasks like telling time.”
Advancements in AI Models in 2025
Enterprise AI adoption has climbed to 88%. Key achievements in 2025 and early 2026 include:
-
Frontier models saw a 30% improvement within a year on Humanity’s Last Exam (HLE), which comprises 2,500 questions in areas like math, natural sciences, and ancient languages, designed to be challenging for AI but favorable to human experts.
-
Leading models exceeded 87% on MMLU-Pro, assessing multi-step reasoning across 12,000 human-reviewed questions in over a dozen fields, demonstrating competitiveness in broad knowledge tasks, according to Stanford HAI researchers.
-
Models like Claude Opus 4.5, GPT-5.2, and Qwen3.5 scored between 62.9% and 70.2% on τ-bench, which evaluates real-world tasks involving user interaction and external API usage.
-
GAIA’s benchmark for general AI assistants showed an accuracy rise from 20% to 74.5%.
-
On SWE-bench Verified, agent performance improved from 60% to nearly 100%, showcasing their ability to tackle real-world software issues.
-
WebArena success rates jumped from 15% in 2023 to 74.3% in early 2026, testing autonomous AI agents in realistic web environments.
-
Performance on MLE-bench, evaluating machine learning engineering skills, increased from 17% in 2024 to about 65% in early 2026.
AI agents are also advancing in cybersecurity, with frontier models solving 93% of tasks on Cybench, which includes 40 professional-level challenges across categories like cryptography and web security. This marks a significant jump from 15% in 2024, highlighting cybersecurity tasks as well-suited to current AI capabilities.
Video generation has progressed notably, with models now simulating object behavior. Google DeepMind’s Veo 3, tested over 18,000 generated videos, demonstrated buoyancy simulations and maze-solving without prior training in those areas.
“Video generation models are evolving from producing realistic images to understanding physical interactions,” the researchers state.
AI is now employed in various enterprise areas such as knowledge management, software engineering, IT, and marketing, while also expanding into specialized domains like taxation and legal reasoning, with accuracy rates between 60% and 90%.
Stanford HAI asserts, “AI capability continues to accelerate, reaching more users than ever.”
AI Capability Grows, Reliability Remains an Issue
Multimodal models are now matching or surpassing human benchmarks in PhD-level science questions, multimodal reasoning, and competitive mathematics. For instance, Gemini Deep Think secured a gold medal at the 2025 International Mathematical Olympiad, solving five out of six problems in natural language within the 4.5-hour limit, a step up from its silver-level performance in 2024.
However, these AI systems still falter in roughly one-third of attempts and struggle with basic perception tasks, according to Stanford HAI. On ClockBench, a test with 180 clock designs and 720 questions, Gemini Deep Think managed only 50.1% accuracy, whereas humans average around 90%. GPT-4.5 High scored similarly at 50.6%.
“Most multimodal models still find telling time challenging,” the Stanford HAI report notes. This task involves visual perception, arithmetic, and clock hand identification, where errors can lead to failure. Despite fine-tuning on 5,000 synthetic images, models only improved on familiar formats and struggled with real-world variations like distorted dials.
Researchers suggest that confusion between hour and minute hands affects directional interpretation, indicating challenges beyond data, involving multiple visual cues integration.
Stanford HAI concludes, “Despite closing gaps in knowledge tasks, visual reasoning remains a challenge for models.”
Hallucination and Multi-Step Reasoning Persist as Challenges
Despite advancements in reasoning, hallucination remains a significant issue.
In one benchmark, hallucination rates among 26 leading models varied from 22% to 94%. Under scrutiny, some models’ accuracy dropped significantly, such as GPT-4o falling from 98.2% to 64.4%, and DeepSeek R1 from over 90% to 14.4%.
Conversely, Grok 4.20 Beta, Claude 4.5 Haiku, and MiMo-V2-Pro exhibited the lowest rates.
Models also struggle with multi-step workflows, despite being tasked with more. On the τ-bench — assessing tool use and multi-turn reasoning — no model exceeded 71%, revealing difficulties in managing conversations and using tools within policy constraints, according to the Stanford HAI report.
Models Increasingly Opaque
The Stanford HAI report indicates that leading models are now nearly indistinguishable in performance. Open-weight models are more competitive, yet they are converging.
As capability no longer stands out, competition is shifting towards cost, reliability, and practical usefulness.
Frontier labs are sharing less about their models, with evaluation methods losing relevance and independent testing not always verifying reported metrics. Stanford HAI states, “The most capable systems are now the least transparent.”
Training code, parameter counts, dataset sizes, and durations are often withheld by firms like OpenAI, Anthropic, and Google. In 2025, 80 out of 95 models were released without training code, and only four were open source.
After increasing between 2023 and 2024, the Foundation Model Transparency Index scores have since fallen, with an average score now of 40, a 17-point drop.
“Significant gaps remain in training data, compute resources, and post-deployment impact disclosures,” the report states.
Challenges in Benchmarking AI
The benchmarks for AI progress face increasing reliability issues, with error rates reaching up to 42% on widely-used evaluations. The Stanford report notes that while AI is tested more ambitiously across reasoning, safety, and task execution, these measures are harder to trust.
Key issues include:
-
“Sparse and declining” bias reporting from developers
-
Benchmark contamination, where exposure to test data inflates scores
-
Discrepancies between developer-reported and independent results
-
“Poorly constructed” evaluations lacking necessary documentation and reproducibility
-
“Increasing opacity and non-standard prompting” causing unreliable model comparisons
The report suggests that strong benchmark performance doesn’t always equate to real-world utility. Moreover, AI capability is outpacing the benchmarks designed to measure it.
This leads to “benchmark saturation,” where high scores make differentiation difficult. Complex, interactive intelligence forms are increasingly hard to benchmark. Some advocate for evaluations focusing on human-AI collaboration rather than isolated AI performance, though this method is still in early development.
Stanford HAI concludes, “Evaluations intended to be challenging for years are saturated in months, shortening their useful tracking period.”
Approaching “Peak Data”
As builders delve into data-intensive inference, concerns about data bottlenecks and scaling sustainability grow. Researchers warn that high-quality human text and web data are “exhausted,” a state known as “peak data.”
Hybrid approaches combining real and synthetic data can significantly speed up training, sometimes by 5 to 10 times, and smaller models trained on synthetic data show promise for specific tasks like classification or code generation, according to Stanford HAI.
Synthetically generated data can enhance model performance in post-training settings, including fine-tuning and reinforcement learning (RL), but “these gains have not generalized to large, general-purpose language models.”
Rather than indiscriminate data scaling, researchers focus on refining inputs, cleaning labels, and constructing higher-quality datasets.
“Discussions on data availability often miss a key shift in recent AI research,” the report notes. “Performance gains increasingly rely on improving existing datasets’ quality, not acquiring more.”
Responsible AI Lags Behind
Despite growing infrastructure for responsible AI, progress remains “uneven” and struggles to keep pace with rapid capability advancements, according to Stanford HAI.
Leading frontier AI model developers often report on capability benchmarks, but safety and responsibility reporting is inconsistent and “spotty.”
Documented AI incidents increased significantly, from 233 in 2024 to 362 in 2025. Although several frontier models received “Very Good” or “Good” safety ratings under standard conditions (as per the AILuminate benchmark), safety performance declined across models during adversarial jailbreak tests.
“AI models perform well on safety tests under normal conditions, but their defenses weaken under deliberate attack,” Stanford HAI remarks.
Builders report that improving one aspect, like safety, can negatively impact another, such as accuracy. “The infrastructure for responsible AI is expanding, but progress is uneven, not matching AI deployment speed,” according to Stanford researchers.
Stanford’s findings underscore a crucial point: the significant gap in 2026 isn’t between AI and human performance but between AI’s demo capabilities and its reliable production performance. With decreasing transparency from labs and benchmarks losing relevance, measuring this gap becomes more challenging.

