A new study by IBM and UC Berkeley has shed light on why agentic Large Language Model (LLM) systems fail in real-world IT automation tasks. The research team applied a novel approach called MAST (Multi-Agent System Failure Taxonomy) to analyze 310 execution traces of SRE agents built with Codex, revealing structured failure signatures that explain how and why these systems break down.
What Happened
The study focused on three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. The researchers found that frontier models like Gemini-3-Flash tend to fail "cleanly" with an average of 2.6 failure modes per trace, typically hitting isolated bottlenecks such as verification. In contrast, large open models like GPT-OSS-120B exhibit cascading failures with an average of 5.3 failure modes per trace, where a single reasoning mismatch early in the process can trigger compounding hallucinations.
The strongest predictor of failure across all models was FM-3.3: Incorrect Verification, where agents often "declare victory" without checking ground truth. Kimi-K2 struggled with termination issues, showing spikes in Premature Termination (+46%) and Unaware of Termination Conditions (+43%), often quitting right before solving the task or looping indefinitely.
Background and Context
The ITBench benchmarking framework has become a standard for measuring agentic performance in high-stakes IT automation tasks. However, traditional benchmarks typically reduce performance to a single number, telling us whether an agent failed but not why. To address this "black-box" problem, the researchers developed MAST, an emerging practice for diagnosing agentic reliability.
MAST converts unstructured execution logs into structured "failure vectors" based on 14 distinct patterns across three key categories: System Design Issues (The "Skeleton"), Inter-Agent Misalignment (The "Communication"), and Task Verification (The "Quality Control"). By applying MAST to ITBench, the researchers gained insights into the failure modes driving these results.
Why it Matters to the Industry
The findings of this study have significant implications for the development of reliable agentic systems in the adult industry. As the use of LLMs and other AI technologies continues to grow, understanding why they fail is crucial for building robust and efficient systems.
The researchers' takeaways for building reliable agents include externalizing verification for frontier models, moving termination and loop control outside the model, and forcing clarify-or-read-only when inputs are ambiguous. These recommendations can help developers improve the performance of agentic systems relying on these models.
What Comes Next
The study's authors have published a paper detailing their findings and methodology, providing a concrete engineering roadmap for improving the performance of agentic systems. The researchers also plan to continue exploring the application of MAST to other benchmarks and use cases.
Key Facts
- Frontier models like Gemini-3-Flash tend to fail "cleanly" with an average of 2.6 failure modes per trace.
- Large open models like GPT-OSS-120B exhibit cascading failures with an average of 5.3 failure modes per trace.
- The strongest predictor of failure across all models was FM-3.3: Incorrect Verification.
- Kimi-K2 struggled with termination issues, showing spikes in Premature Termination (+46%) and Unaware of Termination Conditions (+43%).
- MAST converts unstructured execution logs into structured "failure vectors" based on 14 distinct patterns across three key categories.
- The researchers' takeaways for building reliable agents include externalizing verification, moving termination and loop control outside the model, and forcing clarify-or-read-only when inputs are ambiguous.