MERRIN Benchmark Reveals AI Agents Struggle with Multimodal Reasoning, Achieving 40.1% Accuracy Against Human's 71.4%

A new benchmark named MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments) has been introduced to evaluate AI agents' ability to navigate and reason through complex, real-world web information. Developed by researchers from UNC Chapel Hill, Virginia Tech, and the University of Texas at Austin, MERRIN highlights significant limitations in current AI capabilities for multimodal understanding and reasoning in noisy digital environments. The paper, published on arXiv, details the benchmark's design and initial findings.

MERRIN addresses critical gaps in existing benchmarks by requiring AI agents to process natural language queries without explicit modality cues, incorporate underexplored modalities like video and audio, and contend with noisy, conflicting, and incomplete web evidence. The benchmark is human-annotated and designed to reflect the multi-hop reasoning often required in real-world search tasks. It challenges agents to identify relevant modalities and retrieve pertinent evidence from diverse sources.

Initial evaluations on MERRIN demonstrate that current AI agents perform poorly, with an average accuracy of just 22.3% across all tested models. Even the best-performing agent, Gemini-3.1-Pro with Agentic Multimodal Search, achieved only 40.1% accuracy. This contrasts sharply with human performance, where annotators achieved 71.4% accuracy on a subset of the questions, utilizing significantly fewer search queries and visits.

The research indicates that reasoning, rather than search effectiveness, is the more pressing bottleneck for AI agents. Providing agents with gold-standard evidence only led to a modest 7.6% improvement in accuracy, suggesting that even with perfect information retrieval, agents struggle with the subsequent reasoning steps. Furthermore, agents exhibited a strong bias towards textual evidence, often overlooking crucial information in video and audio formats, and were prone to "over-exploration" in noisy environments without converging on correct answers.

These findings underscore the urgent need for advancements in AI systems capable of robustly searching and reasoning across diverse modalities in complex web environments. MERRIN serves as a valuable testbed for future research aimed at bridging the substantial performance gap between AI agents and human cognitive abilities in multimodal information processing.