AI Models Fall Short of Human-Level Performance in New Finance Agent Benchmark, Top Model Reaches 64.37%

Vals AI has unveiled its Finance Agent Benchmark v2, a sophisticated evaluation designed to rigorously test the capabilities of frontier AI models in financial analysis. The benchmark, which aims to mirror the complex tasks performed by human financial analysts, revealed that initial model performance significantly lags behind human expectations, with no model initially cracking a 52% accuracy rate. However, recent reports indicate that Anthropic's Claude Opus 4.7 has achieved a score of 64.37% on this benchmark.

The new benchmark was developed to address the growing potential of AI in automating "busy work" within the lucrative finance sector. Vals AI stated in its announcement, > "Finance is one of the most lucrative applications of AI where much of the busy work could be automated. That’s why we rebuilt our Finance Agent Benchmark to push frontier models even further." This updated version features a refined taxonomy reflecting real-world workflows, an improved tool harness, and jury-based evaluation, along with over 900 novel questions.

Vals AI emphasized the implications of the initial low scores, questioning, > "Would you trust a financial analyst who’s only correct half the time?" This highlights the significant gap between current AI capabilities and the reliability required for high-stakes financial decision-making. The benchmark evaluates models on their ability to use tools to research and answer complex financial questions about companies, financial statements, and SEC filings.

Anthropic recently announced that its Claude Opus 4.7 model leads the Vals AI Finance Agent benchmark with a score of 64.37%. This improved performance suggests rapid advancements in AI's ability to handle financial tasks, even as the overall benchmark remains challenging. The company is also releasing ten ready-to-run agent templates for financial services, aiming to automate time-consuming tasks like building pitchbooks and screening KYC files.

Despite these advancements, the benchmark underscores that while AI can significantly aid financial workflows, human oversight and expertise remain crucial. Vals AI's ongoing efforts to create domain-specific benchmarks across various industries, including finance, legal, and software, are critical for tracking the practical viability and progress of AI agents in real-world applications. The benchmark's results provide a clear measure for the industry to track progress and identify areas for further development.