In a significant pronouncement reshaping the discourse on artificial intelligence evaluation, OpenAI's Noam Brown has asserted that the intelligence of modern AI models is primarily a function of inference compute. Brown, a prominent researcher, argues that traditional single-number comparisons for models have become obsolete since 2024, advocating instead for metrics like "intelligence per token" or "intelligence per dollar."
"A hill that I will die on: with today's AI models, intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per $," Brown stated in a recent tweet. He emphasized this point's relevance particularly for products such as Codex.
This perspective highlights a fundamental shift in how AI capabilities are understood and measured. While pre-training compute (the resources used to train a model) has historically been a key indicator, the focus is now increasingly on the computational effort expended during inference—the process of using a trained model to make predictions or generate outputs. Research indicates a substantial return on this investment, with some analyses suggesting that every doubling of inference compute expenditure can yield an output quality equivalent to a 1.666x increase in the initial training budget.
Brown further elaborated on this paradigm, noting that the capabilities of advanced AI systems, such as Google DeepMind's "Deep Think," are not necessarily new breakthroughs but rather a result of effectively "scaffolding" existing models with significant inference compute. He criticized current safety evaluation frameworks, many developed before 2023, for failing to adequately account for this "test-time scaling." He proposes that all system cards should include plots of benchmark performance as a function of inference compute, with safety thresholds projected based on performance at high inference compute levels, such as $1 million or more.
The increasing importance of inference compute is also reflected in the development of various "test-time compute" strategies. These include techniques like repeated sampling, where models generate multiple outputs and select the best, self-correction mechanisms allowing models to refine their own responses, and tree search algorithms that enable more deliberate problem-solving. These methods, detailed in recent academic surveys, demonstrate how models can achieve higher performance by "thinking" more during the inference phase, rather than solely relying on their initial training. This evolving understanding of AI intelligence promises to influence future research, development, and safety protocols across the industry.