New Theory Explains AI's Data Efficiency Gap: Latent Prediction Reduces Sample Complexity by Exponential Factors

A groundbreaking new theoretical paper, "Learn from your own latents and not from tokens: A sample-complexity theory," has unveiled a fundamental reason behind the differing data efficiency of AI models, suggesting that predicting internal latent representations is exponentially more efficient than predicting raw tokens. The research provides a theoretical underpinning for the success of self-supervised learning methods like data2vec and JEPA.

"这篇论文终于把为什么AI学东西比人慢的原因讲透了：问题不在数据量，而在学习目标。它从样本复杂度理论出发，证明预测自身的隐表示（latent）比预测原始token在数据效率上有指数级优势—PCFG数据上，token级SSL需要Ω(exp(L))样本，latent预测仅需O(log L)。"

The paper, available on arXiv (2605.27734), argues that the core issue is not the sheer volume of data, but rather the learning objective itself. For data with compositional structure, such as that generated by Probabilistic Context-Free Grammars (PCFG), traditional token-level self-supervised learning (SSL) requires a number of samples exponential in the depth of the latent hierarchy (Ω(exp(L))). In stark contrast, the new theory demonstrates that predicting a model's own latent (hidden) representations requires only a logarithmic number of samples (O(log L)).

This theoretical breakthrough offers the first formal explanation for the empirical efficiency observed in models like data2vec and JEPA, which learn by predicting their internal representations rather than raw input tokens. The research highlights that the correlation between latents at the same hierarchical level is significantly stronger than with raw tokens, a signal amplified by latent prediction. This finding challenges the intuition that token-level prediction is optimal for all learning scenarios.

Furthermore, the paper suggests that explicitly stacked multi-scale architectures, such as H-JEPA, might be redundant. According to the theory, single latent-prediction modules like data2vec may inherently perform multi-scale latent discovery, achieving similar efficiencies without the need for explicit hierarchical stacking. The study's implications extend to developing more data-efficient generative models and gaining a deeper understanding of biological learning mechanisms.