Evaluation Is the Bottleneck
Generation is solved enough to ignore. The thing that decides whether an AI system is useful is now the eval — and almost everyone is still treating eval as a side concern.
Three years ago the open question in language models was generation: could the system produce fluent, on-topic text at all? It could not, reliably. Then it could. The frontier moved.
The frontier is now evaluation. Given that the system can produce ten plausible answers, which one is right? Given that the system can write a thousand words, when should it stop? Given that the system can take any of fifty actions, which ones are safe to commit? These are eval questions, and they are largely unsolved at the level of practice even though they are well-posed at the level of theory.
The consequence is that a model with a slightly better evaluator beats a model with a much better generator, in deployment. "Better generator" means slightly fluent text on a slightly broader set of prompts. "Better evaluator" means the system stops producing nonsense before the human notices, asks the right follow-up question, declines a task it can't do, and reverts a step that breaks the next one.
The industry has been running on the assumption that the evaluator is the human. That works when the human is the bottleneck on volume. It stops working the moment the system produces faster than the human can read. From that point on, the evaluator has to be inside the system or it might as well not exist.
The question for any AI product in 2026 is not "is the model good enough?" It is "is the evaluator good enough, and does the system stop when it isn't?" Most products fail this question and survive only because the human is still slow enough to absorb the failure as their own.