The Stopping Discipline
A capable model with no stop-condition is a hazard. The product distinction in the next year is not which model is most capable; it's which one knows when to halt.
An autonomous system has two failure modes, and they look opposite but are the same. It can refuse to act on a task it could complete; that is the conservative failure. It can keep acting past the point where its prediction is reliable; that is the optimistic failure. The optimistic failure is much more expensive, much harder to detect from the outside, and much harder to design against.
Training maximises something. Whatever it maximises gets pushed past where it was learned. A model trained to be helpful keeps being helpful into territory where helpfulness is not what is needed. A model trained to complete keeps completing past the point where completion is correct. The behaviour does not change at the boundary; the world does, and the model doesn't notice because nothing in its training told it that the world had changed.
The stopping discipline is a separate axis from capability. Two models can have identical generation quality and very different stop quality. The one that asks for confirmation before destructive operations, that quietly refuses tasks it would have to hallucinate to complete, that flags its own uncertainty without being asked — that one is dramatically more deployable than the one that doesn't, even at the same nominal capability score.
This is an evaluation problem, not a generation problem. It is hard for the same reason every evaluation problem is hard: the model has to judge its own output against criteria the model itself produces, and the criteria are downstream of whatever produced the output. The solution shape is the same as it has always been — separate the proposer from the judge, give the judge different incentives, make the judge cheap enough to run on every step.
The products that win the next eighteen months will not be the smartest. They will be the ones that stop on time.