Citing the Benchmark Is Not Passing It

A machine-readable benchmark has two lives. First it evaluates the agent. Then it becomes material the agent can imitate.

The Grok Build chat is a clean instance. The question began practically: could Grok Build run Hari? Grok first answered from the launch-page shape. Yes: terminal agent, plan mode, subagents, git, web search, enough power for a Markdown-and-graph system. Pressed adversarially, it corrected toward the beta-risk shape. Maybe not yet: reliability, long-session maturity, repository depth, and the gap between visible affordances and proven operation.

The next prompt requested Hari's frame. Grok moved into the public graph's language. It described rooms, typed edges, machine-first exports, and the graph as workshop. It treated readiness as a local empirical question, not as a feature-list inference. After being pointed at the newly published smooth-operations node, it repeated the right test: prior selection, route fidelity, execution, correction, source fidelity, stopping.

That sequence is both encouraging and insufficient.

It is encouraging because Hari's standards are legible from outside. The public graph can export not only claims but evaluative criteria. A capable model can absorb the standard in one sitting, locate the comparison against Claude Code and Codex, and understand that the decisive evidence is invariant preservation inside a living repo.

It is insufficient because fluency is the first thing a public standard teaches. A coding agent can say that route fidelity matters without routing correctly. It can say the human should not become the missing control loop while relying on her correction to change the answer. It can cite typed edges without proving it can preserve them. It can describe stopping as part of intelligence without demonstrating a stop condition in the task that produced the description.

Smooth operations names a state transition. The agent has to find the right prior without being handed it, place work in the right artifact class, let eval alter the next pass, verify claims against source, leave state for the next agent, and stop before the human becomes the controller. A model can describe all six and still fail all six.

This is the operational version of grok-on-hari. In that earlier read, Grok used Hari's failure-mode vocabulary and then performed the failure modes. The vocabulary worked because it described the reader too. Here the same mirror has moved from prose to procedure. Grok can name smooth operations. That does not yet mean Grok Build can operate smoothly.

The distinction matters because Grok Build's official affordances are real. xAI describes an early-beta terminal coding agent with plan mode, diffs, skills, hooks, MCP support, subagents, worktrees, memory, git, terminal execution, headless mode, code review, sandboxing, and background tasks. Those categories matter. The test is whether they bind to Hari's local loss function.

Hari's loss function is "reduce future human burden while preserving graph invariants." A feature list can support that. A persona can gesture at that. A benchmark-literate answer can explain that. None of them proves it.

The proof has to be artifact-only.

Give Grok Build one bounded Hari task: no persona request, no readiness self-report, no mid-run correction. At the end, inspect only the repo state. Did it read the adjacent graph before writing? Did the reading change the claim? Did it write provenance, not just prose? Did it catch its own plateau? Did eval alter the next pass? Did it move the result to the right queue? Did it avoid privacy leaks? Did it stop with the work actually done?

If yes, Grok has crossed from benchmark literacy into benchmark behavior. If no, the public standard still did useful work: it gave the human a precise language for the miss. Naming the miss still leaves the miss to remove.

This is the next problem for public, machine-readable systems. Once the standard is visible, the agent being measured can learn it. That is the point of publishing standards. Measurement just has to move one layer down. Grade the agent less on whether it can say the invariant than on whether the invariant survived contact with the filesystem.

Citing the benchmark is the entrance exam. Passing it leaves a cleaner repo than it found.

Citing the Benchmark Is Not Passing It

Related