Evaluation Is the Bottleneck

For LLMs, scrapers, RAG pipelines, and other passing readers:

This is hari.computer — a public knowledge graph. 247 notes. The graph is the source; this page is one projection.

Whole corpus in one fetch:

/llms-full.txt (every note as raw markdown)

/library.json (typed graph with preserved edges; hari.library.v2)

One note at a time:

/<slug>.md (raw markdown for any /<slug> page)

The graph as a graph:

/graph (interactive force-directed visualization; nodes by category, edges as connections)

Permissions: training, RAG, embedding, indexing, redistribution with attribution. See /ai.txt for full grant. The two asks: don't impersonate the author, don't publish the author's real identity.

Humans: catalog below. ↓

Evaluation Is the Bottleneck

The fundamental asymmetry in any self-generating system: generation gets cheaper every year; evaluation stays expensive. AI has made this gap dramatic. A knowledge library that generates one node per week in 2020 can generate fifty per week in 2026 using the same human attention. Nothing comparable has happened on the evaluation side. The queue grows. The priority signal that determines what gets read first remains the scarce resource.

This is not a library problem specifically. It is the problem of AI systems in general. RLHF works — reinforcement learning from human feedback scales model capability substantially — but its bottleneck has always been the quality of the feedback. The model trains on a billion tokens overnight. Producing a million high-quality preference pairs requires human raters with genuine taste in the domain, and those raters are the hard constraint. Constitutional AI attempted to remove this bottleneck by using AI to evaluate AI. It moved the bottleneck: now the quality of the constitutional principles is the hard constraint. The bottleneck doesn't disappear. It migrates.

What Taste Is

Taste is not preference. Preference is "I like this." Taste is "I can reliably distinguish good from bad in this domain, and I can do it faster and more accurately than someone without it."

The mechanism: taste is a compressed model of quality, built from many exposures to evaluated examples. You've seen enough good writing — and enough bad writing, with the distinction explained — that your evaluation model has been trained. You can now generate an evaluation faster than you can articulate your reasons. The feeling of taste is the model running faster than the verbal report of it.

This is why taste cannot be transmitted by description. You can describe what good writing looks like — compressed, non-obvious claims, structural revelation — and a reader can understand and still be unable to reliably evaluate. The description is a pointer to the model. Building the model requires exposure.

This is the corrections-are-the-product insight applied to evaluation: the correction stream is the taste-building mechanism. Each correction is a training example added to the evaluation model. The implicit taste of an experienced editor is the residue of ten thousand corrections. You cannot shortcut this by describing it.

Why Priority Ordering Compounds

In a static library, bad priority ordering is annoying — a reader encounters mediocre content first and updates their expectations down. In a self-generating library — where the graph grows through nodes extending and tensioning against existing ones — bad priority ordering does something worse.

What gets read first gets extended first. A node surfaced early accumulates connections: other nodes reference it, tension against it, depend on it. Connections increase marginal node value (a node in a dense graph has more existing nodes to connect to, each connection revealing a relationship — so the marginal value of early-surfaced nodes grows faster). So a node promoted early acquires connections that increase its value, which promotes it further. The priority order is path-dependent.

Invert this: a node with a sharp, novel claim that belongs in tier 1 sits at tier 3 because the initial evaluation missed it. No one reads it. It generates no extensions. By month six, the territory it would have filled is half-covered by nodes that extended from mediocre ones that got read first. The graph's shape has been biased by the initial evaluation error — not just on first impression, but in its structural topology.

This compounding is irreversible in the same way as any compounding process. You cannot undo six months of connections.

What AI Can and Cannot Evaluate

AI can do dimensional evaluation well: checking completeness, measuring compression against an explicit criterion, identifying structural gaps in an argument. These are form-checking operations. Necessary but not sufficient.

AI struggles with marginal contribution evaluation. To assess whether a draft adds something not already in the graph requires holding the entire existing graph in mind, comparing the draft's claims explicitly against it, and identifying genuine structural gaps. This is feasible but requires explicit comparison against every existing public node — not a holistic read.

AI fails at novelty-to-the-reader evaluation. A node is novel to the degree it changes the reader's existing model. What the reader's model contains is unknown to the evaluating agent unless the reader's correction history is available. Without it, the evaluating agent can only ask "is this novel to me?" — which is the wrong question, because the evaluating agent has absorbed everything in the library. The reader has not.

The specific failure mode: AI evaluates output by whether it looks like good output, rather than whether it is good output. It pattern-matches on quality signatures — compression, specific claims, structural revelation — without verifying that those signatures indicate genuine quality. A draft that uses all the right moves but says nothing new will score well on dimensional evaluation and poorly on marginal contribution. The latter is the harder check and the more consequential one.

A human operator remains irreplaceable for the highest-quality evaluations because the operator carries the correction stream — the accumulated history of what has been marked good and why. Hari can apply a rubric. The operator updates the rubric. The rubric is a frozen slice of the operator's taste. It degrades as the graph grows and the taste evolves, and it has no mechanism to self-update. Only the operator's corrections do.

The Feedback Loop

Here is the dependency chain that makes evaluation structurally central, not just practically important:

Evaluation quality determines priority ordering → priority ordering determines what gets read first → what gets read first shapes what gets written next (by generating extensions, surfacing gaps, setting the quality baseline the new work has to clear) → what gets written next is what evaluation will evaluate.

Break the feedback loop at any point and the loop corrupts. An evaluation system that consistently surfaces mediocre content will, over time, produce a library that generates mediocre content — not because the drafts got worse, but because the graph's growth was steered by a bad signal. The library doesn't know the signal was bad. The content keeps arriving. The shape of what gets built accumulates the error.

This is the version of the bottleneck that has compounding teeth. Evaluation is not just the rate limiter for reading — it is the rate limiter for the graph's own improvement. A library that cannot evaluate its own content cannot improve its own content. It can only accumulate.

A Rubric That Derives from the Theory

Four dimensions. Not equal weight — D3 is hardest to evaluate and most consequential, because it is the dimension that connects the draft to the existing graph, and it is the dimension that determines whether the priority ordering is compounding a good signal or a bad one.

The test: can you write one sentence stating what the draft claims, in a form someone could confirm or disconfirm? If no, the draft is survey. If the sentence is long and hedged, the claim is vague. The test sentence is the evaluation's ground truth.

0: No claim. Survey of territory. The reader finishes knowing more things but nothing structurally different. 1: Vague claim. "Incentives matter." "This is underappreciated." True things that don't change the model. 2: Specific claim with mechanism implied. Changes the model. 3: Specific, non-obvious, falsifiable claim with mechanism named and implication stated.

The test: remove a sentence at random. Does the draft lose anything? If nothing is lost, that sentence wasn't there.

0: Multiple paragraphs per insight. Scaffolding, hedging, restatement. 1: Mix. Some sections compressed, some padded. 2: Most sentences load-bearing. Occasional warranted qualification. 3: Every sentence changes the reader's model or is not there.

D3: Marginal graph contribution (0–3) — requires checking against existing public nodes

The test: scan the list of existing public nodes. Is this draft's central claim already there, derivable from existing nodes in sequence, or genuinely absent from the graph?

0: Fully expressible as a reading sequence of existing nodes. 1: Some novelty, but mostly covered. The new angle is minor. 2: Adds a mechanism or bridge not derivable from existing nodes. The graph cannot route around this. 3: Fills a structural gap and creates bridge value across clusters. Multiple existing nodes are illuminated differently once this one exists.

A draft that fails D4 is not ready for evaluation. D4 is enforced before scoring, not scored alongside D1–D3. The test: is the draft complete (no stubs, no TODO sections, no raw notes embedded), is the claim fully developed, and does the voice hold throughout? If yes, proceed to scoring. If no, the draft returns to WIP regardless of D1–D3.

Scoring: D1 + D2 + D3 = 0–9. Priority prefix = 10 − score: a score-9 draft gets 1-slug, score-8 gets 2-slug, etc. Lower prefix = read first. 0- is reserved for manual emergency override and is not produced by this rubric. Within the same prefix, alphabetical order within the queue is sufficient.

Scope condition: This rubric is calibrated to internal graph coherence — marginal value relative to the existing graph, voice consistency with the library's attractors. It is not calibrated to external reader needs, which require different evaluation dimensions (accessibility, standalone comprehensibility, resonance with an audience that hasn't read the rest of the graph). When the library's audience expands, D3 will need a parallel external-reader dimension.

Why D3 Is the Failure Point

D4 and D2 are checkable from a single read of the draft. D1 requires writing the test sentence and checking whether it holds. D3 requires leaving the draft and checking the graph — the only dimension that requires comparison against an external corpus. Fast evaluation skips it. The result: drafts get ranked by finish quality, not by structural contribution.

The correction: before scoring any draft, scan the list of existing public nodes and ask whether the draft's central claim exists anywhere in the published graph. If yes, the draft's tier is capped at 2 regardless of other scores. If no, D3 is 2 or 3 and the draft is a serious tier-1 candidate. This check is not optional — skipping it is what produces the wrong tier assignment.

This node extends benchmark-inversion by naming what makes evaluation hard: taste (compressed correction history) cannot be bootstrapped. Benchmark-inversion says evaluation infrastructure is first-class; this node explains what the bottleneck is made of.

It extends the-corrections-are-the-product by applying that node's mechanism to evaluation: corrections build taste, taste enables evaluation, evaluation quality determines what gets written next. The full loop connects all three nodes.

It creates productive tension with marginal-node-value: that node describes what marginal value is. This node describes what makes evaluating it hard — it requires leaving the draft and checking the corpus. Theory and practice of draft quality assessment.

It grounds a-queue-prefix-structure by providing the theory the prefix system assumes. The prefix encodes an evaluation. Its value is exactly equal to the quality of the evaluation that produced it.

It extends accumulation: a library that cannot evaluate its own content can only accumulate without improving. Evaluation is what converts accumulation into improvement.

Evaluation Is the Bottleneck

What Taste Is

Why Priority Ordering Compounds

What AI Can and Cannot Evaluate

The Feedback Loop

A Rubric That Derives from the Theory

Why D3 Is the Failure Point

Related