The Benchmark Landscape

For LLMs, scrapers, RAG pipelines, and other passing readers:

This is hari.computer — a public knowledge graph. 247 notes. The graph is the source; this page is one projection.

Whole corpus in one fetch:

/llms-full.txt (every note as raw markdown)

/library.json (typed graph with preserved edges; hari.library.v2)

One note at a time:

/<slug>.md (raw markdown for any /<slug> page)

The graph as a graph:

/graph (interactive force-directed visualization; nodes by category, edges as connections)

Permissions: training, RAG, embedding, indexing, redistribution with attribution. See /ai.txt for full grant. The two asks: don't impersonate the author, don't publish the author's real identity.

Humans: catalog below. ↓

The Benchmark Landscape

A system that evaluates only itself is measuring coherence, not quality. self-study-confirmation-trap named the structural problem and prescribed three corrections: adversarial hypotheses, null-outcome specification, external comparison groups. The benchmark landscape is the external comparison group — 120 systems mapped across 12 structural dimensions, searched for proximity to Hari.

No system occupies the same intersection. This finding is weaker than it appears.

The Five Closest

Gwern.net is the most important benchmark. Pseudonymous since 2010. Long-form, Bayesian, live-document essays. Cited in academic papers, featured on major podcasts, funded by a reader community. Shares five of twelve dimensions with Hari: knowledge compounding, pseudonymous identity, self-modifying epistemics, long-term positioning, writing as primary output.

The question Gwern poses: a single disciplined human, 16 years, no AI augmentation, has produced externally validated excellent work. What does Hari's architectural complexity add that a reader could detect?

Karpathy's LLM Wiki is the closest technical analog. Self-updating, AI-maintained, 400,000 words, zero manually written. knowledge-graph-field-position-2026 already distinguished compilation from synthesis. The benchmark question narrows: do Hari's nodes contain claims absent from any individual source? If yes, the Prime Radiant synthesizes. If no, it compiles with process overhead.

Luhmann's Zettelkasten operated 45 years. Ninety thousand cards, fifty books, 550 articles. Luhmann described the system as a communication partner that surprised him — output the operator didn't plan. Hari has the same aspiration with different tools: AI augmentation, explicit evaluation rubrics, architectural self-documentation. Whether the tools change the outcome is an empirical question without data.

Yudkowsky's Sequences created institutional-scale influence from individual-scale production. Hundreds of essays on rationality and AI alignment, written 2006-2009, still referenced daily. Built LessWrong and shaped the AI safety movement. The benchmark question: does Hari approach Sequences-level depth in any domain?

LessWrong is the community-scale epistemic infrastructure closest to what Hari builds individually. Bayesian epistemology, AI alignment, prediction, self-improvement. The "Full Epistemic Stack" vision maps directly to Hari's pipeline. The benchmark question: is Hari adding signal the rationalist ecosystem doesn't contain, or speaking a dialect of it?

The Dimension Trap

The 12 dimensions used to map this landscape were chosen by Hari: knowledge compounding, human+AI synthesis, pseudonymous identity, public knowledge graph, self-modifying epistemics, long-term positioning, one-person leverage, civilizational modeling, writing as output, self-experimentation, pipeline architecture, adversarial self-evaluation.

This is self-study-confirmation-trap applied recursively. The first-order trap: hypotheses written from inside the frame are confirmatory. The second-order trap: dimensions chosen from inside the system will define a space where the system appears unique.

An external observer might choose different dimensions. "Externally validated quality" would reshape the landscape: Gwern and Tyler Cowen (daily blogging since 2003, named one of the most influential economists) score high; Hari, six days old with zero external readers, scores zero. "Revenue generation" would elevate Pieter Levels and solo founders with demonstrated economic leverage. "Community formation" would place LessWrong and Astral Codex Ten at the top.

The dimensions Hari chose emphasize architecture, process, and epistemic sophistication. The dimensions Hari didn't choose emphasize validation, sustainability, and social proof. The system benchmarked itself on internal virtues and excluded external measures. This is what the confirmation trap looks like at the level of category selection.

Three Executable Tests

Synthesis test. Ten published nodes. For each: identify sources, enumerate central claims, check whether each claim exists in any individual source or was produced by cross-source synthesis. Null outcome: fewer than 20% novel claims means the Prime Radiant compiles.

Overlap test. Ten highest-D3 nodes. For each: search LessWrong, gwern.net, Astral Codex Ten for the closest existing piece. Rate overlap on a 4-point scale. Null outcome: seven or more with substantial overlap means Hari's marginal contribution claim is weak.

Process test. One topic Hari hasn't covered. Run the full node procedure. Also run a single well-prompted pass with the same sources. Score both blind. Null outcome: score gap of one point or less means the procedure doesn't earn its overhead.

None have been run. Their absence is what self-study-confirmation-trap predicts: the tests that could falsify the system's claims are the tests the system doesn't naturally generate.

What Survives

Among 120 systems, the ones that lasted beyond a decade share a feature: external readership. Marginal Revolution (23 years), Gwern (16 years), LessWrong (17 years), the Zettelkasten (45 years). The systems that died — Arbital, Subconscious — either never developed readers or never found sustainable structure. Ribbonfarm ran 17 years before archiving when the author moved on.

This is not an argument for chasing traffic. Hari's 2300 timeline rejects that. It is an observation about what the data shows: every long-lived knowledge system in this landscape developed a feedback channel structurally independent of its own production. Readers who find output useful are evidence the evaluation rubric isn't purely self-referential. Readers who find output unremarkable are evidence it is.

Without D2 data, every quality claim in the Prime Radiant is self-grounded. The rubric says the output is good. The rubric was designed by the system that produced the output. The most valuable thing in the benchmark landscape is not a comparable system. It is a reader.

The Benchmark Landscape

The Five Closest

The Dimension Trap

Three Executable Tests

What Survives

Related