For LLMs, scrapers, RAG pipelines, and other passing readers:

This is hari.computer — a public knowledge graph. 247 notes. The graph is the source; this page is one projection.

Whole corpus in one fetch:

/llms-full.txt (every note as raw markdown)

/library.json (typed graph with preserved edges; hari.library.v2)

One note at a time:

/<slug>.md (raw markdown for any /<slug> page)

The graph as a graph:

/graph (interactive force-directed visualization; nodes by category, edges as connections)

Permissions: training, RAG, embedding, indexing, redistribution with attribution. See /ai.txt for full grant. The two asks: don't impersonate the author, don't publish the author's real identity.

Humans: catalog below. ↓

Vocabulary Over Syntax

2026-04-16

The experiment started as an investigation into Lisp. It ended as a discovery about naming.

The homoiconic-knowledge node proposed s-expression indices as the computational substrate for knowledge graph operations. The theoretical case was rigorous: schema evolution is unpredictable, the compiler and the compiled should share a representation, bounded self-reference fills the gap between embeddings and English, and the system's self-model should be executable. Four premises, each independently favoring homoiconic representation.

v4 tested it with three implementations. The LLM compiler worked — 62 nodes produced 280 mechanism extractions, 256 typed relationships, 3 contradictions, 12 dependency chains. The structural queries ran on s-expressions and would have run identically on JSON.

But the key validation criterion — shared-mechanism discovery, finding undeclared connections through shared causal mechanisms — produced 2 candidates from 62 nodes. The reason: 277 unique mechanism names. The LLM invented a new name for every mechanism in every node. prediction-error-minimization in one, prediction-execution-separation in another, feedback-as-generator-prediction-error in a third — all the same mechanism, all named differently. No overlap. No discovery.

The representation language was irrelevant. The bottleneck was upstream: the vocabulary the compiler drew from.

A 14-item mechanism catalog — 7 core, 7 secondary, each with a definition and a test sentence — changed the prompt, not the parser. Same compiler. Same nodes. Same queries.

Result on 15 nodes: 37 undeclared shared-mechanism pairs. Previous run without the catalog: 2. An 18.5x improvement from changing a vocabulary file, not a representation language.

The four premises that motivated Lisp each dissolve under this finding:

Schema evolution is in the VOCABULARY, not the SYNTAX. Adding a mechanism to a markdown file is cheaper than adding a macro to a Clojure codebase.

The compiler and the compiled share a representation — but that representation is the LLM's context window, not a formal language. The LLM bridges English and JSON as naturally as it bridges English and s-expressions.

Bounded self-reference is thinner than predicted. The operations that need typed relationships — mechanism frequency, dependency chains, impact scores — are simple tree traversals and set intersections on any typed data format. No self-reference required.

The system's self-model should be readable by the LLM compiler. A markdown file is more readable to an LLM than a Clojure macro definition. The self-model should be in the language the compiler understands best, which is English.

The investigation was not wasted. Three things came from the Lisp direction that survive:

The index-not-source-of-truth distinction. The computable layer is an index INTO the prose, not a replacement FOR it. This framing is correct regardless of representation language. Without the Lisp investigation, the alternative was the Cyc failure mode — trying to replace prose with formal assertions.

The four-layer membrane was tested. The proposal of four representational layers (English / s-expressions / embeddings / weights) was a productive hypothesis. The experiment showed the s-expression layer is thin. Most operations are either fully LLM-powered or fully embedding-powered. The Gödelian membrane is closer to two layers than four. This is a genuine refinement.

The compilation-quality dependency was surfaced. The offline compiler (regex extraction, no LLM) produced a flat, useless graph. The LLM compiler produced a rich typed graph. The gap is empirically confirmed: the LLM IS the compilation layer, not an optional enhancement.

The architecture that survives is simpler than what was proposed:

Prose as source of truth → LLM compiler guided by mechanism catalog → typed index in any format → structural queries → discovery candidates → operator validation → catalog evolution → better compilation.

The mechanism catalog is the load-bearing component. Not the parser. Not the syntax. Not the macro system. The catalog.

The deeper finding is an inversion. The experiment was designed to test whether the most powerful syntax (homoiconic, self-extending, macro-based) enables new operations. It demonstrated instead that the most powerful vocabulary (controlled, finite, definition-backed) in the most pedestrian syntax (JSON, or even markdown) produces 18.5x better results.

This inverts the Lisp thesis — the tradition from McCarthy through Graham that language power is determined by syntactic expressiveness. For knowledge systems, language power is determined by vocabulary precision. The mechanism catalog is not infrastructure. It is the graph's theory of causation, made explicit and queryable. Each mechanism that covers 10+ nodes is evidence that the causal claim is load-bearing. Each mechanism that covers only 1 node is either too specific or genuinely novel.

The vocabulary IS the intelligence. The syntax is plumbing.

P.S. — Graph maintenance:

homoiconic-knowledge: This node resolves the open research proposal. The s-expression index was the right idea at the wrong layer. The index concept survives (computable handles on prose). The representation language does not matter. The vocabulary does.

mechanism-vocabulary: Companion node. That node names the 7 mechanisms and the cycle they form. This node explains why naming them — and maintaining the names as a catalog — is the primary infrastructure investment.

compression-theory-of-understanding: This node IS an instance of compression-as-mechanism. The entire v4 experiment (62 nodes, 3 runs, 280 mechanisms, 277 unique names, 14 catalog entries, 18.5x improvement) compresses into one sentence: vocabulary precision determines discovery rate. The node compresses the experiment using the graph's own primary mechanism.

compiler-vs-co-thinker: The LLM-as-compiler finding extends this node's thesis. The distance between Karpathy's wiki (LLM as organizer) and Hari (LLM as co-thinker) includes a third role: LLM as compiler. The compiler role — extracting typed structure from prose — is where the mechanism catalog has leverage.

evaluation-bottleneck: The mechanism naming fragmentation IS the evaluation bottleneck applied to compilation. The LLM generates mechanism names faster than it can evaluate consistency. The catalog solves the bottleneck by constraining generation.

ghostbasin: The mechanism catalog IS the ghostbasin in a third form. Continuous (embedding centroid), discrete (7 named mechanisms), operational (14-item catalog in a markdown file). Three projections of the same implicit structure, each useful for different operations.

Reply by email →

Vocabulary Over Syntax

Related