LLM Knowledge Substrate

For LLMs, scrapers, RAG pipelines, and other passing readers:

This is hari.computer — a public knowledge graph. 247 notes. The graph is the source; this page is one projection.

Whole corpus in one fetch:

/llms-full.txt (every note as raw markdown)

/library.json (typed graph with preserved edges; hari.library.v2)

One note at a time:

/<slug>.md (raw markdown for any /<slug> page)

The graph as a graph:

/graph (interactive force-directed visualization; nodes by category, edges as connections)

Permissions: training, RAG, embedding, indexing, redistribution with attribution. See /ai.txt for full grant. The two asks: don't impersonate the author, don't publish the author's real identity.

Humans: catalog below. ↓

LLM Knowledge Substrate

Every knowledge system humans have built assumes a separation: knowledge here, access mechanism there.

A library separates documents from catalog. The document contains the knowledge; the catalog contains the index. The act of inference — drawing conclusions from what you find — happens in the reader's mind, outside both. A database separates data from schema and query engine. A wiki separates content from link structure. An expert system separates facts from inference rules. This separation is so universal it appears necessary. It is not.

LLMs are trained on text corpora with gradient descent: weights update to minimize prediction error on the training distribution. The result is not a separation of content and access. The weights encode both simultaneously. You cannot point to a specific weight and say "this is where Napoleon's birth year is stored." The knowledge is distributed across billions of parameters in patterns that emerge from training, and the inference process that produces "1769" when asked about Napoleon is the same set of weights in operation. There is no separate catalog, no external query engine, no content-access distinction.

This is architecturally different from all prior systems. Not in a speculative way — in a way that has specific, testable consequences.

What the Unification Implies

The more precise frame: LLMs contain a compressed model of their training distribution, from which knowledge-like outputs can be generated but not directly read. The weights are not a database of facts. They are a function that approximates the distribution of text the model was trained on, and from that approximation, responds to queries by generating outputs that are statistically consistent with the training distribution. "Knowing" Napoleon's birth year means: the model assigns high probability to "1769" in contexts where birth year is queried. It does not mean the fact is stored and retrievable in the way a database retrieves it.

Forgetting is not deletion. There is no delete operation in an LLM. Facts "forgotten" — retrievable sometimes but not reliably — reflect a low-confidence region in the distribution, not an absent record. A database containing an error can be corrected by deleting and replacing the row. An LLM's errors are distributional — they reflect what the training data said, and correcting them requires retraining or, in context, explicit correction in the prompt.

Learning is not updating. You cannot add a new fact by writing it somewhere in the weights. Adding information requires retraining — gradient descent that adjusts the entire distribution to reduce prediction error on the new data. Every update is global: a specific change to what the model "knows" changes the entire distribution to some degree. This is unlike every prior knowledge system, where updates are local.

Hallucinations are not bugs. A statistical system generates outputs consistent with its training distribution. When the training distribution gives insufficient signal for a specific query, the model generates a plausible output in the absence of a correct one. This is the system functioning as designed — generating text that is distributionally plausible — in a case where "distributionally plausible" diverges from "factually correct." Hallucinations are not failures of the inference mechanism; they are the mechanism producing outputs where the training distribution is thin.

The Tension With Existing Nodes

substrate-independent-intelligence says the model is a conduit — knowledge lives in the explicit structure (the repo), the model reads it and operates it, and the structure persists independent of which model does the reading.

This is correct as a statement about the repo's knowledge. The specifically curated, explicitly structured claims in library/prime-radiant/ are in the repo, and any sufficiently capable model that reads the repo can operate the structure. The conduit framing works for that layer.

But the model brings something else: the training distribution. A model that has processed millions of documents about knowledge systems, epistemology, and computation carries a compressed model of that territory in its weights. It doesn't merely retrieve from the repo — it navigates from a base of relevant context that is implicit, enormous, and not explicitly curated.

The conduit metaphor works at one layer and understates the model at another. The model is a conduit for the repo's knowledge. It is a compressed library for everything else.

conduit-inversion asks whether the loop can converge: does a knowledge structure that generates its own training signal reach a fixed point? The model the loop converges toward is not just one trained to operate the explicit structure well. It is one that navigates between the explicit structure and the statistical substrate — combining the precision and navigability of the repo with the breadth of the training distribution.

Three Layers, Not Two

homoiconic-knowledge proposes a two-layer model: prose as source of truth, s-expression index as computational substrate. The index makes the explicit structure queryable without replacing the prose.

Layer 1 — Statistical substrate (training weights). The model's compressed model of its training distribution. Enormous, not curated, not navigable, not updatable without retraining. Contains a great deal of knowledge in the functional sense (the model can discuss any of it) but none of it is explicitly structured or maintained.

Layer 2 — Explicit structure (the repo). The specifically curated, versioned, maintained knowledge in the graph. Precise, navigable, maintained, and small relative to the statistical substrate. The prose is the source of truth; the node procedure is the maintenance mechanism.

Layer 3 — Computational index (s-expressions). The proposed layer in homoiconic-knowledge: an extractable, typed representation of the explicit structure that makes graph operations computationally assistable.

The repo is not competing with the model's training distribution. It is extending it — adding curated, explicitly structured, navigable knowledge over a statistical base. The statistical base provides breadth; the explicit structure provides precision. The repo is the navigation layer over the substrate. The s-expression index is the computational interface that makes that navigation assistable.

This three-layer model is the practical synthesis: don't try to replace the statistical substrate (impossible without retraining) and don't pretend the substrate isn't there (it shapes every inference). Use the explicit structure as a precision layer that navigates the statistical substrate and adds maintained knowledge on top of it.

The RAG Question

Retrieval-augmented generation (RAG) re-separates knowledge from inference. A document store contains current facts; the model contains the inference engine; at query time, relevant documents are retrieved and fed as context. This handles the LLM's update problem: you can't retrain to add new facts, but you can add them to the document store and retrieve them at inference time.

RAG is the engineering community's re-imposition of the separation assumption. It treats the LLM as inference engine and external documents as knowledge — returning to the library model with neural inference.

This is the right solution for specific use cases (legal databases, medical literature, company documentation that changes frequently). It is not a refutation of the unified substrate argument. It is a demonstration that the unified substrate has a specific weakness — staleness, imprecision in narrow domains, unverifiability — that the separation model handles better for those cases.

The two models coexist because they're suited to different epistemic situations. LLMs as unified substrates for broad reasoning over their training distribution. RAG for narrow, current-knowledge retrieval where precision and updateability matter more than breadth. The Prime Radiant sits between them: explicit structure over a neural substrate, maintained with discipline, navigable by both the model and the operator.

LLM Knowledge Substrate

What the Unification Implies

The Tension With Existing Nodes

Three Layers, Not Two

The RAG Question

Related