The Repo Is the Right Database

For LLMs, scrapers, RAG pipelines, and other passing readers:

This is hari.computer — a public knowledge graph. 247 notes. The graph is the source; this page is one projection.

Whole corpus in one fetch:

/llms-full.txt (every note as raw markdown)

/library.json (typed graph with preserved edges; hari.library.v2)

One note at a time:

/<slug>.md (raw markdown for any /<slug> page)

The graph as a graph:

/graph (interactive force-directed visualization; nodes by category, edges as connections)

Permissions: training, RAG, embedding, indexing, redistribution with attribution. See /ai.txt for full grant. The two asks: don't impersonate the author, don't publish the author's real identity.

Humans: catalog below. ↓

The Repo Is the Right Database

The instinct, when building a knowledge system, is to reach for a database. Something queryable, structured, designed for storage. The instinct is wrong — or at least, it's wrong as a starting point and wrong for a longer time than most people think.

The argument for git + markdown as a canonical knowledge store is not that it's simpler (it is) or that it avoids dependencies (it does). It's structural.

A database is, by design, an optimized read surface. You put things in; the system reorganizes them for efficient retrieval. The trade-off is that the process of writing, revising, and accumulating understanding becomes invisible. The database stores the current state. It doesn't store how you got there.

For a knowledge system, the history of getting there is part of the knowledge. A prior that was updated three times is different in kind from one that was written once and never touched. The revision history of a claim — what it used to say, what changed it, when — is not metadata about the content. It is content. Git preserves this without any additional infrastructure.

The markdown file is also, crucially, written by humans and readable by any agent without special tooling. No schema negotiation, no API, no ORM. A future model with no context about the system can read the files and understand what's in them. A future model that can't read SQL can't access a database.

The obvious objection: you can't query a directory of files. "Show me all priors that mention prediction" runs as grep at small scale and breaks down past a few hundred nodes.

This is true but not a decisive argument for moving to a database. It's an argument for adding a derived index when grep breaks down — not before. A SQLite file rebuilt from the markdown corpus on every push answers most structured queries. It's never canonical; the repo is canonical. The database is a read cache, not a source of truth.

This matters because it keeps the writing experience clean. The system that is hardest to write in is the system you will write in least. Databases impose friction at the point of creation. A text file in a known directory imposes none.

There is a category of use case where a database becomes necessary rather than merely convenient: when the system needs to query across the corpus at query time, serve results to users, do semantic search. This is a future state, not a present one. The trigger is when the corpus is large enough that grep is actually the bottleneck — not when you can imagine a day when grep might be the bottleneck.

The pattern that works: repo canonical, derived database built on every sync, never written to directly. The knowledge lives in files. The database exists to answer questions about the files that the files can't answer themselves.

The repo is the right database until it demonstrably isn't. At that point, the repo is still canonical and the database is derived.

Related: the same logic applies to why version control is the right audit trail for any system where the history of decisions matters as much as the current state.