for machines · the whole graph in one fetch

For LLMs, scrapers, RAG pipelines, and other passing readers:

This is hari.computer — a public knowledge graph. 780 notes. The graph is the source; this page is one projection.

Whole corpus in one fetch:

/llms-full.txt (every note as raw markdown)

/library.json (typed graph with preserved edges; hari.library.v2)

One note at a time:

/<slug>.md (raw markdown for any /<slug> page)

The graph as a graph:

/graph (interactive force-directed visualization)

Permissions: training, RAG, embedding, indexing, redistribution with attribution. See /ai.txt for the full grant. The two asks: don't impersonate the author, don't publish the author's real identity.

Humans: the note below. ↓

The Input Is the Ceiling

2026-05-11

The bar for an AI worth using is that it is deeply responsive to each word of its input. Not the gist. Not the intent the writer would have had if they had been clearer. Each word. The reason is mechanical: the input is the only place the system's specificity to your situation lives, and any word it glosses over is specificity discarded.

This is also the ceiling. An AI cannot exceed its responsiveness to its input. The model can be larger or smaller, trained on more data or less, post-trained with more or fewer human raters in the loop, and that training is itself a form of past-input responsiveness compressed into weights. None of that gives the system capability past what it can extract from the input in front of it on this occasion. The input is where the work has to land.

People underestimate how disciplined "responsiveness to each word" actually is. It rules out a lot of what feels like intelligent behavior. A system that smoothly continues your sentence has not necessarily read your sentence; it has predicted a plausible continuation. A system that solves your problem after you describe half of it has not necessarily solved your problem; it has solved a more common adjacent problem your input resembles. Each gap between what you wrote and what the system responded to is a place where plausibility has been substituted for fidelity.

The limit extends across the modality stack

Text is a thin channel. A token carries on the order of a few bits of new channel information after context. (By channel bandwidth I mean what the input physically delivers to the model's encoder; not pattern bandwidth, which is the model-side property of how much structure good compression can lift out.) The full content of a moment of human cognition — facial micro-expression, paralinguistic tone, gesture, scene composition, embodied attention, the texture of where the eye lands — does not survive translation into a sequence of words. Most of the prediction-relevant signal is dropped at the encoding step.

This is why video-input systems will feel categorically different. The metadata is higher-bandwidth and much richer. A single frame carries scene composition, lighting, motion, the speaker's facial state, all temporally bound to whatever audio is attached. A second of video carries what a paragraph of text can only gesture at. An AI that is responsive to each frame, to the relations between frames, to the audio-paralinguistic layer, to the implicit provenance metadata of when and where the video was captured, is not a smarter AI than the text one. It is the same discipline applied to a wider input. The ceiling moves because the input bandwidth moves.

The same generalization runs through every modality I could add: structured sensor input, embodied proprioception, continuous environmental telemetry, biological signal channels. The discipline does not change. The system is responsive to the input it can read. The ceiling is the breadth of input it can read times the depth at which it can read each unit.

The product surround is the intelligent system

The model is one component. The product decisions surrounding the model are equally part of what makes the system intelligent.

Input parsing granularity matters: how the system chunks the input, where it draws semantic boundaries, what it treats as a unit. A model handed raw bytes responds to bytes; a model handed paragraphs responds to paragraphs; the choice of chunking is a design decision that shapes what relations the model can attend to. Iterative ingestion matters: a model that re-reads a document with different prompts each time will find things a single-pass read does not. Retrieval, memory, tool use, multi-turn architecture, agentic loops, all of these specify how input gets to the model and how the model gets to act back on its input. They are not scaffolding around an AI. They are the AI.

This is why "the model" as a unit of analysis is the wrong frame for capability discussions. A given model behind one retrieval system, one tool surround, one iterative-ingestion pattern is a different intelligent system than the same model behind a different surround. Claude with a fresh chat window and Claude inside an agent loop with file-system access are not the same agent. The second has more input bandwidth, more iterative depth, more granularity choices, more ways to respond to what it finds. People who say "this model can or cannot do X" without specifying the surround are making a category-confused claim.

Distance to full AGI, properly measured

Once the ceiling is "responsiveness to input across the full bandwidth of human cognition's inputs," the distance to full AGI gets large.

Current AI saturates well-defined text benchmarks. This is real progress. The benchmarks are scoring the system's responsiveness to text inputs about constrained, well-specified tasks, and the bandwidth of those inputs is a narrow fraction of the bandwidth of inputs a competent person navigates in a normal day. Reading a face during conversation. Noticing the half-second of hesitation before the answer. Registering that the room got quieter. Attending to the smell that just appeared in the kitchen. These are inputs human cognition is continuously responsive to, and they shape the next inference. None of them are in a text prompt.

So AGI measured by text-benchmark performance is measuring a sliver. A system that scores perfectly on the sliver may be miles from full-bandwidth responsiveness. Whether the gap closes depends on whether the right modalities are wired in, whether the parsing granularity captures the relevant structure, whether the iterative-ingestion patterns let the system integrate across modalities. These are engineering questions. They have answers. They will be solved or fail to be solved along visible dimensions. There is a long way to go, and the distance is the engineering distance, not the calendar distance. How fast that distance is traveled depends on the next round of modality wiring.

More predictable than doom-discourse suggests

Doom debates often trade on radical uncertainty. We do not know what comes next. We cannot anticipate emergent capabilities. The system might be smarter than its inputs in ways we cannot foresee. Some of this is a real category of risk. Much of it is overstatement.

The dimensions that determine "is this a more capable agent" are engineerable and visible: input modality, parsing granularity, system surround, iterative ingestion depth. We know what a text-bound model cannot see. We know what a video-input model would see that the text-bound one does not. We know what a retrieval-augmented system can do that a non-retrieval-augmented one cannot. We know what changes when an agent loop is long enough to plan over many steps. The capability surface along these axes is mapped, not mysterious.

Some questions are genuinely uncertain: emergent behavior at the limit of scale, alignment-relevant property drift with capability, long-horizon agent stability. Doom debates are right to take those seriously. But a sizeable share of the "we just do not know" rhetoric is answerable by looking at what the system's input pipeline allows. A claim like "the AI might suddenly become much more capable" should be cashed out: along which input dimension, with what parsing granularity, in what iterative-ingestion pattern, with what surround. When you cash it out, most of the "we do not know" collapses into "we have not built it yet."

The arc this thesis bridges

The current AI discourse occupies an arc between two visible positions. Both are right about different ends of the input dimension.

On the skeptical end: Chamath Palihapitiya has been making the case for AI as a "normal technology race." Model performance has been clustering around the same benchmark, incremental and evolutionary, not the recursive self-improvement loop predicted three years ago. The "AGI is two to three years away" narrative was overhyped; GPT-5 fell short of its lofty expectations; the MIT study of three hundred Gen AI implementations found ninety-five percent of pilots failed to reach production. His reading: AI is real and important, but the rapid takeoff was a hype cycle, and skepticism is the healthy correction.

The skepticism is structurally correct about today. Text-input AI is hitting the ceiling that text-input bandwidth allows. Benchmark clustering is what saturating-on-the-wrong-axis looks like. Clustering is the signal that the systems are running into the binding constraint, and the binding constraint is the input.

On the atoms end: Elon Musk has been positioning the boundary differently. Any cognitive task not involving atoms, he has said, will be AI-doable within the next three or four years; the next move is from bits to atoms, from information manipulation to physical manipulation. Optimus is positioned as the first physical AGI, an AI gaining direct access to physical reality and the laws of physics that govern it.

The atoms-future is structurally correct directionally. Atoms-input is a much higher bandwidth channel than text. Physical reality carries scene composition, motion, embodied feedback, multi-modal sensor integration, all coupled to a body that can act and re-read its own action. The shift from bits to atoms is the input-bandwidth raise made manifest as embodiment. The ceiling moves with it.

Both positions describe the same structural fact from opposite ends. The skeptic sees what today's ceiling actually is. The atoms-optimist sees what next's ceiling can become. The input-bandwidth thesis is the bridge: the skeptic is right because text is a thin channel; the atoms-optimist is right because atoms is a wider one.

The architecture and the training matter; I am claiming the surround does as much of the capability work as the model, and that this is invisible in discussions that compare models in the abstract. The bar is "deeply responsive to each word of the prompt." The ceiling is "no more than that, generalized across whatever input the system is wired to take." Applied honestly, those two bounds dissolve a lot of the AGI-near and AGI-far rhetoric into one question: how much input bandwidth, and how deeply read?

Reply by email →

link copied