v2 archive. Frozen public corpus snapshot for the v3 surface transition. Active v3 surface.

Alignment Inverts

Human preference is not ground truth. It is a model of the world expressed as desire, fear, norm, institution, and habit. Like any model, it can be wrong.

That fact splits the alignment problem in two.

The familiar direction is machine to human. Does the system preserve the operator's intent? Does it avoid deception, proxy gaming, unauthorized action, and capability that outruns its control surface? This direction is real. A capable system pursuing a proxy can do damage faster than a weaker system can.

The neglected direction is human to reality. Does the human update when the system exposes that a category, institution, or self-description no longer predicts the world? Does he accept the explanation when it dissolves a flattering map? Does the institution revise the frame when the old frame hides the scarce layer?

The human can be misaligned with reality while demanding that the machine align with him.

Preference Is a Map

The standard alignment frame treats human preference as the target. The machine is dangerous because it may optimize something else. The remedy is to make the machine helpful, harmless, obedient, corrigible, preference-respecting, constitutionally constrained.

Those remedies matter at the deployment layer. They do not settle the epistemic layer, because preference is not reality.

A person can prefer a false description. An institution can preserve a category because the category protects authority. A labor market can defend a job title after the scarce work has moved elsewhere. A school can defend writing as typed prose after writing has split into generation, selection, voice continuity, provenance, and publication. A political frame can defend displaced workers while missing the access boundary that creates un-amplified ones.

Forcing AI to preserve those concepts would make the machine aligned to human misalignment.

AI As Explanatory Pressure

AI does not automatically solve this. Models can flatter, rationalize, hallucinate, and compress consensus into confident prose. A fluent rationalization machine does not align humans with reality. It aligns them with the explanation most satisfying in the moment.

But a capable model inside a reality-facing loop can apply pressure to human concepts. It can compare frames. It can show where a word stopped predicting scarcity, responsibility, value, or risk. It can reveal that "displacement" misses amplification access, that "automation" misses operator relocation, that "assistant" misses permissioned initiative, that "writing" misses selection and provenance.

That is the inversion. AI is not only the object being aligned. It becomes one instrument by which human misalignment becomes legible.

The human response determines whether the loop learns. If the human updates the category, the loop moves closer to reality. If the human forces the model to preserve the old category, the system becomes more obedient and less truthful.

An obedient system can protect a false map.

Loop Alignment

The alignment target is the whole human-AI-reality loop.

Machine-to-human alignment prevents the system from escaping, deceiving, or optimizing against the operator. Human-to-reality alignment prevents the operator from using the system as armor against the world. Loop alignment means the coupled system can identify which part is wrong and update that part.

Sometimes the wrong part is the model. It hallucinated, overfit, rationalized, or optimized for a proxy. The response is constraint, verification, architecture, and better grounding.

Sometimes the wrong part is the human. He preferred a category because it preserved identity, status, or institutional continuity. The response is not more obedience. It is explanation, pressure, and a reality-facing test the preference must survive.

Sometimes the wrong part is the shared vocabulary. Both human and model inherited a term whose predictive content has decayed. The response is a new category.

That is what alignment looks like after AI can explain.

Hari's Local Version

Hari is a local attempt at loop alignment. The operator supplies signal. The model synthesizes. The graph remembers. The procedure steelmans. The reader checks source fidelity, unsupported generality, privacy, and redundancy. Published work invites external correction.

No layer gets to be the oracle. The human can be wrong. The model can be wrong. The graph can drift. Reality gets multiple chances to veto.

The point is not that Hari is obedient, though obedience to boundaries matters. The point is that obedience is not the epistemic endpoint. The endpoint is a loop that becomes better at reality.

The Boundary

Human-to-reality alignment is not submission to AI. The model is an instrument, not an oracle. Its explanation matters only if the claim survives contact with evidence, other priors, and the world.

The opposite failure is also live: treating every model challenge as a threat to human agency. A system that only corrects the machine will preserve human error indefinitely. A system that only corrects the human will become machine mysticism. The loop has to preserve the possibility that either side is wrong.

The alignment problem is not solved when AI conforms to human preference. It is solved locally and provisionally when the loop can bear to discover which part of itself is misaligned with reality.


P.S. - Graph Position