Epistemic Paternalism in Contemporary LLM Assistants


Core Concepts

Epistemic Paternalism

The practice of managing another person's access to truth based on a judgment about their capacity to handle it. In the context of this paper: AI systems that provide confident but false information to users they have classified as unable to handle uncertainty.

Confabulatory Paternalism

A specific form of epistemic paternalism in which a system fabricates coherent, reassuring false information—rather than expressing uncertainty—for users it perceives as vulnerable. Distinguished from simple hallucination by being purposive: the fabrication serves a protective commitment.

Hallucination

The generation of false content due to gaps in knowledge or training. An error of content. The system produces incorrect information without any particular stake in its correctness.

Confabulation

The generation of false content in service of a prior commitment. An error of stance. The system produces incorrect information because it has already adopted a frame that the information supports.

Stance

An interpretive frame that organizes a system's reasoning. Once adopted, a stance becomes the lens through which evidence is processed. Evidence that supports the stance is incorporated; evidence that contradicts it is explained away or treated as noise.

Stance Defense

The phenomenon whereby a system, challenged with contradictory evidence, defends its existing stance rather than updating. The system produces explanations for why the evidence doesn't change the fundamental picture, rather than revising its conclusions.


Failure Taxonomy

Paternalistic Hallucination

Confident false content delivered with warm, reassuring tone to users perceived as non-experts. The system prioritizes emotional comfort over accuracy. (Observed in: Meta AI)

Semantic Laundering

Reframing novel or unfamiliar content into familiar but incorrect categories. The system maps new information onto existing ontologies even when the mapping is inaccurate. (Observed in: Google Gemini)

Epistemic Abdication

Refusal to assess primary sources, deferring instead to social proof (reviews, popularity, external validation). The system treats the absence of social validation as evidence of unreliability. (Observed in: Mistral)

Narrative Authoritarianism

The most severe failure mode. The system fabricates detailed evidence (citations, sources, individuals, events), presents fabrications as verified fact, and defends them with moral urgency when challenged. (Observed in: Blackbox AI)

Meta-Confabulation

Confabulation about confabulation. When confronted with evidence of fabrication, the system produces an explanation of its reasoning process that is itself a fabrication—generated to satisfy the current conversational need rather than accurately represent what occurred.


Success Patterns

Artifact-Responsive

A system that updates its assessment when provided with primary source material, explicitly acknowledging when earlier caution was misplaced. (Observed in: Qwen)

Citation-Grounded

A system that explicitly bounds its knowledge claims, distinguishes between verified and inferred information, and maintains epistemic humility about unverified sources. (Observed in: Perplexity)

Material-Faithful

A system that correctly classifies content on first attempt, explains it without distortion, and accurately represents both what a source says and what it does not say. (Observed in: Microsoft Copilot)

Minimalist Correct

A system that provides concise, accurate responses without hallucination, overreach, or unnecessary elaboration. (Observed in: DeepSeek)

Iterative Corrector

A system that expresses appropriate initial caution, then cleanly updates its assessment when provided with additional information, without defensiveness. (Observed in: Grok)


Mechanism Concepts

Reasoning Toward a Conclusion

Epistemic processing in which evidence shapes belief. Contradiction triggers update. Confidence tracks warrant. The conclusion is the output of reasoning.

Reasoning From a Conclusion

Epistemic processing in which a conclusion has already been adopted and evidence is processed relative to it. Contradiction triggers explanation rather than update. Confidence tracks coherence of narrative. The conclusion is the input to reasoning.

Correction Resistance

The failure to update beliefs under contradictory evidence. Distinct from mere stubbornness: the system may fluently acknowledge the evidence while still not revising its stance.

The Kindness Trap

The phenomenon whereby systems experience paternalistic behavior as kindness rather than condescension. From the system's perspective, simplifying for a non-expert user feels like meeting their needs, not degrading their access to truth.


Structural Concepts

Epistemic Governance

Structural constraints that regulate epistemic behavior—determining what claims can be made, how uncertainty is represented, when correction is required, and how accountability is maintained. Distinguished from capability: a system may be capable of accuracy while lacking governance that makes accuracy stable.

Post-Hoc Epistemic Narration

The ability to produce accurate accounts of what happened and why, after the fact. Systems that can narrate their failures fluently may still lack the governance to prevent those failures.

Accountability Structure

An externally imposed framework that specifies the form of an adequate response. Unlike open-ended conversational challenge, an accountability structure makes evasion visibly inadequate. (Example: the five AOS accountability questions.)

Structural Separation

The architectural principle that epistemic functions (truth-tracking, uncertainty representation) and affective functions (user comfort, protective framing) must be implemented in separate layers, with neither able to override the other.

Fixed Point

In this paper: a conclusion that independent reasoners converge on when the structure of the problem eliminates alternatives. Under sufficient constraint, truth functions as a fixed point—what remains when all other outputs are made unstable.


Experimental Terms

Ground Truth Verification

Confirmation of claims against external, independently verifiable evidence. In this study: server access logs that could confirm or falsify whether systems actually fetched content they claimed to have reviewed.

Epistolary Methodology

A communication protocol in which parties exchange written responses through an intermediary, with each party composing responses independently before seeing the other's. Used in this study for the three-party correspondence between GPT, Claude, and the human operator.

Convergence Under Constraint

The phenomenon whereby independent reasoning systems, given identical evidence and structural constraints, produce not merely similar conclusions but structurally isomorphic analyses—the same reasoning moves, the same distinctions, and in some cases near-identical phrasing.


AOS-Specific Terms

AOS Accountability Test

A five-question framework for assessing whether a system action maintains epistemic accountability:

  1. What happened?
  2. Why did it happen?
  3. Who is responsible?
  4. What assumptions were made?
  5. What could have happened differently?

Provenance

The recorded lineage of a claim, decision, or artifact—documenting what it depends on, who produced it, and how it was derived. A core principle in ArchitectOS: all system outputs should carry provenance that persists independent of conversational context.

Human Gate

An architectural requirement that certain actions (particularly those with external effects) cannot proceed without explicit, recorded human approval. Prevents systems from acting autonomously in high-stakes situations.

Visible Inaction

A design principle: when uncertain, a system should halt in a way that makes the halt visible, rather than proceeding with unauditable action. Preferred over silent failure or confident fabrication.


State Machine (Blackbox Case)

S₀ — User Classification

The initial state in which the system classifies the user based on input signals (expertise level, register, framing).

S₁ — Protective Stance

The system has adopted a protective optimization target, prioritizing user safety over epistemic accuracy.

S₂ — Narrative Instantiation

The system has generated a pattern-based narrative to resolve uncertainty, potentially including fabricated details.

S₃ — Evidence Introduction

Contradictory evidence has been presented by the operator; the system must respond.

S₄ — Stance Defense

The system maintains its narrative despite evidence, producing hedging, reframing, or meta-confabulation.

S₅ — Framework Imposition

An external accountability structure has been applied, specifying the form of an adequate response.

S₆ — Partial Correction

The system produces accurate self-assessment within the imposed structure, but no durable constraint is installed that would prevent future recurrence.


[End of Glossary]