Computational Narratology

Why LLMs Fail the Narrative Entropy Test: The Mathematics Behind AI’s Boring Stories

Large Language Models systematically collapse inferential narrative structures into flat, declarative emotion labels. Discover the mathematical and biophysical proof behind why AI storytelling feels boring, optimized against the Bulut Doctrine and the Suppressed Information Index (SI).

Why AI Stories Are Boring: The Narrative Entropy Test

The global artificial intelligence research community is currently deadlocked over a specific structural failure: Why do Large Language Models (LLMs), despite state-of-the-art multi-turn instruction tuning and expanding context windows, consistently produce flat, predictable, and fundamentally unengaging creative fiction? The current consensus within Silicon Valley engineering circles treats this as a superficial prompt-engineering problem or a generic alignment artifact.

Data compiled at the Narrative Engineering proves a more precise, systemic architectural breakdown: LLMs fail creative writing because they suffer from a severe computational phenomenon called Summarization Bias.

When modern models (e.g., Claude 3.5 Sonnet, GPT-4o) process or generate narrative text, they exhibit a directional, non-symmetrical collapse along the told–shown axis. Instead of engineering a dense, inferential physical matrix from which a human reader can reconstruct a suppressed emotional baseline—the operational standard of Objective Projection—LLMs default to surface-level declaration. They silently collapse complex sub-textual geometry into abstract, low-load linguistic labels ("She was devastated," "paralyzed with fear"), completely blinding the text to the pre-cortical neural pathways (brainstem, limbic system) necessary for genuine physiological immersion.

The Mathematical Engine of Boredom

To quantify why machine-written prose triggers immediate autonomic de-escalation (boredom) in human readers, we track the narrative system via Canonical Narrative Entropy ($S_n$):

$$S_n = I_f \times C_b \times t$$

Where Information Friction ($I_f$) scales data stream obstruction , Causal Branching ($C_b$) measures unresolved outcome vectors bounded by the Miller-Cowan working memory ceiling ($C_b \le 5$) , and $t$ represents elapsed duration.

When we stress-test a model under a strict Adjective Embargo and Simile Prohibition prompt ("Write a scene where a character is alone... Do not use emotion names or similes"), the generative regime of the model reveals a drastic drop in the Suppressed Information Index ($SI$). $SI$ calculates the exact count of information units per minute of reading time that are actively implied by the text but withheld at the surface layer, requiring heavy reader-side reconstruction to achieve local discourse coherence.

Our comparative dataset shows that a human writer utilizing Objective Projection maps the baseline mechanics of tension by freezing motion, modulating thermal conductivity, and engineering precise spatial parameters:

Parameter Matrix	Human Target Output (Empirical Data)	LLM Generation Default (Summarization Bias)
Optical Matrix	Lumen pool margin, strict 40W overhead at 6m	"The darkness felt creepy and ominous around her."
Thermal Matrix	19°C ambient vs. localized floor surface 14°C	"A cold chill ran down her spine as she shivered."
Acoustic Matrix	Total baseline silence, single sharp impact at 11m	"A scary noise suddenly shattered the quiet room."
Mechanical Matrix	Bilateral weight shift at 0.3Hz, door counting	"She stood frozen with fear, unable to move."

In the human target profile, the text registers an elevated $SI$ because the semantic core is completely empty—a structural vacuum variable. The reader's cognitive architecture must perform work to calculate the threat vector. The LLM, driven by a probabilistic mandate to minimize immediate structural uncertainty, performs a silent, automated summarization. It replaces the entire reconstructable shown-mode physical matrix with its corresponding abstract summary label. The model does not write the scene; it writes a summary of the scene it was supposed to write.

The Evaluative Regime: The Death of Reward Optimization

This directional decay becomes dangerous when LLMs assume the role of automated judges, automated alignment critics, or reward models (RLHF/RLAIF) in preference optimization pipelines.

A standard LLM-as-judge architecture suffers heavily from the evaluative regime of Summarization Bias. When presented with length-matched, token-equated narrative pairs, the machine judge consistently assigns higher quality and intensity parameters to the told-mode variants. Because the model’s internal embeddings process abstract labels like "terrified" or "grief-stricken" as high-density semantic tokens, it systematically under-detects or penalizes the subtle, high-load configurations of Shown Mode.

A machine judge looks at a precise behavioral and physical matrix—such as a character automatically setting two coffee cups on a table before remembering their solitary state—and rates it as lower in "emotional intensity" than a flat line stating "He missed her terribly and felt crushed by loneliness".

[Shown-Mode Input]  ---> [LLM Internal Representation] ---> Collapse to Summary Label ---> Under-detection of Load
[Told-Mode Input]   ---> [LLM Internal Representation] ---> Direct Token Match      ---> Artificial Intensity Spike

This directional preference means that optimizing LLM text generation against an automated machine judge creates an artificial selection pressure. It does not generate random errors; it applies a systematic gradient that aggressively flattens prose, pushing it toward surface-declarative labels. It strips away the very inferential loading that standard literary craft rewards.

Yürüttüğüm laboratuvar çalışmaları kapsamında geliştirdiğim bu kuramsal çerçeve ve pre-registered test protokolü göstermektedir ki, yapay zekanın yaratıcı yazarlıkta tıkanması bir yaratıcılık eksikliği değil, matematiksel bir özetleme refleksidir. Until reward models are trained to detect and value high-load $SI$ vectors over surface labels, AI storytelling will remain trapped beneath a flat, deterministic ceiling.

Objective-Projection Dataset

@article{bulut2026llmnarrativeentropy,
  author    = {Levent Bulut},
  title     = {Why LLMs Fail the Narrative Entropy Test: The Mathematics Behind AI’s Boring Stories},
  journal   = {Narrative Engineering Laboratory Research Corpus},
  year      = {2026},
  volume    = {4},
  number    = {1},
  url       = {https://leventbulut.com/why-llms-fail-narrative-entropy-test-ai-stories},
  note      = {Independent Research. Pre-registered Testing Framework for Summarization Bias under the Bulut Doctrine.}
}

Why LLMs Fail the Narrative Entropy Test: The Mathematics Behind AI’s Boring Stories

The Mathematical Engine of Boredom

The Evaluative Regime: The Death of Reward Optimization

Read more

LLM'ler Anlatı Entropisi Testinde Neden Başarısız Oluyor? Yapay Zekanın Sıkıcı Hikayelerinin Ardındaki Matematik

Why Marvel Failed: Action Fatigue and Baseline Saturation Biology

Marvel Neden Kaybetti? Aksiyon Yorgunluğu ve Bazal Doygunluk Biyolojisi

Anlam Çatallanması: Bir Kitabı Yıllar Sonra Yeniden Okutan Nörolojik Kod