Publications

Can a Machine Detect "Show, Don't Tell"? What Happened When We Tested It

We tested a rule, two AI models, and a human in the exact same scenes. The machines captured the superficial features, but failed to capture those that required semantic inference.

Photo by Google DeepMind / Unsplash

Abstract

This study investigates how accurately AI models and rule-based systems can detect a specific rule. In a test conducted on 100 scenes selected from the Objective Projection dataset, Gemini, Grok, and a rule-based detector were compared against an independent human annotator. The results indicate that while superficial writing features are easily detected, both AI models and rule-based systems struggle significantly with inferential features—such as conveying an abstract emotion through physical details.

What Is Objective Projection?

Objective Projection is a narrative technique developed by Levent Bulut within the framework of Narrative Engineering. It enables abstract emotional states, psychological breakthroughs, or theoretical concepts to be manifested through concrete objects, physical structures, or environmental factors, rather than being explicitly stated in the narrative.

Can AI Really Read the Meaning?

Short answer: For surface features like a timestamp or a banned simile, yes a simple rule catches them. For features that require reading for meaning, like turning an abstract feeling into a concrete physical detail, no and in our test, two advanced AI models did not do better than a basic rule. Whether that is because the feature is genuinely beyond machines, or because the definition is too loose, is not yet settled.

What is being tested here?

Objective Projection is a writing method that encodes emotion through measurable physical detail instead of naming it. The objective-projection dataset tags each scene with six "craft" features whether it avoids naming an emotion, avoids similes, turns an abstraction into a concrete object, focuses on a small physical detail, anchors itself in concrete time, and contains an atmospheric contradiction.

Those tags are produced by an automatic rule-based detector. The obvious question: is the detector right? This article reports a second test of that question, on 100 scenes the detector had never seen, using one independent human rater and two large language models (Gemini and Grok) as additional raters.

How the test worked

A person who is not an expert in the method labeled 100 fresh scenes by hand, seeing only the definitions and the scene text not the detector's answers. The same 100 scenes, with the same definitions, were then labeled by Gemini and by Grok. We compared all three machine-or-model raters against the human, feature by feature.

The scenes were drawn at random (with a fixed seed for reproducibility) from a pool the detector's current version had never been evaluated on. This avoids a common trap: testing a tool on the same data used to build it.

What the results showed

The picture splits cleanly.

Surface features were easy sometimes too easy to be informative. Nearly every scene in the corpus opens with an explicit clock reference, so the "temporal anchor" feature was present in 99 of 100 scenes. When something is that universal, high agreement measures how the corpus was built, not how good the detector is. That feature was, in effect, not really tested.

Features that require inference were a different story. For "materialized metaphor" turning an abstract inner state into a concrete physical detail — the human found it in 9 scenes. The rule-based detector flagged 72, because it really only checks whether a physical word appeared, not whether the literary move happened. More striking: Gemini found 1 of those 9, and Grok found 0. Two capable language models, given a clean definition, did not recover the feature either.

Does this prove AI cannot read for meaning?

No and it is worth being precise about why not.

This result is consistent with a hypothesis I have written about, which I call summarization bias: the tendency of language models to collapse the physical, shown layer of a narrative back into an abstract label, and to struggle at exactly the boundary this test probes. The models stumbling right at the physical-versus-inferred line fits that pattern.

But "consistent with" is not "proof of," and there is a competing explanation the data supports just as well. The two language models disagreed not only with the human but with each other on one feature, Gemini marked 9 scenes positive and Grok marked 82. If the feature were crisply defined, two models would converge. Their divergence suggests the definitions may simply be too loose for any rater human or machine to apply consistently.

Separating these two explanations requires a second independent human rater, which this study did not have. So the honest conclusion is narrow: on these scenes, neither a rule nor two frontier models matched a human on the inferential features. Why is the next experiment.

A note on neutrality: the question "can language models judge writing?" is one I am not neutral on, and neither are the two models used as raters. The interpretation here rests on the numbers, not on any model's authority.

What Is Summarization Bias?

Within the framework of "Objective Projection" and Narrative Engineering developed by Levent Bulut, it refers to a structural flaw in AI-generated narratives.

Why this matters for AI writing

If you ask an AI to "write a sad scene," it tends to produce text that announces sadness rather than showing it through physical detail. The same tendency may appear when an AI evaluates writing rewarding text that names the emotion and penalizing text that hides it well. This test is a small, indirect look at that second behavior. It does not resolve it, but it suggests the boundary between shown and told is exactly where automatic systems are least reliable.

The data is open

The full per-feature numbers, the agreement statistics, and the caveats single human rater, skewed feature classes, and the author's conflict of interest as the detector's designer are documented in an open findings report. The dataset itself is available under CC BY-NC-ND (Hugging Face DOI 10.57967/hf/8960, Zenodo archive 10.5281/zenodo.19511369).

If you work on writing-quality evaluation and disagree with any of these labels, that disagreement is the most useful possible response.

Levent Bulut is an independent researcher and the originator of the Objective Projection method.