Dataset v7: Detecting the Six Golden Rules — A Process Note

This note explains why we took this route and which honest limits we declare openly.

Version 7 of the Objective Projection dataset added three pieces that move the methodology from a claim into an auditable structure: a transparent detection pipeline that marks which rules each scene satisfies, machine-readable citation infrastructure, and full-text methodology papers. This note explains why we took this route — and which honest limits we declare openly.

The problem: the "I deleted the label, so I'm compliant" fallacy

Objective Projection rests on six rules: Emotion Embargo, Simile Prohibition, Materialized Metaphors, Micro-Focus (Ng), Temporal Anchor, Atmosphere Contradiction. It is easy for a dataset to say it follows these rules; it is hard to prove it. Until v7, whether each of the 500 scenes obeyed a given rule was left to the reader's eye and trust. For an academic record, that is not enough.

The solution: a deterministic, rule-based, open detection pipeline

In v7, every scene gained an applied_rules block. The thing that produces this block is not a language model (no LLM-as-judge) — it is apply_rules.py, a single-file, dependency-free Python script using word-boundary matching. We chose this deliberately:

Reproducible. Clone the repo, run the script, get byte-identical output. Unlike an LLM judge, the result does not drift.
Auditable. Exactly which patterns each rule checks for is written plainly in the script. A researcher can inspect a specific decision, change thresholds, or contest a call.
Transparent. No black box. The full detection logic is readable.

Each applied_rules block carries six boolean flags, an active_count, primary_rule, detection_method, and doctrine_version. No existing field was modified, renamed, or removed — only added.

An honest limit, declared: the rules are not equally reliable

This is the most important part of the release, and the thing most datasets hide. A rule-based checker is, by design, a blunt instrument. Detection rates on the target outputs vary by rule:

High reliability (95%+): Simile Prohibition and Emotion Embargo — deterministic lexicon matching.
Moderate (60–80%): Temporal Anchor, Materialized Metaphors, Micro-Focus — structural/heuristic patterns.
Deliberately conservative (~10%): Atmosphere Contradiction.

That last line is intentional. Atmosphere Contradiction encodes a semantic authorial choice that regex cannot reliably see. So we tuned the pipeline to favour false negatives over false positives: it would rather miss the rule than wrongly claim it. The reason is simple — keeping the dataset's positive labels trustworthy matters more than coverage. To a peer reviewer, this is not a weakness; it is a signal of strength.

The citation infrastructure that ships with it

CITATION.cff — Citation File Format v1.2.0. Hugging Face, GitHub, and Zenodo recognise this file and surface a "Cite this dataset" affordance automatically. It carries the primary HF DOI (10.57967/hf/8960) and the Zenodo archive DOI (10.5281/zenodo.19511369), plus cross-references to the architectural framework and the Sₙ pilot report.
Two full-text papers under academic/: the short-form methodology paper (Beyond the Cortical Label) and the Sₙ pilot report (10.5281/zenodo.20362901).

In short

v7 was not about adding scenes. It was about making the existing 500 scenes provable. If you want a methodology to take its own claim — "literature is not a feeling, it is a physics" — seriously, that claim has to be auditable, reproducible, and honestly bounded. That is what v7 aimed at.

Dataset: huggingface.co/datasets/leventbulut/objective-projection · DOI: 10.57967/hf/8960 · Full technical write-up: huggingface.co/blog/leventbulut/objective-projection

Levent Bulut