Text testing framework — source material
Type: note · Status: current
Full framework for automated testing of text artifacts, received 2026-02-21. Saved as reference for when we start building concrete checks.
See automated-tests-for-text for the distilled observation.
1) Think in terms of a text "spec" (contracts)
Software tests work because there's a spec, even if it's implicit. Free text can have specs too, for example:
- Structural contract: must include sections (Summary / Decision / Next steps), must be under 250 words, must include an owner + date.
- Audience contract: written for "new joiners," avoid internal slang, define acronyms.
- Tone/voice contract: friendly but direct, no hype, no moralizing, no sarcasm.
- Safety/privacy contract: no secrets, no personal data, no legal/medical claims without disclaimers.
- Truthfulness contract: claims must be either cited, explicitly labeled as assumptions, or consistent with a reference corpus.
- Actionability contract: at least 1 concrete next action, deadlines in ISO dates, no ambiguous "soon."
Once you have contracts, you can test them.
2) A "test pyramid" for text
Level A — Deterministic checks (fast, cheap, reliable)
- Length / reading time / sentence length
- Formatting & required sections
- Forbidden phrases / banned claims
- Link validity / citation presence
- Terminology consistency
- Dates & numbers
- PII/secret scanning
Level B — LLM rubric graders (medium cost, high coverage)
- Prompt + rubric + examples → structured JSON (pass/fail + reasons + spans)
- "Does the note contain a single clear thesis in the first 2 sentences?"
- "Does each claim have either a source, or is it explicitly tagged as assumption?"
- "Is there a concrete next step with owner + date?"
- "Is the tone aligned to our style guide?"
- "Is there any internal contradiction?"
Level C — Cross-model / adversarial checks (slower, higher confidence)
- N-of-M voting
- Two-model agreement
- Adversarial prompting
- Metamorphic testing
3) Testing meaning indirectly
Metamorphic tests
- Paraphrase invariance
- Summarization invariance
- Reformat invariance
- Reverse test (generate Q&A, check consistency)
Claim extraction + verification
- Extract atomic claims
- Entailment check against source corpus
- Contradiction check
- Missing citation flag
4) Compatibility with a collection of texts
- No contradictions with existing docs
- Terminology + ontology alignment
- Style/voice consistency
- Duplicate / near-duplicate detection
- Coverage and linking behavior
- Update compatibility (supersedes, migration notes)
5) Production workflow
- Pre-commit / local lint — structure, length, banned phrases, PII
- CI unit tests — deterministic + basic rubric
- CI integration tests — contradiction, taxonomy, duplication
- Human review for edge cases
- Regression suite with golden notes
6) Failure modes
- LLM judges not deterministic → voting, multi-judge
- Judges can be lenient → adversarial critique passes
- Corpus checks miss context → improve retrieval, require citations
- Over-testing early → start with 10-20 high-value checks
7) Minimal starting checklist
Single-note: required sections, max length, next step with owner, no relative dates, acronyms defined, no PII, clarity rubric, main point in first 2 sentences.
Corpus: top-5 similarity contradiction check, threshold linking, glossary alignment.
Topics: