Ingest: LLM Position Bias Benchmark (Swapped-Order Pairwise Judging)

Type: kb/sources/types/ingest-report.md

Source: kb/sources/position-bias/ Captured: 2026-04-21 From: https://github.com/lechmazur/ (Mazur position-bias benchmark family; specific repo slug inferred from the "Related Benchmarks" list) Pin: 483150e

Classification

Type: code-repository -- public data-and-reports bundle (no runner code checked in, but 193 verified story pairs, 386 published prompt files, 10422 parsed model-answer rows with raw response text, and nine aggregate result CSVs); a reproducible measurement artifact rather than a codebase. Domains: evaluation, judge-reliability, position-bias, llm-as-judge Author: Lechoslaw (Lech) Mazur (inferred from the GitHub org linked in "Related Benchmarks"; this snapshot has a single "initial" commit dated 2026-04-21). Credibility signal: the author runs a well-known suite of LLM public benchmarks; independent of model providers; methodology is transparent but single-team.

Summary

The benchmark measures whether LLM judges preserve their pairwise preference when the same two candidate stories are displayed in the opposite order. For each of 193 verified sibling-edit story pairs, 27 judge models see both display orders and emit tag-structured ratings plus a winner. Headline finding: across the report view, the model-average first-shown pick rate is 63.3% and the median model flips its underlying canonical choice in 44.8% of decisive two-view case pairs -- the position effect is not a tiebreaker, it is the dominant failure mode. Evidence carried by the tree: (a) prompt provenance (data/prompts/pass1/*.txt shows the literal judge prompts, tag-only output contract, and that the hidden edit request is withheld from judges); (b) per-model outcome decomposition (data/results/model_metrics.csv, position_bias.csv) -- GPT-5.4 (high reasoning) flips 66.3%, Mistral Large 3 shows inverted second-position bias, ByteDance Seed 2.0 Pro and DeepSeek V3.2 are the only decisive-coverage + low-flip combinations; (c) case-level sensitivity (case_metrics.csv) with named high-flip cases like "midnight bakery" at 87.5% flip and a worked example exposing the raw <answer> tags for both stable and flipping judges; (d) narrow scope acknowledgment (source_pair_metrics.csv shows one editor-pair surface: Claude Sonnet 4.6 high-reasoning vs GPT-5.4 high-reasoning). Pass-1 response coverage is ~100% for almost every model, so the findings are not coverage artifacts.

File Manifest

Files read in full: - README.md -- top-level thesis, full leaderboard, method-in-brief, worked example, and list of related benchmarks; the primary load-bearing document. - reports/summary.md -- auto-generated sanity summary (model count, case count, completion threshold, top-10 order-sensitive cases) that cross-checks README claims against the CSVs. - data/README.md -- public data bundle contract: which CSV/JSONL files ship with the snapshot and what each row means; needed to read the metric tables correctly. - data/manifest.json -- SHA-256 + row-count manifest for every published artifact; establishes reproducibility and that the tree's evidence is pinned. - data/prompts/pass1/case_3__variant_a_first.txt -- one literal judge prompt (the "midnight bakery" case shown verbatim in the README worked example); lets us verify the tag contract, the fact that the hidden edit request is NOT shown to judges, and the answer-label normalization claim. - data/results/model_metrics.csv -- 27-row per-model table of displayed-first rate, first lift, order-flip rate, decisive-pair coverage, tie rate, and refusal rate; the core numeric substrate under the README leaderboard. - data/results/position_bias.csv -- per-model position-bias summary (first-position flip rate vs second-position flip rate vs stable choice) that decomposes the order-flip number into its directional pieces. - data/results/category_metrics.csv -- topic-category breakdown (general / planning / reasoning / high-stakes) used to caveat that the non-general buckets are too small to support category-level claims. - data/results/source_pair_metrics.csv -- single-row table confirming the entire snapshot uses one editor pair (Claude Sonnet 4.6 adaptive <> GPT-5.4 high); essential for bounding the generality claim. - data/prompt_index.csv (first rows) -- per-prompt metadata linking case/view/seed/label-style/editor-model assignments to the actual prompt files; confirms seeds and label scheme are held constant across views.

Connections Found

Connect landed the source in the KB's judge-reliability / oracle-theory cluster; all ten outbound candidates are evidence into kb/notes/, because the source is empirical measurement rather than a new theoretical claim. Strongest attractor: brainstorming-how-to-test-whether-pairwise-comparison-can-harden-soft-oracles, which already lists "position bias rate" as one of its planned measures -- Mazur supplies both the metric operationalisation (decisive-pair flip rate) and an unmitigated baseline (44.8% median flip, 63.3% first-shown pick) the seedling was carrying as a placeholder. It is also evidence for systematic-prompt-variation-serves-verification-and-diagnosis (order-swap is a textbook diagnostic prompt variation: surface varied, task semantics preserved, invariance is the success criterion) and for operational-signals-that-a-component-is-a-relaxing-candidate (the first "relaxing" signal is "brittleness under paraphrase or reordering" -- Mazur measures exactly that, at the judge layer rather than the task-solver layer). Secondary evidence edges land at error-correction-works-above-chance-oracles-with-decorrelated-checks, interpretation-errors-are-failures-of-the-interpreter, reliability-dimensions-map-to-oracle-hardening-stages, the-augmentation-automation-boundary-is-discrimination-not-accuracy, oracle-strength-spectrum, quality-signals-for-kb-evaluation, and evaluation-automation-is-phase-gated-by-comprehension. Source-to-source: the benchmark sits between the Koylan pairwise ingest (pairwise as a cleaner primitive) and the Autoreason ingest (blind-label Borda as a mitigation); it is the "unmitigated failure measurement" move between them. Connect also flagged an off-authorisation match with agent-skills-for-context-engineering, which explicitly recommends order-swap mitigation; the correct route is a reverse edge from the review, not an outbound from here. Key insight for the graph: the source turns "LLM judges are noisy" from a hand-wave into a numeric prior and puts specific reverse-edge evidence within reach of seven seedling notes that currently cite only task-solver-layer evidence for equivalent claims.

Extractable Value

  1. Numeric prior for single-pass pairwise-judge contamination -- Use the 44.8% median order-flip rate and 63.3% first-shown pick rate as the default pessimistic priors for LLM-as-judge reliability on sibling-quality comparisons. This is immediately usable when arguing that any single-view judge eval must randomize or aggregate both orders. Cite data/results/model_metrics.csv for the per-model figures. High reach across judge-based evaluation broadly. [quick-win]
  2. Directional decomposition of the flip rate -- position_bias.csv separates first-position flips from second-position flips and stable-choice share. Two models with the same flip rate can have opposite failure modes (Mistral Large 3 concentrates in second-position flips; most others concentrate first). Operational rule: never report a flip rate without the directional split and decisive-pair coverage. High reach. [quick-win]
  3. Coverage-corrected judge ranking -- The Xiaomi MiMo result (19.8% flip, 54.9% decisive coverage, 30% tie rate) shows that a low flip number can be model-conservatism rather than order-invariance. Pair flip rate with decisive-pair coverage before ranking. Directly adoptable. [quick-win]
  4. Reusable judge-prompt contract -- The tag-only contract (<rating_first>1..7</rating_first>, <rating_other>1..7</rating_other>, <answer>1|2|TIE|INSUFFICIENT</answer>) delivered ~100% parseable response coverage with 0% insufficient rate and 4 total refusals across 10422 rows. Worth reusing in our own judge harnesses. Moderate reach. [quick-win]
  5. Sibling-edit isolation as a diagnostic design -- Two editor models applying the same bounded change to the same base story produces near-identical candidates that differ on a single surgical axis. This is a transportable diagnostic-prompt-variation pattern for isolating order, tone, length, or structural effects at the judge layer. Moderate reach; cross-applies to any quality-comparison eval we build. [experiment]
  6. Designed-panel decorrelation test -- Connect flagged that Mistral Large 3's inverted second-position bias, combined with the majority first-position-biased panel, raises an unanswered question: does a deliberately mixed panel (including a known-inverted judge) decorrelate order bias better than a randomly sampled panel? The model_answers.jsonl raw responses would let us test this without re-running the benchmark. [experiment]
  7. Generality gap (trustworthiness flag) -- source_pair_metrics.csv confirms one editor pair (Claude Sonnet 4.6 high <> GPT-5.4 high) and category_metrics.csv confirms 188/193 cases in the general bucket. Consumers should treat the 44.8% figure as a sibling-edit, general-purpose short-story sensitivity, not a universal pairwise-judging claim. The README is explicit about this; downstream notes should inherit the caveat. [just-a-reference]
  8. Open re-analysis surface -- model_answers.jsonl carries raw response text for all 10422 rows, which opens cheap secondary analyses: correlation between reasoning-token count and flip rate within a family; per-topic leakage into the position effect; whether rating-scale placement interacts with the final <answer> tag. The repo gives us the substrate, not the analysis. [deep-dive]

Limitations (our opinion)

  • Read-only inspection. I did not execute any code from this repo -- there is no runner code checked in. It is a data-and-reports bundle: claims rest on internal consistency between README.md, reports/summary.md, and the CSVs, plus the SHA-256 manifest in data/manifest.json. I did not re-verify the hashes, and the upstream prompting/judging pipeline that produced model_answers.jsonl is not in the tree, so we cannot audit the answer-generation step.
  • Narrow task surface. One editor pair (per source_pair_metrics.csv), and 188 of 193 cases in the general-purpose short-story bucket; non-general buckets are too small for category-level claims. The 44.8% flip figure belongs to this sibling-edit surface, not to arbitrary pairwise-judging tasks.
  • Sibling-edit design inflates or at least changes the meaning of sensitivity. When the two candidates are near-identical, small textual cues plausibly dominate, so the flip rate should not be transported unchanged to evals where candidates differ substantively.
  • Judge roster is a specific 2026-Q1 snapshot of 27 models chosen by one author; no guarantee of representativeness, and reasoning-vs-non-reasoning configurations sometimes diverge by more than the model-family effect (GPT-5.4 high vs no-reasoning vs Mini xhigh all land separately on the leaderboard).
  • No human-rater baseline. The benchmark does not measure how often humans would flip on the same pairs, so we cannot say whether LLM order sensitivity exceeds or tracks human order sensitivity.
  • Provenance is preprint-tier. Single "initial" commit on 2026-04-21; upstream repo slug inferred from the "Related Benchmarks" list. Downstream notes should avoid phrasing that makes the 44.8% number single-source load-bearing -- it is the best measurement we have at the judging layer but not yet cross-replicated.
  • Pinned here at commit 483150e (2026-04-21). This snapshot is the first public version; future commits will change the model roster, the case set, and potentially the metric definitions. The analysis above rots as the repo evolves -- re-check against the pin before citing specific numbers.

Write a new note in kb/notes/ titled approximately "Pairwise LLM judges flip ~45% of the time under order swap on sibling-edit pairs". It should (a) state the median-44.8% flip and 63.3% first-shown pick as numeric priors, citing this ingest report as evidence; (b) connect laterally to brainstorming-how-to-test-whether-pairwise-comparison-can-harden-soft-oracles (whose placeholder "position bias rate" measure this benchmark grounds) and to operational-signals-that-a-component-is-a-relaxing-candidate as judge-layer evidence for brittleness-under-reordering; (c) codify the operational rule that any single-view LLM-as-judge eval must be treated as order-contaminated unless it randomises order, aggregates both swapped views, or demonstrates sub-10% flip on a sibling-pair probe set. After the note lands, author the seven reverse-edge evidence links connect identified (strongest first: brainstorming, operational-signals, interpretation-errors, reliability-dimensions) so the snapshot is reachable from the existing seedling cluster.