Ingest: LLM Position Bias Benchmark (Swapped-Order Pairwise Judging)

Type: kb/sources/types/ingest-report.md

Source: kb/sources/position-bias/ Captured: 2026-04-21 From: https://github.com/lechmazur/ (Mazur position-bias benchmark family; specific repo slug inferred from the "Related Benchmarks" list) Pin: 483150e

Classification

Type: code-repository -- public data-and-reports bundle (no runner code checked in, but 193 verified story pairs, 386 published prompt files, 10422 parsed model-answer rows with raw response text, and nine aggregate result CSVs); a reproducible measurement artifact rather than a codebase. Domains: evaluation, judge-reliability, position-bias, llm-as-judge Author: Lechoslaw (Lech) Mazur (inferred from the GitHub org linked in "Related Benchmarks"; this snapshot has a single "initial" commit dated 2026-04-21). Credibility signal: the author runs a well-known suite of LLM public benchmarks; independent of model providers; methodology is transparent but single-team.

Summary

The benchmark measures whether LLM judges preserve their pairwise preference when the same two candidate stories are displayed in the opposite order. For each of 193 verified sibling-edit story pairs, 27 judge models see both display orders and emit tag-structured ratings plus a winner. Headline finding: across the report view, the model-average first-shown pick rate is 63.3% and the median model flips its underlying canonical choice in 44.8% of decisive two-view case pairs -- the position effect is not a tiebreaker, it is the dominant failure mode. Evidence carried by the tree: (a) prompt provenance (data/prompts/pass1/*.txt shows the literal judge prompts, tag-only output contract, and that the hidden edit request is withheld from judges); (b) per-model outcome decomposition (data/results/model_metrics.csv, position_bias.csv) -- GPT-5.4 (high reasoning) flips 66.3%, Mistral Large 3 shows inverted second-position bias, ByteDance Seed 2.0 Pro and DeepSeek V3.2 are the only decisive-coverage + low-flip combinations; (c) case-level sensitivity (case_metrics.csv) with named high-flip cases like "midnight bakery" at 87.5% flip and a worked example exposing the raw <answer> tags for both stable and flipping judges; (d) narrow scope acknowledgment (source_pair_metrics.csv shows one editor-pair surface: Claude Sonnet 4.6 high-reasoning vs GPT-5.4 high-reasoning). Pass-1 response coverage is ~100% for almost every model, so the findings are not coverage artifacts.

File Manifest

Files read in full: - README.md -- top-level thesis, full leaderboard, method-in-brief, worked example, and list of related benchmarks; the primary load-bearing document. - reports/summary.md -- auto-generated sanity summary (model count, case count, completion threshold, top-10 order-sensitive cases) that cross-checks README claims against the CSVs. - data/README.md -- public data bundle contract: which CSV/JSONL files ship with the snapshot and what each row means; needed to read the metric tables correctly. - data/manifest.json -- SHA-256 + row-count manifest for every published artifact; establishes reproducibility and that the tree's evidence is pinned. - data/prompts/pass1/case_3__variant_a_first.txt -- one literal judge prompt (the "midnight bakery" case shown verbatim in the README worked example); lets us verify the tag contract, the fact that the hidden edit request is NOT shown to judges, and the answer-label normalization claim. - data/results/model_metrics.csv -- 27-row per-model table of displayed-first rate, first lift, order-flip rate, decisive-pair coverage, tie rate, and refusal rate; the core numeric substrate under the README leaderboard. - data/results/position_bias.csv -- per-model position-bias summary (first-position flip rate vs second-position flip rate vs stable choice) that decomposes the order-flip number into its directional pieces. - data/results/category_metrics.csv -- topic-category breakdown (general / planning / reasoning / high-stakes) used to caveat that the non-general buckets are too small to support category-level claims. - data/results/source_pair_metrics.csv -- single-row table confirming the entire snapshot uses one editor pair (Claude Sonnet 4.6 adaptive <> GPT-5.4 high); essential for bounding the generality claim. - data/prompt_index.csv (first rows) -- per-prompt metadata linking case/view/seed/label-style/editor-model assignments to the actual prompt files; confirms seeds and label scheme are held constant across views.

Connections Found

Connect landed the source in the KB's judge-reliability / oracle-theory cluster; all ten outbound candidates are evidence into kb/notes/, because the source is empirical measurement rather than a new theoretical claim. Strongest attractor: brainstorming-how-to-test-whether-pairwise-comparison-can-harden-soft-oracles, which already lists "position bias rate" as one of its planned measures -- Mazur supplies both the metric operationalisation (decisive-pair flip rate) and an unmitigated baseline (44.8% median flip, 63.3% first-shown pick) the seedling was carrying as a placeholder. It is also evidence for systematic-prompt-variation-serves-verification-and-diagnosis (order-swap is a textbook diagnostic prompt variation: surface varied, task semantics preserved, invariance is the success criterion) and for operational-signals-that-a-component-is-a-relaxing-candidate (the first "relaxing" signal is "brittleness under paraphrase or reordering" -- Mazur measures exactly that, at the judge layer rather than the task-solver layer). Secondary evidence edges land at error-correction-works-above-chance-oracles-with-decorrelated-checks, interpretation-errors-are-failures-of-the-interpreter, reliability-dimensions-map-to-oracle-hardening-stages, the-augmentation-automation-boundary-is-discrimination-not-accuracy, oracle-strength-spectrum, quality-signals-for-kb-evaluation, and evaluation-automation-is-phase-gated-by-comprehension. Source-to-source: the benchmark sits between the Koylan pairwise ingest (pairwise as a cleaner primitive) and the Autoreason ingest (blind-label Borda as a mitigation); it is the "unmitigated failure measurement" move between them. Connect also flagged an off-authorisation match with agent-skills-for-context-engineering, which explicitly recommends order-swap mitigation; the correct route is a reverse edge from the review, not an outbound from here. Key insight for the graph: the source turns "LLM judges are noisy" from a hand-wave into a numeric prior and puts specific reverse-edge evidence within reach of seven seedling notes that currently cite only task-solver-layer evidence for equivalent claims.

Extractable Value

Numeric prior for single-pass pairwise-judge contamination -- Use the 44.8% median order-flip rate and 63.3% first-shown pick rate as the default pessimistic priors for LLM-as-judge reliability on sibling-quality comparisons. This is immediately usable when arguing that any single-view judge eval must randomize or aggregate both orders. Cite data/results/model_metrics.csv for the per-model figures. High reach across judge-based evaluation broadly. [quick-win]
Directional decomposition of the flip rate -- position_bias.csv separates first-position flips from second-position flips and stable-choice share. Two models with the same flip rate can have opposite failure modes (Mistral Large 3 concentrates in second-position flips; most others concentrate first). Operational rule: never report a flip rate without the directional split and decisive-pair coverage. High reach. [quick-win]
Coverage-corrected judge ranking -- The Xiaomi MiMo result (19.8% flip, 54.9% decisive coverage, 30% tie rate) shows that a low flip number can be model-conservatism rather than order-invariance. Pair flip rate with decisive-pair coverage before ranking. Directly adoptable. [quick-win]
Reusable judge-prompt contract -- The tag-only contract (<rating_first>1..7</rating_first>, <rating_other>1..7</rating_other>, <answer>1|2|TIE|INSUFFICIENT</answer>) delivered ~100% parseable response coverage with 0% insufficient rate and 4 total refusals across 10422 rows. Worth reusing in our own judge harnesses. Moderate reach. [quick-win]
Sibling-edit isolation as a diagnostic design -- Two editor models applying the same bounded change to the same base story produces near-identical candidates that differ on a single surgical axis. This is a transportable diagnostic-prompt-variation pattern for isolating order, tone, length, or structural effects at the judge layer. Moderate reach; cross-applies to any quality-comparison eval we build. [experiment]
Designed-panel decorrelation test -- Connect flagged that Mistral Large 3's inverted second-position bias, combined with the majority first-position-biased panel, raises an unanswered question: does a deliberately mixed panel (including a known-inverted judge) decorrelate order bias better than a randomly sampled panel? The model_answers.jsonl raw responses would let us test this without re-running the benchmark. [experiment]
Generality gap (trustworthiness flag) -- source_pair_metrics.csv confirms one editor pair (Claude Sonnet 4.6 high <> GPT-5.4 high) and category_metrics.csv confirms 188/193 cases in the general bucket. Consumers should treat the 44.8% figure as a sibling-edit, general-purpose short-story sensitivity, not a universal pairwise-judging claim. The README is explicit about this; downstream notes should inherit the caveat. [just-a-reference]
Open re-analysis surface -- model_answers.jsonl carries raw response text for all 10422 rows, which opens cheap secondary analyses: correlation between reasoning-token count and flip rate within a family; per-topic leakage into the position effect; whether rating-scale placement interacts with the final <answer> tag. The repo gives us the substrate, not the analysis. [deep-dive]

Limitations (our opinion)

Read-only inspection. I did not execute any code from this repo -- there is no runner code checked in. It is a data-and-reports bundle: claims rest on internal consistency between README.md, reports/summary.md, and the CSVs, plus the SHA-256 manifest in data/manifest.json. I did not re-verify the hashes, and the upstream prompting/judging pipeline that produced model_answers.jsonl is not in the tree, so we cannot audit the answer-generation step.
Narrow task surface. One editor pair (per source_pair_metrics.csv), and 188 of 193 cases in the general-purpose short-story bucket; non-general buckets are too small for category-level claims. The 44.8% flip figure belongs to this sibling-edit surface, not to arbitrary pairwise-judging tasks.
Sibling-edit design inflates or at least changes the meaning of sensitivity. When the two candidates are near-identical, small textual cues plausibly dominate, so the flip rate should not be transported unchanged to evals where candidates differ substantively.
Judge roster is a specific 2026-Q1 snapshot of 27 models chosen by one author; no guarantee of representativeness, and reasoning-vs-non-reasoning configurations sometimes diverge by more than the model-family effect (GPT-5.4 high vs no-reasoning vs Mini xhigh all land separately on the leaderboard).
No human-rater baseline. The benchmark does not measure how often humans would flip on the same pairs, so we cannot say whether LLM order sensitivity exceeds or tracks human order sensitivity.
Provenance is preprint-tier. Single "initial" commit on 2026-04-21; upstream repo slug inferred from the "Related Benchmarks" list. Downstream notes should avoid phrasing that makes the 44.8% number single-source load-bearing -- it is the best measurement we have at the judging layer but not yet cross-replicated.
Pinned here at commit 483150e (2026-04-21). This snapshot is the first public version; future commits will change the model roster, the case set, and potentially the metric definitions. The analysis above rots as the repo evolves -- re-check against the pin before citing specific numbers.

Recommended Next Action

Write a new note in kb/notes/ titled approximately "Pairwise LLM judges flip ~45% of the time under order swap on sibling-edit pairs". It should (a) state the median-44.8% flip and 63.3% first-shown pick as numeric priors, citing this ingest report as evidence; (b) connect laterally to brainstorming-how-to-test-whether-pairwise-comparison-can-harden-soft-oracles (whose placeholder "position bias rate" measure this benchmark grounds) and to operational-signals-that-a-component-is-a-relaxing-candidate as judge-layer evidence for brittleness-under-reordering; (c) codify the operational rule that any single-view LLM-as-judge eval must be treated as order-contaminated unless it randomises order, aggregates both swapped views, or demonstrates sub-10% flip on a sibling-pair probe set. After the note lands, author the seven reverse-edge evidence links connect identified (strongest first: brainstorming, operational-signals, interpretation-errors, reliability-dimensions) so the snapshot is reachable from the existing seedling cluster.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search