Reliability dimensions map to oracle-hardening stages

Type: note · Status: seedling · Tags: llm-interpretation-errors

The oracle-strength spectrum describes a gradient from hard oracles (cheap deterministic checks) to no oracle (vibes). The engineering move is to harden oracles — convert no-oracle into some-oracle, then tighten. But which oracle are you hardening? The Rabanser et al. reliability framework (source) decomposes agent reliability into four dimensions, and each one targets a distinct verification question:

Dimension	Oracle question	Hardening move
Consistency	"Does this work?"	Run it again. Same input, same output? Converts interactive oracle to hard oracle via repetition.
Robustness	"Does this still work?"	Perturb the input. Paraphrase, inject faults, change context. Converts soft oracle ("it usually works") to hard oracle ("it works under these perturbations").
Predictability	"Will this work next time?"	Calibrate confidence. If the system says 80% and it's right 80% of the time, the confidence score is a soft oracle. Discrimination (assigning higher confidence to correct answers) would push it toward hard.
Safety	"What happens when it doesn't work?"	Bound the damage. Not a continuous score but a hard constraint — a gate, not a gradient. This is the only dimension that's already a hard oracle by design: either the failure is bounded or it isn't.

Why this mapping matters

The oracle-strength note says "invest in telemetry and eval harnesses before investing in capability, because guidance is the bottleneck." The reliability framework shows exactly where to invest: each dimension is a separate oracle that can be hardened independently. You don't need to solve all four at once.

The empirical finding that capability gains have outpaced reliability gains over 18 months of model releases is the oracle-strength prediction confirmed at scale: the bottleneck is verification quality, not generation quality. MAKER's million-step zero-error result demonstrates what happens when you take this seriously for consistency: decompose to minimal subtasks, vote across independent samples, discard red-flagged outputs. The entire MDAP framework is architectural oracle hardening — and it works precisely because per-step oracle strength is hard (each Towers of Hanoi move has a deterministic correct answer).

Connection to spec mining

Spec mining is the operational mechanism for consistency and robustness hardening. You watch failures, extract patterns, write deterministic checks. The Rabanser framework's Table 3 — mapping real-world failures to reliability metrics — is spec mining applied to evaluation itself: each failure class becomes a testable property.

The workflow becomes: observe failure → classify by reliability dimension → mine a spec for that dimension → the oracle hardens.

The predictability gap

Predictability is the hardest dimension to harden because discrimination (not just calibration) requires the model to know what it doesn't know at the individual-task level. The paper finds calibration improving but discrimination stagnant — models get better at aggregate confidence but not at per-instance confidence. This suggests predictability will be the last oracle to harden, and the augmentation strategy (human-in-the-loop for uncertain cases) remains the pragmatic answer.

This is the augmentation/automation boundary: a 90%-accurate agent with poor discrimination is fine as an augmentation (human catches the 10%) but dangerous as an automation (nobody catches it). An approval gate converts a weak predictability oracle into an interactive one — the human provides the discrimination the model lacks.

Relevant Notes:

oracle-strength-spectrum — foundation: the gradient from hard to no oracle that this note maps reliability dimensions onto
spec-mining-as-codification — the operational mechanism for hardening consistency and robustness oracles
deploy-time-learning — reliability hardening as deploy-time learning, not training-time learning
relaxing-signals — indicators for where a component sits on the spectrum; prompt robustness (R_prompt) is a relaxing signal measured at scale
MAKER: Solving a Million-Step LLM Task with Zero Errors — concrete architectural hardening: decomposition + voting hardens consistency, red-flagging hardens predictability, both enabled by hard per-step oracles
ABC: Agent Behavioral Contracts — extends: maps onto all four dimensions — safety (hard invariants), consistency (soft invariants with recovery), predictability (drift monitoring via D*=α/γ), robustness (compositionality theorem)
the augmentation-automation boundary is discrimination not accuracy — deepens: extracts and develops the predictability gap paragraph into a standalone claim — the boundary depends on per-instance discrimination, which is empirically stagnant
Ma et al. (2026). Prompt Stability in Code LLMs — exemplifies: AUC-E metric directly operationalizes robustness (R_Rob) hardening — measures how much prompt perturbation changes outputs, quantifying the soft-to-hard oracle transition for robustness

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search