Oracle strength spectrum

Type: note · Status: seedling · Tags: llm-interpretation-errors

The bitter lesson boundary draws a line between arithmetic (spec is the problem) and vision features (spec is a theory about the problem). This note proposes that the boundary is better understood as a gradient of oracle strength — how cheaply and reliably you can check whether output is correct — and explores what that would imply for engineering priorities. The framework is speculative; the individual hypotheses need testing.

The spectrum

  • Hard oracle: exact, cheap, deterministic check. Unit tests, type checks, cryptographic verification. The arithmetic regime.
  • Soft oracle: proxy score that correlates but isn't the real thing. BLEU, helpfulness rubrics, heuristic checks, consistency scores.
  • Interactive oracle: you can ask for feedback. User edits, thumbs up/down, preference pairs.
  • Delayed oracle: you only know later. Did the user churn? Did the bug surface? Did the decision pay off?
  • No oracle: vibes and anecdotes.

The bitter lesson is strongest at the hard-oracle end, where there's a clear training signal for scale to optimise against, and weakest at the no-oracle end, where there's nothing. This maps to the Karpathy verifiability framing that deploy-time learning builds on: a task is verifiable to the extent it is resettable, efficient to retry, and rewardable — three properties that strengthen as oracle strength increases.

The engineering move: harden the oracle

If the boundary is a gradient, the core engineering challenge becomes: move components toward the hard-oracle end. Convert no-oracle into some-oracle, then tighten. This is codification applied to the objective itself, not just to the implementation.

The priority follows: invest in telemetry and eval harnesses before investing in capability, because verification quality is the bottleneck, not generation quality. The Rabanser et al. reliability study offers suggestive evidence: across 14 models and 18 months of releases, capability gains yielded only small reliability improvements. If this pattern holds broadly — and it may not, since such findings are sensitive to the specific models and benchmarks used — it confirms that generation and verification improve on independent tracks, with verification lagging.

Concrete examples of oracle hardening: - Logging user corrections turns no-oracle into interactive oracle. - Adding schema validation turns soft-oracle ("does this look right?") into hard-oracle ("does this parse?"). - Building regression test suites turns delayed-oracle into hard-oracle for known cases.

Manufacture, amplify, monitor

Oracle hardening decomposes into three steps, each with its own methods and failure modes:

Manufacture. Spec mining creates oracles by extracting deterministic checks from observed behavior: watch the system, identify regularities, write verifiers. Each mined spec converts "does the output look right?" into "does it match this rule?" The reliability dimensions (consistency, robustness, predictability, safety) tell you which oracle to build next — each dimension targets a different verification question, so you can direct the mining at specific gaps.

Amplify. A mined spec doesn't need to be a perfect verifier. Error correction works whenever the oracle has discriminative power (TPR > FPR) and checks are decorrelated — the cost scales with 1/(TPR−FPR)², so even a weak spec is useful. This sets the manufacturing bar low: you need above-chance discrimination, not certainty. One reason external manufacturing matters: Rabanser et al. find that model self-assessment improves in calibration (aggregate confidence alignment) but not reliably in discrimination (per-instance separation of correct from incorrect). If models struggle to achieve TPR > FPR through introspection alone — a finding that may shift as models evolve — then spec mining's externally constructed checks become the primary source of discriminative oracles.

Monitor. Relaxing signals detect when a hardened oracle encodes a vision feature rather than a genuine spec — brittleness under paraphrase, isolation-vs-integration gaps, sensitivity to distribution shift. These indicate the oracle is softer than it appears and the component may need to move back toward the learned regime.

The steps have different failure modes: manufacturing without amplification gives a single fragile check; amplification without manufacturing leaves you voting over noise; either without monitoring risks locking in a vision feature as if it were arithmetic.

The generator/verifier pattern depends on this

The generator/verifier pattern — high-variance generator plus quality gate — is a common architectural choice, but it only works when oracle strength is sufficient. A quality gate that can't discriminate correct from incorrect outputs (TPR ≈ FPR) adds cost without adding reliability. The manufacture/amplify pipeline above is a prerequisite for generator/verifier architectures, not an optimisation.

Maturation path

This note stays seedling because it bundles several speculative claims under a coherent narrative. To mature, extract each and find adequate support — literature, external sources, worked examples, or direct argument:

  1. Oracle strength is a gradient underlying the bitter lesson boundary — the core reframing. Currently asserted by analogy to the Karpathy verifiability framing. Needs independent support, e.g. from reinforcement learning literature on reward shaping or verification complexity theory.
  2. "Harden the oracle" is the primary engineering move — plausible prescription but no practitioner evidence. Cases where teams invested in eval infrastructure before capability (or failed by not doing so) would ground this.
  3. The manufacture/amplify/monitor decomposition — invented here. Each step has grounding (spec mining, error correction, relaxing signals) but the claim that these three compose into a complete pipeline is unverified. Are there missing steps?
  4. Capability gains and reliability gains track independently — leaning on Rabanser et al., which the note already hedges. Needs either stronger empirical evidence or a theoretical argument for why they decouple.
  5. Generator/verifier depends on oracle strength — reasonable but stated as fact. Could be grounded in the broader generate-and-test / best-of-N literature.

Each extracted claim should link back here as its origin.

Open questions

  • Does oracle strength predict bitter-lessoning? If so, the spectrum is prescriptive — invest in codification where oracles are hard, invest in learned approaches where oracles are soft. Deutsch's explanatory reach concept suggests a mechanism: hard oracles survive scaling because they ARE the problem specification — they have reach beyond any particular model's capabilities. Soft oracles encode adaptive fit (theories about what correct looks like) which scale reveals as approximations, just as it did for vision features. This would make oracle strength a proxy for how much reach the verification has. External support: Tam et al. observe that agentic coding tools automate engineering (hard oracle — tests, specs, benchmarks) while research problem selection (no oracle — "you can't know in advance whether a solution exists") resists automation entirely. Quant firms pay $600k for "research taste" precisely because it's a no-oracle domain. This is the oracle-strength prediction stated in market-economics language.
  • Oracle strength and codification timescales. Hard oracles codify fast (you can test immediately); delayed oracles codify slowly (you have to wait for signal). The connection to codification timescales seems natural but hasn't been tested.
  • Oracle strength is itself hard to assess. Proxy scores that seem cheap and reliable may turn out to correlate poorly with the real objective — you don't always know whether your oracle is hard or soft until you test at scale.

Relevant Notes: