Ingest: On the "Induction Bias" in Sequence Models

Type: kb/sources/types/ingest-report.md

Source: induction-bias-sequence-models-ebrahimi-2026.md Captured: 2026-03-03 From: https://arxiv.org/pdf/2602.18333

Classification

Type: scientific-paper -- Large-scale empirical study (190,000+ training runs) with systematic methodology, quantitative metrics (sharing factor kappa), and controlled comparisons across architectures and supervision regimes. Published as a preprint by Qualcomm AI Research.

Domains: ml-architecture, state-tracking, data-efficiency, inductive-bias

Author: M. Reza Ebrahimi, Defferrard, Panchal, Memisevic at Qualcomm AI Research. Ebrahimi has prior published work on transformer limitations in state tracking (COLM 2024) and bilinear RNN architectures (NeurIPS 2025). The team has systematic expertise in the transformer-vs-recurrent comparison space.

Summary

The paper conducts a large-scale empirical study comparing transformers and recurrent neural networks (LSTMs, Dense-SSMs) on state-tracking tasks (modular addition, permutation composition), focusing on in-distribution data efficiency rather than the usual OOD length-generalization framing. The central finding is that transformers require dramatically more training data than RNNs for state tracking -- sometimes by orders of magnitude -- because they lack what the authors call "induction bias": the structural constraint that forces step-by-step state updates. The paper introduces a sharing factor (kappa) that quantifies whether a model reuses learned mechanisms across different sequence lengths. Transformers show kappa near or below 1 (learning length-specific solutions in isolation, with destructive interference under CoT), while RNNs show kappa >> 1 (amortized learning via shared transition operators). The paper further shows that in-distribution data efficiency and cross-length weight sharing are highly correlated with OOD length-generalization ability, connecting the in-distribution and OOD failures as manifestations of the same underlying architectural limitation.

Connections Found

The /connect discovery found 4 strong and 4 moderate connections. Because this is an ML-architecture paper and the KB focuses on knowledge-system methodology, the connections are analogical -- cross-domain mappings between architectural constraints in neural networks and structural constraints in knowledge systems.

Strong connections: - bitter-lesson-boundary -- grounds: The paper provides the strongest quantitative evidence for the calculator side of the boundary. Transformers (the "general method leveraging computation") need orders-of-magnitude more data than RNNs with structural priors for state tracking -- a calculator-class problem. The kappa < 1 finding (destructive interference, kappa = 0.28 for transformers with CoT) is a direct counter-example to "more data is always better." Architectural constraint is permanent advantage here, not a temporary one that scale will dissolve. The bitter-lesson-boundary note's Relevant Notes section currently has only one entry; this paper should be added. - oracle-strength-spectrum -- exemplifies: The three supervision formats (outcome, CoT, ACoT) map to different oracle granularities. The key finding is that architecture interacts with oracle granularity -- RNNs benefit from per-step aligned supervision (ACoT) while transformers do not (they prefer non-aligned CoT). This demonstrates that oracle strength is necessary but not sufficient: the learner's architecture must match the oracle's structure for learning to be efficient. - structure-activates-higher-quality-training-distributions -- extends by analogy: Both describe the same mechanism at different levels -- structural constraints that do not add information but constrain the search space so learning converges faster. The KB note operates at the prompt level (structured templates activate higher-quality distributions); the paper operates at the architecture/weight level (recurrent induction bias constrains the hypothesis space so state tracking requires less data). Both are instances of constraint-as-learning-accelerator. - error-correction-works-above-chance-oracles-with-decorrelated-checks -- extends by analogy: Transformers' length-isolated learning (kappa near 1) is structurally analogous to correlated errors -- knowledge at one length does not transfer to another, the "votes" carry the same information and don't compound. RNNs' high kappa achieves the decorrelated, amortized learning that effective error correction requires. The sharing factor kappa is essentially a measure of decorrelation across problem variations.

Moderate connections: - information-value-is-observer-relative -- exemplifies: Architectures extract dramatically different amounts of structure from the same data, exactly as the observer-relative information framework predicts. The same state-tracking data has high effective information for RNNs and low effective information for transformers. The architecture IS the computational bound. - constraining -- parallels: Induction bias constrains the model's hypothesis space in the same way constraining narrows the interpretation space of an underspecified spec. Both trade generality for efficiency. The paper quantifies this trade-off for architectural constraints; the KB defines it for artifact constraints. - codification-and-relaxing-navigate-the-bitter-lesson-boundary -- grounds: Step-by-step decomposition is a permanent architectural advantage for state tracking, not a temporary one that scale dissolves. The kappa results show that the general method (transformer) not only fails to scale past the constraint -- it actively degrades (destructive interference) when exposed to more diverse data. - deploy-time-learning-the-missing-middle -- extends by analogy: The supervision regime findings (outcome < CoT < ACoT for RNNs, outcome < CoT for transformers) map onto the verifiability gradient. Making intermediate computation explicit and verifiable (process supervision vs outcome supervision) makes learning cheaper -- the same principle underlying the argument for verifiable artifacts in deployed systems.

Source-to-source: - meyerson-maker-million-step-llm-zero-errors.ingest -- complements: Ebrahimi et al. explain WHY transformers fail at long-range state tracking (kappa near 1, length-isolated learning). MAKER shows HOW to build reliable systems despite that failure via decomposition and error correction. Two papers bracket the same problem from opposite directions. - from-entropy-to-epiplexity-rethinking-information-computationally-bounded.ingest -- extends: The epiplexity paper provides the theoretical framework (observer-relative information extraction) that explains the induction bias paper's results. Different architectures are different computational bounds extracting different amounts of structure from the same data.

Synthesis opportunity: "Architectural constraints as a form of constraining" -- constraint-as-learning-accelerator operates at architecture level (this paper), artifact level (constraining), and prompt level (structure-activates-higher-quality-training-distributions), with the bitter lesson boundary determining which constraints are permanent vs temporary. A synthesis note could formalize this ladder.

Extractable Value

  1. The sharing factor kappa as a diagnostic metric -- kappa quantifies whether a system reuses learned mechanisms across problem variations or learns isolated solutions. This concept could transfer beyond ML architectures to evaluate whether knowledge-system structures (templates, indexes, conventions) enable amortized learning or force per-instance solutions. [experiment]

  2. Destructive interference is possible (kappa < 1) -- More data and more diversity can actively hurt when the architecture lacks the right inductive bias. Transformers with CoT on mixed lengths showed kappa = 0.28. This is a concrete counter-example to "more data is always better" and directly grounds the bitter-lesson-boundary note with quantitative evidence. [quick-win]

  3. Architecture-supervision interaction -- Transformers prefer CoT (non-aligned), RNNs prefer ACoT (aligned). The optimal supervision format depends on the learner's architecture, not just the task. This has implications for oracle design: the oracle must match the learner, not just the problem. [just-a-reference]

  4. In-distribution failure as the real story -- The paper reframes transformer state-tracking limitations from "OOD failure" to "in-distribution data inefficiency." Even with unlimited training distribution coverage, the data cost may be prohibitive. This is more practically relevant than the length-generalization framing for understanding scaling costs. [quick-win]

  5. Permutation composition generalizes beyond commutative tasks -- Findings hold for non-commutative state tracking (S_5), the canonical non-commutative structure by Cayley's theorem. The limitation is not specific to simple arithmetic but applies to any finite discrete state-tracking task. [just-a-reference]

  6. Step-by-step decomposition as permanent architectural advantage -- For calculator-class problems, forcing step-by-step computation is not a temporary crutch but a permanent efficiency gain. This strengthens the argument that some problems should never be approached with general methods, regardless of scale. [deep-dive]

Limitations (our opinion)

What was not tested:

  • Only synthetic algebraic tasks. The paper studies modular addition (Z_m) and permutation composition (S_5) -- both clean algebraic structures with exactly defined state transitions. Real-world state tracking (dialogue state, GUI interaction, multi-turn agent reasoning) involves ambiguous inputs, noisy state boundaries, and state spaces that are not formally enumerable. The paper's findings are provably correct for the algebraic regime but the transfer to "applied agentic scenarios" (which the authors invoke in Section 5) is asserted, not tested. The claim about "context rot" at long context lengths (Section 5) is especially speculative -- context degradation in language models involves many mechanisms beyond state-tracking failure.

  • Small, fixed-architecture models only. The transformer is a 6-layer GPT-2 variant with 256-dim embeddings; the RNNs are single-layer. The paper does not test whether scaling the transformer (more layers, larger dimension, different attention patterns) closes the gap. The authors acknowledge computational constraints make broader comparison infeasible (Section 5), but this leaves open the possibility that the data-efficiency gap narrows with scale -- which would be exactly the bitter lesson's prediction. The bitter-lesson-boundary framework would say it won't narrow for calculator-class tasks, but the paper itself does not test this.

  • No hybrid architectures. The paper compares pure transformers against pure RNNs. It does not test transformer-RNN hybrids, which are an active area of research (Phan et al., 2025, cited in the paper, show delayed attention training improves length generalization in hybrids). The practical implication may not be "use RNNs" but "add recurrent components to transformers for state-tracking subtasks" -- the paper's framing does not address this.

  • Binary success metric. N* is defined as the minimum dataset size where at least one of 15 hyperparameter configurations achieves validation loss below 10^{-4}. This is a strict pass/fail threshold. The paper does not report how close to convergence models get below the threshold, whether there is a smooth degradation curve, or whether the efficiency gap holds at less stringent convergence criteria. It is possible that transformers achieve "good enough" state tracking with much less data than "perfect" state tracking.

  • No curriculum or pretraining effects. The models are trained from scratch on the target task. In practice, transformers used for state tracking are pretrained on vast corpora and may develop implicit state-tracking heuristics that this experimental setup cannot capture. The short-to-long length distribution (Section 2) is a minimal curriculum but is not the kind of pretraining that characterizes deployed LLMs.

What this means for the conclusions: The paper's core finding -- that recurrent induction bias dramatically improves data efficiency on algebraic state tracking -- is robustly established within its experimental scope. The extrapolation to real-world agent systems, context rot, and practical LLM use (Section 5) should be treated as motivated hypotheses, not demonstrated results. The kappa metric and the destructive interference finding transfer cleanly as conceptual tools; the specific N* numbers do not.

Update bitter-lesson-boundary: add this paper as grounding evidence in the Relevant Notes section, with a relationship like "grounds: 190k-run empirical study demonstrating that architectural constraint (recurrent induction bias) permanently outperforms the general method (transformer) on calculator-class state tracking, with destructive interference (kappa = 0.28) showing more diverse data actively hurts the general method." The note currently has only one entry in its Relevant Notes section and urgently needs empirical grounding for its central claim.