Ingest: Beyond Transformers: Sudoku Bench

Type: practitioner-report

Source: pathway-beyond-transformers-sudoku-bench.md Captured: 2026-03-26 From: https://pathway.com/research/beyond-transformers-sudoku-bench

Classification

Type: practitioner-report -- Pathway built something (BDH) and reports benchmark results, but the methodology and architecture are undisclosed. Not a scientific paper (no peer review, no methodology section, no reproducible setup). Not a conceptual essay (there are concrete benchmark numbers). The company is using its own product's results to argue a thesis about transformer limitations.

Domains: ml-architecture, constraint-satisfaction, reasoning-benchmarks

Author: "Pathway Team" -- Pathway describes itself as a "post-transformer neo-lab." No individual authors named. The team has commercial incentive to overstate both transformer limitations and BDH capabilities. The company appears to be an AI startup positioning against the transformer-dominated landscape. Credibility is low for the BDH claims (vendor benchmarks), moderate for the framing of transformer limitations (which aligns with independent research).

Summary

Pathway reports that their BDH (Dragon Hatchling) model achieves 97.4% accuracy on approximately 250,000 extreme Sudoku puzzles, while leading LLMs (O3-mini, DeepSeek R1, Claude 3.7) score 0%. They use this gap as evidence that the transformer architecture has fundamental limits for constraint-satisfaction reasoning -- arguing that the bottleneck is the token-by-token processing with limited internal state per token (~1,000 floats), which prevents holding multiple candidate strategies in parallel. The article then argues that "post-transformer" architectures with larger latent reasoning spaces are necessary for progress toward AGI, and positions BDH as such a system -- one that maintains language fluency while adding native constraint-solving capability via "intrinsic memory mechanisms" and "continual learning." The article extends the Sudoku result into a claim about "generative strategy" for real-world constraint problems in medicine, law, operations, and planning.

Connections Found

/connect discovered 7 genuine connections, rejected 8 candidates, and flagged one synthesis opportunity.

Closest sibling: induction-bias-sequence-models-ebrahimi-2026.ingest -- complements. Both argue transformers have fundamental architectural limits for structured reasoning, but with dramatically different evidence quality. Ebrahimi et al. demonstrate this rigorously on algebraic state tracking with 190,000 controlled runs and the quantitative kappa metric. Pathway demonstrates it on constraint satisfaction with a benchmark comparison but no disclosed methodology. Together they bracket the architectural-limits claim from two problem domains, with Ebrahimi providing the scientific grounding and Pathway providing the applied/commercial data point.

Primary theoretical frame: bitter-lesson-boundary -- exemplifies. Sudoku is a clean calculator-class problem: the spec IS the problem (constraint satisfaction with fully defined rules), verification is trivial (check rows/columns/boxes), but solving requires search through interacting possibilities. The 0% LLM accuracy vs 97.4% purpose-built architecture accuracy is a direct instance of the bitter lesson boundary: architectural constraint beats general-method scaling.

Supporting connections: - oracle-strength-spectrum -- exemplifies: Sudoku sits at the hard-oracle end (trivially verifiable, hard to generate), the exact profile where purpose-built solvers should dominate. - the-boundary-of-automation-is-the-boundary-of-verification -- exemplifies: Trivial verification + hard generation = automation bottleneck is in architecture, not oracle construction. - codification-and-relaxing-navigate-the-bitter-lesson-boundary -- exemplifies: BDH's "latent reasoning space" is an architectural relaxing move replacing codified chain-of-thought. - information-value-is-observer-relative -- exemplifies: Same puzzle, dramatically different results depending on architecture = architecture IS the computational bound. - effective-context-is-task-relative-and-complexity-relative-not-a-fixed-model-constant -- extends: The Pathway source pushes beyond task-relative context capacity to task-relative architecture -- for some problem classes, the limitation is the kind of internal representation, not its size.

Synthesis opportunity flagged: Three sources now converge on the same architectural limitation claim from different problem domains: Ebrahimi (algebraic state tracking), ConvexBench (compositional reasoning), and Pathway (constraint satisfaction). A synthesis note could catalog the problem classes where transformers demonstrably fail despite trivial verification, map them to the bitter lesson boundary, and assess evidence quality across the three.

Extractable Value

Third problem domain for the architectural-limits cluster -- Constraint satisfaction (Sudoku) joins algebraic state tracking (Ebrahimi) and compositional reasoning (ConvexBench) as a domain where transformers fail at scale despite trivial verification. The convergence across three independent problem domains strengthens the case that the limitation is architectural. High reach: the pattern transfers to any domain with well-defined constraint structures and cheap verification. [quick-win]
The "code escape" observation is a clean illustration of a real distinction -- The article notes that LLMs prompted cleverly may write a Sudoku solver in Python rather than reason about the puzzle directly. This is a concrete example of the difference between native reasoning and tool-mediated reasoning. The LLM doesn't solve the constraint problem; it offloads it to an arithmetic engine. This maps directly to the bitter lesson boundary: the LLM recognizes the problem class and routes to a calculator, which is actually competent architecture-aware behavior, not a failure. Low reach as stated (the article frames it as a weakness), but the reframing has high reach. [experiment]
Claimed continual learning from interaction -- BDH allegedly learns from every interaction and reaches "advanced-beginner level in a new game in 20 minutes." If true, this would be a concrete implementation of deploy-time learning that our KB discusses as a missing capability. But: no methodology disclosed, no independent evaluation, no detail on what "learns" means mechanically. Low reach due to evidence quality. [just-a-reference]
The "generative strategy" concept -- Moving from summarizing/retrieving past solutions to generating novel strategies that satisfy multiple constraints simultaneously. The framing is interesting but unsupported by evidence beyond the Sudoku result. The gap between "solves Sudoku" and "generates medical treatment strategies" is enormous and the article does not attempt to bridge it with evidence. Low reach: context-bound assertion. [just-a-reference]
Benchmark scale is notable -- 250,000 extreme Sudoku puzzles provides statistical power that most reasoning benchmarks lack. The benchmark design itself (hard instances of a trivially verifiable problem, evaluated at scale) is a useful pattern for testing architectural limits, regardless of what one thinks of BDH's claimed results. Moderate reach. [quick-win]
The "cost at 10x lower" claim -- By reasoning in latent space rather than generating verbose chain-of-thought, BDH claims to achieve results at 10x lower computational cost. If validated independently, this would be evidence that CoT is an expensive workaround for an architectural limitation, not an efficient reasoning strategy. Not independently validated. [just-a-reference]

Curiosity Gate

What is most surprising? The 0% figure for all leading LLMs on extreme Sudoku. Not "low" -- zero. This is surprising because O3-mini and similar reasoning models have demonstrated sophisticated mathematical reasoning, and Sudoku is formally simpler than many math competition problems they can solve. The article does not explain what "0%" means precisely -- whether it's 0% on all 250,000 puzzles, whether any partial credit was given, or what prompting strategies were tested. The arxiv reference cited (2506.21734) should be checked. If the 0% truly means "no LLM solved even one extreme Sudoku puzzle correctly," that's a strong signal. If it means "average accuracy rounds to 0%," it's weaker. The surprise here is in the absolute nature of the failure, which suggests these puzzles may be genuinely beyond the computational capacity of chain-of-thought reasoning at any prompt-engineering effort level.

What's the simpler account? For the Sudoku result itself: Sudoku is a well-understood constraint-satisfaction problem with known efficient algorithmic solutions (backtracking with constraint propagation). Any system that implements even a basic backtracking search will solve extreme Sudoku reliably. BDH may simply have learned to implement backtracking in its latent space -- which is impressive as a learning outcome but does not require anything as grand as "post-transformer reasoning" or "latent reasoning space." The simpler account is: BDH has an architecture that can learn recursive search procedures; transformers cannot. This is exactly the Ebrahimi finding (recurrent architectures learn step-by-step state tracking efficiently; transformers don't), applied to a different task. No need to invoke "generative strategy," "eureka moments," or the path to AGI.

Is the central claim hard to vary? The article makes two separable claims: (1) transformers are fundamentally limited for constraint satisfaction, and (2) BDH's latent reasoning architecture is the solution. Claim 1 is hard to vary -- you can't swap "constraint satisfaction" for arbitrary other tasks and keep the conclusion, and the evidence (0% vs 97.4%) directly depends on the claim. Claim 2 is easy to vary -- any architecture with recursive/recurrent state tracking would likely produce similar results (traditional constraint solvers achieve ~100% without any ML). The article conflates an architectural limitation finding (solid) with an argument for their specific commercial product (unfounded).

Limitations (our opinion)

What is not visible:

BDH architecture is completely undisclosed. No paper, no technical report, no diagram, no description of what "latent reasoning space" or "intrinsic memory mechanisms" actually mean mechanically. For all we know, BDH could be a thin wrapper around a constraint solver, an RNN-based architecture, or something genuinely novel. Without disclosure, the benchmark result is unfalsifiable -- we cannot assess whether the architecture is novel or whether the result is simply "a backtracking solver beats a language model at constraint satisfaction," which would be unsurprising.
LLM evaluation methodology is absent. Which versions of O3-mini, DeepSeek R1, and Claude 3.7 were tested? What prompting strategies? Was tool use (code writing) permitted? Were any Chain-of-Thought augmentations or multi-attempt strategies tested? The article cites arxiv 2506.21734 for the LLM accuracy score, but the BDH evaluation methodology is entirely internal. The induction-bias-sequence-models-ebrahimi-2026.ingest source ran 190,000 controlled training runs with explicit hyperparameter search -- the methodological gap is enormous.
Survivorship bias and commercial incentive. Pathway is selling BDH as a commercial product. The benchmark was chosen because BDH performs well on it. We do not know what BDH fails at, what other benchmarks were tried and abandoned, or what the failure modes look like. The article presents zero failure cases and zero limitations of BDH -- a red flag for any honest technical evaluation.
The domain-transfer claim is entirely speculative. The article leaps from "solves Sudoku" to "could help generate strategy in medicine, law, operations, and planning" without any intermediate evidence. Constraint satisfaction in Sudoku is fully specified with no ambiguity; real-world constraint problems in medicine and law involve ambiguous constraints, incomplete information, and context-dependent priorities. The transfer is asserted but represents an entirely different (and much harder) problem class. The bitter-lesson-boundary framework would classify Sudoku as calculator-class (spec IS the problem) and medical treatment planning as having significant vision-feature components (the "constraints" are theories about what matters, not formal definitions).
The "continual learning" claim is unverifiable. "Advanced-beginner level in a new game in 20 minutes" -- which game? What does "advanced-beginner" mean? Measured how? No details provided, no independent evaluation.
The Feynman analogy is misleading. The article claims AI needs "eureka moments" and "unstructured reasoning" for creativity, citing Feynman's plate-spinning insight. This conflates unconscious/implicit pattern recognition (which may indeed happen in latent spaces) with the specific claim that BDH does this. The analogy does no argumentative work -- it makes the reader feel that BDH enables creativity without providing evidence for it.

Recommended Next Action

Write a synthesis note titled "Transformer architectural limits cluster around verifiable constraint problems" connecting to bitter-lesson-boundary, oracle-strength-spectrum, induction-bias-sequence-models-ebrahimi-2026.ingest, convexbench-can-llms-recognize-convex-functions.ingest, and this source -- it would argue that three independent sources from different problem domains (algebraic state tracking, compositional reasoning, constraint satisfaction) converge on the same finding: transformers fail on problems with well-defined structure that admits exact verification, and that this convergence strengthens the bitter lesson boundary's prediction that architectural constraint is a permanent advantage for calculator-class problems. The evidence quality varies dramatically (Ebrahimi >> ConvexBench > Pathway), and the note should assess what each source contributes and where the gaps remain.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search