Ingest: Agentic Code Reasoning

Type: scientific-paper

Source: agentic-code-reasoning.md Captured: 2026-03-07 From: https://arxiv.org/html/2603.01896v2

Classification

Type: scientific-paper — arXiv preprint with explicit methodology, controlled task evaluations, comparative baselines, error analysis, and quantitative results across multiple datasets/tasks. Domains: agentic-code-reasoning, execution-free-verification, structured-reasoning, reliability Author: Shubham Ugare and Satish Chandra (Meta). Author signal is moderate-to-strong: industry researchers reporting concrete benchmarks and ablations on practical SWE-agent workflows.

Summary

The paper introduces "agentic code reasoning" — LLM agents that explore codebases and reason about code semantics without executing it. The core intervention is semi-formal reasoning templates that force agents to construct explicit premises, trace execution paths, and derive formal conclusions. Across three tasks (patch equivalence verification, fault localization, code question answering), structured reasoning consistently improves accuracy over unstructured baselines: patch-equivalence from 78% to 88% on curated examples, 93% on real agent-generated patches; code QA +9pp on RubberDuckBench; fault localization +5-12pp on Defects4J.

The paper's motivating application is replacing test execution with LLM-based verification in RL training pipelines, arguing that sandbox/CI setup per repository is too costly. However, the cost argument is asserted, not demonstrated — there is no comparison of LLM inference cost vs. execution environment cost. The paper's actual contribution is demonstrating that structured execution-free verification can reach 93% accuracy, making it a feasibility result, not a cost-effectiveness result.

Connections Found

Extractable Value

  1. Semi-formal reasoning templates as a reliability lever for code agents. Measured gains from process constraints without model retraining, directly usable in agent pipeline design. The mechanism is narrowing interpretation space before generation, consistent with agentic-systems-interpret-underspecified-instructions. [quick-win]
  2. Feasibility benchmark for execution-free verification. 93% accuracy on real-world patches establishes that LLM-based verification can approximate test outcomes, though the remaining 7% error rate and absence of cost analysis leave the deployment case unproven. [experiment]
  3. Quantified quality-vs-cost trade-off for structured reasoning. Semi-formal mode improves outcomes but increases step budget (~2.8x in reported settings), useful for architecture decisions. [experiment]
  4. Reusable failure-mode taxonomy for verifier agents. Incomplete trace coverage, third-party semantic guessing, and dismissal of subtle differences are actionable error classes for future eval harnesses. [quick-win]
  5. Process structure and output structure are separate design dimensions. Existing KB notes mostly discuss output schema quality; this source emphasizes process-schema constraints as a distinct mechanism. [deep-dive]

Weak Points

  • Cost argument is asserted, not demonstrated. The paper claims execution-free verification is cheaper than running tests, but provides no cost comparison. LLM inference with up to 100 agent steps of Opus-4.5 (2.8x more in semi-formal mode) is expensive; whether it's cheaper than sandbox setup depends on scale, repo diversity, and infrastructure — none of which are measured.
  • Accuracy ceiling may not suffice. 93% on a balanced dataset means ~7% false signals in RL training. Whether this error rate is tolerable depends on the downstream task, which the paper does not evaluate.
  • Single model family, single benchmark ecosystem. Results are on SWE-bench derived tasks with Claude models. Generalization to other models, languages, and repository structures is untested.

The strongest extractable insight is about structured reasoning as interpretation-narrowing (#1 above) — this connects cleanly to existing notes. The execution-free-as-oracle-replacement framing (#2) is interesting but the paper's own evidence doesn't close the argument. A note on this would need to flag the unproven cost claim rather than accept it.