Evaluating Long-Context Reasoning in LLM-Based WebAgents
Type: academic-paper
Author: Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai Source: https://arxiv.org/html/2512.04307v1 Date: December 3, 2025
Abstract
This research introduces a benchmark for assessing how well LLM-based web agents reason across extended interaction histories. The team developed an evaluation framework simulating multi-session user interactions by injecting irrelevant task sequences between dependent subtasks, creating contexts from 25,000 to 150,000 tokens.
Testing four models — Claude-3.7, GPT-4.1, Llama 4, and o4-mini — revealed significant performance decline with increased context. Success rates drop from 40-50% in baseline conditions to less than 10% in long context scenarios.
The analysis identified primary failure modes: agents became trapped in loops and lost sight of original objectives. The researchers tested an implicit RAG approach generating task-relevant summaries, which provided modest improvements but did not resolve fundamental limitations.
Key Findings
- Dramatic performance degradation as context length increases from 25k to 150k tokens
- Agents struggle with maintaining coherent task execution across extended interactions
- Current architectures lack robust mechanisms for long-term user session management
Failure Modes (at 150k tokens)
- Claude-3.7: 16.4% false ends, 35% inefficient progress, 16.7% loops
- GPT-4.1: 6.9% false ends, 32.7% inefficient progress, 44.3% loops
- o4-mini: Outperforms other models despite similar challenges
Proposed Solution
Implicit RAG (iRAG) breaks complex instructions into sub-instructions, generating task-relevant summaries to improve retrieval. Results show modest improvements though fundamental limitations in long context reasoning persist.
Implications
Current state-of-the-art models struggle maintaining coherence across realistic long-term interactions, highlighting critical challenges for deploying WebAgents in realistic scenarios. The research underscores the need for enhanced memory architectures and planning capabilities.
Publication Details
- Status: Accepted at NeurIPS 25 LAW Workshop
- Subject Areas: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- DOI: https://doi.org/10.48550/arXiv.2512.04307