Analysis: Adaptation of Agentic AI (arXiv:2512.16301)

This note analyzes the paper "Adaptation of Agentic AI" and explores how its framework could inform llm-do's design and potential extensions.

Paper Overview

The paper presents a taxonomy for adaptation in agentic systems, organizing strategies along two dimensions:

  1. What adapts: Agent (LLM behavior) vs Tool (external capabilities)
  2. What triggers adaptation: Tool-execution signals vs Agent-output signals

Key Adaptation Mechanisms

1. Error-Driven Adaptation (Tool-Execution-Signaled)

When tools fail, the error propagates back to the LLM, which adjusts its approach.

Current llm-do behavior: This already happens naturally—when a tool call fails, the error message is returned to the LLM in the conversation, and it can retry with different parameters or try a different tool.

Potential extension: Failure logging and analysis

# Hypothetical worker config
adaptation:
  log_failures: true
  failure_store: ./failures/{worker_name}/

The runtime could: - Log every tool failure with context (worker, task input, tool called, error, LLM's recovery attempt) - Aggregate patterns: "tool X fails 40% of the time when called with pattern Y" - Surface insights: "Consider adding instruction about X" or "Tool X needs better error messages"

Implementation sketch:

# In runtime/ctx.py, wrap tool execution
async def execute_tool_with_logging(tool, args, context):
    try:
        result = await tool(**args)
        log_success(context, tool, args, result)
        return result
    except Exception as e:
        log_failure(context, tool, args, e)
        raise  # Let LLM handle retry

2. Prompt Refinement (Agent Adaptation)

Adjusting worker instructions based on observed performance.

Current llm-do support: Manual—users edit .worker files based on observed behavior.

Potential extension: Prompt evolution tracking

Workers could track their own performance and suggest refinements:

# In worker definition
adaptation:
  track_outcomes: true
  outcome_labels: ["success", "partial", "failed", "user_corrected"]

After each run, optionally prompt user: "How did this go?" Store outcome with task hash. After N runs, analyze: - Which task patterns succeed? - Which fail? - What instructions might help?

More ambitious: An improve_worker meta-tool that: 1. Reads failure logs for a worker 2. Analyzes patterns 3. Proposes prompt modifications 4. User approves changes

This is "stabilizing" in llm-do terminology—but data-driven rather than intuition-driven.

3. Tool Selection Adaptation

Dynamically adjusting which tools are available based on task and past success.

Current llm-do support: Static—toolsets are declared in worker config.

Potential extension: Conditional toolsets

toolsets:
  filesystem:
    tools: [read_file, write_file]
  shell:
    tools: [run_command]
    # Only enable after certain conditions
    enable_after:
      - successful_read_file: 2  # After 2 successful file reads
      - user_trust_level: high

Or dynamic tool recommendations:

toolsets:
  delegation:
    workers: [analyzer, formatter]
    recommend_based_on:
      - task_keywords
      - past_success_rate

4. Online vs Offline Adaptation

Online: Real-time adjustments during execution (current LLM retry behavior) Offline: Pre-computed refinements from historical analysis

Current llm-do support: Mostly online (LLM adapts in-session).

Potential extension: Offline adaptation pipeline

# Analyze past runs
llm-run analyze --entry orchestrator --since "7 days"

# Output:
# - 23 runs, 18 successful
# - Common failure: "file not found" when paths are relative
# - Suggested instruction addition: "Always use absolute paths"
# - Tool `formatter` called but failed 4/5 times → review formatter.worker

This could feed into a refinement workflow: 1. Run workers in production 2. Periodically analyze logs 3. Generate suggested improvements 4. Human reviews and applies

5. Agent-Output-Signaled Adaptation

Adaptation triggered by the agent's own reasoning/confidence, not external failures.

Current llm-do support: None explicit.

Potential extension: Confidence signaling

Workers could be instructed to signal uncertainty:

# In worker instructions
If you are uncertain about the correct approach, use the `signal_uncertainty` tool
with a description of what you're unsure about. This will pause execution for human input.

Or structured output could include confidence:

result_type:
  type: object
  properties:
    answer:
      type: string
    confidence:
      type: string
      enum: [high, medium, low]
    uncertainty_reason:
      type: string

Low-confidence results could trigger: - Automatic human review - Logging for later analysis - Escalation to a more capable model

Mapping to llm-do's Spectrum

The paper's framework maps to llm-do's neural/symbolic spectrum:

Paper Concept llm-do Equivalent
Agent-agnostic tools Pure Python tools (no worker calls)
Agent-supervised tools Hybrid tools with ctx.deps.call(...)
Tool-execution-signaled Error propagation to LLM (existing)
Agent-output-signaled Confidence signals (potential extension)
Offline adaptation Failure analysis → prompt refinement
Online adaptation In-session LLM retry/adjustment

Proposed Features (Priority Order)

P1: Failure Logging

Minimal invasive change. Log tool failures with context to enable later analysis.

# runtime addition
class FailureLog:
    def log(self, worker: str, tool: str, args: dict, error: str, recovery: str):
        # Append to JSONL file
        ...

P2: Run Outcome Tracking

After worker completion, optionally record outcome. Enables offline analysis.

# CLI addition
llm-run worker.worker --track-outcome
# After completion: "Rate outcome: [s]uccess / [p]artial / [f]ailed / [enter] skip"

P3: Analysis Command

Consume logs to surface patterns.

llm-run analyze orchestrator --since 7d
# Outputs failure patterns, success rates, suggestions

P4: Confidence Signaling

Add optional signal_uncertainty tool to core toolsets. When called, pauses for human input or logs for review.

P5: Prompt Evolution Suggestions

Meta-worker that reads failure logs and proposes instruction improvements.

Implications for Bidirectional Refactoring

The paper reinforces llm-do's bidirectional refactoring principle with data-driven triggers:

Stabilizing signals (neural → symbolic): - Tool consistently fails with certain input patterns → add validation in Python - LLM always calls tools in same sequence → extract to Python orchestration - Output structure is always identical → add stricter schema

Softening signals (symbolic → neural): - Python tool has growing exception list → delegate edge cases to worker - Validation rules have many special cases → use LLM for underspecified validation - User frequently overrides tool behavior → add flexibility via worker


Relevant Notes:

  • stabilisation — the stabilising/softening signals catalogued here are the data-driven triggers for the stabilise/soften cycle

References