Distillation

Type: note · Status: current · Tags: learning-theory

One of two co-equal learning mechanisms in deployed agentic systems, alongside constraining. Distillation is compressing knowledge so that a consumer can act on it within bounded context. Without distillation, the source material often exceeds the consumer's effective context for the task — making the operation infeasible, not merely slow. Even when source material would technically fit, undistilled methodology crowds out the actual work — both by consuming tokens and by adding navigational complexity. The source can be anything — raw observations, methodology, prior reasoning, accumulated understanding. The target is always an artifact that equips a consumer (agent, collaborator) to perform a task. Different operational contexts need different extractions from the same source, so multiple distillations are normal — each serves a different consumer-task pair.

Context engineering is the architecture — the loading strategy, routing, the select function in the scheduling model. Distillation is the main operation that architecture performs, though not the only one (routing, scoping, and maintenance are also context engineering operations).

Most KB learning is distillation — explore messily, notice patterns, extract insight, write a note.

Prior work

Compressing knowledge for a specific audience is not new — it's the core of several established fields:

  • Technical writing — the discipline is built on audience analysis and purpose-driven restructuring. Progressive disclosure is distillation applied to documentation.
  • Pedagogical adaptation — scaffolding (Vygotsky), curriculum design, and Bloom's taxonomy all address reshaping knowledge for learners at different levels.
  • Library science / abstracting — professional abstracting and indexing is distillation optimized for retrieval decisions.
  • Knowledge management — Nonaka & Takeuchi's externalization (tacit → explicit knowledge) describes a similar transformation, though without the context-budget framing.

What's specific to our use is the agent context: the context budget is a hard constraint, not a soft guideline.

TODO: This survey is from the agent's training data, not systematic. Revisit with deep search — technical writing and pedagogy literatures likely have results about what makes distillation effective.

How distillation works

The content is selected and compressed to fit the consumer's task and context budget. The rhetorical mode may shift if the task demands it (argumentative → procedural when the task is execution, exploratory → assertive when the task is deciding). What stays constant is the medium — unlike codification, distillation typically stays in natural language consumed by an LLM.

Source → Distillate Target
Methodology → Skill Agent performing a specific workflow
Workshop → Note Future agents needing the insight
Research → Design principle Decision-making in a particular area
Accumulated understanding → Narrative Consumer who needs the current whole picture
Caller's knowledge + sub-agent's question → Refined prompt Sub-agent facing a specific task
Domain artifacts (logs, patches, docs) → Detection/analysis skill Agent diagnosing or investigating a class of problems
Many observations → Summary Agent that can't fit them all in context

Targeting is information loss — this is why the source persists. Reading only the /connect skill, you can connect notes but can't adapt the procedure to a novel situation. The methodology notes handle that.

Warning: a distillate can look adequate while losing behavioral influence — compressed experience is often less active than the raw traces it replaced (Faithful Self-Evolvers).

Relationship to constraining

Constraining and distillation are orthogonal — they operate on different dimensions of the same artifacts:

Not distilled Distilled
Not constrained Raw capture (text file, session notes) Extracted but loose (draft skill, rough note)
Constrained Committed but not extracted (stored output, frozen config) Extracted AND hardened (validated skill, codified script)

Constraining asks: how constrained is this artifact? Distillation asks: was this artifact extracted from something larger?

You can distil without constraining (extract a skill — still natural language, still underspecified). You can constrain without distilling (store an LLM output — no extraction from reasoning involved). The full compound gain comes when both apply.

Terminology note

ML "knowledge distillation" (Hinton et al., 2015) trains a smaller model to mimic a larger model's output distribution — automated, targets weights, optimizes for reproducing the teacher's behavior. KB distillation involves judgment about what to extract, targets text artifacts, and optimizes for operational effectiveness — the distillate serves a different purpose than the source. Shared intuition: purposeful compression from a larger source into a smaller target for a specific consumer.


Relevant Notes: