Ingest: Self-training Large Language Models through Knowledge Detection

Type: kb/sources/types/ingest-report.md

Source: self-training-large-language-models-through-knowledge-detection.md Captured: 2026-04-20 From: https://aclanthology.org/2024.findings-emnlp.883.pdf

Classification

Type: scientific-paper -- EMNLP Findings 2024 paper with a concrete training method, datasets, model comparisons, ablations, OOD forgetting checks, and citations. Domains: learning-theory, continual-learning, oracle-theory, trace-derived-learning Author: Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria; academic NLP/AI researchers, including overlap with the already-ingested Into the Unknown line of work and a public Self-Training-LLM codebase reviewed in this KB.

Summary

The paper proposes a self-training pipeline where an LLM generates or receives Wikipedia-grounded instructions, self-annotates answers, learns a context-grounded reading-comprehension behavior through SFT, then constructs a DPO preference dataset by comparing answers generated with the source document against answers generated without it. Its key move is selective training: examples pass through a consistency filter to reject low-confidence reference answers and a knowledge filter to keep cases where the model appears unknown, measured by contradiction scores over sampled generations. Experiments on TinyLlama-1.1B, Llama2-7B, and Llama2-13B report improved truthfulness on held-out Wikipedia questions and less OOD degradation than unfiltered DPO. The paper is best read as an unknown-selection and preference-data-construction paper, not as general autonomous self-learning.

Connections Found

The connect report found a tight cluster around trace-derived weight learning, unknown-selection oracles, and continual-learning substrate choices. The strongest connection is the code-grounded Self-Training-LLM review, which inspects the repository implementing this paper and reaches the same design lesson: the important abstraction is separating prompt/data validity from learning-target value. The paper also extends the prior Into the Unknown ingest: that source framed Points in the Unknown and self-question/search/train loops, while this paper makes unknown detection operational inside DPO preference construction and adds forgetting evaluation. In the KB's learning-theory cluster, the source is a weight-side contrast to Continual learning's open problem is behaviour, not knowledge and Treat continual learning as substrate coevolution, because temporary readable traces are promoted into opaque model weights rather than durable repo artifacts. It also extends Memory management policy is learnable but oracle-dependent by showing a different learned-selection problem: not when to use memory, but which generated QA traces deserve training. The oracle notes, especially Oracle strength spectrum and The boundary of automation is the boundary of verification, explain why the method works only where source documents and contradiction scoring provide a usable soft verifier.

Extractable Value

  1. Separate candidate validity from learning value -- The paper's two filters are the highest-reach design pattern: one gate checks whether the preferred/reference answer is internally consistent enough to trust, the other checks whether the model is weak enough on the item to justify training. KB learning needs the same split: "is this candidate grounded?" and "does promoting it improve the system?" are different questions. [quick-win]

  2. Reference-context anchoring strengthens soft preference oracles -- The best DPO data comes from comparing document-conditioned answers against unconditioned answers, not from self-consistency alone. The transferable mechanism is that a retrieved/source-grounded answer can act as a temporary stronger oracle when hard ground truth is unavailable. [experiment]

  3. Selective training is also an anti-forgetting mechanism -- The paper's most surprising empirical claim is that training on fewer, higher-unknown examples can outperform training on the full preference dataset while preserving OOD benchmark performance. The simpler account is not that the model has deep self-knowledge, but that overtraining on already-known items injects noisy preference gradients. [just-a-reference]

  4. Unknown thresholds are model-capability dependent -- Larger models pruned more examples and benefited from higher knowledge-filtering thresholds. This is a useful warning for any automated curation loop: promotion thresholds are not portable constants; they are policy parameters tied to the evaluator and generator. [experiment]

  5. Batch trace-to-weights deserves its own trace-derived learning subtype -- The source's traces are generated questions, source contexts, sampled answers, contradiction scores, and pairwise judgments. They are not live agent trajectories, but they still form a trace-derived learning loop ending in SFT/DPO checkpoints. This refines the taxonomy in trace-derived learning techniques. [quick-win]

  6. Reading-comprehension pretraining stabilizes later preference learning -- The SFT stage includes a reading-comprehension dataset so the model first learns to answer with source context before DPO tries to separate preferred from rejected answers. For artifact-learning loops, the analogue is teaching the system to use evidence before asking it to optimize preferences over outputs. [experiment]

Limitations (our opinion)

The method is less autonomous than the headline suggests. GPT-3.5 generates instructions and reading-comprehension labels; GPT-4 judges pairwise truthfulness. The target model curates part of its own signal, but stronger external models and curated Wikipedia documents remain load-bearing.

The unknown detector is a soft oracle. Contradiction between context-conditioned and unconditioned answers can indicate missing knowledge, but it can also reflect prompt sensitivity, ambiguity, sampling variance, context misuse, or confident wrong knowledge. A model can be consistently wrong, and consistency filtering would not catch that without reference-based checks.

The evaluation domain is narrow: Wikipedia-derived factual QA over ten topics, one self-training iteration, and OOD checks limited to Open LLM Leaderboard tasks plus sampled known items. That is enough to make the mechanism worth tracking, but not enough to conclude that selective self-training is safe over long multi-cycle updates or private/noisy domains.

The paper's strongest result depends on document availability. The ablation without document context performs worse, especially for larger models. This weakens any broad claim that the model can self-correct from its own outputs alone; the actual mechanism is source-grounded preference construction.

Threshold sensitivity is underdeveloped. The paper shows that increasing tau_K can improve win rate, especially for 13B, but does not provide a principled calibration method. That matters because the curation policy is the real learned-system boundary: bad thresholds can either waste training on known examples or starve the update loop.

The snapshot also reinforces a pattern already visible in the code review: intermediate artifacts are mostly staging data. Once promoted into weights, the chosen/rejected examples, scores, and rationales no longer participate in an inspectable knowledge graph unless separately preserved.

Write a note titled "Unknown selection needs separate validity and learning-value gates" connecting to Self-Training-LLM, Automating KB learning is an open problem, Oracle strength spectrum, and Into the Unknown ingest. It should argue that self-learning and KB-learning pipelines need two distinct promotion gates: one for candidate groundedness and one for whether the current system benefits from learning that candidate.