h
t
t
p
s
:
/
/
a
r
x
i
v
.
o
r
g
/
h
t
m
l
/
2
6
0
3
.
2
7
1
1
6
v
1
The Price of Meaning: Why Every Semantic Memory System Forgets
Type: kb/sources/types/snapshot.md · Tags: academic-paper
Author: Sambartha Ray Barman, Andrey Starenky, Sofia Bodnar, Nikhil Narasimhan, Ashwin Gopinath
Source: https://arxiv.org/html/2603.27116v1
Date: 2026-03-28
License: CC BY 4.0
arXiv:2603.27116v1 [cs.AI] 28 Mar 2026
The Price of Meaning: Why Every Semantic Memory System Forgets
Sambartha Ray Barman Sentra, 235 2nd Street, San Francisco, CA 94105, USA Andrey Starenky Sentra, 235 2nd
Street, San Francisco, CA 94105, USA Sofia Bodnar Sentra, 235 2nd Street, San Francisco, CA 94105, USA Nikhil
Narasimhan Sentra, 235 2nd Street, San Francisco, CA 94105, USA Ashwin Gopinath Corresponding author:
agopi@mit.edu, ashwin@sentra.app Sentra, 235 2nd Street, San Francisco, CA 94105, USA Department of Mechanical
Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA
Abstract
Every major AI memory system in production today, from vector databases to RAG pipelines to the weights of
large language models, organises information by meaning. That organisation is what makes these systems useful:
it lets them generalise, draw analogies, and retrieve by concept rather than by keyword. But it comes at a
price. We show that the same geometric structure that enables semantic generalisation also makes interference,
forgetting, and false recall inescapable. Here we formalise and test that tradeoff for a broad class of
semantically continuous kernel-threshold memories: systems whose retrieval score is a monotone function of an
inner-product in a semantic feature space, whose representations are learned under a rate or distortion budget,
and whose semantic manifold has finite local intrinsic dimension.
Within this class, we derive four results. First, semantically useful representations have finite semantic
effective rank. Second, finite local dimension implies positive competitor mass in retrieval neighbourhoods.
Third, under growing memory, retention decays to zero; with power-law arrival statistics and population
heterogeneity, this yields population-level power-law forgetting curves. Fourth, for associative lures
satisfying a
[MATH: δ\delta :MATH]
-convexity condition below the decision margin, false recall cannot be eliminated by threshold tuning within
the same score family.
We then test these predictions across five memory architectures: vector retrieval, graph memory,
attention-based retrieval, BM25-based filesystem retrieval, and parametric memory. Pure semantic retrieval
systems express the geometric vulnerability directly as forgetting and false recall. Systems with explicit
reasoning can partially override these symptoms behaviourally, but convert smooth degradation into brittle
failure modes. Systems that escape interference completely do so by sacrificing semantic generalisation.
The result is not an argument against scale. It is an argument that scale alone is not enough. Making a vector
database ten times larger, an LLM ten times bigger, or an embedding space ten times wider does not remove the
interference; it moves the system along a tradeoff surface where forgetting and usefulness are coupled. For
memory, progress requires not only scale but new architectures, training objectives, and
interference-management mechanisms. The price of meaning is interference, and no architecture we tested avoids
paying it.
Organising memory by meaning makes forgetting and false recall inevitable. Scaling up does not fix it.
Introduction
Every deployed retrieval-augmented generation system, every long-term agent memory, and every knowledge graph
built on dense embeddings shares a design choice: organise information by meaning. Items that are semantically
related sit near each other in representation space. This is what makes these systems capable of
generalisation, analogy, and conceptual transfer rather than mere keyword lookup. But it also means that when
the system tries to retrieve one memory, its semantic neighbours compete for the same retrieval slot. That
competition is interference, and this paper asks whether any semantic memory system can avoid it.
Our previous work, HIDE^3, showed that one simple retrieval architecture (cosine similarity over sentence
embeddings) reproduces several canonical memory phenomena, including forgetting under interference (
[MATH:
b=0.460±0.183b=0.460\pm 0.183 :MATH]
), DRM-style false recall (
[MATH: FA=0.583\text{FA}=0.583 :MATH]
), spacing effects, and tip-of-tongue states (
[MATH: 3.66%3.66\% :MATH]
). (We note that different dimensionality estimators yield different values for the same model (participation
ratio
[MATH: ≈158\approx
158 :MATH]
, Levina–Bickel
[MATH: ≈10.6\approx
10.6 :MATH]
, PCA-projected
[MATH: ≈16\approx
16 :MATH]
), a discrepancy we reconcile in the Dimensionality section; all place these systems in the
interference-vulnerable regime.) The natural objection is architectural: perhaps those phenomena are artefacts
of one particular embedding-and-threshold system rather than consequences of semantically organised memory more
broadly.
This paper addresses that objection. We identify a theorem class, semantically continuous kernel-threshold
memories, within which interference is not a bug of one architecture, but a structural consequence of semantic
organisation under finite effective dimensionality. We then show empirically that related pressures appear
across multiple modern memory architectures, even when their behavioural expression differs. This paper argues
that within a broad and practically important theorem class, these phenomena follow from the structure of
semantically organised retrieval itself.
We call a memory system semantically useful if it supports retrieval by conceptual relatedness rather than
exact lexical identity alone. This is a functional definition: the target regime is memory that supports
inference, analogy, and conceptual transfer. The theorem developed here applies not to all possible memories,
but to a specific class of semantically continuous retrieval systems.
To obtain fully rigorous results, we make explicit the theorem class. Our proofs apply to semantically
continuous kernel-threshold memories: systems whose retrieval rule is a monotone function of an inner-product
score in a semantic feature space (Axiom A1), whose semantically useful representation is optimised under a
rate or distortion budget (Axiom A3), and whose semantic manifold has finite local intrinsic dimension
(Axiom A4). This class includes dense vector retrieval, embedding-based graph memory, and hidden-state
similarity retrieval. Architectures equipped with an external symbolic verifier or exact episodic record fall
outside this theorem class and are treated separately as behavioural workarounds rather than counterexamples.
The claim is therefore not that every conceivable memory system must exhibit the same behavioural signatures.
It is that a large and practically central class of modern memory systems inherits a common geometric
vulnerability. Architectures can differ in how they express that vulnerability, and some can partially
compensate for it behaviourally, but those compensations are not free.
We close the gap with four theorems and a unifying No-Escape Theorem. Within the kernel-threshold theorem
class, any system satisfying Axioms A1–A5 exhibits interference-driven forgetting, false recall, and partial
retrieval states. The logical chain is: semantic kernel
[MATH: ++ :MATH]
rate-distortion optimality
[MATH: ⇒\Rightarrow :MATH]
finite semantic effective rank (Theorem 1)
[MATH: ⇒\Rightarrow :MATH]
positive cap mass (Theorem 2)
[MATH: ++ :MATH]
growing memory
[MATH: ⇒\Rightarrow :MATH]
inevitable forgetting (Theorem 3); power-law arrival
[MATH: ++ :MATH]
population heterogeneity
[MATH: ⇒\Rightarrow :MATH]
power-law forgetting curve. Independently: associative
[MATH: δ\delta :MATH]
-convexity
[MATH: ⇒\Rightarrow :MATH]
lure inseparability under threshold tuning (Theorem 4) (Fig. 1). We verify every link empirically across five
architecturally distinct memory systems: a vector database (BGE-large^21), an attention-based context window
(Qwen2.5-7B^13), a filesystem agent memory with BM25
[MATH: ++ :MATH]
LLM re-ranking, a graph memory with PageRank (MiniLM^15; similar contrastive architectures underpin CLIP^14),
and parametric knowledge in LLM weights. The effective dimensionality convergence (from
[MATH:
dnom=3,584d_{\text{nom}}=3{,}584 :MATH]
to
[MATH: deff=17.9d_{\text{eff}}=17.9 :MATH]
for Qwen hidden states) mirrors the low-dimensional structure in biological neural populations^19, 7.
We emphasise what the theorem does not say. It bounds the existence of these phenomena, not their magnitude.
Engineering can and should optimise parameters to minimise unwanted interference; the gap between “inevitable”
and “catastrophic” is where engineering contributes. The forgetting exponent, the false alarm rate, and the TOT
probability are continuous functions of system parameters; the theorem says these functions are bounded away
from zero for systems in the kernel-threshold theorem class satisfying Axioms A1–A5. Murdock’s serial position
effect^11, Cepeda et al.’s^6 distributed practice findings, Brown and McNeill’s^5 tip-of-tongue phenomenology,
and Nadel and Moscovitch’s^12 consolidation theory all describe the same geometric substrate from different
vantage points. The most important finding is not that all five architectures show the same phenomena (they do
not, at the behavioural level) but that the geometric vulnerability holds across the tested architectures under
the SPP formalism while the behavioural expression depends on whether the system can build workarounds. These
workarounds are never free: they either convert graceful degradation into catastrophic failure, or sacrifice
semantic usefulness entirely. We organise our findings into three architectural categories (pure geometric,
reasoning-overlay, and systems outside the operative theorem regime) that make this tradeoff explicit.
Results
Mathematical framework: the no-escape theorem
Definition 1 (Semantic Proximity Property).
A memory system
[MATH: ℳ=(𝒮,E,R,d)\mathcal{M}=(\mathcal{S},E,R,d) :MATH]
with item set
[MATH: 𝒮\mathcal{S} :MATH]
, encoding function
[MATH: E:𝒮→𝒱E:\mathcal{S}\to\mathcal{V} :MATH]
into a Hilbert space
[MATH: 𝒱\mathcal{V} :MATH]
, retrieval function
[MATH: RR :MATH]
, and proximity measure
[MATH: dd :MATH]
, satisfies the Semantic Proximity Property (SPP) if for any semantically related pair
[MATH: (si,sj)(s_{i},s_{j})
:MATH]
and unrelated pair
[MATH: (si,sk)(s_{i},s_{k})
:MATH]
:
[MATH: 𝔼[d(E(si),E(sj))]<𝔼[d(E(si),E(sk))].\mathbb{E}[d(E(s_{i}),E(s_{j}))]<\mathbb{E}[d(E(s_{i}),E(s_{k}))]. :MATH]
We verified SPP empirically for all five architectures using
[MATH: 143143 :MATH]
sentence pairs from Wikipedia (
[MATH: p<0.001p<0.001 :MATH]
, paired
[MATH: tt :MATH]
-test, Cohen’s
[MATH: d>1.5d>1.5 :MATH]
for all embedding architectures; Extended Data Fig. 14). We acknowledge that
[MATH: 143143 :MATH]
pairs is a limited empirical base; the SPP verification serves as a sanity check that each architecture
satisfies the minimal definition, not as proof that SPP holds for all possible inputs. The definition is
deliberately minimal: we specify neither the encoding mechanism nor the similarity function, requiring only
that the system places related items closer than unrelated ones.
To obtain the formal results below, we introduce five axioms that define the kernel-threshold memory class.
Definition 2 (Axiom A1: Kernel-Threshold Retrieval).
There exists a semantic feature map
[MATH: ϕ:𝒳→ℋ\phi:\mathcal{X}\to\mathcal{H} :MATH]
into a Hilbert space and a retrieval score
[MATH: s(q,x)=g(⟨wq,ϕ(x)⟩ℋ)s(q,x)=g(\langle
w_{q},\phi(x)\rangle_{\mathcal{H}}) :MATH]
, where
[MATH: gg :MATH]
is monotone increasing. Cosine similarity, dot-product retrieval, and linear probes on hidden states fit this
form.
Definition 3 (Axiom A2: Semantic Sufficiency).
There is a positive semidefinite semantic kernel
[MATH: KK :MATH]
such that retrieval relevance is measurable with respect to the sigma-algebra generated by the semantic
coordinates of
[MATH: KK :MATH]
. Only the semantic component can improve Bayes retrieval risk.
Definition 4 (Axiom A3: Rate-Distortion Optimality).
The encoder is optimal for retrieval risk under a rate or distortion budget
[MATH: DD :MATH]
.
Definition 5 (Axiom A4: Local Regularity).
The pushforward measure
[MATH: μ=ϕ#PX\mu=\phi_{#}P_{X} :MATH]
on the semantic manifold is locally Ahlfors regular of intrinsic dimension
[MATH: dlocd_{\mathrm{loc}} :MATH]
: for
[MATH: μ\mu :MATH]
-almost every anchor
[MATH: zz :MATH]
,
[MATH: c1rdloc≤μ<
mo lspace="0em" rspace="0em">(B(z,r))≤c2rdlocc_{1}r^{d_{\mathrm{loc}}}\leq\mu(B(z,r))\leq
c_{2}r^{d_{\mathrm{loc}}} :MATH]
for
[MATH:
0<r<r00<r<r_{0} :MATH]
.
Definition 6 (Axiom A5: Associative Convexity).
For studied items
[MATH: {x1,…,xk}{x_{1},\ldots,x_{k}} :MATH]
, an associative lure
[MATH: cc :MATH]
is
[MATH: δ\delta :MATH]
-convex if
[MATH: ‖ϕ(c)−∑iaiϕ(xi)‖ℋ≤δ|\phi(c)-\sum_{i}a_{i}\phi(x_{i})|{\mathcal{H}}\leq\delta :MATH]
for some convex weights
[MATH: ai≥0a{i}\geq 0 :MATH]
,
[MATH:
∑iai=1
\sum_{i}a_{i}=1 :MATH]
.
Theorem 1 (Semantic Spectral Bound; proof sketch).
Let
[MATH: KK :MATH]
be the semantic kernel with Mercer eigenpairs
[MATH: (λj,ψj)(\lambda_{j},\psi_{j}) :MATH]
. Under Axioms A1–A3, for every optimal encoder under distortion budget
[MATH: DD :MATH]
, there exists a threshold
[MATH: γ(D)\gamma(D) :MATH]
such that the encoder factors through the truncated semantic statistic
[MATH: Φγ(x)=(λjψj(x))λj>γ(D)\Phi_{\gamma}(x)=(\sqrt{\lambda_{j}}\psi_{j}(x)){\lambda{j}>\gamma(D)} :MATH]
. The semantically useful effective dimension obeys
[MATH:
deff≤reff(γ(D))≤#{j:λj>γ(D)}d_{\mathrm{eff}}\leq
r_{\mathrm{eff}}(\gamma(D))\leq#{j:\lambda_{j}>\gamma(D)} :MATH]
. Nominal dimension can grow without changing the semantically useful effective rank. For natural language,
empirical measurements yield
[MATH: dintrinsic≈10d_{\mathrm{intrinsic}}\approx 10 :MATH]
–
[MATH: 5050 :MATH]
^8; this is an observed range, not a mathematical consequence.
Proof sketch. Mercer decomposition of
[MATH: KK :MATH]
yields the semantic statistic. By Blackwell sufficiency, nuisance directions independent of relevance given the
semantic coordinates cannot reduce Bayes retrieval risk. Reverse water-filling under the distortion budget
retains only spectral modes above
[MATH: γ(D)\gamma(D) :MATH]
. Full proof in Supplementary §S2.
[MATH: □\square :MATH]
Theorem 2 (Positive Cap Mass).
Under Axioms A1 and A4, for any anchor
[MATH: zz :MATH]
and sufficiently small retrieval radius
[MATH: θ\theta :MATH]
,
[MATH: c1′θdloc(z)≤μ(C(z,θ))≤c2′θdloc(z)c_{1}^{\prime}\theta^{d_{\mathrm{loc}}(z)}\leq\mu(C(z,\theta))\leq
c_{2}^{\prime}\theta^{d_{\mathrm{loc}}(z)} :MATH]
. Every admissible retrieval neighbourhood has strictly positive competitor mass.
Theorem 3 (Inevitable Forgetting Under Growing Memory).
Under Axioms A1 and A4, if competitor arrivals form a marked point process with cumulative intensity
[MATH: Λx(t)=∫0tλx<
/msub>(u)𝑑u\Lambda_{x}(t)=\int_{0}^{t}\lambda_{x}(u)\,du :MATH]
, then retention for item
[MATH: xx :MATH]
is
[MATH: Rx(t)=exp(−μ(Cx)Λx(t))R_{x}(t)=\exp(-\mu(C_{x})\Lambda_{x}(t)) :MATH]
. If
[MATH: Λx(t)→∞\Lambda_{x}(t)\to\infty :MATH]
, then
[MATH: Rx(t)→0R_{x}(t)\to
0 :MATH]
.
Corollary 4 (Stretched Exponential Per-Item Retention).
If
[MATH: λx(t)=λ0,x</mrow
t−α\lambda_{x}(t)=\lambda_{0,x}t^{-\alpha} :MATH]
with
[MATH: 0<α<10<\alpha<1 :MATH]
, then
[MATH: Rx(t)=exp(−cxt1−α)R_{x}(t)=\exp(-c_{x}t^{1-\alpha}) :MATH]
where
[MATH: cx=μ(Cx)λ0,x/(1−α)c_{x}=\mu(C_{x})\lambda_{0,x}/(1-\alpha) :MATH]
. This is a stretched exponential, not a power law, for any individual item.
Proposition 5 (Population Power Law from Heterogeneity).
If the item-specific scale
[MATH: cxc_{x} :MATH]
has a density regularly varying at zero,
[MATH: g(c)∼κcβ−1g(c)\sim\kappa c^{\beta-1} :MATH]
as
[MATH: c↓0c\downarrow 0 :MATH]
, then the population-averaged retention obeys
[MATH: R¯(t)∼κΓ(β)t−β(1−α)\overline{R}(t)\sim\kappa\Gamma(\beta)t^{-\beta(1-\alpha)}</semantics
:MATH]
. The population forgetting exponent is
[MATH: b=β(1−α)b=\beta(1-\alpha) :MATH]
.
Interpretation. Individual items forget by a stretched exponential; population heterogeneity turns this into a
power law. Geometry determines the hazard scale (
[MATH: μ(Cx)\mu(C_{x}) :MATH]
), the environment determines the time dependence (
[MATH: α\alpha :MATH]
), and population heterogeneity (
[MATH: β\beta :MATH]
) determines the asymptotic forgetting exponent. The exponent
[MATH: α\alpha :MATH]
is corpus-dependent: Anderson & Schooler^2 reported
[MATH: α=0.513\alpha=0.513 :MATH]
on newspaper text; we measure
[MATH: α=0.459\alpha=0.459 :MATH]
on Wikipedia. Both place
[MATH: bb :MATH]
in the
[MATH: [0.3,0.6][0.3,0.6]
:MATH]
range for reasonable
[MATH: β\beta :MATH]
.
Theorem 6 (Inseparability of Associative Lures).
Under Axioms A1 and A5, let
[MATH: cc :MATH]
be a
[MATH: δ\delta :MATH]
-convex lure for studied items
[MATH: x1,…,xkx_{1},\ldots,x_{k} :MATH]
. If each studied item is accepted with margin
[MATH: m>0m>0 :MATH]
, i.e.
[MATH: fq(xi)≥τ+mf_{q}(x_{i})\geq\tau+m :MATH]
for all
[MATH: ii :MATH]
, then
[MATH: fq(c)≥τ+m−δ
f_{q}(c)\geq\tau+m-\delta
:MATH]
. If
[MATH: δ<m\delta<m :MATH]
, the lure is also accepted. If
[MATH: δ=0\delta=0 :MATH]
, no threshold in this score family that accepts all studied items can reject the lure.
Proof.
[MATH: fq(c)=⟨wq,∑iaiϕ(xi)+ε⟩=∑iaifq(xi)+⟨wq,ε⟩≥∑iai(τ+m)−‖wq‖‖ε‖≥τ+m−δf_{q}(c)=\langle
w_{q},\sum_{i}a_{i}\phi(x_{i})+\varepsilon\rangle=\sum_{i}a_{i}f_{q}(x_{i})+\langle
w_{q},\varepsilon\rangle\geq\sum_{i}a_{i}(\tau+m)-|w_{q}||\varepsilon|\geq\tau+m-\delta :MATH]
.
[MATH: □\square :MATH]
SPP alone guarantees semantic proximity but not threshold inseparability. The
[MATH: δ\delta :MATH]
-convexity condition (A5) is stronger and empirically testable: for all 24 DRM lures, the convex-hull
reconstruction error
[MATH: δ∗\delta^{*} :MATH]
is smaller than the observed decision margin
[MATH: mm :MATH]
, confirming the theorem’s premise.
Theorem 7 (No Escape for Kernel-Threshold Memory).
Under Axioms A1–A5: (1) the semantically useful representation has effective rank controlled by the semantic
operator spectrum; (2) every admissible retrieval neighbourhood has positive competitor mass; (3) under growing
memory, retention decays to zero; (4) for
[MATH: δ\delta :MATH]
-convex associative lures with
[MATH: δ\delta :MATH]
below the decision margin, false recall cannot be eliminated by threshold tuning within the same score family.
Any architecture that simultaneously eliminates interference-driven forgetting and associative false recall
must either abandon semantic continuity and kernel-threshold retrieval, add an external symbolic verifier or
exact episodic record, or send the semantic effective rank to infinity.
The no-escape theorem operates at two levels
The geometric level appears universal under the SPP formalism; the behavioural level is
architecture-dependent. The distinction between these two levels is the paper’s central contribution beyond
HIDE. At the geometric level, every system satisfying Axioms A1–A4 has low semantic effective rank,
non-negligible spherical cap volumes, and representation-space vulnerability to interference. This is derived
under stated assumptions and empirically confirmed in all five architectures. At the behavioural level, the
manifestation depends on whether the architecture can build a workaround, and what that workaround costs.
We organize the five architectures into three categories based on how the geometric vulnerability manifests
behaviourally. Category 1 (pure geometric systems: vector database, graph memory) expresses the vulnerability
directly: the geometry IS the behaviour. Category 2 (reasoning-overlay systems: attention memory, parametric
memory) possesses the geometric vulnerability but can partially override it behaviourally, at the cost of
converting graceful degradation into catastrophic failure. Category 3 (SPP-violating systems: filesystem/BM25)
escapes the vulnerability entirely by abandoning semantic organisation. The remainder of this section reports
results for each.
The five architectures split into three categories:
Category 1: Pure geometric systems (vector database, graph memory). The geometry IS the behaviour. These
systems exhibit smooth power-law forgetting (
[MATH: b=0.440b=0.440 :MATH]
,
[MATH: 0.4780.478 :MATH]
), robust DRM false recall (
[MATH: FA=0.583\text{FA}=0.583 :MATH]
,
[MATH: 0.2080.208 :MATH]
), the spacing effect (long
[MATH: >> :MATH]
massed), and TOT states (
[MATH: 2.0%2.0\% :MATH]
,
[MATH: 2.8%2.8\% :MATH]
). No escape at either level.
Category 2: Systems with explicit reasoning overlays (attention memory, parametric memory). The geometric
vulnerability exists (
[MATH: deff=17.9d_{\text{eff}}=17.9 :MATH]
, lures within caps), but the system can reason its way around it behaviourally. The LLM correctly rejects DRM
lures by parsing word lists (
[MATH: FA=0.000\text{FA}=0.000 :MATH]
). However, interference manifests differently: the attention architecture shows a phase transition (perfect
accuracy
[MATH: →\to :MATH]
catastrophic failure at
[MATH: ∼100\sim
100 :MATH]
competitors), and parametric memory shows monotonically decreasing accuracy with neighbour density (
[MATH: 1.000→0.1131.000\to 0.113 :MATH]
,
[MATH: b=0.215b=0.215 :MATH]
on PopQA). The workaround converts graceful degradation into catastrophic failure.
Category 3: Systems that abandon SPP (filesystem/BM25 keyword retrieval). BM25 produces
[MATH: b=0.000b=0.000 :MATH]
,
[MATH: FA=0.000\text{FA}=0.000 :MATH]
, no spacing effect, yielding complete immunity. But SPP correlation is
[MATH: r=0.210r=0.210 :MATH]
and semantic retrieval agreement is
[MATH: 15.5%15.5\% :MATH]
. It escaped interference by escaping usefulness. This IS the no-escape theorem in action.
Interference produces power-law forgetting in every SPP system
In the architectures where temporal interference is expressed through graded retrieval competition, the
forgetting exponent depends on competitor count and environmental arrival statistics. For the vector database
(Architecture 1),
[MATH:
b=0.440±0.030b=0.440\pm 0.030 :MATH]
(
[MATH: R2=0.570R^{2}=0.570 :MATH]
,
[MATH: n=5n=5 :MATH]
seeds) at
[MATH: 10,00010{,}000 :MATH]
competitors with power-law temporal decay (
[MATH: ψ=0.5\psi=0.5 :MATH]
,
[MATH: β=0.20\beta=0.20 :MATH]
), matching HIDE’s
[MATH: b=0.460b=0.460 :MATH]
to within one standard error. At zero competitors,
[MATH: b<0.01b<0.01 :MATH]
: without interference, there is no forgetting. This is not a subtle distinction: the identical encoding
function without competitors yields
[MATH: bb :MATH]
more than forty times smaller.
The graph memory (Architecture 4, MiniLM + PageRank) produces
[MATH:
b=0.478±0.028b=0.478\pm 0.028 :MATH]
at
[MATH: 10,00010{,}000 :MATH]
competitors, squarely in the human range despite an entirely different retrieval mechanism. The parametric
architecture (Architecture 5, Qwen2.5-7B) confirms interference in model weights via the PopQA dataset (
[MATH: 14,26714{,}267 :MATH]
questions): accuracy decreases monotonically from
[MATH: 1.0001.000 :MATH]
(fewer than
[MATH: 5050 :MATH]
near neighbours) to
[MATH: 0.2570.257 :MATH]
(
[MATH: 5050 :MATH]
–
[MATH: 200200 :MATH]
),
[MATH: 0.1700.170 :MATH]
(
[MATH: 200200 :MATH]
–
[MATH: 500500 :MATH]
), and
[MATH: 0.1130.113 :MATH]
(more than
[MATH: 1,0001{,}000 :MATH]
). Power-law fit:
[MATH: b=0.215b=0.215 :MATH]
,
[MATH: R2=0.501R^{2}=0.501 :MATH]
. Geometry plus power-law arrival gives stretched-exponential retention for individual items (Corollary 4). The
empirically observed power law (
[MATH: b=0.440b=0.440 :MATH]
–
[MATH: 0.4780.478 :MATH]
) emerges after averaging over item-level heterogeneity in interference scale (Proposition 5), a standard
scale-mixture mechanism.
The attention architecture (Architecture 2, Qwen2.5-7B context window) reveals a qualitatively different
failure mode that power-law fitting cannot capture. Rather than smooth degradation, accuracy undergoes a phase
transition: near-perfect retrieval with fewer than
[MATH: 100100 :MATH]
competitors collapses to near-zero at
[MATH: 200+200+ :MATH]
. A logistic fit
[MATH: R(n)=1/(1+exp(k(n−n0)))R(n)=1/(1+\exp(k(n-n_{0}))) :MATH]
captures this cliff accurately (
[MATH: n0≈120n_{0}\approx 120 :MATH]
,
[MATH: k≈0.03k\approx 0.03 :MATH]
). The distinction is itself informative: Category 1 systems degrade continuously (power law), while Category 2
systems hold perfectly then fail discontinuously (sigmoid). These are qualitatively different failure
signatures of the same underlying geometric vulnerability. The connection is precise: attention over a finite
context window performs implicit nearest-neighbour search with a hard capacity limit. Below that limit, the
reasoning overlay can compensate for geometric interference by attending selectively to relevant tokens. Above
it, the
[MATH: θ\theta :MATH]
-cap of competitors saturates the attention budget and the system collapses. The sigmoid inflection point (
[MATH: n0≈120n_{0}\approx 120 :MATH]
) marks the competitor count at which the attention capacity can no longer absorb the geometric interference
predicted by Theorem 2. The filesystem architecture (Architecture 3, BM25) shows
[MATH: b=0.000b=0.000 :MATH]
(zero forgetting) because keyword matching bypasses semantic similarity entirely. But this immunity costs
usefulness: BM25 retrieval agrees with cosine similarity on only
[MATH: 15.5%15.5\% :MATH]
of queries.
False recall is geometrically inevitable but behaviourally overridable
We did not build a false memory system; we found one in the geometry of every architecture. The DRM
experiment^16 tests false recognition of semantic lures. For the vector database,
[MATH: FA=0.583\text{FA}=0.583 :MATH]
at
[MATH: θ=0.864\theta=0.864 :MATH]
(the BGE-large-calibrated threshold where unrelated
[MATH: FA=0\text{FA}=0 :MATH]
), matching HIDE exactly. For the graph memory,
[MATH: FA=0.208\text{FA}=0.208 :MATH]
at
[MATH: θ=0.82\theta=0.82 :MATH]
. The nearly
[MATH: 3×3\times :MATH]
difference between the two Category 1 architectures reflects different threshold calibrations and different
semantic clustering geometries: BGE-large’s contrastive training produces tighter semantic clusters than
MiniLM, placing lures closer to studied items relative to the threshold. Both rates substantially exceed what
any SPP-free system could produce (
[MATH: FA=0\text{FA}=0 :MATH]
), and the spherical cap analysis confirms that all
[MATH: 24/2424/24 :MATH]
lures across both architectures lie within the predicted cap intersection of their studied associates.
Theorem 6 is confirmed without exception.
For the LLM architectures (attention, parametric),
[MATH: FA=0.000\text{FA}=0.000 :MATH]
at the behavioural level: the model correctly identifies that “sleep” was not in the word list. But this does
not violate the theorem. The theorem applies to the representation geometry, and the geometric prediction
holds: lures are indistinguishable from studied items in the hidden-state space. The behavioural override
requires explicit list-checking, a reasoning capability that operates on top of the geometric vulnerability,
not in place of it. A system without this reasoning layer (e.g., a vector database, a knowledge graph, or a
retrieval pipeline) has no such override. The DRM result has the same important asymmetry noted in HIDE: it
requires no boundary conditions. Forgetting requires competitors. False recall requires only the geometry of
meaning. SPP alone guarantees semantic proximity but not threshold inseparability. The formal guarantee
requires the stronger
[MATH: δ\delta :MATH]
-convexity condition (Axiom A5, Theorem 6), which we verify empirically: for all
[MATH: 2424 :MATH]
DRM lures, the convex-hull reconstruction error
[MATH: δ∗\delta^{*} :MATH]
is smaller than the observed decision margin
[MATH: mm :MATH]
.
A natural question arises: if LLMs escape DRM false recall via explicit reasoning (FA
[MATH: =0.000=0.000 :MATH]
), why do humans, who also reason, show FA
[MATH: ≈0.55\approx
0.55 :MATH]
? The answer has two parts. First, human source monitoring is not a separate symbolic layer operating on top of
the memory system; it shares the same geometric substrate, so the lure’s representation is already
indistinguishable from studied items before the monitoring system engages. The LLM, by contrast, has access to
the literal token sequence in its context window, a symbolic record external to the embedding space that
permits exact matching. Human episodic memory has no such external record. Second, explicit source monitoring
in humans is metabolically expensive and is not automatically deployed during recognition tasks; the DRM
paradigm exploits precisely this.
The spacing effect reflects temporal interference geometry
In architectures where temporal interference is expressed through graded retrieval competition, distributed
practice beats massed practice. For the vector database with
[MATH: 10,00010{,}000 :MATH]
distractors and age-proportional noise (
[MATH: σ=0.25\sigma=0.25 :MATH]
): massed
[MATH:
=0.360±0.022=0.360\pm 0.022 :MATH]
, long-spacing
[MATH:
=0.902±0.039=0.902\pm 0.039 :MATH]
(Cohen’s
[MATH: d=24.6d=24.6 :MATH]
,
[MATH: n=5n=5 :MATH]
seeds). The mechanism is geometric: spaced repetitions create traces at different temporal positions; massed
traces are uniformly old (
[MATH: ∼30\sim
30 :MATH]
days) and uniformly degraded. For the graph memory: long
[MATH: =0.996=0.996 :MATH]
, massed
[MATH: =0.920=0.920 :MATH]
, same direction, smaller magnitude.
The attention architecture shows the opposite pattern: massed
[MATH: =1.000=1.000 :MATH]
, all spaced conditions
[MATH: =0.000=0.000 :MATH]
. This is an architectural capacity artefact, not a refutation of the spacing prediction: the context window
imposes a hard limit on token distance, and spaced repetitions with intervening fillers push the target beyond
the attention horizon. The result does not bear on the geometric spacing prediction; it reveals instead how
context-window limits create a different interference geometry, relocating interference from the temporal
domain to the capacity domain. The filesystem (BM25) shows all conditions at
[MATH: 1.0001.000 :MATH]
; keyword matching is unaffected by spacing. Both “failures” are informative: they reveal the specific
architectural constraints that determine how the geometric vulnerability manifests behaviourally.
The dimensionality convergence
The label “3,584-dimensional” is, in a functionally meaningful sense, a misnomer. Despite nominal
dimensionalities spanning an order of magnitude (
[MATH: 384384 :MATH]
for MiniLM to
[MATH: 3,5843{,}584 :MATH]
for Qwen2.5 hidden states), effective dimensionality converges dramatically. BGE-large:
[MATH: deff=158d_{\text{eff}}=158 :MATH]
(participation ratio),
[MATH: deff=10.6d_{\text{eff}}=10.6 :MATH]
(Levina–Bickel^8). MiniLM:
[MATH: deff=127d_{\text{eff}}=127 :MATH]
. Qwen2.5-7B hidden states:
[MATH: deff=17.9d_{\text{eff}}=17.9 :MATH]
, a
[MATH: 200200 :MATH]
-fold compression. The Levina–Bickel estimator, which measures local manifold dimensionality, gives
[MATH: deff≈10d_{\text{eff}}\approx 10 :MATH]
–
[MATH: 1515 :MATH]
across all models, consistent with the rate-distortion bound (Theorem 1). Biological neural populations operate
at estimated
[MATH: deff=100d_{\text{eff}}=100 :MATH]
–
[MATH: 500500 :MATH]
^19, 7, placing them near the transition zone. The convergence is not coincidental: any SPP-satisfying encoding
must concentrate variance in the
[MATH: ∼10{\sim}10 :MATH]
–
[MATH: 5050 :MATH]
semantically meaningful directions.
A note on estimator discrepancy is warranted. HIDE reported
[MATH: deff≈16d_{\text{eff}}\approx 16 :MATH]
for BGE-large; this paper reports
[MATH: deff=158d_{\text{eff}}=158 :MATH]
(participation ratio) and
[MATH: deff=10.6d_{\text{eff}}=10.6 :MATH]
(Levina–Bickel) for the same model. The discrepancy is methodological, not contradictory. The participation
ratio measures global variance concentration (how many dimensions carry substantial eigenvalue mass) and is
sensitive to the long tail of small but non-zero eigenvalues. The Levina–Bickel estimator measures local
manifold dimensionality (the number of directions along which the data actually varies in a neighbourhood).
HIDE’s value of
[MATH: ≈16\approx
16 :MATH]
was computed on PCA-projected embeddings, which truncates the tail. For the interference theorems, the
Levina–Bickel estimate (
[MATH: ≈10\approx
10 :MATH]
–
[MATH: 1515 :MATH]
) is the governing quantity, and the reason is mathematical, not merely methodological: interference occurs in
local neighbourhoods (the
[MATH: θ\theta :MATH]
-cap of Theorem 2), and the crowding within these neighbourhoods is determined by the local manifold
dimensionality, not by the global variance spread. The participation ratio captures the latter; Levina–Bickel
captures the former. Plugging
[MATH: deff=158d_{\text{eff}}=158 :MATH]
into the spherical cap formula would dramatically underestimate interference, because the global variance
includes dimensions along which nearby items do not actually vary. The correct input to Theorem 2 is the local
intrinsic dimensionality (
[MATH: ≈10\approx
10 :MATH]
–
[MATH: 1515 :MATH]
), and all three estimators confirm that this value places these systems in the interference-vulnerable regime
(
[MATH: deff<100d_{\text{eff}}<100 :MATH]
).
Tested interventions reveal a usefulness-immunity tradeoff
Every cure for memory’s “flaws” either fails or kills the patient.
Solution 1: Increase nominal dimensionality. Zero-padding BGE-large from
[MATH: 1,0241{,}024 :MATH]
to
[MATH: 4,0964{,}096 :MATH]
dimensions:
[MATH: bb :MATH]
stays at
[MATH: ∼0.31{\sim}0.31 :MATH]
because
[MATH: deffd_{\text{eff}} :MATH]
is unchanged (
[MATH: 124124 :MATH]
in both cases). Only PCA reduction to
[MATH: 6464 :MATH]
dimensions changes
[MATH: bb :MATH]
(
[MATH: 0.3700.370 :MATH]
), by genuinely reducing the space, not by padding it.
Solution 2: BM25 keyword retrieval. Eliminates DRM false recall (
[MATH: FA=0\text{FA}=0 :MATH]
) and forgetting (
[MATH: b=0b=0 :MATH]
). But semantic retrieval agreement:
[MATH: 15.5%15.5\% :MATH]
. This is Architecture 3’s result rephrased as a solution.
Solution 3: Orthogonalisation. Gram–Schmidt reduces interference to zero (mean off-diagonal cosine
[MATH:
<10−4<10^{-4} :MATH]
) but nearest-neighbour accuracy drops to
[MATH: 0.0%0.0\% :MATH]
. Random projection to
[MATH: 256256 :MATH]
dimensions preserves
[MATH: 68%68\% :MATH]
accuracy but
[MATH: deff=77d_{\text{eff}}=77 :MATH]
, still in the interference regime.
Solution 4: Memory compression. At
[MATH: 5050 :MATH]
clusters:
[MATH: b=0.432b=0.432 :MATH]
, retrieval accuracy
[MATH: =0.988=0.988 :MATH]
. At
[MATH: 2,5002{,}500 :MATH]
clusters:
[MATH: b=0.163b=0.163 :MATH]
, accuracy
[MATH: =0.928=0.928 :MATH]
. The tradeoff is monotonic: you can reduce
[MATH: bb :MATH]
by compressing, but you lose specific-fact retrieval.
Every solution traces a strict Pareto frontier between interference immunity and semantic usefulness.
Compression at
[MATH: k=2,500k=2{,}500 :MATH]
achieves
[MATH: b=0.163b=0.163 :MATH]
with
[MATH: 92.8%92.8\% :MATH]
accuracy, a potentially acceptable engineering compromise for specific applications, but not mathematical
immunity. The theorem does not claim that interference cannot be reduced; it claims it cannot be eliminated
without sacrificing SPP. The tradeoff frontier itself is the No-Escape Theorem in empirical form.
Discussion
We use strong language at points because the claim is structural: within the theorem class, the tradeoff is not
an empirical accident but a consequence of the retrieval geometry.
The central result of this paper is that semantically organised memory has a structural vulnerability to
interference, and that this vulnerability appears at two levels. At the geometric level, semantically useful
representations with finite effective rank create retrieval neighbourhoods with non-zero competitor mass and
non-trivial lure overlap. At the behavioural level, different architectures express that vulnerability
differently. Pure retrieval systems express it directly as smooth forgetting and false recall; systems with
explicit reasoning can partially compensate, but often replace graceful degradation with brittle failure modes;
systems that avoid the vulnerability entirely do so by giving up semantic generalisation.
The broader implication is a limit on the naive reading of the Bitter Lesson for memory systems. The Bitter
Lesson correctly emphasises the long-run power of general methods plus computation. Our result does not argue
against that principle. It argues that within semantically organised memory, scale alone is not sufficient. The
same geometry that enables semantic generalisation also creates representational crowding, competitor mass, and
lure proximity. Therefore larger models and more data may improve performance, but they do not in themselves
remove interference as a class of phenomena. Beyond a point, memory requires architectural innovation, not
scale alone. The comparison across architectures is best read as a map of how a shared geometric pressure
manifests across architectures, not as a single unified leaderboard.
The resolution of the interference-versus-decay debate^20, 4 is now concrete. Decay alone produces
[MATH: b<0.01b<0.01 :MATH]
; interference produces
[MATH: b=0.440b=0.440 :MATH]
–
[MATH: 0.4780.478 :MATH]
in the human range. Geometry plus power-law arrival gives stretched-exponential retention for individual items.
The empirically observed power law emerges after averaging over item-level heterogeneity in interference scale,
a standard scale-mixture mechanism (Proposition 5). This sharpens rather than weakens the theory: it identifies
exactly which part of the forgetting law is geometric (the hazard scale
[MATH: μ(Cx)\mu(C_{x}) :MATH]
), which part is environmental (
[MATH: α\alpha :MATH]
), and which part is population-level (
[MATH: β\beta :MATH]
). The parametric result is perhaps the most striking: Qwen2.5-7B’s accuracy on factual questions drops from
[MATH: 1.0001.000 :MATH]
to
[MATH: 0.1130.113 :MATH]
as the density of semantically similar facts in the training corpus increases. This is interference in model
weights: not in an external store, not in a context window, but in the parameters themselves. The complementary
learning systems hypothesis^10 can be reinterpreted: fast hippocampal encoding and slow neocortical
consolidation manage the interference-usefulness tradeoff, they do not eliminate interference. Even the brain’s
most sophisticated consolidation mechanism (replay-guided refinement with importance weighting^10) does not
escape interference; it manages the position on the tradeoff frontier that the no-escape theorem establishes.
We note that the cited
[MATH: deff=100d_{\text{eff}}=100 :MATH]
–
[MATH: 500500 :MATH]
range derives from visual cortex recordings^19, 7. Memory-related structures (hippocampus, entorhinal cortex)
may have different effective dimensionalities; hippocampal place cells, for instance, are thought to operate in
lower-dimensional manifolds. The interference prediction holds for any
[MATH: deffd_{\text{eff}} :MATH]
below
[MATH: ∼100{\sim}100 :MATH]
, so the conclusion is robust to variation in the biological estimate.
The DRM result has an asymmetry first noted in HIDE that the two-level framework clarifies. False recall
requires no boundary conditions: it holds for noiseless, competitor-free systems (Theorem 6). This makes it
more fundamental than forgetting. LLMs equipped with explicit list-checking or an external symbolic record do
not refute the theorem; they instantiate a behavioural workaround outside the pure kernel-threshold retrieval
class. The theorem concerns the semantic memory substrate. Workarounds can route around its vulnerabilities,
but only by adding an auxiliary mechanism not described by the substrate alone. Production systems that rely on
semantically continuous retrieval are expected to inherit related pressures. The implication is that complete
immunity to false recall typically requires leaving the semantic retrieval regime or adding external
verification^16.
The parametric TOT rate (
[MATH: 69%69\% :MATH]
) deserves explicit discussion. This rate (
[MATH: 18×18\times :MATH]
the human baseline and
[MATH: 34×34\times :MATH]
the vector database rate) reflects a systemic property of parametric models (not specific to Qwen): all such
models store facts as superposed weight-space associations. When queried, multiple associations activate
simultaneously, producing partial retrieval at far higher rates than architectures with explicit, separated
memory stores. The operational definition of TOT transfers imperfectly to parametric systems: “correct category
but wrong specific answer” captures a different failure mode than the phenomenological tip-of-tongue experience
in humans. The elevated rate is thus informative about the geometry of weight-space retrieval rather than
directly comparable to human TOT rates. We flag this definitional caveat explicitly: the parametric TOT entry
in Figure 7 should be interpreted with caution, as it reflects a categorically different operational definition
from the phenomenological TOT experience measured in humans and the geometric near-miss definition used for
embedding architectures.
One consideration not addressed by the five-architecture survey is hybrid retrieval: most production systems
combine architectures (e.g., BM25 keyword pre-filtering followed by dense vector re-ranking). Such systems
attempt to navigate the tradeoff frontier by falling back on Category 3 retrieval (keyword matching) when
Category 1 retrieval (semantic similarity) suffers geometric interference. However, combining them does not
violate the No-Escape Theorem; it builds a routing layer between a system that forgets and a system that cannot
generalise. The semantic component remains subject to Theorems 1–4 whenever it is invoked, and the keyword
component contributes only non-semantic retrieval when it is. The hybrid reduces the frequency of interference
events at the cost of reducing the frequency of semantic generalisation, another point on the tradeoff
frontier, not an escape from it.
Several anticipated objections deserve response. First, one might argue SPP is too weak. Any stronger
definition implies SPP as a special case; the theorem applies a fortiori. Second, Theorem 1 might appear to
prove only finiteness. The rate-distortion argument^18, 1 proves smallness: intrinsic dimensionality
[MATH: ∼10{\sim}10 :MATH]
–
[MATH: 5050 :MATH]
^8 bounds
[MATH: deffd_{\text{eff}} :MATH]
regardless of hardware. Third, the exponential-to-power-law conversion relies on Anderson–Schooler statistics,
which we verify (
[MATH: α=0.459\alpha=0.459 :MATH]
). Fourth, we use spherical caps, not convex hulls; the distinction matters for angular similarity. Fifth,
attention is not cosine similarity, but SPP is the key property, verified for all architectures (
[MATH: p<0.001p<0.001 :MATH]
). Sixth, LLM DRM confounds parametric and episodic memory; this is precisely why the two-level framework
matters. Seventh, the connection to bias-variance tradeoff is real but our contribution is specific
quantitative predictions from first principles.
Implications for system design
The no-escape theorem translates into specific, actionable predictions for retrieval system engineers. First,
the severity of forgetting, captured by the prefactor
[MATH: A=pnear(deff)⋅λ0/(1−α)A=p_{\text{near}}(d_{\text{eff}})\cdot\lambda_{0}/(1-\alpha) :MATH]
, scales with
[MATH: pnearp_{\text{near}} :MATH]
: for a database with
[MATH: deff≈16d_{\text{eff}}\approx 16 :MATH]
and
[MATH: 10,00010{,}000 :MATH]
entries,
[MATH: AA :MATH]
reaches values consistent with the empirically observed
[MATH: b≈0.44b\approx 0.44 :MATH]
over realistic time windows. Retrieval accuracy will degrade as a power law with database age; re-ranking,
metadata filters, and structured memory can materially change behaviour, but within the kernel-threshold class
they navigate the tradeoff frontier rather than escaping it. Second, any SPP-satisfying retrieval system will
produce false positives for semantically associated queries at rates comparable to its true positive rate; the
DRM prediction applies directly to production RAG systems. Third, increasing nominal dimensionality is provably
not a solution (Solution 1): only training objectives that genuinely increase the effective rank of stored
representations (a target that current contrastive objectives do not optimise for, and which the low intrinsic
dimensionality of natural language makes difficult to achieve) can reduce interference. The gap between
“inevitable” and “catastrophic” is where engineering contributes: optimising noise parameters, managing
competitor density through intelligent caching, and designing consolidation strategies that navigate the
compression–fidelity frontier (Solution 4).
The standard engineering response to forgetting and false recall is to treat them as bugs and try to fix them.
Our results suggest they are not bugs. They are the cost of admission. Any memory system that organises
information by meaning will, as it grows, forget old items through interference and falsely recognise items it
never stored. These are not signs of a broken system; they are signs of a system that is doing what it was
designed to do, namely represent meaning geometrically, under the constraints that geometry imposes. Systems
can mitigate interference, reroute around it, or trade semantic capability for robustness, but within the
kernel-threshold regime they cannot eliminate it for free. The price of meaning is interference. Within this
theorem class, there is no escape.
Methods
Models and architectures
Five memory architectures were implemented. Architecture 1 (Vector Database): BAAI/bge-large-en-v1.5^21 (
[MATH: 1,0241{,}024 :MATH]
dim, MIT licence). Cosine similarity retrieval with temporal decay
[MATH: S(t)=(1+βt)−ψS(t)=(1+\beta t)^{-\psi} :MATH]
,
[MATH: β=0.20\beta=0.20 :MATH]
,
[MATH: ψ=0.5\psi=0.5 :MATH]
. Age-proportional noise:
[MATH: ϵ=(σa+0.01/d</
mi>)𝐳\boldsymbol{\epsilon}=(\sigma\sqrt{a+0.01}/\sqrt{d})\mathbf{z} :MATH]
,
[MATH: σ=0.5\sigma=0.5 :MATH]
. Stored in HIDESpace^3. Architecture 2 (Attention Memory): Qwen2.5-7B-Instruct^13 (Apache 2.0, fp16). Facts in
context window; retrieval via generation. Proximity: cosine of middle-layer hidden states (
[MATH: d=3,584d=3{,}584 :MATH]
). Architecture 3 (Filesystem Memory): JSON records. BM25 keyword search (rank_bm25, top-
[MATH: 5050 :MATH]
)
[MATH: →\to :MATH]
Qwen2.5-7B relevance re-ranking (
[MATH: 11 :MATH]
–
[MATH: 1010 :MATH]
scale, normalised to
[MATH: [0,1][0,1] :MATH]
). Architecture 4 (Graph Memory): all-MiniLM-L6-v2^15 (
[MATH: 384384 :MATH]
dim, Apache 2.0). Edges if cosine
[MATH: >0.7>0.7 :MATH]
; retrieval via personalised PageRank (
[MATH: α=0.85\alpha=0.85 :MATH]
). Architecture 5 (Parametric Memory): Qwen2.5-7B-Instruct. Knowledge in weights; probed via direct Q&A without
RAG.
Forgetting experiments
Embedding architectures (1, 4):
[MATH: 100100 :MATH]
target facts from Wikipedia,
[MATH: nnear∈{0,10,50,100,
200,500,1,000,5,000</
mn>,10,000}n_{\text{near}}\in{0,10,50,100,200,500,1{,}000,5{,}000,10{,}000} :MATH]
competitors. Targets and competitors stored in HIDESpace. Query with noise-corrupted target embedding;
retrieval with temporal decay. Accuracy measured at
[MATH: 1010 :MATH]
age bins over
[MATH: 3030 :MATH]
simulated days. Power-law fit:
[MATH: R(t)=a⋅t−bR(t)=a\cdot t^{-b} :MATH]
^17. Decay parameter
[MATH: β=0.20\beta=0.20 :MATH]
calibrated via sweep
[MATH: [0.01,0.5][0.01,0.5]
:MATH]
to match HIDE (
[MATH: b=0.460b=0.460 :MATH]
). Attention architecture (2):
[MATH: 5050 :MATH]
target facts
[MATH: ×\times :MATH]
[MATH: 55 :MATH]
positions
[MATH: ×\times :MATH]
[MATH: 77 :MATH]
[MATH: nnearn_{\text{near}} :MATH]
values
[MATH: ×\times :MATH]
[MATH: 55 :MATH]
seeds. Context: system prompt
[MATH: ++ :MATH]
numbered facts
[MATH: ++ :MATH]
question. Age
[MATH: == :MATH]
position-normalised to
[MATH: 3030 :MATH]
-day scale. Parametric architecture (5): PopQA dataset^9 (
[MATH: 14,26714{,}267 :MATH]
questions). Neighbour density: BGE-large cosine
[MATH: >0.4>0.4 :MATH]
to Wikipedia corpus. Binned:
[MATH: {0{0 :MATH]
–
[MATH: 50,5050,50 :MATH]
–
[MATH: 200,200200,200 :MATH]
–
[MATH: 500,500500,500 :MATH]
–
[MATH: 1,000,1,000+}1{,}000,1{,}000+} :MATH]
. Power-law fit on bin-accuracy curve. Filesystem (3): BM25 retrieval of target among competitors; LLM
re-ranking of top-
[MATH: 5050 :MATH]
.
DRM false memory
All
[MATH: 2424 :MATH]
published lists^16 (
[MATH: 1515 :MATH]
studied
[MATH: ++ :MATH]
[MATH: 11 :MATH]
critical lure). Embedding architectures: Centroid similarity. Threshold sweep
[MATH: θ∈[0.50,0.95]\theta\in[0.50,0.95] :MATH]
, step
[MATH: 0.010.01 :MATH]
. For BGE-large:
[MATH: FA=0.583\text{FA}=0.583 :MATH]
at
[MATH: θ=0.864\theta=0.864 :MATH]
. For MiniLM:
[MATH: FA=0.208\text{FA}=0.208 :MATH]
at
[MATH: θ=0.82\theta=0.82 :MATH]
. LLM architectures: Prompt with word list; query “Was WORD in the list? yes/no.” Parse first yes/no.
[MATH: 2424 :MATH]
lists
[MATH: ×\times :MATH]
[MATH: 55 :MATH]
seeds.
Spacing, TOT, dimensionality
Spacing:
[MATH: 100100 :MATH]
facts,
[MATH: 33 :MATH]
repetitions,
[MATH: 44 :MATH]
conditions (massed:
[MATH: 0 :MATH]
–
[MATH: 120120 :MATH]
s; short:
[MATH: 0 :MATH]
–
[MATH: 22 :MATH]
h; medium:
[MATH: 0 :MATH]
–
[MATH: 22 :MATH]
d; long:
[MATH: 0 :MATH]
–
[MATH: 22 :MATH]
w). Test at
[MATH: t=30t=30 :MATH]
d.
[MATH: 10,00010{,}000 :MATH]
distractors,
[MATH: σ=0.25\sigma=0.25 :MATH]
. TOT: Embedding architectures: PCA to
[MATH: 9696 :MATH]
dim, query noise
[MATH:
σ=1.5/96\sigma=1.5/\sqrt{96} :MATH]
. TOT: correct rank
[MATH: 22 :MATH]
–
[MATH: 2020 :MATH]
with top-
[MATH: 11 :MATH]
sim
[MATH: >0.5>0.5 :MATH]
. LLM: partial-domain match in generated answer. Dimensionality: Participation ratio on covariance of
[MATH: 10,00010{,}000 :MATH]
Wikipedia embeddings. Levina–Bickel two-nearest-neighbour estimator^8.
[MATH: d95d_{95} :MATH]
,
[MATH: d99d_{99} :MATH]
: components for
[MATH: 95%95\% :MATH]
/
[MATH: 99%99\% :MATH]
variance.
Solution analysis
Solution 1: PCA to
[MATH: {64,128,256,512}{64,128,256,512} :MATH]
, zero-pad to
[MATH: {2,048,4,096}{2{,}048,4{,}096} :MATH]
. Each:
[MATH: deffd_{\text{eff}} :MATH]
[MATH: ++ :MATH]
Ebbinghaus at
[MATH: 5,0005{,}000 :MATH]
competitors. Solution 2: BM25 retrieval; DRM, Ebbinghaus; semantic agreement with cosine NN. Solution 3:
Gram–Schmidt (
[MATH: 500500 :MATH]
vectors), random projection (
[MATH: {32,64,128,256}{32,64,128,256} :MATH]
dims). Solution 4: MiniBatchKMeans at
[MATH: {50,100,250,500,<
mn>1,000,2,500}{50,100,250,500,1{,}000,2{,}500} :MATH]
clusters; Ebbinghaus before/after.
Statistical analysis and reproducibility
All experiments:
[MATH: 55 :MATH]
seeds
[MATH: [42,123,456,789,<
mn>1024][42,123,456,789,1024] :MATH]
. Bootstrap
[MATH: 95%95\% :MATH]
CI from
[MATH: 10,00010{,}000 :MATH]
resamples. Cohen’s
[MATH: dd :MATH]
for spacing. One-sided Wilcoxon for ordering. SPP: paired
[MATH: tt :MATH]
-test,
[MATH: p<0.001p<0.001 :MATH]
. Anderson–Schooler: power-law fit to inter-arrival distribution at cosine threshold
[MATH: 0.50.5 :MATH]
(
[MATH: α=0.459\alpha=0.459 :MATH]
,
[MATH: R2=0.952R^{2}=0.952 :MATH]
). All code, configs, and results in JSON in the reproducibility package. Single NVIDIA A100-SXM4-80GB;
[MATH: ∼10{\sim}10 :MATH]
GPU-hours total.
Calibration of decay parameter
The temporal decay parameter
[MATH: β=0.20\beta=0.20 :MATH]
was calibrated via sweep over
[MATH: [0.01,0.5][0.01,0.5]
:MATH]
to match HIDE’s
[MATH: b=0.460b=0.460 :MATH]
. This calibration ensures comparability with the predecessor study but means the absolute value of
[MATH: bb :MATH]
is partially fitted. The qualitative conclusions (that interference produces forgetting and that the exponent
increases with competitor count) do not depend on the specific value of
[MATH: β\beta :MATH]
.
Relationship to prior work
This paper extends HIDE^3 in three ways: (a) the mathematical framework (Theorems 1–4, the corollary,
proposition, and the No-Escape Theorem) is entirely new (HIDE argued from empirical convergence; this paper
argues from formal derivation under stated assumptions); (b) four of the five architectures are new (only the
vector database replicates HIDE’s setup, serving as a calibration condition); (c) the two-level framework
(geometric vs. behavioural) and the three-category taxonomy are new contributions that resolve the
architectural objection HIDE left open. The Ebbinghaus baseline comparison (
[MATH: b=0.440b=0.440 :MATH]
vs. HIDE’s
[MATH: 0.4600.460 :MATH]
) uses the same protocol and models as HIDE to enable direct comparison; all other results are independent.
Data Availability
All datasets publicly available: Wikipedia (wikimedia/wikipedia, CC BY-SA 3.0), DRM word lists (public
domain^16), PopQA^9 (open).
Code Availability
Code, configuration files, raw results, and reproduction scripts available at
https://github.com/Dynamis-Labs/no-escape.
Acknowledgements
Computational experiments and manuscript preparation were assisted by Claude (Anthropic).
Author Contributions
A.G. conceived the project, developed the theoretical framework and designed the experiments. A.G, A.S.,
S.R.B., S.B., and N.N. contributed to implementation, experimental execution and manuscript preparation.
Competing Interests
The authors have financial interests in Dynamis Labs, Inc.
References
* S. Amari and H. Nagaoka (2000) Methods of information geometry. American Mathematical Society. Cited by:
Discussion.
* J. R. Anderson and L. J. Schooler (1991) Reflections of the environment in memory. Psychological Science 2,
pp. 396–408. Cited by: Mathematical framework: the no-escape theorem.
* S. R. Barman, A. Starenky, S. Bodnar, N. Narasimhan, and A. Gopinath (2026) The geometry of forgetting.
arXiv preprint arXiv:submit/7411865 [cs.AI]. Note: HIDE paper Cited by: Introduction, Models and
architectures, Relationship to prior work.
* R. A. Bjork and E. L. Bjork (1992) A new theory of disuse and an old theory of stimulus fluctuation. In
From Learning Processes to Cognitive Processes: Essays in Honor of William K. Estes, A. F. Healy, S. M.
Kosslyn, and R. M. Shiffrin (Eds.), pp. 35–67. Cited by: Discussion.
* R. Brown and D. McNeill (1966) The “tip of the tongue” phenomenon. Journal of Verbal Learning and Verbal
Behavior 5, pp. 325–337. Cited by: Introduction.
* N. J. Cepeda, H. Pashler, E. Vul, J. T. Wixted, and D. Rohrer (2006) Distributed practice in verbal recall
tasks: a review and quantitative synthesis. Psychological Bulletin 132, pp. 354–380. Cited by:
Introduction.
* P. Gao, E. Trautmann, B. Yu, G. Santhanam, S. Ryu, K. Shenoy, and S. Ganguli (2017) A theory of
multineuronal dimensionality, dynamics and measurement. bioRxiv. External Links: Document Cited by:
Introduction, Figure 5, Figure 5, The dimensionality convergence, Discussion.
* E. Levina and P. J. Bickel (2005) Maximum likelihood estimation of intrinsic dimension. Advances in Neural
Information Processing Systems 17. Cited by: The dimensionality convergence, Discussion, Spacing, TOT,
dimensionality, Theorem 1.
* A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi (2023) When not to trust language
models: investigating effectiveness of parametric and non-parametric memories. arXiv preprint
arXiv:2212.10511. Cited by: Forgetting experiments, Data Availability.
* J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly (1995) Why there are complementary learning systems
in the hippocampus and neocortex. Psychological Review 102, pp. 419–457. Cited by: Discussion, Discussion.
* B. B. Murdock (1962) The serial position effect of free recall. Journal of Experimental Psychology 64,
pp. 482–488. Cited by: Introduction.
* L. Nadel and M. Moscovitch (1997) Memory consolidation, retrograde amnesia and the hippocampal complex.
Current Opinion in Neurobiology 7, pp. 217–227. Cited by: Introduction.
* Qwen Team (2024) Qwen2.5: a party of foundation models. arXiv preprint arXiv:2412.15115. Cited by:
Introduction, Models and architectures.
* A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J.
Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language
supervision. In Proceedings of ICML, Cited by: Introduction.
* N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In
Proceedings of EMNLP-IJCNLP, pp. 3982–3992. Cited by: Introduction, Models and architectures.
* H. L. Roediger and K. B. McDermott (1995) Creating false memories: remembering words not presented in
lists. Journal of Experimental Psychology: Learning, Memory, and Cognition 21, pp. 803–814. Cited by: False
recall is geometrically inevitable but behaviourally overridable, Discussion, DRM false memory, Data
Availability.
* C. E. Shannon (1948) A mathematical theory of communication. Bell System Technical Journal 27, pp. 379–423.
Cited by: Forgetting experiments.
* C. E. Shannon (1959) Coding theorems for a discrete source with a fidelity criterion. IRE National
Convention Record 7, pp. 142–163. Cited by: Discussion.
* C. Stringer, M. Pachitariu, N. Steinmetz, C. B. Reddy, M. Carandini, and K. D. Harris (2019)
High-dimensional geometry of population responses in visual cortex. Nature 571, pp. 361–365. Cited by:
Introduction, Figure 5, Figure 5, The dimensionality convergence, Discussion.
* J. T. Wixted (1991) On the form of forgetting. Psychological Science 2, pp. 409–415. Cited by: Discussion.
* S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff (2023) C-pack: packaged resources for general Chinese
embeddings. arXiv preprint arXiv:2309.07597. Cited by: Introduction, Models and architectures.
Figures
Refer to caption Figure 1: The No-Escape Theorem: logical structure (paper roadmap). This figure maps the
paper’s argument. Under the kernel-threshold theorem class (Axioms A1–A5): the semantic kernel and
rate-distortion optimality yield finite semantic effective rank (Theorem 1); local regularity yields positive
cap mass (Theorem 2); growing memory yields inevitable forgetting (Theorem 3), with power-law arrival and
population heterogeneity producing power-law forgetting curves. Independently, associative
[MATH: δ\delta :MATH]
-convexity yields lure inseparability (Theorem 4). No architecture within this class avoids these consequences
without abandoning semantic continuity or adding an external symbolic verifier. Each arrow represents a step
derived under stated assumptions and supported by empirical tests across the architectures studied here. Refer
to caption Figure 2: Interference produces forgetting across architecturally distinct memory systems. a, Vector
DB and b, Graph show smooth power-law forgetting curves converging toward the human range (
[MATH: b≈0.3b\approx 0.3 :MATH]
–
[MATH: 0.70.7 :MATH]
, red dashed). c, Attention shows a phase transition (logistic fit:
[MATH: n0≈120n_{0}\approx 120 :MATH]
,
[MATH: k≈0.03k\approx 0.03 :MATH]
; power-law fitting is inappropriate for this sigmoid failure mode). d, Filesystem (BM25) shows
[MATH: b=0b=0 :MATH]
(no semantic interference). e, Parametric (PopQA) shows monotonic accuracy decline with neighbour density.
Category 1 systems degrade continuously; Category 2 systems fail discontinuously.
[MATH: n=5n=5 :MATH]
seeds throughout. Refer to caption Figure 3: The forgetting exponent depends on competitor count, not
architecture. Forgetting exponent
[MATH: bb :MATH]
vs. number of near competitors for embedding architectures (Vector DB, Graph) with human reference (
[MATH: b≈0.5b\approx 0.5 :MATH]
, dashed). Both converge toward the human range at high competitor counts. Shaded: bootstrap
[MATH: 95%95\% :MATH]
CI,
[MATH: n=5n=5 :MATH]
seeds. Refer to caption Figure 4: False recall is geometrically inevitable. a, Hit rate, lure false alarm rate,
and unrelated FA for all five architectures and human data. Embedding architectures show elevated lure FA; LLM
architectures show FA
[MATH: =0=0 :MATH]
at behavioural level (explicit list-checking). b, Lure FA rates compared directly. The geometric prediction (
[MATH: 24/2424/24 :MATH]
lures within spherical caps) holds for all architectures regardless of behavioural output.
[MATH: n=5n=5 :MATH]
seeds,
[MATH: 2424 :MATH]
DRM lists. Refer to caption Figure 5: Effective dimensionality converges far below nominal regardless of
architecture.
[MATH: deffd_{\text{eff}} :MATH]
(participation ratio) vs.
[MATH: dnomd_{\text{nom}} :MATH]
for all five architectures. Grey: biological range (
[MATH: deff=100d_{\text{eff}}=100 :MATH]
–
[MATH: 500500 :MATH]
^19, 7). Qwen hidden states (
[MATH:
dnom=3,584d_{\text{nom}}=3{,}584 :MATH]
) compress to
[MATH: deff=17.9d_{\text{eff}}=17.9 :MATH]
, a
[MATH: 200×200\times :MATH]
reduction. All architectures cluster below the interference threshold. Refer to caption Figure 6: No proposed
solution achieves both immunity and usefulness. Every solution that reduces interference moves along a tradeoff
frontier toward reduced usefulness; no solution escapes the frontier itself. This is the empirical corollary to
Theorem 1. a, Zero-padding does not reduce
[MATH: bb :MATH]
(
[MATH: deffd_{\text{eff}} :MATH]
unchanged). b, BM25 eliminates false recall but semantic agreement drops to
[MATH: 15.5%15.5\% :MATH]
. c, Gram–Schmidt eliminates interference; semantic accuracy
[MATH: =0%=0\% :MATH]
. d, Compression reduces
[MATH: bb :MATH]
but degrades retrieval. Refer to caption Figure 7: Architecture comparison across four memory phenomena.
Heatmap of forgetting exponent
[MATH: bb :MATH]
, DRM lure FA, spacing Cohen’s
[MATH: dd :MATH]
, and TOT rate for all five architectures and human reference. The three prototypical behavioural categories
are visible: pure geometric (top two rows), reasoning overlay (middle), SPP-violating (bottom). Dashes indicate
metrics not measurable for that architecture (attention
[MATH: bb :MATH]
: sigmoid, not power-law; parametric spacing: no controlled paradigm). ^†Parametric TOT (
[MATH: 69%69\% :MATH]
) uses a different operational definition than human/embedding TOT and is not directly comparable (see
Discussion). Metrics are architecture-specific and not all directly numerically comparable (see Methods for
protocol differences).
Supplementary Information
Table 1: Hyperparameters for all architectures and experiments.
Parameter Value Description
Seeds
[MATH: [42,123,456,789,1024][42,123,456,789,1024] :MATH]
Random seeds
Bootstrap
[MATH: 10,00010{,}000
:MATH]
Resamples for
[MATH: 95%95\% :MATH]
CI
Decay
[MATH: β\beta :MATH]
[MATH: 0.200.20 :MATH]
Temporal decay rate (calibrated)
Decay
[MATH: ψ\psi :MATH]
[MATH: 0.500.50 :MATH]
Temporal decay exponent
Noise
[MATH: σ\sigma :MATH]
(Ebb.)
[MATH: 0.500.50 :MATH]
Ebbinghaus query noise
Noise
[MATH: σ\sigma :MATH]
(Sp.)
[MATH: 0.250.25 :MATH]
Spacing noise
TOT PCA dim
[MATH: 9696 :MATH]
PCA reduction for TOT
TOT noise
[MATH: 1.5/961.5/\sqrt{96} :MATH]
Query noise for TOT
PageRank
[MATH: α\alpha :MATH]
[MATH: 0.850.85 :MATH]
Damping factor
Edge threshold
[MATH: 0.700.70 :MATH]
Graph cosine cutoff
BM25 top-
[MATH: kk :MATH]
[MATH: 5050 :MATH]
Filesystem candidates
PopQA threshold
[MATH: 0.400.40 :MATH]
Cosine threshold for neighbours
Table 2: Dataset details.
Dataset Source Size Licence Use
Wikipedia wikimedia/wikipedia
[MATH: 20,00020{,}000
:MATH]
sent. CC BY-SA 3.0 All experiments
DRM lists Roediger & McDermott
[MATH: 2424 :MATH]
lists Public domain False memory
PopQA akariasai/PopQA
[MATH: 14,26714{,}267
:MATH]
Q&A Open Parametric interf.
Table 3: Per-architecture results summary (
[MATH: n=5n=5 :MATH]
seeds unless noted).
Ebb.
[MATH: bb :MATH]
DRM FA Spacing L/M TOT
[MATH: deffd_{\text{eff}} :MATH]
Vector DB
[MATH: 0.440±0.0300.440\pm
0.030 :MATH]
[MATH: 0.5830.583 :MATH]
[MATH: 0.90/0.360.90/0.36 :MATH]
[MATH: 0.0200.020 :MATH]
[MATH: 158158 :MATH]
Graph
[MATH: 0.478±0.0280.478\pm
0.028 :MATH]
[MATH: 0.2080.208 :MATH]
[MATH: 1.00/0.921.00/0.92 :MATH]
[MATH: 0.0280.028 :MATH]
[MATH: 127127 :MATH]
Attention phase trans.
[MATH: 0.000†0.000^{\dagger} :MATH]
[MATH: 0.00/1.000.00/1.00 :MATH]
[MATH: 0.2100.210 :MATH]
[MATH: 17.917.9 :MATH]
Parametric
[MATH: 0.215∗0.215^{*} :MATH]
[MATH: 0.000†0.000^{\dagger} :MATH]
—
[MATH: 0.6900.690 :MATH]
[MATH: 17.917.9 :MATH]
Filesystem
[MATH: 0.0000.000 :MATH]
[MATH: 0.0000.000 :MATH]
[MATH: 1.00/1.001.00/1.00 :MATH]
[MATH: 0.0100.010 :MATH]
[MATH: 158158 :MATH]
Human
[MATH: ∼0.5{\sim}0.5 :MATH]
[MATH: ∼0.55{\sim}0.55
:MATH]
L
[MATH: >> :MATH]
M
[MATH: ∼0.037{\sim}0.037 :MATH]
[MATH: 100100 :MATH]
–
[MATH: 500500 :MATH]
^∗PopQA interference
[MATH: bb :MATH]
(binned neighbour density), not controlled Ebbinghaus paradigm; not directly
comparable to embedding-architecture
[MATH: bb :MATH]
values. ^†Behavioural; geometric prediction holds (
[MATH: 24/2424/24 :MATH]
caps).
Table 4: Effective dimensionality per architecture.
Architecture
[MATH: dnomd_{\text{nom}} :MATH]
[MATH: deffd_{\text{eff}} :MATH]
(PR)
[MATH: deffd_{\text{eff}} :MATH]
(LB)
[MATH: d95d_{95} :MATH]
[MATH: d99d_{99} :MATH]
Vector DB (BGE-large)
[MATH: 1,0241{,}024
:MATH]
[MATH: 158158 :MATH]
[MATH: 10.610.6 :MATH]
[MATH: 404404 :MATH]
[MATH: 642642 :MATH]
Graph (MiniLM)
[MATH: 384384 :MATH]
[MATH: 127127 :MATH]
—
[MATH: 237237 :MATH]
[MATH: 309309 :MATH]
Attention (Qwen)
[MATH: 3,5843{,}584
:MATH]
[MATH: 17.917.9 :MATH]
— — —
Parametric (Qwen)
[MATH: 3,5843{,}584
:MATH]
[MATH: 17.917.9 :MATH]
— — —
Filesystem (BGE-large)
[MATH: 1,0241{,}024
:MATH]
[MATH: 158158 :MATH]
[MATH: 10.610.6 :MATH]
[MATH: 404404 :MATH]
[MATH: 642642 :MATH]
Table 5: Solution analysis data points.
Solution Configuration
[MATH: bb :MATH]
Accuracy
1: High dim PCA
[MATH: d=64d=64 :MATH]
[MATH: 0.3700.370 :MATH]
reduced
Original
[MATH: d=1,024d=1{,}024 :MATH]
[MATH: 0.8310.831 :MATH]
baseline
Zero-pad
[MATH: d=2,048d=2{,}048 :MATH]
[MATH: 0.3320.332 :MATH]
baseline
Zero-pad
[MATH: d=4,096d=4{,}096 :MATH]
[MATH: 0.3080.308 :MATH]
baseline
2: BM25 Full BM25
[MATH: 0.0000.000 :MATH]
[MATH: 15.5%15.5\% :MATH]
3: Gram–Schmidt
[MATH: 500500 :MATH]
vectors
[MATH: 0.0000.000 :MATH]
[MATH: 0.0%0.0\% :MATH]
4: Compression
[MATH: k=50k=50 :MATH]
[MATH: 0.4320.432 :MATH]
[MATH: 98.8%98.8\% :MATH]
[MATH: k=500k=500
:MATH]
[MATH: 0.2540.254 :MATH]
[MATH: 95.6%95.6\% :MATH]
[MATH: k=2,500k=2{,}500 :MATH]
[MATH: 0.1630.163 :MATH]
[MATH: 92.8%92.8\% :MATH]
Extended Data
Refer to caption Figure 8: Extended Data Fig. 1: The five memory architectures. Each architecture implements a
fundamentally different storage and retrieval mechanism: cosine similarity (Vector DB), attention over context
(Attention), BM25
[MATH: ++ :MATH]
LLM re-ranking (Filesystem), personalised PageRank (Graph), and parametric knowledge in weights (Parametric).
Despite architectural diversity, all except Filesystem strongly satisfy SPP (
[MATH: p<0.001p<0.001 :MATH]
).
[MATH: n=143n=143 :MATH]
sentence pairs per architecture. Refer to caption Figure 9: Extended Data Fig. 2: Vector Database full results.
a, Forgetting exponent
[MATH: bb :MATH]
at each competitor count, showing monotonic increase. b, DRM hit rate, lure FA, and unrelated FA. c, Spacing
retention: long
[MATH: =0.902=0.902 :MATH]
, massed
[MATH: =0.360=0.360 :MATH]
. d, Eigenvalue spectrum (
[MATH: deff=158d_{\text{eff}}=158 :MATH]
). Error bars: bootstrap
[MATH: 95%95\% :MATH]
CI,
[MATH: n=5n=5 :MATH]
seeds. Refer to caption Figure 10: Extended Data Fig. 3: Graph Memory full results. a,
[MATH: b=0.478b=0.478 :MATH]
at
[MATH: 10,00010{,}000 :MATH]
competitors. b, DRM lure FA
[MATH: =0.208=0.208 :MATH]
at
[MATH: θ=0.82\theta=0.82 :MATH]
. c, Spacing: long
[MATH: =0.996=0.996 :MATH]
, massed
[MATH: =0.920=0.920 :MATH]
. d, Eigenvalue spectrum (
[MATH: deff=127d_{\text{eff}}=127 :MATH]
).
[MATH: n=5n=5 :MATH]
seeds. Refer to caption Figure 11: Extended Data Fig. 4: Attention Memory full results. a, Phase transition:
near-perfect accuracy at
[MATH: nnear<100n_{\text{near}}<100 :MATH]
, then catastrophic collapse (logistic fit:
[MATH: n0≈120n_{0}\approx 120 :MATH]
,
[MATH: k≈0.03k\approx 0.03 :MATH]
; power-law fitting is inappropriate for this sigmoid failure mode;
[MATH: yy :MATH]
-axis values reflect interference severity, not power-law exponents). b, DRM FA
[MATH: =0=0 :MATH]
at behavioural level (geometric prediction holds:
[MATH: 24/2424/24 :MATH]
lures within caps). c, Spacing: architectural capacity artefact: massed
[MATH: =1.0=1.0 :MATH]
, spaced
[MATH: =0.0=0.0 :MATH]
(context-window limit relocates interference to capacity domain). d,
[MATH: deff=17.9d_{\text{eff}}=17.9 :MATH]
from
[MATH:
dnom=3,584d_{\text{nom}}=3{,}584 :MATH]
, a
[MATH: 200×200\times :MATH]
compression.
[MATH: n=5n=5 :MATH]
seeds. Refer to caption Figure 12: Extended Data Fig. 5: Parametric Memory full results. PopQA interference:
accuracy drops from
[MATH: 1.0001.000 :MATH]
(
[MATH: <50<50 :MATH]
neighbours) to
[MATH: 0.1130.113 :MATH]
(
[MATH: >1,000>1{,}000 :MATH]
neighbours). Power-law fit
[MATH: b=0.215b=0.215 :MATH]
. DRM FA
[MATH: =0=0 :MATH]
behaviourally. TOT
[MATH: =69%=69\% :MATH]
, a very high partial retrieval rate.
[MATH: deff=17.9d_{\text{eff}}=17.9 :MATH]
.
[MATH: n=3n=3 :MATH]
seeds for PopQA. Refer to caption Figure 13: Extended Data Fig. 6: Filesystem Memory full results. BM25 keyword
retrieval:
[MATH: b=0.000b=0.000 :MATH]
(no forgetting), FA
[MATH: =0=0 :MATH]
(no false recall), all spacing conditions
[MATH: =1.0=1.0 :MATH]
. SPP correlation
[MATH: r=0.210r=0.210 :MATH]
; BM25 weakly satisfies semantic proximity. This architecture demonstrates Solution 2: immunity at the cost of
usefulness. Refer to caption Figure 14: Extended Data Fig. 7: SPP verification. Mean similarity for related
pairs (same article) vs. unrelated pairs (different articles) across all five architectures. All satisfy SPP (
[MATH: p<0.001p<0.001 :MATH]
), with embedding architectures showing stronger separation.
[MATH: n=143n=143 :MATH]
pairs. Refer to caption Figure 15: Extended Data Fig. 8: Spherical cap verification. a, Analytical cap volume
(fraction of sphere) vs. dimension for five cap half-angles
[MATH: θ∈{10∘,20∘,
30∘,45∘,60∘<
/mo>}\theta\in{10^{\circ},20^{\circ},30^{\circ},45^{\circ},60^{\circ}}</
semantics> :MATH]
, showing exponential collapse with increasing
[MATH: dd :MATH]
. Shaded region marks the interference regime (
[MATH: deff≈10d_{\text{eff}}\approx 10 :MATH]
–
[MATH: 5050 :MATH]
) where all tested architectures operate. b, Monte Carlo verification: simulated vs. analytical cap volume on
log–log axes for the
[MATH: 77 :MATH]
(
[MATH: dd :MATH]
,
[MATH: θ\theta :MATH]
) combinations where the Monte Carlo sample detected non-zero signal. Six of seven points fall within
[MATH: ±20%\pm 20\% :MATH]
of the
[MATH: y=xy=x :MATH]
line; the single outlier (
[MATH: d=8d=8 :MATH]
,
[MATH: θ=20∘\theta=20^{\circ} :MATH]
, ratio
[MATH: =0.48=0.48 :MATH]
) reflects finite-sample resolution at analytical volume
[MATH: ≈8×10−5\approx 8\times 10^{-5} :MATH]
, not analytical error. Marker shape encodes dimension; colour encodes
[MATH: θ\theta :MATH]
(matching a). Confirms Theorem 2. Refer to caption Figure 16: Extended Data Fig. 9: Reproducibility across
seeds. Per-seed values of
[MATH: bb :MATH]
, DRM lure FA, and
[MATH: deffd_{\text{eff}} :MATH]
for Vector DB and Graph architectures. Low variance confirms reproducibility.
[MATH: n=5n=5 :MATH]
seeds. Refer to caption Figure 17: Extended Data Fig. 10: Full solution analysis. a,
[MATH: bb :MATH]
vs. nominal dimensionality. b, BM25 immunity vs. usefulness. c, Orthogonalisation methods. d, Compression:
[MATH: bb :MATH]
(blue) and accuracy (red) vs. cluster count. Every solution traces a tradeoff; none achieves both immunity and
usefulness.