Epiplexity by example: what entropy and complexity miss

Type: note · Status: current · Tags: learning-theory

Entropy asks how surprising the symbols are. Complexity asks how short the best exact recipe is. Epiplexity asks how much structure this observer can actually extract with the tools and time they have.

Think of a locked box. Entropy and complexity describe the contents as an object. Epiplexity asks: can this observer open the box, and if they can, how much pattern can they recover from what is inside? "Observer" means the full toolkit — intelligence, side information, prior knowledge, keys, decompressors, domain expertise, and time.

Four measures

Measure	How you calculate it	What you must fix first	Depends on practical observer?
Information content	`−log₂ P(message)`	A probability model over messages	No — once the model is fixed
Shannon entropy	Average of `−log₂ P(symbol)` over the source	A probability distribution for the source	No — once the source model is fixed
Kolmogorov complexity	Length of the shortest program that outputs the string	A universal programming model / machine	No — once the machine is fixed
Epiplexity	Area under a bounded learner's loss curve above final loss	A learner, its tools, and a compute budget	Yes

The first three do not depend on what Bob or Eve can practically do, once you fix their formal setup (model, source, machine). Epiplexity does. It is a property of the data-observer pair. A message may contain structure in principle but have low epiplexity for an observer who cannot get to it.

Warm-up: calculating all four for `ABABABABABABABAB`

16 symbols, alphabet {A, B}. Walk through each measure.

Information content — surprisal of this exact message under a chosen model:

Naive model (each symbol i.i.d., p = 0.5): I = −log₂(0.5¹⁶) = 16 bits.
Pattern-aware model ("repeat AB"): I = −log₂(1) = 0 bits — the message is fully predicted once you know the rule.
Both are correct under their model. Information content is model-relative — it doesn't say which model is "right."

Shannon entropy — average surprisal under a source model:

Memoryless source model (p(A) = p(B) = 0.5): H = −0.5 log₂ 0.5 − 0.5 log₂ 0.5 = 1 bit/symbol.
First-order Markov model ("after A comes B, after B comes A"): conditional entropy H(Xₙ|Xₙ₋₁) ≈ 0.

Entropy can treat the same sequence as high-entropy or low-entropy depending on the source model.

Kolmogorov complexity — the length of the shortest exact generator. Something like for i in range(8): print("AB") — call it ~30 bytes of code for a 16-byte string. Scale up to "AB" × 10,000 (20,000 bytes): the program barely grows while the string grows linearly. K captures compressibility — but assumes unlimited search for the shortest program.

Epiplexity (via prequential coding — Finzi et al.'s measurement method): fix a bounded learner, feed symbols one at a time, and track prediction loss:

Symbol	What the learner sees	Prediction loss
1: A	No history	1 bit (uniform guess)
2: B	Just "A"	≈ 1 bit (still uncertain)
3: A	"AB" — a pattern?	≈ 0.5 bits (hedging)
4: B	"ABA" — pattern confirmed	≈ 0.1 bits
5–16	Pattern locked in	≈ 0 bits each

Final loss: ≈ 0 bits/symbol. Epiplexity = area under the loss curve, above the final loss. In this example that is ≈ (1-0) + (1-0) + (0.5-0) + (0.1-0) + ... ≈ 2.6 bits, so the 2.6 bits is the epiplexity. That area is the learnable structure this learner extracted by processing the sequence.

The important contrast is a truly random binary sequence. There the loss curve stays flat at about 1, 1, 1, 1, ... because the learner never gets better. But the final loss is also about 1, so the area above final loss is (1-1) + (1-1) + ... = 0. Random data is hard, but not learnable. Epiplexity measures the drop in loss from learning structure, not the raw loss itself.

The warm-up shows the split:

Information content and entropy depend on the probability model you pick.
Kolmogorov complexity asks for the shortest exact generator (assuming unlimited search).
Epiplexity asks how much of that structure a bounded learner can actually recover.

The interesting cases are when these disagree.

Example 1: Encrypted patterned message

Alice starts with the same structured plaintext — ABABABABABABABAB — and encrypts it with AES.

For the ciphertext:

Shannon entropy: high — designed to look random.
Kolmogorov complexity: can still be low if the plaintext and key have short descriptions: a short program can generate the plaintext, supply the key, run AES, and output the ciphertext. That still does not tell you who can actually read the message.

For epiplexity, the "observer" includes any cheap preprocessing they can do. Bob's effective learner is "decrypt with known key, then predict"; Eve's is "predict directly from ciphertext."

Bob (has the key): decrypt → ABABABABABABABAB → extract generator "repeat AB eight times." Bob reaches both the message and the rule behind it.

Eve (no key): frequency analysis → flat distribution. Known-plaintext attacks → nothing matches. Byte pattern clustering → no signal. Every technique within her budget returns noise. She cannot reach the plaintext layer, let alone the generator layer.

Eve later (acquires the key): same ciphertext, same bytes, same entropy, same Kolmogorov complexity. But now she decrypts, reaches the plaintext, and extracts the generator. Her epiplexity jumps from near zero to high because her toolkit changed.

Two extraction layers are at work:

Access layer — get from ciphertext to plaintext.
Pattern layer — get from plaintext to the short generator.

The prequential coding calculation shows this as loss curves:

Eve's learner on the ciphertext: loss stays flat at ~8 bits/byte. Area above final loss ≈ 0. Epiplexity ≈ 0.
Bob's learner: decrypt first, then predict plaintext. Loss drops from 1 bit to 0 after a few symbols — same curve as the warm-up. Epiplexity ≈ 2.6 bits.
Eve-later's learner: same bytes, but the learner now includes the key. Same 2.6 bits.

Neither entropy nor complexity can express "Bob can read it but Eve cannot." Epiplexity can, because the loss curve shape depends on what the observer can compute.

Example 2: Shuffled textbook

Take an introductory physics textbook. Randomly shuffle all paragraphs.

Measure	Ordered textbook	Shuffled textbook
Shannon entropy	Same (identical character frequencies)	Same
Kolmogorov complexity	Nearly identical	Nearly identical (one is a permutation of the other)
Epiplexity for a student	High — concepts build on each other, structure is extractable	Low — dependencies are broken, the student can't follow the argument
Epiplexity for an expert	High	Moderate — the expert can mentally reorder because they already know the structure

Same data. Same entropy. Same complexity. Different extractable structure depending on arrangement and observer. The ordering is a deterministic transformation — it adds no information in the classical sense. But it dramatically changes what a bounded reader can learn.

This is why distillation creates value: rearranging and compressing knowledge can raise epiplexity for the target reader even while reducing token count.

Example 3: CSPRNG vs compressed file

A cryptographically secure pseudorandom number generator (CSPRNG) produces a 1 MB file. A zip compressor produces a 1 MB file from a 50 MB dataset.

Measure	CSPRNG output	Compressed file
Shannon entropy	~8 bits/byte (maximum)	~8 bits/byte (maximum)
Kolmogorov complexity	Low if a short seed plus generator specifies it	Lower than the 50 MB source because compression found a shorter description
Epiplexity without tools	Near zero — provably no useful structure to extract	Low — looks like noise without the decompressor
Epiplexity with seed/decompressor	Still near zero — regenerating the bytes does not reveal hidden meaning	High — decompression reveals 50 MB of structured data

Both files look random to statistical tests. The difference: the compressed file rewards the right tool with rich structure; the CSPRNG output has nothing to unpack. Finzi et al. use CSPRNGs as the canonical example of zero-epiplexity data.

Example 4: Chess game notation

A recorded chess game between two grandmasters: 1. e4 e5 2. Nf3 Nc6 ...

Observer	Epiplexity
Someone who doesn't know chess	Zero — the symbols are meaningless
A beginner who knows the rules	Low — they can replay moves but extract little strategy
A club player	Moderate — they recognize common openings and tactical patterns
A grandmaster	High — they extract strategic themes, positional ideas, novelties, and preparation choices

Same string. Same entropy. Same Kolmogorov complexity. Four levels of extractable structure, depending entirely on the observer's computational capacity.

The pattern

Measure	What it captures	What it misses
Information content	Surprise of this message under a model	Which model is appropriate; whether the observer can compute it
Shannon entropy	Average surprise per symbol	Structure beyond the assumed model order
Kolmogorov complexity	Best possible compression	Whether anyone can find that best program in finite time
Epiplexity	Structure this observer actually extracts	Nothing — observer-dependence is the point

Entropy measures randomness. Complexity measures shortest description. Neither tells you what a particular observer can do with the data. Epiplexity fills that gap: structure that is both present in the data and accessible to the observer within their budget.

This is why the same note in a KB can have different value for different readers, why context arrangement matters even when the tokens are identical, and why reverse-compression — expanding text without adding extractable structure — is a real failure mode even when the expanded text is correct.

Relevant Notes:

information value is observer-relative — grounds: epiplexity formalizes the observer-dependence of information value; this note provides concrete examples for that formalization
Epiplexity paper — source: Finzi et al. define epiplexity and prove CSPRNGs have zero epiplexity for bounded observers
distillation — exemplifies: shuffled ordering shows why distillation can raise extractable structure for bounded observers
context efficiency is the central design concern — extends: identical tokens can differ in usable structure depending on observer and arrangement
reverse-compression — extends: output that grows without raising extractable structure

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search