Epiplexity by example: what entropy and complexity miss
Type: note · Status: current · Tags: learning-theory
Entropy asks how surprising the symbols are. Complexity asks how short the best exact recipe is. Epiplexity asks how much structure this observer can actually extract with the tools and time they have.
Think of a locked box. Entropy and complexity describe the contents as an object. Epiplexity asks: can this observer open the box, and if they can, how much pattern can they recover from what is inside? "Observer" means the full toolkit — intelligence, side information, prior knowledge, keys, decompressors, domain expertise, and time.
Four measures
| Measure | How you calculate it | What you must fix first | Depends on practical observer? |
|---|---|---|---|
| Information content | −log₂ P(message) |
A probability model over messages | No — once the model is fixed |
| Shannon entropy | Average of −log₂ P(symbol) over the source |
A probability distribution for the source | No — once the source model is fixed |
| Kolmogorov complexity | Length of the shortest program that outputs the string | A universal programming model / machine | No — once the machine is fixed |
| Epiplexity | Area under a bounded learner's loss curve above final loss | A learner, its tools, and a compute budget | Yes |
The first three do not depend on what Bob or Eve can practically do, once you fix their formal setup (model, source, machine). Epiplexity does. It is a property of the data-observer pair. A message may contain structure in principle but have low epiplexity for an observer who cannot get to it.
Warm-up: calculating all four for ABABABABABABABAB
16 symbols, alphabet {A, B}. Walk through each measure.
Information content — surprisal of this exact message under a chosen model:
- Naive model (each symbol i.i.d., p = 0.5): I = −log₂(0.5¹⁶) = 16 bits.
- Pattern-aware model ("repeat AB"): I = −log₂(1) = 0 bits — the message is fully predicted once you know the rule.
- Both are correct under their model. Information content is model-relative — it doesn't say which model is "right."
Shannon entropy — average surprisal under a source model:
- Memoryless source model (p(A) = p(B) = 0.5): H = −0.5 log₂ 0.5 − 0.5 log₂ 0.5 = 1 bit/symbol.
- First-order Markov model ("after A comes B, after B comes A"): conditional entropy H(Xₙ|Xₙ₋₁) ≈ 0.
Entropy can treat the same sequence as high-entropy or low-entropy depending on the source model.
Kolmogorov complexity — the length of the shortest exact generator. Something like for i in range(8): print("AB") — call it ~30 bytes of code for a 16-byte string. Scale up to "AB" × 10,000 (20,000 bytes): the program barely grows while the string grows linearly. K captures compressibility — but assumes unlimited search for the shortest program.
Epiplexity (via prequential coding — Finzi et al.'s measurement method): fix a bounded learner, feed symbols one at a time, and track prediction loss:
| Symbol | What the learner sees | Prediction loss |
|---|---|---|
| 1: A | No history | 1 bit (uniform guess) |
| 2: B | Just "A" | ≈ 1 bit (still uncertain) |
| 3: A | "AB" — a pattern? | ≈ 0.5 bits (hedging) |
| 4: B | "ABA" — pattern confirmed | ≈ 0.1 bits |
| 5–16 | Pattern locked in | ≈ 0 bits each |
Final loss: ≈ 0 bits/symbol. Epiplexity = area under the loss curve, above the final loss. In this example that is ≈ (1-0) + (1-0) + (0.5-0) + (0.1-0) + ... ≈ 2.6 bits, so the 2.6 bits is the epiplexity. That area is the learnable structure this learner extracted by processing the sequence.
The important contrast is a truly random binary sequence. There the loss curve stays flat at about 1, 1, 1, 1, ... because the learner never gets better. But the final loss is also about 1, so the area above final loss is (1-1) + (1-1) + ... = 0. Random data is hard, but not learnable. Epiplexity measures the drop in loss from learning structure, not the raw loss itself.
The warm-up shows the split:
- Information content and entropy depend on the probability model you pick.
- Kolmogorov complexity asks for the shortest exact generator (assuming unlimited search).
- Epiplexity asks how much of that structure a bounded learner can actually recover.
The interesting cases are when these disagree.
Example 1: Encrypted patterned message
Alice starts with the same structured plaintext — ABABABABABABABAB — and encrypts it with AES.
For the ciphertext:
- Shannon entropy: high — designed to look random.
- Kolmogorov complexity: can still be low if the plaintext and key have short descriptions: a short program can generate the plaintext, supply the key, run AES, and output the ciphertext. That still does not tell you who can actually read the message.
For epiplexity, the "observer" includes any cheap preprocessing they can do. Bob's effective learner is "decrypt with known key, then predict"; Eve's is "predict directly from ciphertext."
Bob (has the key): decrypt → ABABABABABABABAB → extract generator "repeat AB eight times." Bob reaches both the message and the rule behind it.
Eve (no key): frequency analysis → flat distribution. Known-plaintext attacks → nothing matches. Byte pattern clustering → no signal. Every technique within her budget returns noise. She cannot reach the plaintext layer, let alone the generator layer.
Eve later (acquires the key): same ciphertext, same bytes, same entropy, same Kolmogorov complexity. But now she decrypts, reaches the plaintext, and extracts the generator. Her epiplexity jumps from near zero to high because her toolkit changed.
Two extraction layers are at work:
- Access layer — get from ciphertext to plaintext.
- Pattern layer — get from plaintext to the short generator.
The prequential coding calculation shows this as loss curves:
- Eve's learner on the ciphertext: loss stays flat at ~8 bits/byte. Area above final loss ≈ 0. Epiplexity ≈ 0.
- Bob's learner: decrypt first, then predict plaintext. Loss drops from 1 bit to 0 after a few symbols — same curve as the warm-up. Epiplexity ≈ 2.6 bits.
- Eve-later's learner: same bytes, but the learner now includes the key. Same 2.6 bits.
Neither entropy nor complexity can express "Bob can read it but Eve cannot." Epiplexity can, because the loss curve shape depends on what the observer can compute.
Example 2: Shuffled textbook
Take an introductory physics textbook. Randomly shuffle all paragraphs.
| Measure | Ordered textbook | Shuffled textbook |
|---|---|---|
| Shannon entropy | Same (identical character frequencies) | Same |
| Kolmogorov complexity | Nearly identical | Nearly identical (one is a permutation of the other) |
| Epiplexity for a student | High — concepts build on each other, structure is extractable | Low — dependencies are broken, the student can't follow the argument |
| Epiplexity for an expert | High | Moderate — the expert can mentally reorder because they already know the structure |
Same data. Same entropy. Same complexity. Different extractable structure depending on arrangement and observer. The ordering is a deterministic transformation — it adds no information in the classical sense. But it dramatically changes what a bounded reader can learn.
This is why distillation creates value: rearranging and compressing knowledge can raise epiplexity for the target reader even while reducing token count.
Example 3: CSPRNG vs compressed file
A cryptographically secure pseudorandom number generator (CSPRNG) produces a 1 MB file. A zip compressor produces a 1 MB file from a 50 MB dataset.
| Measure | CSPRNG output | Compressed file |
|---|---|---|
| Shannon entropy | ~8 bits/byte (maximum) | ~8 bits/byte (maximum) |
| Kolmogorov complexity | Low if a short seed plus generator specifies it | Lower than the 50 MB source because compression found a shorter description |
| Epiplexity without tools | Near zero — provably no useful structure to extract | Low — looks like noise without the decompressor |
| Epiplexity with seed/decompressor | Still near zero — regenerating the bytes does not reveal hidden meaning | High — decompression reveals 50 MB of structured data |
Both files look random to statistical tests. The difference: the compressed file rewards the right tool with rich structure; the CSPRNG output has nothing to unpack. Finzi et al. use CSPRNGs as the canonical example of zero-epiplexity data.
Example 4: Chess game notation
A recorded chess game between two grandmasters: 1. e4 e5 2. Nf3 Nc6 ...
| Observer | Epiplexity |
|---|---|
| Someone who doesn't know chess | Zero — the symbols are meaningless |
| A beginner who knows the rules | Low — they can replay moves but extract little strategy |
| A club player | Moderate — they recognize common openings and tactical patterns |
| A grandmaster | High — they extract strategic themes, positional ideas, novelties, and preparation choices |
Same string. Same entropy. Same Kolmogorov complexity. Four levels of extractable structure, depending entirely on the observer's computational capacity.
The pattern
| Measure | What it captures | What it misses |
|---|---|---|
| Information content | Surprise of this message under a model | Which model is appropriate; whether the observer can compute it |
| Shannon entropy | Average surprise per symbol | Structure beyond the assumed model order |
| Kolmogorov complexity | Best possible compression | Whether anyone can find that best program in finite time |
| Epiplexity | Structure this observer actually extracts | Nothing — observer-dependence is the point |
Entropy measures randomness. Complexity measures shortest description. Neither tells you what a particular observer can do with the data. Epiplexity fills that gap: structure that is both present in the data and accessible to the observer within their budget.
This is why the same note in a KB can have different value for different readers, why context arrangement matters even when the tokens are identical, and why reverse-compression — expanding text without adding extractable structure — is a real failure mode even when the expanded text is correct.
Relevant Notes:
- information value is observer-relative — grounds: epiplexity formalizes the observer-dependence of information value; this note provides concrete examples for that formalization
- Epiplexity paper — source: Finzi et al. define epiplexity and prove CSPRNGs have zero epiplexity for bounded observers
- distillation — exemplifies: shuffled ordering shows why distillation can raise extractable structure for bounded observers
- context efficiency is the central design concern — extends: identical tokens can differ in usable structure depending on observer and arrangement
- reverse-compression — extends: output that grows without raising extractable structure