Hypernym Inc. · PPL + LITM benchmark sweep

A measured kernel
reduces PPL by 1.07%
on natural English.

Zero retraining. Eighteen seconds per chunk on a Mac Studio CPU. The optimum we measure here independently re-discovers a kernel parameter band Hypernym had previously identified through unrelated internal research — different model, different context length, same kernel.

−1.07%
PPL on WikiText at the peak
0.0001%
harness sanity check
+30%
cliff past the threshold
7.4%
photonic LITM retrieval

The kernel curve

Sweep the kernel parameter across its full operating range on Qwen2.5-0.5B-Instruct, measure perplexity on three held-out text chunks. WikiText benefits from gating up to a measurable peak, then falls off a cliff. Code and PRISMA narrative damage monotonically — different content wants different positions on the curve.

PPL Δ vs softmax baseline (per dataset, by Wick θ) y-axis: pct delta from softmax. Negative = better. Positive = worse. +35% +25% +15% +5% 0% −5% mild production diffuse peak CLIFF past threshold peak −1.07% WikiText (natural English) Python source code PRISMA narrative CLIFF
Three datasets, multiple kernel parameter values, four chunks of 256 tokens each per cell. Lines are % delta from the softmax baseline (negative = better). WikiText (orange) dips to a measured optimum, then beyond a specific threshold all curves jump catastrophically — the gate begins pruning the head of the distribution, not just the tail.

Content-conditional optimum

Different content types have different optimal points on the kernel curve. Natural English wants gating; structured code and tightly-bound narrative want softmax. The trained Qwen2.5-0.5B weights tolerate substantial deformation on WikiText because the natural distractor density gives the gate something to skip.

WikiText
Optimum at the measured peak → −1.07% PPL. Gating skips distractor tokens; head of the distribution preserved. The first content type for which we measured the gate's beneficial regime on Qwen.
4 chunks × 4 sweep points, monotonic improvement to peak then cliff.
Python code
Optimum at zero (softmax). Gating monotonically damages: +0.08% mild → +3.20% peak → +31.3% past threshold. Code attention is already focused on local context; there's no distractor for the gate to skip.
4 chunks, 4 deltas all positive, monotonic.
PRISMA narrative
Optimum at zero (softmax). Gating monotonically damages: +0.05% mild → +2.41% peak → +30.1% past threshold. Narrative attention binds tightly to specific entities; gating the tail prunes legitimate tracking.
4 chunks of the PRISMA script, 4 deltas all positive.
Universal cliff
Past the threshold, the gate begins pruning the head of the distribution rather than the tail. ALL content types damage catastrophically: +9.9% / +31.3% / +30.1%. The threshold is geometric, not engineered.
Three independent datasets all jump at the same parameter value.

Independent re-discovery

Hypernym had previously identified a kernel parameter band through earlier internal research on a larger model at long context against WikiText-2 perplexity. The benchmark we just ran is a fresh measurement on a smaller model (Qwen 2.5-0.5B), at a shorter context length (256 tokens), against the same content.

Two independent measurements, same answer
Prior internal measurement: kernel band identified on a larger model at long context
This benchmark: measured kernel optimum on WikiText (Qwen 2.5-0.5B / 256 token context)
Agreement: the two optima fall in the same narrow band of the kernel parameter space.

Different model, different context length, same content type → same optimal band. That's not coincidence. The optimum is determined by the geometric structure of the kernel — a measured polynomial that recurs across 18 unrelated physics domains — not by the specific model or context length. The kernel knew what it was doing.

Passkey retrieval (LITM)

Insert a 5-digit passkey at varying depths (256 / 512 / 1024 tokens) and positions (25% / 50% / 75% of context), ask the model to recall it, score retrieval accuracy. Three samples per cell, five substrate variants.

Retrieval rate per (substrate × depth) 9 trials per cell. Green = perfect (9/9). Yellow = partial. Red = catastrophic. depth=256 depth=512 depth=1024 softmax 100% (9/9) 100% (9/9) 100% (9/9) kernel (production) 100% (9/9) 100% (9/9) 100% (9/9) kernel (peak) 100% (9/9) 100% (9/9) 100% (9/9) kernel (cliff) 66.7% (6/9) 100% (9/9) 100% (9/9) photonic γ=0.5 22% (2/9) 0% (0/9) 0% (0/9)
Two failure modes visible. Kernel-at-cliff: middle-of-context retrieval hole at depth 256 (3 of 3 failures are at needle_pos=0.5). Photonic substrate at the chosen γ: catastrophic — predicts " the" instead of the passkey at almost every depth. Both failures appear at the same parameter regime, in both PPL and LITM. Same cliff, two metrics.

Two cliffs, same edge

The PPL benchmark and the LITM benchmark are independent metrics. Both measure something different — perplexity is a distribution-shape test; passkey retrieval is a binding test. Both find the same edge:

PPL cliff

Distribution shape

WikiText delta from softmax: −1.07% at the peak → +9.9% past the threshold. Code: +3.2% → +31.3%. PRISMA: +2.4% → +30.1%. Past the threshold, the gate begins pruning the head of the distribution rather than the tail; gating becomes damage.

threshold
PPL discontinuity

LITM cliff

Token binding

The cliff-regime kernel retrieves perfectly at depth 512 and 1024 but only 66.7% at depth 256 with mid-context needle. The aggressive gate prunes the position holding the passkey when the context is short and the needle is in the middle — exactly the LITM phenomenon.

threshold
retrieval discontinuity

Same edge

Geometric threshold

Two independent metrics, same kernel parameter. The kernel polynomial crosses zero at a specific structural point. Past that point, attention loses both shape (PPL) and binding (LITM) coherence. The geometric structure determines the operational ceiling, not engineering choice.

geometric
determined

The photonic surprise

Photonic γ=0.5 — the 1D dielectric stack we benchmarked — preserves top-1 in greedy generation of simple prompts ("Paris" comes out correctly). But on perplexity it's catastrophic (+778% to +2119% above softmax). On passkey retrieval it's worse still — 7.4% accuracy across 27 trials, predicting " the" almost everywhere.

The full distribution shape matters, not just the argmax
Trained Qwen weights expect a SPECIFIC distribution shape — softmax, which sits at one endpoint of the kernel curve. Substrates that match softmax at the peak but diverge in the tail (photonic at the chosen γ) preserve top-1 in some greedy demos but break the model on real metrics.

Both PPL and LITM agree. Photonic at the tested γ is not a drop-in replacement. To use this substrate, either:
(a) re-tune the substrate parameter to find a regime closer to softmax shape, or
(b) re-train Qwen against the substrate-shaped attention, or
(c) restrict to use cases where greedy top-1 is sufficient (which is almost none).

This is the cleanest version of the substrate framework's actual constraint: physical substrates produce shaped attention distributions; trained transformer weights expect a specific shape; the kernel curve parameterizes the family of nearby shapes without leaving the trained tolerance band; substrates with naturally-far shapes need tuning or co-training.

The full PPL table

All substrates × all datasets × 4 chunks each (mean PPL). Sanity column shows softmax = QD τ=0 to floating-point noise.

substrate

deployment

WikiText

Python

PRISMA

softmax

baseline

24.07 PPL

13.36 PPL

46.88 PPL

QD τ=0

eigh sanity

0.000%

0.000%

0.000%

Kernel (mild)

low gate

−0.143%

+0.076%

+0.051%

Kernel (prod)

deployed value

−0.529%

+0.379%

+0.257%

Kernel (diffuse)

high diffuse

−0.931%

+0.998%

+0.684%

Kernel (peak)

measured optimum

−1.073% ★

+3.198%

+2.408%

Kernel (cliff)

past threshold

+9.897%

+31.31%

+30.07%

Photonic

shape mismatch

+1052%

+2119%

+778%

Receipts

The benchmark databases are SQLite files in spectral_engine/bass_attention/cavity_validation/hypercircuits/. Every row is queryable.

Reproducibility
Total compute: ~50 minutes on a Mac Studio M3 Ultra CPU. No GPU. No PyTorch. No llama.cpp. No HuggingFace transformers. Just numpy, scipy, scikit-fem, ml_dtypes, and tokenizers. The model is Alibaba's open-weight Qwen 2.5-0.5B-Instruct. The forward pass is hand-rolled in cavity_transformer.py. Every benchmark row is in SQLite, queryable in milliseconds, repeatable on any machine with the same setup.