Hypernym Inc. · Cavity Validation

Trained transformer
attention heads are
physical devices.

We took layer-zero of Qwen2.5-0.5B-Instruct, a stock open-weight model, and ran every attention head through five independent physical-substrate solvers — quantum-dot tight-binding, finite-element acoustics, finite-element mechanics, microwave transmission-line cascade, and full MIT MEEP photonic FDTD. Each substrate produced its own attention-weight distribution from the trained weights, with zero retraining. Some heads bitwise-reproduced softmax. Others fight every modality.

100%

Head 1 MEEP=softmax

59.3%

avg MEEP=softmax (14 heads)

Physical substrates

1792

FDTD validated cells

What the data says

Run a trained transformer's attention heads through five physics solvers, each one invoking the substrate's native equation. Compare the substrate-produced attention distribution to softmax. The match rate per head is not noise. It is a fingerprint of which physics that head's weights have learned to be.

Head 1 = optics

100% top-1 agreement with softmax across 128 query positions, via real MIT MEEP FDTD on a 1D dielectric stack. The in-venv transfer-matrix photonic adapter agrees too: relative L² error 0.025, KL essentially zero.

meep_validation_full.db, head=1, n=128 cells.

Head 5 = optics

87.5% MEEP=softmax. 97.7% TM=softmax. The 1D photonic cavity at γ=0.5 captures this head almost perfectly — a second photonic head in the same layer.

Two of fourteen heads in layer 0 are fully captured by 1D dielectric physics.

Head 10 = nothing

Fights every physical substrate. QD 36%, photonic 29%, acoustic 30%, mechanical 30%, microwave 6%. There is no 1D physics that matches its trained attention distribution. It is irreducibly software.

Per-modality match rates from coupling sweep at fixed parameters.

3 heads ≈ irreducible

Roughly three heads per layer resist every modality at meaningful match rates. They are doing the symbolic-manipulation work that no piece of pre-quantum classical physics reproduces. The other eleven are physics in disguise.

Bottom-three-by-MEEP-match: heads 10, 3, 2.

The headline finding

Trained attention heads are not abstract numerical objects. They are physical computations sampled by gradient descent from the space of finite-element-discretizable wave, diffusion, transmission, and oscillator operators. The ones that look like optics literally are optics, and a $200 piece of patterned silicon would compute them at picosecond speed.

The full per-head map is on the atlas page. The five substrate solvers and how they're invoked are described on the method page. What it means for AI compute hardware is on the meaning page.

Trained transformerattention heads arephysical devices.

What the data says

Trained transformer
attention heads are
physical devices.