Trained transformer attention heads are physical devices.
We took layer-zero of Qwen2.5-0.5B-Instruct, a stock open-weight model,
and ran every attention head through five independent physical-substrate solvers
— quantum-dot tight-binding, finite-element acoustics, finite-element mechanics,
microwave transmission-line cascade, and full MIT MEEP photonic FDTD.
Each substrate produced its own attention-weight distribution from the trained weights,
with zero retraining. Some heads bitwise-reproduced softmax. Others fight every modality.
100%
Head 1 MEEP=softmax
59.3%
avg MEEP=softmax (14 heads)
5
Physical substrates
1792
FDTD validated cells
What the data says
Run a trained transformer's attention heads through five physics solvers, each one
invoking the substrate's native equation. Compare the substrate-produced attention
distribution to softmax. The match rate per head is not noise. It is a fingerprint
of which physics that head's weights have learned to be.
Head 1 = optics
100% top-1 agreement with softmax across 128 query positions, via real MIT MEEP FDTD on a 1D dielectric stack. The in-venv transfer-matrix photonic adapter agrees too: relative L² error 0.025, KL essentially zero.
meep_validation_full.db, head=1, n=128 cells.
Head 5 = optics
87.5% MEEP=softmax. 97.7% TM=softmax. The 1D photonic cavity at γ=0.5 captures this head almost perfectly — a second photonic head in the same layer.
Two of fourteen heads in layer 0 are fully captured by 1D dielectric physics.
Head 10 = nothing
Fights every physical substrate. QD 36%, photonic 29%, acoustic 30%, mechanical 30%, microwave 6%. There is no 1D physics that matches its trained attention distribution. It is irreducibly software.
Per-modality match rates from coupling sweep at fixed parameters.
3 heads ≈ irreducible
Roughly three heads per layer resist every modality at meaningful match rates. They are doing the symbolic-manipulation work that no piece of pre-quantum classical physics reproduces. The other eleven are physics in disguise.
Bottom-three-by-MEEP-match: heads 10, 3, 2.
The headline finding
Trained attention heads are not abstract numerical objects. They are physical
computations sampled by gradient descent from the space of finite-element-discretizable
wave, diffusion, transmission, and oscillator operators. The ones that look like
optics literally are optics, and a $200 piece of patterned silicon would compute
them at picosecond speed.
The full per-head map is on the atlas page. The five substrate
solvers and how they're invoked are described on the method page.
What it means for AI compute hardware is on the meaning page.