Method — Substrate Fingerprints

How an attention head becomes a physical device

Attention assigns each query position a probability distribution over keys via softmax(Q · Kᵀ / √d_head). The same arithmetic admits a physical reading: each key is a site, each query–key inner product is a site energy, and the resulting distribution is the thermal occupation of a one-dimensional system at unit temperature.

Once you accept that reading, every classical-physics solver becomes a candidate attention engine. We compile the head's weights as that solver's native configuration, run the solver, and read the resulting site occupation as the attention weights. No retraining. The trained weights are the substrate parameters.

Five substrates, one weight set

For each attention head, the same (W_Q, W_K, W_V, W_O, b_Q, b_K, b_V) are translated into the configuration space of five different physical solvers, each invoked through its native open-source library. The harness is at hypercircuits/modality_harness/; the base interface is a HeadAdapter abstract class with two methods: __init__ (compile weights into substrate config) and run_query (run the physics, read attention).

The five substrates

One dot per token position. Site energies derived from query–key inner products place high-attention sites at low potential. Nearest-neighbor tunneling τ. Solve via scipy's symmetric eigensolver. The thermal-density diagonal is the attention distribution. At one limit the math reduces analytically to softmax (rel error 1e-6 verified — the harness sanity check).

One node per token. A query-derived on-site potential is added to the Helmholtz operator on a 1D cavity. Generalized eigensolve via scipy. Attention weights are the position-resolved thermal density across cavity eigenmodes, weighted by the FEM mass matrix at each node. P2 elements match Phase-2 cavity-validation reference exactly.

Each token position is a transmission-line segment with a per-segment wavenumber derived from the query–key score. Standard ABCD-matrix cascade propagates the (V, I) state from port 0 through the cascade. Stored EM energy at each segment becomes the attention weight. Same code path as the Phase-4 cavity-validation skrf solver.

One discrete node per token. The Sturm–Liouville string operator with a score-derived restoring force on a 1D string. Linear basis (P1) on a uniform mesh — same code path as Phase-5 cavity-validation calculix-beam string variant. Thermal-mode density at each node is the attention weight.

One slab per token, with a score-derived per-slab dielectric. Subprocess call to the miniforge MEEP environment runs full 1D FDTD with PML boundaries, a Gaussian source at one end, and a DFT field monitor across the inner region. Time-integrated energy density per slab becomes the attention weight. ~0.82 seconds per attention computation on M3 Mac Studio. The in-venv transfer-matrix adapter is a faster but less accurate cousin of this same physics.

The harness

The abstract base class HeadAdapter takes the head's weight matrices, biases, and RoPE tables, plus a modality_params dict with substrate-specific hyperparameters (τ for QD, α for FEM modalities, dx for microwave, γ for photonic). Each subclass implements __init__ (transposing weights to substrate config) and run_query (running physics, returning the d_model-space residual contribution).

The harness ModalityHarness orchestrates a full forward pass: per layer, per query position, per head, the chosen substrate's solver runs and the output is accumulated into the residual stream. With QuantumDotHeadAdapter(τ=0.0) the harness reproduces the standard softmax forward pass to machine precision (verified on Qwen2.5-0.5B, full 24 layers, top-1 ' Paris' match on "The capital of France is").

Adding a sixth substrate is one new file and a registry entry. The plumbing is done.

See the atlas page for which substrate matched which head.