WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with R2 = 0.98. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at 3× the norm of the model's write makes tokens initially ranked 100–1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

WriteSAE atoms substitute for native Gated DeltaNet writes. Panels show the write k_t v_t^T, the atom v_i w_i^T, the cache-slot patch, and the forward-KL controls.
Figure 1. WriteSAE atoms substitute for native Gated DeltaNet writes. At Qwen3.5-0.8B L9 H4, atoms beat ablation on 92.4% of n = 4,851 firings. Panels show the write ktvt, the atom viwi, the cache-slot patch, and the forward-KL controls.

Results

Replacement test: WriteSAE atom vs deleting the write and vs a same-amplitude random atom. Win rate is the fraction of positions where the atom yields a closer final token distribution (lower forward KL to the base model) than the deletion control.
TestModelnWin rate
Single-atom substitutionQwen3.5-0.8B (GDN, L9 H4)4,85192.4%
87-atom population testQwen3.5-0.8B (GDN, L9 H4)8789.8%
Cross-architecture substitutionMamba-2-370M2,50088.1%
Closed-form prediction of logit shiftQwen3.5-0.8Bn/aR2=0.98

Generation result: writing the formula's chosen direction into three consecutive cache positions at 3× the norm of the model's write makes tokens initially ranked 100–1000 by the unmodified model appear in 100% of continuations, up from 33.3% under greedy decoding. To our knowledge, the first cache-level steering intervention in a state-space or hybrid recurrent layer.

Cross-architecture ordering: GDN rank-1 > RWKV-7 rank-2 > Mamba-2 diagonal. The median cosine between an atom and the nearest native write at its firing positions is highest in GDN (0.262), then RWKV-7 (0.180), then Mamba-2 (0.0575). WriteSAE atoms do not install cleanly when the native write rule is not rank-1.

How it works

A vector decoder atom cannot reach a matrix cache write: a vector in d has the wrong shape to add to a dk × dv matrix. WriteSAE makes each decoder atom a pair (vi, wi) whose outer product viwi has the same rank-1 form as the native write. The replacement test removes the model's write at a firing position, inserts the atom scaled by its SAE activation, continues the forward pass, and compares forward KL on the final output distribution against two controls: deleting the write entirely, and substituting a same-amplitude random atom. The atom has to beat both to count as the model's own write.

Install & usage

git clone https://github.com/JackYoung27/writesae
cd writesae
pip install -e .

# extract GDN states from a base model
python -m experiments.extraction.extract_states \
    --model Qwen/Qwen3.5-0.8B --layers 9 --n_samples 50000 --output_dir states

# train a WriteSAE on one head
python -m core.train --sae_type bilinear --layer 9 --head 4 \
    --n_features 2048 --k 32 --data_dir states --output_dir ckpt

Pretrained checkpoints

Load the paper's SAE checkpoints from HuggingFace Hub. Variants cover four SAE architectures (WriteSAE / FlatSAE / MatrixSAE / BilinearSAE), three Qwen3.5 scales (0.8B / 4B / 27B), and four cross-architecture models (DeltaNet, GLA, Mamba-2, RWKV-7).

from huggingface_hub import snapshot_download

path = snapshot_download("JackYoung27/writesae-ckpts", local_dir="ckpts")
# ckpts/manifest.json maps tags to SHA256 and metadata.

BibTeX

@article{young2026writesae,
  title   = {WriteSAE: Sparse Autoencoders for Recurrent State},
  author  = {Young, Jack},
  journal = {arXiv preprint arXiv:2605.12770},
  year    = {2026}
}