S₀ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Hybrid recurrent-attention models ship with recurrent states initialized to zero. S₀ tuning learns one initial state matrix per recurrent layer via gradient descent on roughly 48 execution-verified correct solutions, while freezing all model weights. On Qwen3.5-4B, this improves greedy pass@1 on HumanEval by +23.6 percentage points (p < 0.001, 10 seeds) and outperforms the best rank-24 LoRA baseline (+12.7 pp) by +10.8 pp. A separate matched-budget LoRA control degrades by −15.5 pp in this small-data regime. The tuned state is a 48 MB file with zero inference overhead; task switching requires no weight merging or model reload.
Results
| Benchmark | Baseline | S₀ | Delta | p |
|---|---|---|---|---|
| HumanEval | 48.8% | 72.2% | +23.6 pp | < 10-11 |
| MATH-500 | 51.4% | 56.2% | +4.8 pp | 0.00002 |
| GSM8K | 85.3% | 88.1% | +2.8 pp | 0.0003 |
A text-to-SQL boundary test (Spider) shows no transfer, consistent with the method's dependence on early-token trajectory diversity.
HumanEval: 10 seeds. MATH-500: 8 seeds. GSM8K: 10 seeds. Cross-architecture: FalconH1-7B reaches 71.8% versus 71.4% for LoRA in a 3-seed comparison, statistically indistinguishable at this sample size.
Scaling: gains increase with model size — +2.6 pp at 0.8B, +23.6 pp at 4B, +44.0 pp at 9B (HumanEval).
How it works
Each recurrent layer in a hybrid model maintains a state matrix updated at every token. By default this state starts at zero. S₀ tuning optimizes that initial value on a small set of correct solutions; the learned state acts as a trajectory-steering "launch vector" that redirects generation from the very first token. 85% of corrected solutions diverge from baseline at the first generated character.
Install & usage
pip install s0-tuning
from s0 import S0Trainer, S0Config
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.train(data) # list of (text, prompt_length) pairs
trainer.activate() # zero-cost at inference
BibTeX
@article{young2026s0tuning,
title = {S$_0$ Tuning: Zero-Overhead Adaptation of
Hybrid Recurrent-Attention Models},
author = {Young, Jack},
journal = {arXiv preprint arXiv:2604.01168},
year = {2026}
}