S₀ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young

Hybrid recurrent-attention models ship with recurrent states initialized to zero. S₀ tuning learns one initial state matrix per recurrent layer via gradient descent on roughly 48 execution-verified correct solutions, while freezing all model weights. On Qwen3.5-4B, this improves greedy pass@1 on HumanEval by +23.6 percentage points (p < 0.001, 10 seeds) and outperforms the best rank-24 LoRA baseline (+12.7 pp) by +10.8 pp. A separate matched-budget LoRA control degrades by −15.5 pp in this small-data regime. The tuned state is a 48 MB file with zero inference overhead; task switching requires no weight merging or model reload.

S0 tuning injects a learned initial state into recurrent layers, steering generation from the first token with zero inference cost.
S₀ tuning replaces the zero initial state of each recurrent layer with a learned value. The state is absorbed at the first recurrent step; after that, inference is identical to the base model.

Results

Baseline vs S₀ on Qwen3.5-4B, greedy decoding. p-values from Welch's t-test (baseline vs S₀).
BenchmarkBaselineS₀Deltap
HumanEval48.8%72.2%+23.6 pp< 10-11
MATH-50051.4%56.2%+4.8 pp0.00002
GSM8K85.3%88.1%+2.8 pp0.0003

A text-to-SQL boundary test (Spider) shows no transfer, consistent with the method's dependence on early-token trajectory diversity.

HumanEval: 10 seeds. MATH-500: 8 seeds. GSM8K: 10 seeds. Cross-architecture: FalconH1-7B reaches 71.8% versus 71.4% for LoRA in a 3-seed comparison, statistically indistinguishable at this sample size.

Scaling: gains increase with model size — +2.6 pp at 0.8B, +23.6 pp at 4B, +44.0 pp at 9B (HumanEval).

How it works

Each recurrent layer in a hybrid model maintains a state matrix updated at every token. By default this state starts at zero. S₀ tuning optimizes that initial value on a small set of correct solutions; the learned state acts as a trajectory-steering "launch vector" that redirects generation from the very first token. 85% of corrected solutions diverge from baseline at the first generated character.

Install & usage

pip install s0-tuning

from s0 import S0Trainer, S0Config
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.train(data)      # list of (text, prompt_length) pairs
trainer.activate()       # zero-cost at inference

BibTeX

@article{young2026s0tuning,
  title   = {S$_0$ Tuning: Zero-Overhead Adaptation of
             Hybrid Recurrent-Attention Models},
  author  = {Young, Jack},
  journal = {arXiv preprint arXiv:2604.01168},
  year    = {2026}
}