S₀ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

[Paper] [Code] [Trained States] [Dataset] [BibTeX]

Hybrid recurrent-attention models ship with recurrent states initialized to zero. S₀ tuning learns one initial state matrix per recurrent layer via gradient descent on roughly 48 execution-verified correct solutions, while freezing all model weights. On Qwen3.5-4B, this improves greedy pass@1 on HumanEval by +23.6 percentage points (p < 0.001, 10 seeds) and outperforms the best rank-24 LoRA baseline (+12.7 pp) by +10.8 pp. A separate matched-budget LoRA control degrades by −15.5 pp in this small-data regime. The tuned state is a 48 MB file with zero inference overhead; task switching requires no weight merging or model reload.

S0 tuning injects a learned initial state into recurrent layers, steering generation from the first token with zero inference cost. — S₀ tuning replaces the zero initial state of each recurrent layer with a learned value. The state is absorbed at the first recurrent step; after that, inference is identical to the base model.

Results

Baseline vs S₀ on Qwen3.5-4B, greedy decoding. p-values from Welch's t-test (baseline vs S₀).
Benchmark	Baseline	S₀	Delta	p
HumanEval	48.8%	72.2%	+23.6 pp	< 10^-11
MATH-500	51.4%	56.2%	+4.8 pp	0.00002
GSM8K	85.3%	88.1%	+2.8 pp	0.0003

A text-to-SQL boundary test (Spider) shows no transfer, consistent with the method's dependence on early-token trajectory diversity.

HumanEval: 10 seeds. MATH-500: 8 seeds. GSM8K: 10 seeds. Cross-architecture: FalconH1-7B reaches 71.8% versus 71.4% for LoRA in a 3-seed comparison, statistically indistinguishable at this sample size.

Scaling: gains increase with model size: +2.6 pp at 0.8B, +23.6 pp at 4B, +44.0 pp at 9B (HumanEval).

How it works

Each recurrent layer in a hybrid model maintains a state matrix updated at every token. By default this state starts at zero. S₀ tuning optimizes that initial value on a small set of correct solutions; the learned state acts as a trajectory-steering "launch vector" that redirects generation from the very first token. 85% of corrected solutions diverge from baseline at the first generated character.

Install & usage

pip install s0-tuning

from s0 import S0Trainer, S0Config
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.train(data)      # list of (text, prompt_length) pairs
trainer.activate()       # zero-cost at inference

Pretrained states

Skip training and load the states from the paper directly from HuggingFace Hub. The training dataset (45 execution-verified HumanEval solutions) is also available.

from huggingface_hub import snapshot_download
from s0 import S0Trainer

path = snapshot_download("JackYoung27/s0-tuning-qwen3.5-4b-humaneval")
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.load(path)
trainer.activate()

output = trainer.generate("def fibonacci(n):\n", max_new_tokens=256)

BibTeX

@article{young2026s0tuning,
  title   = {S$_0$ Tuning: Zero-Overhead Adaptation of
             Hybrid Recurrent-Attention Models},
  author  = {Young, Jack},
  journal = {arXiv preprint arXiv:2604.01168},
  year    = {2026}
}