research

Quantization x Interpretability

Jan 2026Active

Research investigating how model quantization affects sparse autoencoders (SAEs). Found SAEs transfer across precisions (99% correlation), code generation degrades 50% at INT4 while knowledge retrieval survives, and undercomplete SAEs (fewer features than model dimension) transfer 2.3x better than overcomplete ones.

99%
SAE Transfer
BF16→INT4
-50%
Code (HumanEval)
40%→20%
0%
Knowledge (MMLU)
100%→100%
2.3×
Undercomplete
0.5× beats 8×
TL;DR: BF16-trained SAEs work on INT4 models (99% correlation). Code generation breaks at INT4 (-50%), knowledge retrieval survives (0% loss). Undercomplete SAEs (0.5× model dimension) transfer 2.3× better than overcomplete (8×).

Finding 1Degradation has structure

INT4 BF16 total
Code (HumanEval)
Qwen3
20%
40%
StarCoder2
18%
35%
Math (GSM8K)
Qwen3
87%
StarCoder2
71%
78%
Knowledge (MMLU-CS)
Qwen3
100%
StarCoder2
93%
Pattern: Generative tasks (code) break first. Discriminative tasks (knowledge) survive. Quantization adds noise to generation, not retrieval.
Example: HumanEval/8 - sum_product function

Finding 2SAEs transfer across precisions

Finding 3Undercomplete SAEs transfer better

MethodHow we ran experiments

ImpactWhy this matters for safety

PythonPyTorchSAELensbitsandbytesTransformers