ControlMT v2.3 — int8 dynamic (CPU)
CPU-optimized int8 dynamic-quantized variant of anandkaman/controlmt-v2.3.
Auto-applies torch.quantization.quantize_dynamic to every nn.Linear at load time.
You don't need to write any quantization code — just load it the standard HF way.
Performance (RTX 5060 Ti box, 6-pair test, beam=2)
| Variant | Latency / pair | RAM | vs CPU bf16 |
|---|---|---|---|
| int8 (this) | 0.28 s | ~140 MB | 1.8× faster |
| CPU bf16 | 0.51 s | 280 MB | (baseline) |
| CPU fp32 | 1.44 s | 560 MB | 2.8× slower |
Quality: identical output on our test set vs fp32. In production, re-validate on your own representative sentences — int8 dynamic occasionally drops 0.5–1 BLEU on long-tail outputs.
Quick start
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("anandkaman/controlmt-v2.3-int8", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3-int8", trust_remote_code=True)
# Already quantized — just translate
print(model.translate("ನಾನು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇನೆ.",
tokenizer=tokenizer, direction="kn2en"))
# → "I speak Kannada."
That's it. No quantize_dynamic call needed; the modeling code does it for you
on from_pretrained.
Or use the SDK (one-liner, also handles pip install):
pip install controlmt
from controlmt import ControlMT
model = ControlMT.from_hf(model_id="anandkaman/controlmt-v2.3-int8", quant="int8")
Hardware
- ✅ CPU (x86 + ARM, ≥1 GB RAM)
- ❌ GPU — quantization is CPU-only; calling
.to('cuda')reverts the int8 ops to fp32 Linear and you lose the speed/memory win. Use the main repo withdtype=torch.float16for GPU.
Other variants in the family
| Repo | Best for |
|---|---|
| controlmt-v2.3 | General use — fp32 / bf16 / fp16 chosen at load |
| controlmt-v2.3-int8 (you are here) | CPU-only, memory-constrained, fastest CPU |
| controlmt-demo | Live web demo |
License
Apache 2.0. Same as the base model.
- Downloads last month
- 19
Model tree for anandkaman/controlmt-v2.3-int8
Base model
anandkaman/controlmt-v2.3