Sarvam-1 — Hinglish Text-Normalization LoRA

A LoRA adapter for sarvamai/sarvam-1 (2B) that performs Text Normalization (TN) for code-mixed Hindi/English text — turning written forms into the spoken form a TTS acoustic model needs (acronyms → phonetic letters, IDs/phone numbers → digit-by-digit, times, dates, currency, units, percentages), while preserving the Hindi/English code-mix.

Part of fast-indic-tts.

Live demo: https://huggingface.co/spaces/AK04-IXR/fast-indic-tts

Why

The base sarvam-1 is a base (non-instruct) model and cannot be reliably prompted into TN (12-shot ICL scores 49.9% WER — worse than rules). This adapter is fine-tuned on a synthetic, correct-by-construction code-mixed corpus.

Results (held-out 40-sentence labeled set)

System	WER ↓	CER ↓	Exact-Match ↑
naive rules (`indic-numtowords`)	43.6%	43.7%	0%
competitive rule engine	20.9%	17.5%	27.5%
Sarvam-1 base (12-shot ICL)	49.9%	35.4%	5%
this adapter	7.96%	6.37%	62.5%

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("sarvamai/sarvam-1")
m = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-1")
m = PeftModel.from_pretrained(m, "AK04-IXR/sarvam1-hinglish-tn-lora")

prompt = "Input: Mera flight ticket PNR-8392 hai, aur departure 4:30 PM ko hai.\nOutput:"
ids = tok(prompt, return_tensors="pt").to(m.device)
out = m.generate(**ids, max_new_tokens=96, do_sample=False)
print(tok.decode(out[0][ids['input_ids'].shape[1]:], skip_special_tokens=True))
# -> Mera flight ticket pee-en-aar eight three nine two hai, aur departure four thirty pee-em ko hai.

Training

LoRA (r=16, α=32, all attn+MLP projections; 0.94% of params) on ~8k synthetic pairs, 3 epochs, bf16, on a single A100. See the GitHub repo for the data generator, trainer, and evaluation harness.

Limitations

Trained on synthetic data, so it follows the project's normalization conventions; the held-out test set is small (40 sentences) — treat the headline number as indicative and see the per-category breakdown in the repo.

Downloads last month: 54

Model tree for AK04-IXR/sarvam1-hinglish-tn-lora

Base model

sarvamai/sarvam-1

Adapter

(29)

this model

AK04-IXR
/

sarvam1-hinglish-tn-lora