Sarvam-1 β€” Hinglish Text-Normalization LoRA

A LoRA adapter for sarvamai/sarvam-1 (2B) that performs Text Normalization (TN) for code-mixed Hindi/English text β€” turning written forms into the spoken form a TTS acoustic model needs (acronyms β†’ phonetic letters, IDs/phone numbers β†’ digit-by-digit, times, dates, currency, units, percentages), while preserving the Hindi/English code-mix.

Part of fast-indic-tts.

Live demo: https://huggingface.co/spaces/AK04-IXR/fast-indic-tts

Why

The base sarvam-1 is a base (non-instruct) model and cannot be reliably prompted into TN (12-shot ICL scores 49.9% WER β€” worse than rules). This adapter is fine-tuned on a synthetic, correct-by-construction code-mixed corpus.

Results (held-out 40-sentence labeled set)

System WER ↓ CER ↓ Exact-Match ↑
naive rules (indic-numtowords) 43.6% 43.7% 0%
competitive rule engine 20.9% 17.5% 27.5%
Sarvam-1 base (12-shot ICL) 49.9% 35.4% 5%
this adapter 7.96% 6.37% 62.5%

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("sarvamai/sarvam-1")
m = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-1")
m = PeftModel.from_pretrained(m, "AK04-IXR/sarvam1-hinglish-tn-lora")

prompt = "Input: Mera flight ticket PNR-8392 hai, aur departure 4:30 PM ko hai.\nOutput:"
ids = tok(prompt, return_tensors="pt").to(m.device)
out = m.generate(**ids, max_new_tokens=96, do_sample=False)
print(tok.decode(out[0][ids['input_ids'].shape[1]:], skip_special_tokens=True))
# -> Mera flight ticket pee-en-aar eight three nine two hai, aur departure four thirty pee-em ko hai.

Training

LoRA (r=16, Ξ±=32, all attn+MLP projections; 0.94% of params) on ~8k synthetic pairs, 3 epochs, bf16, on a single A100. See the GitHub repo for the data generator, trainer, and evaluation harness.

Limitations

Trained on synthetic data, so it follows the project's normalization conventions; the held-out test set is small (40 sentences) β€” treat the headline number as indicative and see the per-category breakdown in the repo.

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AK04-IXR/sarvam1-hinglish-tn-lora

Adapter
(29)
this model

Space using AK04-IXR/sarvam1-hinglish-tn-lora 1