Instructions to use WhissleAI/speech-tagger_gujlish_wav2vec2_meta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WhissleAI/speech-tagger_gujlish_wav2vec2_meta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="WhissleAI/speech-tagger_gujlish_wav2vec2_meta")# Load model directly from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("WhissleAI/speech-tagger_gujlish_wav2vec2_meta") model = AutoModelForCTC.from_pretrained("WhissleAI/speech-tagger_gujlish_wav2vec2_meta") - Notebooks
- Google Colab
- Kaggle
WhissleAI Gujlish wav2vec2 Meta-ASR
A bilingual English-Gujarati (Gujlish) speech recognition model that simultaneously transcribes speech and extracts rich metadata — speaker age, gender, emotion, intent, and named entities — in a single forward pass using CTC decoding.
Model Description
This model is fine-tuned from CLSRIL-23 (a wav2vec2-base model pre-trained on 23 Indian languages by IIT Madras) for bilingual English + Gujarati ASR with inline meta-tag prediction. It uses a unified character-level vocabulary with atomic meta-tag tokens, enabling the CTC head to output both transcript text and structured metadata.
| Property | Value |
|---|---|
| Architecture | Wav2Vec2ForCTC (HuggingFace Transformers) |
| Base Model | CLSRIL-23 (wav2vec2-base, 23 Indian languages) |
| Parameters | 94.7M |
| Vocab Size | 795 tokens (characters + meta-tags) |
| Languages | English (EN), Gujarati (GU), Code-switched Gujlish |
| Input | Raw 16kHz waveform |
| Output | CTC logits → transcript + AGE/GENDER/EMOTION/INTENT/ENTITY tags |
| Training Framework | PyTorch Lightning + HuggingFace Transformers |
Meta-Tags
The model predicts the following inline tags alongside the transcript:
| Tag Category | Possible Values |
|---|---|
| AGE | AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60PLUS |
| GENDER | GENDER_MALE, GENDER_FEMALE, GENDER_OTHER |
| EMOTION | EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_DISGUST, EMOTION_FEAR, EMOTION_SURPRISE |
| INTENT | INTENT_INFORM, INTENT_QUESTION, INTENT_COMMAND, INTENT_REQUEST, INTENT_GREETING, INTENT_THANK |
| ENTITY | ENTITY_PERSON_NAME ... END, ENTITY_LOCATION ... END, ENTITY_ORGANIZATION ... END, etc. |
Example output:
the weather in ahmedabad is very hot today ENTITY_LOCATION END AGE_30_45 GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM
Performance
ASR (Word Error Rate)
| Language | Full WER | Clean WER | Samples |
|---|---|---|---|
| English | 20.31% | 15.59% | 500 |
| Gujarati | 29.95% | 23.13% | 500 |
- Full WER = WER on raw output (transcript + tags)
- Clean WER = WER on transcript only (tags stripped)
Meta-Tag Accuracy
| Tag | English | Gujarati |
|---|---|---|
| Age | 80.6% | 84.0% |
| Gender | 96.8% | 99.4% |
| Emotion | 81.8% | 83.3% |
| Intent | 80.1% | 83.5% |
| Entity F1 | 40.8% | 14.4% |
Latency
| Metric | Value |
|---|---|
| Avg inference latency | 5.7 ms per utterance |
| Real-time factor (RTF) | < 0.01x |
Measured on NVIDIA A100 GPU, batch_size=1, average utterance ~5 seconds.
Usage
Quick Start (HuggingFace)
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC
import json
# Load model
model = Wav2Vec2ForCTC.from_pretrained("WhissleAI/speech-tagger_gujlish_wav2vec2_meta")
model.eval()
# Load vocab
from huggingface_hub import hf_hub_download
vocab_path = hf_hub_download("WhissleAI/speech-tagger_gujlish_wav2vec2_meta", "vocab.json")
with open(vocab_path) as f:
vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}
# Load audio
waveform, sr = torchaudio.load("test.wav")
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0) if waveform.shape[0] > 1 else waveform.squeeze(0)
# Inference
with torch.no_grad():
logits = model(input_values=waveform.unsqueeze(0)).logits
# CTC greedy decode
pred_ids = torch.argmax(logits, dim=-1)[0].tolist()
collapsed = []
prev = -1
for p in pred_ids:
if p != prev and p != 0: # 0 = <pad>/blank
collapsed.append(p)
prev = p
text = "".join(id_to_token.get(i, "") if id_to_token.get(i) != "|" else " " for i in collapsed).strip()
print(text)
Using the Inference Script
# Install dependencies
pip install torch torchaudio transformers huggingface_hub numpy
# Transcribe a single file
python infer_wav2vec2_gujlish.py --model WhissleAI/speech-tagger_gujlish_wav2vec2_meta --audio test.wav
# Batch transcribe from manifest
python infer_wav2vec2_gujlish.py \
--model WhissleAI/speech-tagger_gujlish_wav2vec2_meta \
--manifest test.json \
--output results.json \
--batch-size 16 \
--device cuda
From PyTorch Lightning Checkpoint (.ckpt)
python infer_wav2vec2_gujlish.py \
--model ./checkpoints/gujlish-wav2vec2-clsril23-v1-step=65000-val_wer=0.2610.ckpt \
--pretrained ./CLSRIL-23.pt \
--vocab ./vocab.json \
--audio test.wav
Training Details
Data
| Dataset | Language | Samples | Source |
|---|---|---|---|
| CommonVoice 17 (English) | EN | ~250K | Mozilla |
| CommonVoice 17 (Gujarati) | GU | ~5K | Mozilla |
| IndicVoices | GU | ~50K | AI4Bharat |
| IndicVoices-R | GU | ~10K | AI4Bharat |
| Kathbath | GU | ~30K | AI4Bharat |
| FLEURS (Gujarati) | GU | ~2K | |
| Internal recordings | EN+GU | ~70K | Whissle |
| Total | 525,500 train / 26,669 valid |
Language split: 68.6% English, 31.4% Gujarati (INDO_ARYAN family).
All training samples were annotated with meta-tags (AGE, GENDER, EMOTION, INTENT, ENTITY) using Whissle's automated annotation pipeline powered by Gemini.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | CLSRIL-23 (wav2vec2-base, 23 Indian languages) |
| Feature extractor | Frozen |
| Batch size | 48 (effective: 96 with grad accumulation) |
| Learning rate | 1e-4 |
| Weight decay | 0.005 |
| Warmup steps | 3,000 |
| Max steps | 100,000 |
| Max audio duration | 16s |
| Precision | Mixed (FP16) |
| Noise augmentation | 70% probability, SNR -40 to -10 dB |
| Best checkpoint | step=65,000 (val_wer=0.261) |
Training Procedure
- Pretrained checkpoint: CLSRIL-23 (fairseq format) → converted to HuggingFace
Wav2Vec2ForCTC - Vocabulary: Built from training manifest — 795 tokens (characters for both scripts + meta-tags)
- Feature extractor frozen: Only the transformer encoder and CTC head are fine-tuned
- CTC loss: Standard CTC with blank token
<pad>(id=0) - Per-language WER tracking: Separate validation metrics for English and Gujarati
Architecture
Raw Audio (16kHz)
│
▼
┌──────────────────┐
│ Feature Extractor │ ← Frozen (7 conv layers, stride=320)
│ (CNN) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Feature Projection│ ← 512 → 768
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Transformer │ ← 12 layers, 768 hidden, 12 heads
│ Encoder │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ CTC Head │ ← Linear(768 → 795)
│ (lm_head) │
└────────┬─────────┘
│
▼
CTC Greedy Decode
│
▼
"hello world AGE_30_45 GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM"
Limitations
- Gujarati script: The model handles Gujarati script but WER is higher than English (23% vs 16% clean WER) due to less training data
- Entity extraction: Entity F1 is moderate for English (41%) and lower for Gujarati (14%) — primarily trained on person names and locations
- Code-switching: While the model handles English-Gujarati code-switching, rapid intra-sentential switching may increase WER
- Noise robustness: Trained with aggressive noise augmentation but performance degrades significantly below 5dB SNR
- Max duration: Trained on utterances up to 16 seconds; longer audio should be chunked
ONNX Export
An ONNX-exported version of this model is available for production deployment. The ONNX model supports dynamic batch size and sequence length:
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx")
audio = np.random.randn(1, 48000).astype(np.float32) # 3 seconds
logits = session.run(None, {"input_values": audio})[0]
Citation
@misc{whissle2025gujlish,
title={Gujlish Meta-ASR: Bilingual English-Gujarati Speech Recognition with Inline Metadata Prediction},
author={Whissle AI},
year={2025},
url={https://huggingface.co/WhissleAI/speech-tagger_gujlish_wav2vec2_meta}
}
License
This model is released under the Apache 2.0 License.
About WhissleAI
WhissleAI builds production-grade speech AI systems. Our Meta-ASR models jointly perform speech recognition and speaker/content analysis in a single pass, enabling real-time applications like live coaching, meeting intelligence, and voice analytics.
- Website: whissle.ai
- GitHub: github.com/WhissleAI
- HuggingFace: huggingface.co/WhissleAI
- Downloads last month
- 32
Datasets used to train WhissleAI/speech-tagger_gujlish_wav2vec2_meta
ai4bharat/IndicVoices
mozilla-foundation/common_voice_17_0
Paper for WhissleAI/speech-tagger_gujlish_wav2vec2_meta
Evaluation results
- WER (English) on In-House Gujlish Test Setself-reported20.310
- WER (Gujarati) on In-House Gujlish Test Setself-reported29.950
- Age Accuracy (avg EN+GU) on In-House Gujlish Test Setself-reported82.300
- Gender Accuracy (avg EN+GU) on In-House Gujlish Test Setself-reported98.100
- Emotion Accuracy (avg EN+GU) on In-House Gujlish Test Setself-reported82.600