IndicConformer GGUF Models (Hindi & Punjabi)
This repository contains highly optimized GGUF versions of AI4Bharat's flagship IndicConformer models for Hindi and Punjabi Automatic Speech Recognition (ASR).
The models are converted directly from the official NeMo PyTorch checkpoints and are designed to be run locally with zero Python dependencies using the lightweight C++ C-API and the ggml execution provider.
These GGUF models are fully metadata-driven. All configuration parameters, vocabulary strings, and featurizer window buffers are embedded directly into the GGUF binary, making it standalone.
Models Included
indicconformer-hindi.gguf: Optimized for Hindi speech recognition.indicconformer-punjabi.gguf: Optimized for Punjabi speech recognition.
Evaluation & Benchmarks
1. Accuracy Benchmarks (WER & CER)
Evaluation results on standard test sets (Word Error Rate - WER & Character Error Rate - CER). Lower numbers represent better transcription accuracy.
| Model / Dataset | Hindi (Kathbath Test) | Punjabi (Kathbath Test) | Hindi (FLEURS Test) | Punjabi (FLEURS Test) |
|---|---|---|---|---|
| IndicConformer GGUF (WER) | 13.5% | 15.1% | 15.2% | 16.8% |
| IndicConformer GGUF (CER) | 5.2% | 6.8% | 5.9% | 7.4% |
2. Local Performance & Speed Benchmarks
Benchmarks were performed locally on a Windows workstation running an NVIDIA GeForce RTX 5070 Ti GPU and an Intel/AMD CPU.
- Test Audio Duration: 7.26 seconds (mono, 16000Hz).
- Metric: Real-Time Factor (RTF). Lower is faster.
| Configuration | Average Time (s) | Real-Time Factor (RTF) | VRAM / RAM footprint |
|---|---|---|---|
| Hindi (GPU / CUDA) | 0.602s | 0.083x (12x speed) | ~520 MB |
| Hindi (CPU) | 0.436s | 0.060x (16x speed) | ~140 MB |
| Punjabi (GPU / CUDA) | 0.654s | 0.090x (11x speed) | ~520 MB |
| Punjabi (CPU) | 0.415s | 0.057x (17x speed) | ~140 MB |
For shorter clips (under 10 seconds), CPU is slightly faster because it avoids the initial GPU CUDA kernel compilation and memory-transfer latency. On longer files (e.g., 5+ minutes), GPU execution provides a massive speed improvement.
3. Comparison with OpenAI Whisper & Gemma 4 Audio
When deploying ASR systems locally, you must balance model size, system resources, accuracy, and execution latency.
| Feature / Model | IndicConformer GGUF (Ours) | OpenAI Whisper Large V3 | Gemma 4 Audio (12B) |
|---|---|---|---|
| Parameter Count | ~120M | ~1.5B | ~12B |
| RAM/VRAM Footprint | ~140 MB | ~3.1 GB | ~8.5 GB+ |
| Dependencies | None (Self-contained C++) | Python, PyTorch, Transformers | LLM Server (Ollama / HuggingFace) |
| Inference Mode | Real-time Streamable / Batch | Batch-only (seq2seq) | Batch-only (seq2seq) |
| Hindi Accuracies | Highly competitive local WER | ~11-12% WER (often struggles with local dialects) | High semantic accuracy, but prone to LLM paraphrasing |
Key Advantage: Our GGUF models run at over 15x real-time speed while consuming less than 5% of the memory footprint required by Whisper or Gemma 4, making them perfect for low-power edge devices and CPU-only systems.
Technical Details
Model Architecture
The model utilizes a Hybrid Conformer-CTC architecture:
- Audio Preprocessing: 80-channel log-mel filterbanks extracted on 25ms windows with a 10ms stride.
- Subsampling Layer: 2D Depthwise Separable Convolutions reducing sequence length by 4x.
- Encoder: Conformer layers combining self-attention (for global context) and depthwise convolutions (for local structures).
- Decoder: Connectionist Temporal Classification (CTC) decoder yielding fast, greedy decoding.
Licensing and Copyright Notice
The original model weights were created and published by AI4Bharat (IIT Madras).
Both the source models and these converted GGUF files are released under the permissive Creative Commons Attribution 4.0 International (CC-BY-4.0).
Creative Commons Attribution 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
๐ป Integrates perfectly with RenderCaption
This model was explicitly converted and optimized to be run inside RenderCaptionโour custom desktop transcription software.
What is RenderCaption? RenderCaption is a fully offline, high-speed transcription application. It is built using Rust and Tauri, meaning it is incredibly lightweight, fast, and 100% private (no audio is ever sent to the cloud). Instead of writing python code or using terminal commands, you can simply load this model into the RenderCaption desktop app and transcribe audio instantly with a beautiful user interface.
Check out the RenderCaption Desktop App on GitHub Here!
Usage Instructions
To run these models, you need the parakeet-cli C++ execution engine.
- Go to the parakeet.cpp GitHub Repository.
- Follow their build instructions to compile the
parakeet-cliexecutable for your specific operating system (Windows/Linux/macOS). - Once compiled, open your terminal and run the models using the following commands:
Transcribing Hindi
parakeet-cli transcribe --model indicconformer-hindi.f32.gguf --input audio.wav --decoder ctc --lang hi
Transcribing Punjabi
parakeet-cli transcribe --model indicconformer-punjabi.f32.gguf --input audio.wav --decoder ctc --lang pa
- Downloads last month
- 438
32-bit
Model tree for Singla0009/IndicConformer-GGUF
Evaluation results
- Test WER on Kathbath (Hindi clean test)self-reported13.500
- Test CER on Kathbath (Hindi clean test)self-reported5.200
- Test WER on Kathbath (Punjabi clean test)self-reported15.100
- Test CER on Kathbath (Punjabi clean test)self-reported6.800
- Test WER on FLEURS (Hindi test)self-reported15.200
- Test WER on FLEURS (Punjabi test)self-reported16.800