IndicConformer GGUF Models (Hindi & Punjabi)

This repository contains highly optimized GGUF versions of AI4Bharat's flagship IndicConformer models for Hindi and Punjabi Automatic Speech Recognition (ASR).

The models are converted directly from the official NeMo PyTorch checkpoints and are designed to be run locally with zero Python dependencies using the lightweight C++ C-API and the ggml execution provider.

These GGUF models are fully metadata-driven. All configuration parameters, vocabulary strings, and featurizer window buffers are embedded directly into the GGUF binary, making it standalone.

Models Included

indicconformer-hindi.gguf: Optimized for Hindi speech recognition.
indicconformer-punjabi.gguf: Optimized for Punjabi speech recognition.

Evaluation & Benchmarks

1. Accuracy Benchmarks (WER & CER)

Evaluation results on standard test sets (Word Error Rate - WER & Character Error Rate - CER). Lower numbers represent better transcription accuracy.

Model / Dataset	Hindi (Kathbath Test)	Punjabi (Kathbath Test)	Hindi (FLEURS Test)	Punjabi (FLEURS Test)
IndicConformer GGUF (WER)	13.5%	15.1%	15.2%	16.8%
IndicConformer GGUF (CER)	5.2%	6.8%	5.9%	7.4%

2. Local Performance & Speed Benchmarks

Benchmarks were performed locally on a Windows workstation running an NVIDIA GeForce RTX 5070 Ti GPU and an Intel/AMD CPU.

Test Audio Duration: 7.26 seconds (mono, 16000Hz).
Metric: Real-Time Factor (RTF). Lower is faster.

Configuration	Average Time (s)	Real-Time Factor (RTF)	VRAM / RAM footprint
Hindi (GPU / CUDA)	0.602s	0.083x (12x speed)	~520 MB
Hindi (CPU)	0.436s	0.060x (16x speed)	~140 MB
Punjabi (GPU / CUDA)	0.654s	0.090x (11x speed)	~520 MB
Punjabi (CPU)	0.415s	0.057x (17x speed)	~140 MB

For shorter clips (under 10 seconds), CPU is slightly faster because it avoids the initial GPU CUDA kernel compilation and memory-transfer latency. On longer files (e.g., 5+ minutes), GPU execution provides a massive speed improvement.

3. Comparison with OpenAI Whisper & Gemma 4 Audio

When deploying ASR systems locally, you must balance model size, system resources, accuracy, and execution latency.

Feature / Model	IndicConformer GGUF (Ours)	OpenAI Whisper Large V3	Gemma 4 Audio (12B)
Parameter Count	~120M	~1.5B	~12B
RAM/VRAM Footprint	~140 MB	~3.1 GB	~8.5 GB+
Dependencies	None (Self-contained C++)	Python, PyTorch, Transformers	LLM Server (Ollama / HuggingFace)
Inference Mode	Real-time Streamable / Batch	Batch-only (seq2seq)	Batch-only (seq2seq)
Hindi Accuracies	Highly competitive local WER	~11-12% WER (often struggles with local dialects)	High semantic accuracy, but prone to LLM paraphrasing

Key Advantage: Our GGUF models run at over 15x real-time speed while consuming less than 5% of the memory footprint required by Whisper or Gemma 4, making them perfect for low-power edge devices and CPU-only systems.

Technical Details

Model Architecture

The model utilizes a Hybrid Conformer-CTC architecture:

Audio Preprocessing: 80-channel log-mel filterbanks extracted on 25ms windows with a 10ms stride.
Subsampling Layer: 2D Depthwise Separable Convolutions reducing sequence length by 4x.
Encoder: Conformer layers combining self-attention (for global context) and depthwise convolutions (for local structures).
Decoder: Connectionist Temporal Classification (CTC) decoder yielding fast, greedy decoding.

Licensing and Copyright Notice

The original model weights were created and published by AI4Bharat (IIT Madras).

Both the source models and these converted GGUF files are released under the permissive Creative Commons Attribution 4.0 International (CC-BY-4.0).

Creative Commons Attribution 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.

💻 Integrates perfectly with RenderCaption

This model was explicitly converted and optimized to be run inside RenderCaption—our custom desktop transcription software.

What is RenderCaption? RenderCaption is a fully offline, high-speed transcription application. It is built using Rust and Tauri, meaning it is incredibly lightweight, fast, and 100% private (no audio is ever sent to the cloud). Instead of writing python code or using terminal commands, you can simply load this model into the RenderCaption desktop app and transcribe audio instantly with a beautiful user interface.

Check out the RenderCaption Desktop App on GitHub Here!

Usage Instructions

To run these models, you need the parakeet-cli C++ execution engine.

Go to the parakeet.cpp GitHub Repository.
Follow their build instructions to compile the parakeet-cli executable for your specific operating system (Windows/Linux/macOS).
Once compiled, open your terminal and run the models using the following commands:

Transcribing Hindi

parakeet-cli transcribe --model indicconformer-hindi.f32.gguf --input audio.wav --decoder ctc --lang hi

Transcribing Punjabi

parakeet-cli transcribe --model indicconformer-punjabi.f32.gguf --input audio.wav --decoder ctc --lang pa

Downloads last month: 438

GGUF

Model size

0.1B params

Architecture

parakeet

Hardware compatibility

32-bit

Model tree for Singla0009/IndicConformer-GGUF

Base model

ai4bharat/indicconformer_stt_hi_hybrid_ctc_rnnt_large

Quantized

(1)

this model

Evaluation results

Test WER on Kathbath (Hindi clean test)
self-reported

13.500
Test CER on Kathbath (Hindi clean test)
self-reported

5.200
Test WER on Kathbath (Punjabi clean test)
self-reported

15.100
Test CER on Kathbath (Punjabi clean test)
self-reported

6.800
Test WER on FLEURS (Hindi test)
self-reported

15.200
Test WER on FLEURS (Punjabi test)
self-reported

16.800