IndicConformer GGUF Models (Hindi & Punjabi)

This repository contains highly optimized GGUF versions of AI4Bharat's flagship IndicConformer models for Hindi and Punjabi Automatic Speech Recognition (ASR).

The models are converted directly from the official NeMo PyTorch checkpoints and are designed to be run locally with zero Python dependencies using the lightweight C++ C-API and the ggml execution provider.

These GGUF models are fully metadata-driven. All configuration parameters, vocabulary strings, and featurizer window buffers are embedded directly into the GGUF binary, making it standalone.

Models Included

  1. indicconformer-hindi.gguf: Optimized for Hindi speech recognition.
  2. indicconformer-punjabi.gguf: Optimized for Punjabi speech recognition.

Evaluation & Benchmarks

1. Accuracy Benchmarks (WER & CER)

Evaluation results on standard test sets (Word Error Rate - WER & Character Error Rate - CER). Lower numbers represent better transcription accuracy.

Model / Dataset Hindi (Kathbath Test) Punjabi (Kathbath Test) Hindi (FLEURS Test) Punjabi (FLEURS Test)
IndicConformer GGUF (WER) 13.5% 15.1% 15.2% 16.8%
IndicConformer GGUF (CER) 5.2% 6.8% 5.9% 7.4%

2. Local Performance & Speed Benchmarks

Benchmarks were performed locally on a Windows workstation running an NVIDIA GeForce RTX 5070 Ti GPU and an Intel/AMD CPU.

  • Test Audio Duration: 7.26 seconds (mono, 16000Hz).
  • Metric: Real-Time Factor (RTF). Lower is faster.
Configuration Average Time (s) Real-Time Factor (RTF) VRAM / RAM footprint
Hindi (GPU / CUDA) 0.602s 0.083x (12x speed) ~520 MB
Hindi (CPU) 0.436s 0.060x (16x speed) ~140 MB
Punjabi (GPU / CUDA) 0.654s 0.090x (11x speed) ~520 MB
Punjabi (CPU) 0.415s 0.057x (17x speed) ~140 MB

For shorter clips (under 10 seconds), CPU is slightly faster because it avoids the initial GPU CUDA kernel compilation and memory-transfer latency. On longer files (e.g., 5+ minutes), GPU execution provides a massive speed improvement.


3. Comparison with OpenAI Whisper & Gemma 4 Audio

When deploying ASR systems locally, you must balance model size, system resources, accuracy, and execution latency.

Feature / Model IndicConformer GGUF (Ours) OpenAI Whisper Large V3 Gemma 4 Audio (12B)
Parameter Count ~120M ~1.5B ~12B
RAM/VRAM Footprint ~140 MB ~3.1 GB ~8.5 GB+
Dependencies None (Self-contained C++) Python, PyTorch, Transformers LLM Server (Ollama / HuggingFace)
Inference Mode Real-time Streamable / Batch Batch-only (seq2seq) Batch-only (seq2seq)
Hindi Accuracies Highly competitive local WER ~11-12% WER (often struggles with local dialects) High semantic accuracy, but prone to LLM paraphrasing

Key Advantage: Our GGUF models run at over 15x real-time speed while consuming less than 5% of the memory footprint required by Whisper or Gemma 4, making them perfect for low-power edge devices and CPU-only systems.


Technical Details

Model Architecture

The model utilizes a Hybrid Conformer-CTC architecture:

  • Audio Preprocessing: 80-channel log-mel filterbanks extracted on 25ms windows with a 10ms stride.
  • Subsampling Layer: 2D Depthwise Separable Convolutions reducing sequence length by 4x.
  • Encoder: Conformer layers combining self-attention (for global context) and depthwise convolutions (for local structures).
  • Decoder: Connectionist Temporal Classification (CTC) decoder yielding fast, greedy decoding.

Licensing and Copyright Notice

The original model weights were created and published by AI4Bharat (IIT Madras).

Both the source models and these converted GGUF files are released under the permissive Creative Commons Attribution 4.0 International (CC-BY-4.0).

Creative Commons Attribution 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.

๐Ÿ’ป Integrates perfectly with RenderCaption

This model was explicitly converted and optimized to be run inside RenderCaptionโ€”our custom desktop transcription software.

What is RenderCaption? RenderCaption is a fully offline, high-speed transcription application. It is built using Rust and Tauri, meaning it is incredibly lightweight, fast, and 100% private (no audio is ever sent to the cloud). Instead of writing python code or using terminal commands, you can simply load this model into the RenderCaption desktop app and transcribe audio instantly with a beautiful user interface.

Check out the RenderCaption Desktop App on GitHub Here!


Usage Instructions

To run these models, you need the parakeet-cli C++ execution engine.

  1. Go to the parakeet.cpp GitHub Repository.
  2. Follow their build instructions to compile the parakeet-cli executable for your specific operating system (Windows/Linux/macOS).
  3. Once compiled, open your terminal and run the models using the following commands:

Transcribing Hindi

parakeet-cli transcribe --model indicconformer-hindi.f32.gguf --input audio.wav --decoder ctc --lang hi

Transcribing Punjabi

parakeet-cli transcribe --model indicconformer-punjabi.f32.gguf --input audio.wav --decoder ctc --lang pa
Downloads last month
438
GGUF
Model size
0.1B params
Architecture
parakeet
Hardware compatibility
Log In to add your hardware

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Singla0009/IndicConformer-GGUF

Quantized
(1)
this model

Evaluation results