Silero VAD & Whisper GGML Base (Local STT Bundle) 🎙️

This repository contains a curated set of high-performance models for building local Speech-to-Text (STT) systems. This bundle is specifically optimized for .NET 10 (C#) applications utilizing Whisper.net and OnnxRuntime.

📦 Repository Contents

1. Silero VAD (Voice Activity Detection)

File: silero_vad.onnx
Format: ONNX
Purpose: High-precision, real-time detection of human speech within an audio stream.
Technical Specifications (v5):
- Sample Rate: 16000 Hz (Strictly required).
- Inputs:
  - input: float32 [1, 512] (Recommended window size for low-latency).
  - sr: int64 [1] (Must be 16000).
  - state: float32 [2, 1, 128] (Internal RNN state; must be preserved between chunks).
- Outputs:
  - output: Speech probability (0.0 — 1.0).
  - stateN: Updated RNN state for the next iteration.

2. OpenAI Whisper (Base)

File: ggml-base.bin
Format: GGML (Compatible with whisper.cpp and Whisper.net).
Purpose: Multilingual transcription and translation (supports 99 languages).
Technical Specifications:
- Architecture: Transformer (~74M parameters).
- Acceleration: Optimized for Vulkan (Cross-platform GPU acceleration for Windows/Linux).

🛠 Technical Audio Specifications

For both models to function correctly, the input audio stream must be pre-processed:

Parameter	Value
Sample Rate	16000 Hz (16 kHz)
Channels	Mono (1 channel)
Format	Raw 32-bit float PCM (`f32le`)
Normalization	Range -1.0 to 1.0

🚀 Logic Flow (Three-Body Pipeline)

This bundle is designed for a concurrent, non-blocking pipeline using System.Threading.Channels:

Stage 1: Audio Processor Uses FFmpeg to decode any input format on the fly and streams 512-sample float chunks into the first channel.
Stage 2: VAD Processor Consumes chunks, detects speech using the stateful ONNX model, and segments the stream into complete sentences based on silence thresholds.
Stage 3: Transcriptor Receives finalized sentences and performs Whisper inference. Streams transcribed text back to the client while subsequent audio is still being processed.

💡 Quick Start Guide (C# Implementation)

VAD State Management: When using silero_vad.onnx, initialize the state tensor with zeros at the start of a stream. For every subsequent chunk, pass the stateN output from the previous call back into the model.
GPU Acceleration: For Windows and Linux, use the Whisper.net.Runtime.Vulkan NuGet package to ensure maximum performance across different GPU vendors.
Concurrency: Use SemaphoreSlim to limit concurrent STT requests to prevent GPU VRAM exhaustion.

⚖️ License & Credits

Both models are distributed under the MIT License.

Silero VAD: by snakers4.
Whisper: by OpenAI.
GGML Conversion: enabled via whisper.cpp.

Created for high-performance universal local AI agents.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support