Silero VAD & Whisper GGML Base (Local STT Bundle) ποΈ
This repository contains a curated set of high-performance models for building local Speech-to-Text (STT) systems. This bundle is specifically optimized for .NET 10 (C#) applications utilizing Whisper.net and OnnxRuntime.
π¦ Repository Contents
1. Silero VAD (Voice Activity Detection)
- File:
silero_vad.onnx - Format: ONNX
- Purpose: High-precision, real-time detection of human speech within an audio stream.
- Technical Specifications (v5):
- Sample Rate: 16000 Hz (Strictly required).
- Inputs:
input:float32 [1, 512](Recommended window size for low-latency).sr:int64 [1](Must be16000).state:float32 [2, 1, 128](Internal RNN state; must be preserved between chunks).
- Outputs:
output: Speech probability (0.0 β 1.0).stateN: Updated RNN state for the next iteration.
2. OpenAI Whisper (Base)
- File:
ggml-base.bin - Format: GGML (Compatible with
whisper.cppandWhisper.net). - Purpose: Multilingual transcription and translation (supports 99 languages).
- Technical Specifications:
- Architecture: Transformer (~74M parameters).
- Acceleration: Optimized for Vulkan (Cross-platform GPU acceleration for Windows/Linux).
π Technical Audio Specifications
For both models to function correctly, the input audio stream must be pre-processed:
| Parameter | Value |
|---|---|
| Sample Rate | 16000 Hz (16 kHz) |
| Channels | Mono (1 channel) |
| Format | Raw 32-bit float PCM (f32le) |
| Normalization | Range -1.0 to 1.0 |
π Logic Flow (Three-Body Pipeline)
This bundle is designed for a concurrent, non-blocking pipeline using System.Threading.Channels:
- Stage 1: Audio Processor Uses FFmpeg to decode any input format on the fly and streams 512-sample float chunks into the first channel.
- Stage 2: VAD Processor Consumes chunks, detects speech using the stateful ONNX model, and segments the stream into complete sentences based on silence thresholds.
- Stage 3: Transcriptor Receives finalized sentences and performs Whisper inference. Streams transcribed text back to the client while subsequent audio is still being processed.
π‘ Quick Start Guide (C# Implementation)
- VAD State Management: When using
silero_vad.onnx, initialize thestatetensor with zeros at the start of a stream. For every subsequent chunk, pass thestateNoutput from the previous call back into the model. - GPU Acceleration: For Windows and Linux, use the
Whisper.net.Runtime.VulkanNuGet package to ensure maximum performance across different GPU vendors. - Concurrency: Use
SemaphoreSlimto limit concurrent STT requests to prevent GPU VRAM exhaustion.
βοΈ License & Credits
Both models are distributed under the MIT License.
- Silero VAD: by snakers4.
- Whisper: by OpenAI.
- GGML Conversion: enabled via whisper.cpp.
Created for high-performance universal local AI agents.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support