Silero VAD & Whisper GGML Base (Local STT Bundle) πŸŽ™οΈ

This repository contains a curated set of high-performance models for building local Speech-to-Text (STT) systems. This bundle is specifically optimized for .NET 10 (C#) applications utilizing Whisper.net and OnnxRuntime.


πŸ“¦ Repository Contents

1. Silero VAD (Voice Activity Detection)

  • File: silero_vad.onnx
  • Format: ONNX
  • Purpose: High-precision, real-time detection of human speech within an audio stream.
  • Technical Specifications (v5):
    • Sample Rate: 16000 Hz (Strictly required).
    • Inputs:
      • input: float32 [1, 512] (Recommended window size for low-latency).
      • sr: int64 [1] (Must be 16000).
      • state: float32 [2, 1, 128] (Internal RNN state; must be preserved between chunks).
    • Outputs:
      • output: Speech probability (0.0 β€” 1.0).
      • stateN: Updated RNN state for the next iteration.

2. OpenAI Whisper (Base)

  • File: ggml-base.bin
  • Format: GGML (Compatible with whisper.cpp and Whisper.net).
  • Purpose: Multilingual transcription and translation (supports 99 languages).
  • Technical Specifications:
    • Architecture: Transformer (~74M parameters).
    • Acceleration: Optimized for Vulkan (Cross-platform GPU acceleration for Windows/Linux).

πŸ›  Technical Audio Specifications

For both models to function correctly, the input audio stream must be pre-processed:

Parameter Value
Sample Rate 16000 Hz (16 kHz)
Channels Mono (1 channel)
Format Raw 32-bit float PCM (f32le)
Normalization Range -1.0 to 1.0

πŸš€ Logic Flow (Three-Body Pipeline)

This bundle is designed for a concurrent, non-blocking pipeline using System.Threading.Channels:

  1. Stage 1: Audio Processor Uses FFmpeg to decode any input format on the fly and streams 512-sample float chunks into the first channel.
  2. Stage 2: VAD Processor Consumes chunks, detects speech using the stateful ONNX model, and segments the stream into complete sentences based on silence thresholds.
  3. Stage 3: Transcriptor Receives finalized sentences and performs Whisper inference. Streams transcribed text back to the client while subsequent audio is still being processed.

πŸ’‘ Quick Start Guide (C# Implementation)

  1. VAD State Management: When using silero_vad.onnx, initialize the state tensor with zeros at the start of a stream. For every subsequent chunk, pass the stateN output from the previous call back into the model.
  2. GPU Acceleration: For Windows and Linux, use the Whisper.net.Runtime.Vulkan NuGet package to ensure maximum performance across different GPU vendors.
  3. Concurrency: Use SemaphoreSlim to limit concurrent STT requests to prevent GPU VRAM exhaustion.

βš–οΈ License & Credits

Both models are distributed under the MIT License.

Created for high-performance universal local AI agents.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support