WeSpeaker ResNet34-LM β€” LiteRT

Speaker embedding for speaker identification and diarization clustering.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

256-dim speaker embedding network for Android, ported from pyannote/wespeaker-voxceleb-resnet34-LM.

Model

Property Value
Architecture ResNet34 + stats pooling + linear projection
Parameters ~6.6 M
Format LiteRT (TFLite)
Quantization float32
Sample rate 16 000 Hz
Input 80-bin kaldi-style mel fbank features (T frames)
Output L2-normalized 256-dim embedding

Files

File Size Description
wespeaker-resnet34.tflite 25.4 MB Full model, FP32
config.json 1 KB Fbank spec + I/O signature

Why fbank-as-input

pyannote's kaldi fbank implementation uses torch.hamming_window and aten._fft_r2c, neither of which has a lowering in litert-torch. We export only the ResNet34 portion; the caller computes the 80-bin fbank features on-device. This matches the standard mobile speaker-embedding pattern and keeps the tflite graph free of FFT ops.

Fbank parameters

Parameter Value
num_mel_bins 80
frame_length 25 ms
frame_shift 10 ms
window_type hamming
dither 0.0
use_energy false

The reference implementation is torchaudio.compliance.kaldi.fbank with those arguments. The model internally applies features - mean(features, dim=1) centering so the caller may pass raw (uncentered) fbank output.

Signature

Inputs:
  fbank         [1, T, 80]   float32   Kaldi mel fbank, T=298 for 3 s @ 16 kHz

Outputs:
  embedding     [1, 256]     float32   L2-normalized speaker embedding

Parity

Verified max diff = 4.2e-07 vs the upstream pyannote model's full forward on a random 3-second waveform (with kaldi fbank features computed externally).

Usage

// Compute 80-bin kaldi fbank features on-device with your preferred library
val fbank = kaldiFbank(audio, melBins = 80, frameLengthMs = 25, frameShiftMs = 10)

val model = Interpreter(loadModelFile("wespeaker-resnet34.tflite"))
val embedding = FloatArray(256)
model.run(fbank, embedding)

Source

Upstream: pyannote/wespeaker-voxceleb-resnet34-LM (CC BY 4.0, gated β€” accept the license on the upstream page).

Links

Ecosystem

  • soniqo.audio β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
  • speech-core β€” C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
  • speech-swift β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
  • speech-android β€” Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

License

This bundle inherits the upstream model license (cc-by-4.0). See the linked base_model repository for the full terms.

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/WeSpeaker-ResNet34-LM-LiteRT

Finetuned
(7)
this model

Collection including soniqo/WeSpeaker-ResNet34-LM-LiteRT