PolyVOX Sandglasset 3-Speaker Separator
This repository contains the fine-tuned 3-speaker Sandglasset speech separation checkpoint used in the PolyVOX Samsung EnnovateX hackathon pipeline for real-time multi-user smart assistant command understanding in noisy smart-home environments.
Model Summary
- Task: 3-speaker single-channel speech separation.
- Sample rate: 16 kHz mono.
- Checkpoint:
best_model.pt. - Parameters: 4,961,794, under the 5M parameter hackathon budget.
- Architecture: compact Sandglasset-style encoder, bottleneck, gated TCN blocks, skip fusion, mask head, and transposed-convolution decoder.
- Intended pipeline role: separate higher-overlap multi-user speech before ASR, command attribution, and SmartThings action routing.
Architecture
The model uses the project Sandglasset student architecture:
- Encoder:
Conv1d(1, 256, kernel_size=16, stride=8)followed by ReLU. - Bottleneck:
GroupNorm(1, 256)andConv1d(256, 128, 1). - Separator: 24 gated Sandglass/TCN blocks with hidden size 384, kernel size 5,
dilation cycle
1, 2, 4, 8, 16, and selected average-pooling downsampling. - Mask head:
Conv1d(128, 3 * 256, 1)with ReLU masks. - Decoder:
ConvTranspose1d(256, 1, kernel_size=16, stride=8).
The exact PyTorch implementation is included in modeling_sandglasset.py.
Training Data and Fine-Tuning
The checkpoint was trained with online dynamic mixing from LibriSpeech source utterances and MUSAN noise. The fine-tuning run focused on hard overlap and noise conditions for the real-time assistant setting.
Source run artifacts included in this repository:
training_report.jsontraining_report.mdmetrics.csvlatest_summary.json
Raw LibriSpeech and MUSAN audio are not redistributed in this model repository.
Final KPI Evaluation
Final KPI report date: 2026-06-21.
Evaluation protocol: final sparse-overlap LibriMix-style KPI evaluation for the fine-tuned Sandglasset 3-speaker checkpoint.
| Evaluation group | Cases | Required SI-SNR | Mean SI-SNR | Mean SI-SNRi | Mean SI-SDR | Mean SI-SDRi | Mean xRT | Cases over target | xRT target |
|---|---|---|---|---|---|---|---|---|---|
| Clean 3-speaker | 100 | >15 dB | 13.9380 dB | 17.4585 dB | 13.9257 dB | 17.4458 dB | 0.0202 | 39/100 | Met |
| Noisy 3-speaker | 100 | >10 dB | 12.0654 dB | 15.6966 dB | 12.0539 dB | 15.6846 dB | 0.0254 | 67/100 | Met |
Both final 3-speaker groups are far below the required 0.5 xRT runtime limit and the checkpoint satisfies the under-5M-parameter deployment constraint. The noisy/reverberant 3-speaker group exceeds the official mean SI-SNR target; the clean 3-speaker group is close to the target and remains strongest at low to moderate overlap.
Overlap Breakdown
Clean 3-speaker:
| Overlap | Cases | Mean SI-SNR | Mean SI-SNRi | Mean SI-SDR | Mean SI-SDRi | Mean xRT | Cases over target |
|---|---|---|---|---|---|---|---|
| 0% | 20 | 20.6725 dB | 24.1622 dB | 20.6549 dB | 24.1444 dB | 0.0144 | 16/20 |
| 25% | 20 | 16.2218 dB | 19.7307 dB | 16.2109 dB | 19.7194 dB | 0.0189 | 13/20 |
| 50% | 20 | 13.4509 dB | 16.9945 dB | 13.4360 dB | 16.9792 dB | 0.0182 | 9/20 |
| 75% | 20 | 11.0074 dB | 14.5427 dB | 10.9974 dB | 14.5323 dB | 0.0226 | 1/20 |
| 100% | 20 | 8.3372 dB | 11.8621 dB | 8.3294 dB | 11.8536 dB | 0.0268 | 0/20 |
Noisy 3-speaker:
| Overlap | Cases | Mean SI-SNR | Mean SI-SNRi | Mean SI-SDR | Mean SI-SDRi | Mean xRT | Cases over target |
|---|---|---|---|---|---|---|---|
| 0% | 20 | 16.1933 dB | 19.7920 dB | 16.1783 dB | 19.7767 dB | 0.0201 | 18/20 |
| 25% | 20 | 13.8276 dB | 17.4482 dB | 13.8164 dB | 17.4366 dB | 0.0176 | 17/20 |
| 50% | 20 | 12.2222 dB | 15.8782 dB | 12.2095 dB | 15.8651 dB | 0.0241 | 16/20 |
| 75% | 20 | 10.1524 dB | 13.7991 dB | 10.1421 dB | 13.7884 dB | 0.0269 | 12/20 |
| 100% | 20 | 7.9314 dB | 11.5655 dB | 7.9232 dB | 11.5564 dB | 0.0383 | 4/20 |
The same values are provided in machine-readable form in
final_kpi_results.json.
Training Validation Metrics
Fine-tuned run: sandglass_overlap_noise_finetune_3spk_20260618_222500.
| Metric | Value |
|---|---|
| Best epoch | 49 |
| Best validation SI-SNR | 6.9151 dB |
| Best clean-condition validation SI-SNR | 7.8934 dB |
| Best noisy-condition validation SI-SNR | 5.8541 dB |
| Validation batches | 512 |
Quick Start
Install dependencies:
pip install -r requirements.txt
Separate a WAV file:
python inference.py input.wav --output-dir separated_outputs
Programmatic loading:
import torch
from modeling_sandglasset import load_sandglasset_checkpoint, separate_tensor
model, checkpoint = load_sandglasset_checkpoint("best_model.pt")
mixture = torch.randn(16000)
estimates = separate_tensor(model, mixture)
print(estimates.shape) # [1, 3, time]
Intended Use
This model is intended for research and hackathon demonstration of real-time 3-speaker separation inside the PolyVOX smart assistant pipeline. It can be used to test overlapping command separation before ASR and downstream smart-home action routing.
Limitations
- Optimized for 16 kHz single-channel speech.
- Trained and validated on dynamic LibriSpeech-style speech mixtures with MUSAN noise augmentation, not every possible home acoustic condition.
- Speaker output order is permutation-invariant and should not be interpreted as a persistent speaker identity.
- Three-speaker separation is harder than two-speaker separation and the included metrics should be presented as current hackathon checkpoint results, not as production-quality generalization guarantees.
- Clean 3-speaker mean SI-SNR is close to but still below the official clean multi-speaker target; noisy 3-speaker mean SI-SNR meets the target.
- This package includes a PyTorch
.ptcheckpoint. Load it only from trusted sources, or create an additional Safetensors copy withconvert_to_safetensors.py.
License and Attribution
The project code in this package is released under the MIT License. Training used LibriSpeech and MUSAN-derived data according to their respective dataset terms. See the parent PolyVOX repository documentation for full open-source and dataset attribution.
- Downloads last month
- 26
Evaluation results
- Clean 3-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluationself-reported13.938
- Noisy 3-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluationself-reported12.065
- Clean 3-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluationself-reported17.459
- Noisy 3-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluationself-reported15.697