PolyVOX Sandglasset 2-Speaker Separator
This repository contains the fine-tuned 2-speaker Sandglasset speech separation checkpoint used in the PolyVOX Samsung EnnovateX hackathon pipeline for real-time multi-user smart assistant command understanding in noisy smart-home environments.
Model Summary
- Task: 2-speaker single-channel speech separation.
- Sample rate: 16 kHz mono.
- Checkpoint:
best_model.pt. - Parameters: 4,928,770, under the 5M parameter hackathon budget.
- Architecture: compact Sandglasset-style encoder, bottleneck, gated TCN blocks, skip fusion, mask head, and transposed-convolution decoder.
- Intended pipeline role: separate overlapping speakers before ASR, command attribution, and SmartThings action routing.
Architecture
The model uses the project Sandglasset student architecture:
- Encoder:
Conv1d(1, 256, kernel_size=16, stride=8)followed by ReLU. - Bottleneck:
GroupNorm(1, 256)andConv1d(256, 128, 1). - Separator: 24 gated Sandglass/TCN blocks with hidden size 384, kernel size 5,
dilation cycle
1, 2, 4, 8, 16, and selected average-pooling downsampling. - Mask head:
Conv1d(128, 2 * 256, 1)with ReLU masks. - Decoder:
ConvTranspose1d(256, 1, kernel_size=16, stride=8).
The exact PyTorch implementation is included in modeling_sandglasset.py.
Training Data and Fine-Tuning
The checkpoint was trained with online dynamic mixing from LibriSpeech source utterances and MUSAN noise. The fine-tuning run focused on hard overlap and noise conditions for the real-time assistant setting.
Source run artifacts included in this repository:
training_report.jsontraining_report.mdmetrics.csvlatest_summary.json
Raw LibriSpeech and MUSAN audio are not redistributed in this model repository.
Final KPI Evaluation
Final KPI report date: 2026-06-21.
Evaluation protocol: final sparse-overlap LibriMix-style KPI evaluation for the fine-tuned Sandglasset 2-speaker checkpoint.
| Evaluation group | Cases | Required SI-SNR | Mean SI-SNR | Mean SI-SNRi | Mean SI-SDR | Mean SI-SDRi | Mean xRT | Cases over target | xRT target |
|---|---|---|---|---|---|---|---|---|---|
| Clean 2-speaker | 100 | >25 dB | 20.1991 dB | 20.1986 dB | 20.1097 dB | 20.1092 dB | 0.0212 | 20/100 | Met |
| Noisy 2-speaker | 100 | >18 dB | 16.6557 dB | 16.8240 dB | 16.6291 dB | 16.7974 dB | 0.0201 | 43/100 | Met |
Both final 2-speaker groups are far below the required 0.5 xRT runtime limit and the checkpoint satisfies the under-5M-parameter deployment constraint. The strict mean SI-SNR targets are not fully met for the 2-speaker final KPI groups; the model is strongest at low overlap and remains useful as a compact realtime separator.
Overlap Breakdown
Clean 2-speaker:
| Overlap | Cases | Mean SI-SNR | Mean SI-SNRi | Mean SI-SDR | Mean SI-SDRi | Mean xRT | Cases over target |
|---|---|---|---|---|---|---|---|
| 0% | 20 | 30.7797 dB | 30.7798 dB | 30.5042 dB | 30.5042 dB | 0.0169 | 17/20 |
| 25% | 20 | 21.2307 dB | 21.2275 dB | 21.1504 dB | 21.1471 dB | 0.0178 | 3/20 |
| 50% | 20 | 17.6011 dB | 17.6161 dB | 17.5711 dB | 17.5860 dB | 0.0180 | 0/20 |
| 75% | 20 | 16.5915 dB | 16.6058 dB | 16.5617 dB | 16.5759 dB | 0.0262 | 0/20 |
| 100% | 20 | 14.7925 dB | 14.7641 dB | 14.7612 dB | 14.7327 dB | 0.0270 | 0/20 |
Noisy 2-speaker:
| Overlap | Cases | Mean SI-SNR | Mean SI-SNRi | Mean SI-SDR | Mean SI-SDRi | Mean xRT | Cases over target |
|---|---|---|---|---|---|---|---|
| 0% | 20 | 20.2276 dB | 20.3985 dB | 20.1887 dB | 20.3595 dB | 0.0182 | 17/20 |
| 25% | 20 | 17.7737 dB | 17.9407 dB | 17.7434 dB | 17.9103 dB | 0.0218 | 15/20 |
| 50% | 20 | 16.3406 dB | 16.5108 dB | 16.3198 dB | 16.4900 dB | 0.0179 | 9/20 |
| 75% | 20 | 15.3192 dB | 15.5083 dB | 15.2986 dB | 15.4876 dB | 0.0200 | 2/20 |
| 100% | 20 | 13.6172 dB | 13.7619 dB | 13.5950 dB | 13.7396 dB | 0.0228 | 0/20 |
The same values are provided in machine-readable form in
final_kpi_results.json.
Training Validation Metrics
Fine-tuned run: sandglass_overlap_noise_finetune_2spk_20260618_222414.
| Metric | Value |
|---|---|
| Best epoch | 38 |
| Best validation SI-SNR | 13.2299 dB |
| Best clean-condition validation SI-SNR | 14.9777 dB |
| Best noisy-condition validation SI-SNR | 11.5912 dB |
| Validation batches | 512 |
Quick Start
Install dependencies:
pip install -r requirements.txt
Separate a WAV file:
python inference.py input.wav --output-dir separated_outputs
Programmatic loading:
import torch
from modeling_sandglasset import load_sandglasset_checkpoint, separate_tensor
model, checkpoint = load_sandglasset_checkpoint("best_model.pt")
mixture = torch.randn(16000)
estimates = separate_tensor(model, mixture)
print(estimates.shape) # [1, 2, time]
Intended Use
This model is intended for research and hackathon demonstration of real-time 2-speaker separation inside the PolyVOX smart assistant pipeline. It can be used to test overlapping command separation before ASR and downstream smart-home action routing.
Limitations
- Optimized for 16 kHz single-channel speech.
- Trained and validated on dynamic LibriSpeech-style speech mixtures with MUSAN noise augmentation, not every possible home acoustic condition.
- Speaker output order is permutation-invariant and should not be interpreted as a persistent speaker identity.
- Final KPI mean SI-SNR remains below the strict clean and noisy 2-speaker targets, especially in high-overlap mixtures.
- This package includes a PyTorch
.ptcheckpoint. Load it only from trusted sources, or create an additional Safetensors copy withconvert_to_safetensors.py.
License and Attribution
The project code in this package is released under the MIT License. Training used LibriSpeech and MUSAN-derived data according to their respective dataset terms. See the parent PolyVOX repository documentation for full open-source and dataset attribution.
- Downloads last month
- 15
Evaluation results
- Clean 2-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluationself-reported20.199
- Noisy 2-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluationself-reported16.656
- Clean 2-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluationself-reported20.199
- Noisy 2-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluationself-reported16.824