PolyVOX Sandglasset 3-Speaker Separator

This repository contains the fine-tuned 3-speaker Sandglasset speech separation checkpoint used in the PolyVOX Samsung EnnovateX hackathon pipeline for real-time multi-user smart assistant command understanding in noisy smart-home environments.

Model Summary

Task: 3-speaker single-channel speech separation.
Sample rate: 16 kHz mono.
Checkpoint: best_model.pt.
Parameters: 4,961,794, under the 5M parameter hackathon budget.
Architecture: compact Sandglasset-style encoder, bottleneck, gated TCN blocks, skip fusion, mask head, and transposed-convolution decoder.
Intended pipeline role: separate higher-overlap multi-user speech before ASR, command attribution, and SmartThings action routing.

Architecture

The model uses the project Sandglasset student architecture:

Encoder: Conv1d(1, 256, kernel_size=16, stride=8) followed by ReLU.
Bottleneck: GroupNorm(1, 256) and Conv1d(256, 128, 1).
Separator: 24 gated Sandglass/TCN blocks with hidden size 384, kernel size 5, dilation cycle 1, 2, 4, 8, 16, and selected average-pooling downsampling.
Mask head: Conv1d(128, 3 * 256, 1) with ReLU masks.
Decoder: ConvTranspose1d(256, 1, kernel_size=16, stride=8).

The exact PyTorch implementation is included in modeling_sandglasset.py.

Training Data and Fine-Tuning

The checkpoint was trained with online dynamic mixing from LibriSpeech source utterances and MUSAN noise. The fine-tuning run focused on hard overlap and noise conditions for the real-time assistant setting.

Source run artifacts included in this repository:

training_report.json
training_report.md
metrics.csv
latest_summary.json

Raw LibriSpeech and MUSAN audio are not redistributed in this model repository.

Final KPI Evaluation

Final KPI report date: 2026-06-21.

Evaluation protocol: final sparse-overlap LibriMix-style KPI evaluation for the fine-tuned Sandglasset 3-speaker checkpoint.

Evaluation group	Cases	Required SI-SNR	Mean SI-SNR	Mean SI-SNRi	Mean SI-SDR	Mean SI-SDRi	Mean xRT	Cases over target	xRT target
Clean 3-speaker	100	>15 dB	13.9380 dB	17.4585 dB	13.9257 dB	17.4458 dB	0.0202	39/100	Met
Noisy 3-speaker	100	>10 dB	12.0654 dB	15.6966 dB	12.0539 dB	15.6846 dB	0.0254	67/100	Met

Both final 3-speaker groups are far below the required 0.5 xRT runtime limit and the checkpoint satisfies the under-5M-parameter deployment constraint. The noisy/reverberant 3-speaker group exceeds the official mean SI-SNR target; the clean 3-speaker group is close to the target and remains strongest at low to moderate overlap.

Overlap Breakdown

Clean 3-speaker:

Overlap	Cases	Mean SI-SNR	Mean SI-SNRi	Mean SI-SDR	Mean SI-SDRi	Mean xRT	Cases over target
0%	20	20.6725 dB	24.1622 dB	20.6549 dB	24.1444 dB	0.0144	16/20
25%	20	16.2218 dB	19.7307 dB	16.2109 dB	19.7194 dB	0.0189	13/20
50%	20	13.4509 dB	16.9945 dB	13.4360 dB	16.9792 dB	0.0182	9/20
75%	20	11.0074 dB	14.5427 dB	10.9974 dB	14.5323 dB	0.0226	1/20
100%	20	8.3372 dB	11.8621 dB	8.3294 dB	11.8536 dB	0.0268	0/20

Noisy 3-speaker:

Overlap	Cases	Mean SI-SNR	Mean SI-SNRi	Mean SI-SDR	Mean SI-SDRi	Mean xRT	Cases over target
0%	20	16.1933 dB	19.7920 dB	16.1783 dB	19.7767 dB	0.0201	18/20
25%	20	13.8276 dB	17.4482 dB	13.8164 dB	17.4366 dB	0.0176	17/20
50%	20	12.2222 dB	15.8782 dB	12.2095 dB	15.8651 dB	0.0241	16/20
75%	20	10.1524 dB	13.7991 dB	10.1421 dB	13.7884 dB	0.0269	12/20
100%	20	7.9314 dB	11.5655 dB	7.9232 dB	11.5564 dB	0.0383	4/20

The same values are provided in machine-readable form in final_kpi_results.json.

Training Validation Metrics

Fine-tuned run: sandglass_overlap_noise_finetune_3spk_20260618_222500.

Metric	Value
Best epoch	49
Best validation SI-SNR	6.9151 dB
Best clean-condition validation SI-SNR	7.8934 dB
Best noisy-condition validation SI-SNR	5.8541 dB
Validation batches	512

Quick Start

Install dependencies:

pip install -r requirements.txt

Separate a WAV file:

python inference.py input.wav --output-dir separated_outputs

Programmatic loading:

import torch
from modeling_sandglasset import load_sandglasset_checkpoint, separate_tensor

model, checkpoint = load_sandglasset_checkpoint("best_model.pt")
mixture = torch.randn(16000)
estimates = separate_tensor(model, mixture)
print(estimates.shape)  # [1, 3, time]

Intended Use

This model is intended for research and hackathon demonstration of real-time 3-speaker separation inside the PolyVOX smart assistant pipeline. It can be used to test overlapping command separation before ASR and downstream smart-home action routing.

Limitations

Optimized for 16 kHz single-channel speech.
Trained and validated on dynamic LibriSpeech-style speech mixtures with MUSAN noise augmentation, not every possible home acoustic condition.
Speaker output order is permutation-invariant and should not be interpreted as a persistent speaker identity.
Three-speaker separation is harder than two-speaker separation and the included metrics should be presented as current hackathon checkpoint results, not as production-quality generalization guarantees.
Clean 3-speaker mean SI-SNR is close to but still below the official clean multi-speaker target; noisy 3-speaker mean SI-SNR meets the target.
This package includes a PyTorch .pt checkpoint. Load it only from trusted sources, or create an additional Safetensors copy with convert_to_safetensors.py.

License and Attribution

The project code in this package is released under the MIT License. Training used LibriSpeech and MUSAN-derived data according to their respective dataset terms. See the parent PolyVOX repository documentation for full open-source and dataset attribution.

Downloads last month: 26

Evaluation results

Clean 3-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

13.938
Noisy 3-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

12.065
Clean 3-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

17.459
Noisy 3-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

15.697