PolyVOX Sandglasset 2-Speaker Separator

This repository contains the fine-tuned 2-speaker Sandglasset speech separation checkpoint used in the PolyVOX Samsung EnnovateX hackathon pipeline for real-time multi-user smart assistant command understanding in noisy smart-home environments.

Model Summary

Task: 2-speaker single-channel speech separation.
Sample rate: 16 kHz mono.
Checkpoint: best_model.pt.
Parameters: 4,928,770, under the 5M parameter hackathon budget.
Architecture: compact Sandglasset-style encoder, bottleneck, gated TCN blocks, skip fusion, mask head, and transposed-convolution decoder.
Intended pipeline role: separate overlapping speakers before ASR, command attribution, and SmartThings action routing.

Architecture

The model uses the project Sandglasset student architecture:

Encoder: Conv1d(1, 256, kernel_size=16, stride=8) followed by ReLU.
Bottleneck: GroupNorm(1, 256) and Conv1d(256, 128, 1).
Separator: 24 gated Sandglass/TCN blocks with hidden size 384, kernel size 5, dilation cycle 1, 2, 4, 8, 16, and selected average-pooling downsampling.
Mask head: Conv1d(128, 2 * 256, 1) with ReLU masks.
Decoder: ConvTranspose1d(256, 1, kernel_size=16, stride=8).

The exact PyTorch implementation is included in modeling_sandglasset.py.

Training Data and Fine-Tuning

The checkpoint was trained with online dynamic mixing from LibriSpeech source utterances and MUSAN noise. The fine-tuning run focused on hard overlap and noise conditions for the real-time assistant setting.

Source run artifacts included in this repository:

training_report.json
training_report.md
metrics.csv
latest_summary.json

Raw LibriSpeech and MUSAN audio are not redistributed in this model repository.

Final KPI Evaluation

Final KPI report date: 2026-06-21.

Evaluation protocol: final sparse-overlap LibriMix-style KPI evaluation for the fine-tuned Sandglasset 2-speaker checkpoint.

Evaluation group	Cases	Required SI-SNR	Mean SI-SNR	Mean SI-SNRi	Mean SI-SDR	Mean SI-SDRi	Mean xRT	Cases over target	xRT target
Clean 2-speaker	100	>25 dB	20.1991 dB	20.1986 dB	20.1097 dB	20.1092 dB	0.0212	20/100	Met
Noisy 2-speaker	100	>18 dB	16.6557 dB	16.8240 dB	16.6291 dB	16.7974 dB	0.0201	43/100	Met

Both final 2-speaker groups are far below the required 0.5 xRT runtime limit and the checkpoint satisfies the under-5M-parameter deployment constraint. The strict mean SI-SNR targets are not fully met for the 2-speaker final KPI groups; the model is strongest at low overlap and remains useful as a compact realtime separator.

Overlap Breakdown

Clean 2-speaker:

Overlap	Cases	Mean SI-SNR	Mean SI-SNRi	Mean SI-SDR	Mean SI-SDRi	Mean xRT	Cases over target
0%	20	30.7797 dB	30.7798 dB	30.5042 dB	30.5042 dB	0.0169	17/20
25%	20	21.2307 dB	21.2275 dB	21.1504 dB	21.1471 dB	0.0178	3/20
50%	20	17.6011 dB	17.6161 dB	17.5711 dB	17.5860 dB	0.0180	0/20
75%	20	16.5915 dB	16.6058 dB	16.5617 dB	16.5759 dB	0.0262	0/20
100%	20	14.7925 dB	14.7641 dB	14.7612 dB	14.7327 dB	0.0270	0/20

Noisy 2-speaker:

Overlap	Cases	Mean SI-SNR	Mean SI-SNRi	Mean SI-SDR	Mean SI-SDRi	Mean xRT	Cases over target
0%	20	20.2276 dB	20.3985 dB	20.1887 dB	20.3595 dB	0.0182	17/20
25%	20	17.7737 dB	17.9407 dB	17.7434 dB	17.9103 dB	0.0218	15/20
50%	20	16.3406 dB	16.5108 dB	16.3198 dB	16.4900 dB	0.0179	9/20
75%	20	15.3192 dB	15.5083 dB	15.2986 dB	15.4876 dB	0.0200	2/20
100%	20	13.6172 dB	13.7619 dB	13.5950 dB	13.7396 dB	0.0228	0/20

The same values are provided in machine-readable form in final_kpi_results.json.

Training Validation Metrics

Fine-tuned run: sandglass_overlap_noise_finetune_2spk_20260618_222414.

Metric	Value
Best epoch	38
Best validation SI-SNR	13.2299 dB
Best clean-condition validation SI-SNR	14.9777 dB
Best noisy-condition validation SI-SNR	11.5912 dB
Validation batches	512

Quick Start

Install dependencies:

pip install -r requirements.txt

Separate a WAV file:

python inference.py input.wav --output-dir separated_outputs

Programmatic loading:

import torch
from modeling_sandglasset import load_sandglasset_checkpoint, separate_tensor

model, checkpoint = load_sandglasset_checkpoint("best_model.pt")
mixture = torch.randn(16000)
estimates = separate_tensor(model, mixture)
print(estimates.shape)  # [1, 2, time]

Intended Use

This model is intended for research and hackathon demonstration of real-time 2-speaker separation inside the PolyVOX smart assistant pipeline. It can be used to test overlapping command separation before ASR and downstream smart-home action routing.

Limitations

Optimized for 16 kHz single-channel speech.
Trained and validated on dynamic LibriSpeech-style speech mixtures with MUSAN noise augmentation, not every possible home acoustic condition.
Speaker output order is permutation-invariant and should not be interpreted as a persistent speaker identity.
Final KPI mean SI-SNR remains below the strict clean and noisy 2-speaker targets, especially in high-overlap mixtures.
This package includes a PyTorch .pt checkpoint. Load it only from trusted sources, or create an additional Safetensors copy with convert_to_safetensors.py.

License and Attribution

The project code in this package is released under the MIT License. Training used LibriSpeech and MUSAN-derived data according to their respective dataset terms. See the parent PolyVOX repository documentation for full open-source and dataset attribution.

Downloads last month: 15

Evaluation results

Clean 2-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

20.199
Noisy 2-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

16.656
Clean 2-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

20.199
Noisy 2-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
self-reported

16.824