PolyVOX Sandglasset 3-Speaker Separator

This repository contains the fine-tuned 3-speaker Sandglasset speech separation checkpoint used in the PolyVOX Samsung EnnovateX hackathon pipeline for real-time multi-user smart assistant command understanding in noisy smart-home environments.

Model Summary

  • Task: 3-speaker single-channel speech separation.
  • Sample rate: 16 kHz mono.
  • Checkpoint: best_model.pt.
  • Parameters: 4,961,794, under the 5M parameter hackathon budget.
  • Architecture: compact Sandglasset-style encoder, bottleneck, gated TCN blocks, skip fusion, mask head, and transposed-convolution decoder.
  • Intended pipeline role: separate higher-overlap multi-user speech before ASR, command attribution, and SmartThings action routing.

Architecture

The model uses the project Sandglasset student architecture:

  • Encoder: Conv1d(1, 256, kernel_size=16, stride=8) followed by ReLU.
  • Bottleneck: GroupNorm(1, 256) and Conv1d(256, 128, 1).
  • Separator: 24 gated Sandglass/TCN blocks with hidden size 384, kernel size 5, dilation cycle 1, 2, 4, 8, 16, and selected average-pooling downsampling.
  • Mask head: Conv1d(128, 3 * 256, 1) with ReLU masks.
  • Decoder: ConvTranspose1d(256, 1, kernel_size=16, stride=8).

The exact PyTorch implementation is included in modeling_sandglasset.py.

Training Data and Fine-Tuning

The checkpoint was trained with online dynamic mixing from LibriSpeech source utterances and MUSAN noise. The fine-tuning run focused on hard overlap and noise conditions for the real-time assistant setting.

Source run artifacts included in this repository:

  • training_report.json
  • training_report.md
  • metrics.csv
  • latest_summary.json

Raw LibriSpeech and MUSAN audio are not redistributed in this model repository.

Final KPI Evaluation

Final KPI report date: 2026-06-21.

Evaluation protocol: final sparse-overlap LibriMix-style KPI evaluation for the fine-tuned Sandglasset 3-speaker checkpoint.

Evaluation group Cases Required SI-SNR Mean SI-SNR Mean SI-SNRi Mean SI-SDR Mean SI-SDRi Mean xRT Cases over target xRT target
Clean 3-speaker 100 >15 dB 13.9380 dB 17.4585 dB 13.9257 dB 17.4458 dB 0.0202 39/100 Met
Noisy 3-speaker 100 >10 dB 12.0654 dB 15.6966 dB 12.0539 dB 15.6846 dB 0.0254 67/100 Met

Both final 3-speaker groups are far below the required 0.5 xRT runtime limit and the checkpoint satisfies the under-5M-parameter deployment constraint. The noisy/reverberant 3-speaker group exceeds the official mean SI-SNR target; the clean 3-speaker group is close to the target and remains strongest at low to moderate overlap.

Overlap Breakdown

Clean 3-speaker:

Overlap Cases Mean SI-SNR Mean SI-SNRi Mean SI-SDR Mean SI-SDRi Mean xRT Cases over target
0% 20 20.6725 dB 24.1622 dB 20.6549 dB 24.1444 dB 0.0144 16/20
25% 20 16.2218 dB 19.7307 dB 16.2109 dB 19.7194 dB 0.0189 13/20
50% 20 13.4509 dB 16.9945 dB 13.4360 dB 16.9792 dB 0.0182 9/20
75% 20 11.0074 dB 14.5427 dB 10.9974 dB 14.5323 dB 0.0226 1/20
100% 20 8.3372 dB 11.8621 dB 8.3294 dB 11.8536 dB 0.0268 0/20

Noisy 3-speaker:

Overlap Cases Mean SI-SNR Mean SI-SNRi Mean SI-SDR Mean SI-SDRi Mean xRT Cases over target
0% 20 16.1933 dB 19.7920 dB 16.1783 dB 19.7767 dB 0.0201 18/20
25% 20 13.8276 dB 17.4482 dB 13.8164 dB 17.4366 dB 0.0176 17/20
50% 20 12.2222 dB 15.8782 dB 12.2095 dB 15.8651 dB 0.0241 16/20
75% 20 10.1524 dB 13.7991 dB 10.1421 dB 13.7884 dB 0.0269 12/20
100% 20 7.9314 dB 11.5655 dB 7.9232 dB 11.5564 dB 0.0383 4/20

The same values are provided in machine-readable form in final_kpi_results.json.

Training Validation Metrics

Fine-tuned run: sandglass_overlap_noise_finetune_3spk_20260618_222500.

Metric Value
Best epoch 49
Best validation SI-SNR 6.9151 dB
Best clean-condition validation SI-SNR 7.8934 dB
Best noisy-condition validation SI-SNR 5.8541 dB
Validation batches 512

Quick Start

Install dependencies:

pip install -r requirements.txt

Separate a WAV file:

python inference.py input.wav --output-dir separated_outputs

Programmatic loading:

import torch
from modeling_sandglasset import load_sandglasset_checkpoint, separate_tensor

model, checkpoint = load_sandglasset_checkpoint("best_model.pt")
mixture = torch.randn(16000)
estimates = separate_tensor(model, mixture)
print(estimates.shape)  # [1, 3, time]

Intended Use

This model is intended for research and hackathon demonstration of real-time 3-speaker separation inside the PolyVOX smart assistant pipeline. It can be used to test overlapping command separation before ASR and downstream smart-home action routing.

Limitations

  • Optimized for 16 kHz single-channel speech.
  • Trained and validated on dynamic LibriSpeech-style speech mixtures with MUSAN noise augmentation, not every possible home acoustic condition.
  • Speaker output order is permutation-invariant and should not be interpreted as a persistent speaker identity.
  • Three-speaker separation is harder than two-speaker separation and the included metrics should be presented as current hackathon checkpoint results, not as production-quality generalization guarantees.
  • Clean 3-speaker mean SI-SNR is close to but still below the official clean multi-speaker target; noisy 3-speaker mean SI-SNR meets the target.
  • This package includes a PyTorch .pt checkpoint. Load it only from trusted sources, or create an additional Safetensors copy with convert_to_safetensors.py.

License and Attribution

The project code in this package is released under the MIT License. Training used LibriSpeech and MUSAN-derived data according to their respective dataset terms. See the parent PolyVOX repository documentation for full open-source and dataset attribution.

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • Clean 3-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    13.938
  • Noisy 3-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    12.065
  • Clean 3-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    17.459
  • Noisy 3-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    15.697