PolyVOX Sandglasset 2-Speaker Separator

This repository contains the fine-tuned 2-speaker Sandglasset speech separation checkpoint used in the PolyVOX Samsung EnnovateX hackathon pipeline for real-time multi-user smart assistant command understanding in noisy smart-home environments.

Model Summary

  • Task: 2-speaker single-channel speech separation.
  • Sample rate: 16 kHz mono.
  • Checkpoint: best_model.pt.
  • Parameters: 4,928,770, under the 5M parameter hackathon budget.
  • Architecture: compact Sandglasset-style encoder, bottleneck, gated TCN blocks, skip fusion, mask head, and transposed-convolution decoder.
  • Intended pipeline role: separate overlapping speakers before ASR, command attribution, and SmartThings action routing.

Architecture

The model uses the project Sandglasset student architecture:

  • Encoder: Conv1d(1, 256, kernel_size=16, stride=8) followed by ReLU.
  • Bottleneck: GroupNorm(1, 256) and Conv1d(256, 128, 1).
  • Separator: 24 gated Sandglass/TCN blocks with hidden size 384, kernel size 5, dilation cycle 1, 2, 4, 8, 16, and selected average-pooling downsampling.
  • Mask head: Conv1d(128, 2 * 256, 1) with ReLU masks.
  • Decoder: ConvTranspose1d(256, 1, kernel_size=16, stride=8).

The exact PyTorch implementation is included in modeling_sandglasset.py.

Training Data and Fine-Tuning

The checkpoint was trained with online dynamic mixing from LibriSpeech source utterances and MUSAN noise. The fine-tuning run focused on hard overlap and noise conditions for the real-time assistant setting.

Source run artifacts included in this repository:

  • training_report.json
  • training_report.md
  • metrics.csv
  • latest_summary.json

Raw LibriSpeech and MUSAN audio are not redistributed in this model repository.

Final KPI Evaluation

Final KPI report date: 2026-06-21.

Evaluation protocol: final sparse-overlap LibriMix-style KPI evaluation for the fine-tuned Sandglasset 2-speaker checkpoint.

Evaluation group Cases Required SI-SNR Mean SI-SNR Mean SI-SNRi Mean SI-SDR Mean SI-SDRi Mean xRT Cases over target xRT target
Clean 2-speaker 100 >25 dB 20.1991 dB 20.1986 dB 20.1097 dB 20.1092 dB 0.0212 20/100 Met
Noisy 2-speaker 100 >18 dB 16.6557 dB 16.8240 dB 16.6291 dB 16.7974 dB 0.0201 43/100 Met

Both final 2-speaker groups are far below the required 0.5 xRT runtime limit and the checkpoint satisfies the under-5M-parameter deployment constraint. The strict mean SI-SNR targets are not fully met for the 2-speaker final KPI groups; the model is strongest at low overlap and remains useful as a compact realtime separator.

Overlap Breakdown

Clean 2-speaker:

Overlap Cases Mean SI-SNR Mean SI-SNRi Mean SI-SDR Mean SI-SDRi Mean xRT Cases over target
0% 20 30.7797 dB 30.7798 dB 30.5042 dB 30.5042 dB 0.0169 17/20
25% 20 21.2307 dB 21.2275 dB 21.1504 dB 21.1471 dB 0.0178 3/20
50% 20 17.6011 dB 17.6161 dB 17.5711 dB 17.5860 dB 0.0180 0/20
75% 20 16.5915 dB 16.6058 dB 16.5617 dB 16.5759 dB 0.0262 0/20
100% 20 14.7925 dB 14.7641 dB 14.7612 dB 14.7327 dB 0.0270 0/20

Noisy 2-speaker:

Overlap Cases Mean SI-SNR Mean SI-SNRi Mean SI-SDR Mean SI-SDRi Mean xRT Cases over target
0% 20 20.2276 dB 20.3985 dB 20.1887 dB 20.3595 dB 0.0182 17/20
25% 20 17.7737 dB 17.9407 dB 17.7434 dB 17.9103 dB 0.0218 15/20
50% 20 16.3406 dB 16.5108 dB 16.3198 dB 16.4900 dB 0.0179 9/20
75% 20 15.3192 dB 15.5083 dB 15.2986 dB 15.4876 dB 0.0200 2/20
100% 20 13.6172 dB 13.7619 dB 13.5950 dB 13.7396 dB 0.0228 0/20

The same values are provided in machine-readable form in final_kpi_results.json.

Training Validation Metrics

Fine-tuned run: sandglass_overlap_noise_finetune_2spk_20260618_222414.

Metric Value
Best epoch 38
Best validation SI-SNR 13.2299 dB
Best clean-condition validation SI-SNR 14.9777 dB
Best noisy-condition validation SI-SNR 11.5912 dB
Validation batches 512

Quick Start

Install dependencies:

pip install -r requirements.txt

Separate a WAV file:

python inference.py input.wav --output-dir separated_outputs

Programmatic loading:

import torch
from modeling_sandglasset import load_sandglasset_checkpoint, separate_tensor

model, checkpoint = load_sandglasset_checkpoint("best_model.pt")
mixture = torch.randn(16000)
estimates = separate_tensor(model, mixture)
print(estimates.shape)  # [1, 2, time]

Intended Use

This model is intended for research and hackathon demonstration of real-time 2-speaker separation inside the PolyVOX smart assistant pipeline. It can be used to test overlapping command separation before ASR and downstream smart-home action routing.

Limitations

  • Optimized for 16 kHz single-channel speech.
  • Trained and validated on dynamic LibriSpeech-style speech mixtures with MUSAN noise augmentation, not every possible home acoustic condition.
  • Speaker output order is permutation-invariant and should not be interpreted as a persistent speaker identity.
  • Final KPI mean SI-SNR remains below the strict clean and noisy 2-speaker targets, especially in high-overlap mixtures.
  • This package includes a PyTorch .pt checkpoint. Load it only from trusted sources, or create an additional Safetensors copy with convert_to_safetensors.py.

License and Attribution

The project code in this package is released under the MIT License. Training used LibriSpeech and MUSAN-derived data according to their respective dataset terms. See the parent PolyVOX repository documentation for full open-source and dataset attribution.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • Clean 2-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    20.199
  • Noisy 2-speaker mean SI-SNR on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    16.656
  • Clean 2-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    20.199
  • Noisy 2-speaker mean SI-SNR improvement on Final sparse-overlap LibriMix-style KPI evaluation
    self-reported
    16.824