Artem Gorlanov
fix
d7280b3

A newer version of the Gradio SDK is available: 5.4.0

Upgrade
metadata
title: Denoise And Diarization
emoji: 🐠
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 3.28.0
app_file: app.py
pinned: false

How run:

  1. huggingface
  2. run local inference:
    1. GUI: python app.py
    2. Inference local: python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out
  3. run docker:
docker login registry.hf.space
docker run -it -p 7860:7860 --platform=linux/amd64 \
    registry.hf.space/speechmaster-denoise-and-diarization:latest python app.py

About pipeline:

  • denoise audio
  • vad(voice activity detector)
  • speaker embeddings from each vad fragments
  • clustering this embeddings

Inference for hardware

inference time for file dialog.mp3
cpu 2v CPU huggingece 453.8 s/it
gpu tesla v100 8.23 s/it

Approaches

I know a lot of methods for this task:

  • separation: using separation models(need longtime train and finetune)
  • diarization
    • speaker_embedding+Clustering knowing numbers of speakers
    • overlap speech detection
    • speaker_embedding+Clustering knowing numbers of speakers
    • asr_each_word+speaker_embedding+Clustering numbers of speakers
  • end-to-end nn diarization (sota worst than just diarization)

For this task i used speaker_embedding+Clustering unknowing numbers of speakers

How i can improve:

  • Fix preprocessing
    • estimate SNR(signal noise rate) and if input clean dont use denoising
  • Add train:
    • custom speaker recognition model
    • custom overlap speech detector
    • custom speech separation model:
  • Using FaceVad if there are video
  • improve speed and ram size:
    • quantization models
    • optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml
    • pruning models
    • distillation(train small model with big model)

How to improve besides what's on top:

  • delete overlap speech using asr
  • delete overlap speech using overlap detection