Spaces:
Runtime error
Runtime error
File size: 2,244 Bytes
93c280c 54d944a 3ff6c9f a227627 d172563 54d944a a227627 ad99144 a227627 ad99144 a227627 d7280b3 6144c99 d7280b3 6144c99 d7280b3 6144c99 3ff6c9f 6144c99 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
title: Denoise And Diarization
emoji: 🐠
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 3.28.0
app_file: app.py
pinned: false
---
# How run:
1) [huggingface](https://huggingface.co/spaces/speechmaster/denoise_and_diarization)
2) run local inference:
1) GUI:
`python app.py`
2) Inference local:
`python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out`
3) run docker:
```
docker login registry.hf.space
docker run -it -p 7860:7860 --platform=linux/amd64 \
registry.hf.space/speechmaster-denoise-and-diarization:latest python app.py
```
# About pipeline:
+ denoise audio
+ vad(voice activity detector)
+ speaker embeddings from each vad fragments
+ clustering this embeddings
# Inference for hardware
| | inference time for file dialog.mp3 |
|-----------------------|:------------------------------------:|
| cpu 2v CPU huggingece | 453.8 s/it |
| gpu tesla v100 | 8.23 s/it |
# Approaches
I know a lot of methods for this task:
+ separation: using separation models(need longtime train and finetune)
+ diarization
+ speaker_embedding+Clustering knowing numbers of speakers
+ overlap speech detection
+ speaker_embedding+Clustering knowing numbers of speakers
+ asr_each_word+speaker_embedding+Clustering numbers of speakers
+ end-to-end nn diarization (sota worst than just diarization)
For this task i used speaker_embedding+Clustering unknowing numbers of speakers
# How i can improve:
+ Fix preprocessing
+ estimate SNR(signal noise rate) and if input clean dont use denoising
+ Add train:
+ custom speaker recognition model
+ custom overlap speech detector
+ custom speech separation model:
+ [MossFormer](https://github.com/alibabasglab/MossFormer)
+ [speechbrain](https://speechbrain.github.io/)
+ Using FaceVad if there are video
+ improve speed and ram size:
+ quantization models
+ optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml
+ pruning models
+ distillation(train small model with big model)
How to improve besides what's on top:
+ delete overlap speech using asr
+ delete overlap speech using overlap detection
|