File size: 2,244 Bytes
93c280c
 
 
 
 
 
 
 
 
 
 
54d944a
3ff6c9f
 
a227627
 
 
d172563
54d944a
 
 
 
 
 
a227627
ad99144
 
 
 
 
 
 
 
a227627
ad99144
 
 
 
a227627
d7280b3
6144c99
 
 
 
 
 
 
 
 
 
 
 
d7280b3
 
6144c99
d7280b3
6144c99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ff6c9f
6144c99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
title: Denoise And Diarization
emoji: 🐠
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 3.28.0
app_file: app.py
pinned: false
---

# How run:
1) [huggingface](https://huggingface.co/spaces/speechmaster/denoise_and_diarization)
2) run local inference:
   1) GUI:
   `python app.py`
   2) Inference local:
   `python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out`
3)  run docker:
```
docker login registry.hf.space
docker run -it -p 7860:7860 --platform=linux/amd64 \
	registry.hf.space/speechmaster-denoise-and-diarization:latest python app.py
```

# About pipeline:
+ denoise audio
+ vad(voice activity detector)
+ speaker embeddings from each vad fragments
+ clustering this embeddings


# Inference for hardware

|                       |  inference time for file dialog.mp3  |
|-----------------------|:------------------------------------:|
| cpu 2v CPU huggingece |              453.8 s/it              |
| gpu tesla v100        |              8.23 s/it               |

# Approaches
I know a lot of methods for this task:
  + separation: using  separation models(need longtime train and finetune)
  + diarization
    + speaker_embedding+Clustering knowing numbers of speakers
    + overlap speech detection 
    + speaker_embedding+Clustering knowing numbers of speakers
    + asr_each_word+speaker_embedding+Clustering numbers of speakers
  + end-to-end nn diarization (sota worst than just diarization)

For this task i used speaker_embedding+Clustering unknowing numbers of speakers


# How i can improve:
+ Fix preprocessing
  + estimate SNR(signal noise rate) and if input clean dont use denoising
+ Add train:
  + custom speaker recognition model 
  + custom overlap speech detector 
  + custom speech separation model:
    + [MossFormer](https://github.com/alibabasglab/MossFormer)
    + [speechbrain](https://speechbrain.github.io/)
+ Using FaceVad if there are video 
+ improve speed and ram size:
  + quantization models 
  + optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml
  + pruning models
  + distillation(train small model with big model)
   
    

How to improve besides what's on top:
+ delete overlap speech using asr
+ delete overlap speech using overlap detection