--- title: Denoise And Diarization emoji: 🐠 colorFrom: gray colorTo: gray sdk: gradio sdk_version: 3.28.0 app_file: app.py pinned: false --- # How run: 1) [huggingface](https://huggingface.co/spaces/speechmaster/denoise_and_diarization) 2) run local inference: 1) GUI: `python app.py` 2) Inference local: `python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out` 3) run docker: ``` docker login registry.hf.space docker run -it -p 7860:7860 --platform=linux/amd64 \ registry.hf.space/speechmaster-denoise-and-diarization:latest python app.py ``` # About pipeline: + denoise audio + vad(voice activity detector) + speaker embeddings from each vad fragments + clustering this embeddings # Inference for hardware | | inference time for file dialog.mp3 | |-----------------------|:------------------------------------:| | cpu 2v CPU huggingece | 453.8 s/it | | gpu tesla v100 | 8.23 s/it | # Approaches I know a lot of methods for this task: + separation: using separation models(need longtime train and finetune) + diarization + speaker_embedding+Clustering knowing numbers of speakers + overlap speech detection + speaker_embedding+Clustering knowing numbers of speakers + asr_each_word+speaker_embedding+Clustering numbers of speakers + end-to-end nn diarization (sota worst than just diarization) For this task i used speaker_embedding+Clustering unknowing numbers of speakers # How i can improve: + Fix preprocessing + estimate SNR(signal noise rate) and if input clean dont use denoising + Add train: + custom speaker recognition model + custom overlap speech detector + custom speech separation model: + [MossFormer](https://github.com/alibabasglab/MossFormer) + [speechbrain](https://speechbrain.github.io/) + Using FaceVad if there are video + improve speed and ram size: + quantization models + optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml + pruning models + distillation(train small model with big model) How to improve besides what's on top: + delete overlap speech using asr + delete overlap speech using overlap detection