molto commited on
Commit
b52dc53
1 Parent(s): a7be8e8

Upload 14 files

Browse files
CKPT.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # yamllint disable
2
+ end-of-epoch: true
3
+ si-snr: -7.369511792677621
4
+ unixtime: 1621680451.6093628
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail:
4
+ tags:
5
+ - audio-to-audio
6
+ - audio-source-separation
7
+ - Source Separation
8
+ - Speech Separation
9
+ - WHAM!
10
+ - SepFormer
11
+ - Transformer
12
+ - pytorch
13
+ - speechbrain
14
+ license: "apache-2.0"
15
+ metrics:
16
+ - SI-SNRi
17
+ - SDRi
18
+
19
+ ---
20
+
21
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
22
+ <br/><br/>
23
+
24
+ # SepFormer trained on WHAMR! (16k sampling frequency)
25
+ This repository provides all the necessary tools to perform audio source separation with a [SepFormer](https://arxiv.org/abs/2010.13154v2) model, implemented with SpeechBrain, and pretrained on [WHAMR!](http://wham.whisper.ai/) dataset with 16k sampling frequency, which is basically a version of WSJ0-Mix dataset with environmental noise and reverberation in 16k. For a better experience we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io). The given model performance is 13.5 dB SI-SNRi on the test set of WHAMR! dataset.
26
+
27
+
28
+ | Release | Test-Set SI-SNRi | Test-Set SDRi |
29
+ |:-------------:|:--------------:|:--------------:|
30
+ | 30-03-21 | 13.5 dB | 13.0 dB |
31
+
32
+
33
+ ## Install SpeechBrain
34
+
35
+ First of all, please install SpeechBrain with the following command:
36
+
37
+ ```
38
+ pip install speechbrain
39
+ ```
40
+
41
+ Please notice that we encourage you to read our tutorials and learn more about [SpeechBrain](https://speechbrain.github.io).
42
+
43
+ ### Perform source separation on your own audio file
44
+
45
+ ```python
46
+ from speechbrain.inference.separation import SepformerSeparation as separator
47
+ import torchaudio
48
+
49
+ model = separator.from_hparams(source="speechbrain/sepformer-whamr16k", savedir='pretrained_models/sepformer-whamr16k')
50
+
51
+ # for custom file, change path
52
+ est_sources = model.separate_file(path='speechbrain/sepformer-whamr16k/test_mixture16k.wav')
53
+
54
+ torchaudio.save("source1hat.wav", est_sources[:, :, 0].detach().cpu(), 16000)
55
+ torchaudio.save("source2hat.wav", est_sources[:, :, 1].detach().cpu(), 16000)
56
+ ```
57
+
58
+ The system expects input recordings sampled at 16kHz (single channel).
59
+ If your signal has a different sample rate, resample it (e.g, using torchaudio or sox) before using the interface.
60
+
61
+ ### Inference on GPU
62
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
63
+
64
+ ### Training
65
+ The model was trained with SpeechBrain (fc2eabb7).
66
+ To train it from scratch follows these steps:
67
+ 1. Clone SpeechBrain:
68
+ ```bash
69
+ git clone https://github.com/speechbrain/speechbrain/
70
+ ```
71
+ 2. Install it:
72
+ ```
73
+ cd speechbrain
74
+ pip install -r requirements.txt
75
+ pip install -e .
76
+ ```
77
+
78
+ 3. Run Training:
79
+ ```
80
+ cd recipes/WHAMandWHAMR/separation/
81
+ python train.py hparams/sepformer-whamr.yaml --data_folder=your_data_folder --sample_rate=16000
82
+ ```
83
+
84
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1QiQhp1vi5t4UfNpNETA48_OmPiXnUy8O?usp=sharing).
85
+
86
+ ### Limitations
87
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
88
+
89
+ #### Referencing SpeechBrain
90
+
91
+ ```bibtex
92
+ @misc{speechbrain,
93
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
94
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
95
+ year={2021},
96
+ eprint={2106.04624},
97
+ archivePrefix={arXiv},
98
+ primaryClass={eess.AS},
99
+ note={arXiv:2106.04624}
100
+ }
101
+ ```
102
+
103
+
104
+ #### Referencing SepFormer
105
+ ```bibtex
106
+ @inproceedings{subakan2021attention,
107
+ title={Attention is All You Need in Speech Separation},
108
+ author={Cem Subakan and Mirco Ravanelli and Samuele Cornell and Mirko Bronzi and Jianyuan Zhong},
109
+ year={2021},
110
+ booktitle={ICASSP 2021}
111
+ }
112
+ ```
113
+
114
+ # **About SpeechBrain**
115
+ - Website: https://speechbrain.github.io/
116
+ - Code: https://github.com/speechbrain/speechbrain/
117
+ - HuggingFace: https://huggingface.co/speechbrain/
brain.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9e24193f36931b7f57932532efbdcf64971f42732383ba6808825f77db258f6
3
+ size 28
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "speechbrain_interface": "SepformerSeparation"
3
+ }
counter.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dbb1ded63bc70732626c5dfe6c7f50ced3d560e970f30b15335ac290358748f6
3
+ size 3
dataloader-TRAIN.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39e5b4830d4d9c14db7368a95b65d5463ea3d09520373723430c03a5a453b5df
3
+ size 5
decoder.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d959cb46de5b15f008bf5476cf7c19bc680b310c7e10883e3eddfdbde533cb8
3
+ size 17272
encoder.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:31d23d395a408b887b8f6ac01e477f00bcba27f95785426359fb52d52f1dc6ed
3
+ size 17272
gitattributes ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
hyperparams.yaml ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ################################
2
+ # Model: Inference for source separation with SepFormer
3
+ # https://arxiv.org/abs/2010.13154
4
+ # Generated from speechbrain/recipes/WSJ0Mix/separation/train/hparams/sepformer-whamr-16khz.yaml
5
+ # Dataset : Whamr-16kHz
6
+ # ###############################
7
+
8
+
9
+ # Parameters
10
+ sample_rate: 16000
11
+ num_spks: 2
12
+
13
+ # Specifying the network
14
+ Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
15
+ kernel_size: 16
16
+ out_channels: 256
17
+
18
+ SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
19
+ num_layers: 8
20
+ d_model: 256
21
+ nhead: 8
22
+ d_ffn: 1024
23
+ dropout: 0
24
+ use_positional_encoding: true
25
+ norm_before: true
26
+
27
+ SBtfinter: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
28
+ num_layers: 8
29
+ d_model: 256
30
+ nhead: 8
31
+ d_ffn: 1024
32
+ dropout: 0
33
+ use_positional_encoding: true
34
+ norm_before: true
35
+
36
+ MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
37
+ num_spks: !ref <num_spks>
38
+ in_channels: 256
39
+ out_channels: 256
40
+ num_layers: 2
41
+ K: 250
42
+ intra_model: !ref <SBtfintra>
43
+ inter_model: !ref <SBtfinter>
44
+ norm: ln
45
+ linear_layer_after_inter_intra: false
46
+ skip_around_intra: true
47
+
48
+ Decoder: !new:speechbrain.lobes.models.dual_path.Decoder
49
+ in_channels: 256
50
+ out_channels: 1
51
+ kernel_size: 16
52
+ stride: 8
53
+ bias: false
54
+
55
+ modules:
56
+ encoder: !ref <Encoder>
57
+ decoder: !ref <Decoder>
58
+ masknet: !ref <MaskNet>
59
+
60
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
61
+ loadables:
62
+ masknet: !ref <MaskNet>
63
+ encoder: !ref <Encoder>
64
+ decoder: !ref <Decoder>
65
+
hyperparams_train.yaml ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Generated 2021-05-22 from:
2
+ # /home/mila/s/subakany/speechbrain_new/recipes/WSJ0Mix/separation/yamls/sepformer-whamr-16k.yaml
3
+ # yamllint disable
4
+ # ################################
5
+ # Model: SepFormer for source separation
6
+ # https://arxiv.org/abs/2010.13154
7
+ #
8
+ # Dataset : WSJ0-2mix and WSJ0-3mix
9
+ # ################################
10
+ # Basic parameters
11
+ # Seed needs to be set at top of yaml, before objects with parameters are made
12
+ #
13
+ seed: 1234
14
+ __set_seed: !apply:torch.manual_seed [1234]
15
+
16
+ # Data params
17
+
18
+ # the data folder for the wham dataset
19
+ # data_folder needs to follow the format: /yourpath/whamr.
20
+ # make sure to use the name whamr at your top folder for the dataset!
21
+ data_folder: /network/tmp1/subakany/whamr_16k
22
+
23
+ # the path for wsj0/si_tr_s/ folder -- only needed if dynamic mixing is used
24
+ # e.g. /yourpath/wsj0-processed/si_tr_s/
25
+ # you need to convert the original wsj0 to 8k
26
+ # you can do this conversion with the script ../meta/preprocess_dynamic_mixing.py
27
+ wsj0_tr: /yourpath/wsj0-processed/si_tr_s/
28
+
29
+ experiment_name: sepformer-whamr-randomreverb-16k
30
+ output_folder: results/sepformer-whamr-randomreverb-16k/1234
31
+ train_log: results/sepformer-whamr-randomreverb-16k/1234/train_log.txt
32
+ save_folder: results/sepformer-whamr-randomreverb-16k/1234/save
33
+
34
+ # the file names should start with whamr instead of whamorg
35
+ train_data: results/sepformer-whamr-randomreverb-16k/1234/save/whamr_tr.csv
36
+ valid_data: results/sepformer-whamr-randomreverb-16k/1234/save/whamr_cv.csv
37
+ test_data: results/sepformer-whamr-randomreverb-16k/1234/save/whamr_tt.csv
38
+ skip_prep: false
39
+
40
+ # Experiment params
41
+ auto_mix_prec: false # Set it to True for mixed precision
42
+ test_only: true
43
+ num_spks: 2 # set to 3 for wsj0-3mix
44
+ progressbar: true
45
+ save_audio: false # Save estimated sources on disk
46
+ sample_rate: 16000
47
+
48
+ # Training parameters
49
+ N_epochs: 200
50
+ batch_size: 1
51
+ lr: 0.00015
52
+ clip_grad_norm: 5
53
+ loss_upper_lim: 999999 # this is the upper limit for an acceptable loss
54
+ # if True, the training sequences are cut to a specified length
55
+ limit_training_signal_len: true
56
+ # this is the length of sequences if we choose to limit
57
+ # the signal length of training sequences
58
+ training_signal_len: 64000
59
+
60
+ # Set it to True to dynamically create mixtures at training time
61
+ dynamic_mixing: false
62
+
63
+ # Parameters for data augmentation
64
+
65
+ # rir_path variable points to the directory of the room impulse responses
66
+ # e.g. /miniscratch/subakany/rir_wavs
67
+ # If the path does not exist, it is created automatically.
68
+ rir_path: /network/tmp1/subakany/rir_wavs_16k
69
+
70
+ use_wavedrop: false
71
+ use_speedperturb: true
72
+ use_speedperturb_sameforeachsource: false
73
+ use_rand_shift: false
74
+ min_shift: -8000
75
+ max_shift: 8000
76
+
77
+ speedperturb: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
78
+ perturb_prob: 1.0
79
+ drop_freq_prob: 0.0
80
+ drop_chunk_prob: 0.0
81
+ sample_rate: 16000
82
+ speeds: [95, 100, 105]
83
+
84
+ wavedrop: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
85
+ perturb_prob: 0.0
86
+ drop_freq_prob: 1.0
87
+ drop_chunk_prob: 1.0
88
+ sample_rate: 16000
89
+
90
+ # loss thresholding -- this thresholds the training loss
91
+ threshold_byloss: true
92
+ threshold: -30
93
+
94
+ # Encoder parameters
95
+ N_encoder_out: 256
96
+ out_channels: 256
97
+ kernel_size: 16
98
+ kernel_stride: 8
99
+
100
+ # Dataloader options
101
+ dataloader_opts:
102
+ batch_size: 1
103
+ num_workers: 3
104
+
105
+ # Specifying the network
106
+ Encoder: &id003 !new:speechbrain.lobes.models.dual_path.Encoder
107
+ kernel_size: 16
108
+ out_channels: 256
109
+
110
+
111
+ SBtfintra: &id001 !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
112
+ num_layers: 8
113
+ d_model: 256
114
+ nhead: 8
115
+ d_ffn: 1024
116
+ dropout: 0
117
+ use_positional_encoding: true
118
+ norm_before: true
119
+
120
+ SBtfinter: &id002 !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
121
+ num_layers: 8
122
+ d_model: 256
123
+ nhead: 8
124
+ d_ffn: 1024
125
+ dropout: 0
126
+ use_positional_encoding: true
127
+ norm_before: true
128
+
129
+ MaskNet: &id005 !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
130
+
131
+ num_spks: 2
132
+ in_channels: 256
133
+ out_channels: 256
134
+ num_layers: 2
135
+ K: 250
136
+ intra_model: *id001
137
+ inter_model: *id002
138
+ norm: ln
139
+ linear_layer_after_inter_intra: false
140
+ skip_around_intra: true
141
+
142
+ Decoder: &id004 !new:speechbrain.lobes.models.dual_path.Decoder
143
+ in_channels: 256
144
+ out_channels: 1
145
+ kernel_size: 16
146
+ stride: 8
147
+ bias: false
148
+
149
+ optimizer: !name:torch.optim.Adam
150
+ lr: 0.00015
151
+ weight_decay: 0
152
+
153
+ loss: !name:speechbrain.nnet.losses.get_si_snr_with_pitwrapper
154
+
155
+ lr_scheduler: &id007 !new:speechbrain.nnet.schedulers.ReduceLROnPlateau
156
+
157
+ factor: 0.5
158
+ patience: 2
159
+ dont_halve_until_epoch: 85
160
+
161
+ epoch_counter: &id006 !new:speechbrain.utils.epoch_loop.EpochCounter
162
+ limit: 200
163
+
164
+ modules:
165
+ encoder: *id003
166
+ decoder: *id004
167
+ masknet: *id005
168
+ checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
169
+ checkpoints_dir: results/sepformer-whamr-randomreverb-16k/1234/save
170
+ recoverables:
171
+ encoder: *id003
172
+ decoder: *id004
173
+ masknet: *id005
174
+ counter: *id006
175
+ lr_scheduler: *id007
176
+ train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
177
+ save_file: results/sepformer-whamr-randomreverb-16k/1234/train_log.txt
178
+
179
+
180
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
181
+ loadables:
182
+ masknet: !ref <MaskNet>
183
+ encoder: !ref <Encoder>
184
+ decoder: !ref <Decoder>
lr_scheduler.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5108af5b18f633bd908c3142c8e87f80f95d3159ba00cb2c52317cfd2d8e1b5
3
+ size 1647
masknet.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5fb8c690668e5d1bbbc9a8256974577093a09cd08e845e9f30024fdd33472ce
3
+ size 113112646
optimizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89a9b119d5ce268c345c2ba978e8ff10a13beabc8d6ee1962b256721de9564d6
3
+ size 205694713