flexthink commited on
Commit
380887d
1 Parent(s): 2b84978

Modified it to create as "lite" version for cases where you only need speaker embeddings

Browse files
Files changed (3) hide show
  1. README.md +6 -39
  2. custom_interface.py +33 -115
  3. hyperparams.yaml +3 -48
README.md CHANGED
@@ -26,12 +26,13 @@ widget:
26
  <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
27
  <br/><br/>
28
 
29
- # Speaker Verification with ECAPA-TDNN embeddings with discrete_ssl input on Voxceleb
30
 
31
- This repository provides all the necessary tools to perform speaker verification with a pretrained ECAPA-TDNN model and discrete audio input using SpeechBrain.
32
- The system can be used to extract speaker embeddings as well.
33
  It is trained on Voxceleb 1 training data.
34
 
 
 
35
  For a better experience, we encourage you to learn more about
36
  [SpeechBrain](https://speechbrain.github.io). The model performance on Voxceleb1-test set(Cleaned) is:
37
 
@@ -58,50 +59,16 @@ Please notice that we encourage you to read our tutorials and learn more about
58
  import torchaudio
59
  from speechbrain.inference.interfaces import foreign_class
60
 
61
- classifier = foreign_class(source="poonehmousavi/discrete_wavlm_spk_rec_ecapatdn", pymodule_file="custom_interface.py", classname="CustomEncoderClassifier")
62
 
63
- signal, fs =torchaudio.load('tests/samples/example1.wav')
64
  embeddings = classifier.encode_batch(signal)
65
  print(embeddings.shape)
66
  ```
67
  The system is trained with recordings sampled at 16kHz (single channel).
68
  The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *classify_file* if needed. Make sure your input tensor is compliant with the expected sampling rate if you use *encode_batch* and *classify_batch*.
69
 
70
- <!-- ### Perform Speaker Verification
71
-
72
- ```python
73
- from speechbrain.inference.speaker import SpeakerRecognition
74
- verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
75
- score, prediction = verification.verify_files("tests/samples/ASR/spk1_snt1.wav", "tests/samples/ASR/spk2_snt1.wav") # Different Speakers
76
- score, prediction = verification.verify_files("tests/samples/ASR/spk1_snt1.wav", "tests/samples/ASR/spk1_snt2.wav") # Same Speaker
77
- ```
78
- The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise. -->
79
-
80
- <!-- ### Inference on GPU
81
- To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
82
-
83
- ### Training
84
- The model was trained with SpeechBrain (aa018540).
85
- To train it from scratch follows these steps:
86
- 1. Clone SpeechBrain:
87
- ```bash
88
- git clone https://github.com/speechbrain/speechbrain/
89
- ```
90
- 2. Install it:
91
- ```
92
- cd speechbrain
93
- pip install -r requirements.txt
94
- pip install -e .
95
- ```
96
-
97
- 3. Run Training:
98
- ```
99
- cd recipes/VoxCeleb/SpeakerRec
100
- python train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml --data_folder=your_data_folder
101
- ```
102
 
103
- You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1-ahC1xeyPinAHp2oAohL-02smNWO41Cc?usp=sharing).
104
- -->
105
  ### Limitations
106
  The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
107
 
 
26
  <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
27
  <br/><br/>
28
 
29
+ # Standalone ECAPA-TDNN embeddings with discrete_ssl input on Voxceleb
30
 
31
+ This repository provides all the necessary tools to obtain speaker embeddings with a pretrained ECAPA-TDNN model and discrete audio input using SpeechBrain.
 
32
  It is trained on Voxceleb 1 training data.
33
 
34
+ Adopted from poonehmousavi/discrete_wavlm_spk_rec_ecapatdn
35
+
36
  For a better experience, we encourage you to learn more about
37
  [SpeechBrain](https://speechbrain.github.io). The model performance on Voxceleb1-test set(Cleaned) is:
38
 
 
59
  import torchaudio
60
  from speechbrain.inference.interfaces import foreign_class
61
 
62
+ classifier = foreign_class(source="flexthink/discrete_wavlm_spk_rec_ecapatdn", pymodule_file="custom_interface.py", classname="DiscreteSpkEmb")
63
 
64
+ tokens = torch.randint(4, 100, 4)
65
  embeddings = classifier.encode_batch(signal)
66
  print(embeddings.shape)
67
  ```
68
  The system is trained with recordings sampled at 16kHz (single channel).
69
  The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *classify_file* if needed. Make sure your input tensor is compliant with the expected sampling rate if you use *encode_batch* and *classify_batch*.
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
 
 
72
  ### Limitations
73
  The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
74
 
custom_interface.py CHANGED
@@ -60,6 +60,8 @@ class Discrete_EmbeddingLayer(torch.nn.Module):
60
  pad_index=0,
61
  init=False,
62
  freeze=False,
 
 
63
  ):
64
  super(Discrete_EmbeddingLayer, self).__init__()
65
  self.vocab_size = vocab_size
@@ -69,11 +71,26 @@ class Discrete_EmbeddingLayer(torch.nn.Module):
69
  num_codebooks * vocab_size, emb_dim
70
  ).requires_grad_(not self.freeze)
71
  self.init = init
 
 
 
72
 
73
  def init_embedding(self, weights):
74
  with torch.no_grad():
75
  self.embedding.weight = torch.nn.Parameter(weights)
76
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  def forward(self, in_tokens):
78
  """Computes the embedding for discrete tokens.
79
  a sample.
@@ -89,17 +106,13 @@ class Discrete_EmbeddingLayer(torch.nn.Module):
89
  """
90
  with torch.set_grad_enabled(not self.freeze):
91
  # Add unique token IDs across diffrent codebooks by adding num_codebooks * vocab_size
92
- in_tokens += torch.arange(
93
- 0,
94
- self.num_codebooks * self.vocab_size,
95
- self.vocab_size,
96
- device=in_tokens.device,
97
- )
98
  # Forward Pass to embedding and
99
- in_embs = self.embedding(in_tokens)
100
  return in_embs
101
 
102
- class CustomEncoderClassifier(Pretrained):
 
103
  """A ready-to-use class for utterance-level classification (e.g, speaker-id,
104
  language-id, emotion recognition, keyword spotting, etc).
105
  The class assumes that an self-supervised encoder like wav2vec2/hubert and a classifier model
@@ -129,126 +142,31 @@ class CustomEncoderClassifier(Pretrained):
129
 
130
  def __init__(self, *args, **kwargs):
131
  super().__init__(*args, **kwargs)
132
- self.similarity = torch.nn.CosineSimilarity(dim=-1, eps=1e-6)
133
 
134
- def encode_batch(self, wavs, wav_lens=None, normalize=False):
135
  """Encodes the input audio into a single vector embedding.
136
  The waveforms should already be in the model's desired format.
137
- You can call:
138
- ``normalized = <this>.normalizer(signal, sample_rate)``
139
- to get a correctly converted signal in most cases.
140
  Arguments
141
  ---------
142
- wavs : torch.tensor
143
- Batch of waveforms [batch, time, channels] or [batch, time]
144
- depending on the model. Make sure the sample rate is fs=16000 Hz.
145
- wav_lens : torch.tensor
146
  Lengths of the waveforms relative to the longest one in the
147
  batch, tensor of shape [batch]. The longest one should have
148
  relative length 1.0 and others len(waveform) / max_length.
149
  Used for ignoring padding.
150
- normalize : bool
151
- If True, it normalizes the embeddings with the statistics
152
- contained in mean_var_norm_emb.
153
  Returns
154
  -------
155
  torch.tensor
156
  The encoded batch
157
  """
158
  # Manage single waveforms in input
159
- if len(wavs.shape) == 1:
160
- wavs = wavs.unsqueeze(0)
161
-
162
- # Assign full length if wav_lens is not assigned
163
- if wav_lens is None:
164
- wav_lens = torch.ones(wavs.shape[0], device=self.device)
165
-
166
- # Storing waveform in the specified device
167
- wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
168
- wavs = wavs.float()
169
-
170
- with torch.no_grad():
171
- self.hparams.codec.to(self.device).eval()
172
- tokens, _, _ = self.hparams.codec(
173
- wavs, wav_lens, **self.hparams.tokenizer_config
174
- )
175
- embeddings = self.mods.discrete_embedding_layer(tokens)
176
- att_w = self.mods.attention_mlp(embeddings)
177
- feats = torch.matmul(att_w.transpose(2, -1), embeddings).squeeze(-2)
178
- embeddings = self.mods.embedding_model(feats, wav_lens)
179
  return embeddings.squeeze(1)
180
-
181
-
182
- def verify_batch(
183
- self, wavs1, wavs2, wav1_lens=None, wav2_lens=None, threshold=0.25
184
- ):
185
- """Performs speaker verification with cosine distance.
186
-
187
- It returns the score and the decision (0 different speakers,
188
- 1 same speakers).
189
-
190
- Arguments
191
- ---------
192
- wavs1 : Torch.Tensor
193
- torch.Tensor containing the speech waveform1 (batch, time).
194
- Make sure the sample rate is fs=16000 Hz.
195
- wavs2 : Torch.Tensor
196
- torch.Tensor containing the speech waveform2 (batch, time).
197
- Make sure the sample rate is fs=16000 Hz.
198
- wav1_lens : Torch.Tensor
199
- torch.Tensor containing the relative length for each sentence
200
- in the length (e.g., [0.8 0.6 1.0])
201
- wav2_lens : Torch.Tensor
202
- torch.Tensor containing the relative length for each sentence
203
- in the length (e.g., [0.8 0.6 1.0])
204
- threshold : Float
205
- Threshold applied to the cosine distance to decide if the
206
- speaker is different (0) or the same (1).
207
-
208
- Returns
209
- -------
210
- score
211
- The score associated to the binary verification output
212
- (cosine distance).
213
- prediction
214
- The prediction is 1 if the two signals in input are from the same
215
- speaker and 0 otherwise.
216
- """
217
- emb1 = self.encode_batch(wavs1, wav1_lens, normalize=False)
218
- emb2 = self.encode_batch(wavs2, wav2_lens, normalize=False)
219
- score = self.similarity(emb1, emb2)
220
- return score, score > threshold
221
-
222
- def verify_files(self, path_x, path_y, **kwargs):
223
- """Speaker verification with cosine distance
224
-
225
- Returns the score and the decision (0 different speakers,
226
- 1 same speakers).
227
-
228
- Arguments
229
- ---------
230
- path_x : str
231
- Path to file x
232
- path_y : str
233
- Path to file y
234
- **kwargs : dict
235
- Arguments to ``load_audio``
236
-
237
- Returns
238
- -------
239
- score
240
- The score associated to the binary verification output
241
- (cosine distance).
242
- prediction
243
- The prediction is 1 if the two signals in input are from the same
244
- speaker and 0 otherwise.
245
- """
246
- waveform_x = self.load_audio(path_x, **kwargs)
247
- waveform_y = self.load_audio(path_y, **kwargs)
248
- # Fake batches:
249
- batch_x = waveform_x.unsqueeze(0)
250
- batch_y = waveform_y.unsqueeze(0)
251
- # Verify:
252
- score, decision = self.verify_batch(batch_x, batch_y)
253
- # Squeeze:
254
- return score[0], decision[0]
 
60
  pad_index=0,
61
  init=False,
62
  freeze=False,
63
+ available_layers=None,
64
+ layers=None,
65
  ):
66
  super(Discrete_EmbeddingLayer, self).__init__()
67
  self.vocab_size = vocab_size
 
71
  num_codebooks * vocab_size, emb_dim
72
  ).requires_grad_(not self.freeze)
73
  self.init = init
74
+ self.layers = layers
75
+ self.available_layers = available_layers
76
+ self.offsets = self.build_offsets()
77
 
78
  def init_embedding(self, weights):
79
  with torch.no_grad():
80
  self.embedding.weight = torch.nn.Parameter(weights)
81
 
82
+ def build_offsets(self):
83
+ offsets = torch.arange(
84
+ 0,
85
+ self.num_codebooks * self.vocab_size,
86
+ self.vocab_size,
87
+ )
88
+ if self.layers:
89
+ selected_layers = set(self.layers)
90
+ indexes = [idx for idx, layer in enumerate(self.layers) if layer in selected_layers]
91
+ offsets = offsets[indexes]
92
+ return offsets
93
+
94
  def forward(self, in_tokens):
95
  """Computes the embedding for discrete tokens.
96
  a sample.
 
106
  """
107
  with torch.set_grad_enabled(not self.freeze):
108
  # Add unique token IDs across diffrent codebooks by adding num_codebooks * vocab_size
109
+ in_tokens_offset = in_tokens + self.offsets.to(in_tokens.device)
 
 
 
 
 
110
  # Forward Pass to embedding and
111
+ in_embs = self.embedding(in_tokens_offset.int())
112
  return in_embs
113
 
114
+
115
+ class DiscreteSpkEmb(Pretrained):
116
  """A ready-to-use class for utterance-level classification (e.g, speaker-id,
117
  language-id, emotion recognition, keyword spotting, etc).
118
  The class assumes that an self-supervised encoder like wav2vec2/hubert and a classifier model
 
142
 
143
  def __init__(self, *args, **kwargs):
144
  super().__init__(*args, **kwargs)
 
145
 
146
+ def encode_batch(self, audio, length=None):
147
  """Encodes the input audio into a single vector embedding.
148
  The waveforms should already be in the model's desired format.
 
 
 
149
  Arguments
150
  ---------
151
+ audio : torch.tensor
152
+ Batch of tokenized audio [batch, time, heads]
153
+ length : torch.tensor
 
154
  Lengths of the waveforms relative to the longest one in the
155
  batch, tensor of shape [batch]. The longest one should have
156
  relative length 1.0 and others len(waveform) / max_length.
157
  Used for ignoring padding.
158
+
 
 
159
  Returns
160
  -------
161
  torch.tensor
162
  The encoded batch
163
  """
164
  # Manage single waveforms in input
165
+ embeddings = self.mods.discrete_embedding_layer(audio)
166
+ att_w = self.mods.attention_mlp(embeddings)
167
+ feats = torch.matmul(att_w.transpose(2, -1), embeddings).squeeze(-2)
168
+ embeddings = self.mods.embedding_model(feats, length)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
  return embeddings.squeeze(1)
170
+
171
+ def forward(self, audio, length=None):
172
+ return self.encode_batch(audio, length)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hyperparams.yaml CHANGED
@@ -8,7 +8,6 @@ n_mels: 80
8
  # Pretrain folder (HuggingFace)
9
  pretrained_path: poonehmousavi/discrete_wavlm_spk_rec_ecapatdn
10
  # Output parameters
11
- out_n_neurons: 1211
12
  save_folder: tmp
13
 
14
  ### Configuration for discrete SSL model
@@ -30,6 +29,7 @@ num_clusters: 1000
30
  # deduplicate: [False, False, False, False]
31
  # bpe_tokenizer_path: [null , null, null, null]
32
  ssl_layer_num: [1, 3, 7, 12, 18, 23]
 
33
  num_codebooks: 6
34
  deduplicate: [False, False, False, False, False, False]
35
  bpe_tokenizer_path: [null, null, null, null, null, null]
@@ -43,42 +43,12 @@ tokenizer_config:
43
  deduplicates: !ref <deduplicate>
44
  bpe_tokenizers: !ref <bpe_tokenizer_path>
45
 
46
- ssl_model: !apply:speechbrain.utils.hparams.choice
47
- value: !ref <ssl_model_type>
48
- choices:
49
- wavlm: !new:speechbrain.lobes.models.huggingface_transformers.wavlm.WavLM
50
- source: !ref <ssl_hub>
51
- output_norm: False
52
- freeze: !ref <freeze_ssl>
53
- freeze_feature_extractor: !ref <freeze_feature_extractor>
54
- output_all_hiddens: True
55
- save_path: !ref <ssl_folder>
56
- hubert: !new:speechbrain.lobes.models.huggingface_transformers.hubert.HuBERT
57
- source: !ref <ssl_hub>
58
- output_norm: False
59
- freeze: !ref <freeze_ssl>
60
- freeze_feature_extractor: !ref <freeze_feature_extractor>
61
- output_all_hiddens: True
62
- save_path: !ref <ssl_folder>
63
- wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2
64
- source: !ref <ssl_hub>
65
- output_norm: False
66
- freeze: !ref <freeze_ssl>
67
- freeze_feature_extractor: !ref <freeze_feature_extractor>
68
- output_all_hiddens: True
69
- save_path: !ref <ssl_folder>
70
-
71
- codec: !new:speechbrain.lobes.models.huggingface_transformers.discrete_ssl.DiscreteSSL
72
- save_path: !ref <kmeans_cache_dir>
73
- ssl_model: !ref <ssl_model>
74
- kmeans_dataset: !ref <kmeans_dataset>
75
- kmeans_repo_id: !ref <kmeans_repo_id>
76
- num_clusters: !ref <num_clusters>
77
-
78
  discrete_embedding_layer: !new:custom_interface.Discrete_EmbeddingLayer
79
  num_codebooks: !ref <num_codebooks>
80
  vocab_size: !ref <num_clusters>
81
  emb_dim: !ref <encoder_dim>
 
 
82
 
83
  attention_mlp: !new:custom_interface.AttentionMLP
84
  input_dim: !ref <encoder_dim>
@@ -93,36 +63,21 @@ embedding_model: !new:speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN
93
  attention_channels: 128
94
  lin_neurons: 192
95
 
96
- classifier: !new:speechbrain.lobes.models.ECAPA_TDNN.Classifier
97
- input_size: 192
98
- out_neurons: !ref <out_n_neurons>
99
-
100
-
101
-
102
  modules:
103
  embedding_model: !ref <embedding_model>
104
- classifier: !ref <classifier>
105
  attention_mlp: !ref <attention_mlp>
106
- codec: !ref <codec>
107
  discrete_embedding_layer: !ref <discrete_embedding_layer>
108
 
109
 
110
- label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder
111
-
112
-
113
  pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
114
  loadables:
115
  embedding_model: !ref <embedding_model>
116
- classifier: !ref <classifier>
117
  attention_mlp: !ref <attention_mlp>
118
  discrete_embedding_layer: !ref <discrete_embedding_layer>
119
- label_encoder: !ref <label_encoder>
120
 
121
  paths:
122
  embedding_model: !ref <pretrained_path>/embedding_model.ckpt
123
- classifier: !ref <pretrained_path>/classifier.ckpt
124
  attention_mlp: !ref <pretrained_path>/attention_mlp.ckpt
125
- label_encoder: !ref <pretrained_path>/label_encoder.txt
126
  discrete_embedding_layer: !ref <pretrained_path>/discrete_embedding_layer.ckpt
127
 
128
 
 
8
  # Pretrain folder (HuggingFace)
9
  pretrained_path: poonehmousavi/discrete_wavlm_spk_rec_ecapatdn
10
  # Output parameters
 
11
  save_folder: tmp
12
 
13
  ### Configuration for discrete SSL model
 
29
  # deduplicate: [False, False, False, False]
30
  # bpe_tokenizer_path: [null , null, null, null]
31
  ssl_layer_num: [1, 3, 7, 12, 18, 23]
32
+ ssl_layer_num_selected: [1, 3, 7, 12, 18, 23]
33
  num_codebooks: 6
34
  deduplicate: [False, False, False, False, False, False]
35
  bpe_tokenizer_path: [null, null, null, null, null, null]
 
43
  deduplicates: !ref <deduplicate>
44
  bpe_tokenizers: !ref <bpe_tokenizer_path>
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  discrete_embedding_layer: !new:custom_interface.Discrete_EmbeddingLayer
47
  num_codebooks: !ref <num_codebooks>
48
  vocab_size: !ref <num_clusters>
49
  emb_dim: !ref <encoder_dim>
50
+ available_layers: !ref <ssl_layer_num>
51
+ layers: !ref <ssl_layer_num_selected>
52
 
53
  attention_mlp: !new:custom_interface.AttentionMLP
54
  input_dim: !ref <encoder_dim>
 
63
  attention_channels: 128
64
  lin_neurons: 192
65
 
 
 
 
 
 
 
66
  modules:
67
  embedding_model: !ref <embedding_model>
 
68
  attention_mlp: !ref <attention_mlp>
 
69
  discrete_embedding_layer: !ref <discrete_embedding_layer>
70
 
71
 
 
 
 
72
  pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
73
  loadables:
74
  embedding_model: !ref <embedding_model>
 
75
  attention_mlp: !ref <attention_mlp>
76
  discrete_embedding_layer: !ref <discrete_embedding_layer>
 
77
 
78
  paths:
79
  embedding_model: !ref <pretrained_path>/embedding_model.ckpt
 
80
  attention_mlp: !ref <pretrained_path>/attention_mlp.ckpt
 
81
  discrete_embedding_layer: !ref <pretrained_path>/discrete_embedding_layer.ckpt
82
 
83