Modified it to create as "lite" version for cases where you only need speaker embeddings

Browse files

Files changed (3) hide show

README.md +6 -39
custom_interface.py +33 -115
hyperparams.yaml +3 -48

README.md CHANGED Viewed

@@ -26,12 +26,13 @@ widget:
 <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
 <br/><br/>
-# Speaker Verification with ECAPA-TDNN embeddings with discrete_ssl input on Voxceleb
-This repository provides all the necessary tools to perform speaker verification with a pretrained ECAPA-TDNN model and discrete audio input using SpeechBrain.
-The system can be used to extract speaker embeddings as well.
 It is trained on Voxceleb 1 training data.
 For a better experience, we encourage you to learn more about
 [SpeechBrain](https://speechbrain.github.io). The model performance on Voxceleb1-test set(Cleaned) is:
@@ -58,50 +59,16 @@ Please notice that we encourage you to read our tutorials and learn more about
 import torchaudio
 from speechbrain.inference.interfaces import foreign_class
-classifier = foreign_class(source="poonehmousavi/discrete_wavlm_spk_rec_ecapatdn", pymodule_file="custom_interface.py", classname="CustomEncoderClassifier")
-signal, fs =torchaudio.load('tests/samples/example1.wav')
 embeddings = classifier.encode_batch(signal)
 print(embeddings.shape)
 ```
 The system is trained with recordings sampled at 16kHz (single channel).
 The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *classify_file* if needed. Make sure your input tensor is compliant with the expected sampling rate if you use *encode_batch* and *classify_batch*.
-<!-- ### Perform Speaker Verification
-```python
-from speechbrain.inference.speaker import SpeakerRecognition
-verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
-score, prediction = verification.verify_files("tests/samples/ASR/spk1_snt1.wav", "tests/samples/ASR/spk2_snt1.wav") # Different Speakers
-score, prediction = verification.verify_files("tests/samples/ASR/spk1_snt1.wav", "tests/samples/ASR/spk1_snt2.wav") # Same Speaker
-```
- The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise. -->
-<!-- ### Inference on GPU
-To perform inference on the GPU, add  `run_opts={"device":"cuda"}`  when calling the `from_hparams` method.
-### Training
-The model was trained with SpeechBrain (aa018540).
-To train it from scratch follows these steps:
-1. Clone SpeechBrain:
-```bash
-git clone https://github.com/speechbrain/speechbrain/
-```
-2. Install it:
-```
-cd speechbrain
-pip install -r requirements.txt
-pip install -e .
-```
-3. Run Training:
-```
-cd  recipes/VoxCeleb/SpeakerRec
-python train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml --data_folder=your_data_folder
-```
-You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1-ahC1xeyPinAHp2oAohL-02smNWO41Cc?usp=sharing).
- -->
 ### Limitations
 The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

 <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
 <br/><br/>
+# Standalone ECAPA-TDNN embeddings with discrete_ssl input on Voxceleb
+This repository provides all the necessary tools to obtain speaker embeddings with a pretrained ECAPA-TDNN model and discrete audio input using SpeechBrain.
 It is trained on Voxceleb 1 training data.
+Adopted from poonehmousavi/discrete_wavlm_spk_rec_ecapatdn
 For a better experience, we encourage you to learn more about
 [SpeechBrain](https://speechbrain.github.io). The model performance on Voxceleb1-test set(Cleaned) is:
 import torchaudio
 from speechbrain.inference.interfaces import foreign_class
+classifier = foreign_class(source="flexthink/discrete_wavlm_spk_rec_ecapatdn", pymodule_file="custom_interface.py", classname="DiscreteSpkEmb")
+tokens = torch.randint(4, 100, 4)
 embeddings = classifier.encode_batch(signal)
 print(embeddings.shape)
 ```
 The system is trained with recordings sampled at 16kHz (single channel).
 The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *classify_file* if needed. Make sure your input tensor is compliant with the expected sampling rate if you use *encode_batch* and *classify_batch*.
 ### Limitations
 The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

custom_interface.py CHANGED Viewed

@@ -60,6 +60,8 @@ class Discrete_EmbeddingLayer(torch.nn.Module):
         pad_index=0,
         init=False,
         freeze=False,
     ):
         super(Discrete_EmbeddingLayer, self).__init__()
         self.vocab_size = vocab_size
@@ -69,11 +71,26 @@ class Discrete_EmbeddingLayer(torch.nn.Module):
             num_codebooks * vocab_size, emb_dim
         ).requires_grad_(not self.freeze)
         self.init = init
     def init_embedding(self, weights):
         with torch.no_grad():
             self.embedding.weight = torch.nn.Parameter(weights)
     def forward(self, in_tokens):
         """Computes the embedding for discrete tokens.
         a sample.
@@ -89,17 +106,13 @@ class Discrete_EmbeddingLayer(torch.nn.Module):
         """
         with torch.set_grad_enabled(not self.freeze):
             #  Add unique token IDs across diffrent codebooks by adding num_codebooks * vocab_size
-            in_tokens += torch.arange(
-                0,
-                self.num_codebooks * self.vocab_size,
-                self.vocab_size,
-                device=in_tokens.device,
-            )
             # Forward Pass to embedding and
-            in_embs = self.embedding(in_tokens)
             return in_embs
-class CustomEncoderClassifier(Pretrained):
     """A ready-to-use class for utterance-level classification (e.g, speaker-id,
     language-id, emotion recognition, keyword spotting, etc).
     The class assumes that an self-supervised encoder like wav2vec2/hubert and a classifier model
@@ -129,126 +142,31 @@ class CustomEncoderClassifier(Pretrained):
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
-        self.similarity = torch.nn.CosineSimilarity(dim=-1, eps=1e-6)
-    def encode_batch(self, wavs, wav_lens=None, normalize=False):
         """Encodes the input audio into a single vector embedding.
         The waveforms should already be in the model's desired format.
-        You can call:
-        ``normalized = <this>.normalizer(signal, sample_rate)``
-        to get a correctly converted signal in most cases.
         Arguments
         ---------
-        wavs : torch.tensor
-            Batch of waveforms [batch, time, channels] or [batch, time]
-            depending on the model. Make sure the sample rate is fs=16000 Hz.
-        wav_lens : torch.tensor
             Lengths of the waveforms relative to the longest one in the
             batch, tensor of shape [batch]. The longest one should have
             relative length 1.0 and others len(waveform) / max_length.
             Used for ignoring padding.
-        normalize : bool
-            If True, it normalizes the embeddings with the statistics
-            contained in mean_var_norm_emb.
         Returns
         -------
         torch.tensor
             The encoded batch
         """
         # Manage single waveforms in input
-        if len(wavs.shape) == 1:
-            wavs = wavs.unsqueeze(0)
-        # Assign full length if wav_lens is not assigned
-        if wav_lens is None:
-            wav_lens = torch.ones(wavs.shape[0], device=self.device)
-        # Storing waveform in the specified device
-        wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
-        wavs = wavs.float()
-        with torch.no_grad():
-            self.hparams.codec.to(self.device).eval()
-            tokens, _, _ = self.hparams.codec(
-                wavs, wav_lens, **self.hparams.tokenizer_config
-            )
-            embeddings = self.mods.discrete_embedding_layer(tokens)
-            att_w = self.mods.attention_mlp(embeddings)
-            feats = torch.matmul(att_w.transpose(2, -1), embeddings).squeeze(-2)
-            embeddings = self.mods.embedding_model(feats, wav_lens)
         return embeddings.squeeze(1)
-    def verify_batch(
-        self, wavs1, wavs2, wav1_lens=None, wav2_lens=None, threshold=0.25
-    ):
-        """Performs speaker verification with cosine distance.
-        It returns the score and the decision (0 different speakers,
-        1 same speakers).
-        Arguments
-        ---------
-        wavs1 : Torch.Tensor
-            torch.Tensor containing the speech waveform1 (batch, time).
-            Make sure the sample rate is fs=16000 Hz.
-        wavs2 : Torch.Tensor
-            torch.Tensor containing the speech waveform2 (batch, time).
-            Make sure the sample rate is fs=16000 Hz.
-        wav1_lens : Torch.Tensor
-            torch.Tensor containing the relative length for each sentence
-            in the length (e.g., [0.8 0.6 1.0])
-        wav2_lens : Torch.Tensor
-            torch.Tensor containing the relative length for each sentence
-            in the length (e.g., [0.8 0.6 1.0])
-        threshold : Float
-            Threshold applied to the cosine distance to decide if the
-            speaker is different (0) or the same (1).
-        Returns
-        -------
-        score
-            The score associated to the binary verification output
-            (cosine distance).
-        prediction
-            The prediction is 1 if the two signals in input are from the same
-            speaker and 0 otherwise.
-        """
-        emb1 = self.encode_batch(wavs1, wav1_lens, normalize=False)
-        emb2 = self.encode_batch(wavs2, wav2_lens, normalize=False)
-        score = self.similarity(emb1, emb2)
-        return score, score > threshold
-    def verify_files(self, path_x, path_y, **kwargs):
-        """Speaker verification with cosine distance
-        Returns the score and the decision (0 different speakers,
-        1 same speakers).
-        Arguments
-        ---------
-        path_x : str
-            Path to file x
-        path_y : str
-            Path to file y
-        **kwargs : dict
-            Arguments to ``load_audio``
-        Returns
-        -------
-        score
-            The score associated to the binary verification output
-            (cosine distance).
-        prediction
-            The prediction is 1 if the two signals in input are from the same
-            speaker and 0 otherwise.
-        """
-        waveform_x = self.load_audio(path_x, **kwargs)
-        waveform_y = self.load_audio(path_y, **kwargs)
-        # Fake batches:
-        batch_x = waveform_x.unsqueeze(0)
-        batch_y = waveform_y.unsqueeze(0)
-        # Verify:
-        score, decision = self.verify_batch(batch_x, batch_y)
-        # Squeeze:
-        return score[0], decision[0]

         pad_index=0,
         init=False,
         freeze=False,
+        available_layers=None,
+        layers=None,
     ):
         super(Discrete_EmbeddingLayer, self).__init__()
         self.vocab_size = vocab_size
             num_codebooks * vocab_size, emb_dim
         ).requires_grad_(not self.freeze)
         self.init = init
+        self.layers = layers
+        self.available_layers = available_layers
+        self.offsets = self.build_offsets()
     def init_embedding(self, weights):
         with torch.no_grad():
             self.embedding.weight = torch.nn.Parameter(weights)
+    def build_offsets(self):
+        offsets = torch.arange(
+            0,
+            self.num_codebooks * self.vocab_size,
+            self.vocab_size,
+        )
+        if self.layers:
+            selected_layers = set(self.layers)
+            indexes = [idx for idx, layer in enumerate(self.layers) if layer in selected_layers]
+            offsets = offsets[indexes]
+        return offsets
     def forward(self, in_tokens):
         """Computes the embedding for discrete tokens.
         a sample.
         """
         with torch.set_grad_enabled(not self.freeze):
             #  Add unique token IDs across diffrent codebooks by adding num_codebooks * vocab_size
+            in_tokens_offset = in_tokens + self.offsets.to(in_tokens.device)
             # Forward Pass to embedding and
+            in_embs = self.embedding(in_tokens_offset.int())
             return in_embs
+class DiscreteSpkEmb(Pretrained):
     """A ready-to-use class for utterance-level classification (e.g, speaker-id,
     language-id, emotion recognition, keyword spotting, etc).
     The class assumes that an self-supervised encoder like wav2vec2/hubert and a classifier model
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
+    def encode_batch(self, audio, length=None):
         """Encodes the input audio into a single vector embedding.
         The waveforms should already be in the model's desired format.
         Arguments
         ---------
+        audio : torch.tensor
+            Batch of tokenized audio [batch, time, heads]
+        length : torch.tensor
             Lengths of the waveforms relative to the longest one in the
             batch, tensor of shape [batch]. The longest one should have
             relative length 1.0 and others len(waveform) / max_length.
             Used for ignoring padding.
         Returns
         -------
         torch.tensor
             The encoded batch
         """
         # Manage single waveforms in input
+        embeddings = self.mods.discrete_embedding_layer(audio)
+        att_w = self.mods.attention_mlp(embeddings)
+        feats = torch.matmul(att_w.transpose(2, -1), embeddings).squeeze(-2)
+        embeddings = self.mods.embedding_model(feats, length)
         return embeddings.squeeze(1)
+    def forward(self, audio, length=None):
+        return self.encode_batch(audio, length)

hyperparams.yaml CHANGED Viewed

@@ -8,7 +8,6 @@ n_mels: 80
 # Pretrain folder (HuggingFace)
 pretrained_path: poonehmousavi/discrete_wavlm_spk_rec_ecapatdn
 # Output parameters
-out_n_neurons: 1211
 save_folder: tmp
 ### Configuration for  discrete SSL model
@@ -30,6 +29,7 @@ num_clusters: 1000
 # deduplicate: [False, False, False, False]
 # bpe_tokenizer_path: [null , null,  null, null]
 ssl_layer_num: [1, 3, 7, 12, 18, 23]
 num_codebooks: 6
 deduplicate: [False, False, False, False, False, False]
 bpe_tokenizer_path: [null, null, null, null, null, null]
@@ -43,42 +43,12 @@ tokenizer_config:
     deduplicates: !ref <deduplicate>
     bpe_tokenizers: !ref <bpe_tokenizer_path>
-ssl_model: !apply:speechbrain.utils.hparams.choice
-    value: !ref <ssl_model_type>
-    choices:
-        wavlm: !new:speechbrain.lobes.models.huggingface_transformers.wavlm.WavLM
-            source: !ref <ssl_hub>
-            output_norm: False
-            freeze: !ref <freeze_ssl>
-            freeze_feature_extractor: !ref <freeze_feature_extractor>
-            output_all_hiddens: True
-            save_path: !ref <ssl_folder>
-        hubert: !new:speechbrain.lobes.models.huggingface_transformers.hubert.HuBERT
-            source: !ref <ssl_hub>
-            output_norm: False
-            freeze: !ref <freeze_ssl>
-            freeze_feature_extractor: !ref <freeze_feature_extractor>
-            output_all_hiddens: True
-            save_path: !ref <ssl_folder>
-        wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2
-            source: !ref <ssl_hub>
-            output_norm: False
-            freeze: !ref <freeze_ssl>
-            freeze_feature_extractor: !ref <freeze_feature_extractor>
-            output_all_hiddens: True
-            save_path: !ref <ssl_folder>
-codec: !new:speechbrain.lobes.models.huggingface_transformers.discrete_ssl.DiscreteSSL
-    save_path: !ref <kmeans_cache_dir>
-    ssl_model: !ref <ssl_model>
-    kmeans_dataset: !ref <kmeans_dataset>
-    kmeans_repo_id: !ref <kmeans_repo_id>
-    num_clusters: !ref <num_clusters>
 discrete_embedding_layer: !new:custom_interface.Discrete_EmbeddingLayer
     num_codebooks: !ref <num_codebooks>
     vocab_size: !ref <num_clusters>
     emb_dim: !ref <encoder_dim>
 attention_mlp: !new:custom_interface.AttentionMLP
     input_dim: !ref <encoder_dim>
@@ -93,36 +63,21 @@ embedding_model: !new:speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN
     attention_channels: 128
     lin_neurons: 192
-classifier: !new:speechbrain.lobes.models.ECAPA_TDNN.Classifier
-    input_size: 192
-    out_neurons: !ref <out_n_neurons>
 modules:
     embedding_model: !ref <embedding_model>
-    classifier: !ref <classifier>
     attention_mlp: !ref <attention_mlp>
-    codec: !ref <codec>
     discrete_embedding_layer: !ref <discrete_embedding_layer>
-label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder
 pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
     loadables:
         embedding_model: !ref <embedding_model>
-        classifier: !ref <classifier>
         attention_mlp: !ref <attention_mlp>
         discrete_embedding_layer: !ref <discrete_embedding_layer>
-        label_encoder: !ref <label_encoder>
     paths:
         embedding_model: !ref <pretrained_path>/embedding_model.ckpt
-        classifier: !ref <pretrained_path>/classifier.ckpt
         attention_mlp: !ref <pretrained_path>/attention_mlp.ckpt
-        label_encoder: !ref <pretrained_path>/label_encoder.txt
         discrete_embedding_layer: !ref <pretrained_path>/discrete_embedding_layer.ckpt

 # Pretrain folder (HuggingFace)
 pretrained_path: poonehmousavi/discrete_wavlm_spk_rec_ecapatdn
 # Output parameters
 save_folder: tmp
 ### Configuration for  discrete SSL model
 # deduplicate: [False, False, False, False]
 # bpe_tokenizer_path: [null , null,  null, null]
 ssl_layer_num: [1, 3, 7, 12, 18, 23]
+ssl_layer_num_selected: [1, 3, 7, 12, 18, 23]
 num_codebooks: 6
 deduplicate: [False, False, False, False, False, False]
 bpe_tokenizer_path: [null, null, null, null, null, null]
     deduplicates: !ref <deduplicate>
     bpe_tokenizers: !ref <bpe_tokenizer_path>
 discrete_embedding_layer: !new:custom_interface.Discrete_EmbeddingLayer
     num_codebooks: !ref <num_codebooks>
     vocab_size: !ref <num_clusters>
     emb_dim: !ref <encoder_dim>
+    available_layers: !ref <ssl_layer_num>
+    layers: !ref <ssl_layer_num_selected>
 attention_mlp: !new:custom_interface.AttentionMLP
     input_dim: !ref <encoder_dim>
     attention_channels: 128
     lin_neurons: 192
 modules:
     embedding_model: !ref <embedding_model>
     attention_mlp: !ref <attention_mlp>
     discrete_embedding_layer: !ref <discrete_embedding_layer>
 pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
     loadables:
         embedding_model: !ref <embedding_model>
         attention_mlp: !ref <attention_mlp>
         discrete_embedding_layer: !ref <discrete_embedding_layer>
     paths:
         embedding_model: !ref <pretrained_path>/embedding_model.ckpt
         attention_mlp: !ref <pretrained_path>/attention_mlp.ckpt
         discrete_embedding_layer: !ref <pretrained_path>/discrete_embedding_layer.ckpt