microsoft
/

unispeech-sat-base-sv

@@ -7,13 +7,11 @@ tags:
 - speech
 ---
-# UniSpeech-SAT-Base
 [Microsoft's UniSpeech](https://www.microsoft.com/en-us/research/publication/unispeech-unified-speech-representation-learning-with-labeled-and-unlabeled-data/)
-The base model pretrained on 16kHz sampled speech audio with utterance and speaker contrastive loss. When using the model, make sure that your speech input is also sampled at 16kHz.
-**Note**: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model **speech recognition**, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model.
 The model was pre-trained on:
@@ -29,33 +27,38 @@ Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Li
 The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT.
-# Usage
-This is an English pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
-used in inference. The model was pre-trained in English and should therefore perform well only in English. The model has been shown to work well on task such as speaker verification, speaker identification, and speaker diarization.
-**Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
-of phonemes before fine-tuning.
-## Speech Recognition
-To fine-tune the model for speech recognition, see [the official speech recognition example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition).
-## Speech Classification
-To fine-tune the model for speech classification, see [the official audio classification example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/audio-classification).
 ## Speaker Verification
-TODO
-## Speaker Diarization
-TODO
-# Contribution
-The model was contributed by [cywang](https://huggingface.co/cywang) and [patrickvonplaten](https://huggingface.co/patrickvonplaten).
 # License

 - speech
 ---
+# UniSpeech-SAT-Base for Speaker Verification
 [Microsoft's UniSpeech](https://www.microsoft.com/en-us/research/publication/unispeech-unified-speech-representation-learning-with-labeled-and-unlabeled-data/)
+The model was pretrained on 16kHz sampled speech audio with utterance and speaker contrastive loss. When using the model, make sure that your speech input is also sampled at 16kHz.
 The model was pre-trained on:
 The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT.
+# Fine-tuning details
+The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
+[X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
+# Usage
 ## Speaker Verification
+```python
+from transformers import Wav2Vec2FeatureExtractor, UniSpeechSatForXVector
+from datasets import load_dataset
+import torch
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/unispeech-sat-base-sv')
+model = UniSpeechSatForXVector.from_pretrained('microsoft/unispeech-sat-base-sv')
+# audio files are decoded on the fly
+inputs = feature_extractor(dataset[:2]["audio"]["array"], return_tensors="pt")
+embeddings = model(**inputs).embeddings
+embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
+# the resulting embeddings can be used for cosine similarity-based retrieval
+cosine_sim = torch.nn.CosineSimilarity(dim=-1)
+similarity = cosine_sim(embeddings[0], embeddings[1])
+threshold = 0.86  # the optimal threshold is dataset-dependent
+if similarity < threshold:
+    print("Speakers are not the same!")
+```
 # License