reach-vb HF staff ylacombe HF staff commited on
Commit
da985ba
1 Parent(s): 6b1c0b3

Update README.md (#14)

Browse files

- Update README.md (a7dd3f3a8c51c0c43f9645e48754163b5795b5c1)
- Update README.md (f22c97131d1e6698d1d09c8e0f9937424284e73a)
- Update README.md (39999b3ebd9123a62bc541d9d2b5a161d82e3e86)


Co-authored-by: Yoach Lacombe <ylacombe@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +37 -4
README.md CHANGED
@@ -103,15 +103,48 @@ inference: false
103
 
104
  We are open-sourcing our Conformer-based [W2v-BERT 2.0 speech encoder](#w2v-bert-20-speech-encoder) as described in Section 3.2.1 of the [paper](https://arxiv.org/pdf/2312.05187.pdf), which is at the core of our Seamless models.
105
 
 
 
106
  | Model Name | #params | checkpoint |
107
  | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
108
  | W2v-BERT 2.0 | 600M | [checkpoint](https://huggingface.co/reach-vb/conformer-shaw/resolve/main/conformer_shaw.pt)
109
 
110
- Scaling data size for self-supervised pre-training has been empirically proven to be a relatively cheap, yet effective way to improve speech representation quality (Zhang et al., 2023a). Following such direction, we continued to add more unlabeled speech data, increasing the amount of our pre-training data from 1M hours (Seamless Communication et al., 2023) to approximately 4.5M hours.
111
- Besides leveraging more pre-training data, we removed the random-projection quantizer (RPQ) (Chiu et al., 2022) and its associated loss previously incorporated in SeamlessM4T v1 (Seamless Communication et al., 2023).4 Akin to v1, the v2 w2v-BERT 2.0 comprises 24 Conformer layers (Gulati et al., 2020) with approximately 600M parameters and the same pre-training hyperparameters.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
 
 
 
 
113
 
114
- Here's how you should do a forward pass through the speech encoder:
 
 
 
 
 
115
 
116
  ```python
117
  import torch
@@ -148,4 +181,4 @@ seqs, padding_mask = get_seqs_and_padding_mask(src)
148
  with torch.inference_mode():
149
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
150
  seqs, padding_mask = model.encoder(seqs, padding_mask)
151
- ```
 
103
 
104
  We are open-sourcing our Conformer-based [W2v-BERT 2.0 speech encoder](#w2v-bert-20-speech-encoder) as described in Section 3.2.1 of the [paper](https://arxiv.org/pdf/2312.05187.pdf), which is at the core of our Seamless models.
105
 
106
+ This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
107
+
108
  | Model Name | #params | checkpoint |
109
  | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
110
  | W2v-BERT 2.0 | 600M | [checkpoint](https://huggingface.co/reach-vb/conformer-shaw/resolve/main/conformer_shaw.pt)
111
 
112
+ **This model and its training are supported by 🤗 Transformers, more on it in the [docs](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert).**
113
+
114
+
115
+ # 🤗 Transformers usage
116
+
117
+ This is a bare checkpoint without any modeling head, and thus requires finetuning to be used for downstream tasks such as ASR. You can however use it to extract audio embeddings from the top layer with this code snippet:
118
+
119
+ ```python
120
+ from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
121
+ import torch
122
+ from datasets import load_dataset
123
+
124
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
125
+ dataset = dataset.sort("id")
126
+ sampling_rate = dataset.features["audio"].sampling_rate
127
+
128
+ processor = AutoProcessor.from_pretrained("facebook/w2v-bert-2.0")
129
+ model = Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0")
130
+
131
+ # audio file is decoded on the fly
132
+ inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
133
+ with torch.no_grad():
134
+ outputs = model(**inputs)
135
+ ```
136
 
137
+ To learn more about the model use, refer to the following resources:
138
+ - [its docs](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert)
139
+ - [a blog post showing how to fine-tune it on Mongolian ASR](https://huggingface.co/blog/fine-tune-w2v2-bert)
140
+ - [a training script example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)
141
 
142
+
143
+ # Seamless Communication usage
144
+
145
+ This model can be used in [Seamless Communication](https://github.com/facebookresearch/seamless_communication), where it was released.
146
+
147
+ Here's how to make a forward pass through the voice encoder, after having completed the [installation steps](https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#installation):
148
 
149
  ```python
150
  import torch
 
181
  with torch.inference_mode():
182
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
183
  seqs, padding_mask = model.encoder(seqs, padding_mask)
184
+ ```