# Speech Recognition Pre-Training ## Wav2Vec2 Speech Pre-Training The script [`run_speech_wav2vec2_pretraining_no_trainer.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py) can be used to pre-train a [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html?highlight=wav2vec2) model from scratch. In the script [`run_speech_wav2vec2_pretraining_no_trainer`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py), a Wav2Vec2 model is pre-trained on audio data alone using [Wav2Vec2's contrastive loss objective](https://arxiv.org/abs/2006.11477). The following examples show how to fine-tune a `"base"`-sized Wav2Vec2 model as well as a `"large"`-sized Wav2Vec2 model using [`accelerate`](https://github.com/huggingface/accelerate). --- **NOTE 1** Wav2Vec2's pre-training is known to be quite unstable. It is advised to do a couple of test runs with a smaller dataset, *i.e.* `--dataset_config_names clean clean`, `--dataset_split_names validation test` to find good hyper-parameters for `learning_rate`, `batch_size`, `num_warmup_steps`, and the optimizer. A good metric to observe during training is the gradient norm which should ideally be between 0.5 and 2. --- --- **NOTE 2** When training a model on large datasets it is recommended to run the data preprocessing in a first run in a **non-distributed** mode via `--preprocessing_only` so that when running the model in **distributed** mode in a second step the preprocessed data can easily be loaded on each distributed device. --- ### Demo In this demo run we pre-train a `"base-sized"` Wav2Vec2 model simply only on the validation and test data of [librispeech_asr](https://huggingface.co/datasets/librispeech_asr). The demo is run on two Titan RTX (24 GB RAM each). In case you have less RAM available per device, consider reducing `--batch_size` and/or the `--max_duration_in_seconds`. ```bash accelerate launch run_wav2vec2_pretraining_no_trainer.py \ --dataset_name="librispeech_asr" \ --dataset_config_names clean clean \ --dataset_split_names validation test \ --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \ --output_dir="./wav2vec2-pretrained-demo" \ --max_train_steps="20000" \ --num_warmup_steps="32000" \ --gradient_accumulation_steps="8" \ --learning_rate="0.005" \ --weight_decay="0.01" \ --max_duration_in_seconds="20.0" \ --min_duration_in_seconds="2.0" \ --logging_steps="1" \ --saving_steps="10000" \ --per_device_train_batch_size="8" \ --per_device_eval_batch_size="8" \ --adam_beta1="0.9" \ --adam_beta2="0.98" \ --adam_epsilon="1e-06" \ --gradient_checkpointing \ --mask_time_prob="0.65" \ --mask_time_length="10" ``` The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/wav2vec2-pretrained-demo/reports/Wav2Vec2-PreTraining-Demo-Run--VmlldzoxMDk3MjAw?accessToken=oa05s1y57lizo2ocxy3k01g6db1u4pt8m6ur2n8nl4cb0ug02ms2cw313kb8ruch). ### Base To pre-train `"base-sized"` Wav2Vec2 model, *e.g.* [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on [librispeech_asr](https://huggingface.co/datasets/librispeech_asr), the following command can be run: ```bash accelerate launch run_wav2vec2_pretraining_no_trainer.py \ --dataset_name=librispeech_asr \ --dataset_config_names clean clean other \ --dataset_split_names train.100 train.360 train.500 \ --model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \ --output_dir="./wav2vec2-pretrained-demo" \ --max_train_steps="200000" \ --num_warmup_steps="32000" \ --gradient_accumulation_steps="4" \ --learning_rate="0.001" \ --weight_decay="0.01" \ --max_duration_in_seconds="20.0" \ --min_duration_in_seconds="2.0" \ --logging_steps="1" \ --saving_steps="10000" \ --per_device_train_batch_size="8" \ --per_device_eval_batch_size="8" \ --adam_beta1="0.9" \ --adam_beta2="0.98" \ --adam_epsilon="1e-06" \ --gradient_checkpointing \ --mask_time_prob="0.65" \ --mask_time_length="10" ``` The experiment was run on 8 GPU V100 (16 GB RAM each) for 4 days. In case you have more than 8 GPUs available for a higher effective `batch_size`, it is recommended to increase the `learning_rate` to `0.005` for faster convergence. The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/test/reports/Wav2Vec2-Base--VmlldzoxMTUyODQ0?accessToken=rg6e8u9yizx964k8q47zctq1m4afpvtn1i3qi9exgdmzip6xwkfzvagfajpzj55n) and the checkpoint pretrained for 85,000 steps can be accessed [here](https://huggingface.co/patrickvonplaten/wav2vec2-base-repro-960h-libri-85k-steps) ### Large To pre-train `"large-sized"` Wav2Vec2 model, *e.g.* [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60), on [librispeech_asr](https://huggingface.co/datasets/librispeech_asr), the following command can be run: ```bash accelerate launch run_wav2vec2_pretraining_no_trainer.py \ --dataset_name=librispeech_asr \ --dataset_config_names clean clean other \ --dataset_split_names train.100 train.360 train.500 \ --output_dir=./test \ --max_train_steps=200000 \ --num_warmup_steps=32000 \ --gradient_accumulation_steps=8 \ --learning_rate=0.001 \ --weight_decay=0.01 \ --max_duration_in_seconds=20.0 \ --min_duration_in_seconds=2.0 \ --model_name_or_path=./ --logging_steps=1 \ --saving_steps=10000 \ --per_device_train_batch_size=2 \ --per_device_eval_batch_size=4 \ --adam_beta1=0.9 \ --adam_beta2=0.98 \ --adam_epsilon=1e-06 \ --gradient_checkpointing \ --mask_time_prob=0.65 \ --mask_time_length=10 ``` The experiment was run on 8 GPU V100 (16 GB RAM each) for 7 days. In case you have more than 8 GPUs available for a higher effective `batch_size`, it is recommended to increase the `learning_rate` to `0.005` for faster convergence. The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/pretraining-wav2vec2/reports/Wav2Vec2-Large--VmlldzoxMTAwODM4?accessToken=wm3qzcnldrwsa31tkvf2pdmilw3f63d4twtffs86ou016xjbyilh55uoi3mo1qzc) and the checkpoint pretrained for 120,000 steps can be accessed [here](https://huggingface.co/patrickvonplaten/wav2vec2-large-repro-960h-libri-120k-steps)