transformers/examples/research_projects/wav2vec2/README.md · ahassoun/voice_clone_v3 at 9f0c1125801bcfd9a7da6b29de12402616fd0f5f

NOTE: This example is outdated and is not longer actively maintained. Please follow the new instructions of fine-tuning Wav2Vec2 here

Fine-tuning Wav2Vec2

The run_asr.py script allows one to fine-tune pretrained Wav2Vec2 models that can be found here.

This finetuning script can also be run as a google colab TODO: here.

Fine-Tuning with TIMIT

Let's take a look at the script used to fine-tune wav2vec2-base with the TIMIT dataset:

#!/usr/bin/env bash
python run_asr.py \
--output_dir="./wav2vec2-base-timit-asr" \
--num_train_epochs="30" \
--per_device_train_batch_size="20" \
--per_device_eval_batch_size="20" \
--evaluation_strategy="steps" \
--save_steps="500" \
--eval_steps="100" \
--logging_steps="50" \
--learning_rate="5e-4" \
--warmup_steps="3000" \
--model_name_or_path="facebook/wav2vec2-base" \
--fp16 \
--dataset_name="timit_asr" \
--train_split_name="train" \
--validation_split_name="test" \
--orthography="timit" \
--preprocessing_num_workers="$(nproc)" \
--group_by_length \
--freeze_feature_extractor \
--verbose_logging \

The resulting model and inference examples can be found here. Some of the arguments above may look unfamiliar, let's break down what's going on:

--orthography="timit" applies certain text preprocessing rules, for tokenization and normalization, to clean up the dataset. In this case, we use the following instance of Orthography:

Orthography(
    do_lower_case=True,
    # break compounds like "quarter-century-old" and replace pauses "--"
    translation_table=str.maketrans({"-": " "}),
)

The instance above is used as follows:

creates a tokenizer with do_lower_case=True (ignores casing for input and lowercases output when decoding)
replaces "-" with " " to break compounds like "quarter-century-old" and to clean up suspended hyphens
cleans up consecutive whitespaces (replaces them with a single space: " ")
removes characters not in vocabulary (lacking respective sound units)

--verbose_logging logs text preprocessing updates and when evaluating, using the validation split every eval_steps, logs references and predictions.

Fine-Tuning with Arabic Speech Corpus

Other datasets, like the Arabic Speech Corpus dataset, require more work! Let's take a look at the script used to fine-tune wav2vec2-large-xlsr-53:

#!/usr/bin/env bash
python run_asr.py \
--output_dir="./wav2vec2-large-xlsr-53-arabic-speech-corpus" \
--num_train_epochs="50" \
--per_device_train_batch_size="1" \
--per_device_eval_batch_size="1" \
--gradient_accumulation_steps="8" \
--evaluation_strategy="steps" \
--save_steps="500" \
--eval_steps="100" \
--logging_steps="50" \
--learning_rate="5e-4" \
--warmup_steps="3000" \
--model_name_or_path="elgeish/wav2vec2-large-xlsr-53-arabic" \
--fp16 \
--dataset_name="arabic_speech_corpus" \
--train_split_name="train" \
--validation_split_name="test" \
--max_duration_in_seconds="15" \
--orthography="buckwalter" \
--preprocessing_num_workers="$(nproc)" \
--group_by_length \
--freeze_feature_extractor \
--target_feature_extractor_sampling_rate \
--verbose_logging \

First, let's understand how this dataset represents Arabic text; it uses a format called Buckwalter transliteration. We use the lang-trans package to convert back to Arabic when logging. The Buckwalter format only includes ASCII characters, some of which are non-alpha (e.g., ">" maps to "أ").

--orthography="buckwalter" applies certain text preprocessing rules, for tokenization and normalization, to clean up the dataset. In this case, we use the following instance of Orthography:

Orthography(
    vocab_file=pathlib.Path(__file__).parent.joinpath("vocab/buckwalter.json"),
    word_delimiter_token="/",  # "|" is Arabic letter alef with madda above
    words_to_remove={"sil"},  # fixing "sil" in arabic_speech_corpus dataset
    untransliterator=arabic.buckwalter.untransliterate,
    translation_table=str.maketrans(translation_table = {
        "-": " ",  # sometimes used to represent pauses
        "^": "v",  # fixing "tha" in arabic_speech_corpus dataset
    }),
)

The instance above is used as follows:

creates a tokenizer with Buckwalter vocabulary and word_delimiter_token="/"
replaces "-" with " " to clean up hyphens and fixes the orthography for "ث"
removes words used as indicators (in this case, "sil" is used for silence)
cleans up consecutive whitespaces (replaces them with a single space: " ")
removes characters not in vocabulary (lacking respective sound units)

--verbose_logging logs text preprocessing updates and when evaluating, using the validation split every eval_steps, logs references and predictions. Using the Buckwalter format, text is also logged in Arabic abjad.

--target_feature_extractor_sampling_rate resamples audio to target feature extractor's sampling rate (16kHz).

--max_duration_in_seconds="15" filters out examples whose audio is longer than the specified limit, which helps with capping GPU memory usage.

DeepSpeed Integration

To learn how to deploy Deepspeed Integration please refer to this guide.

But to get started quickly all you need is to install:

pip install deepspeed

and then use the default configuration files in this directory:

ds_config_wav2vec2_zero2.json
ds_config_wav2vec2_zero3.json

Here are examples of how you can use DeepSpeed:

(edit the value for --num_gpus to match the number of GPUs you have)

ZeRO-2:

PYTHONPATH=../../../src deepspeed --num_gpus 2 \
run_asr.py \
--output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 --evaluation_strategy=steps --save_steps=500 --eval_steps=100 \
--logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \
--model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \
--dataset_name=hf-internal-testing/librispeech_asr_dummy --dataset_config_name=clean \
--train_split_name=validation --validation_split_name=validation --orthography=timit \
--preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \
--deepspeed ds_config_wav2vec2_zero2.json

For ZeRO-2 with more than 1 gpu you need to use (which is already in the example configuration file):

    "zero_optimization": {
        ...
        "find_unused_parameters": true,
        ...
    }

ZeRO-3:

PYTHONPATH=../../../src deepspeed --num_gpus 2 \
run_asr.py \
--output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 --evaluation_strategy=steps --save_steps=500 --eval_steps=100 \
--logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \
--model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \
--dataset_name=hf-internal-testing/librispeech_asr_dummy --dataset_config_name=clean \
--train_split_name=validation --validation_split_name=validation --orthography=timit \
--preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \
--deepspeed ds_config_wav2vec2_zero3.json

Pretraining Wav2Vec2

The run_pretrain.py script allows one to pretrain a Wav2Vec2 model from scratch using Wav2Vec2's contrastive loss objective (see official paper for more information). It is recommended to pre-train Wav2Vec2 with Trainer + Deepspeed (please refer to this guide for more information).

Here is an example of how you can use DeepSpeed ZeRO-2 to pretrain a small Wav2Vec2 model:

PYTHONPATH=../../../src deepspeed --num_gpus 4 run_pretrain.py \
--output_dir="./wav2vec2-base-libri-100h" \
--num_train_epochs="3" \
--per_device_train_batch_size="32" \
--per_device_eval_batch_size="32" \
--gradient_accumulation_steps="2" \
--save_total_limit="3" \
--save_steps="500" \
--logging_steps="10" \
--learning_rate="5e-4" \
--weight_decay="0.01" \
--warmup_steps="3000" \
--model_name_or_path="patrickvonplaten/wav2vec2-base-libri-100h" \
--dataset_name="librispeech_asr" \
--dataset_config_name="clean" \
--train_split_name="train.100" \
--preprocessing_num_workers="4" \
--max_duration_in_seconds="10.0" \
--group_by_length \
--verbose_logging \
--fp16 \
--deepspeed ds_config_wav2vec2_zero2.json \

Forced Alignment

Character level forced alignment for audio and text pairs with wav2vec2 models finetuned on ASR task for a specific language. Inspired by this Pytorch tutorial.

Input Formats

Input format in script.txt              Input format in wavs directroy
0000    sentence1                       0000.wav
0001    sentence2                       0001.wav

Output Format

Output directory will contain 0000.txt and 0001.txt. Each file will have format like below

char    score   start_ms    end_ms
h       0.25    1440        1520

Run command

python alignment.py  \
--model_name="arijitx/wav2vec2-xls-r-300m-bengali" \
--wav_dir="./wavs"
--text_file="script.txt" \
--input_wavs_sr=48000 \
--output_dir="./out_alignment" \
--cuda