victan
/

audio_seam

Model card Files Files and versions Community

audio_seam / seamless_communication /cli /m4t /finetune /README.md

victan

Upload seamless_communication/cli/m4t/finetune/README.md with huggingface_hub

3fc5a25 10 months ago

preview code

raw

history blame

6.47 kB

	## Finetuning scripts for M4T

	This section demonstrates an example of M4T finetuning on a single translation direction: English-to-Korean.

	The trainer and dataloader were designed mainly for demonstration purposes. Their simplicity should facilitate the code transparency and portability.

	## Data preparation

	M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.

	That kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datasets hub](https://huggingface.co/datasets/google/fleurs), (optionally) extracts units from the target audio samples, and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.

	List of input arguments for `dataset.py`:

	```bash
	--source_lang SOURCE_LANG
	M4T langcode of the dataset SOURCE language
	--target_lang TARGET_LANG
	M4T langcode of the dataset TARGET language
	--split SPLIT Dataset split/shard to download (`train`, `test`)
	--save_dir SAVE_DIR Directory where the datasets will be stored with HuggingFace datasets cache files
	```

	Language codes should follow the notation adopted by M4T models.

	Below is an example bash script that prepares a training and evaluation dataset for the translation direction English-to-Korean:

	```bash
	export DATASET_DIR=~/m4t_dataset
	mkdir -p $DATASET_DIR

	m4t_prepare_dataset \
	--source_lang eng \
	--target_lang kor \
	--split train \
	--save_dir $DATASET_DIR
	m4t_prepare_dataset \
	--source_lang eng \
	--target_lang kor \
	--split validation \
	--save_dir $DATASET_DIR
	```


	Output manifests will be stored in `${DATASET_DIR}/train_manifest.json` and `${DATASET_DIR}/validation_manifest.json`.


	## Finetuning

	`finetune.py` is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against the validation dataset.
	It is recommended to launch it with [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html). Multi-gpu and multi-node training are supported out of the box.

	List of input arguments for `finetune.py`:

	```bash
	--train_dataset TRAIN_DATASET
	Path to manifest with train samples
	--eval_dataset EVAL_DATASET
	Path to manifest with eval samples
	--model_name MODEL_NAME
	Base model name (e.g, `seamlessM4T_medium`, `seamlessM4T_large`)
	--save_model_to SAVE_MODEL_TO
	Path to save best finetuned model
	--seed SEED Randomizer seed value
	--batch_size BATCH_SIZE
	Batch size for training and evaluation
	--patience PATIENCE Set early termination after `patience` number of evaluations without eval loss improvements
	--max_epochs MAX_EPOCHS
	Max number of training epochs
	--learning_rate LEARNING_RATE
	Finetuning learning rate
	--warmup_steps WARMUP_STEPS
	Number of steps with linearly increasing learning rate
	--eval_steps EVAL_STEPS
	Get eval loss after each `eval_steps` training steps
	--log_steps LOG_STEPS
	Log inner loss after each `log_steps` training steps
	--mode {FinetuneMode.SPEECH_TO_SPEECH,FinetuneMode.SPEECH_TO_TEXT,FinetuneMode.TEXT_TO_SPEECH}
	* `SPEECH_TO_SPEECH` -- finetune S2T and T2U parts of the model;
	* `TEXT_TO_SPEECH` -- finetune only T2U;
	* `SPEECH_TO_TEXT` -- finetune only S2T
	```

	The scripts supports three modes of finetuning:
	- `SPEECH_TO_SPEECH`: in this case all model weights except the text encoder will be engaged;
	- `TEXT_TO_SPEECH`: only text-to-unit part of the model will be engaged in the finetuning, other weights will be frozen;
	- `SPEECH_TO_TEXT`: only speech-to-text part of the model will be engaged in the finetuning.

	The referenced finetuning script does not support finetuning of the text encoder.


	Below is an example bash script that launches finetuning of M4T-large on the dataset prepared earlier, using a single node with eight GPUs:

	```
	torchrun \
	--rdzv-backend=c10d \
	--rdzv-endpoint=localhost:0 \
	--nnodes=1 \
	--nproc-per-node=8 \
	--no-python \
	m4t_finetune \
	--mode SPEECH_TO_TEXT \
	--train_dataset $DATASET_DIR/train_manifest.json \
	--eval_dataset $DATASET_DIR/validation_manifest.json \
	--learning_rate 1e-6 \
	--warmup_steps 100 \
	--max_epochs 10 \
	--patience 3 \
	--model_name seamlessM4T_large \
	--save_model_to $DATASET_DIR/checkpoint.pt
	```

	Excerpt from an example finetuning log:

	```
	...
	2023-08-21 14:46:16,936 INFO -- trainer.1100368: Eval after 300 updates: loss=8.7755 best_loss=8.7755 patience_steps_left=3
	2023-08-21 14:46:16,936 INFO -- trainer.1100368: Saving model
	2023-08-21 14:46:35,863 INFO -- trainer.1100368: Epoch 006 / update 00310: train loss=16.3768 last lr=5.68E-08
	2023-08-21 14:46:42,610 INFO -- trainer.1100368: Epoch 006 / update 00320: train loss=16.3730 last lr=5.59E-08
	2023-08-21 14:46:48,285 INFO -- trainer.1100368: Epoch 006 / update 00330: train loss=16.4598 last lr=5.50E-08
	2023-08-21 14:46:54,390 INFO -- trainer.1100368: Epoch 006 / update 00340: train loss=16.4218 last lr=5.42E-08
	2023-08-21 14:47:08,461 INFO -- trainer.1100368: Epoch 006 / update 00350: train loss=16.3906 last lr=5.35E-08
	2023-08-21 14:47:09,067 INFO -- trainer.1100368: Run evaluation
	2023-08-21 14:47:19,205 INFO -- trainer.1100368: Eval after 350 updates: loss=8.7462 best_loss=8.7462 patience_steps_left=3
	2023-08-21 14:47:19,205 INFO -- trainer.1100368: Saving model
	2023-08-21 14:47:44,981 INFO -- trainer.1100368: Epoch 007 / update 00360: train loss=16.4267 last lr=5.27E-08
	2023-08-21 14:47:51,383 INFO -- trainer.1100368: Epoch 007 / update 00370: train loss=16.3630 last lr=5.20E-08
	2023-08-21 14:47:58,305 INFO -- trainer.1100368: Epoch 007 / update 00380: train loss=16.3666 last lr=5.13E-08
	2023-08-21 14:48:04,396 INFO -- trainer.1100368: Epoch 007 / update 00390: train loss=16.3605 last lr=5.06E-08
	2023-08-21 14:48:10,630 INFO -- trainer.1100368: Epoch 007 / update 00400: train loss=16.3518 last lr=5.00E-08
	...
	```