Upload seamless_communication/cli/m4t/finetune/README.md with huggingface_hub
Browse files
seamless_communication/cli/m4t/finetune/README.md
ADDED
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Finetuning scripts for M4T
|
2 |
+
|
3 |
+
This section demonstrates an example of M4T finetuning on a single translation direction: English-to-Korean.
|
4 |
+
|
5 |
+
The trainer and dataloader were designed mainly for demonstration purposes. Their simplicity should facilitate the code transparency and portability.
|
6 |
+
|
7 |
+
## Data preparation
|
8 |
+
|
9 |
+
M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.
|
10 |
+
|
11 |
+
That kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datasets hub](https://huggingface.co/datasets/google/fleurs), (optionally) extracts units from the target audio samples, and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.
|
12 |
+
|
13 |
+
List of input arguments for `dataset.py`:
|
14 |
+
|
15 |
+
```bash
|
16 |
+
--source_lang SOURCE_LANG
|
17 |
+
M4T langcode of the dataset SOURCE language
|
18 |
+
--target_lang TARGET_LANG
|
19 |
+
M4T langcode of the dataset TARGET language
|
20 |
+
--split SPLIT Dataset split/shard to download (`train`, `test`)
|
21 |
+
--save_dir SAVE_DIR Directory where the datasets will be stored with HuggingFace datasets cache files
|
22 |
+
```
|
23 |
+
|
24 |
+
Language codes should follow the notation adopted by M4T models.
|
25 |
+
|
26 |
+
Below is an example bash script that prepares a training and evaluation dataset for the translation direction English-to-Korean:
|
27 |
+
|
28 |
+
```bash
|
29 |
+
export DATASET_DIR=~/m4t_dataset
|
30 |
+
mkdir -p $DATASET_DIR
|
31 |
+
|
32 |
+
m4t_prepare_dataset \
|
33 |
+
--source_lang eng \
|
34 |
+
--target_lang kor \
|
35 |
+
--split train \
|
36 |
+
--save_dir $DATASET_DIR
|
37 |
+
m4t_prepare_dataset \
|
38 |
+
--source_lang eng \
|
39 |
+
--target_lang kor \
|
40 |
+
--split validation \
|
41 |
+
--save_dir $DATASET_DIR
|
42 |
+
```
|
43 |
+
|
44 |
+
|
45 |
+
Output manifests will be stored in `${DATASET_DIR}/train_manifest.json` and `${DATASET_DIR}/validation_manifest.json`.
|
46 |
+
|
47 |
+
|
48 |
+
## Finetuning
|
49 |
+
|
50 |
+
`finetune.py` is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against the validation dataset.
|
51 |
+
It is recommended to launch it with [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html). Multi-gpu and multi-node training are supported out of the box.
|
52 |
+
|
53 |
+
List of input arguments for `finetune.py`:
|
54 |
+
|
55 |
+
```bash
|
56 |
+
--train_dataset TRAIN_DATASET
|
57 |
+
Path to manifest with train samples
|
58 |
+
--eval_dataset EVAL_DATASET
|
59 |
+
Path to manifest with eval samples
|
60 |
+
--model_name MODEL_NAME
|
61 |
+
Base model name (e.g, `seamlessM4T_medium`, `seamlessM4T_large`)
|
62 |
+
--save_model_to SAVE_MODEL_TO
|
63 |
+
Path to save best finetuned model
|
64 |
+
--seed SEED Randomizer seed value
|
65 |
+
--batch_size BATCH_SIZE
|
66 |
+
Batch size for training and evaluation
|
67 |
+
--patience PATIENCE Set early termination after `patience` number of evaluations without eval loss improvements
|
68 |
+
--max_epochs MAX_EPOCHS
|
69 |
+
Max number of training epochs
|
70 |
+
--learning_rate LEARNING_RATE
|
71 |
+
Finetuning learning rate
|
72 |
+
--warmup_steps WARMUP_STEPS
|
73 |
+
Number of steps with linearly increasing learning rate
|
74 |
+
--eval_steps EVAL_STEPS
|
75 |
+
Get eval loss after each `eval_steps` training steps
|
76 |
+
--log_steps LOG_STEPS
|
77 |
+
Log inner loss after each `log_steps` training steps
|
78 |
+
--mode {FinetuneMode.SPEECH_TO_SPEECH,FinetuneMode.SPEECH_TO_TEXT,FinetuneMode.TEXT_TO_SPEECH}
|
79 |
+
* `SPEECH_TO_SPEECH` -- finetune S2T and T2U parts of the model;
|
80 |
+
* `TEXT_TO_SPEECH` -- finetune only T2U;
|
81 |
+
* `SPEECH_TO_TEXT` -- finetune only S2T
|
82 |
+
```
|
83 |
+
|
84 |
+
The scripts supports three modes of finetuning:
|
85 |
+
- `SPEECH_TO_SPEECH`: in this case all model weights except the text encoder will be engaged;
|
86 |
+
- `TEXT_TO_SPEECH`: only text-to-unit part of the model will be engaged in the finetuning, other weights will be frozen;
|
87 |
+
- `SPEECH_TO_TEXT`: only speech-to-text part of the model will be engaged in the finetuning.
|
88 |
+
|
89 |
+
The referenced finetuning script does not support finetuning of the text encoder.
|
90 |
+
|
91 |
+
|
92 |
+
Below is an example bash script that launches finetuning of M4T-large on the dataset prepared earlier, using a single node with eight GPUs:
|
93 |
+
|
94 |
+
```
|
95 |
+
torchrun \
|
96 |
+
--rdzv-backend=c10d \
|
97 |
+
--rdzv-endpoint=localhost:0 \
|
98 |
+
--nnodes=1 \
|
99 |
+
--nproc-per-node=8 \
|
100 |
+
--no-python \
|
101 |
+
m4t_finetune \
|
102 |
+
--mode SPEECH_TO_TEXT \
|
103 |
+
--train_dataset $DATASET_DIR/train_manifest.json \
|
104 |
+
--eval_dataset $DATASET_DIR/validation_manifest.json \
|
105 |
+
--learning_rate 1e-6 \
|
106 |
+
--warmup_steps 100 \
|
107 |
+
--max_epochs 10 \
|
108 |
+
--patience 3 \
|
109 |
+
--model_name seamlessM4T_large \
|
110 |
+
--save_model_to $DATASET_DIR/checkpoint.pt
|
111 |
+
```
|
112 |
+
|
113 |
+
Excerpt from an example finetuning log:
|
114 |
+
|
115 |
+
```
|
116 |
+
...
|
117 |
+
2023-08-21 14:46:16,936 INFO -- trainer.1100368: Eval after 300 updates: loss=8.7755 best_loss=8.7755 patience_steps_left=3
|
118 |
+
2023-08-21 14:46:16,936 INFO -- trainer.1100368: Saving model
|
119 |
+
2023-08-21 14:46:35,863 INFO -- trainer.1100368: Epoch 006 / update 00310: train loss=16.3768 last lr=5.68E-08
|
120 |
+
2023-08-21 14:46:42,610 INFO -- trainer.1100368: Epoch 006 / update 00320: train loss=16.3730 last lr=5.59E-08
|
121 |
+
2023-08-21 14:46:48,285 INFO -- trainer.1100368: Epoch 006 / update 00330: train loss=16.4598 last lr=5.50E-08
|
122 |
+
2023-08-21 14:46:54,390 INFO -- trainer.1100368: Epoch 006 / update 00340: train loss=16.4218 last lr=5.42E-08
|
123 |
+
2023-08-21 14:47:08,461 INFO -- trainer.1100368: Epoch 006 / update 00350: train loss=16.3906 last lr=5.35E-08
|
124 |
+
2023-08-21 14:47:09,067 INFO -- trainer.1100368: Run evaluation
|
125 |
+
2023-08-21 14:47:19,205 INFO -- trainer.1100368: Eval after 350 updates: loss=8.7462 best_loss=8.7462 patience_steps_left=3
|
126 |
+
2023-08-21 14:47:19,205 INFO -- trainer.1100368: Saving model
|
127 |
+
2023-08-21 14:47:44,981 INFO -- trainer.1100368: Epoch 007 / update 00360: train loss=16.4267 last lr=5.27E-08
|
128 |
+
2023-08-21 14:47:51,383 INFO -- trainer.1100368: Epoch 007 / update 00370: train loss=16.3630 last lr=5.20E-08
|
129 |
+
2023-08-21 14:47:58,305 INFO -- trainer.1100368: Epoch 007 / update 00380: train loss=16.3666 last lr=5.13E-08
|
130 |
+
2023-08-21 14:48:04,396 INFO -- trainer.1100368: Epoch 007 / update 00390: train loss=16.3605 last lr=5.06E-08
|
131 |
+
2023-08-21 14:48:10,630 INFO -- trainer.1100368: Epoch 007 / update 00400: train loss=16.3518 last lr=5.00E-08
|
132 |
+
...
|
133 |
+
```
|