victan commited on
Commit
3fc5a25
1 Parent(s): 7ddea44

Upload seamless_communication/cli/m4t/finetune/README.md with huggingface_hub

Browse files
seamless_communication/cli/m4t/finetune/README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Finetuning scripts for M4T
2
+
3
+ This section demonstrates an example of M4T finetuning on a single translation direction: English-to-Korean.
4
+
5
+ The trainer and dataloader were designed mainly for demonstration purposes. Their simplicity should facilitate the code transparency and portability.
6
+
7
+ ## Data preparation
8
+
9
+ M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.
10
+
11
+ That kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datasets hub](https://huggingface.co/datasets/google/fleurs), (optionally) extracts units from the target audio samples, and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.
12
+
13
+ List of input arguments for `dataset.py`:
14
+
15
+ ```bash
16
+ --source_lang SOURCE_LANG
17
+ M4T langcode of the dataset SOURCE language
18
+ --target_lang TARGET_LANG
19
+ M4T langcode of the dataset TARGET language
20
+ --split SPLIT Dataset split/shard to download (`train`, `test`)
21
+ --save_dir SAVE_DIR Directory where the datasets will be stored with HuggingFace datasets cache files
22
+ ```
23
+
24
+ Language codes should follow the notation adopted by M4T models.
25
+
26
+ Below is an example bash script that prepares a training and evaluation dataset for the translation direction English-to-Korean:
27
+
28
+ ```bash
29
+ export DATASET_DIR=~/m4t_dataset
30
+ mkdir -p $DATASET_DIR
31
+
32
+ m4t_prepare_dataset \
33
+ --source_lang eng \
34
+ --target_lang kor \
35
+ --split train \
36
+ --save_dir $DATASET_DIR
37
+ m4t_prepare_dataset \
38
+ --source_lang eng \
39
+ --target_lang kor \
40
+ --split validation \
41
+ --save_dir $DATASET_DIR
42
+ ```
43
+
44
+
45
+ Output manifests will be stored in `${DATASET_DIR}/train_manifest.json` and `${DATASET_DIR}/validation_manifest.json`.
46
+
47
+
48
+ ## Finetuning
49
+
50
+ `finetune.py` is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against the validation dataset.
51
+ It is recommended to launch it with [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html). Multi-gpu and multi-node training are supported out of the box.
52
+
53
+ List of input arguments for `finetune.py`:
54
+
55
+ ```bash
56
+ --train_dataset TRAIN_DATASET
57
+ Path to manifest with train samples
58
+ --eval_dataset EVAL_DATASET
59
+ Path to manifest with eval samples
60
+ --model_name MODEL_NAME
61
+ Base model name (e.g, `seamlessM4T_medium`, `seamlessM4T_large`)
62
+ --save_model_to SAVE_MODEL_TO
63
+ Path to save best finetuned model
64
+ --seed SEED Randomizer seed value
65
+ --batch_size BATCH_SIZE
66
+ Batch size for training and evaluation
67
+ --patience PATIENCE Set early termination after `patience` number of evaluations without eval loss improvements
68
+ --max_epochs MAX_EPOCHS
69
+ Max number of training epochs
70
+ --learning_rate LEARNING_RATE
71
+ Finetuning learning rate
72
+ --warmup_steps WARMUP_STEPS
73
+ Number of steps with linearly increasing learning rate
74
+ --eval_steps EVAL_STEPS
75
+ Get eval loss after each `eval_steps` training steps
76
+ --log_steps LOG_STEPS
77
+ Log inner loss after each `log_steps` training steps
78
+ --mode {FinetuneMode.SPEECH_TO_SPEECH,FinetuneMode.SPEECH_TO_TEXT,FinetuneMode.TEXT_TO_SPEECH}
79
+ * `SPEECH_TO_SPEECH` -- finetune S2T and T2U parts of the model;
80
+ * `TEXT_TO_SPEECH` -- finetune only T2U;
81
+ * `SPEECH_TO_TEXT` -- finetune only S2T
82
+ ```
83
+
84
+ The scripts supports three modes of finetuning:
85
+ - `SPEECH_TO_SPEECH`: in this case all model weights except the text encoder will be engaged;
86
+ - `TEXT_TO_SPEECH`: only text-to-unit part of the model will be engaged in the finetuning, other weights will be frozen;
87
+ - `SPEECH_TO_TEXT`: only speech-to-text part of the model will be engaged in the finetuning.
88
+
89
+ The referenced finetuning script does not support finetuning of the text encoder.
90
+
91
+
92
+ Below is an example bash script that launches finetuning of M4T-large on the dataset prepared earlier, using a single node with eight GPUs:
93
+
94
+ ```
95
+ torchrun \
96
+ --rdzv-backend=c10d \
97
+ --rdzv-endpoint=localhost:0 \
98
+ --nnodes=1 \
99
+ --nproc-per-node=8 \
100
+ --no-python \
101
+ m4t_finetune \
102
+ --mode SPEECH_TO_TEXT \
103
+ --train_dataset $DATASET_DIR/train_manifest.json \
104
+ --eval_dataset $DATASET_DIR/validation_manifest.json \
105
+ --learning_rate 1e-6 \
106
+ --warmup_steps 100 \
107
+ --max_epochs 10 \
108
+ --patience 3 \
109
+ --model_name seamlessM4T_large \
110
+ --save_model_to $DATASET_DIR/checkpoint.pt
111
+ ```
112
+
113
+ Excerpt from an example finetuning log:
114
+
115
+ ```
116
+ ...
117
+ 2023-08-21 14:46:16,936 INFO -- trainer.1100368: Eval after 300 updates: loss=8.7755 best_loss=8.7755 patience_steps_left=3
118
+ 2023-08-21 14:46:16,936 INFO -- trainer.1100368: Saving model
119
+ 2023-08-21 14:46:35,863 INFO -- trainer.1100368: Epoch 006 / update 00310: train loss=16.3768 last lr=5.68E-08
120
+ 2023-08-21 14:46:42,610 INFO -- trainer.1100368: Epoch 006 / update 00320: train loss=16.3730 last lr=5.59E-08
121
+ 2023-08-21 14:46:48,285 INFO -- trainer.1100368: Epoch 006 / update 00330: train loss=16.4598 last lr=5.50E-08
122
+ 2023-08-21 14:46:54,390 INFO -- trainer.1100368: Epoch 006 / update 00340: train loss=16.4218 last lr=5.42E-08
123
+ 2023-08-21 14:47:08,461 INFO -- trainer.1100368: Epoch 006 / update 00350: train loss=16.3906 last lr=5.35E-08
124
+ 2023-08-21 14:47:09,067 INFO -- trainer.1100368: Run evaluation
125
+ 2023-08-21 14:47:19,205 INFO -- trainer.1100368: Eval after 350 updates: loss=8.7462 best_loss=8.7462 patience_steps_left=3
126
+ 2023-08-21 14:47:19,205 INFO -- trainer.1100368: Saving model
127
+ 2023-08-21 14:47:44,981 INFO -- trainer.1100368: Epoch 007 / update 00360: train loss=16.4267 last lr=5.27E-08
128
+ 2023-08-21 14:47:51,383 INFO -- trainer.1100368: Epoch 007 / update 00370: train loss=16.3630 last lr=5.20E-08
129
+ 2023-08-21 14:47:58,305 INFO -- trainer.1100368: Epoch 007 / update 00380: train loss=16.3666 last lr=5.13E-08
130
+ 2023-08-21 14:48:04,396 INFO -- trainer.1100368: Epoch 007 / update 00390: train loss=16.3605 last lr=5.06E-08
131
+ 2023-08-21 14:48:10,630 INFO -- trainer.1100368: Epoch 007 / update 00400: train loss=16.3518 last lr=5.00E-08
132
+ ...
133
+ ```