Translation
Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks.
This guide will show you how to fine-tune T5 on the English-French subset of the OPUS Books dataset to translate English text to French.
See the translation task page for more information about its associated models, datasets, and metrics.
Load OPUS Books dataset
Load the OPUS Books dataset from the 🤗 Datasets library:
>>> from datasets import load_dataset
>>> books = load_dataset("opus_books", "en-fr")
Split this dataset into a train and test set:
books = books["train"].train_test_split(test_size=0.2)
Then take a look at an example:
>>> books["train"][0]
{'id': '90560',
'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}
The translation
field is a dictionary containing the English and French translations of the text.
Preprocess
Load the T5 tokenizer to process the language pairs:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
The preprocessing function needs to:
- Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
- Tokenize the input (English) and target (French) separately. You can’t tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
- Truncate sequences to be no longer than the maximum length set by the
max_length
parameter.
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "
>>> def preprocess_function(examples):
... inputs = [prefix + example[source_lang] for example in examples["translation"]]
... targets = [example[target_lang] for example in examples["translation"]]
... model_inputs = tokenizer(inputs, max_length=128, truncation=True)
... with tokenizer.as_target_tokenizer():
... labels = tokenizer(targets, max_length=128, truncation=True)
... model_inputs["labels"] = labels["input_ids"]
... return model_inputs
Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map
function by setting batched=True
to process multiple elements of the dataset at once:
>>> tokenized_books = books.map(preprocess_function, batched=True)
Load T5 with AutoModelForSeq2SeqLM:
>>> from transformers import AutoModelForSeq2SeqLM
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
Load T5 with TFAutoModelForSeq2SeqLM:
>>> from transformers import TFAutoModelForSeq2SeqLM
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
Use DataCollatorForSeq2Seq to create a batch of examples. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer
function by setting padding=True
, dynamic padding is more efficient.
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
Train
If you aren’t familiar with fine-tuning a model with the Trainer, take a look at the basic tutorial here!
At this point, only three steps remain:
- Define your training hyperparameters in Seq2SeqTrainingArguments.
- Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, and data collator.
- Call train() to fine-tune your model.
>>> from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
>>> training_args = Seq2SeqTrainingArguments(
... output_dir="./results",
... evaluation_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... weight_decay=0.01,
... save_total_limit=3,
... num_train_epochs=1,
... fp16=True,
... )
>>> trainer = Seq2SeqTrainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_books["train"],
... eval_dataset=tokenized_books["test"],
... tokenizer=tokenizer,
... data_collator=data_collator,
... )
>>> trainer.train()
To fine-tune a model in TensorFlow, start by converting your datasets to the tf.data.Dataset
format with to_tf_dataset. Specify inputs and labels in columns
, whether to shuffle the dataset order, batch size, and the data collator:
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
If you aren’t familiar with fine-tuning a model with Keras, take a look at the basic tutorial here!
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
Configure the model for training with compile
:
>>> model.compile(optimizer=optimizer)
Call fit
to fine-tune the model:
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding PyTorch notebook or TensorFlow notebook.