Transformers documentation

Translation

You are viewing v4.17.0 version. A newer version v4.38.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Translation

Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks.

This guide will show you how to fine-tune T5 on the English-French subset of the OPUS Books dataset to translate English text to French.

See the translation task page for more information about its associated models, datasets, and metrics.

Load OPUS Books dataset

Load the OPUS Books dataset from the 🤗 Datasets library:

>>> from datasets import load_dataset

>>> books = load_dataset("opus_books", "en-fr")

Split this dataset into a train and test set:

books = books["train"].train_test_split(test_size=0.2)

Then take a look at an example:

>>> books["train"][0]
{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}

The translation field is a dictionary containing the English and French translations of the text.

Preprocess

Load the T5 tokenizer to process the language pairs:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")

The preprocessing function needs to:

  1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
  2. Tokenize the input (English) and target (French) separately. You can’t tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
  3. Truncate sequences to be no longer than the maximum length set by the max_length parameter.
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "


>>> def preprocess_function(examples):
...     inputs = [prefix + example[source_lang] for example in examples["translation"]]
...     targets = [example[target_lang] for example in examples["translation"]]
...     model_inputs = tokenizer(inputs, max_length=128, truncation=True)

...     with tokenizer.as_target_tokenizer():
...         labels = tokenizer(targets, max_length=128, truncation=True)

...     model_inputs["labels"] = labels["input_ids"]
...     return model_inputs

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

>>> tokenized_books = books.map(preprocess_function, batched=True)

Use DataCollatorForSeq2Seq to create a batch of examples. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

>>> from transformers import DataCollatorForSeq2Seq >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Fine-tune with Trainer

Load T5 with AutoModelForSeq2SeqLM:

>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

If you aren’t familiar with fine-tuning a model with the Trainer, take a look at the basic tutorial here!

At this point, only three steps remain:

  1. Define your training hyperparameters in Seq2SeqTrainingArguments.
  2. Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, and data collator.
  3. Call train() to fine-tune your model.
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=1,
...     fp16=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_books["train"],
...     eval_dataset=tokenized_books["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
... )

>>> trainer.train()

Fine-tune with TensorFlow

To fine-tune a model in TensorFlow is just as easy, with only a few differences.

If you aren’t familiar with fine-tuning a model with Keras, take a look at the basic tutorial here!

Convert your datasets to the tf.data.Dataset format with to_tf_dataset. Specify inputs and labels in columns, whether to shuffle the dataset order, batch size, and the data collator:

>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

Set up an optimizer function, learning rate schedule, and some training hyperparameters:

>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

Load T5 with TFAutoModelForSeq2SeqLM:

>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")

Configure the model for training with compile:

>>> model.compile(optimizer=optimizer)

Call fit to fine-tune the model:

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding PyTorch notebook or TensorFlow notebook.