metadata

language:
  - it
datasets:
  - oscar
tags:
  - seq2seq
  - lm-head
license: apache-2.0
inference: false

Italian T5-base 🇮🇹

⚠️⚠️ REDIRECTION NOTICE ⚠️⚠️

The contents of the repository gsarti/t5-base-it will be transfered to a new repository gsarti/it5-base-oscar on the Huggingface Hub on October 23rd, 2021. Users looking for an improved version of the Italian T5 model can already use the checkpoint in the gsarti/it5-base repository (more details soon!).

Created by Gabriele Sarti during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project PreTrain T5 for Italian.

This is notably the first sequence-to-sequence model pre-trained on the Italian language available on the 🤗 Hub. For people interested in studying the pre-training dynamics of this model, the repository t5-base-it-training contains Flax checkpoints for the whole pre-training process (saved each 2000 steps, 129 checkpoints, ~250GB).

Important: The inference widget is deactivated because the model needs a task-specific seq2seq training on a downstream task to be actually useful. The script run_t5_mlm_flax.py provides an example of fine-tuning the model on a downstream summarization task.

Dataset

This model was trained on the Italian de-duplicated portion of the OSCAR corpus (11B words, ~69GB) using the 🤗 Datasets library. The corpus was used as-is without any further preprocessing.

Training

The model was trained for 258K steps in 4 days using JAX/Flax on a TPU3v8-VM on Google Cloud. Refer to the Tensorboard tab of the repository for an overview of the training process.

The original configuration for the model t5-base was adopted, with the exception of the parameter dropout_rate that was set at 0 instead of 0.1 during pre-training, following the implementation of t5-v1.1. The tokenizer is a SentencePieceUnigramTokenizer trained on the first 2M sentences of the Italian portion of the mC4 corpus.

The following parameters were used for training:

parameter	value
optimizer	adafactor w/ default params
dataset	`oscar/unshuffled_deduplicated_it`
max seq. length	512
per-device batch size	16
tot. batch size	128
learning rate	1e-2
lr schedule	linear warmup + linear decay
warmup steps	10K
weight decay	1e-3
num. train epochs	1 (258K steps)
validation split size	15K examples