language:
- it
datasets:
- oscar
tags:
- seq2seq
- lm-head
license: apache-2.0
inference: false
Italian T5-base ๐ฎ๐น
Created by Gabriele Sarti during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project PreTrain T5 for Italian.
This is notably the first sequence-to-sequence model pre-trained on the Italian language available on the ๐ค Hub. For people interested in studying the pre-training dynamics of this model, the repository t5-base-it-training
contains Flax checkpoints for the whole pre-training process (saved each 2000 steps, 129 checkpoints, ~250GB).
Important: The inference widget is deactivated because the model needs a task-specific seq2seq training on a downstream task to be actually useful. The script run_t5_mlm_flax.py
provides an example of fine-tuning the model on a downstream summarization task.
Dataset
This model was trained on the Italian de-duplicated portion of the OSCAR corpus (11B words, ~69GB) using the ๐ค Datasets library. The corpus was used as-is without any further preprocessing.
Training
The model was trained for 258K steps in 4 days using JAX/Flax on a TPU3v8-VM on Google Cloud. Refer to the Tensorboard tab of the repository for an overview of the training process.
The original configuration for the model t5-base
was adopted, with the exception of the parameter dropout_rate
that was set at 0
instead of 0.1
during pre-training, following the implementation of t5-v1.1
. The tokenizer is a SentencePieceUnigramTokenizer
trained on the first 2M sentences of the Italian portion of the mC4
corpus.
The following parameters were used for training:
parameter | value |
---|---|
optimizer | adafactor w/ default params |
dataset | oscar/unshuffled_deduplicated_it |
max seq. length | 512 |
per-device batch size | 16 |
tot. batch size | 128 |
learning rate | 1e-2 |
lr schedule | linear warmup + linear decay |
warmup steps | 10K |
weight decay | 1e-3 |
num. train epochs | 1 (258K steps) |
validation split size | 15K examples |