language:
- it
datasets:
- oscar
tags:
- seq2seq
- lm-head
license: apache-2.0
inference: false
Italian T5-base 🇮🇹
⚠️⚠️ REDIRECTION NOTICE ⚠️⚠️
The contents of the repository gsarti/t5-base-it
will be transfered to a new repository gsarti/it5-base-oscar
on the Huggingface Hub on October 23rd, 2021. Users looking for an improved version of the Italian T5 model can already use the checkpoint in the gsarti/it5-base
repository (more details soon!).
Created by Gabriele Sarti during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project PreTrain T5 for Italian.
This is notably the first sequence-to-sequence model pre-trained on the Italian language available on the 🤗 Hub. For people interested in studying the pre-training dynamics of this model, the repository t5-base-it-training
contains Flax checkpoints for the whole pre-training process (saved each 2000 steps, 129 checkpoints, ~250GB).
Important: The inference widget is deactivated because the model needs a task-specific seq2seq training on a downstream task to be actually useful. The script run_t5_mlm_flax.py
provides an example of fine-tuning the model on a downstream summarization task.
Dataset
This model was trained on the Italian de-duplicated portion of the OSCAR corpus (11B words, ~69GB) using the 🤗 Datasets library. The corpus was used as-is without any further preprocessing.
Training
The model was trained for 258K steps in 4 days using JAX/Flax on a TPU3v8-VM on Google Cloud. Refer to the Tensorboard tab of the repository for an overview of the training process.
The original configuration for the model t5-base
was adopted, with the exception of the parameter dropout_rate
that was set at 0
instead of 0.1
during pre-training, following the implementation of t5-v1.1
. The tokenizer is a SentencePieceUnigramTokenizer
trained on the first 2M sentences of the Italian portion of the mC4
corpus.
The following parameters were used for training:
parameter | value |
---|---|
optimizer | adafactor w/ default params |
dataset | oscar/unshuffled_deduplicated_it |
max seq. length | 512 |
per-device batch size | 16 |
tot. batch size | 128 |
learning rate | 1e-2 |
lr schedule | linear warmup + linear decay |
warmup steps | 10K |
weight decay | 1e-3 |
num. train epochs | 1 (258K steps) |
validation split size | 15K examples |