File size: 7,036 Bytes
dec4ff8
 
 
 
 
 
 
 
 
 
fa8bd99
dec4ff8
 
aaf6eaa
dec4ff8
 
 
3a5144f
dec4ff8
3a5144f
dec4ff8
 
 
ba5007d
dec4ff8
 
 
 
 
 
 
 
89b2707
dec4ff8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
062a1f1
dec4ff8
062a1f1
 
dec4ff8
 
81ba648
dec4ff8
 
 
 
 
 
 
 
 
 
 
 
ba5007d
dec4ff8
 
 
 
 
 
 
3200123
3a5144f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3200123
3a5144f
3200123
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language:
- it
datasets:
- gsarti/clean_mc4_it
tags:
 - seq2seq
 - lm-head
license: apache-2.0
inference: false
thumbnail: https://gsarti.com/publication/it5/featured.png
---

# Italian T5 Base 🇮🇹

The [IT5](https://huggingface.co/models?search=it5) model family represents the first effort in pretraining large-scale sequence-to-sequence transformer models for the Italian language, following the approach adopted by the original [T5 model](https://github.com/google-research/text-to-text-transfer-transformer). 

This model is released as part of the project ["IT5: Text-to-Text Pretraining for Italian Language Understanding and Generation"](https://aclanthology.org/2024.lrec-main.823/), by [Gabriele Sarti](https://gsarti.com/) and [Malvina Nissim](https://malvinanissim.github.io/) with the support of [Huggingface](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104) and with TPU usage sponsored by Google's [TPU Research Cloud](https://sites.research.google/trc/). All the training was conducted on a single TPU3v8-VM machine on Google Cloud. Refer to the Tensorboard tab of the repository for an overview of the training process.

*TThe inference widget is deactivated because the model needs a task-specific seq2seq fine-tuning on a downstream task to be useful in practice.*

## Model variants

This repository contains the checkpoints for the `base` version of the model. The model was trained for one epoch (1.05M steps) on the [Thoroughly Cleaned Italian mC4 Corpus](https://huggingface.co/datasets/gsarti/clean_mc4_it) (~41B words, ~275GB) using 🤗 Datasets and the `google/t5-v1_1-base` improved configuration. Another version of this model trained on the [OSCAR corpus](https://oscar-corpus.com/) is also available under the name [`gsarti/it5-base-oscar`](https://huggingface.co/gsartiit5-base-oscar). The training procedure is made available [on Github](https://github.com/gsarti/t5-flax-gcp).

The following table summarizes the parameters for all available models

|                       |`it5-small`            |`it5-base` (this one) |`it5-large`            |`it5-base-oscar`                  |
|-----------------------|-----------------------|----------------------|-----------------------|----------------------------------|
|`dataset`              |`gsarti/clean_mc4_it`  |`gsarti/clean_mc4_it` |`gsarti/clean_mc4_it`  |`oscar/unshuffled_deduplicated_it`|
|`architecture`         |`google/t5-v1_1-small` |`google/t5-v1_1-base` |`google/t5-v1_1-large` |`t5-base`                         |
|`learning rate`        | 5e-3                  | 5e-3                 | 5e-3                  | 1e-2                             |
|`steps`                | 1'050'000             | 1'050'000            | 2'100'000             | 258'000                          |
|`training time`        | 36 hours              | 101 hours            | 370 hours             | 98 hours                         |
|`ff projection`        |`gated-gelu`           |`gated-gelu`          |`gated-gelu`           |`relu`                            |
|`tie embeds`           |`false`                |`false`               |`false`                |`true`                            |
|`optimizer`            | adafactor             | adafactor            | adafactor             | adafactor                        |
|`max seq. length`      | 512                   | 512                  | 512                   | 512                              |
|`per-device batch size`| 16                    | 16                   | 8                     | 16                               |
|`tot. batch size`      | 128                   | 128                  | 64                    | 128                              |
|`weigth decay`         | 1e-3                  | 1e-3                 | 1e-2                  | 1e-3                             |
|`validation split size`| 15K examples          | 15K examples         | 15K examples          | 15K examples                     |

The high training time of `it5-base-oscar` was due to [a bug](https://github.com/huggingface/transformers/pull/13012) in the training script.

For a list of individual model parameters, refer to the `config.json` file in the respective repositories.

## Using the models

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-base")
```

*Note: You will need to fine-tune the model on your downstream seq2seq task to use it. See an example [here](https://huggingface.co/it5/it5-base-news-summarization).*

Flax and Tensorflow versions of the model are also available:

```python
from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-base")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-base")
```

## Limitations

Due to the nature of the web-scraped corpus on which IT5 models were trained, it is likely that their usage could reproduce and amplify pre-existing biases in the data, resulting in potentially harmful content such as racial or gender stereotypes and conspiracist views. For this reason, the study of such biases is explicitly encouraged, and model usage should ideally be restricted to research-oriented and non-user-facing endeavors.

## Model curators

For problems or updates on this model, please contact [gabriele.sarti996@gmail.com](mailto:gabriele.sarti996@gmail.com).

##  Citation Information

```bibtex
@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
    abstract = "We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.",
}

```