File size: 2,830 Bytes
9ac4303
 
b0687c9
 
 
010f8d8
 
 
 
 
 
 
d3561f1
 
ec58cea
c7e52e2
0b3ee43
 
 
d3561f1
 
 
72f31af
d3561f1
 
 
 
 
 
 
 
565bdef
74ea0d0
d3561f1
74ea0d0
d3561f1
 
 
 
 
 
 
565bdef
d3561f1
2322fc3
 
 
 
 
 
 
d3561f1
e9b5a38
2322fc3
 
 
 
 
 
 
d3561f1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: afl-3.0
language:
- it
widget:
- text: >-
    Gli sviluppi delle prestazioni rivalutate e del valore di <mask> sono di
    seguito riportati
tags:
- bureauberto
- administrative language
- italian
---

# BureauBERTo: adapting UmBERTo to the Italian bureaucratic language
<img  src="https://huggingface.co/colinglab/BureauBERTo/resolve/main/bureauberto.jpg?raw=true" width="600"/> 



BureauBERTo is the first transformer-based language model adapted to the Italian Public Administration (PA) and technical-bureaucratic domains. This model results from a further pre-training applied to the general-purpose Italian model UmBERTo.

## Training Corpus
BureauBERTo is trained on the Bureau Corpus, a composite corpus containing PA, banking, and insurance documents. The Bureau Corpus contains 35,293,226 sentences and approximately 1B tokens, for a total amount of 6.7 GB of plain text. The input dataset is constructed by applying the BureauBERTo tokenizer to contiguous sentences from one or more documents, using the separating special token after each sentence. The BureauBERTo vocabulary is expanded with 8,305 domain-specific tokens extracted from the Bureau Corpus.

## Training Procedure

The further pre-training is applied with a MLM objective (randomly masking 15\% of the tokens) on the Bureau Corpus. The model was trained for 40 epochs, resulting in 17,400 steps with a batch size of 8K on a NVIDIA A100 GPU. We used a learning rate of 5e-5 along with an Adam optimizer (β1=0.9, β2 = 0.98) with a weight decay of 0.1, and a 0.06 warm-up steps ratio.


## BureauBERTo model can be loaded like:

```python
from transformers import AutoModel, AutoTokenizer
model_name = "colinglab/BureauBERTo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```

## Citation
If you find our resource or paper useful, please consider including the following citation in your paper.
```
@inproceedings{auriemma2023bureauberto,
title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
shorttitle = {{BureauBERTo}},
author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
editor       = {Falchi, Fabrizio  and
				Giannotti, Fosca and 
				Monreale, Anna and
				Boldrini, Chiara and
				Rinzivillo, Salvatore and
				Colantonio, Sara},
language = {en},
booktitle = {{Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops co-located with the 3rd CINI National Lab AIIS Conference on Artificial Intelligence (Ital IA 2023)}},
address = {Pisa, Italy},
series = {{CEUR} {Workshop} {Proceedings}},
volume       = {3486},
pages = {240--248},
publisher    = {CEUR-WS.org},
year         = {2023},
url          = {https://ceur-ws.org/Vol-3486/42.pdf},
}
```