colinglab
/

BureauBERTo

administrative language

Inference Endpoints

Model card Files Files and versions Community

martinam commited on Jul 13, 2023

Commit

72f31af

•

1 Parent(s): fd63a59

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ widget:
 BureauBERTo is the first transformer-based language model adapted to the Italian Public Administration (PA) and technical-bureaucratic domains. This model results from a further pre-training applied to the general-purpose Italian model UmBERTo.
 ## Training Corpus
-BureauBERTo is trained on the Bureau Corpus. , a composite corpus containing PA, banking, and insurance documents. The Bureau Corpus contains 35,293,226 sentences and approximately 1B tokens, for a total amount of 6.7 GB of plain text. The input dataset is constructed by applying the BureauBERTo tokenizer to contiguous sentences from one or more documents, using the separating special token after each sentence. The BureauBERTo vocabulary is expanded with 8,305 domain-specific tokens extracted from the Bureau Corpus.
 ## Training Procedure

 BureauBERTo is the first transformer-based language model adapted to the Italian Public Administration (PA) and technical-bureaucratic domains. This model results from a further pre-training applied to the general-purpose Italian model UmBERTo.
 ## Training Corpus
+BureauBERTo is trained on the Bureau Corpus, a composite corpus containing PA, banking, and insurance documents. The Bureau Corpus contains 35,293,226 sentences and approximately 1B tokens, for a total amount of 6.7 GB of plain text. The input dataset is constructed by applying the BureauBERTo tokenizer to contiguous sentences from one or more documents, using the separating special token after each sentence. The BureauBERTo vocabulary is expanded with 8,305 domain-specific tokens extracted from the Bureau Corpus.
 ## Training Procedure