martinam commited on
Commit
72f31af
1 Parent(s): fd63a59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -14,7 +14,7 @@ widget:
14
  BureauBERTo is the first transformer-based language model adapted to the Italian Public Administration (PA) and technical-bureaucratic domains. This model results from a further pre-training applied to the general-purpose Italian model UmBERTo.
15
 
16
  ## Training Corpus
17
- BureauBERTo is trained on the Bureau Corpus. , a composite corpus containing PA, banking, and insurance documents. The Bureau Corpus contains 35,293,226 sentences and approximately 1B tokens, for a total amount of 6.7 GB of plain text. The input dataset is constructed by applying the BureauBERTo tokenizer to contiguous sentences from one or more documents, using the separating special token after each sentence. The BureauBERTo vocabulary is expanded with 8,305 domain-specific tokens extracted from the Bureau Corpus.
18
 
19
  ## Training Procedure
20
 
 
14
  BureauBERTo is the first transformer-based language model adapted to the Italian Public Administration (PA) and technical-bureaucratic domains. This model results from a further pre-training applied to the general-purpose Italian model UmBERTo.
15
 
16
  ## Training Corpus
17
+ BureauBERTo is trained on the Bureau Corpus, a composite corpus containing PA, banking, and insurance documents. The Bureau Corpus contains 35,293,226 sentences and approximately 1B tokens, for a total amount of 6.7 GB of plain text. The input dataset is constructed by applying the BureauBERTo tokenizer to contiguous sentences from one or more documents, using the separating special token after each sentence. The BureauBERTo vocabulary is expanded with 8,305 domain-specific tokens extracted from the Bureau Corpus.
18
 
19
  ## Training Procedure
20