Fill-Mask
Transformers
PyTorch
Safetensors
Italian
xlm-roberta
Inference Endpoints
osiria commited on
Commit
73f8735
1 Parent(s): 2ea1dbd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -32,19 +32,19 @@ widget:
32
 
33
  <h3>Introduction</h3>
34
 
35
- This model is a <b>lightweight</b> and uncased version of <b>MiniLM</b> <b>[1]</b> for the <b>italian</b> language. Its <b>17M parameters</b> and <b>67MB</b> size make it
36
  <b>85% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high-quality results.
37
 
38
 
39
  <h3>Model description</h3>
40
 
41
  The model builds on <b>mMiniLMv2</b> <b>[1]</b> (from Microsoft: [L6xH384 mMiniLMv2](https://github.com/microsoft/unilm/tree/master/minilm)) as a starting point,
42
- focusing it on the italian language while at the same time turning it into an uncased model by modifying the embedding layer
43
  (as in <b>[2]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
44
  reduction in the number of parameters.
45
 
46
  To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
47
- the model has been further pre-trained on the italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [3]</b> technique to make it more robust
48
  to the new uncased representations.
49
 
50
  The resulting model has 17M parameters, a vocabulary of 14.610 tokens, and a size of 67MB, which makes it <b>85% lighter</b> than a typical mono-lingual BERT model and
@@ -53,7 +53,7 @@ The resulting model has 17M parameters, a vocabulary of 14.610 tokens, and a siz
53
 
54
  <h3>Training procedure</h3>
55
 
56
- The model has been trained for <b>masked language modeling</b> on the italian <b>Wikipedia</b> (~3GB) dataset for 10K steps, using the AdamW optimizer, with a batch size of 512
57
  (obtained through 128 gradient accumulation steps),
58
  a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using <b>dynamic masking</b> between epochs and
59
  exploiting the <b>whole word masking</b> technique.
 
32
 
33
  <h3>Introduction</h3>
34
 
35
+ This model is a <b>lightweight</b> and uncased version of <b>MiniLM</b> <b>[1]</b> for the <b>Italian</b> language. Its <b>17M parameters</b> and <b>67MB</b> size make it
36
  <b>85% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high-quality results.
37
 
38
 
39
  <h3>Model description</h3>
40
 
41
  The model builds on <b>mMiniLMv2</b> <b>[1]</b> (from Microsoft: [L6xH384 mMiniLMv2](https://github.com/microsoft/unilm/tree/master/minilm)) as a starting point,
42
+ focusing it on the Italian language while at the same time turning it into an uncased model by modifying the embedding layer
43
  (as in <b>[2]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
44
  reduction in the number of parameters.
45
 
46
  To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
47
+ the model has been further pre-trained on the Italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [3]</b> technique to make it more robust
48
  to the new uncased representations.
49
 
50
  The resulting model has 17M parameters, a vocabulary of 14.610 tokens, and a size of 67MB, which makes it <b>85% lighter</b> than a typical mono-lingual BERT model and
 
53
 
54
  <h3>Training procedure</h3>
55
 
56
+ The model has been trained for <b>masked language modeling</b> on the Italian <b>Wikipedia</b> (~3GB) dataset for 10K steps, using the AdamW optimizer, with a batch size of 512
57
  (obtained through 128 gradient accumulation steps),
58
  a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using <b>dynamic masking</b> between epochs and
59
  exploiting the <b>whole word masking</b> technique.