osiria commited on
Commit
fdbca53
1 Parent(s): 49476a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -11
README.md CHANGED
@@ -33,23 +33,22 @@ widget:
33
  <h3>Introduction</h3>
34
 
35
  This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>italian</b> language. With its <b>55M parameters</b> and <b>220MB</b> size,
36
- it's <b>50% lighter</b> than a typical mono-lingual BERT model, and ideal
37
- for circumstances where memory consumption and execution speed are critical aspects, while maintaining high quality results.
38
 
39
 
40
  <h3>Model description</h3>
41
 
42
- The model has been obtained by taking the multilingual <b>DistilBERT</b> <b>[2]</b> model (from the HuggingFace team: [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)) as a starting point,
43
- and then focusing it on the italian language while at the same time turning it into an uncased model by modifying the embedding layer
44
  (as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
45
  reduction in the number of parameters.
46
 
47
- In order to compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words which were previously capitalized,
48
  the model has been further pre-trained on the italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [4]</b> technique to make it more robust
49
- with respect to the new uncased representations.
50
 
51
  The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
52
- 20% lighter than a typical mono-lingual DistilBERT model.
53
 
54
 
55
  <h3>Training procedure</h3>
@@ -71,12 +70,12 @@ provided with the dataset, while for Named Entity Recognition the metrics have b
71
  | Part of Speech Tagging | 97.48 | 97.29 | 97.37 |
72
  | Named Entity Recognition | 89.29 | 89.84 | 89.53 |
73
 
74
- The metrics have been computed at token level and macro-averaged over the classes.
75
 
76
 
77
  <h3>Demo</h3>
78
 
79
- You can try the model online (fine-tuned on named entity recognition) using this webapp: https://huggingface.co/spaces/osiria/next-it-demo
80
 
81
  <h3>Quick usage</h3>
82
 
@@ -92,8 +91,8 @@ pipeline_mlm = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
92
 
93
  <h3>Limitations</h3>
94
 
95
- This lightweight model is mainly trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text taken
96
- from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). It may show limitations when it comes to chaotic text, containing errors and slang expressions
97
  (like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).
98
 
99
  <h3>References</h3>
 
33
  <h3>Introduction</h3>
34
 
35
  This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>italian</b> language. With its <b>55M parameters</b> and <b>220MB</b> size,
36
+ it's <b>50% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high quality results.
 
37
 
38
 
39
  <h3>Model description</h3>
40
 
41
+ The model builds on the multilingual <b>DistilBERT</b> <b>[2]</b> model (from the HuggingFace team: [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)) as a starting point,
42
+ focusing it on the italian language while at the same time turning it into an uncased model by modifying the embedding layer
43
  (as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
44
  reduction in the number of parameters.
45
 
46
+ To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
47
  the model has been further pre-trained on the italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [4]</b> technique to make it more robust
48
+ to the new uncased representations.
49
 
50
  The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
51
+ 20% lighter than a standard mono-lingual DistilBERT model.
52
 
53
 
54
  <h3>Training procedure</h3>
 
70
  | Part of Speech Tagging | 97.48 | 97.29 | 97.37 |
71
  | Named Entity Recognition | 89.29 | 89.84 | 89.53 |
72
 
73
+ The metrics have been computed at the token level and macro-averaged over the classes.
74
 
75
 
76
  <h3>Demo</h3>
77
 
78
+ You can try the model online (fine-tuned on named entity recognition) using this web app: https://huggingface.co/spaces/osiria/next-it-demo
79
 
80
  <h3>Quick usage</h3>
81
 
 
91
 
92
  <h3>Limitations</h3>
93
 
94
+ This lightweight model is mainly trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text
95
+ from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). Hpwever, it may show limitations when it comes to chaotic text, containing errors and slang expressions
96
  (like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).
97
 
98
  <h3>References</h3>