Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/Musixmatch/umberto-wikipedia-uncased-v1/README.md
README.md
ADDED
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: it
|
3 |
+
---
|
4 |
+
|
5 |
+
# UmBERTo Wikipedia Uncased
|
6 |
+
|
7 |
+
[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
|
8 |
+
|
9 |
+
<p align="center">
|
10 |
+
<img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
|
11 |
+
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
|
12 |
+
</p>
|
13 |
+
|
14 |
+
## Dataset
|
15 |
+
UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from [Wikipedia-ITA](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
|
16 |
+
|
17 |
+
## Pre-trained model
|
18 |
+
|
19 |
+
| Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
|
20 |
+
| ------ | ------ | ------ | ------ | ------ |------ | ------ |
|
21 |
+
| `umberto-wikipedia-uncased-v1` | YES | YES | SPM | 32K | 100k | [Link](http://bit.ly/35wbSj6) |
|
22 |
+
|
23 |
+
This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
|
24 |
+
|
25 |
+
## Downstream Tasks
|
26 |
+
These results refers to umberto-wikipedia-uncased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
|
27 |
+
|
28 |
+
#### Named Entity Recognition (NER)
|
29 |
+
|
30 |
+
| Dataset | F1 | Precision | Recall | Accuracy |
|
31 |
+
| ------ | ------ | ------ | ------ | ----- |
|
32 |
+
| **ICAB-EvalITA07** | **86.240** | 85.939 | 86.544 | 98.534 |
|
33 |
+
| **WikiNER-ITA** | **90.483** | 90.328 | 90.638 | 98.661 |
|
34 |
+
|
35 |
+
#### Part of Speech (POS)
|
36 |
+
|
37 |
+
| Dataset | F1 | Precision | Recall | Accuracy |
|
38 |
+
| ------ | ------ | ------ | ------ | ------ |
|
39 |
+
| **UD_Italian-ISDT** | 98.563 | 98.508 | 98.618 | **98.717** |
|
40 |
+
| **UD_Italian-ParTUT** | 97.810 | 97.835 | 97.784 | **98.060** |
|
41 |
+
|
42 |
+
|
43 |
+
|
44 |
+
## Usage
|
45 |
+
|
46 |
+
##### Load UmBERTo Wikipedia Uncased with AutoModel, Autotokenizer:
|
47 |
+
|
48 |
+
```python
|
49 |
+
|
50 |
+
import torch
|
51 |
+
from transformers import AutoTokenizer, AutoModel
|
52 |
+
|
53 |
+
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
|
54 |
+
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
|
55 |
+
|
56 |
+
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
|
57 |
+
input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
|
58 |
+
outputs = umberto(input_ids)
|
59 |
+
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
|
60 |
+
```
|
61 |
+
|
62 |
+
##### Predict masked token:
|
63 |
+
|
64 |
+
```python
|
65 |
+
from transformers import pipeline
|
66 |
+
|
67 |
+
fill_mask = pipeline(
|
68 |
+
"fill-mask",
|
69 |
+
model="Musixmatch/umberto-wikipedia-uncased-v1",
|
70 |
+
tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
|
71 |
+
)
|
72 |
+
|
73 |
+
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
|
74 |
+
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
|
75 |
+
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
|
76 |
+
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
|
77 |
+
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
|
78 |
+
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
|
79 |
+
```
|
80 |
+
|
81 |
+
|
82 |
+
## Citation
|
83 |
+
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
|
84 |
+
|
85 |
+
* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
|
86 |
+
* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
|
87 |
+
* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
|
88 |
+
* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
|
89 |
+
|
90 |
+
```
|
91 |
+
@inproceedings {magnini2006annotazione,
|
92 |
+
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
|
93 |
+
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
|
94 |
+
booktitle = {Proc.of SILFI 2006},
|
95 |
+
year = {2006}
|
96 |
+
}
|
97 |
+
@inproceedings {magnini2006cab,
|
98 |
+
title = {I - CAB: the Italian Content Annotation Bank.},
|
99 |
+
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
|
100 |
+
booktitle = {LREC},
|
101 |
+
pages = {963--968},
|
102 |
+
year = {2006},
|
103 |
+
organization = {Citeseer}
|
104 |
+
}
|
105 |
+
```
|
106 |
+
|
107 |
+
## Authors
|
108 |
+
|
109 |
+
**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
|
110 |
+
**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
|
111 |
+
**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
|
112 |
+
|
113 |
+
## About Musixmatch AI
|
114 |
+
![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
|
115 |
+
We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
|
116 |
+
Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)
|
117 |
+
|