julien-c HF staff commited on
Commit
7bb8e47
1 Parent(s): 80ab8e7

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/mrm8488/RuPERTa-base/README.md

Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ thumbnail: https://i.imgur.com/DUlT077.jpg
4
+ widget:
5
+ - text: "España es un país muy <mask> en la UE"
6
+ ---
7
+
8
+ # RuPERTa: the Spanish RoBERTa 🎃<img src="https://abs-0.twimg.com/emoji/v2/svg/1f1ea-1f1f8.svg" alt="spain flag" width="25"/>
9
+
10
+ RuPERTa-base (uncased) is a [RoBERTa model](https://github.com/pytorch/fairseq/tree/master/examples/roberta) trained on a *uncased* verison of [big Spanish corpus](https://github.com/josecannete/spanish-corpora).
11
+ RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
12
+ The architecture is the same as `roberta-base`:
13
+
14
+ `roberta.base:` **RoBERTa** using the **BERT-base architecture 125M** params
15
+
16
+ ## Benchmarks 🧾
17
+ WIP (I continue working on it) 🚧
18
+
19
+ | Task/Dataset | F1 | Precision | Recall | Fine-tuned model | Reproduce it |
20
+ | -------- | ----: | --------: | -----: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
21
+ | POS | 97.39 | 97.47 | 97.32 | [RuPERTa-base-finetuned-pos](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pos) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/RuPERTa_base_finetuned_POS.ipynb)
22
+ | NER | 77.55 | 75.53 | 79.68 | [RuPERTa-base-finetuned-ner](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-ner) |
23
+ | SQUAD-es v1 | to-do | | |[RuPERTa-base-finetuned-squadv1](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv1)
24
+ | SQUAD-es v2 | to-do | | |[RuPERTa-base-finetuned-squadv2](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv2)
25
+
26
+ ## Model in action 🔨
27
+
28
+ ### Usage for POS and NER 🏷
29
+
30
+ ```python
31
+ import torch
32
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
33
+
34
+ id2label = {
35
+ "0": "B-LOC",
36
+ "1": "B-MISC",
37
+ "2": "B-ORG",
38
+ "3": "B-PER",
39
+ "4": "I-LOC",
40
+ "5": "I-MISC",
41
+ "6": "I-ORG",
42
+ "7": "I-PER",
43
+ "8": "O"
44
+ }
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
47
+ model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
48
+
49
+ text ="Julien, CEO de HF, nació en Francia."
50
+
51
+ input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
52
+
53
+ outputs = model(input_ids)
54
+ last_hidden_states = outputs[0]
55
+
56
+ for m in last_hidden_states:
57
+ for index, n in enumerate(m):
58
+ if(index > 0 and index <= len(text.split(" "))):
59
+ print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
60
+
61
+ # Output:
62
+ '''
63
+ Julien,: I-PER
64
+ CEO: O
65
+ de: O
66
+ HF,: B-ORG
67
+ nació: I-PER
68
+ en: I-PER
69
+ Francia.: I-LOC
70
+ '''
71
+ ```
72
+
73
+ For **POS** just change the `id2label` dictionary and the model path to [mrm8488/RuPERTa-base-finetuned-pos](https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pos)
74
+
75
+ ### Fast usage for LM with `pipelines` 🧪
76
+
77
+ ```python
78
+ from transformers import AutoModelWithLMHead, AutoTokenizer
79
+ model = AutoModelWithLMHead.from_pretrained('mrm8488/RuPERTa-base')
80
+ tokenizer = AutoTokenizer.from_pretrained("mrm8488/RuPERTa-base", do_lower_case=True)
81
+
82
+ from transformers import pipeline
83
+
84
+ pipeline_fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
85
+
86
+ pipeline_fill_mask("España es un país muy <mask> en la UE")
87
+ ```
88
+
89
+ ```json
90
+ [
91
+ {
92
+ "score": 0.1814306527376175,
93
+ "sequence": "<s> españa es un país muy importante en la ue</s>",
94
+ "token": 1560
95
+ },
96
+ {
97
+ "score": 0.024842597544193268,
98
+ "sequence": "<s> españa es un país muy fuerte en la ue</s>",
99
+ "token": 2854
100
+ },
101
+ {
102
+ "score": 0.02473250962793827,
103
+ "sequence": "<s> españa es un país muy pequeño en la ue</s>",
104
+ "token": 2948
105
+ },
106
+ {
107
+ "score": 0.023991240188479424,
108
+ "sequence": "<s> españa es un país muy antiguo en la ue</s>",
109
+ "token": 5240
110
+ },
111
+ {
112
+ "score": 0.0215945765376091,
113
+ "sequence": "<s> españa es un país muy popular en la ue</s>",
114
+ "token": 5782
115
+ }
116
+ ]
117
+ ```
118
+
119
+ ## Acknowledgments
120
+
121
+ I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
122
+
123
+ > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
124
+
125
+ > Made with <span style="color: #e25555;">&hearts;</span> in Spain