Commit
•
fd538c7
1
Parent(s):
c7f136b
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ro
|
3 |
+
library: [PyTorch,Transformers]
|
4 |
+
dataset: [wikipedia, oscar, rotex]
|
5 |
+
---
|
6 |
+
|
7 |
+
|
8 |
+
# Romanian DistilBERT
|
9 |
+
|
10 |
+
This repository contains the uncased Romanian DistilBERT. The teacher model used for distillation is: [readerbench/RoBERT-base](https://huggingface.co/readerbench/RoBERT-base).
|
11 |
+
|
12 |
+
## Usage
|
13 |
+
|
14 |
+
```python
|
15 |
+
from transformers import AutoTokenizer, AutoModel
|
16 |
+
|
17 |
+
# load the tokenizer and the model
|
18 |
+
tokenizer = AutoTokenizer.from_pretrained("racai/distilbert-base-romanian-uncased")
|
19 |
+
model = AutoModel.from_pretrained("racai/distilbert-base-romanian-uncased")
|
20 |
+
|
21 |
+
# tokenize a test sentence
|
22 |
+
input_ids = tokenizer.encode("aceasta este o propoziție de test.", add_special_tokens=True, return_tensors="pt")
|
23 |
+
|
24 |
+
# run the tokens trough the model
|
25 |
+
outputs = model(input_ids)
|
26 |
+
|
27 |
+
print(outputs)
|
28 |
+
```
|
29 |
+
|
30 |
+
## Model Size
|
31 |
+
|
32 |
+
Romanian DistilBERT is 35% smaller than the original Romanian BERT.
|
33 |
+
|
34 |
+
| Model | Size (MB) | Params (Millions) |
|
35 |
+
|--------------------------------|:---------:|:----------------:|
|
36 |
+
| bert-base-romanian-cased-v1 | 441 | 114 |
|
37 |
+
| distilbert-base-romanian-cased | 282 | 72 |
|
38 |
+
|
39 |
+
## Evaluation
|
40 |
+
|
41 |
+
We evaluated the model in comparison with the RoBERT-base on 5 Romanian tasks:
|
42 |
+
|
43 |
+
- **UPOS**: Universal Part of Speech (F1-macro)
|
44 |
+
- **XPOS**: Extended Part of Speech (F1-macro)
|
45 |
+
- **NER**: Named Entity Recognition (F1-macro)
|
46 |
+
- **SAPN**: Sentiment Anlaysis - Positive vs Negative (Accuracy)
|
47 |
+
- **SAR**: Sentiment Analysis - Rating (F1-macro)
|
48 |
+
- **DI**: Dialect identification (F1-macro)
|
49 |
+
- **STS**: Semantic Textual Similarity (Pearson)
|
50 |
+
|
51 |
+
| Model | UPOS | XPOS | NER | SAPN | SAR | DI | STS |
|
52 |
+
|--------------------------------|:----:|:----:|:---:|:----:|:---:|:--:|:---:|
|
53 |
+
| RoBERT-base | 98.02 | 97.15 | 85.14 | 98.30 | 79.40 | 96.07 | 81.18 |
|
54 |
+
| distilbert-base-romanian-uncased | 97.12 | 95.79 | 83.11 | 98.01 | 79.58 | 96.11 | 79.80 |
|