Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md
README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ro
|
3 |
+
---
|
4 |
+
|
5 |
+
# bert-base-romanian-uncased-v1
|
6 |
+
|
7 |
+
The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
|
8 |
+
|
9 |
+
### How to use
|
10 |
+
|
11 |
+
```python
|
12 |
+
from transformers import AutoTokenizer, AutoModel
|
13 |
+
import torch
|
14 |
+
|
15 |
+
# load tokenizer and model
|
16 |
+
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
|
17 |
+
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
|
18 |
+
|
19 |
+
# tokenize a sentence and run through the model
|
20 |
+
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
21 |
+
outputs = model(input_ids)
|
22 |
+
|
23 |
+
# get encoding
|
24 |
+
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
25 |
+
```
|
26 |
+
|
27 |
+
### Evaluation
|
28 |
+
|
29 |
+
Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
|
30 |
+
|
31 |
+
The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
|
32 |
+
|
33 |
+
| Model | UPOS | XPOS | NER | LAS |
|
34 |
+
|--------------------------------|:-----:|:------:|:-----:|:-----:|
|
35 |
+
| bert-base-multilingual-uncased | 97.65 | 95.72 | 83.91 | 87.65 |
|
36 |
+
| bert-base-romanian-uncased-v1 | **98.18** | **96.84** | **85.26** | **89.61** |
|
37 |
+
|
38 |
+
### Corpus
|
39 |
+
|
40 |
+
The model is trained on the following corpora (stats in the table below are after cleaning):
|
41 |
+
|
42 |
+
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
43 |
+
|----------- |:--------: |:--------: |:--------: |:--------: |
|
44 |
+
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
45 |
+
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
46 |
+
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
47 |
+
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
48 |
+
|
49 |
+
#### Acknowledgements
|
50 |
+
|
51 |
+
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|