julien-c HF staff commited on
Commit
657f10c
1 Parent(s): 969c10a

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md

Files changed (1) hide show
  1. README.md +51 -0
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ro
3
+ ---
4
+
5
+ # bert-base-romanian-uncased-v1
6
+
7
+ The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
8
+
9
+ ### How to use
10
+
11
+ ```python
12
+ from transformers import AutoTokenizer, AutoModel
13
+ import torch
14
+
15
+ # load tokenizer and model
16
+ tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
17
+ model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
18
+
19
+ # tokenize a sentence and run through the model
20
+ input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
21
+ outputs = model(input_ids)
22
+
23
+ # get encoding
24
+ last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
25
+ ```
26
+
27
+ ### Evaluation
28
+
29
+ Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
30
+
31
+ The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
32
+
33
+ | Model | UPOS | XPOS | NER | LAS |
34
+ |--------------------------------|:-----:|:------:|:-----:|:-----:|
35
+ | bert-base-multilingual-uncased | 97.65 | 95.72 | 83.91 | 87.65 |
36
+ | bert-base-romanian-uncased-v1 | **98.18** | **96.84** | **85.26** | **89.61** |
37
+
38
+ ### Corpus
39
+
40
+ The model is trained on the following corpora (stats in the table below are after cleaning):
41
+
42
+ | Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
43
+ |----------- |:--------: |:--------: |:--------: |:--------: |
44
+ | OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
45
+ | OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
46
+ | Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
47
+ | **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
48
+
49
+ #### Acknowledgements
50
+
51
+ - We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!