julien-c HF staff commited on
Commit
829db6e
1 Parent(s): c848d21

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md

Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ro
3
+ ---
4
+
5
+ # bert-base-romanian-cased-v1
6
+
7
+ The BERT **base**, **cased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
8
+
9
+ ### How to use
10
+
11
+ ```python
12
+ from transformers import AutoTokenizer, AutoModel
13
+ import torch
14
+ # load tokenizer and model
15
+ tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
16
+ model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
17
+ # tokenize a sentence and run through the model
18
+ input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
19
+ outputs = model(input_ids)
20
+ # get encoding
21
+ last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
22
+ ```
23
+
24
+ ### Evaluation
25
+
26
+ Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
27
+
28
+ The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
29
+
30
+ | Model | UPOS | XPOS | NER | LAS |
31
+ |--------------------------------|:-----:|:------:|:-----:|:-----:|
32
+ | bert-base-multilingual-cased | 97.87 | 96.16 | 84.13 | 88.04 |
33
+ | bert-base-romanian-cased-v1 | **98.00** | **96.46** | **85.88** | **89.69** |
34
+
35
+ ### Corpus
36
+
37
+ The model is trained on the following corpora (stats in the table below are after cleaning):
38
+
39
+ | Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
40
+ |----------- |:--------: |:--------: |:--------: |:--------: |
41
+ | OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
42
+ | OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
43
+ | Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
44
+ | **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
45
+
46
+ #### Acknowledgements
47
+
48
+ - We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!