iliemihai commited on
Commit
9aa5f5b
1 Parent(s): 86da424

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -92
README.md CHANGED
@@ -1,129 +1,94 @@
1
  ---
2
- pipeline_tag: sentence-similarity
3
  tags:
4
- - sentence-transformers
5
- - feature-extraction
6
- - sentence-similarity
7
- - transformers
8
-
9
  ---
10
 
11
- # {MODEL_NAME}
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
-
15
- <!--- Describe your model here -->
16
 
17
- ## Usage (Sentence-Transformers)
18
-
19
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
-
21
- ```
22
- pip install -U sentence-transformers
23
- ```
24
 
25
- Then you can use the model like this:
26
-
27
- ```python
28
- from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('{MODEL_NAME}')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
34
- ```
35
-
36
-
37
-
38
- ## Usage (HuggingFace Transformers)
39
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
40
 
41
  ```python
42
  from transformers import AutoTokenizer, AutoModel
43
  import torch
44
 
 
 
 
45
 
46
- #Mean Pooling - Take attention mask into account for correct averaging
47
- def mean_pooling(model_output, attention_mask):
48
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
49
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
50
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
51
 
 
 
 
52
 
53
- # Sentences we want sentence embeddings for
54
- sentences = ['This is an example sentence', 'Each sentence is converted']
 
 
 
55
 
56
- # Load model from HuggingFace Hub
57
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
58
- model = AutoModel.from_pretrained('{MODEL_NAME}')
59
 
60
- # Tokenize sentences
61
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
62
 
63
- # Compute token embeddings
64
- with torch.no_grad():
65
- model_output = model(**encoded_input)
66
 
67
- # Perform pooling. In this case, mean pooling.
68
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
69
 
70
- print("Sentence embeddings:")
71
- print(sentence_embeddings)
72
- ```
 
73
 
 
74
 
 
75
 
76
- ## Evaluation Results
77
 
78
- <!--- Describe how your model was evaluated -->
 
 
 
 
 
79
 
80
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
81
 
 
82
 
83
- ## Training
84
- The model was trained with the parameters:
85
 
86
- **DataLoader**:
87
 
88
- `torch.utils.data.dataloader.DataLoader` of length 8047 with parameters:
89
  ```
90
- {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
91
  ```
92
 
93
- **Loss**:
94
 
95
- `sentence_transformers.losses.TripletLoss.TripletLoss` with parameters:
96
- ```
97
- {'distance_metric': 'TripletDistanceMetric.EUCLIDEAN', 'triplet_margin': 5}
98
- ```
99
-
100
- Parameters of the fit()-Method:
101
  ```
102
- {
103
- "epochs": 10,
104
- "evaluation_steps": 0,
105
- "evaluator": "NoneType",
106
- "max_grad_norm": 1,
107
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
108
- "optimizer_params": {
109
- "lr": 2e-05
110
- },
111
- "scheduler": "WarmupLinear",
112
- "steps_per_epoch": null,
113
- "warmup_steps": 8047,
114
- "weight_decay": 0.01
115
  }
116
  ```
117
 
 
118
 
119
- ## Full Model Architecture
120
- ```
121
- SentenceTransformer(
122
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
123
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
124
- )
125
- ```
126
-
127
- ## Citing & Authors
128
-
129
- <!--- Describe where people can find more information -->
 
1
  ---
2
+ language: ro
3
  tags:
4
+ - bert
5
+ - fill-mask
6
+ license: mit
 
 
7
  ---
8
 
9
+ # bert-base-romanian-uncased-v1
 
 
 
 
10
 
11
+ The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
 
 
 
 
 
 
12
 
13
+ ### How to use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ```python
16
  from transformers import AutoTokenizer, AutoModel
17
  import torch
18
 
19
+ # load tokenizer and model
20
+ tokenizer = AutoTokenizer.from_pretrained("iliemihai/sentence-bert-base-romanian-uncased-v1", do_lower_case=True)
21
+ model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
22
 
23
+ # tokenize a sentence and run through the model
24
+ input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
25
+ outputs = model(input_ids)
 
 
26
 
27
+ # get encoding
28
+ last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
29
+ ```
30
 
31
+ Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
32
+ ```
33
+ text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
34
+ ```
35
+ because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.
36
 
 
 
 
37
 
38
+ ### Evaluation
 
39
 
40
+ Evaluation is performed on Romaian STSb dataset
 
 
41
 
 
 
42
 
43
+ | Model | Spearman | Pearson |
44
+ |--------------------------------|:-----:|:------:|
45
+ | bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 |
46
+ | sentence-bert-base-romanian-uncased-v1 | **0.84** | **0.84** |
47
 
48
+ ### Corpus
49
 
50
+ #### Pretraining
51
 
52
+ The model is trained on the following corpora (stats in the table below are after cleaning):
53
 
54
+ | Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
55
+ |-----------|:--------:|:--------:|:--------:|:--------:|
56
+ | OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
57
+ | OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
58
+ | Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
59
+ | **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
60
 
61
+ #### Finetuning
62
 
63
+ The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian).
64
 
65
+ ### Citation
 
66
 
67
+ If you use this model in a research paper, I'd kindly ask you to cite the following paper:
68
 
 
69
  ```
70
+ Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
71
  ```
72
 
73
+ or, in bibtex:
74
 
 
 
 
 
 
 
75
  ```
76
+ @inproceedings{dumitrescu-etal-2020-birth,
77
+ title = "The birth of {R}omanian {BERT}",
78
+ author = "Dumitrescu, Stefan and
79
+ Avram, Andrei-Marius and
80
+ Pyysalo, Sampo",
81
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
82
+ month = nov,
83
+ year = "2020",
84
+ address = "Online",
85
+ publisher = "Association for Computational Linguistics",
86
+ url = "https://aclanthology.org/2020.findings-emnlp.387",
87
+ doi = "10.18653/v1/2020.findings-emnlp.387",
88
+ pages = "4324--4328",
89
  }
90
  ```
91
 
92
+ #### Acknowledgements
93
 
94
+ - We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!