File size: 3,111 Bytes
4f9e632
 
1bac293
 
 
 
 
 
 
 
 
 
 
7233d5d
1bac293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
454c024
 
 
 
 
 
 
 
1bac293
 
 
 
 
7233d5d
 
 
dca3b85
1b142a7
dca3b85
 
828f36c
1bac293
 
 
 
 
 
 
 
 
 
d7dd8d6
4b41416
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---

language: ro

---

# ALBert 

The ALR-Bert , **cased** model for Romanian, trained on a 15GB corpus!
ALR-BERT is a multi-layer bidirectional Transformer encoder that shares ALBERT's factorized embedding parameterization and cross-layer sharing. ALR-BERT-base inherits ALBERT-base and features 12 parameter-sharing layers, a 128-dimension embedding size, 768 hidden units, 12 heads, and GELU non-linearities. Masked language modeling (MLM) and sentence order prediction (SOP) losses are the two objectives that ALBERT is pre-trained on. For ALR-BERT, we preserve both these objectives. 
The model was trained using 40 batches per GPU (for 128 sequence length) and then 20 batches per GPU (for 512 sequence length). Layer-wise Adaptive Moments optimizer for Batch (LAMB) training was utilized, with a warm-up over the first 1\% of steps up to a learning rate of 1e4, then a decay. Eight NVIDIA Tesla V100 SXM3 with 32GB memory were used, and the pre-training process took around 2 weeks per model.


Training methodology follows closely work previous done in Romanian Bert (https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1)



### How to use

```python

from transformers import AutoTokenizer, AutoModel

import torch

# load tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("dragosnicolae555/ALR_BERT")

model = AutoModel.from_pretrained("dragosnicolae555/ALR_BERT")

#Here add your magic

```

Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
```
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
```
because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word. 


### Evaluation

Here, we evaluate ALR-BERT on Simple Universal Dependencies task. One model for each task, evaluating labeling performance on the UPOS (Universal Part-of-Speech) and the XPOS (Extended Part-of-Speech) (eXtended Part-of-Speech). We compare our proposed ALR-BERT with Romanian BERT and multiligual BERT, using the cased version. To counteract the random seed effect, we repeat each experiment five times and simply provide the mean score. 




| Model                          |  UPOS |  XPOS  |  MLAS  |  AllTags  |
|--------------------------------|:-----:|:------:|:-----:|:-----:|
| M-BERT (cased)  | 93.87 | 89.89 | 90.01  | 87.04|
| Romanian BERT (cased)    |  95.56 |  95.35 |  92.78 |  93.22 | 
| ALR-BERT (cased)   |  **87.38** |  **84.05** |  **79.82** |  **78.82**| 

### Corpus 

The model is trained on the following corpora (stats in the table below are after cleaning):

| Corpus    	| Lines(M) 	| Words(M) 	| Chars(B) 	| Size(GB) 	|
|-----------	|:--------:	|:--------:	|:--------:	|:--------:	|
| OPUS      	|   55.05  	|  635.04  	|   4.045  	|    3.8   	|
| OSCAR     	|   33.56  	|  1725.82 	|  11.411  	|    11    	|
| Wikipedia 	|   1.54   	|   60.47  	|   0.411  	|    0.4   	|
| **Total**     	|   **90.15**  	|  **2421.33** 	|  **15.867**  	|   **15.2**   	|