dragosnicolae555
/

ALR_BERT

Inference Endpoints

Model card Files Files and versions Community

ALR_BERT / README.md

dragosnicolae555's picture

dragosnicolae555

Update README.md

454c024 over 2 years ago

|

raw history blame contribute delete

No virus

3.11 kB

	---

	language: ro

	---

	# ALBert

	The ALR-Bert , cased model for Romanian, trained on a 15GB corpus!
	ALR-BERT is a multi-layer bidirectional Transformer encoder that shares ALBERT's factorized embedding parameterization and cross-layer sharing. ALR-BERT-base inherits ALBERT-base and features 12 parameter-sharing layers, a 128-dimension embedding size, 768 hidden units, 12 heads, and GELU non-linearities. Masked language modeling (MLM) and sentence order prediction (SOP) losses are the two objectives that ALBERT is pre-trained on. For ALR-BERT, we preserve both these objectives.
	The model was trained using 40 batches per GPU (for 128 sequence length) and then 20 batches per GPU (for 512 sequence length). Layer-wise Adaptive Moments optimizer for Batch (LAMB) training was utilized, with a warm-up over the first 1\% of steps up to a learning rate of 1e4, then a decay. Eight NVIDIA Tesla V100 SXM3 with 32GB memory were used, and the pre-training process took around 2 weeks per model.


	Training methodology follows closely work previous done in Romanian Bert (https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1)



	### How to use

	```python

	from transformers import AutoTokenizer, AutoModel

	import torch

	# load tokenizer and model

	tokenizer = AutoTokenizer.from_pretrained("dragosnicolae555/ALR_BERT")

	model = AutoModel.from_pretrained("dragosnicolae555/ALR_BERT")

	#Here add your magic

	```

	Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
	```
	text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
	```
	because the model was NOT trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.


	### Evaluation

	Here, we evaluate ALR-BERT on Simple Universal Dependencies task. One model for each task, evaluating labeling performance on the UPOS (Universal Part-of-Speech) and the XPOS (Extended Part-of-Speech) (eXtended Part-of-Speech). We compare our proposed ALR-BERT with Romanian BERT and multiligual BERT, using the cased version. To counteract the random seed effect, we repeat each experiment five times and simply provide the mean score.




	\| Model \| UPOS \| XPOS \| MLAS \| AllTags \|
	\|--------------------------------\|:-----:\|:------:\|:-----:\|:-----:\|
	\| M-BERT (cased) \| 93.87 \| 89.89 \| 90.01 \| 87.04\|
	\| Romanian BERT (cased) \| 95.56 \| 95.35 \| 92.78 \| 93.22 \|
	\| ALR-BERT (cased) \| 87.38 \| 84.05 \| 79.82 \| 78.82\|

	### Corpus

	The model is trained on the following corpora (stats in the table below are after cleaning):

	\| Corpus \| Lines(M) \| Words(M) \| Chars(B) \| Size(GB) \|
	\|----------- \|:--------: \|:--------: \|:--------: \|:--------: \|
	\| OPUS \| 55.05 \| 635.04 \| 4.045 \| 3.8 \|
	\| OSCAR \| 33.56 \| 1725.82 \| 11.411 \| 11 \|
	\| Wikipedia \| 1.54 \| 60.47 \| 0.411 \| 0.4 \|
	\| Total \| 90.15 \| 2421.33 \| 15.867 \| 15.2 \|