sagorbrur
commited on
Commit
•
ad5105b
1
Parent(s):
2c0b2dd
added README.md by sagorsarker
Browse files
README.md
ADDED
@@ -0,0 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: bn
|
3 |
+
tags:
|
4 |
+
- bert
|
5 |
+
- bengali
|
6 |
+
- bengali-lm
|
7 |
+
- bangla
|
8 |
+
license: MIT
|
9 |
+
datasets:
|
10 |
+
- common_crawl
|
11 |
+
- wikipedia
|
12 |
+
- oscar
|
13 |
+
---
|
14 |
+
|
15 |
+
|
16 |
+
# Bangla BERT Base
|
17 |
+
A long way passed. Here is our **Bangla-Bert**! It is now available in huggingface model hub.
|
18 |
+
|
19 |
+
[Bangla-Bert-Base](https://github.com/sagorbrur/bangla-bert) is a pretrained language model of Bengali language using mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and it's github [repository](https://github.com/google-research/bert)
|
20 |
+
|
21 |
+
|
22 |
+
|
23 |
+
## Pretrain Corpus Details
|
24 |
+
Corpus was downloaded from two main sources:
|
25 |
+
|
26 |
+
* Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/)
|
27 |
+
* [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
|
28 |
+
|
29 |
+
After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
|
30 |
+
|
31 |
+
```
|
32 |
+
sentence 1
|
33 |
+
sentence 2
|
34 |
+
|
35 |
+
sentence 1
|
36 |
+
sentence 2
|
37 |
+
|
38 |
+
```
|
39 |
+
|
40 |
+
## Building Vocab
|
41 |
+
We used [BNLP](https://github.com/sagorbrur/bnlp) package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format.
|
42 |
+
Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](https://github.com/sagorbrur/bangla-bert) and also at [huggingface](https://huggingface.co/sagorsarker/bangla-bert-base) model hub.
|
43 |
+
|
44 |
+
## Training Details
|
45 |
+
* Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
|
46 |
+
* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
|
47 |
+
* Total Training Steps: 1 Million
|
48 |
+
* The model was trained on a single Google Cloud TPU
|
49 |
+
|
50 |
+
## Evaluation Results
|
51 |
+
|
52 |
+
### LM Evaluation Results
|
53 |
+
After training 1 millions steps here is the evaluation resutls.
|
54 |
+
|
55 |
+
```
|
56 |
+
global_step = 1000000
|
57 |
+
loss = 2.2406516
|
58 |
+
masked_lm_accuracy = 0.60641736
|
59 |
+
masked_lm_loss = 2.201459
|
60 |
+
next_sentence_accuracy = 0.98625
|
61 |
+
next_sentence_loss = 0.040997364
|
62 |
+
perplexity = numpy.exp(2.2406516) = 9.393331287442784
|
63 |
+
Loss for final step: 2.426227
|
64 |
+
|
65 |
+
```
|
66 |
+
|
67 |
+
### Downstream Task Evaluation Results
|
68 |
+
Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evalution results of classification task.
|
69 |
+
He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for classification task.
|
70 |
+
Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves state of the art result.
|
71 |
+
Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
|
72 |
+
|
73 |
+
|
74 |
+
| Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average |
|
75 |
+
| ----- | -------------------| ---------------- | --------------- | ------- |
|
76 |
+
| mBERT | 68.15 | 52.32 | 72.27 | 64.25 |
|
77 |
+
| Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
|
78 |
+
| Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
|
79 |
+
|
80 |
+
|
81 |
+
**NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.**
|
82 |
+
|
83 |
+
|
84 |
+
## How to Use
|
85 |
+
You can use this model directly with a pipeline for masked language modeling:
|
86 |
+
|
87 |
+
```py
|
88 |
+
from transformers import BertForMaskedLM, BertTokenizer, pipeline
|
89 |
+
|
90 |
+
model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
|
91 |
+
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
|
92 |
+
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
|
93 |
+
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
|
94 |
+
print(pred)
|
95 |
+
|
96 |
+
# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}
|
97 |
+
|
98 |
+
```
|
99 |
+
|
100 |
+
|
101 |
+
## Author
|
102 |
+
[Sagor Sarker](https://github.com/sagorbrur)
|
103 |
+
|
104 |
+
## Acknowledgements
|
105 |
+
|
106 |
+
* Thanks to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
|
107 |
+
* Thank to all the people around, who always helping us to build something for Bengali.
|
108 |
+
|
109 |
+
## Reference
|
110 |
+
* https://github.com/google-research/bert
|
111 |
+
|
112 |
+
|
113 |
+
|
114 |
+
|
115 |
+
|