bangla-bert-base / README.md
1 ---
2 language: bn
3 tags:
4 - bert
5 - bengali
6 - bengali-lm
7 - bangla
8 license: mit
9 datasets:
10 - common_crawl
11 - wikipedia
12 - oscar
13 ---
14
15
16 # Bangla BERT Base
17 A long way passed. Here is our **Bangla-Bert**! It is now available in huggingface model hub.
18
19 [Bangla-Bert-Base](https://github.com/sagorbrur/bangla-bert) is a pretrained language model of Bengali language using mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and it's github [repository](https://github.com/google-research/bert)
20
21
22
23 ## Pretrain Corpus Details
24 Corpus was downloaded from two main sources:
25
26 * Bengali commoncrawl corpus downloaded from [OSCAR](https://oscar-corpus.com/)
27 * [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
28
29 After downloading these corpora, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
30
31 ```
32 sentence 1
33 sentence 2
34
35 sentence 1
36 sentence 2
37
38 ```
39
40 ## Building Vocab
41 We used [BNLP](https://github.com/sagorbrur/bnlp) package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format.
42 Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](https://github.com/sagorbrur/bangla-bert) and also at [huggingface](https://huggingface.co/sagorsarker/bangla-bert-base) model hub.
43
44 ## Training Details
45 * Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
46 * Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
47 * Total Training Steps: 1 Million
48 * The model was trained on a single Google Cloud TPU
49
50 ## Evaluation Results
51
52 ### LM Evaluation Results
53 After training 1 million steps here are the evaluation results.
54
55 ```
56 global_step = 1000000
57 loss = 2.2406516
58 masked_lm_accuracy = 0.60641736
59 masked_lm_loss = 2.201459
60 next_sentence_accuracy = 0.98625
61 next_sentence_loss = 0.040997364
62 perplexity = numpy.exp(2.2406516) = 9.393331287442784
63 Loss for final step: 2.426227
64
65 ```
66
67 ### Downstream Task Evaluation Results
68 - Evaluation on Bengali Classification Benchmark Datasets
69
70 Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evaluation results of the classification task.
71 He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for the classification task.
72 Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves a state of the art result.
73 Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
74
75
76 | Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average |
77 | ----- | -------------------| ---------------- | --------------- | ------- |
78 | mBERT | 68.15 | 52.32 | 72.27 | 64.25 |
79 | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
80 | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
81
82 - Evaluation on [Wikiann](https://huggingface.co/datasets/wikiann) Datasets
83
84 We evaluated `Bangla-BERT-Base` with [Wikiann](https://huggingface.co/datasets/wikiann) Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT). </br>
85 `Bangla-BERT-Base` got a third-place where `mBERT` got first and `XML-R` got second place after training these models 5 epochs.
86
87 | Base Pre-trained Model | F1 Score | Accuracy |
88 | ----- | -------------------| ---------------- |
89 | [mBERT-uncased](https://huggingface.co/bert-base-multilingual-uncased) | 97.11 | 97.68 |
90 | [XLM-R](https://huggingface.co/xlm-roberta-base) | 96.22 | 97.03 |
91 | [Indic-BERT](https://huggingface.co/ai4bharat/indic-bert)| 92.66 | 94.74 |
92 | Bangla-BERT-Base | 95.57 | 97.49 |
93
94 All four model trained with [transformers-token-classification](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb) notebook.
95 You can find all models evaluation results [here](https://github.com/sagorbrur/bangla-bert/tree/master/evaluations/wikiann)
96
97 Also, you can check the below paper list. They used this model on their datasets.
98 * [DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced Bengali Language](https://arxiv.org/abs/2012.14353)
99 * [Emotion Classification in a Resource Constrained Language Using Transformer-based Approach](https://arxiv.org/abs/2104.08613)
100 * [A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models](https://arxiv.org/abs/2107.03844)
101 * [BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding](https://arxiv.org/abs/2101.00204)
102
103 **NB: If you use this model for any NLP task please share evaluation results with us. We will add it here.**
104
105 ## Limitations and Biases
106
107 ## How to Use
108
109 **Bangla BERT Tokenizer**
110
111 ```py
112 from transformers import AutoTokenizer, AutoModel
113
114 bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
115 text = "আমি বাংলায় গান গাই।"
116 bnbert_tokenizer.tokenize(text)
117 # ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']
118 ```
119
120
121 **MASK Generation**
122
123 You can use this model directly with a pipeline for masked language modeling:
124
125 ```py
126 from transformers import BertForMaskedLM, BertTokenizer, pipeline
127
128 model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
129 tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
130 nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
131 for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
132 print(pred)
133
134 # {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}
135
136 ```
137
138 ## Author
139 [Sagor Sarker](https://github.com/sagorbrur)
140
141 ## Acknowledgements
142
143 * Thanks to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
144 * Thank to all the people around, who always helping us to build something for Bengali.
145
146 ## Reference
147 * https://github.com/google-research/bert
148
149 ## Citation
150 If you find this model helpful, please cite.
151
152 ```
153 @misc{Sagor_2020,
154 title = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understading},
155 author = {Sagor Sarker},
156 year = {2020},
157 url = {https://github.com/sagorbrur/bangla-bert}
158 }
159
160 ```
161
162
163
164
165
166