sagorbrur commited on
Commit
ad5105b
1 Parent(s): 2c0b2dd

added README.md by sagorsarker

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: bn
3
+ tags:
4
+ - bert
5
+ - bengali
6
+ - bengali-lm
7
+ - bangla
8
+ license: MIT
9
+ datasets:
10
+ - common_crawl
11
+ - wikipedia
12
+ - oscar
13
+ ---
14
+
15
+
16
+ # Bangla BERT Base
17
+ A long way passed. Here is our **Bangla-Bert**! It is now available in huggingface model hub.
18
+
19
+ [Bangla-Bert-Base](https://github.com/sagorbrur/bangla-bert) is a pretrained language model of Bengali language using mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and it's github [repository](https://github.com/google-research/bert)
20
+
21
+
22
+
23
+ ## Pretrain Corpus Details
24
+ Corpus was downloaded from two main sources:
25
+
26
+ * Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/)
27
+ * [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
28
+
29
+ After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
30
+
31
+ ```
32
+ sentence 1
33
+ sentence 2
34
+
35
+ sentence 1
36
+ sentence 2
37
+
38
+ ```
39
+
40
+ ## Building Vocab
41
+ We used [BNLP](https://github.com/sagorbrur/bnlp) package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format.
42
+ Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](https://github.com/sagorbrur/bangla-bert) and also at [huggingface](https://huggingface.co/sagorsarker/bangla-bert-base) model hub.
43
+
44
+ ## Training Details
45
+ * Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
46
+ * Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
47
+ * Total Training Steps: 1 Million
48
+ * The model was trained on a single Google Cloud TPU
49
+
50
+ ## Evaluation Results
51
+
52
+ ### LM Evaluation Results
53
+ After training 1 millions steps here is the evaluation resutls.
54
+
55
+ ```
56
+ global_step = 1000000
57
+ loss = 2.2406516
58
+ masked_lm_accuracy = 0.60641736
59
+ masked_lm_loss = 2.201459
60
+ next_sentence_accuracy = 0.98625
61
+ next_sentence_loss = 0.040997364
62
+ perplexity = numpy.exp(2.2406516) = 9.393331287442784
63
+ Loss for final step: 2.426227
64
+
65
+ ```
66
+
67
+ ### Downstream Task Evaluation Results
68
+ Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evalution results of classification task.
69
+ He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for classification task.
70
+ Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves state of the art result.
71
+ Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
72
+
73
+
74
+ | Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average |
75
+ | ----- | -------------------| ---------------- | --------------- | ------- |
76
+ | mBERT | 68.15 | 52.32 | 72.27 | 64.25 |
77
+ | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
78
+ | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
79
+
80
+
81
+ **NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.**
82
+
83
+
84
+ ## How to Use
85
+ You can use this model directly with a pipeline for masked language modeling:
86
+
87
+ ```py
88
+ from transformers import BertForMaskedLM, BertTokenizer, pipeline
89
+
90
+ model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
91
+ tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
92
+ nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
93
+ for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
94
+ print(pred)
95
+
96
+ # {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}
97
+
98
+ ```
99
+
100
+
101
+ ## Author
102
+ [Sagor Sarker](https://github.com/sagorbrur)
103
+
104
+ ## Acknowledgements
105
+
106
+ * Thanks to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
107
+ * Thank to all the people around, who always helping us to build something for Bengali.
108
+
109
+ ## Reference
110
+ * https://github.com/google-research/bert
111
+
112
+
113
+
114
+
115
+