Bangla BERT Base

Here we published a pretrained Bangla bert language model as bert-base-bangla! which is now available in huggingface model hub. Here we described bert-base-bangla which is a pretrained Bangla language model based on mask language modeling described in BERT and the GitHub repository

Corpus Details

We trained the Bangla bert language model using BanglaLM dataset from kaggle BanglaLM. There is 3 version of dataset which is almost 40GB. After downloading the dataset, we went on the way to mask LM.

Bangla Base BERT Tokenizer

from transformers import AutoTokenizer, AutoModel
bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bert-base-bangla")
text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
# output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']

MASK Generation here, we can use bert base bangla model as for masked language modeling:

from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained("Kowsher/bert-base-bangla")
tokenizer = BertTokenizer.from_pretrained("Kowsher/bert-base-bangla")

nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):
# {'sequence': 'আমি বাংলার গান লিখি', 'score': 0.17955434322357178, 'token': 24749, 'token_str': 'লিখি'}

nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"তুই রাজাকার তুই {nlp.tokenizer.mask_token}"):
# {'sequence': 'তই রাজাকার তই রাজাকার', 'score': 0.9975168704986572, 'token': 13401, 'token_str': 'রাজাকার'}

nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"বাংলা আমার {nlp.tokenizer.mask_token}"):
# {'sequence': 'বাংলা আমার অহংকার', 'score': 0.5679506063461304, 'token': 19009, 'token_str': 'অহংকার'}  

Cite this work Kowsher, Md., BERT Base Bangla: A Pretrained Transformer Based Bangla Bert Model (September 15, 2021). Research on Computational Language, Available at SSRN:



