sahajBERT

Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.

Model description

sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

We trained our model on 2 of these downstream tasks: sequence classification and token classification

How to use

You can use this model directly with a pipeline for masked language modeling:


from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

# Initialize pipeline

pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

pipeline(raw_text)

Here is how to use this model to get the features of a given text in PyTorch:


from transformers import AlbertModel, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertModel.from_pretrained("neuropark/sahajBERT")

text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

Limitations and bias

WIP

Training data

The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Training procedure

This model was trained in a collaborative manner by volunteer participants.

Contributors leaderboard

Rank Username Total contributed runtime
1 khalidsaifullaah 11 days 21:02:08
2 ishanbagchi 9 days 20:37:00
3 tanmoyio 9 days 18:08:34
4 debajit 8 days 14:15:10
5 skylord 6 days 16:35:29
6 ibraheemmoosa 5 days 01:05:57
7 SaulLu 5 days 00:46:36
8 lhoestq 4 days 20:11:16
9 nilavya 4 days 08:51:51
10 Priyadarshan 4 days 02:28:55
11 anuragshas 3 days 05:00:55
12 sujitpal 2 days 20:52:33
13 manandey 2 days 16:17:13
14 albertvillanova 2 days 14:14:31
15 justheuristic 2 days 13:20:52
16 w0lfw1tz 2 days 07:22:48
17 smoker 2 days 02:52:03
18 Soumi 1 days 20:42:02
19 Anjali 1 days 16:28:00
20 OptimusPrime 1 days 09:16:57
21 theainerd 1 days 04:48:57
22 yhn112 0 days 20:57:02
23 kolk 0 days 17:57:37
24 arnab 0 days 17:54:12
25 imavijit 0 days 16:07:26
26 osanseviero 0 days 14:16:45
27 subhranilsarkar 0 days 13:04:46
28 sagnik1511 0 days 12:24:57
29 anindabitm 0 days 08:56:44
30 borzunov 0 days 04:07:35
31 thomwolf 0 days 03:53:15
32 priyadarshan 0 days 03:40:11
33 ali007 0 days 03:34:37
34 sbrandeis 0 days 03:18:16
35 Preetha 0 days 03:13:47
36 Mrinal 0 days 03:01:43
37 laxya007 0 days 02:18:34
38 lewtun 0 days 00:34:43
39 Rounak 0 days 00:26:10
40 kshmax 0 days 00:06:38

Hardware used

Eval results

We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

  • NER: a named entity recognition on Bengali split of WikiANN dataset

  • NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

Base pre-trained Model NER - F1 (mean ± std) NCC - Accuracy (mean ± std)
sahajBERT 95.45 ± 0.53 91.97 ± 0.47
XLM-R-large 96.48 ± 0.22 90.05 ± 0.38
IndicBert 92.52 ± 0.45 74.46 ± 1.91

BibTeX entry and citation info

Coming soon!

Downloads last month
384
Hosted inference API
Fill-Mask
Mask token: [MASK]
This model can be loaded on the Inference API on-demand.