sahajBERT

Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.

Model description

sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

We trained our model on 2 of these downstream tasks: sequence classification and token classification

How to use

You can use this model directly with a pipeline for masked language modeling:


from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

# Initialize pipeline

pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

pipeline(raw_text)

Here is how to use this model to get the features of a given text in PyTorch:


from transformers import AlbertModel, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertModel.from_pretrained("neuropark/sahajBERT")

text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

Limitations and bias

WIP

Training data

The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Training procedure

This model was trained in a collaborative manner by volunteer participants.

Contributors leaderboard

Rank	Username	Total contributed runtime
1	khalidsaifullaah	11 days 21:02:08
2	ishanbagchi	9 days 20:37:00
3	tanmoyio	9 days 18:08:34
4	debajit	8 days 14:15:10
5	skylord	6 days 16:35:29
6	ibraheemmoosa	5 days 01:05:57
7	SaulLu	5 days 00:46:36
8	lhoestq	4 days 20:11:16
9	nilavya	4 days 08:51:51
10	Priyadarshan	4 days 02:28:55
11	anuragshas	3 days 05:00:55
12	sujitpal	2 days 20:52:33
13	manandey	2 days 16:17:13
14	albertvillanova	2 days 14:14:31
15	justheuristic	2 days 13:20:52
16	w0lfw1tz	2 days 07:22:48
17	smoker	2 days 02:52:03
18	Soumi	1 days 20:42:02
19	Anjali	1 days 16:28:00
20	OptimusPrime	1 days 09:16:57
21	theainerd	1 days 04:48:57
22	yhn112	0 days 20:57:02
23	kolk	0 days 17:57:37
24	arnab	0 days 17:54:12
25	imavijit	0 days 16:07:26
26	osanseviero	0 days 14:16:45
27	subhranilsarkar	0 days 13:04:46
28	sagnik1511	0 days 12:24:57
29	anindabitm	0 days 08:56:44
30	borzunov	0 days 04:07:35
31	thomwolf	0 days 03:53:15
32	priyadarshan	0 days 03:40:11
33	ali007	0 days 03:34:37
34	sbrandeis	0 days 03:18:16
35	Preetha	0 days 03:13:47
36	Mrinal	0 days 03:01:43
37	laxya007	0 days 02:18:34
38	lewtun	0 days 00:34:43
39	Rounak	0 days 00:26:10
40	kshmax	0 days 00:06:38

Hardware used

Eval results

We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

NER: a named entity recognition on Bengali split of WikiANN dataset
NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

Base pre-trained Model	NER - F1 (mean ± std)	NCC - Accuracy (mean ± std)
sahajBERT	95.45 ± 0.53	91.97 ± 0.47
XLM-R-large	96.48 ± 0.22	90.05 ± 0.38
IndicBert	92.52 ± 0.45	74.46 ± 1.91

BibTeX entry and citation info

Coming soon!

neuropark
/

sahajBERT