language: bn
tags:
- collaborative
- bengali
- albert
- bangla
license: apache-2.0
datasets:
- Wikipedia
- Oscar
widget:
- text: ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো
sahajBERT
Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.
Model description
sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
We trained our model on 2 of these downstream tasks: sequence classification and token classification
How to use
You can use this model directly with a pipeline for masked language modeling:
from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast
# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
# Initialize model
model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")
# Initialize pipeline
pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)
raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me
pipeline(raw_text)
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AlbertModel, PreTrainedTokenizerFast
# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
# Initialize model
model = AlbertModel.from_pretrained("neuropark/sahajBERT")
text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Limitations and bias
WIP
Training data
The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
Training procedure
This model was trained in a collaborative manner by volunteer participants.
Contributors leaderboard
Rank | Username | Total contributed runtime |
---|---|---|
1 | khalidsaifullaah | 11 days 21:02:08 |
2 | ishanbagchi | 9 days 20:37:00 |
3 | tanmoyio | 9 days 18:08:34 |
4 | debajit | 8 days 14:15:10 |
5 | skylord | 6 days 16:35:29 |
6 | ibraheemmoosa | 5 days 01:05:57 |
7 | SaulLu | 5 days 00:46:36 |
8 | lhoestq | 4 days 20:11:16 |
9 | nilavya | 4 days 08:51:51 |
10 | Priyadarshan | 4 days 02:28:55 |
11 | anuragshas | 3 days 05:00:55 |
12 | sujitpal | 2 days 20:52:33 |
13 | manandey | 2 days 16:17:13 |
14 | albertvillanova | 2 days 14:14:31 |
15 | justheuristic | 2 days 13:20:52 |
16 | w0lfw1tz | 2 days 07:22:48 |
17 | smoker | 2 days 02:52:03 |
18 | Soumi | 1 days 20:42:02 |
19 | Anjali | 1 days 16:28:00 |
20 | OptimusPrime | 1 days 09:16:57 |
21 | theainerd | 1 days 04:48:57 |
22 | yhn112 | 0 days 20:57:02 |
23 | kolk | 0 days 17:57:37 |
24 | arnab | 0 days 17:54:12 |
25 | imavijit | 0 days 16:07:26 |
26 | osanseviero | 0 days 14:16:45 |
27 | subhranilsarkar | 0 days 13:04:46 |
28 | sagnik1511 | 0 days 12:24:57 |
29 | anindabitm | 0 days 08:56:44 |
30 | borzunov | 0 days 04:07:35 |
31 | thomwolf | 0 days 03:53:15 |
32 | priyadarshan | 0 days 03:40:11 |
33 | ali007 | 0 days 03:34:37 |
34 | sbrandeis | 0 days 03:18:16 |
35 | Preetha | 0 days 03:13:47 |
36 | Mrinal | 0 days 03:01:43 |
37 | laxya007 | 0 days 02:18:34 |
38 | lewtun | 0 days 00:34:43 |
39 | Rounak | 0 days 00:26:10 |
40 | kshmax | 0 days 00:06:38 |
Eval results
We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:
NER: a named entity recognition on Bengali split of WikiANN dataset
NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE
Base pre-trained Model | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std) |
---|---|---|
sahajBERT | 95.45 ± 0.53 | 91.97 ± 0.47 |
XLM-R-large | 96.48 ± 0.22 | 90.05 ± 0.38 |
IndicBert | 92.52 ± 0.45 | 74.46 ± 1.91 |
BibTeX entry and citation info
Coming soon!