language: bn
tags:
- collaborative
- bengali
- albert
- bangla
license: apache-2.0
datasets:
- Wikipedia
- Oscar
widget:
- text: ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো
sahajBERT
Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.
Model description
sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
We trained our model on 2 of these downstream tasks: sequence classification and token classification
How to use
You can use this model directly with a pipeline for masked language modeling:
from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast
# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
# Initialize model
model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")
# Initialize pipeline
pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)
raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me
pipeline(raw_text)
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AlbertModel, PreTrainedTokenizerFast
# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
# Initialize model
model = AlbertModel.from_pretrained("neuropark/sahajBERT")
text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input
Limitations and bias
WIP
Training data
The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
Training procedure
This model was trained in a collaborative manner by volunteer participants.
Contributors leaderboard
| Rank | Username | Total contributed runtime |
|:-------------:|:-------------:|-------------:|
| 1|khalidsaifullaah|11 days 21:02:08|
| 2|ishanbagchi|9 days 20:37:00|
| 3|tanmoyio|9 days 18:08:34|
| 4|debajit|8 days 14:15:10|
| 5|skylord|6 days 16:35:29|
| 6|ibraheemmoosa|5 days 01:05:57|
| 7|SaulLu|5 days 00:46:36|
| 8|lhoestq|4 days 20:11:16|
| 9|nilavya|4 days 08:51:51|
|10|Priyadarshan|4 days 02:28:55|
|11|anuragshas|3 days 05:00:55|
|12|sujitpal|2 days 20:52:33|
|13|manandey|2 days 16:17:13|
|14|albertvillanova|2 days 14:14:31|
|15|justheuristic|2 days 13:20:52|
|16|w0lfw1tz|2 days 07:22:48|
|17|smoker|2 days 02:52:03|
|18|Soumi|1 days 20:42:02|
|19|Anjali|1 days 16:28:00|
|20|OptimusPrime|1 days 09:16:57|
|21|theainerd|1 days 04:48:57|
|22|yhn112|0 days 20:57:02|
|23|kolk|0 days 17:57:37|
|24|arnab|0 days 17:54:12|
|25|imavijit|0 days 16:07:26|
|26|osanseviero|0 days 14:16:45|
|27|subhranilsarkar|0 days 13:04:46|
|28|sagnik1511|0 days 12:24:57|
|29|anindabitm|0 days 08:56:44|
|30|borzunov|0 days 04:07:35|
|31|thomwolf|0 days 03:53:15|
|32|priyadarshan|0 days 03:40:11|
|33|ali007|0 days 03:34:37|
|34|sbrandeis|0 days 03:18:16|
|35|Preetha|0 days 03:13:47|
|36|Mrinal|0 days 03:01:43|
|37|laxya007|0 days 02:18:34|
|38|lewtun|0 days 00:34:43|
|39|Rounak|0 days 00:26:10|
|40|kshmax|0 days 00:06:38|
Eval results
We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:
NER: a named entity recognition on Bengali split of WikiANN dataset
NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE
| Base pretrained Model | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std) |
|:-------------:|:-------------:|:-------------:|
|sahajBERT | 95.45 ± 0.53| 91.97 ± 0.47|
|XLM-R-large | 96.48 ± 0.22| 90.05 ± 0.38|
|IndicBert | 92.52 ± 0.45| 74.46 ± 1.91|
BibTeX entry and citation info
Coming soon!