sahajBERT / README.md
SaulLu's picture
add model card
3d0109f
metadata
language: bn
tags:
  - collaborative
  - bengali
  - albert
  - bangla
license: apache-2.0
datasets:
  - Wikipedia
  - Oscar
widget:
  - text: ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো

sahajBERT

Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.

Model description

sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

We trained our model on 2 of these downstream tasks: sequence classification and token classification

How to use

You can use this model directly with a pipeline for masked language modeling:


from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

# Initialize pipeline

pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

pipeline(raw_text)

Here is how to use this model to get the features of a given text in PyTorch:


from transformers import AlbertModel, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertModel.from_pretrained("neuropark/sahajBERT")

text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input

Limitations and bias

WIP

Training data

The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Training procedure

This model was trained in a collaborative manner by volunteer participants.

Contributors leaderboard

| Rank | Username | Total contributed runtime |

|:-------------:|:-------------:|-------------:|

| 1|khalidsaifullaah|11 days 21:02:08|

| 2|ishanbagchi|9 days 20:37:00|

| 3|tanmoyio|9 days 18:08:34|

| 4|debajit|8 days 14:15:10|

| 5|skylord|6 days 16:35:29|

| 6|ibraheemmoosa|5 days 01:05:57|

| 7|SaulLu|5 days 00:46:36|

| 8|lhoestq|4 days 20:11:16|

| 9|nilavya|4 days 08:51:51|

|10|Priyadarshan|4 days 02:28:55|

|11|anuragshas|3 days 05:00:55|

|12|sujitpal|2 days 20:52:33|

|13|manandey|2 days 16:17:13|

|14|albertvillanova|2 days 14:14:31|

|15|justheuristic|2 days 13:20:52|

|16|w0lfw1tz|2 days 07:22:48|

|17|smoker|2 days 02:52:03|

|18|Soumi|1 days 20:42:02|

|19|Anjali|1 days 16:28:00|

|20|OptimusPrime|1 days 09:16:57|

|21|theainerd|1 days 04:48:57|

|22|yhn112|0 days 20:57:02|

|23|kolk|0 days 17:57:37|

|24|arnab|0 days 17:54:12|

|25|imavijit|0 days 16:07:26|

|26|osanseviero|0 days 14:16:45|

|27|subhranilsarkar|0 days 13:04:46|

|28|sagnik1511|0 days 12:24:57|

|29|anindabitm|0 days 08:56:44|

|30|borzunov|0 days 04:07:35|

|31|thomwolf|0 days 03:53:15|

|32|priyadarshan|0 days 03:40:11|

|33|ali007|0 days 03:34:37|

|34|sbrandeis|0 days 03:18:16|

|35|Preetha|0 days 03:13:47|

|36|Mrinal|0 days 03:01:43|

|37|laxya007|0 days 02:18:34|

|38|lewtun|0 days 00:34:43|

|39|Rounak|0 days 00:26:10|

|40|kshmax|0 days 00:06:38|

Eval results

We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

  • NER: a named entity recognition on Bengali split of WikiANN dataset

  • NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

| Base pretrained Model | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std) |

|:-------------:|:-------------:|:-------------:|

|sahajBERT | 95.45 ± 0.53| 91.97 ± 0.47|

|XLM-R-large | 96.48 ± 0.22| 90.05 ± 0.38|

|IndicBert | 92.52 ± 0.45| 74.46 ± 1.91|

BibTeX entry and citation info

Coming soon!