File size: 7,350 Bytes
3d0109f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d4e9db
3d0109f
de944be
 
3d0109f
 
 
 
e9159ab
 
 
 
 
 
3d0109f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09cb382
3d0109f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9159ab
 
 
 
 
 
3d0109f
 
 
 
 
 
 
 
c026689
3d0109f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---

language: bn

tags:

- collaborative

- bengali

- albert

- bangla

license: apache-2.0

datasets:

- Wikipedia

- Oscar

widget:

- text: "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে [MASK]।"

pipeline_tag: fill-mask

---

# sahajBERT


<iframe width="100%" height="1100" frameborder="0"
  src="https://observablehq.com/embed/@huggingface/participants-bubbles-chart?cells=c_noaws%2Ct_noaws%2Cviewof+currentDate"></iframe>



Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives. 

## Model description

<!-- You can embed local or remote images using `![](...)` -->

sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an [ALBERT](https://arxiv.org/abs/1909.11942) architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR. 

<!-- Add more information about the collaborative training when we have time / preprint available -->

## Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

We trained our model on 2 of these downstream tasks: [sequence classification](https://huggingface.co/neuropark/sahajBERT-NCC) and [token classification](https://huggingface.co/neuropark/sahajBERT-NER)

#### How to use

You can use this model directly with a pipeline for masked language modeling:

```python

from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

# Initialize pipeline

pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

pipeline(raw_text)

```

Here is how to use this model to get the features of a given text in PyTorch:

```python

from transformers import AlbertModel, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertModel.from_pretrained("neuropark/sahajBERT")

text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

```

#### Limitations and bias

<!-- Provide examples of latent issues and potential remediations. -->

WIP

## Training data

The tokenizer was trained on he Bengali part of OSCAR and the model on a [dump of Wikipedia in Bengali](https://huggingface.co/datasets/lhoestq/wikipedia_bn) and the Bengali part of [OSCAR](https://huggingface.co/datasets/oscar).

## Training procedure

This model was trained in a collaborative manner by volunteer participants. 

<!-- Add more information about the collaborative training when we have time / preprint available + Preprocessing, hardware used, hyperparameters... (maybe use figures)-->

### Contributors leaderboard

| Rank       | Username           | Total contributed runtime  |
|:-------------:|:-------------:|-------------:|
| 1|[khalidsaifullaah](https://huggingface.co/khalidsaifullaah)|11 days 21:02:08|
| 2|[ishanbagchi](https://huggingface.co/ishanbagchi)|9 days 20:37:00|
| 3|[tanmoyio](https://huggingface.co/tanmoyio)|9 days 18:08:34|
| 4|[debajit](https://huggingface.co/debajit)|8 days 14:15:10|
| 5|[skylord](https://huggingface.co/skylord)|6 days 16:35:29|
| 6|[ibraheemmoosa](https://huggingface.co/ibraheemmoosa)|5 days 01:05:57|
| 7|[SaulLu](https://huggingface.co/SaulLu)|5 days 00:46:36|
| 8|[lhoestq](https://huggingface.co/lhoestq)|4 days 20:11:16|
| 9|[nilavya](https://huggingface.co/nilavya)|4 days 08:51:51|
|10|[Priyadarshan](https://huggingface.co/Priyadarshan)|4 days 02:28:55|
|11|[anuragshas](https://huggingface.co/anuragshas)|3 days 05:00:55|
|12|[sujitpal](https://huggingface.co/sujitpal)|2 days 20:52:33|
|13|[manandey](https://huggingface.co/manandey)|2 days 16:17:13|
|14|[albertvillanova](https://huggingface.co/albertvillanova)|2 days 14:14:31|
|15|[justheuristic](https://huggingface.co/justheuristic)|2 days 13:20:52|
|16|[w0lfw1tz](https://huggingface.co/w0lfw1tz)|2 days 07:22:48|
|17|[smoker](https://huggingface.co/smoker)|2 days 02:52:03|
|18|[Soumi](https://huggingface.co/Soumi)|1 days 20:42:02|
|19|[Anjali](https://huggingface.co/Anjali)|1 days 16:28:00|
|20|[OptimusPrime](https://huggingface.co/OptimusPrime)|1 days 09:16:57|
|21|[theainerd](https://huggingface.co/theainerd)|1 days 04:48:57|
|22|[yhn112](https://huggingface.co/yhn112)|0 days 20:57:02|
|23|[kolk](https://huggingface.co/kolk)|0 days 17:57:37|
|24|[arnab](https://huggingface.co/arnab)|0 days 17:54:12|
|25|[imavijit](https://huggingface.co/imavijit)|0 days 16:07:26|
|26|[osanseviero](https://huggingface.co/osanseviero)|0 days 14:16:45|
|27|[subhranilsarkar](https://huggingface.co/subhranilsarkar)|0 days 13:04:46|
|28|[sagnik1511](https://huggingface.co/sagnik1511)|0 days 12:24:57|
|29|[anindabitm](https://huggingface.co/anindabitm)|0 days 08:56:44|
|30|[borzunov](https://huggingface.co/borzunov)|0 days 04:07:35|
|31|[thomwolf](https://huggingface.co/thomwolf)|0 days 03:53:15|
|32|[priyadarshan](https://huggingface.co/priyadarshan)|0 days 03:40:11|
|33|[ali007](https://huggingface.co/ali007)|0 days 03:34:37|
|34|[sbrandeis](https://huggingface.co/sbrandeis)|0 days 03:18:16|
|35|[Preetha](https://huggingface.co/Preetha)|0 days 03:13:47|
|36|[Mrinal](https://huggingface.co/Mrinal)|0 days 03:01:43|
|37|[laxya007](https://huggingface.co/laxya007)|0 days 02:18:34|
|38|[lewtun](https://huggingface.co/lewtun)|0 days 00:34:43|
|39|[Rounak](https://huggingface.co/Rounak)|0 days 00:26:10|
|40|[kshmax](https://huggingface.co/kshmax)|0 days 00:06:38|


### Hardware used

<iframe width="100%" height="251" frameborder="0"
  src="https://observablehq.com/embed/@huggingface/sahajbert-hardware?cells=c1_noaws"></iframe>

## Eval results

We evaluate sahajBERT model quality and 2 other model benchmarks ([XLM-R-large](https://huggingface.co/xlm-roberta-large)  and [IndicBert](https://huggingface.co/ai4bharat/indic-bert)) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

- **NER**: a named entity recognition on Bengali split of [WikiANN](https://huggingface.co/datasets/wikiann) dataset

- **NCC**: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

| Base pre-trained Model       | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std)           |
|:-------------:|:-------------:|:-------------:|
|sahajBERT |  95.45 ± 0.53|  91.97 ± 0.47|
|[XLM-R-large](https://huggingface.co/xlm-roberta-large) |  96.48 ± 0.22| 90.05 ± 0.38|
|[IndicBert](https://huggingface.co/ai4bharat/indic-bert) |  92.52 ± 0.45| 74.46 ± 1.91|

### BibTeX entry and citation info

Coming soon! 

<!-- ```bibtex

@inproceedings{...,

  year={2020}

}

``` -->