SaulLu commited on
Commit
3d0109f
1 Parent(s): 1f682da

add model card

Browse files
Files changed (1) hide show
  1. README.md +229 -0
README.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language: bn
4
+
5
+ tags:
6
+
7
+ - collaborative
8
+
9
+ - bengali
10
+
11
+ - albert
12
+
13
+ - bangla
14
+
15
+ license: apache-2.0
16
+
17
+ datasets:
18
+
19
+ - Wikipedia
20
+
21
+ - Oscar
22
+
23
+ widget:
24
+
25
+ - text: "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো"
26
+
27
+ ---
28
+
29
+ <!-- TODO: change widget text -->
30
+
31
+ # sahajBERT
32
+
33
+ Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.
34
+
35
+ ## Model description
36
+
37
+ <!-- You can embed local or remote images using `![](...)` -->
38
+
39
+ sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an [ALBERT](https://arxiv.org/abs/1909.11942) architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
40
+
41
+ <!-- Add more information about the collaborative training when we have time / preprint available -->
42
+
43
+ ## Intended uses & limitations
44
+
45
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
46
+
47
+ We trained our model on 2 of these downstream tasks: [sequence classification](https://huggingface.co/neuropark/sahajBERT-NCC) and [token classification](https://huggingface.co/neuropark/sahajBERT-NER)
48
+
49
+ #### How to use
50
+
51
+ You can use this model directly with a pipeline for masked language modeling:
52
+
53
+ ```python
54
+
55
+ from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast
56
+
57
+ # Initialize tokenizer
58
+
59
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
60
+
61
+ # Initialize model
62
+
63
+ model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")
64
+
65
+ # Initialize pipeline
66
+
67
+ pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)
68
+
69
+ raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me
70
+
71
+ pipeline(raw_text)
72
+
73
+ ```
74
+
75
+ Here is how to use this model to get the features of a given text in PyTorch:
76
+
77
+ ```python
78
+
79
+ from transformers import AlbertModel, PreTrainedTokenizerFast
80
+
81
+ # Initialize tokenizer
82
+
83
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
84
+
85
+ # Initialize model
86
+
87
+ model = AlbertModel.from_pretrained("neuropark/sahajBERT")
88
+
89
+ text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me
90
+
91
+ encoded_input = tokenizer(text, return_tensors='pt')
92
+
93
+ output = model(**encoded_input
94
+
95
+ ```
96
+
97
+ #### Limitations and bias
98
+
99
+ <!-- Provide examples of latent issues and potential remediations. -->
100
+
101
+ WIP
102
+
103
+ ## Training data
104
+
105
+ The tokenizer was trained on he Bengali part of OSCAR and the model on a [dump of Wikipedia in Bengali](https://huggingface.co/datasets/lhoestq/wikipedia_bn) and the Bengali part of [OSCAR](https://huggingface.co/datasets/oscar).
106
+
107
+ ## Training procedure
108
+
109
+ This model was trained in a collaborative manner by volunteer participants.
110
+
111
+ <!-- Add more information about the collaborative training when we have time / preprint available + Preprocessing, hardware used, hyperparameters... (maybe use figures)-->
112
+
113
+ ### Contributors leaderboard
114
+
115
+ | Rank | Username | Total contributed runtime |
116
+
117
+ |:-------------:|:-------------:|-------------:|
118
+
119
+ | 1|[khalidsaifullaah](https://huggingface.co/khalidsaifullaah)|11 days 21:02:08|
120
+
121
+ | 2|[ishanbagchi](https://huggingface.co/ishanbagchi)|9 days 20:37:00|
122
+
123
+ | 3|[tanmoyio](https://huggingface.co/tanmoyio)|9 days 18:08:34|
124
+
125
+ | 4|[debajit](https://huggingface.co/debajit)|8 days 14:15:10|
126
+
127
+ | 5|[skylord](https://huggingface.co/skylord)|6 days 16:35:29|
128
+
129
+ | 6|[ibraheemmoosa](https://huggingface.co/ibraheemmoosa)|5 days 01:05:57|
130
+
131
+ | 7|[SaulLu](https://huggingface.co/SaulLu)|5 days 00:46:36|
132
+
133
+ | 8|[lhoestq](https://huggingface.co/lhoestq)|4 days 20:11:16|
134
+
135
+ | 9|[nilavya](https://huggingface.co/nilavya)|4 days 08:51:51|
136
+
137
+ |10|[Priyadarshan](https://huggingface.co/Priyadarshan)|4 days 02:28:55|
138
+
139
+ |11|[anuragshas](https://huggingface.co/anuragshas)|3 days 05:00:55|
140
+
141
+ |12|[sujitpal](https://huggingface.co/sujitpal)|2 days 20:52:33|
142
+
143
+ |13|[manandey](https://huggingface.co/manandey)|2 days 16:17:13|
144
+
145
+ |14|[albertvillanova](https://huggingface.co/albertvillanova)|2 days 14:14:31|
146
+
147
+ |15|[justheuristic](https://huggingface.co/justheuristic)|2 days 13:20:52|
148
+
149
+ |16|[w0lfw1tz](https://huggingface.co/w0lfw1tz)|2 days 07:22:48|
150
+
151
+ |17|[smoker](https://huggingface.co/smoker)|2 days 02:52:03|
152
+
153
+ |18|[Soumi](https://huggingface.co/Soumi)|1 days 20:42:02|
154
+
155
+ |19|[Anjali](https://huggingface.co/Anjali)|1 days 16:28:00|
156
+
157
+ |20|[OptimusPrime](https://huggingface.co/OptimusPrime)|1 days 09:16:57|
158
+
159
+ |21|[theainerd](https://huggingface.co/theainerd)|1 days 04:48:57|
160
+
161
+ |22|[yhn112](https://huggingface.co/yhn112)|0 days 20:57:02|
162
+
163
+ |23|[kolk](https://huggingface.co/kolk)|0 days 17:57:37|
164
+
165
+ |24|[arnab](https://huggingface.co/arnab)|0 days 17:54:12|
166
+
167
+ |25|[imavijit](https://huggingface.co/imavijit)|0 days 16:07:26|
168
+
169
+ |26|[osanseviero](https://huggingface.co/osanseviero)|0 days 14:16:45|
170
+
171
+ |27|[subhranilsarkar](https://huggingface.co/subhranilsarkar)|0 days 13:04:46|
172
+
173
+ |28|[sagnik1511](https://huggingface.co/sagnik1511)|0 days 12:24:57|
174
+
175
+ |29|[anindabitm](https://huggingface.co/anindabitm)|0 days 08:56:44|
176
+
177
+ |30|[borzunov](https://huggingface.co/borzunov)|0 days 04:07:35|
178
+
179
+ |31|[thomwolf](https://huggingface.co/thomwolf)|0 days 03:53:15|
180
+
181
+ |32|[priyadarshan](https://huggingface.co/priyadarshan)|0 days 03:40:11|
182
+
183
+ |33|[ali007](https://huggingface.co/ali007)|0 days 03:34:37|
184
+
185
+ |34|[sbrandeis](https://huggingface.co/sbrandeis)|0 days 03:18:16|
186
+
187
+ |35|[Preetha](https://huggingface.co/Preetha)|0 days 03:13:47|
188
+
189
+ |36|[Mrinal](https://huggingface.co/Mrinal)|0 days 03:01:43|
190
+
191
+ |37|[laxya007](https://huggingface.co/laxya007)|0 days 02:18:34|
192
+
193
+ |38|[lewtun](https://huggingface.co/lewtun)|0 days 00:34:43|
194
+
195
+ |39|[Rounak](https://huggingface.co/Rounak)|0 days 00:26:10|
196
+
197
+ |40|[kshmax](https://huggingface.co/kshmax)|0 days 00:06:38|
198
+
199
+ ## Eval results
200
+
201
+ We evaluate sahajBERT model quality and 2 other model benchmarks ([XLM-R-large](https://huggingface.co/xlm-roberta-large) and [IndicBert](https://huggingface.co/ai4bharat/indic-bert)) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:
202
+
203
+ - **NER**: a named entity recognition on Bengali split of [WikiANN](https://huggingface.co/datasets/wikiann) dataset
204
+
205
+ - **NCC**: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE
206
+
207
+ | Base pretrained Model | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std) |
208
+
209
+ |:-------------:|:-------------:|:-------------:|
210
+
211
+ |sahajBERT | 95.45 ± 0.53| 91.97 ± 0.47|
212
+
213
+ |[XLM-R-large](https://huggingface.co/xlm-roberta-large) | 96.48 ± 0.22| 90.05 ± 0.38|
214
+
215
+ |[IndicBert](https://huggingface.co/ai4bharat/indic-bert) | 92.52 ± 0.45| 74.46 ± 1.91|
216
+
217
+ ### BibTeX entry and citation info
218
+
219
+ Coming soon!
220
+
221
+ <!-- ```bibtex
222
+
223
+ @inproceedings{...,
224
+
225
+ year={2020}
226
+
227
+ }
228
+
229
+ ``` -->