sahajBERT / README.md

add animations to model card

e9159ab almost 3 years ago

No virus

7.35 kB

	---

	language: bn

	tags:

	- collaborative

	- bengali

	- albert

	- bangla

	license: apache-2.0

	datasets:

	- Wikipedia

	- Oscar

	widget:

	- text: "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে [MASK]।"

	pipeline_tag: fill-mask

	---

	# sahajBERT


	<iframe width="100%" height="1100" frameborder="0"
	src="https://observablehq.com/embed/@huggingface/participants-bubbles-chart?cells=c_noaws%2Ct_noaws%2Cviewof+currentDate"></iframe>



	Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.

	## Model description

	<!-- You can embed local or remote images using `![](...)` -->

	sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an [ALBERT](https://arxiv.org/abs/1909.11942) architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

	<!-- Add more information about the collaborative training when we have time / preprint available -->

	## Intended uses & limitations

	You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

	We trained our model on 2 of these downstream tasks: [sequence classification](https://huggingface.co/neuropark/sahajBERT-NCC) and [token classification](https://huggingface.co/neuropark/sahajBERT-NER)

	#### How to use

	You can use this model directly with a pipeline for masked language modeling:

	```python

	from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

	# Initialize tokenizer

	tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

	# Initialize model

	model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

	# Initialize pipeline

	pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

	raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

	pipeline(raw_text)

	```

	Here is how to use this model to get the features of a given text in PyTorch:

	```python

	from transformers import AlbertModel, PreTrainedTokenizerFast

	# Initialize tokenizer

	tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

	# Initialize model

	model = AlbertModel.from_pretrained("neuropark/sahajBERT")

	text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

	encoded_input = tokenizer(text, return_tensors='pt')

	output = model(**encoded_input)

	```

	#### Limitations and bias

	<!-- Provide examples of latent issues and potential remediations. -->

	WIP

	## Training data

	The tokenizer was trained on he Bengali part of OSCAR and the model on a [dump of Wikipedia in Bengali](https://huggingface.co/datasets/lhoestq/wikipedia_bn) and the Bengali part of [OSCAR](https://huggingface.co/datasets/oscar).

	## Training procedure

	This model was trained in a collaborative manner by volunteer participants.

	<!-- Add more information about the collaborative training when we have time / preprint available + Preprocessing, hardware used, hyperparameters... (maybe use figures)-->

	### Contributors leaderboard

	\| Rank \| Username \| Total contributed runtime \|
	\|:-------------:\|:-------------:\|-------------:\|
	\| 1\|[khalidsaifullaah](https://huggingface.co/khalidsaifullaah)\|11 days 21:02:08\|
	\| 2\|[ishanbagchi](https://huggingface.co/ishanbagchi)\|9 days 20:37:00\|
	\| 3\|[tanmoyio](https://huggingface.co/tanmoyio)\|9 days 18:08:34\|
	\| 4\|[debajit](https://huggingface.co/debajit)\|8 days 14:15:10\|
	\| 5\|[skylord](https://huggingface.co/skylord)\|6 days 16:35:29\|
	\| 6\|[ibraheemmoosa](https://huggingface.co/ibraheemmoosa)\|5 days 01:05:57\|
	\| 7\|[SaulLu](https://huggingface.co/SaulLu)\|5 days 00:46:36\|
	\| 8\|[lhoestq](https://huggingface.co/lhoestq)\|4 days 20:11:16\|
	\| 9\|[nilavya](https://huggingface.co/nilavya)\|4 days 08:51:51\|
	\|10\|[Priyadarshan](https://huggingface.co/Priyadarshan)\|4 days 02:28:55\|
	\|11\|[anuragshas](https://huggingface.co/anuragshas)\|3 days 05:00:55\|
	\|12\|[sujitpal](https://huggingface.co/sujitpal)\|2 days 20:52:33\|
	\|13\|[manandey](https://huggingface.co/manandey)\|2 days 16:17:13\|
	\|14\|[albertvillanova](https://huggingface.co/albertvillanova)\|2 days 14:14:31\|
	\|15\|[justheuristic](https://huggingface.co/justheuristic)\|2 days 13:20:52\|
	\|16\|[w0lfw1tz](https://huggingface.co/w0lfw1tz)\|2 days 07:22:48\|
	\|17\|[smoker](https://huggingface.co/smoker)\|2 days 02:52:03\|
	\|18\|[Soumi](https://huggingface.co/Soumi)\|1 days 20:42:02\|
	\|19\|[Anjali](https://huggingface.co/Anjali)\|1 days 16:28:00\|
	\|20\|[OptimusPrime](https://huggingface.co/OptimusPrime)\|1 days 09:16:57\|
	\|21\|[theainerd](https://huggingface.co/theainerd)\|1 days 04:48:57\|
	\|22\|[yhn112](https://huggingface.co/yhn112)\|0 days 20:57:02\|
	\|23\|[kolk](https://huggingface.co/kolk)\|0 days 17:57:37\|
	\|24\|[arnab](https://huggingface.co/arnab)\|0 days 17:54:12\|
	\|25\|[imavijit](https://huggingface.co/imavijit)\|0 days 16:07:26\|
	\|26\|[osanseviero](https://huggingface.co/osanseviero)\|0 days 14:16:45\|
	\|27\|[subhranilsarkar](https://huggingface.co/subhranilsarkar)\|0 days 13:04:46\|
	\|28\|[sagnik1511](https://huggingface.co/sagnik1511)\|0 days 12:24:57\|
	\|29\|[anindabitm](https://huggingface.co/anindabitm)\|0 days 08:56:44\|
	\|30\|[borzunov](https://huggingface.co/borzunov)\|0 days 04:07:35\|
	\|31\|[thomwolf](https://huggingface.co/thomwolf)\|0 days 03:53:15\|
	\|32\|[priyadarshan](https://huggingface.co/priyadarshan)\|0 days 03:40:11\|
	\|33\|[ali007](https://huggingface.co/ali007)\|0 days 03:34:37\|
	\|34\|[sbrandeis](https://huggingface.co/sbrandeis)\|0 days 03:18:16\|
	\|35\|[Preetha](https://huggingface.co/Preetha)\|0 days 03:13:47\|
	\|36\|[Mrinal](https://huggingface.co/Mrinal)\|0 days 03:01:43\|
	\|37\|[laxya007](https://huggingface.co/laxya007)\|0 days 02:18:34\|
	\|38\|[lewtun](https://huggingface.co/lewtun)\|0 days 00:34:43\|
	\|39\|[Rounak](https://huggingface.co/Rounak)\|0 days 00:26:10\|
	\|40\|[kshmax](https://huggingface.co/kshmax)\|0 days 00:06:38\|


	### Hardware used

	<iframe width="100%" height="251" frameborder="0"
	src="https://observablehq.com/embed/@huggingface/sahajbert-hardware?cells=c1_noaws"></iframe>

	## Eval results

	We evaluate sahajBERT model quality and 2 other model benchmarks ([XLM-R-large](https://huggingface.co/xlm-roberta-large) and [IndicBert](https://huggingface.co/ai4bharat/indic-bert)) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

	- NER: a named entity recognition on Bengali split of [WikiANN](https://huggingface.co/datasets/wikiann) dataset

	- NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

	\| Base pre-trained Model \| NER - F1 (mean ± std) \| NCC - Accuracy (mean ± std) \|
	\|:-------------:\|:-------------:\|:-------------:\|
	\|sahajBERT \| 95.45 ± 0.53\| 91.97 ± 0.47\|
	\|[XLM-R-large](https://huggingface.co/xlm-roberta-large) \| 96.48 ± 0.22\| 90.05 ± 0.38\|
	\|[IndicBert](https://huggingface.co/ai4bharat/indic-bert) \| 92.52 ± 0.45\| 74.46 ± 1.91\|

	### BibTeX entry and citation info

	Coming soon!

	<!-- ```bibtex

	@inproceedings{...,

	year={2020}

	}

	``` -->