panini / README.md

Update README.md

c2e4bdf over 3 years ago

6.52 kB

	---
	widget:
	- text: "मुझे उनसे बात करना <mask> अच्छा लगा"
	- text: "हम आपके सुखद <mask> की कामना करते हैं"
	- text: "सभी अच्छी चीजों का एक <mask> होता है"
	---

	# RoBERTa base model for Hindi language

	Pretrained model on Hindi language using a masked language modeling (MLM) objective. RoBERTa was introduced in
	[this paper](https://arxiv.org/abs/1907.11692) and first released in
	[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).

	> This is part of the
	[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.

	## Model description

	RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.

	### How to use

	You can use this model directly with a pipeline for masked language modeling:
	```python
	>>> from transformers import pipeline
	>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
	>>> unmasker("मुझे उनसे बात करना <mask> अच्छा लगा")

	[{'score': 0.2096337080001831,
	'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
	'token': 1462,
	'token_str': ' एकदम'},
	{'score': 0.17915162444114685,
	'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
	'token': 594,
	'token_str': ' तब'},
	{'score': 0.15887945890426636,
	'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
	'token': 324,
	'token_str': ' और'},
	{'score': 0.12024253606796265,
	'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
	'token': 743,
	'token_str': ' लगभग'},
	{'score': 0.07114479690790176,
	'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
	'token': 672,
	'token_str': ' कब'}]
	```

	## Training data

	The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
	- [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
	- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
	- [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
	- [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
	- [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
	- [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
	- [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
	- [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.

	## Training procedure
	### Preprocessing

	The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
	the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
	with `<s>` and the end of one by `</s>`
	The details of the masking procedure for each sentence are the following:
	- 15% of the tokens are masked.
	- In 80% of the cases, the masked tokens are replaced by `<mask>`.
	- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
	- In the 10% remaining cases, the masked tokens are left as is.
	Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

	### Pretraining
	The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The
	optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
	\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
	rate after.

	## Evaluation Results

	RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.

	\| Task \| Task Type \| IndicBERT \| HindiBERTa \| Indic Transformers Hindi BERT \| RoBERTa Hindi Guj San \| RoBERTa Hindi \|
	\|-------------------------\|----------------------\|-----------\|------------\|-------------------------------\|-----------------------\|---------------\|
	\| BBC News Classification \| Genre Classification \| 76.44 \| 66.86 \| 77.6 \| 64.9 \| 73.67 \|
	\| WikiNER \| Token Classification \| - \| 90.68 \| 95.09 \| 89.61 \| 92.76 \|
	\| IITP Product Reviews \| Sentiment Analysis \| 78.01 \| 73.23 \| 78.39 \| 66.16 \| 75.53 \|
	\| IITP Movie Reviews \| Sentiment Analysis \| 60.97 \| 52.26 \| 70.65 \| 49.35 \| 61.29 \|

	## Team Members
	- Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
	- Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
	- Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
	- Rahul Dev ([mlkorra](https://huggingface.co/mlkorra))
	- Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))

	## Credits
	Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.

	<img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>