File size: 6,227 Bytes
3d3e2f6 66112b8 2f42c34 02ca275 66112b8 8ff622e 66112b8 8ff622e 66112b8 8aa37f3 66112b8 8aa37f3 c2e4bdf 66112b8 8aa37f3 66112b8 8aa37f3 66112b8 8ff622e 66112b8 8ff622e 66112b8 8aa37f3 66112b8 8ff622e 66112b8 8ff622e 66112b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
widget:
- text: "मुझे उनसे बात करना <mask> अच्छा लगा"
- text: "हम आपके सुखद <mask> की कामना करते हैं"
- text: "सभी अच्छी चीजों का एक <mask> होता है"
use_cache: false
---
# RoBERTa base model for Hindi language
Pretrained model on Hindi language using a masked language modeling (MLM) objective.
> This is part of the
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
## Model description
RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data(a combination of **mc4, oscar and indic-nlp** datasets)
### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
>>> unmasker("मुझे उनसे बात करना <mask> अच्छा लगा")
[{'score': 0.2096337080001831,
'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
'token': 1462,
'token_str': ' एकदम'},
{'score': 0.17915162444114685,
'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
'token': 594,
'token_str': ' तब'},
{'score': 0.15887945890426636,
'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
'token': 324,
'token_str': ' और'},
{'score': 0.12024253606796265,
'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
'token': 743,
'token_str': ' लगभग'},
{'score': 0.07114479690790176,
'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
'token': 672,
'token_str': ' कब'}]
```
## Training data
The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
- [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
- [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
- [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
- [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
- [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
- [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
## Training procedure
### Preprocessing
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
with `<s>` and the end of one by `</s>`. We also did some preliminary cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets. The model was then trained on a randomized shuffle of all the datasets combined.
The details of the masking procedure for each sentence are the following:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
### Pretraining
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128.
## Evaluation Results
RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
| BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
| WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
| IITP Product Reviews | Sentiment Analysis | **78.01** | 73.23 | **78.39** | 66.16 | 75.53 |
| IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
## Team Members
- Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
- Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
- Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
- Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))
- Rahul Dev ([mlkorra](https://huggingface.co/mlkorra))
## Credits
Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
<img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium> |