flax-community
/

roberta-hindi

@@ -3,7 +3,6 @@ widget:
 - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
 - text: "हम आपके सुखद <mask> की कामना करते हैं"
 - text: "सभी अच्छी चीजों का एक <mask> होता है"
-use_cache: false
 ---
 # RoBERTa base model for Hindi language
@@ -23,28 +22,27 @@ You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> from transformers import pipeline
 >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
->>> unmasker("मुझे उनसे बात करना <mask> अच्छा लगा")
-[{'score': 0.2096337080001831,
-  'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
-  'token': 1462,
-  'token_str': ' एकदम'},
- {'score': 0.17915162444114685,
-  'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
-  'token': 594,
-  'token_str': ' तब'},
- {'score': 0.15887945890426636,
-  'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
-  'token': 324,
-  'token_str': ' और'},
- {'score': 0.12024253606796265,
-  'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
-  'token': 743,
-  'token_str': ' लगभग'},
- {'score': 0.07114479690790176,
-  'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
-  'token': 672,
-  'token_str': ' कब'}]
 ```
 ## Training data
@@ -63,7 +61,11 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
 the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
-with `<s>` and the end of one by `</s>`. We also did some preliminary cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets. The model was then trained on a randomized shuffle of all the datasets combined.
 The details of the masking procedure for each sentence are the following:
 - 15% of the tokens are masked.
 - In 80% of the cases, the masked tokens are replaced by `<mask>`.
@@ -72,7 +74,7 @@ The details of the masking procedure for each sentence are the following:
 Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
 ### Pretraining
-The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128.
 ## Evaluation Results

 - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
 - text: "हम आपके सुखद <mask> की कामना करते हैं"
 - text: "सभी अच्छी चीजों का एक <mask> होता है"
 ---
 # RoBERTa base model for Hindi language
 ```python
 >>> from transformers import pipeline
 >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
+>>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
+[{'score': 0.3310680091381073,
+  'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
+  'token': 1349,
+  'token_str': ' सफर'},
+ {'score': 0.15317578613758087,
+  'sequence': 'हम आपके सुखद पल की कामना करते हैं',
+  'token': 848,
+  'token_str': ' पल'},
+ {'score': 0.07826550304889679,
+  'sequence': 'हम आपके सुखद समय की कामना करते हैं',
+  'token': 453,
+  'token_str': ' समय'},
+ {'score': 0.06304813921451569,
+  'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
+  'token': 404,
+  'token_str': ' पहल'},
+ {'score': 0.058322224766016006,
+  'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
+  'token': 857,
+  'token_str': ' अवसर'}]
 ```
 ## Training data
 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
 the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
+with `<s>` and the end of one by `</s>`.
+- We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets.
+- We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual lablelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv)  where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
 The details of the masking procedure for each sentence are the following:
 - 15% of the tokens are masked.
 - In 80% of the cases, the masked tokens are replaced by `<mask>`.
 Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
 ### Pretraining
+The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of **mC4, oscar** and other datasets listed above was used to train the model. Training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).
 ## Evaluation Results