dk-crazydiv commited on
Commit
3f415c3
1 Parent(s): 2f42c34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -25
README.md CHANGED
@@ -3,7 +3,6 @@ widget:
3
  - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
4
  - text: "हम आपके सुखद <mask> की कामना करते हैं"
5
  - text: "सभी अच्छी चीजों का एक <mask> होता है"
6
- use_cache: false
7
  ---
8
 
9
  # RoBERTa base model for Hindi language
@@ -23,28 +22,27 @@ You can use this model directly with a pipeline for masked language modeling:
23
  ```python
24
  >>> from transformers import pipeline
25
  >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
26
- >>> unmasker("मुझे उनसे बात करना <mask> अच्छा लगा")
27
-
28
- [{'score': 0.2096337080001831,
29
- 'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
30
- 'token': 1462,
31
- 'token_str': ' एकदम'},
32
- {'score': 0.17915162444114685,
33
- 'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
34
- 'token': 594,
35
- 'token_str': ' तब'},
36
- {'score': 0.15887945890426636,
37
- 'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
38
- 'token': 324,
39
- 'token_str': ' और'},
40
- {'score': 0.12024253606796265,
41
- 'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
42
- 'token': 743,
43
- 'token_str': ' लगभग'},
44
- {'score': 0.07114479690790176,
45
- 'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
46
- 'token': 672,
47
- 'token_str': ' कब'}]
48
  ```
49
 
50
  ## Training data
@@ -63,7 +61,11 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
63
 
64
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
65
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
66
- with `<s>` and the end of one by `</s>`. We also did some preliminary cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets. The model was then trained on a randomized shuffle of all the datasets combined.
 
 
 
 
67
  The details of the masking procedure for each sentence are the following:
68
  - 15% of the tokens are masked.
69
  - In 80% of the cases, the masked tokens are replaced by `<mask>`.
@@ -72,7 +74,7 @@ The details of the masking procedure for each sentence are the following:
72
  Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
73
 
74
  ### Pretraining
75
- The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128.
76
 
77
  ## Evaluation Results
78
 
 
3
  - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
4
  - text: "हम आपके सुखद <mask> की कामना करते हैं"
5
  - text: "सभी अच्छी चीजों का एक <mask> होता है"
 
6
  ---
7
 
8
  # RoBERTa base model for Hindi language
 
22
  ```python
23
  >>> from transformers import pipeline
24
  >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
25
+ >>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
26
+ [{'score': 0.3310680091381073,
27
+ 'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
28
+ 'token': 1349,
29
+ 'token_str': ' सफर'},
30
+ {'score': 0.15317578613758087,
31
+ 'sequence': 'हम आपके सुखद पल की कामना करते हैं',
32
+ 'token': 848,
33
+ 'token_str': ' पल'},
34
+ {'score': 0.07826550304889679,
35
+ 'sequence': 'हम आपके सुखद समय की कामना करते हैं',
36
+ 'token': 453,
37
+ 'token_str': ' समय'},
38
+ {'score': 0.06304813921451569,
39
+ 'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
40
+ 'token': 404,
41
+ 'token_str': ' पहल'},
42
+ {'score': 0.058322224766016006,
43
+ 'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
44
+ 'token': 857,
45
+ 'token_str': ' अवसर'}]
 
46
  ```
47
 
48
  ## Training data
 
61
 
62
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
63
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
64
+ with `<s>` and the end of one by `</s>`.
65
+ - We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets.
66
+ - We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual lablelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv) where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
67
+
68
+
69
  The details of the masking procedure for each sentence are the following:
70
  - 15% of the tokens are masked.
71
  - In 80% of the cases, the masked tokens are replaced by `<mask>`.
 
74
  Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
75
 
76
  ### Pretraining
77
+ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of **mC4, oscar** and other datasets listed above was used to train the model. Training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).
78
 
79
  ## Evaluation Results
80