dk-crazydiv
commited on
Commit
•
3f415c3
1
Parent(s):
2f42c34
Update README.md
Browse files
README.md
CHANGED
@@ -3,7 +3,6 @@ widget:
|
|
3 |
- text: "मुझे उनसे बात करना <mask> अच्छा लगा"
|
4 |
- text: "हम आपके सुखद <mask> की कामना करते हैं"
|
5 |
- text: "सभी अच्छी चीजों का एक <mask> होता है"
|
6 |
-
use_cache: false
|
7 |
---
|
8 |
|
9 |
# RoBERTa base model for Hindi language
|
@@ -23,28 +22,27 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
23 |
```python
|
24 |
>>> from transformers import pipeline
|
25 |
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
|
26 |
-
>>> unmasker("
|
27 |
-
|
28 |
-
|
29 |
-
'
|
30 |
-
'
|
31 |
-
|
32 |
-
|
33 |
-
'
|
34 |
-
'
|
35 |
-
|
36 |
-
|
37 |
-
'
|
38 |
-
'
|
39 |
-
|
40 |
-
|
41 |
-
'
|
42 |
-
'
|
43 |
-
|
44 |
-
|
45 |
-
'
|
46 |
-
'
|
47 |
-
'token_str': ' कब'}]
|
48 |
```
|
49 |
|
50 |
## Training data
|
@@ -63,7 +61,11 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
|
|
63 |
|
64 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
65 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
66 |
-
with `<s>` and the end of one by `</s>`.
|
|
|
|
|
|
|
|
|
67 |
The details of the masking procedure for each sentence are the following:
|
68 |
- 15% of the tokens are masked.
|
69 |
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
@@ -72,7 +74,7 @@ The details of the masking procedure for each sentence are the following:
|
|
72 |
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
73 |
|
74 |
### Pretraining
|
75 |
-
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores)
|
76 |
|
77 |
## Evaluation Results
|
78 |
|
|
|
3 |
- text: "मुझे उनसे बात करना <mask> अच्छा लगा"
|
4 |
- text: "हम आपके सुखद <mask> की कामना करते हैं"
|
5 |
- text: "सभी अच्छी चीजों का एक <mask> होता है"
|
|
|
6 |
---
|
7 |
|
8 |
# RoBERTa base model for Hindi language
|
|
|
22 |
```python
|
23 |
>>> from transformers import pipeline
|
24 |
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
|
25 |
+
>>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
|
26 |
+
[{'score': 0.3310680091381073,
|
27 |
+
'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
|
28 |
+
'token': 1349,
|
29 |
+
'token_str': ' सफर'},
|
30 |
+
{'score': 0.15317578613758087,
|
31 |
+
'sequence': 'हम आपके सुखद पल की कामना करते हैं',
|
32 |
+
'token': 848,
|
33 |
+
'token_str': ' पल'},
|
34 |
+
{'score': 0.07826550304889679,
|
35 |
+
'sequence': 'हम आपके सुखद समय की कामना करते हैं',
|
36 |
+
'token': 453,
|
37 |
+
'token_str': ' समय'},
|
38 |
+
{'score': 0.06304813921451569,
|
39 |
+
'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
|
40 |
+
'token': 404,
|
41 |
+
'token_str': ' पहल'},
|
42 |
+
{'score': 0.058322224766016006,
|
43 |
+
'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
|
44 |
+
'token': 857,
|
45 |
+
'token_str': ' अवसर'}]
|
|
|
46 |
```
|
47 |
|
48 |
## Training data
|
|
|
61 |
|
62 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
63 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
64 |
+
with `<s>` and the end of one by `</s>`.
|
65 |
+
- We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets.
|
66 |
+
- We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual lablelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv) where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
|
67 |
+
|
68 |
+
|
69 |
The details of the masking procedure for each sentence are the following:
|
70 |
- 15% of the tokens are masked.
|
71 |
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
|
|
74 |
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
75 |
|
76 |
### Pretraining
|
77 |
+
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of **mC4, oscar** and other datasets listed above was used to train the model. Training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).
|
78 |
|
79 |
## Evaluation Results
|
80 |
|