Spaces:
Runtime error
Runtime error
dk-crazydiv
commited on
Commit
•
6ecb4b7
1
Parent(s):
ab117ce
Modified readme data and examples
Browse files- About/intro.md +1 -1
- About/model_description.md +1 -1
- About/results.md +1 -1
- About/training_data.md +2 -2
- About/training_procedure.md +7 -14
- About/use.md +29 -21
- apps/mlm.py +2 -2
- config.json +1 -1
- mlm_custom/mlm_targeted_text.csv +15 -17
About/intro.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
# RoBERTa base model for Hindi language
|
2 |
|
3 |
-
[Pretrained model on Hindi language](https://huggingface.co/flax-community/roberta-hindi) using a masked language modeling (MLM) objective.
|
4 |
|
5 |
> This is part of the
|
6 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
|
|
1 |
# RoBERTa base model for Hindi language
|
2 |
|
3 |
+
[Pretrained model on Hindi language](https://huggingface.co/flax-community/roberta-hindi) using a masked language modeling (MLM) objective. Model is able to achieve competitive accuracy compared to pre-existing models on downstream tasks like NamedEntityRecognition and Classification. There are some MLM examples which show that there is a visible room for improvement, but this should serve well as a good base model for hindi languages & could be fine-tuned on specific datasets.
|
4 |
|
5 |
> This is part of the
|
6 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
About/model_description.md
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
## Model description
|
2 |
|
3 |
-
It is a monolingual transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
|
|
|
1 |
## Model description
|
2 |
|
3 |
+
It is a monolingual transformers model pretrained on a large corpus of Hindi data (100GB+) in a self-supervised fashion.
|
About/results.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2 |
|
3 |
RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
|
4 |
|
5 |
-
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
|
6 |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
|
7 |
| BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
|
8 |
| WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
|
|
|
2 |
|
3 |
RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
|
4 |
|
5 |
+
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi(ours) |
|
6 |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
|
7 |
| BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
|
8 |
| WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
|
About/training_data.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
## Training data
|
2 |
|
3 |
-
The RoBERTa model was pretrained on the
|
4 |
-
- [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
5 |
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
|
|
6 |
- [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
|
7 |
- [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
|
8 |
- [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
|
|
|
1 |
## Training data
|
2 |
|
3 |
+
The RoBERTa model was pretrained on the union, followed by a random shuffle of the following datasets:
|
|
|
4 |
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
5 |
+
- [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
6 |
- [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
|
7 |
- [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
|
8 |
- [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
|
About/training_procedure.md
CHANGED
@@ -1,18 +1,11 @@
|
|
1 |
## Training procedure
|
2 |
### Preprocessing
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
- 15% of the tokens are masked.
|
9 |
-
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
10 |
-
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
11 |
-
- In the 10% remaining cases, the masked tokens are left as is.
|
12 |
-
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
13 |
|
14 |
### Pretraining
|
15 |
-
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores)
|
16 |
-
|
17 |
-
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
|
18 |
-
rate after.
|
|
|
1 |
## Training procedure
|
2 |
### Preprocessing
|
3 |
+
The texts are tokenized using Byte-Pair Encoding (BPE) and a vocabulary size of 50265.
|
4 |
+
- `mc4` and `oscar` datasets were available in `datasets` library. For rest of the datasets, we wrote our own loading scripts [available here](https://github.com/amankhandelia/roberta_hindi/blob/master/test_custom_datasets.py)
|
5 |
+
- It was slightly challenging to run `mc4` dataset(104GB+) and perform preprocessing and use it in non-streaming mode. `datasets` library had a lot of helper functions which allowed us to merge & shuffle the datasets with ease.
|
6 |
+
- We had to perform cleanup of mC4 and oscar datasets by removing all non hindi(non Devanagiri) characters from the datasets, as the datasets are relatively somewhat noisy.
|
7 |
+
- We attempted to filter out evaluation set of [WikiNER of IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by manual lablelling where the actual labels were not correct and modified the downstream evaluation dataset. Code & manually labelled file are also present in our [github repo](https://github.com/amankhandelia/roberta_hindi)
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
### Pretraining
|
10 |
+
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of mC4, oscar and other datasets listed above was used to train the model. Downstream training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).
|
11 |
+
|
|
|
|
About/use.md
CHANGED
@@ -5,26 +5,34 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
5 |
```python
|
6 |
>>> from transformers import pipeline
|
7 |
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
|
8 |
-
>>> unmasker("
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
-
|
11 |
-
'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
|
12 |
-
'token': 1462,
|
13 |
-
'token_str': ' एकदम'},
|
14 |
-
{'score': 0.17915162444114685,
|
15 |
-
'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
|
16 |
-
'token': 594,
|
17 |
-
'token_str': ' तब'},
|
18 |
-
{'score': 0.15887945890426636,
|
19 |
-
'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
|
20 |
-
'token': 324,
|
21 |
-
'token_str': ' और'},
|
22 |
-
{'score': 0.12024253606796265,
|
23 |
-
'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
|
24 |
-
'token': 743,
|
25 |
-
'token_str': ' लगभग'},
|
26 |
-
{'score': 0.07114479690790176,
|
27 |
-
'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
|
28 |
-
'token': 672,
|
29 |
-
'token_str': ' कब'}]
|
30 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
```python
|
6 |
>>> from transformers import pipeline
|
7 |
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
|
8 |
+
>>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
|
9 |
+
[{'score': 0.3310680091381073,
|
10 |
+
'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
|
11 |
+
'token': 1349,
|
12 |
+
'token_str': ' सफर'},
|
13 |
+
{'score': 0.15317578613758087,
|
14 |
+
'sequence': 'हम आपके सुखद पल की कामना करते हैं',
|
15 |
+
'token': 848,
|
16 |
+
'token_str': ' पल'},
|
17 |
+
{'score': 0.07826550304889679,
|
18 |
+
'sequence': 'हम आपके सुखद समय की कामना करते हैं',
|
19 |
+
'token': 453,
|
20 |
+
'token_str': ' समय'},
|
21 |
+
{'score': 0.06304813921451569,
|
22 |
+
'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
|
23 |
+
'token': 404,
|
24 |
+
'token_str': ' पहल'},
|
25 |
+
{'score': 0.058322224766016006,
|
26 |
+
'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
|
27 |
+
'token': 857,
|
28 |
+
'token_str': ' अवसर'}]
|
29 |
+
```
|
30 |
|
31 |
+
Alternatively, the model could be loaded with AutoModels (& could then be fine-tuned on downstream tasks):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
```
|
33 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
34 |
+
|
35 |
+
tokenizer = AutoTokenizer.from_pretrained("flax-community/roberta-hindi")
|
36 |
+
|
37 |
+
model = AutoModelForMaskedLM.from_pretrained("flax-community/roberta-hindi")
|
38 |
+
```
|
apps/mlm.py
CHANGED
@@ -56,8 +56,8 @@ def app():
|
|
56 |
masked_text = st.text_area("Please type a masked sentence to fill", select_text)
|
57 |
|
58 |
st.sidebar.markdown(
|
59 |
-
"### Models
|
60 |
-
"- [RoBERTa Hindi](https://huggingface.co/flax-community/roberta-hindi)\n"
|
61 |
"- [Indic Transformers Hindi](https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert)\n"
|
62 |
"- [HindiBERTa](https://huggingface.co/mrm8488/HindiBERTa)\n"
|
63 |
"- [RoBERTa Hindi Guj San](https://huggingface.co/surajp/RoBERTa-hindi-guj-san)"
|
|
|
56 |
masked_text = st.text_area("Please type a masked sentence to fill", select_text)
|
57 |
|
58 |
st.sidebar.markdown(
|
59 |
+
"### MLM Models for comparison:\n"
|
60 |
+
"- [RoBERTa Hindi](https://huggingface.co/flax-community/roberta-hindi) (ours)\n"
|
61 |
"- [Indic Transformers Hindi](https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert)\n"
|
62 |
"- [HindiBERTa](https://huggingface.co/mrm8488/HindiBERTa)\n"
|
63 |
"- [RoBERTa Hindi Guj San](https://huggingface.co/surajp/RoBERTa-hindi-guj-san)"
|
config.json
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
{
|
2 |
"models": {
|
3 |
-
"RoBERTa Hindi": "flax-community/roberta-hindi",
|
4 |
"Indic Transformers Hindi": "neuralspace-reverie/indic-transformers-hi-bert",
|
5 |
"HindiBERTa": "mrm8488/HindiBERTa",
|
6 |
"RoBERTa Hindi Guj San": "surajp/RoBERTa-hindi-guj-san"
|
|
|
1 |
{
|
2 |
"models": {
|
3 |
+
"RoBERTa Hindi (ours)": "flax-community/roberta-hindi",
|
4 |
"Indic Transformers Hindi": "neuralspace-reverie/indic-transformers-hi-bert",
|
5 |
"HindiBERTa": "mrm8488/HindiBERTa",
|
6 |
"RoBERTa Hindi Guj San": "surajp/RoBERTa-hindi-guj-san"
|
mlm_custom/mlm_targeted_text.csv
CHANGED
@@ -1,18 +1,16 @@
|
|
1 |
user_id,text,output,multi
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
hassiahk,किसी पुस्तक को उसके <mask> से मत आंकिए,आवरण,
|
18 |
-
hassiahk,सभी <mask> चीजों का एक अंत होता है,"[""अच्छी"", ""बुरी""]",TRUE
|
|
|
1 |
user_id,text,output,multi
|
2 |
+
z,"ताजमहल की <mask> को देखने के लिए, देश दुनिया से लोग दूर दूर से आते है",सुन्दरता,
|
3 |
+
z,गौर से देखिए इस मासूम <mask> को,बच्ची,
|
4 |
+
z,"बात तय थी, लेकिन ऐन मौके पर उसके मुकर जाने से सारा काम <mask> में पड़ गया",खटाई,
|
5 |
+
z,"कारखाने के तैयार माल को पैक करने की अनेक <mask> मशीनें आज मिलती हैं",z,
|
6 |
+
z,"प्रत्येक टीम में अधिकतम <mask> खिलाड़ी होते हैं",z,
|
7 |
+
z,"बार बार देखो, हज़ार बार देखो, ये देखने की <mask> है","[""चीज़"",""बात""]",TRUE
|
8 |
+
z,ट्रंप कल अहमदाबाद में प्रधानमंत्री मोदी से <mask> करने जा रहे हैं,"[""मुलाकात"",""मिल्ने""]",TRUE
|
9 |
+
z,कहने को साथ अपने ये <mask> चलती है,दुनिया,
|
10 |
+
z,"ये इश्क़ नहीं आसान बस इतना समझ लीजिये, एक आग का दरिया है और <mask> के जाना है",डूब,
|
11 |
+
z,आपका दिन <mask> हो,"[""शुभ"",""अच्छा""]",TRUE
|
12 |
+
z,शुभ <mask>,"[""प्रभात"",रात्रि"",""यात्रा"",""अवसर""]",TRUE
|
13 |
+
z,बात ये है कि आप इसे <mask> से ही जानते हैं,पहले,
|
14 |
+
z,<mask> पूर्व में उगता है,सूरज,
|
15 |
+
z,"जल्दी सोना और जल्दी उठना इंसान को स्वस्थ ,समृद्ध और बुद्धिमान <mask> है",बनाता,
|
16 |
+
z,सभी <mask> चीजों का एक अंत होता है,"[""अच्छी"", ""बुरी""]",TRUE
|
|
|
|