dk-crazydiv commited on
Commit
6ecb4b7
1 Parent(s): ab117ce

Modified readme data and examples

Browse files
About/intro.md CHANGED
@@ -1,6 +1,6 @@
1
  # RoBERTa base model for Hindi language
2
 
3
- [Pretrained model on Hindi language](https://huggingface.co/flax-community/roberta-hindi) using a masked language modeling (MLM) objective.
4
 
5
  > This is part of the
6
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
 
1
  # RoBERTa base model for Hindi language
2
 
3
+ [Pretrained model on Hindi language](https://huggingface.co/flax-community/roberta-hindi) using a masked language modeling (MLM) objective. Model is able to achieve competitive accuracy compared to pre-existing models on downstream tasks like NamedEntityRecognition and Classification. There are some MLM examples which show that there is a visible room for improvement, but this should serve well as a good base model for hindi languages & could be fine-tuned on specific datasets.
4
 
5
  > This is part of the
6
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
About/model_description.md CHANGED
@@ -1,3 +1,3 @@
1
  ## Model description
2
 
3
- It is a monolingual transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
 
1
  ## Model description
2
 
3
+ It is a monolingual transformers model pretrained on a large corpus of Hindi data (100GB+) in a self-supervised fashion.
About/results.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
4
 
5
- | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
6
  |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
7
  | BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
8
  | WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
 
2
 
3
  RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
4
 
5
+ | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi(ours) |
6
  |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
7
  | BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
8
  | WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
About/training_data.md CHANGED
@@ -1,8 +1,8 @@
1
  ## Training data
2
 
3
- The RoBERTa model was pretrained on the reunion of the following datasets:
4
- - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
5
  - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
 
6
  - [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
7
  - [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
8
  - [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
 
1
  ## Training data
2
 
3
+ The RoBERTa model was pretrained on the union, followed by a random shuffle of the following datasets:
 
4
  - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
5
+ - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
6
  - [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
7
  - [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
8
  - [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
About/training_procedure.md CHANGED
@@ -1,18 +1,11 @@
1
  ## Training procedure
2
  ### Preprocessing
3
-
4
- The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
5
- the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
6
- with `<s>` and the end of one by `</s>`
7
- The details of the masking procedure for each sentence are the following:
8
- - 15% of the tokens are masked.
9
- - In 80% of the cases, the masked tokens are replaced by `<mask>`.
10
- - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
11
- - In the 10% remaining cases, the masked tokens are left as is.
12
- Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
13
 
14
  ### Pretraining
15
- The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128. The
16
- optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
17
- \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
18
- rate after.
 
1
  ## Training procedure
2
  ### Preprocessing
3
+ The texts are tokenized using Byte-Pair Encoding (BPE) and a vocabulary size of 50265.
4
+ - `mc4` and `oscar` datasets were available in `datasets` library. For rest of the datasets, we wrote our own loading scripts [available here](https://github.com/amankhandelia/roberta_hindi/blob/master/test_custom_datasets.py)
5
+ - It was slightly challenging to run `mc4` dataset(104GB+) and perform preprocessing and use it in non-streaming mode. `datasets` library had a lot of helper functions which allowed us to merge & shuffle the datasets with ease.
6
+ - We had to perform cleanup of mC4 and oscar datasets by removing all non hindi(non Devanagiri) characters from the datasets, as the datasets are relatively somewhat noisy.
7
+ - We attempted to filter out evaluation set of [WikiNER of IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by manual lablelling where the actual labels were not correct and modified the downstream evaluation dataset. Code & manually labelled file are also present in our [github repo](https://github.com/amankhandelia/roberta_hindi)
 
 
 
 
 
8
 
9
  ### Pretraining
10
+ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores).A randomized shuffle of combined dataset of mC4, oscar and other datasets listed above was used to train the model. Downstream training logs are present in [wandb](https://wandb.ai/wandb/hf-flax-roberta-hindi).
11
+
 
 
About/use.md CHANGED
@@ -5,26 +5,34 @@ You can use this model directly with a pipeline for masked language modeling:
5
  ```python
6
  >>> from transformers import pipeline
7
  >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
8
- >>> unmasker("मुझे उनसे बात करना <mask> अच्छा लगा")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- [{'score': 0.2096337080001831,
11
- 'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
12
- 'token': 1462,
13
- 'token_str': ' एकदम'},
14
- {'score': 0.17915162444114685,
15
- 'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
16
- 'token': 594,
17
- 'token_str': ' तब'},
18
- {'score': 0.15887945890426636,
19
- 'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
20
- 'token': 324,
21
- 'token_str': ' और'},
22
- {'score': 0.12024253606796265,
23
- 'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
24
- 'token': 743,
25
- 'token_str': ' लगभग'},
26
- {'score': 0.07114479690790176,
27
- 'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
28
- 'token': 672,
29
- 'token_str': ' कब'}]
30
  ```
 
 
 
 
 
 
 
5
  ```python
6
  >>> from transformers import pipeline
7
  >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
8
+ >>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
9
+ [{'score': 0.3310680091381073,
10
+ 'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
11
+ 'token': 1349,
12
+ 'token_str': ' सफर'},
13
+ {'score': 0.15317578613758087,
14
+ 'sequence': 'हम आपके सुखद पल की कामना करते हैं',
15
+ 'token': 848,
16
+ 'token_str': ' पल'},
17
+ {'score': 0.07826550304889679,
18
+ 'sequence': 'हम आपके सुखद समय की कामना करते हैं',
19
+ 'token': 453,
20
+ 'token_str': ' समय'},
21
+ {'score': 0.06304813921451569,
22
+ 'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
23
+ 'token': 404,
24
+ 'token_str': ' पहल'},
25
+ {'score': 0.058322224766016006,
26
+ 'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
27
+ 'token': 857,
28
+ 'token_str': ' अवसर'}]
29
+ ```
30
 
31
+ Alternatively, the model could be loaded with AutoModels (& could then be fine-tuned on downstream tasks):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
34
+
35
+ tokenizer = AutoTokenizer.from_pretrained("flax-community/roberta-hindi")
36
+
37
+ model = AutoModelForMaskedLM.from_pretrained("flax-community/roberta-hindi")
38
+ ```
apps/mlm.py CHANGED
@@ -56,8 +56,8 @@ def app():
56
  masked_text = st.text_area("Please type a masked sentence to fill", select_text)
57
 
58
  st.sidebar.markdown(
59
- "### Models\n"
60
- "- [RoBERTa Hindi](https://huggingface.co/flax-community/roberta-hindi)\n"
61
  "- [Indic Transformers Hindi](https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert)\n"
62
  "- [HindiBERTa](https://huggingface.co/mrm8488/HindiBERTa)\n"
63
  "- [RoBERTa Hindi Guj San](https://huggingface.co/surajp/RoBERTa-hindi-guj-san)"
 
56
  masked_text = st.text_area("Please type a masked sentence to fill", select_text)
57
 
58
  st.sidebar.markdown(
59
+ "### MLM Models for comparison:\n"
60
+ "- [RoBERTa Hindi](https://huggingface.co/flax-community/roberta-hindi) (ours)\n"
61
  "- [Indic Transformers Hindi](https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert)\n"
62
  "- [HindiBERTa](https://huggingface.co/mrm8488/HindiBERTa)\n"
63
  "- [RoBERTa Hindi Guj San](https://huggingface.co/surajp/RoBERTa-hindi-guj-san)"
config.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "models": {
3
- "RoBERTa Hindi": "flax-community/roberta-hindi",
4
  "Indic Transformers Hindi": "neuralspace-reverie/indic-transformers-hi-bert",
5
  "HindiBERTa": "mrm8488/HindiBERTa",
6
  "RoBERTa Hindi Guj San": "surajp/RoBERTa-hindi-guj-san"
 
1
  {
2
  "models": {
3
+ "RoBERTa Hindi (ours)": "flax-community/roberta-hindi",
4
  "Indic Transformers Hindi": "neuralspace-reverie/indic-transformers-hi-bert",
5
  "HindiBERTa": "mrm8488/HindiBERTa",
6
  "RoBERTa Hindi Guj San": "surajp/RoBERTa-hindi-guj-san"
mlm_custom/mlm_targeted_text.csv CHANGED
@@ -1,18 +1,16 @@
1
  user_id,text,output,multi
2
- dk-crazydiv,हम आपके <mask> यात्रा की कामना करते हैं,सुखद,
3
- dk-crazydiv,मुझे उनसे बात करना बहुत <mask> लगा,अच्छा,
4
- dk-crazydiv,"बार बार देखो, हज़ार बार देखो, ये देखने की <mask> है","[""चीज़"",""बात""]",TRUE
5
- dk-crazydiv,ट्रंप कल अहमदाबाद में प्रधानमंत्री मोदी से <mask> करने जा रहे हैं,"[""मुलाकात"",""मिल्ने""]",TRUE
6
- dk-crazydiv,बॉम्बे से बैंगलोर की <mask> 500 किलोमीटर है,दूरी,
7
- dk-crazydiv,कहने को साथ अपने ये <mask> चलती है,दुनिया,
8
- dk-crazydiv,"ये इश्क़ नहीं आसान बस इतना समझ लीजिये, एक आग का दरिया है और <mask> के जाना है",डूब,
9
- prateekagrawal,आपका दिन <mask> हो,"[""शुभ"",""अच्छा""]",TRUE
10
- prateekagrawal,हिंदी भारत में <mask> जाने वाली भाषाओं में से एक है,"[""बोली"",""सिखाई"",""आधिकारिक""]",TRUE
11
- prateekagrawal,शुभ <mask>,"[""प्रभात"",रात्रि"",""यात्रा"",""अवसर""]",TRUE
12
- prateekagrawal,इंसान को कभी बुरा नहीं <mask> चाहिए,"[""बोलना"",""देखना"",""सुनाना"",""करना""]",TRUE
13
- hassiahk,बात ये है कि आप इसे <mask> से ही जानते हैं,पहले,
14
- hassiahk,<mask> पूर्व में उगता है,सूरज,
15
- hassiahk,"जल्दी सोना और जल्दी उठना इंसान को स्वस्थ ,समृद्ध और बुद्धिमान <mask> है",बनाता,
16
- hassiahk,"रोज एक सेब खाओ, <mask> से दूर रहो",डॉक्टर,
17
- hassiahk,किसी पुस्तक को उसके <mask> से मत आंकिए,आवरण,
18
- hassiahk,सभी <mask> चीजों का एक अंत होता है,"[""अच्छी"", ""बुरी""]",TRUE
 
1
  user_id,text,output,multi
2
+ z,"ताजमहल की <mask> को देखने के लिए, देश दुनिया से लोग दूर दूर से आते है",सुन्दरता,
3
+ z,गौर से देखिए इस मासूम <mask> को,बच्ची,
4
+ z,"बात तय थी, लेकिन ऐन मौके पर उसके मुकर जाने से सारा काम <mask> में पड़ गया",खटाई,
5
+ z,"कारखाने के तैयार माल को पैक करने की अनेक <mask> मशीनें आज मिलती हैं",z,
6
+ z,"प्रत्येक टीम में अधिकतम <mask> खिलाड़ी होते हैं",z,
7
+ z,"बार बार देखो, हज़ार बार देखो, ये देखने की <mask> है","[""चीज़"",""बात""]",TRUE
8
+ z,ट्रंप कल अहमदाबाद में प्रधानमंत्री मोदी से <mask> करने जा रहे हैं,"[""मुलाकात"",""मिल्ने""]",TRUE
9
+ z,कहने को साथ अपने ये <mask> चलती है,दुनिया,
10
+ z,"ये इश्क़ नहीं आसान बस इतना समझ लीजिये, एक आग का दरिया है और <mask> के जाना है",डूब,
11
+ z,आपका दिन <mask> हो,"[""शुभ"",""अच्छा""]",TRUE
12
+ z,शुभ <mask>,"[""प्रभात"",रात्रि"",""यात्रा"",""अवसर""]",TRUE
13
+ z,बात ये है कि आप इसे <mask> से ही जानते हैं,पहले,
14
+ z,<mask> पूर्व में उगता है,सूरज,
15
+ z,"जल्दी सोना और जल्दी उठना इंसान को स्वस्थ ,समृद्ध और बुद्धिमान <mask> है",बनाता,
16
+ z,सभी <mask> चीजों का एक अंत होता है,"[""अच्छी"", ""बुरी""]",TRUE