dk-crazydiv commited on
Commit
8ff622e
1 Parent(s): 93c32f7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -12
README.md CHANGED
@@ -3,20 +3,19 @@ widget:
3
  - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
4
  - text: "हम आपके सुखद <mask> की कामना करते हैं"
5
  - text: "सभी अच्छी चीजों का एक <mask> होता है"
 
6
  ---
7
 
8
  # RoBERTa base model for Hindi language
9
 
10
- Pretrained model on Hindi language using a masked language modeling (MLM) objective. RoBERTa was introduced in
11
- [this paper](https://arxiv.org/abs/1907.11692) and first released in
12
- [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).
13
 
14
  > This is part of the
15
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
16
 
17
  ## Model description
18
 
19
- RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
20
 
21
  ### How to use
22
 
@@ -55,7 +54,6 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
55
  - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
56
  - [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
57
  - [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
58
- - [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
59
  - [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
60
  - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
61
  - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
@@ -65,7 +63,7 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
65
 
66
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
67
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
68
- with `<s>` and the end of one by `</s>`
69
  The details of the masking procedure for each sentence are the following:
70
  - 15% of the tokens are masked.
71
  - In 80% of the cases, the masked tokens are replaced by `<mask>`.
@@ -74,10 +72,7 @@ The details of the masking procedure for each sentence are the following:
74
  Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
75
 
76
  ### Pretraining
77
- The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128. The
78
- optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
79
- \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
80
- rate after.
81
 
82
  ## Evaluation Results
83
 
@@ -91,11 +86,12 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
91
  | IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
92
 
93
  ## Team Members
94
- - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
95
  - Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
96
  - Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
97
- - Rahul Dev ([mlkorra](https://huggingface.co/mlkorra))
98
  - Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))
 
 
99
 
100
  ## Credits
101
  Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
 
3
  - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
4
  - text: "हम आपके सुखद <mask> की कामना करते हैं"
5
  - text: "सभी अच्छी चीजों का एक <mask> होता है"
6
+ inference: false
7
  ---
8
 
9
  # RoBERTa base model for Hindi language
10
 
11
+ Pretrained model on Hindi language using a masked language modeling (MLM) objective.
 
 
12
 
13
  > This is part of the
14
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
15
 
16
  ## Model description
17
 
18
+ RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data(a combination of **mc4, oscar and indic-nlp** datasets)
19
 
20
  ### How to use
21
 
 
54
  - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
55
  - [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
56
  - [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
 
57
  - [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
58
  - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
59
  - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
 
63
 
64
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
65
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
66
+ with `<s>` and the end of one by `</s>`. We also did some preliminary cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets. The model was then trained on a randomized shuffle of all the datasets combined.
67
  The details of the masking procedure for each sentence are the following:
68
  - 15% of the tokens are masked.
69
  - In 80% of the cases, the masked tokens are replaced by `<mask>`.
 
72
  Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
73
 
74
  ### Pretraining
75
+ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128.
 
 
 
76
 
77
  ## Evaluation Results
78
 
 
86
  | IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
87
 
88
  ## Team Members
 
89
  - Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
90
  - Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
91
+ - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
92
  - Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))
93
+ - Rahul Dev ([mlkorra](https://huggingface.co/mlkorra))
94
+
95
 
96
  ## Credits
97
  Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.