dk-crazydiv
commited on
Commit
•
8ff622e
1
Parent(s):
93c32f7
Update README.md
Browse files
README.md
CHANGED
@@ -3,20 +3,19 @@ widget:
|
|
3 |
- text: "मुझे उनसे बात करना <mask> अच्छा लगा"
|
4 |
- text: "हम आपके सुखद <mask> की कामना करते हैं"
|
5 |
- text: "सभी अच्छी चीजों का एक <mask> होता है"
|
|
|
6 |
---
|
7 |
|
8 |
# RoBERTa base model for Hindi language
|
9 |
|
10 |
-
Pretrained model on Hindi language using a masked language modeling (MLM) objective.
|
11 |
-
[this paper](https://arxiv.org/abs/1907.11692) and first released in
|
12 |
-
[this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).
|
13 |
|
14 |
> This is part of the
|
15 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
16 |
|
17 |
## Model description
|
18 |
|
19 |
-
RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data
|
20 |
|
21 |
### How to use
|
22 |
|
@@ -55,7 +54,6 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
|
|
55 |
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
56 |
- [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
|
57 |
- [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
|
58 |
-
- [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
|
59 |
- [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
|
60 |
- [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
|
61 |
- [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
|
@@ -65,7 +63,7 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
|
|
65 |
|
66 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
67 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
68 |
-
with `<s>` and the end of one by `</s
|
69 |
The details of the masking procedure for each sentence are the following:
|
70 |
- 15% of the tokens are masked.
|
71 |
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
@@ -74,10 +72,7 @@ The details of the masking procedure for each sentence are the following:
|
|
74 |
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
75 |
|
76 |
### Pretraining
|
77 |
-
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128.
|
78 |
-
optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
|
79 |
-
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
|
80 |
-
rate after.
|
81 |
|
82 |
## Evaluation Results
|
83 |
|
@@ -91,11 +86,12 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
|
|
91 |
| IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
|
92 |
|
93 |
## Team Members
|
94 |
-
- Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
|
95 |
- Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
|
96 |
- Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
|
97 |
-
-
|
98 |
- Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))
|
|
|
|
|
99 |
|
100 |
## Credits
|
101 |
Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
|
|
|
3 |
- text: "मुझे उनसे बात करना <mask> अच्छा लगा"
|
4 |
- text: "हम आपके सुखद <mask> की कामना करते हैं"
|
5 |
- text: "सभी अच्छी चीजों का एक <mask> होता है"
|
6 |
+
inference: false
|
7 |
---
|
8 |
|
9 |
# RoBERTa base model for Hindi language
|
10 |
|
11 |
+
Pretrained model on Hindi language using a masked language modeling (MLM) objective.
|
|
|
|
|
12 |
|
13 |
> This is part of the
|
14 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
15 |
|
16 |
## Model description
|
17 |
|
18 |
+
RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data(a combination of **mc4, oscar and indic-nlp** datasets)
|
19 |
|
20 |
### How to use
|
21 |
|
|
|
54 |
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
55 |
- [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
|
56 |
- [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
|
|
|
57 |
- [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
|
58 |
- [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
|
59 |
- [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
|
|
|
63 |
|
64 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
65 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
66 |
+
with `<s>` and the end of one by `</s>`. We also did some preliminary cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets. The model was then trained on a randomized shuffle of all the datasets combined.
|
67 |
The details of the masking procedure for each sentence are the following:
|
68 |
- 15% of the tokens are masked.
|
69 |
- In 80% of the cases, the masked tokens are replaced by `<mask>`.
|
|
|
72 |
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
|
73 |
|
74 |
### Pretraining
|
75 |
+
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128.
|
|
|
|
|
|
|
76 |
|
77 |
## Evaluation Results
|
78 |
|
|
|
86 |
| IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
|
87 |
|
88 |
## Team Members
|
|
|
89 |
- Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
|
90 |
- Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
|
91 |
+
- Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
|
92 |
- Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))
|
93 |
+
- Rahul Dev ([mlkorra](https://huggingface.co/mlkorra))
|
94 |
+
|
95 |
|
96 |
## Credits
|
97 |
Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
|