hassiahk commited on
Commit
66112b8
1 Parent(s): d1dffcd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -3
README.md CHANGED
@@ -1,7 +1,100 @@
1
  ---
2
  widget:
3
- - text: "शुभ प्रभात। आशा करता हूं कि आपका <mask> शुभ हो"
4
-
 
5
  ---
6
 
7
- roberta-pretraining-hindi
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  widget:
3
+ - text: "मुझे उनसे बात करना <mask> अच्छा लगा"
4
+ - text: "हम आपके सुखद <mask> की कामना करते हैं"
5
+ - text: "सभी अच्छी चीजों का एक <mask> होता है"
6
  ---
7
 
8
+ # RoBERTa base model for Hindi language
9
+
10
+ Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa was introduced in
11
+ [this paper](https://arxiv.org/abs/1907.11692) and first released in
12
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).
13
+
14
+ > This is part of the
15
+ [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
16
+
17
+ ## Model description
18
+ RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
19
+
20
+
21
+ ### How to use
22
+ You can use this model directly with a pipeline for masked language modeling:
23
+ ```python
24
+ >>> from transformers import pipeline
25
+ >>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
26
+ >>> unmasker("मुझे उनसे बात करना <mask> अच्छा लगा")
27
+
28
+ [{'score': 0.2096337080001831,
29
+ 'sequence': 'मुझे उनसे बात करना एकदम अच्छा लगा',
30
+ 'token': 1462,
31
+ 'token_str': ' एकदम'},
32
+ {'score': 0.17915162444114685,
33
+ 'sequence': 'मुझे उनसे बात करना तब अच्छा लगा',
34
+ 'token': 594,
35
+ 'token_str': ' तब'},
36
+ {'score': 0.15887945890426636,
37
+ 'sequence': 'मुझे उनसे बात करना और अच्छा लगा',
38
+ 'token': 324,
39
+ 'token_str': ' और'},
40
+ {'score': 0.12024253606796265,
41
+ 'sequence': 'मुझे उनसे बात करना लगभग अच्छा लगा',
42
+ 'token': 743,
43
+ 'token_str': ' लगभग'},
44
+ {'score': 0.07114479690790176,
45
+ 'sequence': 'मुझे उनसे बात करना कब अच्छा लगा',
46
+ 'token': 672,
47
+ 'token_str': ' कब'}]
48
+ ```
49
+
50
+ ## Training data
51
+ The RoBERTa model was pretrained on the reunion of the following datasets:
52
+ - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
53
+ - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
54
+ - [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) is a natural language understanding benchmark.
55
+ - [Samanantar](https://indicnlp.ai4bharat.org/samanantar/) is a parallel corpora collection for Indic language.
56
+ - [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k) is a dataset with cleaned 172k Wikipedia articles.
57
+ - [Hindi Text Short and Large Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-and-large-summarization-corpus) is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
58
+ - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
59
+ - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
60
+
61
+ ## Training procedure 👨🏻‍💻
62
+ ### Preprocessing
63
+ The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
64
+ the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
65
+ with `<s>` and the end of one by `</s>`
66
+ The details of the masking procedure for each sentence are the following:
67
+ - 15% of the tokens are masked.
68
+ - In 80% of the cases, the masked tokens are replaced by `<mask>`.
69
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
70
+ - In the 10% remaining cases, the masked tokens are left as is.
71
+ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
72
+
73
+ ### Pretraining
74
+ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128. The
75
+ optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
76
+ \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
77
+ rate after.
78
+
79
+ ## Evaluation Results
80
+
81
+ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
82
+
83
+ | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
84
+ |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
85
+ | BBC News Classification | Genre Classification | | | | | |
86
+ | WikiNER | Token Classification | | | | | |
87
+ | IITP Product Reviews | Sentiment Analysis | | | | | |
88
+ | IITP Movie Reviews | Sentiment Analysis | | | | | |
89
+
90
+ ## Team Members
91
+ - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
92
+ - Aman K ([amankhandelia](https://huggingface.co/amankhandelia))
93
+ - Haswanth Aekula ([hassiahk](https://huggingface.co/hassiahk))
94
+ - Rahul Dev ([mlkorra](https://huggingface.co/mlkorra))
95
+ - Prateek Agrawal ([prateekagrawal](https://huggingface.co/prateekagrawal))
96
+
97
+ ## Credits
98
+ Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
99
+
100
+ <img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>