hassiahk commited on
Commit
8aa37f3
1 Parent(s): 66112b8

Added downstream tasks results

Browse files
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -15,10 +15,11 @@ Pretrained model on Marathi language using a masked language modeling (MLM) obje
15
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
16
 
17
  ## Model description
18
- RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
19
 
 
20
 
21
  ### How to use
 
22
  You can use this model directly with a pipeline for masked language modeling:
23
  ```python
24
  >>> from transformers import pipeline
@@ -48,6 +49,7 @@ You can use this model directly with a pipeline for masked language modeling:
48
  ```
49
 
50
  ## Training data
 
51
  The RoBERTa model was pretrained on the reunion of the following datasets:
52
  - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
53
  - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
@@ -58,8 +60,9 @@ The RoBERTa model was pretrained on the reunion of the following datasets:
58
  - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
59
  - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
60
 
61
- ## Training procedure 👨🏻‍💻
62
  ### Preprocessing
 
63
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
64
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
65
  with `<s>` and the end of one by `</s>`
@@ -82,10 +85,10 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
82
 
83
  | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
84
  |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
85
- | BBC News Classification | Genre Classification | | | | | |
86
- | WikiNER | Token Classification | | | | | |
87
- | IITP Product Reviews | Sentiment Analysis | | | | | |
88
- | IITP Movie Reviews | Sentiment Analysis | | | | | |
89
 
90
  ## Team Members
91
  - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
 
15
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
16
 
17
  ## Model description
 
18
 
19
+ RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
20
 
21
  ### How to use
22
+
23
  You can use this model directly with a pipeline for masked language modeling:
24
  ```python
25
  >>> from transformers import pipeline
 
49
  ```
50
 
51
  ## Training data
52
+
53
  The RoBERTa model was pretrained on the reunion of the following datasets:
54
  - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
55
  - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
 
60
  - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
61
  - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
62
 
63
+ ## Training procedure
64
  ### Preprocessing
65
+
66
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
67
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
68
  with `<s>` and the end of one by `</s>`
 
85
 
86
  | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
87
  |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
88
+ | BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
89
+ | WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
90
+ | IITP Product Reviews | Sentiment Analysis | **78.01** | 73.23 | **78.39** | 66.16 | 75.53 |
91
+ | IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
92
 
93
  ## Team Members
94
  - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))