flax-community
/

roberta-hindi

@@ -15,10 +15,11 @@ Pretrained model on Marathi language using a masked language modeling (MLM) obje
 [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
 ## Model description
-RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
 ### How to use
 You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> from transformers import pipeline
@@ -48,6 +49,7 @@ You can use this model directly with a pipeline for masked language modeling:
 ```
 ## Training data
 The RoBERTa model was pretrained on the reunion of the following datasets:
 - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
 - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
@@ -58,8 +60,9 @@ The RoBERTa model was pretrained on the reunion of the following datasets:
 - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
 - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
-## Training procedure 👨🏻‍💻
 ### Preprocessing
 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
 the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
 with `<s>` and the end of one by `</s>`
@@ -82,10 +85,10 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
 | Task                    | Task Type            | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
 |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
-| BBC News Classification | Genre Classification |           |            |                               |                       |               |
-| WikiNER                 | Token Classification |           |            |                               |                       |               |
-| IITP Product Reviews    | Sentiment Analysis   |           |            |                               |                       |               |
-| IITP Movie Reviews      | Sentiment Analysis   |           |            |                               |                       |               |
 ## Team Members
 - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))

 [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
 ## Model description
+RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
 ### How to use
 You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> from transformers import pipeline
 ```
 ## Training data
 The RoBERTa model was pretrained on the reunion of the following datasets:
 - [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
 - [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
 - [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
 - [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
+## Training procedure
 ### Preprocessing
 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
 the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
 with `<s>` and the end of one by `</s>`
 | Task                    | Task Type            | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
 |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
+| BBC News Classification | Genre Classification | **76.44**     | 66.86      | **77.6**                          | 64.9                  | 73.67         |
+| WikiNER                 | Token Classification | -         | 90.68      | **95.09**                         | 89.61                 | **92.76**         |
+| IITP Product Reviews    | Sentiment Analysis   | **78.01**     | 73.23      | **78.39**                         | 66.16                 | 75.53         |
+| IITP Movie Reviews      | Sentiment Analysis   | 60.97     | 52.26      | **70.65**                         | 49.35                 | **61.29**         |
 ## Team Members
 - Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))