roberta-hindi / About /
dk-crazydiv's picture
Modified readme data and examples
## Training data
The RoBERTa model was pretrained on the union, followed by a random shuffle of the following datasets:
- [mC4]( is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
- [OSCAR]( is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
- [IndicGLUE]( is a natural language understanding benchmark.
- [Samanantar]( is a parallel corpora collection for Indic language.
- [Hindi Wikipedia Articles - 172k]( is a dataset with cleaned 172k Wikipedia articles.
- [Hindi Text Short and Large Summarization Corpus]( is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
- [Hindi Text Short Summarization Corpus]( is a collection of ~330k articles with their headlines collected from Hindi News Websites.
- [Old Newspapers Hindi]( is a cleaned subset of HC Corpora newspapers.