About/training_data.md · flax-community/roberta-hindi at d2682d75578d2459b2ad948faa7b814590142813

Training data

The RoBERTa model was pretrained on the reunion of the following datasets:

OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
IndicGLUE is a natural language understanding benchmark.
Samanantar is a parallel corpora collection for Indic language.
Hindi Wikipedia Articles - 172k is a dataset with cleaned 172k Wikipedia articles.
Hindi Text Short and Large Summarization Corpus is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
Hindi Text Short Summarization Corpus is a collection of ~330k articles with their headlines collected from Hindi News Websites.
Old Newspapers Hindi is a cleaned subset of HC Corpora newspapers.