Spaces:
Runtime error
Runtime error
Training data
The RoBERTa model was pretrained on the reunion of the following datasets:
- OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
- mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
- IndicGLUE is a natural language understanding benchmark.
- Samanantar is a parallel corpora collection for Indic language.
- Hindi Wikipedia Articles - 172k is a dataset with cleaned 172k Wikipedia articles.
- Hindi Text Short and Large Summarization Corpus is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
- Hindi Text Short Summarization Corpus is a collection of ~330k articles with their headlines collected from Hindi News Websites.
- Old Newspapers Hindi is a cleaned subset of HC Corpora newspapers.