|
--- |
|
pretty_name: mC4 |
|
annotations_creators: |
|
- no-annotation |
|
language_creators: |
|
- found |
|
languages: |
|
- af |
|
- am |
|
- ar |
|
- az |
|
- be |
|
- bg |
|
- bg-Latn |
|
- bn |
|
- ca |
|
- ceb |
|
- co |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- el-Latn |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fil |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- haw |
|
- hi |
|
- hi-Latn |
|
- hmn |
|
- ht |
|
- hu |
|
- hy |
|
- id |
|
- ig |
|
- is |
|
- it |
|
- iw |
|
- ja |
|
- ja-Latn |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lb |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mi |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- mt |
|
- my |
|
- ne |
|
- nl |
|
- "no" |
|
- ny |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- ru-Latn |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- sm |
|
- sn |
|
- so |
|
- sq |
|
- sr |
|
- st |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- tg |
|
- th |
|
- tr |
|
- uk |
|
- und |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- yo |
|
- zh |
|
- zh-Latn |
|
- zu |
|
licenses: |
|
- odc-by-1.0 |
|
multilinguality: |
|
- multilingual |
|
size_categories: |
|
- n<1K |
|
- 1K<n<10K |
|
- 10K<n<100K |
|
- 100K<n<1M |
|
- 1M<n<10M |
|
- 10M<n<100M |
|
- 100M<n<1B |
|
- 1B<n<10B |
|
source_datasets: |
|
- original |
|
task_categories: |
|
- sequence-modeling |
|
task_ids: |
|
- language-modeling |
|
paperswithcode_id: mc4 |
|
--- |
|
|
|
# Dataset Card for mC4 |
|
|
|
## Table of Contents |
|
|
|
- [Dataset Card for mC4](#dataset-card-for-mc4) |
|
- [Table of Contents](#table-of-contents) |
|
- [Dataset Description](#dataset-description) |
|
- [Dataset Summary](#dataset-summary) |
|
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) |
|
- [Languages](#languages) |
|
- [Dataset Structure](#dataset-structure) |
|
- [Data Instances](#data-instances) |
|
- [Data Fields](#data-fields) |
|
- [Data Splits](#data-splits) |
|
- [Dataset Creation](#dataset-creation) |
|
- [Curation Rationale](#curation-rationale) |
|
- [Source Data](#source-data) |
|
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) |
|
- [Who are the source language producers?](#who-are-the-source-language-producers) |
|
- [Annotations](#annotations) |
|
- [Annotation process](#annotation-process) |
|
- [Who are the annotators?](#who-are-the-annotators) |
|
- [Personal and Sensitive Information](#personal-and-sensitive-information) |
|
- [Considerations for Using the Data](#considerations-for-using-the-data) |
|
- [Social Impact of Dataset](#social-impact-of-dataset) |
|
- [Discussion of Biases](#discussion-of-biases) |
|
- [Other Known Limitations](#other-known-limitations) |
|
- [Additional Information](#additional-information) |
|
- [Dataset Curators](#dataset-curators) |
|
- [Licensing Information](#licensing-information) |
|
- [Citation Information](#citation-information) |
|
- [Contributions](#contributions) |
|
|
|
## Dataset Description |
|
|
|
- **Homepage:** https://huggingface.co/datasets/allenai/c4 |
|
- **Paper:** https://arxiv.org/abs/1910.10683 |
|
|
|
### Dataset Summary |
|
|
|
A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". |
|
|
|
This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4 |
|
|
|
108 languages are available and are reported in the table below. |
|
|
|
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script. |
|
|
|
| language code | language name | |
|
|:----------------|:---------------------| |
|
| af | Afrikaans | |
|
| am | Amharic | |
|
| ar | Arabic | |
|
| az | Azerbaijani | |
|
| be | Belarusian | |
|
| bg | Bulgarian | |
|
| bg-Latn | Bulgarian (Latin) | |
|
| bn | Bangla | |
|
| ca | Catalan | |
|
| ceb | Cebuano | |
|
| co | Corsican | |
|
| cs | Czech | |
|
| cy | Welsh | |
|
| da | Danish | |
|
| de | German | |
|
| el | Greek | |
|
| el-Latn | Greek (Latin) | |
|
| en | English | |
|
| eo | Esperanto | |
|
| es | Spanish | |
|
| et | Estonian | |
|
| eu | Basque | |
|
| fa | Persian | |
|
| fi | Finnish | |
|
| fil | Filipino | |
|
| fr | French | |
|
| fy | Western Frisian | |
|
| ga | Irish | |
|
| gd | Scottish Gaelic | |
|
| gl | Galician | |
|
| gu | Gujarati | |
|
| ha | Hausa | |
|
| haw | Hawaiian | |
|
| hi | Hindi | |
|
| hi-Latn | Hindi (Latin script) | |
|
| hmn | Hmong, Mong | |
|
| ht | Haitian | |
|
| hu | Hungarian | |
|
| hy | Armenian | |
|
| id | Indonesian | |
|
| ig | Igbo | |
|
| is | Icelandic | |
|
| it | Italian | |
|
| iw | former Hebrew | |
|
| ja | Japanese | |
|
| ja-Latn | Japanese (Latin) | |
|
| jv | Javanese | |
|
| ka | Georgian | |
|
| kk | Kazakh | |
|
| km | Khmer | |
|
| kn | Kannada | |
|
| ko | Korean | |
|
| ku | Kurdish | |
|
| ky | Kyrgyz | |
|
| la | Latin | |
|
| lb | Luxembourgish | |
|
| lo | Lao | |
|
| lt | Lithuanian | |
|
| lv | Latvian | |
|
| mg | Malagasy | |
|
| mi | Maori | |
|
| mk | Macedonian | |
|
| ml | Malayalam | |
|
| mn | Mongolian | |
|
| mr | Marathi | |
|
| ms | Malay | |
|
| mt | Maltese | |
|
| my | Burmese | |
|
| ne | Nepali | |
|
| nl | Dutch | |
|
| no | Norwegian | |
|
| ny | Nyanja | |
|
| pa | Punjabi | |
|
| pl | Polish | |
|
| ps | Pashto | |
|
| pt | Portuguese | |
|
| ro | Romanian | |
|
| ru | Russian | |
|
| ru-Latn | Russian (Latin) | |
|
| sd | Sindhi | |
|
| si | Sinhala | |
|
| sk | Slovak | |
|
| sl | Slovenian | |
|
| sm | San Marino | |
|
| sn | Shona | |
|
| so | Somali | |
|
| sq | Albanian | |
|
| sr | Serbian | |
|
| st | Southern Sotho | |
|
| su | Sundanese | |
|
| sv | Swedish | |
|
| sw | Swahili | |
|
| ta | Tamil | |
|
| te | Telugu | |
|
| tg | Tajik | |
|
| th | Thai | |
|
| tr | Turkish | |
|
| uk | Ukrainian | |
|
| und | Unknown language | |
|
| ur | Urdu | |
|
| uz | Uzbek | |
|
| vi | Vietnamese | |
|
| xh | Xhosa | |
|
| yi | Yiddish | |
|
| yo | Yoruba | |
|
| zh | Chinese | |
|
| zh-Latn | Chinese (Latin) | |
|
| zu | Zulu | |
|
|
|
You can load the mC4 subset of any language like this: |
|
|
|
```python |
|
from datasets import load_dataset |
|
|
|
en_mc4 = load_dataset("mc4", "en") |
|
``` |
|
|
|
And if you can even specify a list of languages: |
|
|
|
```python |
|
from datasets import load_dataset |
|
|
|
mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"]) |
|
``` |
|
|
|
### Supported Tasks and Leaderboards |
|
|
|
mC4 is mainly intended to pretrain language models and word representations. |
|
|
|
### Languages |
|
|
|
The dataset supports 108 languages. |
|
|
|
## Dataset Structure |
|
|
|
### Data Instances |
|
|
|
An example form the `en` config is: |
|
|
|
``` |
|
{'timestamp': '2018-06-24T01:32:39Z', |
|
'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County', |
|
'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'} |
|
``` |
|
|
|
### Data Fields |
|
|
|
The data have several fields: |
|
|
|
- `url`: url of the source as a string |
|
- `text`: text content as a string |
|
- `timestamp`: timestamp as a string |
|
|
|
### Data Splits |
|
|
|
To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table: |
|
|
|
| config | train | validation | |
|
|:---------|:--------|:-------------| |
|
| af | ? | ? | |
|
| am | ? | ? | |
|
| ar | ? | ? | |
|
| az | ? | ? | |
|
| be | ? | ? | |
|
| bg | ? | ? | |
|
| bg-Latn | ? | ? | |
|
| bn | ? | ? | |
|
| ca | ? | ? | |
|
| ceb | ? | ? | |
|
| co | ? | ? | |
|
| cs | ? | ? | |
|
| cy | ? | ? | |
|
| da | ? | ? | |
|
| de | ? | ? | |
|
| el | ? | ? | |
|
| el-Latn | ? | ? | |
|
| en | ? | ? | |
|
| eo | ? | ? | |
|
| es | ? | ? | |
|
| et | ? | ? | |
|
| eu | ? | ? | |
|
| fa | ? | ? | |
|
| fi | ? | ? | |
|
| fil | ? | ? | |
|
| fr | ? | ? | |
|
| fy | ? | ? | |
|
| ga | ? | ? | |
|
| gd | ? | ? | |
|
| gl | ? | ? | |
|
| gu | ? | ? | |
|
| ha | ? | ? | |
|
| haw | ? | ? | |
|
| hi | ? | ? | |
|
| hi-Latn | ? | ? | |
|
| hmn | ? | ? | |
|
| ht | ? | ? | |
|
| hu | ? | ? | |
|
| hy | ? | ? | |
|
| id | ? | ? | |
|
| ig | ? | ? | |
|
| is | ? | ? | |
|
| it | ? | ? | |
|
| iw | ? | ? | |
|
| ja | ? | ? | |
|
| ja-Latn | ? | ? | |
|
| jv | ? | ? | |
|
| ka | ? | ? | |
|
| kk | ? | ? | |
|
| km | ? | ? | |
|
| kn | ? | ? | |
|
| ko | ? | ? | |
|
| ku | ? | ? | |
|
| ky | ? | ? | |
|
| la | ? | ? | |
|
| lb | ? | ? | |
|
| lo | ? | ? | |
|
| lt | ? | ? | |
|
| lv | ? | ? | |
|
| mg | ? | ? | |
|
| mi | ? | ? | |
|
| mk | ? | ? | |
|
| ml | ? | ? | |
|
| mn | ? | ? | |
|
| mr | ? | ? | |
|
| ms | ? | ? | |
|
| mt | ? | ? | |
|
| my | ? | ? | |
|
| ne | ? | ? | |
|
| nl | ? | ? | |
|
| no | ? | ? | |
|
| ny | ? | ? | |
|
| pa | ? | ? | |
|
| pl | ? | ? | |
|
| ps | ? | ? | |
|
| pt | ? | ? | |
|
| ro | ? | ? | |
|
| ru | ? | ? | |
|
| ru-Latn | ? | ? | |
|
| sd | ? | ? | |
|
| si | ? | ? | |
|
| sk | ? | ? | |
|
| sl | ? | ? | |
|
| sm | ? | ? | |
|
| sn | ? | ? | |
|
| so | ? | ? | |
|
| sq | ? | ? | |
|
| sr | ? | ? | |
|
| st | ? | ? | |
|
| su | ? | ? | |
|
| sv | ? | ? | |
|
| sw | ? | ? | |
|
| ta | ? | ? | |
|
| te | ? | ? | |
|
| tg | ? | ? | |
|
| th | ? | ? | |
|
| tr | ? | ? | |
|
| uk | ? | ? | |
|
| und | ? | ? | |
|
| ur | ? | ? | |
|
| uz | ? | ? | |
|
| vi | ? | ? | |
|
| xh | ? | ? | |
|
| yi | ? | ? | |
|
| yo | ? | ? | |
|
| zh | ? | ? | |
|
| zh-Latn | ? | ? | |
|
| zu | ? | ? | |
|
|
|
## Dataset Creation |
|
|
|
### Curation Rationale |
|
|
|
[More Information Needed] |
|
|
|
### Source Data |
|
|
|
#### Initial Data Collection and Normalization |
|
|
|
[More Information Needed] |
|
|
|
#### Who are the source language producers? |
|
|
|
[More Information Needed] |
|
|
|
### Annotations |
|
|
|
#### Annotation process |
|
|
|
[More Information Needed] |
|
|
|
#### Who are the annotators? |
|
|
|
[More Information Needed] |
|
|
|
### Personal and Sensitive Information |
|
|
|
[More Information Needed] |
|
|
|
## Considerations for Using the Data |
|
|
|
### Social Impact of Dataset |
|
|
|
[More Information Needed] |
|
|
|
### Discussion of Biases |
|
|
|
[More Information Needed] |
|
|
|
### Other Known Limitations |
|
|
|
[More Information Needed] |
|
|
|
## Additional Information |
|
|
|
### Dataset Curators |
|
|
|
[More Information Needed] |
|
|
|
### Licensing Information |
|
|
|
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset. |
|
|
|
### Citation Information |
|
|
|
``` |
|
@article{2019t5, |
|
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, |
|
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, |
|
journal = {arXiv e-prints}, |
|
year = {2019}, |
|
archivePrefix = {arXiv}, |
|
eprint = {1910.10683}, |
|
} |
|
``` |
|
|
|
### Contributions |
|
|
|
Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset. |
|
|