versae's picture
Adding sampling to mc4
3f09f56
---
pretty_name: mC4
annotations_creators:
- no-annotation
language_creators:
- found
languages:
- af
- am
- ar
- az
- be
- bg
- bg-Latn
- bn
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- el-Latn
- en
- eo
- es
- et
- eu
- fa
- fi
- fil
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- hi
- hi-Latn
- hmn
- ht
- hu
- hy
- id
- ig
- is
- it
- iw
- ja
- ja-Latn
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- "no"
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- ru-Latn
- sd
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- tr
- uk
- und
- ur
- uz
- vi
- xh
- yi
- yo
- zh
- zh-Latn
- zu
licenses:
- odc-by-1.0
multilinguality:
- multilingual
size_categories:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
source_datasets:
- original
task_categories:
- sequence-modeling
task_ids:
- language-modeling
paperswithcode_id: mc4
---
# Dataset Card for mC4
## Table of Contents
- [Dataset Card for mC4](#dataset-card-for-mc4)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://huggingface.co/datasets/allenai/c4
- **Paper:** https://arxiv.org/abs/1910.10683
### Dataset Summary
A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".
This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4
108 languages are available and are reported in the table below.
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.
| language code | language name |
|:----------------|:---------------------|
| af | Afrikaans |
| am | Amharic |
| ar | Arabic |
| az | Azerbaijani |
| be | Belarusian |
| bg | Bulgarian |
| bg-Latn | Bulgarian (Latin) |
| bn | Bangla |
| ca | Catalan |
| ceb | Cebuano |
| co | Corsican |
| cs | Czech |
| cy | Welsh |
| da | Danish |
| de | German |
| el | Greek |
| el-Latn | Greek (Latin) |
| en | English |
| eo | Esperanto |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fi | Finnish |
| fil | Filipino |
| fr | French |
| fy | Western Frisian |
| ga | Irish |
| gd | Scottish Gaelic |
| gl | Galician |
| gu | Gujarati |
| ha | Hausa |
| haw | Hawaiian |
| hi | Hindi |
| hi-Latn | Hindi (Latin script) |
| hmn | Hmong, Mong |
| ht | Haitian |
| hu | Hungarian |
| hy | Armenian |
| id | Indonesian |
| ig | Igbo |
| is | Icelandic |
| it | Italian |
| iw | former Hebrew |
| ja | Japanese |
| ja-Latn | Japanese (Latin) |
| jv | Javanese |
| ka | Georgian |
| kk | Kazakh |
| km | Khmer |
| kn | Kannada |
| ko | Korean |
| ku | Kurdish |
| ky | Kyrgyz |
| la | Latin |
| lb | Luxembourgish |
| lo | Lao |
| lt | Lithuanian |
| lv | Latvian |
| mg | Malagasy |
| mi | Maori |
| mk | Macedonian |
| ml | Malayalam |
| mn | Mongolian |
| mr | Marathi |
| ms | Malay |
| mt | Maltese |
| my | Burmese |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| ny | Nyanja |
| pa | Punjabi |
| pl | Polish |
| ps | Pashto |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| ru-Latn | Russian (Latin) |
| sd | Sindhi |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| sm | San Marino |
| sn | Shona |
| so | Somali |
| sq | Albanian |
| sr | Serbian |
| st | Southern Sotho |
| su | Sundanese |
| sv | Swedish |
| sw | Swahili |
| ta | Tamil |
| te | Telugu |
| tg | Tajik |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| und | Unknown language |
| ur | Urdu |
| uz | Uzbek |
| vi | Vietnamese |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zh | Chinese |
| zh-Latn | Chinese (Latin) |
| zu | Zulu |
You can load the mC4 subset of any language like this:
```python
from datasets import load_dataset
en_mc4 = load_dataset("mc4", "en")
```
And if you can even specify a list of languages:
```python
from datasets import load_dataset
mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
```
### Supported Tasks and Leaderboards
mC4 is mainly intended to pretrain language models and word representations.
### Languages
The dataset supports 108 languages.
## Dataset Structure
### Data Instances
An example form the `en` config is:
```
{'timestamp': '2018-06-24T01:32:39Z',
'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
```
### Data Fields
The data have several fields:
- `url`: url of the source as a string
- `text`: text content as a string
- `timestamp`: timestamp as a string
### Data Splits
To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:
| config | train | validation |
|:---------|:--------|:-------------|
| af | ? | ? |
| am | ? | ? |
| ar | ? | ? |
| az | ? | ? |
| be | ? | ? |
| bg | ? | ? |
| bg-Latn | ? | ? |
| bn | ? | ? |
| ca | ? | ? |
| ceb | ? | ? |
| co | ? | ? |
| cs | ? | ? |
| cy | ? | ? |
| da | ? | ? |
| de | ? | ? |
| el | ? | ? |
| el-Latn | ? | ? |
| en | ? | ? |
| eo | ? | ? |
| es | ? | ? |
| et | ? | ? |
| eu | ? | ? |
| fa | ? | ? |
| fi | ? | ? |
| fil | ? | ? |
| fr | ? | ? |
| fy | ? | ? |
| ga | ? | ? |
| gd | ? | ? |
| gl | ? | ? |
| gu | ? | ? |
| ha | ? | ? |
| haw | ? | ? |
| hi | ? | ? |
| hi-Latn | ? | ? |
| hmn | ? | ? |
| ht | ? | ? |
| hu | ? | ? |
| hy | ? | ? |
| id | ? | ? |
| ig | ? | ? |
| is | ? | ? |
| it | ? | ? |
| iw | ? | ? |
| ja | ? | ? |
| ja-Latn | ? | ? |
| jv | ? | ? |
| ka | ? | ? |
| kk | ? | ? |
| km | ? | ? |
| kn | ? | ? |
| ko | ? | ? |
| ku | ? | ? |
| ky | ? | ? |
| la | ? | ? |
| lb | ? | ? |
| lo | ? | ? |
| lt | ? | ? |
| lv | ? | ? |
| mg | ? | ? |
| mi | ? | ? |
| mk | ? | ? |
| ml | ? | ? |
| mn | ? | ? |
| mr | ? | ? |
| ms | ? | ? |
| mt | ? | ? |
| my | ? | ? |
| ne | ? | ? |
| nl | ? | ? |
| no | ? | ? |
| ny | ? | ? |
| pa | ? | ? |
| pl | ? | ? |
| ps | ? | ? |
| pt | ? | ? |
| ro | ? | ? |
| ru | ? | ? |
| ru-Latn | ? | ? |
| sd | ? | ? |
| si | ? | ? |
| sk | ? | ? |
| sl | ? | ? |
| sm | ? | ? |
| sn | ? | ? |
| so | ? | ? |
| sq | ? | ? |
| sr | ? | ? |
| st | ? | ? |
| su | ? | ? |
| sv | ? | ? |
| sw | ? | ? |
| ta | ? | ? |
| te | ? | ? |
| tg | ? | ? |
| th | ? | ? |
| tr | ? | ? |
| uk | ? | ? |
| und | ? | ? |
| ur | ? | ? |
| uz | ? | ? |
| vi | ? | ? |
| xh | ? | ? |
| yi | ? | ? |
| yo | ? | ? |
| zh | ? | ? |
| zh-Latn | ? | ? |
| zu | ? | ? |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
### Citation Information
```
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
```
### Contributions
Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset.