pretty_name: mC4
annotations_creators:
- no-annotation
language_creators:
- found
languages:
- af
- am
- ar
- az
- be
- bg
- bg-Latn
- bn
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- el-Latn
- en
- eo
- es
- et
- eu
- fa
- fi
- fil
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- hi
- hi-Latn
- hmn
- ht
- hu
- hy
- id
- ig
- is
- it
- iw
- ja
- ja-Latn
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- 'no'
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- ru-Latn
- sd
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- tr
- uk
- und
- ur
- uz
- vi
- xh
- yi
- yo
- zh
- zh-Latn
- zu
licenses:
- odc-by-1.0
multilinguality:
- multilingual
size_categories:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
source_datasets:
- original
task_categories:
- sequence-modeling
task_ids:
- language-modeling
paperswithcode_id: mc4
Dataset Card for mC4
Table of Contents
- Dataset Card for mC4
Dataset Description
Dataset Summary
A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".
This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4
108 languages are available and are reported in the table below.
Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.
language code | language name |
---|---|
af | Afrikaans |
am | Amharic |
ar | Arabic |
az | Azerbaijani |
be | Belarusian |
bg | Bulgarian |
bg-Latn | Bulgarian (Latin) |
bn | Bangla |
ca | Catalan |
ceb | Cebuano |
co | Corsican |
cs | Czech |
cy | Welsh |
da | Danish |
de | German |
el | Greek |
el-Latn | Greek (Latin) |
en | English |
eo | Esperanto |
es | Spanish |
et | Estonian |
eu | Basque |
fa | Persian |
fi | Finnish |
fil | Filipino |
fr | French |
fy | Western Frisian |
ga | Irish |
gd | Scottish Gaelic |
gl | Galician |
gu | Gujarati |
ha | Hausa |
haw | Hawaiian |
hi | Hindi |
hi-Latn | Hindi (Latin script) |
hmn | Hmong, Mong |
ht | Haitian |
hu | Hungarian |
hy | Armenian |
id | Indonesian |
ig | Igbo |
is | Icelandic |
it | Italian |
iw | former Hebrew |
ja | Japanese |
ja-Latn | Japanese (Latin) |
jv | Javanese |
ka | Georgian |
kk | Kazakh |
km | Khmer |
kn | Kannada |
ko | Korean |
ku | Kurdish |
ky | Kyrgyz |
la | Latin |
lb | Luxembourgish |
lo | Lao |
lt | Lithuanian |
lv | Latvian |
mg | Malagasy |
mi | Maori |
mk | Macedonian |
ml | Malayalam |
mn | Mongolian |
mr | Marathi |
ms | Malay |
mt | Maltese |
my | Burmese |
ne | Nepali |
nl | Dutch |
no | Norwegian |
ny | Nyanja |
pa | Punjabi |
pl | Polish |
ps | Pashto |
pt | Portuguese |
ro | Romanian |
ru | Russian |
ru-Latn | Russian (Latin) |
sd | Sindhi |
si | Sinhala |
sk | Slovak |
sl | Slovenian |
sm | San Marino |
sn | Shona |
so | Somali |
sq | Albanian |
sr | Serbian |
st | Southern Sotho |
su | Sundanese |
sv | Swedish |
sw | Swahili |
ta | Tamil |
te | Telugu |
tg | Tajik |
th | Thai |
tr | Turkish |
uk | Ukrainian |
und | Unknown language |
ur | Urdu |
uz | Uzbek |
vi | Vietnamese |
xh | Xhosa |
yi | Yiddish |
yo | Yoruba |
zh | Chinese |
zh-Latn | Chinese (Latin) |
zu | Zulu |
You can load the mC4 subset of any language like this:
from datasets import load_dataset
en_mc4 = load_dataset("mc4", "en")
And if you can even specify a list of languages:
from datasets import load_dataset
mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
Supported Tasks and Leaderboards
mC4 is mainly intended to pretrain language models and word representations.
Languages
The dataset supports 108 languages.
Dataset Structure
Data Instances
An example form the en
config is:
{'timestamp': '2018-06-24T01:32:39Z',
'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
Data Fields
The data have several fields:
url
: url of the source as a stringtext
: text content as a stringtimestamp
: timestamp as a string
Data Splits
To build mC4, the authors used CLD3 to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:
config | train | validation |
---|---|---|
af | ? | ? |
am | ? | ? |
ar | ? | ? |
az | ? | ? |
be | ? | ? |
bg | ? | ? |
bg-Latn | ? | ? |
bn | ? | ? |
ca | ? | ? |
ceb | ? | ? |
co | ? | ? |
cs | ? | ? |
cy | ? | ? |
da | ? | ? |
de | ? | ? |
el | ? | ? |
el-Latn | ? | ? |
en | ? | ? |
eo | ? | ? |
es | ? | ? |
et | ? | ? |
eu | ? | ? |
fa | ? | ? |
fi | ? | ? |
fil | ? | ? |
fr | ? | ? |
fy | ? | ? |
ga | ? | ? |
gd | ? | ? |
gl | ? | ? |
gu | ? | ? |
ha | ? | ? |
haw | ? | ? |
hi | ? | ? |
hi-Latn | ? | ? |
hmn | ? | ? |
ht | ? | ? |
hu | ? | ? |
hy | ? | ? |
id | ? | ? |
ig | ? | ? |
is | ? | ? |
it | ? | ? |
iw | ? | ? |
ja | ? | ? |
ja-Latn | ? | ? |
jv | ? | ? |
ka | ? | ? |
kk | ? | ? |
km | ? | ? |
kn | ? | ? |
ko | ? | ? |
ku | ? | ? |
ky | ? | ? |
la | ? | ? |
lb | ? | ? |
lo | ? | ? |
lt | ? | ? |
lv | ? | ? |
mg | ? | ? |
mi | ? | ? |
mk | ? | ? |
ml | ? | ? |
mn | ? | ? |
mr | ? | ? |
ms | ? | ? |
mt | ? | ? |
my | ? | ? |
ne | ? | ? |
nl | ? | ? |
no | ? | ? |
ny | ? | ? |
pa | ? | ? |
pl | ? | ? |
ps | ? | ? |
pt | ? | ? |
ro | ? | ? |
ru | ? | ? |
ru-Latn | ? | ? |
sd | ? | ? |
si | ? | ? |
sk | ? | ? |
sl | ? | ? |
sm | ? | ? |
sn | ? | ? |
so | ? | ? |
sq | ? | ? |
sr | ? | ? |
st | ? | ? |
su | ? | ? |
sv | ? | ? |
sw | ? | ? |
ta | ? | ? |
te | ? | ? |
tg | ? | ? |
th | ? | ? |
tr | ? | ? |
uk | ? | ? |
und | ? | ? |
ur | ? | ? |
uz | ? | ? |
vi | ? | ? |
xh | ? | ? |
yi | ? | ? |
yo | ? | ? |
zh | ? | ? |
zh-Latn | ? | ? |
zu | ? | ? |
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
Citation Information
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}