mc4/README.md · bertin-project/bertin-roberta-base-spanish at v1

bertin-roberta-base-spanish / mc4 /README.md

versae

Adding sampling to mc4

3f09f56 almost 3 years ago

preview code

raw history blame

No virus

15.5 kB

	---
	pretty_name: mC4
	annotations_creators:
	- no-annotation
	language_creators:
	- found
	languages:
	- af
	- am
	- ar
	- az
	- be
	- bg
	- bg-Latn
	- bn
	- ca
	- ceb
	- co
	- cs
	- cy
	- da
	- de
	- el
	- el-Latn
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fil
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- haw
	- hi
	- hi-Latn
	- hmn
	- ht
	- hu
	- hy
	- id
	- ig
	- is
	- it
	- iw
	- ja
	- ja-Latn
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lb
	- lo
	- lt
	- lv
	- mg
	- mi
	- mk
	- ml
	- mn
	- mr
	- ms
	- mt
	- my
	- ne
	- nl
	- "no"
	- ny
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- ru-Latn
	- sd
	- si
	- sk
	- sl
	- sm
	- sn
	- so
	- sq
	- sr
	- st
	- su
	- sv
	- sw
	- ta
	- te
	- tg
	- th
	- tr
	- uk
	- und
	- ur
	- uz
	- vi
	- xh
	- yi
	- yo
	- zh
	- zh-Latn
	- zu
	licenses:
	- odc-by-1.0
	multilinguality:
	- multilingual
	size_categories:
	- n<1K
	- 1K<n<10K
	- 10K<n<100K
	- 100K<n<1M
	- 1M<n<10M
	- 10M<n<100M
	- 100M<n<1B
	- 1B<n<10B
	source_datasets:
	- original
	task_categories:
	- sequence-modeling
	task_ids:
	- language-modeling
	paperswithcode_id: mc4
	---

	# Dataset Card for mC4

	## Table of Contents

	- [Dataset Card for mC4](#dataset-card-for-mc4)
	- [Table of Contents](#table-of-contents)
	- [Dataset Description](#dataset-description)
	- [Dataset Summary](#dataset-summary)
	- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
	- [Languages](#languages)
	- [Dataset Structure](#dataset-structure)
	- [Data Instances](#data-instances)
	- [Data Fields](#data-fields)
	- [Data Splits](#data-splits)
	- [Dataset Creation](#dataset-creation)
	- [Curation Rationale](#curation-rationale)
	- [Source Data](#source-data)
	- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
	- [Who are the source language producers?](#who-are-the-source-language-producers)
	- [Annotations](#annotations)
	- [Annotation process](#annotation-process)
	- [Who are the annotators?](#who-are-the-annotators)
	- [Personal and Sensitive Information](#personal-and-sensitive-information)
	- [Considerations for Using the Data](#considerations-for-using-the-data)
	- [Social Impact of Dataset](#social-impact-of-dataset)
	- [Discussion of Biases](#discussion-of-biases)
	- [Other Known Limitations](#other-known-limitations)
	- [Additional Information](#additional-information)
	- [Dataset Curators](#dataset-curators)
	- [Licensing Information](#licensing-information)
	- [Citation Information](#citation-information)
	- [Contributions](#contributions)

	## Dataset Description

	- Homepage: https://huggingface.co/datasets/allenai/c4
	- Paper: https://arxiv.org/abs/1910.10683

	### Dataset Summary

	A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".

	This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4

	108 languages are available and are reported in the table below.

	Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.

	\| language code \| language name \|
	\|:----------------\|:---------------------\|
	\| af \| Afrikaans \|
	\| am \| Amharic \|
	\| ar \| Arabic \|
	\| az \| Azerbaijani \|
	\| be \| Belarusian \|
	\| bg \| Bulgarian \|
	\| bg-Latn \| Bulgarian (Latin) \|
	\| bn \| Bangla \|
	\| ca \| Catalan \|
	\| ceb \| Cebuano \|
	\| co \| Corsican \|
	\| cs \| Czech \|
	\| cy \| Welsh \|
	\| da \| Danish \|
	\| de \| German \|
	\| el \| Greek \|
	\| el-Latn \| Greek (Latin) \|
	\| en \| English \|
	\| eo \| Esperanto \|
	\| es \| Spanish \|
	\| et \| Estonian \|
	\| eu \| Basque \|
	\| fa \| Persian \|
	\| fi \| Finnish \|
	\| fil \| Filipino \|
	\| fr \| French \|
	\| fy \| Western Frisian \|
	\| ga \| Irish \|
	\| gd \| Scottish Gaelic \|
	\| gl \| Galician \|
	\| gu \| Gujarati \|
	\| ha \| Hausa \|
	\| haw \| Hawaiian \|
	\| hi \| Hindi \|
	\| hi-Latn \| Hindi (Latin script) \|
	\| hmn \| Hmong, Mong \|
	\| ht \| Haitian \|
	\| hu \| Hungarian \|
	\| hy \| Armenian \|
	\| id \| Indonesian \|
	\| ig \| Igbo \|
	\| is \| Icelandic \|
	\| it \| Italian \|
	\| iw \| former Hebrew \|
	\| ja \| Japanese \|
	\| ja-Latn \| Japanese (Latin) \|
	\| jv \| Javanese \|
	\| ka \| Georgian \|
	\| kk \| Kazakh \|
	\| km \| Khmer \|
	\| kn \| Kannada \|
	\| ko \| Korean \|
	\| ku \| Kurdish \|
	\| ky \| Kyrgyz \|
	\| la \| Latin \|
	\| lb \| Luxembourgish \|
	\| lo \| Lao \|
	\| lt \| Lithuanian \|
	\| lv \| Latvian \|
	\| mg \| Malagasy \|
	\| mi \| Maori \|
	\| mk \| Macedonian \|
	\| ml \| Malayalam \|
	\| mn \| Mongolian \|
	\| mr \| Marathi \|
	\| ms \| Malay \|
	\| mt \| Maltese \|
	\| my \| Burmese \|
	\| ne \| Nepali \|
	\| nl \| Dutch \|
	\| no \| Norwegian \|
	\| ny \| Nyanja \|
	\| pa \| Punjabi \|
	\| pl \| Polish \|
	\| ps \| Pashto \|
	\| pt \| Portuguese \|
	\| ro \| Romanian \|
	\| ru \| Russian \|
	\| ru-Latn \| Russian (Latin) \|
	\| sd \| Sindhi \|
	\| si \| Sinhala \|
	\| sk \| Slovak \|
	\| sl \| Slovenian \|
	\| sm \| San Marino \|
	\| sn \| Shona \|
	\| so \| Somali \|
	\| sq \| Albanian \|
	\| sr \| Serbian \|
	\| st \| Southern Sotho \|
	\| su \| Sundanese \|
	\| sv \| Swedish \|
	\| sw \| Swahili \|
	\| ta \| Tamil \|
	\| te \| Telugu \|
	\| tg \| Tajik \|
	\| th \| Thai \|
	\| tr \| Turkish \|
	\| uk \| Ukrainian \|
	\| und \| Unknown language \|
	\| ur \| Urdu \|
	\| uz \| Uzbek \|
	\| vi \| Vietnamese \|
	\| xh \| Xhosa \|
	\| yi \| Yiddish \|
	\| yo \| Yoruba \|
	\| zh \| Chinese \|
	\| zh-Latn \| Chinese (Latin) \|
	\| zu \| Zulu \|

	You can load the mC4 subset of any language like this:

	```python
	from datasets import load_dataset

	en_mc4 = load_dataset("mc4", "en")
	```

	And if you can even specify a list of languages:

	```python
	from datasets import load_dataset

	mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
	```

	### Supported Tasks and Leaderboards

	mC4 is mainly intended to pretrain language models and word representations.

	### Languages

	The dataset supports 108 languages.

	## Dataset Structure

	### Data Instances

	An example form the `en` config is:

	```
	{'timestamp': '2018-06-24T01:32:39Z',
	'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
	'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
	```

	### Data Fields

	The data have several fields:

	- `url`: url of the source as a string
	- `text`: text content as a string
	- `timestamp`: timestamp as a string

	### Data Splits

	To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:

	\| config \| train \| validation \|
	\|:---------\|:--------\|:-------------\|
	\| af \| ? \| ? \|
	\| am \| ? \| ? \|
	\| ar \| ? \| ? \|
	\| az \| ? \| ? \|
	\| be \| ? \| ? \|
	\| bg \| ? \| ? \|
	\| bg-Latn \| ? \| ? \|
	\| bn \| ? \| ? \|
	\| ca \| ? \| ? \|
	\| ceb \| ? \| ? \|
	\| co \| ? \| ? \|
	\| cs \| ? \| ? \|
	\| cy \| ? \| ? \|
	\| da \| ? \| ? \|
	\| de \| ? \| ? \|
	\| el \| ? \| ? \|
	\| el-Latn \| ? \| ? \|
	\| en \| ? \| ? \|
	\| eo \| ? \| ? \|
	\| es \| ? \| ? \|
	\| et \| ? \| ? \|
	\| eu \| ? \| ? \|
	\| fa \| ? \| ? \|
	\| fi \| ? \| ? \|
	\| fil \| ? \| ? \|
	\| fr \| ? \| ? \|
	\| fy \| ? \| ? \|
	\| ga \| ? \| ? \|
	\| gd \| ? \| ? \|
	\| gl \| ? \| ? \|
	\| gu \| ? \| ? \|
	\| ha \| ? \| ? \|
	\| haw \| ? \| ? \|
	\| hi \| ? \| ? \|
	\| hi-Latn \| ? \| ? \|
	\| hmn \| ? \| ? \|
	\| ht \| ? \| ? \|
	\| hu \| ? \| ? \|
	\| hy \| ? \| ? \|
	\| id \| ? \| ? \|
	\| ig \| ? \| ? \|
	\| is \| ? \| ? \|
	\| it \| ? \| ? \|
	\| iw \| ? \| ? \|
	\| ja \| ? \| ? \|
	\| ja-Latn \| ? \| ? \|
	\| jv \| ? \| ? \|
	\| ka \| ? \| ? \|
	\| kk \| ? \| ? \|
	\| km \| ? \| ? \|
	\| kn \| ? \| ? \|
	\| ko \| ? \| ? \|
	\| ku \| ? \| ? \|
	\| ky \| ? \| ? \|
	\| la \| ? \| ? \|
	\| lb \| ? \| ? \|
	\| lo \| ? \| ? \|
	\| lt \| ? \| ? \|
	\| lv \| ? \| ? \|
	\| mg \| ? \| ? \|
	\| mi \| ? \| ? \|
	\| mk \| ? \| ? \|
	\| ml \| ? \| ? \|
	\| mn \| ? \| ? \|
	\| mr \| ? \| ? \|
	\| ms \| ? \| ? \|
	\| mt \| ? \| ? \|
	\| my \| ? \| ? \|
	\| ne \| ? \| ? \|
	\| nl \| ? \| ? \|
	\| no \| ? \| ? \|
	\| ny \| ? \| ? \|
	\| pa \| ? \| ? \|
	\| pl \| ? \| ? \|
	\| ps \| ? \| ? \|
	\| pt \| ? \| ? \|
	\| ro \| ? \| ? \|
	\| ru \| ? \| ? \|
	\| ru-Latn \| ? \| ? \|
	\| sd \| ? \| ? \|
	\| si \| ? \| ? \|
	\| sk \| ? \| ? \|
	\| sl \| ? \| ? \|
	\| sm \| ? \| ? \|
	\| sn \| ? \| ? \|
	\| so \| ? \| ? \|
	\| sq \| ? \| ? \|
	\| sr \| ? \| ? \|
	\| st \| ? \| ? \|
	\| su \| ? \| ? \|
	\| sv \| ? \| ? \|
	\| sw \| ? \| ? \|
	\| ta \| ? \| ? \|
	\| te \| ? \| ? \|
	\| tg \| ? \| ? \|
	\| th \| ? \| ? \|
	\| tr \| ? \| ? \|
	\| uk \| ? \| ? \|
	\| und \| ? \| ? \|
	\| ur \| ? \| ? \|
	\| uz \| ? \| ? \|
	\| vi \| ? \| ? \|
	\| xh \| ? \| ? \|
	\| yi \| ? \| ? \|
	\| yo \| ? \| ? \|
	\| zh \| ? \| ? \|
	\| zh-Latn \| ? \| ? \|
	\| zu \| ? \| ? \|

	## Dataset Creation

	### Curation Rationale

	[More Information Needed]

	### Source Data

	#### Initial Data Collection and Normalization

	[More Information Needed]

	#### Who are the source language producers?

	[More Information Needed]

	### Annotations

	#### Annotation process

	[More Information Needed]

	#### Who are the annotators?

	[More Information Needed]

	### Personal and Sensitive Information

	[More Information Needed]

	## Considerations for Using the Data

	### Social Impact of Dataset

	[More Information Needed]

	### Discussion of Biases

	[More Information Needed]

	### Other Known Limitations

	[More Information Needed]

	## Additional Information

	### Dataset Curators

	[More Information Needed]

	### Licensing Information

	AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

	### Citation Information

	```
	@article{2019t5,
	author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
	title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
	journal = {arXiv e-prints},
	year = {2019},
	archivePrefix = {arXiv},
	eprint = {1910.10683},
	}
	```

	### Contributions

	Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset.