Releasing Common Corpus: the largest public domain dataset for training LLMs

Community Article Published March 20, 2024

We announce today the release of Common Corpus on HuggingFace:

Common Corpus is the largest public domain dataset released for training LLMs.
Common Corpus includes 500 billion words from a wide diversity of cultural heritage initiatives.
Common Corpus is multilingual and the largest corpus to date in English, French, Dutch, Spanish, German and Italian.
Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.

Common Corpus is an international initiative coordinated by Pleias, involving researchers in LLM pretraining, AI ethics and cultural heritage like , in association with major organizations committed to an open science approach for AI (HuggingFace, Occiglot, Eleuther, Nomic AI). Common Corpus has received the support of Lang:IA, a state start-up supported by the French Ministry of Culture and the Direction du numérique (Agent Public. Pleias is a French start-up specialized in the training of Large Language Models for document processing on fully open and auditable corpus.

Contrary to what most large AI companies claim, the release of Common Corpus aims to show it is possible to train Large Language Model on fully open and reproducible corpus, without using copyright content. This is only an initial part of what we have collected so far, in part due to the lengthy process of copyright duration verification. In the following weeks and months, we’ll continue to publish many additional datasets also coming from other open sources, such as open data or open science.

CommonCorpus holds the largest English-speaking dataset to date with 180 billion words. Thi includes a major US collection of 21 millions digitized newspapers, Chronicling America that can also be fully explored with an original corpus map created by Nomic AI, as well as large monographs datasets collected by digital historian Sebastian Majstorovic.

Common Corpus is also multilingual. It also incorporates the largest open dataset to date in French (110 billion words), German (30 billion words), Spanish, Dutch or Italian, as well as a very long tail of low resource languages that are currently hardly represented in the training of Large Language Model.

Common Corpus is not only open but more qualitative and diverse than the web archive dataset commonly used for pretraining. It includes millions of books with reasoning-rich content which makes it ideal for creating models with long context.

Common Corpus is the start of a long work in progress. Many things remain to be done to achieve this end and to enhance this collection. We aim to support a strong data commons for AI to ease research and make it more reproducible, but also to make AI more accessible, diverse and democratic, by ensuring that anyone can have a look into the large models.

Upvote