Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Community Article Published April 18, 2024

We announce today the release of Youtube-Commons on HuggingFace:

  • Youtube-Commons is the largest corpus of video to date under an entirely free license.
  • Youtube-Commons comprises 2 million videos in CC-By with documented provenance and attribution.
  • We include the metadata and the full transcript, which also makes it one of the largest collections of conversational data with nearly 30 billion words.
  • Youtube-Commons is multilingual and includes translations in a variety of European languages.

Youtube-Commons is a follow-up of Common-Corpus, an international initiative coordinated by Pleias to release the largest open pre-training corpus coming from public domain sources. Youtube-Commons has similarly received the support of Lang:IA, a state start-up supported by the French Ministry of Culture and the Direction du numérique ((Agent Public). Pleias is a French start-up specialized in the training of Large Language Models for document processing on fully open and auditable corpus.

Youtube-Commons is made of materials released by their original authors under a free license (CC-By). There is currently a debate over the ethical and legal use of these resources for pre-training large text or multimodal models. We consider that respecting the terms of the license (especially in regards to attribution) and the general philosophy of Creative Commons are critical for any future end use project: we provide the necessary metadata to do so and invite all future projects to maintain key principles of reproducibility, transparency and reciprocal contribution to the commons.

Despite its size, Youtube-Commons is still far from covering the entire available set of freely licensed content available on Youtube. One of our incentives for the releasing this corpus was the highly controversial use of copyright content from Youtube videos by OpenAI, both for GPT-4 (with more than 1 million transcripts) and, likely, for their video generation model, SORA. Through this release we aim to demonstrate that it is possible to conciliate AI development with scientific reproducibility and conformity to copyright law. We also seek to empower alternative, more ethical approaches.

Youtube-Commons is only a first step. We are currently expanding this collection both in size but also in content with their associated audio, image and video materials.