arxiv:2410.23144

Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms

Published on Oct 30

Upvote

Authors:

Jordan Meyer ,

Nick Padgett ,

Cullen Miller ,

Abstract

We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.23144 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.23144 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.