Papers
arxiv:2410.23144
Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
Published on Oct 30
Authors:
Abstract
We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2410.23144 in a model README.md to link it from this page.
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2410.23144 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a
collection
to link it from this page.