Common Corpus Collection The largest public domain dataset for training LLMs. • 26 items • Updated Mar 20 • 98
Text-to-Image Base Models Collection All text-to-image open source base models, with their respective license • 28 items • Updated Feb 15 • 17
OpenCLIP DataComp Collection OpenCLIP models trained on DataComp (https://huggingface.co/papers/2304.14108). • 6 items • Updated Oct 9, 2023 • 5
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Paper • 2306.16527 • Published Jun 21, 2023 • 40