MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published May 13 • 16
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach Paper • 2405.15613 • Published May 24 • 12
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 11
How Do Large Language Models Acquire Factual Knowledge During Pretraining? Paper • 2406.11813 • Published 12 days ago • 28
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published 12 days ago • 39
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published 12 days ago • 60
From Pixels to Prose: A Large Dataset of Dense Image Captions Paper • 2406.10328 • Published 15 days ago • 16
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published 12 days ago • 10
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images Paper • 2406.13735 • Published 10 days ago • 5
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models Paper • 2406.14599 • Published 9 days ago • 16