MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published 25 days ago • 10
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published 24 days ago • 39
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published 29 days ago • 14