Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Taylor658Β 
posted an update Jun 20, 2024
Post
938
πŸ” A recently published technical report introduces MINT-1T, a dataset that will considerably expand open-source multimodal data. It features one trillion text tokens and three billion images and is scheduled for release in July 2024.

Researcher Affiliation:

University of Washington
Salesforce Research
Stanford University
University of Texas at Austin
University of California, Berkeley

Paper:
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
https://arxiv.org/pdf/2406.11271v1.pdf

GitHub:
https://github.com/mlfoundations/MINT-1T

Highlights:

MINT-1T Dataset: Largest open-source multimodal interleaved dataset with 1 trillion text tokens & 3 billion images. πŸ“ŠπŸ–ΌοΈ
Diverse Sources: Incorporates data from HTML, PDFs, and ArXiv documents. πŸ“„πŸ“š
Open Source: Dataset and code will be released at https://github.com/mlfoundations/MINT-1T. πŸŒπŸ”“
Broader Domain Representation: Uses diverse data sources for balanced domain representation. πŸŒπŸ“š
Performance in Multimodal Tasks: The dataset’s scale and diversity should enhance multimodal task performance. πŸ€–πŸ’‘

Datasheet Information:

Motivation: Addresses the gap in large-scale open-source multimodal datasets. πŸŒπŸ“Š
Composition: 927.6 million documents, including HTML, PDF, and ArXiv sources. πŸ“„πŸ“š
Collection Process: Gathered from CommonCrawl WARC and WAT dumps, with rigorous filtering. πŸ—‚οΈπŸ”
Preprocessing/Cleaning: Removal of low-quality text, duplicates and anonymization of sensitive information. πŸ§ΉπŸ”’
Ethical Considerations: Measures to ensure privacy and avoid bias. βš–οΈπŸ”
Uses: Training multimodal models, generating interleaved image-text sequences, and building retrieval systems. πŸ€–πŸ“–
In this post