Post
938
π A recently published technical report introduces MINT-1T, a dataset that will considerably expand open-source multimodal data. It features one trillion text tokens and three billion images and is scheduled for release in July 2024.
Researcher Affiliation:
University of Washington
Salesforce Research
Stanford University
University of Texas at Austin
University of California, Berkeley
Paper:
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
https://arxiv.org/pdf/2406.11271v1.pdf
GitHub:
https://github.com/mlfoundations/MINT-1T
Highlights:
MINT-1T Dataset: Largest open-source multimodal interleaved dataset with 1 trillion text tokens & 3 billion images. ππΌοΈ
Diverse Sources: Incorporates data from HTML, PDFs, and ArXiv documents. ππ
Open Source: Dataset and code will be released at https://github.com/mlfoundations/MINT-1T. ππ
Broader Domain Representation: Uses diverse data sources for balanced domain representation. ππ
Performance in Multimodal Tasks: The datasetβs scale and diversity should enhance multimodal task performance. π€π‘
Datasheet Information:
Motivation: Addresses the gap in large-scale open-source multimodal datasets. ππ
Composition: 927.6 million documents, including HTML, PDF, and ArXiv sources. ππ
Collection Process: Gathered from CommonCrawl WARC and WAT dumps, with rigorous filtering. ποΈπ
Preprocessing/Cleaning: Removal of low-quality text, duplicates and anonymization of sensitive information. π§Ήπ
Ethical Considerations: Measures to ensure privacy and avoid bias. βοΈπ
Uses: Training multimodal models, generating interleaved image-text sequences, and building retrieval systems. π€π
Researcher Affiliation:
University of Washington
Salesforce Research
Stanford University
University of Texas at Austin
University of California, Berkeley
Paper:
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
https://arxiv.org/pdf/2406.11271v1.pdf
GitHub:
https://github.com/mlfoundations/MINT-1T
Highlights:
MINT-1T Dataset: Largest open-source multimodal interleaved dataset with 1 trillion text tokens & 3 billion images. ππΌοΈ
Diverse Sources: Incorporates data from HTML, PDFs, and ArXiv documents. ππ
Open Source: Dataset and code will be released at https://github.com/mlfoundations/MINT-1T. ππ
Broader Domain Representation: Uses diverse data sources for balanced domain representation. ππ
Performance in Multimodal Tasks: The datasetβs scale and diversity should enhance multimodal task performance. π€π‘
Datasheet Information:
Motivation: Addresses the gap in large-scale open-source multimodal datasets. ππ
Composition: 927.6 million documents, including HTML, PDF, and ArXiv sources. ππ
Collection Process: Gathered from CommonCrawl WARC and WAT dumps, with rigorous filtering. ποΈπ
Preprocessing/Cleaning: Removal of low-quality text, duplicates and anonymization of sensitive information. π§Ήπ
Ethical Considerations: Measures to ensure privacy and avoid bias. βοΈπ
Uses: Training multimodal models, generating interleaved image-text sequences, and building retrieval systems. π€π