atayloraerospace PRO

Taylor658

AI & ML interests

Computer Vision πŸ”­ | Multimodal Gen AI πŸ€–| AI in Healthcare 🩺 | AI in Aerospace πŸš€

Organizations

Posts 16

view post
Post
520
🌍 Cohere for AI has announced that this July and August, it is inviting researchers from around the world to join Expedition Aya, a global initiative focused on launching projects using multilingual tools like Aya 23 and Aya 101. 🌐

Participants can start by joining the Aya server, where all organization will take place. They can share ideas and connect with others on Discord and the signup sheet. Various events will be hosted to help people find potential team members. 🀝

To support the projects, Cohere API credits will be issued. πŸ’°

Over the course of six weeks, weekly check-in calls are also planned to help teams stay on track and receive support with using Aya. πŸ–₯️

The expedition will wrap up at the end of August with a closing event to showcase everyone’s work and plan next steps. Participants who complete the expedition will also receive some Expedition Aya swag. πŸŽ‰

Links:
Join the Aya Discord: https://discord.com/invite/q9QRYkjpwk
Visit the Expedition Aya Minisite: https://sites.google.com/cohere.com/expedition-aya/home
view post
Post
887
πŸ” A recently published technical report introduces MINT-1T, a dataset that will considerably expand open-source multimodal data. It features one trillion text tokens and three billion images and is scheduled for release in July 2024.

Researcher Affiliation:

University of Washington
Salesforce Research
Stanford University
University of Texas at Austin
University of California, Berkeley

Paper:
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
https://arxiv.org/pdf/2406.11271v1.pdf

GitHub:
https://github.com/mlfoundations/MINT-1T

Highlights:

MINT-1T Dataset: Largest open-source multimodal interleaved dataset with 1 trillion text tokens & 3 billion images. πŸ“ŠπŸ–ΌοΈ
Diverse Sources: Incorporates data from HTML, PDFs, and ArXiv documents. πŸ“„πŸ“š
Open Source: Dataset and code will be released at https://github.com/mlfoundations/MINT-1T. πŸŒπŸ”“
Broader Domain Representation: Uses diverse data sources for balanced domain representation. πŸŒπŸ“š
Performance in Multimodal Tasks: The dataset’s scale and diversity should enhance multimodal task performance. πŸ€–πŸ’‘

Datasheet Information:

Motivation: Addresses the gap in large-scale open-source multimodal datasets. πŸŒπŸ“Š
Composition: 927.6 million documents, including HTML, PDF, and ArXiv sources. πŸ“„πŸ“š
Collection Process: Gathered from CommonCrawl WARC and WAT dumps, with rigorous filtering. πŸ—‚οΈπŸ”
Preprocessing/Cleaning: Removal of low-quality text, duplicates and anonymization of sensitive information. πŸ§ΉπŸ”’
Ethical Considerations: Measures to ensure privacy and avoid bias. βš–οΈπŸ”
Uses: Training multimodal models, generating interleaved image-text sequences, and building retrieval systems. πŸ€–πŸ“–