2 6

Quanzeng You

Ye27

AI & ML interests

None yet

Recent Activity

upvoted a paper about 2 months ago

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

authored a paper about 2 months ago

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

View all activity

Organizations

Ye27's activity

upvoted a paper about 2 months ago

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Paper • 2410.18666 • Published Oct 24 • 19

authored a paper about 2 months ago

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Paper • 2410.18666 • Published Oct 24 • 19

authored 2 papers 3 months ago

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

Paper • 2405.17815 • Published May 28

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Paper • 2409.12568 • Published Sep 19 • 47

upvoted a paper 3 months ago

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Paper • 2409.12568 • Published Sep 19 • 47

authored a paper 4 months ago

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92

upvoted a paper 4 months ago

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92

upvoted an article 5 months ago

Article

Preference Optimization for Vision Language Models

Jul 10

• 53

New activity in HuggingFaceM4/idefics2-8b 8 months ago

Large value difference when comparing hidden_states with flash attention ON and OFF

#42 opened 8 months ago by

Ye27

authored a paper 9 months ago

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 52

upvoted a paper 9 months ago

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 52

authored a paper 10 months ago

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Paper • 2403.01487 • Published Mar 3 • 14

reacted to xiaotianhan's post with 🤗 11 months ago

Post

Thrilled to share some of our recent work in the field of Multimodal Large Language Models (MLLMs).

1️⃣ A Survey on Multimodal Reasoning 📚
Are you curious about the reasoning abilities of MLLMs? In our latest survey, we delve into the world of multimodal reasoning. We comprehensively review existing evaluation protocols, categorize the frontiers of MLLMs, explore recent trends in their applications for reasoning-intensive tasks, and discuss current practices and future directions. For an in-depth exploration, check out our paper: Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning (2401.06805)

2️⃣ Advancing Flamingo with InfiMM 🔥
Building upon the foundation of Flamingo, we introduce the InfiMM model series. InfiMM is a reproduction of Flamingo, enhanced with stronger Large Language Models (LLMs) such as LLaMA2-13B, Vicuna-13B, and Zephyr7B. We've meticulously filtered pre-training data and fine-tuned instructions, resulting in superior performance on recent benchmarks like MMMU, InfiMM-Eval, MM-Vet, and more. Explore the power of InfiMM on Huggingface: Infi-MM/infimm-zephyr

3️⃣ Exploring Multimodal Instruction Fine-tuning 🖼️
Visual Instruction Fine-tuning (IFT) is crucial for aligning MLLMs' output with user intentions. Our research identified challenges with models trained on the LLaVA-mix-665k dataset, particularly in multi-round dialog settings. To address this, we've created a new IFT dataset with high-quality, diverse instruction annotations and images sourced exclusively from the COCO dataset. Our experiments demonstrate that when fine-tuned with this dataset, MLLMs excel in open-ended evaluation benchmarks for both single-round and multi-round dialog settings. Dive into the details in our paper: COCO is "ALL'' You Need for Visual Instruction Fine-tuning (2401.08968)

Stay tuned for more exciting developments.
Special thanks to all our collaborators: @Ye27 @wwyssh @Yongfei @Yi-Qi638 @xudonglin @KhalilMrini @lllliuhhhhggg @Borise @Hongxia

authored 6 papers 11 months ago

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Paper • 2401.08968 • Published Jan 17 • 2

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Paper • 2401.06805 • Published Jan 10 • 2

Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark

Paper • 1605.02677 • Published May 9, 2016 • 1

CORE-MM: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Paper • 2311.11567 • Published Nov 20, 2023 • 8

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Paper • 2311.17126 • Published Nov 28, 2023 • 1

Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

Paper • 2310.06389 • Published Oct 10, 2023 • 1

upvoted a paper about 1 year ago

CORE-MM: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Paper • 2311.11567 • Published Nov 20, 2023 • 8