Xiaotian Han's picture
5 5

Xiaotian Han

xiaotianhan

AI & ML interests

Multimodal LLM

Organizations

Posts 2

view post
Post
2045
๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ Happy to share our recent work. We noticed that image resolution plays an important role, either in improving multi-modal large language models (MLLM) performance or in Sora style any resolution encoder decoder, we hope this work can help lift restriction of 224x224 resolution limit in ViT.

ViTAR: Vision Transformer with Any Resolution (2403.18361)
view post
Post
Thrilled to share some of our recent work in the field of Multimodal Large Language Models (MLLMs).

1๏ธโƒฃ A Survey on Multimodal Reasoning ๐Ÿ“š
Are you curious about the reasoning abilities of MLLMs? In our latest survey, we delve into the world of multimodal reasoning. We comprehensively review existing evaluation protocols, categorize the frontiers of MLLMs, explore recent trends in their applications for reasoning-intensive tasks, and discuss current practices and future directions. For an in-depth exploration, check out our paper: Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning (2401.06805)

2๏ธโƒฃ Advancing Flamingo with InfiMM ๐Ÿ”ฅ
Building upon the foundation of Flamingo, we introduce the InfiMM model series. InfiMM is a reproduction of Flamingo, enhanced with stronger Large Language Models (LLMs) such as LLaMA2-13B, Vicuna-13B, and Zephyr7B. We've meticulously filtered pre-training data and fine-tuned instructions, resulting in superior performance on recent benchmarks like MMMU, InfiMM-Eval, MM-Vet, and more. Explore the power of InfiMM on Huggingface: Infi-MM/infimm-zephyr

3๏ธโƒฃ Exploring Multimodal Instruction Fine-tuning ๐Ÿ–ผ๏ธ
Visual Instruction Fine-tuning (IFT) is crucial for aligning MLLMs' output with user intentions. Our research identified challenges with models trained on the LLaVA-mix-665k dataset, particularly in multi-round dialog settings. To address this, we've created a new IFT dataset with high-quality, diverse instruction annotations and images sourced exclusively from the COCO dataset. Our experiments demonstrate that when fine-tuned with this dataset, MLLMs excel in open-ended evaluation benchmarks for both single-round and multi-round dialog settings. Dive into the details in our paper: COCO is "ALL'' You Need for Visual Instruction Fine-tuning (2401.08968)

Stay tuned for more exciting developments.
Special thanks to all our collaborators: @Ye27 @wwyssh @Yongfei @Yi-Qi638 @xudonglin @KhalilMrini @lllliuhhhhggg @Borise @Hongxia

models

None public yet

datasets

None public yet