Path to Multimodal Generalist

community

https://generalist.top/

path2generalist

Activity Feed

AI & ML interests

Multimodal Generalist

Recent Activity

LXT authored a paper 3 days ago

UniVR: Thinking in Visual Space for Unified Visual Reasoning

LXT authored a paper 3 days ago

SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

LXT authored a paper about 1 month ago

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

View all activity

Organization Card

Community About org cards

On Path to Multimodal Generalist: General-Level and General-Bench

[📖 Project] [🏆 Leaderboard] [📄 Paper] [🤗 Paper-HF] [🤗 Dataset-HF (Close-Set)] [🤗 Dataset-HF (Open-Set)] [📝 Github]

Does higher performance across tasks indicate a stronger capability of MLLM, and closer to AGI?
NO! But synergy does.

Most current MLLMs predominantly build on the language intelligence of LLMs to simulate the indirect intelligence of multimodality, which is merely extending language intelligence to aid multimodal understanding. While LLMs (e.g., ChatGPT) have already demonstrated such synergy in NLP, reflecting language intelligence, unfortunately, the vast majority of MLLMs do not really achieve it across modalities and tasks.

We argue that the key to advancing towards AGI lies in the synergy effect—a capability that enables knowledge learned in one modality or task to generalize and enhance mastery in other modalities or tasks, fostering mutual improvement across different modalities and tasks through interconnected learning.

This project introduces General-Level and General-Bench.

🏆🏆🏆 Overall Leaderboard

🚀🚀🚀 General-Level

A 5-scale level evaluation system with a new norm for assessing the multimodal generalists (multimodal LLMs/agents).
The core is the use of synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions.

🍕🍕🍕 General-Bench

A companion massive multimodal benchmark dataset, encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325K instances.

We set two dataset types according to the use purpose:

General-Bench-Openset with inputs and labels of samples all publicly open, for free open-world use (e.g., for academic experiment/comparisons).
General-Bench-Closeset with only sample inputs available, which is used for leaderboard ranking. Participants need to submit the predictions to us for internal evaluation.

📌📌📌 Citation

If you find this project useful to your research, please kindly cite our paper:

@articles{fei2025pathmultimodalgeneralistgenerallevel,
  title={On Path to Multimodal Generalist: General-Level and General-Bench},
  author={Hao Fei and Yuan Zhou and Juncheng Li and Xiangtai Li and Qingshan Xu and Bobo Li and Shengqiong Wu and Yaoting Wang and Junbao Zhou and Jiahao Meng and Qingyu Shi and Zhiyuan Zhou and Liangtao Shi and Minghe Gao and Daoan Zhang and Zhiqi Ge and Weiming Wu and Siliang Tang and Kaihang Pan and Yaobo Ye and Haobo Yuan and Tao Zhang and Tianjie Ju and Zixiang Meng and Shilin Xu and Liyu Jia and Wentao Hu and Meng Luo and Jiebo Luo and Tat-Seng Chua and Shuicheng Yan and Hanwang Zhang},
  eprint={2505.04620},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
  url={https://arxiv.org/abs/2505.04620},
}