GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data

🔥🏅️Leaderboard🏅️🔥 • Contribute • Paper • Citation

GenCeption is an annotation-free MLLM (Multimodal Large Language Model) evaluation framework that merely requires unimodal data to assess inter-modality semantic coherence and inversely reflects the models' inclination to hallucinate.

GenCeption is inspired by a popular multi-player game DrawCeption. Using the image modality as an example, the process begins with a seed image $\mathbf{X}^{(0)}$ from a unimodal image dataset for the first iteration ($t$=1). The MLLM creates a detailed description of the image, which is then used by an image generator to produce $\mathbf{X}^{(t)}$. After $T$ iterations, we calculate the GC@T score to measure the MLLM's performance on $\mathbf{X}^{(0)}$.

The GenCeption ranking on MME benchmarking dataset (without using any label) shows a strong correlation with other sophisticated benchmarks such as OpenCompass and HallusionBench. Moreover, the negative correlation with MME scores suggests that GenCeption measures distinct aspects not covered by MME, using the same set of samples. For detailed experimental analysis, please read our paper.

We demostrate a 5-iteration GenCeption procedure below run on a seed images to evaluate 4 VLLMs. Each iteration $t$ shows the generated image $\mathbf{X}^{(t)}$, the description $\mathbf{Q}^{(t)}$ of the preceding image $\mathbf{X}^{(t-1)}$, and the similarity score $s^{(t)}$ relative to $\mathbf{X}^{(0)}$. The GC@5 metric for each VLLM is also presented. Hallucinated elements within descriptions $\mathbf{Q}^{(1)}$ and $\mathbf{Q}^{(2)}$ as compared to the seed image are indicated with red underlined.

Contribute

Please create PR (Pull-Request) to contribute your results to the 🔥🏅️Leaderboard🏅️🔥. Start by creating your virtual environment:

conda create --name genception python=3.10 -y
conda activate genception
pip install -r requirements.txt

For example, if you want to evaluate mPLUG-Owl2 model, please follow the instructions in the official mPLUG-OWL2 repository. Then run GenCeption by

bash example_script.sh # uses exemplary data in datasets/example/

This assumes that an OPENAI_API_KEY is set as an environment variable. The model argument to experiment.py in example_script.sh can be adjusted to llava7b, llava13b, mPLUG, or gpt4v. Please adapt accordingly for to evaluate your MLLM.

The MME dataset, of which the image modality was used in our paper, can be obtained as described here.

Cite This Work

@article{cao2023genception,
    author = {Lele Cao and
              Valentin Buchner and
              Zineb Senane and
              Fangkai Yang},
    title = {{GenCeption}: Evaluate Multimodal LLMs with Unlabeled Unimodal Data},
    year={2023},
    journal={arXiv preprint arXiv:2402.14973},
    primaryClass={cs.AI,cs.CL,cs.LG}
}