Papers
arxiv:2405.07990

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Published on May 13
· Featured in Daily Papers on May 14
Authors:
,

Abstract

The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.

Community

Really cool work 🔥 And thanks for sharing the dataset on the hub!!

I could be completely wrong so feel free to correct me. If one decided to pre-train/fine-tune their VLM/MLLM with all the examples from https://matplotlib.org/stable/gallery/index.html (they can even modify the code to generate more variety of the similar kind of plots), will that contaminate this benchmarks dataset?

·
Paper author

Hi Yixiang, I think this is a common question for all the evaluation benchmark. We want to emphasize the significance of visual coding tasks through this benchmark . This is a preliminary attempt to show the MLLM agents' abillity in the plot reasoning and coding. The future solution to the contamination promblem is also simple. We can use some in-house data to do the evaluation. I hope my response will be helpful!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.07990 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.07990 in a Space README.md to link it from this page.

Collections including this paper 4