opencompass/open_vlm_leaderboard · About the OCRBench

SWHL

Apr 29

Is there a significant difference in the performance of many similar models between the current ocrbench leaderborad and the link?

I feel like there must be something wrong.

echo840

Apr 29

•

edited Apr 29

一些api比如GPT4V和qwenvlmax调用不是很稳定，有可能会调用失败，https://huggingface.co/spaces/echo840/ocrbench-leaderboard 在测试的时候会将调用失败的例子重复调用n次直至调用成功或者n次尝试用完。这里可能只是调用了一次，导致有些可以正确回答的例子，由于调用不成功被视为失败。所以，这些api调用的得分可能存在一些差别（主要原因应该是这个，其他除了prompt，应该就是一些随机性的误差和测试时环境的问题）。而且，opencompass里面的模型在测试时可能会加上一些其他的prompt，与https://huggingface.co/spaces/echo840/ocrbench-leaderboard 这里测试时的prompt不一致，所以导致两边测试结果会有一些差别。

KennyUTC

OpenCompass org May 2

Hi, @SWHL ,
as mentioned by @echo840 , sometimes the API model fails or reject to answer a certain question due to the policy set. We have double checked the results and rerun all failed samples. After that, the score of QwenVLMax on OCRBench do improves. However, it's still inferior to QwenVLPlus/