TITLE = """

ConTextual Leaderboard

""" INTRODUCTION_TEXT = """ Models are becoming quite good at understanding text on its own, but what about text in images, which gives important contextual information? For example, navigating a map, or understanding a meme? The ability to reason about the interactions between the text and visual context in images can power many real-world applications, such as AI assistants, or tools to assist the visually impaired. We refer to these tasks as context-sensitive text-rich visual reasoning tasks. At the moment, most evaluations of instruction-tuned large multimodal models (LMMs) focus on testing how well models can respond to human instructions posed as questions or imperative tasks over images… but not how well they understand context-sensitive text-rich scenes! That’s why we created ConTextual, a Context-sensitive Text-rich visuaL reasoning dataset for evaluating LMMs. We also released a leaderboard, so that the community can see for themselves which models are the best at this task. (See our [paper](https://arxiv.org/abs/2401.13311) for more details.) ## Data ConTextual comprises **506 examples covering 8 real-world visual scenarios** - *Time Reading, Shopping, Navigation, Abstract Scenes, Mobile Application, Webpages, Infographics and Miscellaneous Natural Scenes*. Each sample consists of: - A text-rich image - A human-written instruction (question or imperative task) - A human-written reference response ### Data Access ConTextual data can be found on HuggingFace and GitHub. - HuggingFace - [Test](https://huggingface.co/datasets/ucla-contextual/contextual_test) - [Val](https://huggingface.co/datasets/ucla-contextual/contextual_val) - Github - [Test](https://github.com/rohan598/ConTextual/blob/main/data/contextual_test.csv) - [Val](https://github.com/rohan598/ConTextual/blob/main/data/contextual_val.csv) ### Data Format ``` { "image_url": [string] url to the hosted image, "instruction" [string] instruction text, "response": [string] response text (only provided for samples in the val subset), "category": visual scenario this example belongs to like 'time' and 'shopping' out of 8 possible scenarios in ConTextual } ``` """ SUBMISSION_TEXT = """ ## Submissions Results can be submitted for only validation here. Scores are expressed as the percentage of correct answers for a given split. Submission made by our team are labelled "ConTextual authors". ### Validation Results To submit your validation results to the leaderboard, you can run our auto-evaluation code (Evaluation Pipeline with GPT4), following the instructions [here](https://github.com/rohan598/ConTextual?tab=readme-ov-file#-evaluation-pipeline-gpt-4). We expect submissions to be json format as shown below: ``` {"model_name": {"img_url": "1 or 0 as integer"} Replace model name with your model name (string) Replace img_url with img_url of the instance (string) Value for an img url is either 0 or 1 (int) There should be 100 predictions, corresponding to the 100 urls of the val set. ``` **Please do not utilize the public dev set as part of training data for your models.** ### Test Results Once you are happy with your val results, you can send your model predictions to [rohan](mailto:rwadhawan7@g.ucla.edu) and [hritik](mailto:hbansal@g.ucla.edu). Please include in your email 1) A name for your model. 2) Organization (affiliation). 3) (Optionally) GitHub repo or paper link. We expect submissions to be json format similar to val set as shown below: ``` {"model_name": {"img_url": "predicted response"} Replace model name with your model name (string) Replace img_url with img_url of the instance (string) Value for an img url is the predicted response for that instance (string) There should be 506 predictions, corresponding to the 506 urls of the test set. ``` **Please revisit the test leaderboard within 1 to 2 days after sharing your prediction file to view your model scores and ranking on the leaderboard.** """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r"""@misc{wadhawan2024contextual, title={ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models}, author={Rohan Wadhawan and Hritik Bansal and Kai-Wei Chang and Nanyun Peng}, year={2024}, eprint={2401.13311}, archivePrefix={arXiv}, primaryClass={cs.CV} }""" def format_error(msg): return f"

{msg}

" def format_warning(msg): return f"

{msg}

" def format_log(msg): return f"

{msg}

" def model_hyperlink(link, model_name): return f'{model_name}'