What do you think of "List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs"

by Shure-Dev - opened


I want to know why you do not concat multiple images to make one image and solve with only prompt engineering.

TIGER-Lab org

That's the baseline results we compared against across all the benchmarks. Also, concatenating images make co-reference almost impossible. We don't think that's the way to go.

Sign up or log in to comment