A new paper introduces Visual CoT, a new approach that enhances multi-modal large language models with visual chain-of-thought reasoning capabilities. This allows language models to dynamically identify and focus on specific regions within images that are most relevant for answering questions, mimicking human-like efficient visual reasoning.
Keypoints: * Introduces the 373k Visual CoT dataset with bounding box annotations highlighting essential image regions * Proposes a multi-turn pipeline for focusing on relevant visual inputs * Achieves strong results on multi-modal benchmarks