Shengju Qian

thesouthfrog

AI & ML interests

None yet

Recent Activity

Organizations

ID-Animator's profile picture

thesouthfrog's activity

liked a Space 7 months ago
reacted to vladbogo's post with ❤️ 9 months ago
view post
Post
1386
A new paper introduces Visual CoT, a new approach that enhances multi-modal large language models with visual chain-of-thought reasoning capabilities. This allows language models to dynamically identify and focus on specific regions within images that are most relevant for answering questions, mimicking human-like efficient visual reasoning.

Keypoints:
* Introduces the 373k Visual CoT dataset with bounding box annotations highlighting essential image regions
* Proposes a multi-turn pipeline for focusing on relevant visual inputs
* Achieves strong results on multi-modal benchmarks

Paper: Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (2403.16999)
Code, data and other resources: https://github.com/deepcs233/Visual-CoT

Congrats to the authors for their work!