2 3 9

Shengju Qian

thesouthfrog

thesouthfrog

AI & ML interests

None yet

Recent Activity

upvoted a paper 1 day ago

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

View all activity

Organizations

thesouthfrog's activity

upvoted a paper 1 day ago

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Paper • 2412.11279 • Published 3 days ago • 11

upvoted a paper 4 months ago

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51

liked a dataset 7 months ago

OpenDILabCommunity/LMDrive

Viewer • Updated Dec 25, 2023 • 169k • 12.9k • 13

liked a Space 7 months ago

Runtime error

🌍

ID Animator

reacted to vladbogo's post with ❤️ 9 months ago

Post

1386

A new paper introduces Visual CoT, a new approach that enhances multi-modal large language models with visual chain-of-thought reasoning capabilities. This allows language models to dynamically identify and focus on specific regions within images that are most relevant for answering questions, mimicking human-like efficient visual reasoning.

Keypoints:
* Introduces the 373k Visual CoT dataset with bounding box annotations highlighting essential image regions
* Proposes a multi-turn pipeline for focusing on relevant visual inputs
* Achieves strong results on multi-modal benchmarks

Paper: Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (2403.16999)
Code, data and other resources: https://github.com/deepcs233/Visual-CoT

Congrats to the authors for their work!