HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Paper
โข
2503.02003
โข
Published
โข
48
Score image-text similarity using CLIP or SigLIP models
Identify objects in images using text prompts
Segment images using text descriptions
Generate correspondences between images
Explore images from ImageNet-Hard dataset