Papers
arxiv:2606.16767

Text-Vision Co-Instructed Image Editing

Published on Jun 15
· Submitted by
PolyU_VCLab
on Jun 17
Authors:
,
,
,

Abstract

A unified text-visual image editing framework is presented that combines semantic intent from textual instructions with spatial guidance from visual prompts to achieve more precise and faithful image manipulation.

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

Community

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.
intro

Neat paper. It makes a lot of sense to combine textual instructions with visual prompts like points or drags, since text is great for intent but often lacks the specific spatial control needed for precise editing. Seeing a framework that explicitly bridges that gap is pretty interesting.

I'm curious, how does the model handle cases where the text and the visual prompt might be slightly contradictory or ambiguous?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/02cf4e30-a159-4ca7-85dc-5238baf81475

·

Thank you for your interest in our work — this is an interesting question.

In our setting, text mainly provides the semantic intent, while the visual prompt provides more explicit spatial guidance. For example, when asking a model to “turn a dog’s head to the left,” the meaning of “left” can be ambiguous: it may refer to the dog’s own left or the image’s left. Text-only foundation models often focus more on performing the semantic action, such as “turning,” but are less precise about the spatial direction.

Since TV-Edit is trained with visual trajectories as supervision, it tends to rely on the visual prompt to resolve such spatial ambiguity. Therefore, if the text says “turn left” but the drag trajectory indicates a rightward movement, the model will usually follow the visual prompt for the direction while still preserving the semantic action specified by the text. In this way, the two prompts play complementary roles: text specifies what to edit, and the visual prompt specifies how/where to move it spatially.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.16767
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.16767 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.16767 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.16767 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.