Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Kiwi-Edit is a versatile video editing framework built on an MLLM encoder and a video Diffusion Transformer (DiT). It supports both instruction-based video editing and reference image + instruction video editing.

Introduction

Kiwi-Edit addresses the challenge of precise visual control in instruction-based video editing. It introduces a scalable data generation pipeline to create the RefVIE dataset and proposes a unified architecture that synergizes learnable queries with latent visual features. The model excels at various tasks including:

  • Global Stylization: Applying aesthetic changes to a video.
  • Local Editing: Adding, replacing, or removing specific objects.
  • Background Modification: Changing the background based on text or reference images.

Usage

You can perform inference using the Diffusers-based environment as described in the official repository.

Inference with Diffusers

# Install requirements
pip install diffusers decord einops accelerate transformers==4.57.0 opencv-python av

# Run the demo
python diffusers_demo.py \
    --video_path ./demo_data/video/source/example.mp4 \
    --prompt "Remove the monkey." \
    --save_path output.mp4 \
    --model_path linyq/kiwi-edit-5b-instruct-only-diffusers

Citation

@misc{kiwiedit,
      title={Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance}, 
      author={Yiqi Lin and Guoqiang Liang and Ziyun Zeng and Zechen Bai and Yanzhe Chen and Mike Zheng Shou},
      year={2026},
      eprint={2603.02175},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.02175}, 
}
Downloads last month
10
Safetensors
Model size
45.8M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage1_img_only

Paper for linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage1_img_only