You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Lance logo

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu*, Mengqi Huang*,✉, Shaojin Wu*, pathum2583@gmail.com/span>*, Yufei Huo, Jianzhu Guo✉,§
Hao Li, Yinghang Song, Fei Ding, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang
ByteDance
* Equal contribution    Corresponding authors    § Project lead

Homepage arXiv GitHub Demo
English | 简体中文

Note: Lance is a research project rather than a polished product model. The released checkpoint was trained with up to 128 A100 GPUs, with training conducted up to 768x768 image generation and 480p, 12 FPS video generation. Our goal is to share a research artifact for studying unified image/video understanding, generation, and editing under a relatively small model and limited compute budget. Output quality may vary across prompts, resolutions, duration, motion complexity, and editing scenarios, and we see further opportunities to improve the post-training recipe. We appreciate constructive feedback from the community as we continue improving the project.

🔥 Updates

  • 2026/05/26: 🎨 The Gradio interface now supports image and video generation, editing, and understanding. Try it out!
  • 2026/05/25: ✨ The Hugging Face Space is now live, thanks to the HF team!
  • 2026/05/19: 🤗 The technical report is now available on arXiv.
  • 2026/05/18: 🔥 We launched the project homepage and released the initial inference code and model weights on GitHub and Hugging Face.

🌟 Highlights

Lance is a 3B native unified multimodal model that supports image and video understanding, generation, and editing within a single framework.

  • Efficient at 3B scale. With only 3B active parameters, Lance achieves competitive performance across image generation, image editing, and video generation benchmarks.
  • Training from scratch. Lance is trained from scratch with a staged multi-task recipe and within a budget of up to 128 A100 GPUs.

We are actively updating and improving this repository. If you find any bugs or have suggestions, please feel free to open an issue or submit a pull request (PR) 💖.

Lance benchmark overview across image generation, image editing, video generation, and video understanding

📅 Roadmap

  • Release the fine-tuning code.
  • Add support for image-to-video generation code.

🎨 Demo

Show demo results
🔥 We recommend visiting our homepage for more visual results. 🔥

Text-to-Video

Video Editing

Multi-turn Consistency Editing

Intelligent Video Generation

🚀 Installation

Recommended Environment

  • Software: Python 3.10+, CUDA 12.4+ (required)
  • Hardware: A GPU with at least 40GB VRAM is required for inference

We have tested the following dependency combinations on NVIDIA A100:

  • PyTorch 2.8.0 + cu126 + flash-attn 2.8.3
  • PyTorch 2.5.1 + cu124 + flash-attn 2.6.3

The default installation commands use the PyTorch 2.8.0 + cu126 setup. For other GPU models, please choose and validate a PyTorch build and a matching flash-attn version according to your driver, CUDA runtime, Python version, and GPU architecture.

Installation Steps

First, clone the repository:

git clone https://github.com/bytedance/Lance.git
cd Lance

Then, set up the environment:

conda create -n Lance python=3.11 -y
conda activate Lance
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation

Note: If installing flash-attn from source fails, you can install a prebuilt wheel instead. The wheelhouse below is from a third-party repository and is provided for reference only; please verify that any wheel you install matches your Python, PyTorch and CUDA versions.

pip install --no-cache-dir --no-deps --force-reinstall \
"https://huggingface.co/strangertoolshf/flash_attention_2_wheelhouse/resolve/main/wheelhouse-flash_attn-2.8.3/linux_x86_64/torch2.8/cu12/abiTRUE/cp311/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl"

Then, download the model weights from Lance-3B on Hugging Face and place them in the downloads/ directory:

from huggingface_hub import snapshot_download

save_dir = "./downloads/"
repo_id = "bytedance-research/Lance" 
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt","*.pth",],
)

📚 Usage

Inference

Basic Usage

bash inference_lance.sh
  • Before running, please configure the inference parameters at the top of inference_lance.sh.
  • Supported tasks: t2i, t2v, image_edit, video_edit, x2t_image, and x2t_video. You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to customize the default data samples for each task.
  • Note: For all tasks, we recommend following the prompt format used in the provided examples when writing input prompts, as this typically leads to better generation quality.

Task Examples

Text-to-Video
bash inference_lance.sh \
  --TASK_NAME t2v \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --NUM_FRAMES 121 \
  --VIDEO_HEIGHT 480 \
  --VIDEO_WIDTH 848 \
  --SAVE_PATH_GEN results/t2v
Text-to-Image
bash inference_lance.sh \
  --TASK_NAME t2i \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --VIDEO_HEIGHT 768 \
  --VIDEO_WIDTH 768 \
  --SAVE_PATH_GEN results/t2i
Video Editing
bash inference_lance.sh \
  --TASK_NAME video_edit \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --SAVE_PATH_GEN results/video_edit
Image Editing
bash inference_lance.sh \
  --TASK_NAME image_edit \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --SAVE_PATH_GEN results/image_edit
Video Understanding
bash inference_lance.sh \
  --TASK_NAME x2t_video \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --NUM_FRAMES 50 \
  --SAVE_PATH_GEN results/x2t_video
Image Understanding
bash inference_lance.sh \
  --TASK_NAME x2t_image \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --SAVE_PATH_GEN results/x2t_image
Show task and parameter reference

Available Tasks

Task Name Description Example JSON
t2v Text-to-Video generation config/examples/t2v_example.json
t2i Text-to-Image generation config/examples/t2i_example.json
image_edit Image editing config/examples/image_edit_example.json
video_edit Video editing config/examples/video_edit_example.json
x2t_image Image understanding config/examples/x2t_image_example.json
x2t_video Video understanding config/examples/x2t_video_example.json

For understanding examples:

  • config/examples/x2t_image_example.json: image understanding examples for visual question answering and image-based reasoning.
  • config/examples/x2t_video_example.json: video understanding examples for video question answering and video captioning.

Parameters

You can configure the following hyperparameters at the top of the inference_lance.sh script:

Parameter Default Value Description
MODEL_PATH "downloads/Lance_3B" Path to the downloaded Lance model weights (Lance_3B or Lance_3B_Video).
NUM_GPUS 1 Number of GPUs to use for inference.
VALIDATION_NUM_TIMESTEPS 30 Number of denoising steps (e.g., 30 or 50).
VALIDATION_TIMESTEP_SHIFT 3.5 Timestep shift parameter for flow matching scheduling.
CFG_TEXT_SCALE 4.0 Classifier-Free Guidance (CFG) scale for text conditioning.
VALIDATION_DATA_SEED 42 Random seed for generation reproducibility.
NUM_FRAMES 50 Number of frames for video generation (Max: 121). Unused for image tasks.
VIDEO_HEIGHT / VIDEO_WIDTH 768 Spatial resolution. Unused for editing tasks (determined by input image/video).
RESOLUTION "video_480p" Base resolution preset (image_768res or video_480p).

🖥️ Gradio

You can launch the local Gradio demo for video/image generation, editing, and understanding:

python lance_gradio.py --server-name 0.0.0.0 --server-port 7860

Benchmarks

DPG-Bench Evaluation
Models # Params. Global Entity Attribute Relation Other Overall
Generation-only Models
SDXL3.5B83.2782.4380.9186.7680.4174.65
DALL-E 3-90.9789.6188.3990.5889.8383.50
SD3-Medium2B87.9091.0188.8380.7088.6884.08
FLUX.1-dev12B74.3590.0088.9690.8788.3383.84
Qwen-Image20B91.3291.5692.0294.3192.7388.32
Unified Models
Janus-Pro-7B7B86.9088.9089.4089.3289.4884.19
OmniGen24B88.8188.8390.1889.3790.2783.57
Show-o27B89.0091.7889.9691.8191.6486.14
BAGEL7B88.9490.3791.2990.8288.6785.07
InternVL-U1.7B90.3990.7890.6890.2988.7785.18
TUNA7B90.4291.6890.9491.8790.7386.76
TUNA-27B89.5091.4092.0791.9188.8186.54
🌟 Lance (Ours)3B83.8991.0789.3693.3880.8084.67

indicates methods that use LLM rewriters for prompt rewriting before generation.

GenEval Evaluation
Models # Params. 1-Obj. 2-Obj. Count Colors Position Attr. Overall
Generation-only Models
SDXL3.5B0.980.740.390.850.150.230.55
DALL-E 3-0.960.870.470.830.430.450.67
SD3-Medium2B0.990.940.720.890.330.600.74
FLUX.1-dev12B0.980.930.750.930.680.650.82
Qwen-Image20B0.990.920.890.880.760.770.87
Unified Models
Janus-Pro-7B7B0.990.890.590.900.790.660.80
OmniGen24B1.000.950.640.880.550.760.80
Show-o27B1.000.870.580.920.520.620.76
BAGEL7B0.980.950.840.950.780.770.88
Mogao7B1.000.970.830.930.840.800.89
InternVL-U1.7B0.990.940.740.910.770.740.85
TUNA7B1.000.970.810.910.880.830.90
TUNA-27B0.990.960.800.910.840.760.87
🌟 Lance (Ours)3B1.000.940.840.970.870.810.90

indicates methods that use LLM rewriters for prompt rewriting before generation.

GEdit-Bench Evaluation
Models # Params. BC CA MM MC PB ST SA SR SRp TM TT Avg/G_O
Generation-only Models
Gemini 2.0------------6.32
GPT Image 1-6.966.857.105.416.747.447.518.738.558.458.697.49
Qwen-Image-Edit20B8.238.307.338.057.496.748.578.098.298.488.508.01
Unified Models
Lumina-DiMOO8B3.434.273.082.774.745.194.443.804.382.684.203.91
Ovis-U11.2B7.496.886.214.795.986.467.497.257.274.486.316.42
BAGEL7B7.326.916.384.754.576.157.907.167.027.326.226.52
InternVL-U1.7B7.087.056.387.026.036.277.136.556.336.596.856.66
InternVL-U (w/ CoT)1.7B7.057.876.506.995.776.107.337.167.127.366.466.88
🌟 Lance (Ours)3B7.737.747.287.837.507.037.647.857.714.467.577.30
VBench Evaluation (Video Generation)
Type Model # Params. Total Score ↑
Gen. Only ModelScope1.7B75.75
LaVie3B77.08
Show-16B78.93
AnimateDiff-V2-80.27
VideoCrafter-2.0-80.44
CogVideoX5B81.61
Kling-81.85
Open-Sora-2.0-81.71
Gen-3-82.32
Step-Video-T2V30B81.83
Hunyuan Video-83.43
Wan2.1-T2V14B83.69
Unified HaproOmni7B78.10
Emu38B80.96
VILA-U7B74.01
Show-o22B81.34
TUNA1.5B84.06
🌟 Lance (Ours)3B85.11

Running Benchmarks

Ready-to-run benchmark scripts are provided under benchmarks/:

Benchmark Modality Script
GenEVAL (image gen) Image benchmarks/image_gen/GenEVAL/sample_GenEVAL.sh
DPG (image gen) Image benchmarks/image_gen/DPG/sample_DPG.sh
GEdit (image edit) Image benchmarks/image_gen/GEdit/sample_GEdit.sh
VBench (video gen) Video benchmarks/video_gen/Vbench/sample_vbench.sh

📄 License

Copyright 2025 ByteDance Ltd. and/or its affiliates.

🙏 Acknowledgements

We would like to thank the contributors of BAGEL, Qwen2.5-VL-3B-Instruct, and Wan2.2 for their open research and contributions.

💖 Citation

If you find Lance useful for your project or research, welcome to 🌟 this repo and cite our work using the following BibTeX:

@misc{fu2026lanceunifiedmultimodalmodeling,
      title         = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},
      author        = {Fengyi Fu and Mengqi Huang and Shaojin Wu and Yunsheng Jiang and Yufei Huo and Hao Li and Yinghang Song and Fei Ding and Jianzhu Guo and Qian He and Zheren Fu and Zhendong Mao and Yongdong Zhang},
      year          = {2026},
      eprint        = {2605.18678},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CV},
      url           = {https://arxiv.org/abs/2605.18678},
}

📞 Contact

For questions, issues, or collaborations, please contact Mengqi Huang and Jianzhu Guo.

Downloads last month
12
Video Preview
loading

Model tree for Pq234/Lance

Finetuned
(788)
this model

Collection including Pq234/Lance

Paper for Pq234/Lance