yitongl's picture
Add standalone inference helper for sfp4 checkpoint-700
1d0c0cc verified

Standalone Inference Helper

This folder contains a portable inference helper for:

sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700

It is not a full vendored copy of Wan or FastVideo. It contains the sparse FP4 backend overlay and a runner that can be applied to a FastVideo checkout or installation so the uploaded checkpoint can be used for normal inference.

Contents

  • run_inference.py: downloads/loads transformer/diffusion_pytorch_model.safetensors from yitongl/sparse_quant_exp and runs VideoGenerator.
  • run.sh: convenience wrapper that installs the overlay into FASTVIDEO_ROOT and then runs run_inference.py.
  • install_overlay.py: copies the bundled sparse FP4 backend files into a FastVideo checkout/install.
  • overlay_files/: exact runtime source files needed by SPARSE_FP4_OURS_P_ATTN.
  • training_attention_settings.json: structured settings for the uploaded checkpoint.

Expected Environment

  • A working FastVideo Python environment.
  • FastVideo dependencies installed, including PyTorch, Triton, safetensors, and Hugging Face Hub.
  • Access to the base model Wan-AI/Wan2.1-T2V-1.3B-Diffusers.
  • A CUDA GPU supported by the custom Triton kernels.

Usage

From a machine with this HF repo downloaded:

export FASTVIDEO_ROOT=/path/to/FastVideo
bash standalone_inference/run.sh \
  --output-path outputs/sfp4_checkpoint_700 \
  --seed 1000

The script sets:

FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN
FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1

and downloads the uploaded checkpoint-700 transformer weights unless --weights is provided.

To use a local safetensors file:

export FASTVIDEO_ROOT=/path/to/FastVideo
bash standalone_inference/run.sh \
  --weights /path/to/diffusion_pytorch_model.safetensors \
  --prompt "your prompt"

Attention Semantics

  • Self-attention uses SPARSE_FP4_OURS_P_ATTN.
  • Q/K/V use FP4 fake quantization with STE.
  • VSA tile size is 4 x 4 x 4 = 64 tokens.
  • Selected sparse tiles use group-local P quantization in the Triton kernel.
  • Dropped tiles use tile mean compensation.
  • Cross-attention falls back to dense SDPA and is not sparse/FP4.

Checkpoint

The current HF main transformer file is checkpoint-700:

transformer/diffusion_pytorch_model.safetensors

Local SHA256 used when preparing this helper:

4595ca81ea7085c15ccf14b738aa9c0fdf2d2786641f49b55e0bc0e99bf042d2