YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

OmniNFT

Modality-wise Omni Diffusion Negative-aware Fine-Tuning for Joint Audio and Video Generation

HuggingFace ArXiv Project Page


πŸ”ˆ News

  • [2026-05-21] Comfy compatible format is here.
  • [2026-05-19] LTX-2.3 has been supported πŸš€. LoRA weights for LTX-2.3 are now available!
  • [2026-05-13] OmniNFT is released on Arixv.
  • [2026-05-11] Code and LoRA weights for LTX-2 are available.

πŸ—οΈ Method Overview

Modality-wise Advantage Routing β€” Instead of collapsing all rewards into a single global advantage, OmniNFT computes independent per-reward advantages for video, audio, and cross-modal synchronization, then routes each to its responsible generation branch β€” uni-modal advantages supervise only their own branch while the synchronization advantage is broadcast to both β€” resolving the advantage inconsistency where roughly half of samples receive opposing rewards across modalities.

Layer-wise Gradient Surgery β€” To address gradient imbalance where video-branch gradients leak into shallow audio layers dedicated to intra-modal generation, OmniNFT applies a partial stop-gradient on the audio key-value projections in A2V cross-attention at shallow Transformer blocks, suppressing erroneous gradient injection while preserving full gradient flow through the deeper cross-modal alignment layers (AV-Sync Zone).

Region-wise Loss Reweighting β€” Leveraging V2A cross-attention maps from late denoising steps as an intrinsic proxy for sound-emitting critical regions, OmniNFT aggregates them into per-token importance weights that modulate the video-branch RL loss, providing fine-grained credit assignment that concentrates optimization capacity on regions most critical for audio-video synchronization without requiring external detection modules.


⚑ Installation

conda create -n omnninft python=3.11
conda activate omnninft
pip install -r requirements.txt

πŸ“¦ Model Checkpoints

Env Variable Description Source
LTX-MODEL LTX base model LTX-2 LTX-2.3
OmniNFT_LTX LTX + OmniNFT OmniNFT
REWARD_MODELS All reward models (HPSv3, CLAP, AudioBox, Synchformer, ImageBind, etc.) OmniNFT-Reward-Series

πŸš€ Training

Step 0: Download Reward Models

Download all reward model weights from HuggingFace:

huggingface-cli download --resume-download zghhui/OmniNFT-Reward-Series --local-dir Omni_Reward_Series
Reward model checkpoints under Omni_Reward_Series/
Env Variable Path Description
HPSV3_CKPT_PATH Omni_Reward_Series/HPSv3/HPSv3.safetensors HPSv3 image quality scorer
VIDEOALIGN_CKPT_DIR Omni_Reward_Series/VideoReward VideoAlign video quality scorer
AUDIOBOX_CKPT Omni_Reward_Series/audiobox-aesthetics/checkpoint.pt AudioBox aesthetics predictor
CLAP_CKPT Omni_Reward_Series/CLAP CLAP audio-text alignment model
IMAGEBIND_CKPT Omni_Reward_Series/ImageBind/imagebind_huge.pth ImageBind multimodal embeddings
SYNCHFORMER_CKPT Omni_Reward_Series/synchformer/synchformer_state_dict.pth Synchformer AV sync scorer

All paths are pre-configured in bash_train_omninft_ltx_fsdp.sh as relative paths.

Step 1: Launch Reward Servers

HPSv3 and VideoAlign run as remote HTTP servers. Start them before training:

# Terminal 1: HPSv3 server
bash flow_grpo/server/run_remote_hpsv3.sh

# Terminal 2: VideoAlign server
bash flow_grpo/server/run_remote_videoalign.sh

Step 2: Multi(Single)-Node Training

bash bash_train_omninft_ltx_fsdp.sh branch_aware_layer_surgery_avweight

🎬 Inference

Step 1: Merge LoRA into base model

After training, merge the LoRA weights into the base checkpoint:

python scripts/merge_lora.py \
    --checkpoint-path $LTX_MODEL_PATH \
    --lora-dir $OUTPUT_DIR/checkpoint-latest/lora \
    --output-path ./merged_model.safetensors \
    --dtype bf16
Arguments
Argument Description
--checkpoint-path LTX-Video base checkpoint used during training
--lora-dir LoRA output directory (contains adapter_model.safetensors + adapter_config.json)
--output-path Output path for the merged model
--dtype Output precision: bf16 (default) / fp16 / fp32 / keep

Step 2: Generate audio-video

python scripts/inference.py \
    --model_path ./merged_model.safetensors \
    --gemma_path $GEMMA_MODEL_PATH \
    --prompt "A man plays acoustic guitar on a wooden stage, warm applause from the audience" \
    --seed 42 \
    --output_dir ./results
Arguments
Argument Default Description
--model_path (required) Path to merged .safetensors model
--gemma_path env GEMMA_MODEL_PATH Path to Gemma 3 text encoder
--prompt (required) Text prompt for generation
--num_frames 121 Number of video frames
--height / --width model default Video resolution
--num_inference_steps model default Number of denoising steps
--video_guidance_scale model default Video CFG scale
--audio_guidance_scale model default Audio CFG scale
--seed 42 Random seed
--no_audio false Disable audio generation
--dtype bf16 Inference precision

Outputs are saved to --output_dir: .mp4 (video with audio) and .wav (audio only).

πŸ–ŠοΈ Citation

@article{zhang2026omninft,
  title={OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation},
  author={Zhang, Guohui and Ma, XiaoXiao and Huang, Jie and Xu, Hang and Yu, Hu and Fu, Siming and Li, Yuming and Xue, Zeyue and Song, Lin and Huang, Haoyang and Duan, Nan and Zhao, Feng},
  journal={arXiv preprint arXiv:2605.12480},
  year={2026}
}

🀝 Acknowledgements

LTX-2 Β· DiffusionNFT


⚠️ License

Research use only. See individual submodule licenses (HPSv3, ImageBind, LTX-Video, etc.) for their terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for FastVideo/LTX-2.3-OmniNFT-LoRA