OmniVAE Release Assets

This Hugging Face repository contains the release checkpoints, text encoders, evaluation data, and metric weights for the OmniVAE GitHub repository.

The public generation weights in this release are T2AV-only. VAE weights are included because they are required by the T2AV pipeline.

The source code is intentionally kept out of this repository. Use this bundle with the GitHub code by setting OMNIVAE_RELEASE_ROOT to the local download path.

Download

pip install -U huggingface_hub
huggingface-cli download zhanjun/OmniVAE \
  --repo-type model \
  --local-dir /path/to/omnivae_release \
  --local-dir-use-symlinks False

git clone https://github.com/JunZhan2000/OmniVAE.git
cd OmniVAE
export OMNIVAE_RELEASE_ROOT=/path/to/omnivae_release
export OPEN_SOURCE_ROOT="${OMNIVAE_RELEASE_ROOT}"

Equivalently:

source scripts/setup_release_root.sh /path/to/omnivae_release

Layout

omnivae_release/
  manifest.json
  models/
    text_encoder/              frozen Qwen3.5 text encoder used by T2AV inference
    vae/                       OmniVAE checkpoints
    dit/t2av/                  T2AV generation checkpoints
  eval/
    data/t2av/                 validation manifests and small reference data
    models/t2av/               metric weights for T2AV evaluation

All config and metadata paths are relative to the repository root, so the code can load this directory from any local path.

Text Encoders

The release includes the frozen Qwen3.5 text encoder directory used by the validated T2AV configs:

models/text_encoder/Qwen3.5-0.8B-Base

It is a third-party dependency weight, not an additional OmniVAE checkpoint. Keeping it preserves the exact inference/evaluation settings used for the released results.

Checkpoints

VAE

models/vae/video_only/recon
models/vae/video_only/recon_distill
models/vae/audio_only/recon
models/vae/audio_only/recon_distill
models/vae/audio_video/recon_avclip
models/vae/audio_video/recon_distill_avclip
models/vae/audio_only/recon_avclip_ft_decoder
models/vae/audio_only/recon_distill_avclip_ft_decoder

For T2AV AVCLIP experiment families, the audio branch uses the *_ft_decoder checkpoint by default.

Generation DiT

The public bundle contains T2AV generation packages only:

t2av: t2av_recon, t2av_recon_distill, t2av_recon_avclip, t2av_recon_distill_avclip

Each package is inference-only and excludes optimizer, scheduler, dataloader, random state, and trainer state snapshots.

Evaluation Assets

eval/data/t2av/versebench_minimal: VerseBench validation manifest used by the release smoke and comparison scripts.
eval/models/t2av: T2AV metric weights for the bundled my_eval pipeline.
eval/models/vae/fvd/i3d_torchscript.pt: rFVD metric weight for VAE video reconstruction evaluation.

Quick Validation

cd /path/to/OmniVAE
export OMNIVAE_RELEASE_ROOT=/path/to/omnivae_release

# T2AV release validation on VerseBench set3-large.
bash scripts/release_launchers/run_t2av_release_compare_distributed.sh

See the GitHub repository README.md and docs/ for full installation, training, inference, and evaluation commands.

License

The OmniVAE release assets are provided under the Apache-2.0 license. Some metric packages and third-party components retain their original licenses; see the corresponding source-code subdirectories in the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support