OmniRetriever-7B

Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation.

OmniRetriever teaser

OmniRetriever-7B is a unified audio-video-text (AVT) retriever that produces a single shared embedding for any of the 4 modality combinations (text, video, audio, video+audio) on its forward pass. It is released as a LoRA adapter on top of the public WAVE-7B backbone.


TL;DR

We train a unified audio-video-text encoder via fusion-as-teacher distillation and Tuple-InfoNCE, surpassing closed Gemini Embedding 2 on a new 12-direction AVT retrieval benchmark, and reaching the zero-shot audio–text specialist band on Clotho.


Headline numbers

OmniRetriever-Bench (12-direction AVT, R@1)

Model AVG-single AVG-dual AVG-all
WAVE-7B (frozen) 19.27 31.37 25.32
Omni-Embed-Nemotron (open) 21.79 31.84 26.81
Gemini Embedding 2 (closed) 25.44 40.80 33.12
OmniRetriever-7B (ours) 28.63 41.05 34.84

Audio benchmarks (zero-shot R@1)

Model Clotho T→A Clotho A→T SoundDescs T→A SoundDescs A→T
Omni-Embed-Nemotron (open) 6.4 3.5 6.4 4.8
Gemini Embedding 2 (closed) 5.2 1.3 7.0 7.4
OmniRetriever-7B (ours) 19.1 16.1 25.0 20.7

OmniRetriever beats Gemini Embedding 2 by +13.3 to +18.0 R@1 on every audio–text direction.

Video benchmarks (zero-shot R@1)

Model MSR-VTT T→V MSVD T→V DiDeMo T→V VATEX T→V
Omni-Embed-Nemotron (open) 35.8 55.8 41.9 47.5
Gemini Embedding 2 (closed) 53.9 77.1 55.6 69.4
OmniRetriever-7B (ours) 47.9 65.6 45.1 58.7

Method in one minute

OmniRetriever method overview

Every unified AVT encoder produces, on its forward pass, a joint (T, V, A) embedding z_TVA that is its strongest cross-modally grounded vector. Yet pairwise InfoNCE never uses z_TVA — neither as a supervision target nor as a teacher of its single-modal sub-encoders. OmniRetriever turns z_TVA into a training signal:

Loss What it does
L_A Standard pairwise InfoNCE over the three modality pairs (T-V, T-A, V-A). Kept as a stabiliser.
L_D Fusion-as-teacher distillation (main contribution). A stop-gradient copy of the joint z_TVA produced on the forward pass becomes a teacher for every single-modal sub-encoder. Teacher and students share the same backbone, so the audio sub-encoder inherits text–video neighbours that no unimodal teacher can supply.
L_T Tuple-InfoNCE refinement. Supervises z_TVA directly with modality-cycled hard negatives. The shuffled slot cycles deterministically through {T, V, A} so every modality remains discriminative inside the joint vector.

Final objective: L_A + L_D + L_T with uniform weights, applied to a LoRA fine-tune of WAVE-7B.


Architecture

Base model WAVE-7B (≈9.4 B params; Qwen2.5-Omni + BEATs audio encoder)
Adapter LoRA, rank 16, alpha 32, dropout 0.05
LoRA target q,k,v projections of every LLM transformer layer
Full-rank trained heads classify_linear (all-layer fusion head, 28·d → d → d, d=3584), beats_ln, beats_proj (BEATs adaptor)
Total trainable parameters ≈395 M (~4.2 % of the backbone)
Output embedding 3,584-d, L2-normalised

How to use

pip install peft transformers accelerate torch decord librosa soundfile
git clone https://github.com/yunzeliu/Omni-Retriever
cd Omni-Retriever
export PYTHONPATH=$PWD/src:$PYTHONPATH

Then in Python:

from omniretriever import OmniRetriever

model = OmniRetriever.from_pretrained(
    base_model="path/to/WAVE-7B",            # local path to the WAVE-7B backbone
    adapter="YunzeLiu/OmniRetriever-7B",     # this repo
    dtype="bfloat16",
)

z_text  = model.encode_text("a dog barking in the rain")
z_video = model.encode_video("clip.mp4")           # frames only
z_audio = model.encode_audio("clip.mp4")           # audio track only
z_av    = model.encode_av("clip.mp4")              # both streams

# All embeddings are L2-normalised; cosine similarity is just a dot product.
print(float(z_text @ z_av.T))

For end-to-end inference + benchmark scoring, see the GitHub repository.

Manual (peft) loading

from peft import PeftModel
from transformers import AutoProcessor
# load WAVE-7B backbone (see WAVE upstream README)
backbone = ...   # load_wave_backbone("path/to/WAVE-7B")

model = PeftModel.from_pretrained(backbone, "YunzeLiu/OmniRetriever-7B")
model = model.cuda().eval()
processor = AutoProcessor.from_pretrained("path/to/WAVE-7B", trust_remote_code=True)

OmniRetriever-Bench

OmniRetriever-Bench sample triples

The benchmark used in the headline table above is the first 12-direction AVT retrieval benchmark — 3,782 held-out triples on a shared gallery evaluated across all 6 single-modal and 6 dual-modal directions (T↔V, T↔A, V↔A, T↔AV, A↔TV, V↔AT). All captions are reviewed and corrected by trained human annotators starting from a Gemini 3.0 Pro draft.

Released as YunzeLiu/OmniRetriever-Bench on HF Datasets.


Intended use & limitations

Intended use. Cross-modal retrieval over multimodal corpora; building retrieval indices for multimodal RAG; research on AVT representation learning.

Out of scope / prohibited. Biometric identification, recognition, profiling, or surveillance of natural persons in any deployed system — explicitly disallowed by the license.

Known limitations.

  • Audio-language coverage is biased toward English captions.
  • Long-form video: the model crops audio to 8 s and video to 8 frames; not validated on clips longer than ~16 s.
  • The released embeddings (3,584-d, bf16) are not compression-aware; post-hoc int8/binary degrades relative to closed baselines trained with MRL+QAT.
  • Embeddings can be inverted under the right conditions (see vec2text); consider application-layer access control or representation distortion for sensitive deployments.

Citation

@article{liu2026omniretriever,
  title         = {OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation},
  author        = {Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
  year          = {2026},
  eprint        = {2605.26641},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2605.26641}
}

License

Apache-2.0 with an additional clause prohibiting deployment for biometric identification of natural persons. By using these weights you agree not to deploy them in any system whose purpose is to identify, recognise, profile, or surveil natural persons by their face, voice, body, gait, or other biometric attribute.

OmniRetriever builds on the WAVE-7B backbone (Qwen2.5-Omni + BEATs); upstream licenses for the backbone apply to the merged model.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for YunzeLiu/OmniRetriever-7B