OmniRetriever-7B

Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation.

OmniRetriever teaser

OmniRetriever-7B is a unified audio-video-text (AVT) retriever that produces a single shared embedding for any of the 4 modality combinations (text, video, audio, video+audio) on its forward pass. It is released as a LoRA adapter on top of the public WAVE-7B backbone.


📄 Paper	arXiv:2605.26641
💻 Code & inference	https://github.com/yunzeliu/Omni-Retriever
🏠 Project page	https://yunzeliu.github.io/OmniRetriever/
📊 Benchmark	`YunzeLiu/OmniRetriever-Bench`

TL;DR

We train a unified audio-video-text encoder via fusion-as-teacher distillation and Tuple-InfoNCE, surpassing closed Gemini Embedding 2 on a new 12-direction AVT retrieval benchmark, and reaching the zero-shot audio–text specialist band on Clotho.

Headline numbers

OmniRetriever-Bench (12-direction AVT, R@1)

Model	AVG-single	AVG-dual	AVG-all
WAVE-7B (frozen)	19.27	31.37	25.32
Omni-Embed-Nemotron (open)	21.79	31.84	26.81
Gemini Embedding 2 (closed)	25.44	40.80	33.12
OmniRetriever-7B (ours)	28.63	41.05	34.84

Audio benchmarks (zero-shot R@1)

Model	Clotho T→A	Clotho A→T	SoundDescs T→A	SoundDescs A→T
Omni-Embed-Nemotron (open)	6.4	3.5	6.4	4.8
Gemini Embedding 2 (closed)	5.2	1.3	7.0	7.4
OmniRetriever-7B (ours)	19.1	16.1	25.0	20.7

OmniRetriever beats Gemini Embedding 2 by +13.3 to +18.0 R@1 on every audio–text direction.

Video benchmarks (zero-shot R@1)

Model	MSR-VTT T→V	MSVD T→V	DiDeMo T→V	VATEX T→V
Omni-Embed-Nemotron (open)	35.8	55.8	41.9	47.5
Gemini Embedding 2 (closed)	53.9	77.1	55.6	69.4
OmniRetriever-7B (ours)	47.9	65.6	45.1	58.7

Method in one minute

OmniRetriever method overview

Every unified AVT encoder produces, on its forward pass, a joint (T, V, A) embedding z_TVA that is its strongest cross-modally grounded vector. Yet pairwise InfoNCE never uses z_TVA — neither as a supervision target nor as a teacher of its single-modal sub-encoders. OmniRetriever turns z_TVA into a training signal:

Loss	What it does
`L_A`	Standard pairwise InfoNCE over the three modality pairs (T-V, T-A, V-A). Kept as a stabiliser.
`L_D`	Fusion-as-teacher distillation (main contribution). A stop-gradient copy of the joint `z_TVA` produced on the forward pass becomes a teacher for every single-modal sub-encoder. Teacher and students share the same backbone, so the audio sub-encoder inherits text–video neighbours that no unimodal teacher can supply.
`L_T`	Tuple-InfoNCE refinement. Supervises `z_TVA` directly with modality-cycled hard negatives. The shuffled slot cycles deterministically through `{T, V, A}` so every modality remains discriminative inside the joint vector.

Final objective: L_A + L_D + L_T with uniform weights, applied to a LoRA fine-tune of WAVE-7B.

Architecture


Base model	WAVE-7B (≈9.4 B params; Qwen2.5-Omni + BEATs audio encoder)
Adapter	LoRA, rank 16, alpha 32, dropout 0.05
LoRA target	`q,k,v` projections of every LLM transformer layer
Full-rank trained heads	`classify_linear` (all-layer fusion head, `28·d → d → d`, `d=3584`), `beats_ln`, `beats_proj` (BEATs adaptor)
Total trainable parameters	≈395 M (~4.2 % of the backbone)
Output embedding	3,584-d, L2-normalised

How to use

pip install peft transformers accelerate torch decord librosa soundfile
git clone https://github.com/yunzeliu/Omni-Retriever
cd Omni-Retriever
export PYTHONPATH=$PWD/src:$PYTHONPATH

Then in Python:

from omniretriever import OmniRetriever

model = OmniRetriever.from_pretrained(
    base_model="path/to/WAVE-7B",            # local path to the WAVE-7B backbone
    adapter="YunzeLiu/OmniRetriever-7B",     # this repo
    dtype="bfloat16",
)

z_text  = model.encode_text("a dog barking in the rain")
z_video = model.encode_video("clip.mp4")           # frames only
z_audio = model.encode_audio("clip.mp4")           # audio track only
z_av    = model.encode_av("clip.mp4")              # both streams

# All embeddings are L2-normalised; cosine similarity is just a dot product.
print(float(z_text @ z_av.T))

For end-to-end inference + benchmark scoring, see the GitHub repository.

Manual (peft) loading

from peft import PeftModel
from transformers import AutoProcessor
# load WAVE-7B backbone (see WAVE upstream README)
backbone = ...   # load_wave_backbone("path/to/WAVE-7B")

model = PeftModel.from_pretrained(backbone, "YunzeLiu/OmniRetriever-7B")
model = model.cuda().eval()
processor = AutoProcessor.from_pretrained("path/to/WAVE-7B", trust_remote_code=True)

OmniRetriever-Bench

OmniRetriever-Bench sample triples

The benchmark used in the headline table above is the first 12-direction AVT retrieval benchmark — 3,782 held-out triples on a shared gallery evaluated across all 6 single-modal and 6 dual-modal directions (T↔V, T↔A, V↔A, T↔AV, A↔TV, V↔AT). All captions are reviewed and corrected by trained human annotators starting from a Gemini 3.0 Pro draft.

Released as YunzeLiu/OmniRetriever-Bench on HF Datasets.

Intended use & limitations

Intended use. Cross-modal retrieval over multimodal corpora; building retrieval indices for multimodal RAG; research on AVT representation learning.

Out of scope / prohibited. Biometric identification, recognition, profiling, or surveillance of natural persons in any deployed system — explicitly disallowed by the license.

Known limitations.

Audio-language coverage is biased toward English captions.
Long-form video: the model crops audio to 8 s and video to 8 frames; not validated on clips longer than ~16 s.
The released embeddings (3,584-d, bf16) are not compression-aware; post-hoc int8/binary degrades relative to closed baselines trained with MRL+QAT.
Embeddings can be inverted under the right conditions (see vec2text); consider application-layer access control or representation distortion for sensitive deployments.

Citation

@article{liu2026omniretriever,
  title         = {OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation},
  author        = {Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
  year          = {2026},
  eprint        = {2605.26641},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2605.26641}
}

License

Apache-2.0 with an additional clause prohibiting deployment for biometric identification of natural persons. By using these weights you agree not to deploy them in any system whose purpose is to identify, recognise, profile, or surveil natural persons by their face, voice, body, gait, or other biometric attribute.

OmniRetriever builds on the WAVE-7B backbone (Qwen2.5-Omni + BEATs); upstream licenses for the backbone apply to the merged model.

Downloads last month: -

Papers for YunzeLiu/OmniRetriever-7B

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

Paper • 2605.26641 • Published 2 days ago

Text Embeddings Reveal (Almost) As Much As Text

Paper • 2310.06816 • Published Oct 10, 2023 • 1