Instructions to use YunzeLiu/OmniRetriever-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use YunzeLiu/OmniRetriever-7B with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("WAVE-7B") model = PeftModel.from_pretrained(base_model, "YunzeLiu/OmniRetriever-7B") - Notebooks
- Google Colab
- Kaggle
OmniRetriever-7B
Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation.
OmniRetriever-7B is a unified audio-video-text (AVT) retriever that
produces a single shared embedding for any of the 4 modality combinations
(text, video, audio, video+audio) on its forward pass. It is
released as a LoRA adapter on top of the public WAVE-7B backbone.
| 📄 Paper | arXiv:2605.26641 |
| 💻 Code & inference | https://github.com/yunzeliu/Omni-Retriever |
| 🏠 Project page | https://yunzeliu.github.io/OmniRetriever/ |
| 📊 Benchmark | YunzeLiu/OmniRetriever-Bench |
TL;DR
We train a unified audio-video-text encoder via fusion-as-teacher distillation and Tuple-InfoNCE, surpassing closed Gemini Embedding 2 on a new 12-direction AVT retrieval benchmark, and reaching the zero-shot audio–text specialist band on Clotho.
Headline numbers
OmniRetriever-Bench (12-direction AVT, R@1)
| Model | AVG-single | AVG-dual | AVG-all |
|---|---|---|---|
| WAVE-7B (frozen) | 19.27 | 31.37 | 25.32 |
| Omni-Embed-Nemotron (open) | 21.79 | 31.84 | 26.81 |
| Gemini Embedding 2 (closed) | 25.44 | 40.80 | 33.12 |
| OmniRetriever-7B (ours) | 28.63 | 41.05 | 34.84 |
Audio benchmarks (zero-shot R@1)
| Model | Clotho T→A | Clotho A→T | SoundDescs T→A | SoundDescs A→T |
|---|---|---|---|---|
| Omni-Embed-Nemotron (open) | 6.4 | 3.5 | 6.4 | 4.8 |
| Gemini Embedding 2 (closed) | 5.2 | 1.3 | 7.0 | 7.4 |
| OmniRetriever-7B (ours) | 19.1 | 16.1 | 25.0 | 20.7 |
OmniRetriever beats Gemini Embedding 2 by +13.3 to +18.0 R@1 on every audio–text direction.
Video benchmarks (zero-shot R@1)
| Model | MSR-VTT T→V | MSVD T→V | DiDeMo T→V | VATEX T→V |
|---|---|---|---|---|
| Omni-Embed-Nemotron (open) | 35.8 | 55.8 | 41.9 | 47.5 |
| Gemini Embedding 2 (closed) | 53.9 | 77.1 | 55.6 | 69.4 |
| OmniRetriever-7B (ours) | 47.9 | 65.6 | 45.1 | 58.7 |
Method in one minute
Every unified AVT encoder produces, on its forward pass, a joint
(T, V, A) embedding z_TVA that is its strongest cross-modally
grounded vector. Yet pairwise InfoNCE never uses z_TVA — neither as a
supervision target nor as a teacher of its single-modal sub-encoders.
OmniRetriever turns z_TVA into a training signal:
| Loss | What it does |
|---|---|
L_A |
Standard pairwise InfoNCE over the three modality pairs (T-V, T-A, V-A). Kept as a stabiliser. |
L_D |
Fusion-as-teacher distillation (main contribution). A stop-gradient copy of the joint z_TVA produced on the forward pass becomes a teacher for every single-modal sub-encoder. Teacher and students share the same backbone, so the audio sub-encoder inherits text–video neighbours that no unimodal teacher can supply. |
L_T |
Tuple-InfoNCE refinement. Supervises z_TVA directly with modality-cycled hard negatives. The shuffled slot cycles deterministically through {T, V, A} so every modality remains discriminative inside the joint vector. |
Final objective: L_A + L_D + L_T with uniform weights, applied to a LoRA
fine-tune of WAVE-7B.
Architecture
| Base model | WAVE-7B (≈9.4 B params; Qwen2.5-Omni + BEATs audio encoder) |
| Adapter | LoRA, rank 16, alpha 32, dropout 0.05 |
| LoRA target | q,k,v projections of every LLM transformer layer |
| Full-rank trained heads | classify_linear (all-layer fusion head, 28·d → d → d, d=3584), beats_ln, beats_proj (BEATs adaptor) |
| Total trainable parameters | ≈395 M (~4.2 % of the backbone) |
| Output embedding | 3,584-d, L2-normalised |
How to use
pip install peft transformers accelerate torch decord librosa soundfile
git clone https://github.com/yunzeliu/Omni-Retriever
cd Omni-Retriever
export PYTHONPATH=$PWD/src:$PYTHONPATH
Then in Python:
from omniretriever import OmniRetriever
model = OmniRetriever.from_pretrained(
base_model="path/to/WAVE-7B", # local path to the WAVE-7B backbone
adapter="YunzeLiu/OmniRetriever-7B", # this repo
dtype="bfloat16",
)
z_text = model.encode_text("a dog barking in the rain")
z_video = model.encode_video("clip.mp4") # frames only
z_audio = model.encode_audio("clip.mp4") # audio track only
z_av = model.encode_av("clip.mp4") # both streams
# All embeddings are L2-normalised; cosine similarity is just a dot product.
print(float(z_text @ z_av.T))
For end-to-end inference + benchmark scoring, see the GitHub repository.
Manual (peft) loading
from peft import PeftModel
from transformers import AutoProcessor
# load WAVE-7B backbone (see WAVE upstream README)
backbone = ... # load_wave_backbone("path/to/WAVE-7B")
model = PeftModel.from_pretrained(backbone, "YunzeLiu/OmniRetriever-7B")
model = model.cuda().eval()
processor = AutoProcessor.from_pretrained("path/to/WAVE-7B", trust_remote_code=True)
OmniRetriever-Bench
The benchmark used in the headline table above is the first 12-direction
AVT retrieval benchmark — 3,782 held-out triples on a shared gallery
evaluated across all 6 single-modal and 6 dual-modal directions
(T↔V, T↔A, V↔A, T↔AV, A↔TV, V↔AT). All captions are
reviewed and corrected by trained human annotators starting from a
Gemini 3.0 Pro draft.
Released as YunzeLiu/OmniRetriever-Bench on HF Datasets.
Intended use & limitations
Intended use. Cross-modal retrieval over multimodal corpora; building retrieval indices for multimodal RAG; research on AVT representation learning.
Out of scope / prohibited. Biometric identification, recognition, profiling, or surveillance of natural persons in any deployed system — explicitly disallowed by the license.
Known limitations.
- Audio-language coverage is biased toward English captions.
- Long-form video: the model crops audio to 8 s and video to 8 frames; not validated on clips longer than ~16 s.
- The released embeddings (3,584-d, bf16) are not compression-aware; post-hoc int8/binary degrades relative to closed baselines trained with MRL+QAT.
- Embeddings can be inverted under the right conditions (see vec2text); consider application-layer access control or representation distortion for sensitive deployments.
Citation
@article{liu2026omniretriever,
title = {OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation},
author = {Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
year = {2026},
eprint = {2605.26641},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
doi = {10.48550/arXiv.2605.26641}
}
License
Apache-2.0 with an additional clause prohibiting deployment for biometric identification of natural persons. By using these weights you agree not to deploy them in any system whose purpose is to identify, recognise, profile, or surveil natural persons by their face, voice, body, gait, or other biometric attribute.
OmniRetriever builds on the WAVE-7B backbone (Qwen2.5-Omni + BEATs); upstream licenses for the backbone apply to the merged model.
- Downloads last month
- -