OEA-Nemo3B-AC

OEA-Nemo3B-AC is a LoRA + projection-head checkpoint for the Omni-Embed-Audio (OEA) retrieval encoder presented in:

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang (Sogang University) Proceedings of ACL 2026 (Oral) Code & UIQ benchmark: https://github.com/JudeJiwoo/Omni-Embed-Audio Web demo: https://omni-embed-audio.github.io

Model summary

Field	Value
Base model	`nvidia/omni-embed-nemotron-3b` (Omni-Embed-Nemotron-3B, ~3B params)
Trained on	AudioCaps (train split)
Embedding dim	512 (L2-normalized)
Audio sample rate	16 kHz mono
Trainable parameters	LoRA adapters + 2 projection heads (~11–16M)
Backbone	Frozen
Checkpoint file	`step_40.pt` (PyTorch state dict)
License (this checkpoint)	MIT — the underlying base model is governed by its own license

What's inside `step_40.pt`

ckpt = torch.load("step_40.pt", map_location="cpu")
ckpt.keys()
# dict_keys(['lora_state_dict', 'audio_head', 'text_head', ...])

lora_state_dict — LoRA adapters (r=16, α=32, dropout=0.05) attached to q_proj, k_proj, v_proj, o_proj, qkv, out_proj of the Omni-Embed-Nemotron-3B backbone.
audio_head, text_head — modality-specific 512-d ProjectionHead (Linear → Dropout → LayerNorm → L2-normalize).

The base model weights are not redistributed here; they are loaded from nvidia/omni-embed-nemotron-3b on first use.

Quick start

Install dependencies (see the repository for the full pinned list):

pip install torch torchaudio transformers peft huggingface_hub soundfile librosa

Minimal encoding example (text & audio → 512-d L2-normalized embeddings):

import torch
from types import SimpleNamespace
from huggingface_hub import hf_hub_download

# Cloned from https://github.com/JudeJiwoo/Omni-Embed-Audio
from AudioRetrieval.models.omni_embed_adapter import OmniEmbedAdapter
from AudioRetrieval.training.oea.train_omniembed_lora import ProjectionHead, attach_lora

device = "cuda" if torch.cuda.is_available() else "cpu"

ckpt_path = hf_hub_download("JudeJiwoo/OEA-Nemo3B-AC", filename="step_40.pt")

adapter = OmniEmbedAdapter(
    repo_id="nvidia/omni-embed-nemotron-3b",
    device=device,
    passage_prefix="passage:",
    query_prefix="query:",
)

lora_cfg = SimpleNamespace(
    lora_rank=16, lora_alpha=32, lora_dropout=0.05,
    lora_targets=["q_proj","k_proj","v_proj","o_proj","qkv","out_proj"],
)
peft_model = attach_lora(adapter.get_underlying_model(), lora_cfg)
adapter.set_underlying_model(peft_model)

ckpt = torch.load(ckpt_path, map_location=device)
peft_model.load_state_dict(ckpt["lora_state_dict"], strict=False)

hidden = peft_model.config.text_config.hidden_size
audio_head = ProjectionHead(hidden, 512, 0.1).to(device).eval()
text_head  = ProjectionHead(hidden, 512, 0.1).to(device).eval()
audio_head.load_state_dict(ckpt["audio_head"])
text_head.load_state_dict(ckpt["text_head"])

with torch.inference_mode():
    a = adapter.encode_audio(["sample.wav"])
    t = adapter.encode_text(["A clock ticks once a second as it runs."])
    a_emb = audio_head(torch.from_numpy(a).to(device)).cpu().numpy()
    t_emb = text_head(torch.from_numpy(t).to(device)).cpu().numpy()
print(a_emb.shape, t_emb.shape)   # (1, 512) (1, 512)
print((t_emb @ a_emb.T).item())   # cosine similarity (already L2-normalized)

A ready-to-run version is at examples/encode_example.py (or the matching .ipynb) in the public repo — it encodes five bundled Clotho clips and one query per UIQ type out of the box.

Related checkpoints

Repo	Base model	Trained on
JudeJiwoo/OEA-Qwen3B-Cl	Qwen2.5-Omni-3B	Clotho
JudeJiwoo/OEA-Qwen3B-AC	Qwen2.5-Omni-3B	AudioCaps
JudeJiwoo/OEA-Qwen7B-Cl	Qwen2.5-Omni-7B	Clotho
JudeJiwoo/OEA-Qwen7B-AC	Qwen2.5-Omni-7B	AudioCaps
JudeJiwoo/OEA-Nemo3B-Cl	Omni-Embed-Nemotron-3B	Clotho
JudeJiwoo/OEA-Nemo3B-AC	Omni-Embed-Nemotron-3B	AudioCaps

-Cl checkpoints are recommended for evaluation on Clotho, and -AC checkpoints for AudioCaps (Section 4 of the paper).

Intended use & limitations

Intended use. Open-vocabulary text-to-audio / audio-to-text / text-to-text retrieval; UIQ-style robustness studies (questions, imperatives, paraphrases, keyword tags, exclusion queries). Both encoders share a single multimodal-LLM backbone, so OEA is also useful for embedding-space analyses across modalities.

Out of scope. Generative captioning, ASR, music transcription, speaker identification, or any safety-critical downstream task. The model was trained on environmental-sound captioning data; it is not evaluated on speech-, music-, or biomedical-audio retrieval.

Limitations. Inherits biases of the Omni-Embed-Nemotron-3B backbone and of the AudioCaps caption distribution. Performance on languages other than English and on out-of-distribution audio (long-form speech, music, multilingual content) is not measured.

License

This LoRA + projection-head checkpoint is released under the MIT License. The underlying base model (nvidia/omni-embed-nemotron-3b) is governed by its own license — review the upstream model card before redistribution.

Citation

@inproceedings{yoo2026omniembedaudio,
  title     = {Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval},
  author    = {Yoo, HaeJun and Shin, Yongseop and Lee, Insung and Koo, Myoung-Wan and Chang, Du-Seong},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  note      = {Oral presentation},
  year      = {2026}
}

Acknowledgments

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2025-25441313, Professional AI Talent Development Program for Multimodal AI Agents, Contribution: 50%). This research was also supported by the MSIT, Korea, under the Top-Tier AI Global HRD invitation program (RS-2025-25461932) supervised by the IITP.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for JudeJiwoo/OEA-Nemo3B-AC

Base model

nvidia/omni-embed-nemotron-3b

Adapter

(2)

this model

Collection including JudeJiwoo/OEA-Nemo3B-AC

Omni-Embed-Audio (OEA) — ACL 2026 (Oral)

Collection

LoRA + projection-head checkpoints for the Omni-Embed-Audio retrieval encoder (Yoo et al., Sogang U., ACL 2026 Oral). • 6 items • Updated May 18