Instructions to use JudeJiwoo/OEA-Nemo3B-Cl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JudeJiwoo/OEA-Nemo3B-Cl with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="JudeJiwoo/OEA-Nemo3B-Cl")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("JudeJiwoo/OEA-Nemo3B-Cl", dtype="auto") - Notebooks
- Google Colab
- Kaggle
OEA-Nemo3B-Cl
OEA-Nemo3B-Cl is a LoRA + projection-head checkpoint for the Omni-Embed-Audio (OEA) retrieval encoder presented in:
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang (Sogang University) Proceedings of ACL 2026 (Oral) Code & UIQ benchmark: https://github.com/JudeJiwoo/Omni-Embed-Audio Web demo: https://omni-embed-audio.github.io
Model summary
| Field | Value |
|---|---|
| Base model | nvidia/omni-embed-nemotron-3b (Omni-Embed-Nemotron-3B, ~3B params) |
| Trained on | Clotho (train split) |
| Embedding dim | 512 (L2-normalized) |
| Audio sample rate | 16 kHz mono |
| Trainable parameters | LoRA adapters + 2 projection heads (~11β16M) |
| Backbone | Frozen |
| Checkpoint file | step_40.pt (PyTorch state dict) |
| License (this checkpoint) | MIT β the underlying base model is governed by its own license |
What's inside step_40.pt
ckpt = torch.load("step_40.pt", map_location="cpu")
ckpt.keys()
# dict_keys(['lora_state_dict', 'audio_head', 'text_head', ...])
lora_state_dictβ LoRA adapters (r=16, Ξ±=32, dropout=0.05) attached toq_proj, k_proj, v_proj, o_proj, qkv, out_projof the Omni-Embed-Nemotron-3B backbone.audio_head,text_headβ modality-specific 512-dProjectionHead(Linear β Dropout β LayerNorm β L2-normalize).
The base model weights are not redistributed here; they are loaded from
nvidia/omni-embed-nemotron-3b on first use.
Quick start
Install dependencies (see the repository for the full pinned list):
pip install torch torchaudio transformers peft huggingface_hub soundfile librosa
Minimal encoding example (text & audio β 512-d L2-normalized embeddings):
import torch
from types import SimpleNamespace
from huggingface_hub import hf_hub_download
# Cloned from https://github.com/JudeJiwoo/Omni-Embed-Audio
from AudioRetrieval.models.omni_embed_adapter import OmniEmbedAdapter
from AudioRetrieval.training.oea.train_omniembed_lora import ProjectionHead, attach_lora
device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt_path = hf_hub_download("JudeJiwoo/OEA-Nemo3B-Cl", filename="step_40.pt")
adapter = OmniEmbedAdapter(
repo_id="nvidia/omni-embed-nemotron-3b",
device=device,
passage_prefix="passage:",
query_prefix="query:",
)
lora_cfg = SimpleNamespace(
lora_rank=16, lora_alpha=32, lora_dropout=0.05,
lora_targets=["q_proj","k_proj","v_proj","o_proj","qkv","out_proj"],
)
peft_model = attach_lora(adapter.get_underlying_model(), lora_cfg)
adapter.set_underlying_model(peft_model)
ckpt = torch.load(ckpt_path, map_location=device)
peft_model.load_state_dict(ckpt["lora_state_dict"], strict=False)
hidden = peft_model.config.text_config.hidden_size
audio_head = ProjectionHead(hidden, 512, 0.1).to(device).eval()
text_head = ProjectionHead(hidden, 512, 0.1).to(device).eval()
audio_head.load_state_dict(ckpt["audio_head"])
text_head.load_state_dict(ckpt["text_head"])
with torch.inference_mode():
a = adapter.encode_audio(["sample.wav"])
t = adapter.encode_text(["A clock ticks once a second as it runs."])
a_emb = audio_head(torch.from_numpy(a).to(device)).cpu().numpy()
t_emb = text_head(torch.from_numpy(t).to(device)).cpu().numpy()
print(a_emb.shape, t_emb.shape) # (1, 512) (1, 512)
print((t_emb @ a_emb.T).item()) # cosine similarity (already L2-normalized)
A ready-to-run version is at
examples/encode_example.py
(or the matching .ipynb) in the public repo β it encodes five bundled Clotho
clips and one query per UIQ type out of the box.
Related checkpoints
| Repo | Base model | Trained on |
|---|---|---|
| JudeJiwoo/OEA-Qwen3B-Cl | Qwen2.5-Omni-3B | Clotho |
| JudeJiwoo/OEA-Qwen3B-AC | Qwen2.5-Omni-3B | AudioCaps |
| JudeJiwoo/OEA-Qwen7B-Cl | Qwen2.5-Omni-7B | Clotho |
| JudeJiwoo/OEA-Qwen7B-AC | Qwen2.5-Omni-7B | AudioCaps |
| JudeJiwoo/OEA-Nemo3B-Cl | Omni-Embed-Nemotron-3B | Clotho |
| JudeJiwoo/OEA-Nemo3B-AC | Omni-Embed-Nemotron-3B | AudioCaps |
-Cl checkpoints are recommended for evaluation on Clotho, and -AC
checkpoints for AudioCaps (Section 4 of the paper).
Intended use & limitations
Intended use. Open-vocabulary text-to-audio / audio-to-text / text-to-text retrieval; UIQ-style robustness studies (questions, imperatives, paraphrases, keyword tags, exclusion queries). Both encoders share a single multimodal-LLM backbone, so OEA is also useful for embedding-space analyses across modalities.
Out of scope. Generative captioning, ASR, music transcription, speaker identification, or any safety-critical downstream task. The model was trained on environmental-sound captioning data; it is not evaluated on speech-, music-, or biomedical-audio retrieval.
Limitations. Inherits biases of the Omni-Embed-Nemotron-3B backbone and of the Clotho caption distribution. Performance on languages other than English and on out-of-distribution audio (long-form speech, music, multilingual content) is not measured.
License
This LoRA + projection-head checkpoint is released under the MIT License.
The underlying base model (nvidia/omni-embed-nemotron-3b) is governed by its own license β
review the upstream model card before redistribution.
Citation
@inproceedings{yoo2026omniembedaudio,
title = {Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval},
author = {Yoo, HaeJun and Shin, Yongseop and Lee, Insung and Koo, Myoung-Wan and Chang, Du-Seong},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
note = {Oral presentation},
year = {2026}
}
Acknowledgments
This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2025-25441313, Professional AI Talent Development Program for Multimodal AI Agents, Contribution: 50%). This research was also supported by the MSIT, Korea, under the Top-Tier AI Global HRD invitation program (RS-2025-25461932) supervised by the IITP.
Model tree for JudeJiwoo/OEA-Nemo3B-Cl
Base model
nvidia/omni-embed-nemotron-3b