Elephant Embeddings V1 Multimodal Large
elephant-embeddings-v1-multimodal-large is the large multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.
This ModelScope release is maintained by agentic-intelligence-lab to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model llm-semantic-router/multi-modal-embed-large under a consistent Elephant model namespace.
Positioning
This model is a production-oriented multimodal embedding model for semantic routing, retrieval, and cross-modal matching across text, image, and audio.
It is not a generative chat or captioning model. Instead, it maps different modalities into one shared embedding space so agent systems can compare requests, screenshots, documents, and audio records with the same retrieval interface.
Model at a glance
| Item | Value |
|---|---|
| Family | Elephant Embeddings V1 |
| Maintainer | Agentic Intelligence Lab |
| Model type | Multimodal embedding model |
| Modalities | Text, image, audio |
| Architecture | Custom PyTorch tri-encoder |
| Text encoder | llm-semantic-router/mmbert-embed-32k-2d-matryoshka |
| Image encoder | google/siglip2-so400m-patch14-384 |
| Audio encoder | openai/whisper-medium |
| Embedding dimension | 768 |
| Max text length | 32,768 tokens |
| Objective | Cached multiple negatives ranking loss |
| Upstream source | llm-semantic-router/multi-modal-embed-large |
| License | Apache 2.0 |
Why it fits agentic workloads
Agentic products increasingly need to retrieve and route over mixed inputs: user text, screenshots, UI states, documents, voice notes, support calls, and multimodal memory. This model is designed for that operating pattern.
Key advantages:
- Shared semantic space: text, images, and audio can be compared with cosine similarity.
- Routing-grade representation: optimized for retrieval, matching, and routing rather than generation.
- Strong modality towers: uses dedicated text, image, and audio encoders instead of forcing all modalities through a single monolithic checkpoint.
- Long-context text path: supports long tool descriptions, traces, and knowledge chunks through the text encoder.
- Production packaging: includes the custom source package needed to construct and run the tri-encoder.
Recommended use cases
| Scenario | Example |
|---|---|
| Multimodal RAG | Retrieve text notes using an image or audio query |
| Agent routing | Route screenshots, user text, or voice requests to the right tool or workflow |
| Memory search | Search mixed text/image/audio memory stores in one vector space |
| Support and operations | Match tickets, screenshots, logs, and recorded calls semantically |
| Offline indexing | Build high-quality 768d multimodal indexes |
Quick start on ModelScope
pip install modelscope torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile
import json
import os
import sys
import torch
import torch.nn.functional as F
from modelscope import snapshot_download
repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large"
local_dir = snapshot_download(repo_id)
sys.path.insert(0, os.path.join(local_dir, "src"))
from hf_st_mm.data import PairItem
from hf_st_mm.model import MultiModalSentenceEmbedder
with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle:
cfg = json.load(handle)
model = MultiModalSentenceEmbedder(
text_encoder_name=cfg["model"]["text_encoder_name"],
image_encoder_name=cfg["model"]["image_encoder_name"],
audio_encoder_name=cfg["model"]["audio_encoder_name"],
embedding_dim=int(cfg["model"]["embedding_dim"]),
max_text_length=int(cfg["model"]["max_text_length"]),
)
state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
items = [
PairItem(modality="text", value="route this request to the billing workflow"),
PairItem(modality="image", value="/path/to/screenshot.png"),
PairItem(modality="audio", value="/path/to/call.wav"),
]
with torch.no_grad():
embeddings = model.encode_items(items)
print(embeddings.shape) # [3, 768]
query = PairItem(modality="text", value="refund request for a wrong charge")
candidate = PairItem(modality="audio", value="/path/to/refund_call.wav")
with torch.no_grad():
embs = model.encode_items([query, candidate])
similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item()
print(f"similarity={similarity:.4f}")
Evaluation snapshot
| Metric | Value |
|---|---|
| Eval loss | 0.389702 |
| Eval top1 | 0.861707 |
The validation metrics come from the tri-encoder cached retrieval validation path used during export. They are intended as a release sanity snapshot rather than a public leaderboard claim.
Files
| File | Description |
|---|---|
model.pt |
Exported PyTorch weights |
config.json |
Tri-encoder and training/export configuration |
src/hf_st_mm/ |
Python package used to construct and run the model |
README.md |
This model card |
Lineage
This ModelScope package is published by agentic-intelligence-lab as part of the Elephant model release line. It mirrors the upstream HuggingFace model llm-semantic-router/multi-modal-embed-large and keeps the model artifacts unchanged except for the repository naming and model card presentation.
Limitations
- This is a custom PyTorch tri-encoder export, not a standard Transformers auto-class checkpoint.
- Inference relies on the packaged
hf_st_mmsource code. - Image and audio inputs are expected as local file paths in the simple inference path.
- The model is optimized for retrieval, routing, and similarity, not generation or captioning.
- Reported validation metrics come from an internal cached retrieval validation set.
Citation
@misc{elephant-embeddings-v1-multimodal-large,
title={Elephant Embeddings V1 Multimodal Large},
author={Agentic Intelligence Lab},
year={2026},
url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large}
}
License
Apache 2.0
- Downloads last month
- 30
Evaluation results
- Eval loss on Internal cached validation setself-reported0.390
- Eval top1 on Internal cached validation setself-reported0.862