Elephant Embeddings V1 Multimodal Large

elephant-embeddings-v1-multimodal-large is the large multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.

This ModelScope release is maintained by agentic-intelligence-lab to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model llm-semantic-router/multi-modal-embed-large under a consistent Elephant model namespace.

Positioning

This model is a production-oriented multimodal embedding model for semantic routing, retrieval, and cross-modal matching across text, image, and audio.

It is not a generative chat or captioning model. Instead, it maps different modalities into one shared embedding space so agent systems can compare requests, screenshots, documents, and audio records with the same retrieval interface.

Model at a glance

Item	Value
Family	Elephant Embeddings V1
Maintainer	Agentic Intelligence Lab
Model type	Multimodal embedding model
Modalities	Text, image, audio
Architecture	Custom PyTorch tri-encoder
Text encoder	`llm-semantic-router/mmbert-embed-32k-2d-matryoshka`
Image encoder	`google/siglip2-so400m-patch14-384`
Audio encoder	`openai/whisper-medium`
Embedding dimension	768
Max text length	32,768 tokens
Objective	Cached multiple negatives ranking loss
Upstream source	`llm-semantic-router/multi-modal-embed-large`
License	Apache 2.0

Why it fits agentic workloads

Agentic products increasingly need to retrieve and route over mixed inputs: user text, screenshots, UI states, documents, voice notes, support calls, and multimodal memory. This model is designed for that operating pattern.

Key advantages:

Shared semantic space: text, images, and audio can be compared with cosine similarity.
Routing-grade representation: optimized for retrieval, matching, and routing rather than generation.
Strong modality towers: uses dedicated text, image, and audio encoders instead of forcing all modalities through a single monolithic checkpoint.
Long-context text path: supports long tool descriptions, traces, and knowledge chunks through the text encoder.
Production packaging: includes the custom source package needed to construct and run the tri-encoder.

Recommended use cases

Scenario	Example
Multimodal RAG	Retrieve text notes using an image or audio query
Agent routing	Route screenshots, user text, or voice requests to the right tool or workflow
Memory search	Search mixed text/image/audio memory stores in one vector space
Support and operations	Match tickets, screenshots, logs, and recorded calls semantically
Offline indexing	Build high-quality 768d multimodal indexes

Quick start on ModelScope

pip install modelscope torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile

import json
import os
import sys

import torch
import torch.nn.functional as F
from modelscope import snapshot_download

repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large"
local_dir = snapshot_download(repo_id)

sys.path.insert(0, os.path.join(local_dir, "src"))

from hf_st_mm.data import PairItem
from hf_st_mm.model import MultiModalSentenceEmbedder

with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle:
    cfg = json.load(handle)

model = MultiModalSentenceEmbedder(
    text_encoder_name=cfg["model"]["text_encoder_name"],
    image_encoder_name=cfg["model"]["image_encoder_name"],
    audio_encoder_name=cfg["model"]["audio_encoder_name"],
    embedding_dim=int(cfg["model"]["embedding_dim"]),
    max_text_length=int(cfg["model"]["max_text_length"]),
)

state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

items = [
    PairItem(modality="text", value="route this request to the billing workflow"),
    PairItem(modality="image", value="/path/to/screenshot.png"),
    PairItem(modality="audio", value="/path/to/call.wav"),
]

with torch.no_grad():
    embeddings = model.encode_items(items)

print(embeddings.shape)  # [3, 768]

query = PairItem(modality="text", value="refund request for a wrong charge")
candidate = PairItem(modality="audio", value="/path/to/refund_call.wav")

with torch.no_grad():
    embs = model.encode_items([query, candidate])

similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item()
print(f"similarity={similarity:.4f}")

Evaluation snapshot

Metric	Value
Eval loss	0.389702
Eval top1	0.861707

The validation metrics come from the tri-encoder cached retrieval validation path used during export. They are intended as a release sanity snapshot rather than a public leaderboard claim.

Files

File	Description
`model.pt`	Exported PyTorch weights
`config.json`	Tri-encoder and training/export configuration
`src/hf_st_mm/`	Python package used to construct and run the model
`README.md`	This model card

Lineage

This ModelScope package is published by agentic-intelligence-lab as part of the Elephant model release line. It mirrors the upstream HuggingFace model llm-semantic-router/multi-modal-embed-large and keeps the model artifacts unchanged except for the repository naming and model card presentation.

Limitations

This is a custom PyTorch tri-encoder export, not a standard Transformers auto-class checkpoint.
Inference relies on the packaged hf_st_mm source code.
Image and audio inputs are expected as local file paths in the simple inference path.
The model is optimized for retrieval, routing, and similarity, not generation or captioning.
Reported validation metrics come from an internal cached retrieval validation set.

Citation

@misc{elephant-embeddings-v1-multimodal-large,
  title={Elephant Embeddings V1 Multimodal Large},
  author={Agentic Intelligence Lab},
  year={2026},
  url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large}
}

License

Apache 2.0

Downloads last month: 30

Evaluation results

Eval loss on Internal cached validation set
self-reported

0.390
Eval top1 on Internal cached validation set
self-reported

0.862