Elephant Embeddings V1 Multimodal Large

elephant-embeddings-v1-multimodal-large is the large multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.

This ModelScope release is maintained by agentic-intelligence-lab to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model llm-semantic-router/multi-modal-embed-large under a consistent Elephant model namespace.

Positioning

This model is a production-oriented multimodal embedding model for semantic routing, retrieval, and cross-modal matching across text, image, and audio.

It is not a generative chat or captioning model. Instead, it maps different modalities into one shared embedding space so agent systems can compare requests, screenshots, documents, and audio records with the same retrieval interface.

Model at a glance

Item Value
Family Elephant Embeddings V1
Maintainer Agentic Intelligence Lab
Model type Multimodal embedding model
Modalities Text, image, audio
Architecture Custom PyTorch tri-encoder
Text encoder llm-semantic-router/mmbert-embed-32k-2d-matryoshka
Image encoder google/siglip2-so400m-patch14-384
Audio encoder openai/whisper-medium
Embedding dimension 768
Max text length 32,768 tokens
Objective Cached multiple negatives ranking loss
Upstream source llm-semantic-router/multi-modal-embed-large
License Apache 2.0

Why it fits agentic workloads

Agentic products increasingly need to retrieve and route over mixed inputs: user text, screenshots, UI states, documents, voice notes, support calls, and multimodal memory. This model is designed for that operating pattern.

Key advantages:

  • Shared semantic space: text, images, and audio can be compared with cosine similarity.
  • Routing-grade representation: optimized for retrieval, matching, and routing rather than generation.
  • Strong modality towers: uses dedicated text, image, and audio encoders instead of forcing all modalities through a single monolithic checkpoint.
  • Long-context text path: supports long tool descriptions, traces, and knowledge chunks through the text encoder.
  • Production packaging: includes the custom source package needed to construct and run the tri-encoder.

Recommended use cases

Scenario Example
Multimodal RAG Retrieve text notes using an image or audio query
Agent routing Route screenshots, user text, or voice requests to the right tool or workflow
Memory search Search mixed text/image/audio memory stores in one vector space
Support and operations Match tickets, screenshots, logs, and recorded calls semantically
Offline indexing Build high-quality 768d multimodal indexes

Quick start on ModelScope

pip install modelscope torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile
import json
import os
import sys

import torch
import torch.nn.functional as F
from modelscope import snapshot_download

repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large"
local_dir = snapshot_download(repo_id)

sys.path.insert(0, os.path.join(local_dir, "src"))

from hf_st_mm.data import PairItem
from hf_st_mm.model import MultiModalSentenceEmbedder

with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle:
    cfg = json.load(handle)

model = MultiModalSentenceEmbedder(
    text_encoder_name=cfg["model"]["text_encoder_name"],
    image_encoder_name=cfg["model"]["image_encoder_name"],
    audio_encoder_name=cfg["model"]["audio_encoder_name"],
    embedding_dim=int(cfg["model"]["embedding_dim"]),
    max_text_length=int(cfg["model"]["max_text_length"]),
)

state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

items = [
    PairItem(modality="text", value="route this request to the billing workflow"),
    PairItem(modality="image", value="/path/to/screenshot.png"),
    PairItem(modality="audio", value="/path/to/call.wav"),
]

with torch.no_grad():
    embeddings = model.encode_items(items)

print(embeddings.shape)  # [3, 768]

query = PairItem(modality="text", value="refund request for a wrong charge")
candidate = PairItem(modality="audio", value="/path/to/refund_call.wav")

with torch.no_grad():
    embs = model.encode_items([query, candidate])

similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item()
print(f"similarity={similarity:.4f}")

Evaluation snapshot

Metric Value
Eval loss 0.389702
Eval top1 0.861707

The validation metrics come from the tri-encoder cached retrieval validation path used during export. They are intended as a release sanity snapshot rather than a public leaderboard claim.

Files

File Description
model.pt Exported PyTorch weights
config.json Tri-encoder and training/export configuration
src/hf_st_mm/ Python package used to construct and run the model
README.md This model card

Lineage

This ModelScope package is published by agentic-intelligence-lab as part of the Elephant model release line. It mirrors the upstream HuggingFace model llm-semantic-router/multi-modal-embed-large and keeps the model artifacts unchanged except for the repository naming and model card presentation.

Limitations

  • This is a custom PyTorch tri-encoder export, not a standard Transformers auto-class checkpoint.
  • Inference relies on the packaged hf_st_mm source code.
  • Image and audio inputs are expected as local file paths in the simple inference path.
  • The model is optimized for retrieval, routing, and similarity, not generation or captioning.
  • Reported validation metrics come from an internal cached retrieval validation set.

Citation

@misc{elephant-embeddings-v1-multimodal-large,
  title={Elephant Embeddings V1 Multimodal Large},
  author={Agentic Intelligence Lab},
  year={2026},
  url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large}
}

License

Apache 2.0

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results