Instructions to use agentic-in/elephant-embeddings-v1-multimodal-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use agentic-in/elephant-embeddings-v1-multimodal-small with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("agentic-in/elephant-embeddings-v1-multimodal-small", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Elephant Embeddings V1 Multimodal Small
elephant-embeddings-v1-multimodal-small is the compact multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.
This ModelScope release is maintained by agentic-intelligence-lab to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model llm-semantic-router/multi-modal-embed-small under a consistent Elephant model namespace.
Positioning
This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release.
It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following.
Model at a glance
| Item | Value |
|---|---|
| Family | Elephant Embeddings V1 |
| Maintainer | Agentic Intelligence Lab |
| Model type | Multimodal embedding model |
| Modalities | Text, image, audio |
| Text encoder | sentence-transformers/all-MiniLM-L6-v2 |
| Image encoder | google/siglip-base-patch16-512 |
| Audio encoder | openai/whisper-tiny |
| Fusion | 2-layer Transformer attention |
| Embedding dimension | 384 |
| Matryoshka dimensions | 384, 256, 128, 64, 32 |
| Image resolution | 512×512 |
| Audio input | Up to 30s, 16kHz |
| Upstream source | llm-semantic-router/multi-modal-embed-small |
| License | Apache 2.0 |
Why it fits agentic workloads
Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content.
Key advantages:
- Shared multimodal space: compare text, screenshots/images, and short audio clips in one vector space.
- Compact embedding size: 384-dimensional vectors are cheaper to store and search.
- Dimension-adaptive retrieval: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes.
- Practical modality encoders: combines lightweight text and audio encoders with a SigLIP image tower.
- ONNX assets included: provides additional deployment artifacts for selected runtime paths.
Recommended use cases
| Scenario | Example |
|---|---|
| Lightweight multimodal retrieval | Search captions, screenshots, and voice snippets together |
| Agent route matching | Match user text or UI screenshots to tools and workflows |
| Edge or cost-sensitive indexing | Use 384d or truncated vectors for lower storage cost |
| Prototype multimodal memory | Build a small unified memory index before moving to the large model |
| Image/audio semantic search | Retrieve text labels or notes from image/audio queries |
Quick start on ModelScope
pip install modelscope torch transformers pillow safetensors
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from modelscope import snapshot_download
from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel
class MultiModalEmbedder(nn.Module):
def __init__(self):
super().__init__()
self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512")
self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model
self.image_proj = nn.Linear(768, 384)
self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder
def encode_text(self, texts):
if isinstance(texts, str):
texts = [texts]
inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
outputs = self.text_encoder(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return F.normalize(embeddings, p=2, dim=-1)
def encode_image(self, images):
inputs = self.image_processor(images=images, return_tensors="pt")
inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
outputs = self.image_encoder(**inputs)
embeddings = self.image_proj(outputs.pooler_output)
return F.normalize(embeddings, p=2, dim=-1)
def encode_audio(self, waveform):
if isinstance(waveform, torch.Tensor):
waveform = waveform.squeeze().cpu().numpy()
inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt")
inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
outputs = self.audio_encoder(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return F.normalize(embeddings, p=2, dim=-1)
repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small"
local_dir = snapshot_download(repo_id)
model = MultiModalEmbedder()
state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False)
model.text_encoder.load_state_dict({
key.replace("text_encoder.encoder.", ""): value
for key, value in state_dict.items()
if key.startswith("text_encoder.encoder.")
})
model.image_encoder.load_state_dict({
key.replace("image_encoder.vision_encoder.", ""): value
for key, value in state_dict.items()
if key.startswith("image_encoder.vision_encoder.")
})
model.image_proj.load_state_dict({
key.replace("image_encoder.projection.", ""): value
for key, value in state_dict.items()
if key.startswith("image_encoder.projection.")
})
model.audio_encoder.load_state_dict({
key.replace("audio_encoder.encoder.", ""): value
for key, value in state_dict.items()
if key.startswith("audio_encoder.encoder.")
})
model.eval()
texts = ["A refund request", "A screenshot of a login failure"]
text_embeddings = model.encode_text(texts)
print(text_embeddings.shape) # [2, 384]
Matryoshka truncation
full_emb = model.encode_text("A billing support request") # [1, 384]
emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1)
emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1)
emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1)
Evaluation snapshot
| Metric | Score |
|---|---|
| COCO image-to-text R@1 | 41.88% |
| COCO image-to-text R@5 | 71.64% |
| COCO image-to-text R@10 | 82.16% |
| LibriSpeech audio-to-text R@1 | 36.38% |
| LibriSpeech audio-to-text R@5 | 68.22% |
| LibriSpeech audio-to-text R@10 | 79.52% |
Files
| File | Description |
|---|---|
model.pt |
PyTorch checkpoint |
model.safetensors |
SafeTensors checkpoint |
config.json |
Model component configuration |
onnx/ |
ONNX deployment assets |
README.md |
This model card |
Lineage
This ModelScope package is published by agentic-intelligence-lab as part of the Elephant model release line. It mirrors the upstream HuggingFace model llm-semantic-router/multi-modal-embed-small and keeps the model artifacts unchanged except for the repository naming and model card presentation.
Limitations
- English is the primary supported language for this compact release.
- Image inputs are designed around 512×512 preprocessing.
- Audio inputs are intended for short clips up to about 30 seconds at 16kHz.
- The model is optimized for retrieval, routing, and similarity, not generation or captioning.
- For higher-quality multimodal retrieval, use
elephant-embeddings-v1-multimodal-large.
Citation
@misc{elephant-embeddings-v1-multimodal-small,
title={Elephant Embeddings V1 Multimodal Small},
author={Agentic Intelligence Lab},
year={2026},
url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small}
}
License
Apache 2.0
- Downloads last month
- 32
Evaluation results
- Image-to-Text R@1 on COCOself-reported41.880
- Image-to-Text R@5 on COCOself-reported71.640
- Image-to-Text R@10 on COCOself-reported82.160
- Audio-to-Text R@1 on LibriSpeechself-reported36.380
- Audio-to-Text R@5 on LibriSpeechself-reported68.220
- Audio-to-Text R@10 on LibriSpeechself-reported79.520