Elephant Embeddings V1 Multimodal Small

elephant-embeddings-v1-multimodal-small is the compact multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.

This ModelScope release is maintained by agentic-intelligence-lab to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model llm-semantic-router/multi-modal-embed-small under a consistent Elephant model namespace.

Positioning

This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release.

It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following.

Model at a glance

Item	Value
Family	Elephant Embeddings V1
Maintainer	Agentic Intelligence Lab
Model type	Multimodal embedding model
Modalities	Text, image, audio
Text encoder	`sentence-transformers/all-MiniLM-L6-v2`
Image encoder	`google/siglip-base-patch16-512`
Audio encoder	`openai/whisper-tiny`
Fusion	2-layer Transformer attention
Embedding dimension	384
Matryoshka dimensions	384, 256, 128, 64, 32
Image resolution	512×512
Audio input	Up to 30s, 16kHz
Upstream source	`llm-semantic-router/multi-modal-embed-small`
License	Apache 2.0

Why it fits agentic workloads

Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content.

Key advantages:

Shared multimodal space: compare text, screenshots/images, and short audio clips in one vector space.
Compact embedding size: 384-dimensional vectors are cheaper to store and search.
Dimension-adaptive retrieval: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes.
Practical modality encoders: combines lightweight text and audio encoders with a SigLIP image tower.
ONNX assets included: provides additional deployment artifacts for selected runtime paths.

Recommended use cases

Scenario	Example
Lightweight multimodal retrieval	Search captions, screenshots, and voice snippets together
Agent route matching	Match user text or UI screenshots to tools and workflows
Edge or cost-sensitive indexing	Use 384d or truncated vectors for lower storage cost
Prototype multimodal memory	Build a small unified memory index before moving to the large model
Image/audio semantic search	Retrieve text labels or notes from image/audio queries

Quick start on ModelScope

pip install modelscope torch transformers pillow safetensors

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
from modelscope import snapshot_download
from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel


class MultiModalEmbedder(nn.Module):
    def __init__(self):
        super().__init__()
        self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

        self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512")
        self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model
        self.image_proj = nn.Linear(768, 384)

        self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
        self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder

    def encode_text(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.text_encoder(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return F.normalize(embeddings, p=2, dim=-1)

    def encode_image(self, images):
        inputs = self.image_processor(images=images, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.image_encoder(**inputs)
        embeddings = self.image_proj(outputs.pooler_output)
        return F.normalize(embeddings, p=2, dim=-1)

    def encode_audio(self, waveform):
        if isinstance(waveform, torch.Tensor):
            waveform = waveform.squeeze().cpu().numpy()
        inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.audio_encoder(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return F.normalize(embeddings, p=2, dim=-1)


repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small"
local_dir = snapshot_download(repo_id)

model = MultiModalEmbedder()
state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False)

model.text_encoder.load_state_dict({
    key.replace("text_encoder.encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("text_encoder.encoder.")
})
model.image_encoder.load_state_dict({
    key.replace("image_encoder.vision_encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("image_encoder.vision_encoder.")
})
model.image_proj.load_state_dict({
    key.replace("image_encoder.projection.", ""): value
    for key, value in state_dict.items()
    if key.startswith("image_encoder.projection.")
})
model.audio_encoder.load_state_dict({
    key.replace("audio_encoder.encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("audio_encoder.encoder.")
})

model.eval()

texts = ["A refund request", "A screenshot of a login failure"]
text_embeddings = model.encode_text(texts)
print(text_embeddings.shape)  # [2, 384]

Matryoshka truncation

full_emb = model.encode_text("A billing support request")  # [1, 384]

emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1)
emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1)
emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1)

Evaluation snapshot

Metric	Score
COCO image-to-text R@1	41.88%
COCO image-to-text R@5	71.64%
COCO image-to-text R@10	82.16%
LibriSpeech audio-to-text R@1	36.38%
LibriSpeech audio-to-text R@5	68.22%
LibriSpeech audio-to-text R@10	79.52%

Files

File	Description
`model.pt`	PyTorch checkpoint
`model.safetensors`	SafeTensors checkpoint
`config.json`	Model component configuration
`onnx/`	ONNX deployment assets
`README.md`	This model card

Lineage

This ModelScope package is published by agentic-intelligence-lab as part of the Elephant model release line. It mirrors the upstream HuggingFace model llm-semantic-router/multi-modal-embed-small and keeps the model artifacts unchanged except for the repository naming and model card presentation.

Limitations

English is the primary supported language for this compact release.
Image inputs are designed around 512×512 preprocessing.
Audio inputs are intended for short clips up to about 30 seconds at 16kHz.
The model is optimized for retrieval, routing, and similarity, not generation or captioning.
For higher-quality multimodal retrieval, use elephant-embeddings-v1-multimodal-large.

Citation

@misc{elephant-embeddings-v1-multimodal-small,
  title={Elephant Embeddings V1 Multimodal Small},
  author={Agentic Intelligence Lab},
  year={2026},
  url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small}
}

License

Apache 2.0

Downloads last month: 32

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

Image-to-Text R@1 on COCO
self-reported

41.880
Image-to-Text R@5 on COCO
self-reported

71.640
Image-to-Text R@10 on COCO
self-reported

82.160
Audio-to-Text R@1 on LibriSpeech
self-reported

36.380
Audio-to-Text R@5 on LibriSpeech
self-reported

68.220
Audio-to-Text R@10 on LibriSpeech
self-reported

79.520