Elephant Embeddings V1 Multimodal Small

elephant-embeddings-v1-multimodal-small is the compact multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.

This ModelScope release is maintained by agentic-intelligence-lab to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model llm-semantic-router/multi-modal-embed-small under a consistent Elephant model namespace.

Positioning

This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release.

It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following.

Model at a glance

Item Value
Family Elephant Embeddings V1
Maintainer Agentic Intelligence Lab
Model type Multimodal embedding model
Modalities Text, image, audio
Text encoder sentence-transformers/all-MiniLM-L6-v2
Image encoder google/siglip-base-patch16-512
Audio encoder openai/whisper-tiny
Fusion 2-layer Transformer attention
Embedding dimension 384
Matryoshka dimensions 384, 256, 128, 64, 32
Image resolution 512×512
Audio input Up to 30s, 16kHz
Upstream source llm-semantic-router/multi-modal-embed-small
License Apache 2.0

Why it fits agentic workloads

Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content.

Key advantages:

  • Shared multimodal space: compare text, screenshots/images, and short audio clips in one vector space.
  • Compact embedding size: 384-dimensional vectors are cheaper to store and search.
  • Dimension-adaptive retrieval: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes.
  • Practical modality encoders: combines lightweight text and audio encoders with a SigLIP image tower.
  • ONNX assets included: provides additional deployment artifacts for selected runtime paths.

Recommended use cases

Scenario Example
Lightweight multimodal retrieval Search captions, screenshots, and voice snippets together
Agent route matching Match user text or UI screenshots to tools and workflows
Edge or cost-sensitive indexing Use 384d or truncated vectors for lower storage cost
Prototype multimodal memory Build a small unified memory index before moving to the large model
Image/audio semantic search Retrieve text labels or notes from image/audio queries

Quick start on ModelScope

pip install modelscope torch transformers pillow safetensors
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
from modelscope import snapshot_download
from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel


class MultiModalEmbedder(nn.Module):
    def __init__(self):
        super().__init__()
        self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

        self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512")
        self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model
        self.image_proj = nn.Linear(768, 384)

        self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
        self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder

    def encode_text(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.text_encoder(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return F.normalize(embeddings, p=2, dim=-1)

    def encode_image(self, images):
        inputs = self.image_processor(images=images, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.image_encoder(**inputs)
        embeddings = self.image_proj(outputs.pooler_output)
        return F.normalize(embeddings, p=2, dim=-1)

    def encode_audio(self, waveform):
        if isinstance(waveform, torch.Tensor):
            waveform = waveform.squeeze().cpu().numpy()
        inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.audio_encoder(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return F.normalize(embeddings, p=2, dim=-1)


repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small"
local_dir = snapshot_download(repo_id)

model = MultiModalEmbedder()
state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False)

model.text_encoder.load_state_dict({
    key.replace("text_encoder.encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("text_encoder.encoder.")
})
model.image_encoder.load_state_dict({
    key.replace("image_encoder.vision_encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("image_encoder.vision_encoder.")
})
model.image_proj.load_state_dict({
    key.replace("image_encoder.projection.", ""): value
    for key, value in state_dict.items()
    if key.startswith("image_encoder.projection.")
})
model.audio_encoder.load_state_dict({
    key.replace("audio_encoder.encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("audio_encoder.encoder.")
})

model.eval()

texts = ["A refund request", "A screenshot of a login failure"]
text_embeddings = model.encode_text(texts)
print(text_embeddings.shape)  # [2, 384]

Matryoshka truncation

full_emb = model.encode_text("A billing support request")  # [1, 384]

emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1)
emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1)
emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1)

Evaluation snapshot

Metric Score
COCO image-to-text R@1 41.88%
COCO image-to-text R@5 71.64%
COCO image-to-text R@10 82.16%
LibriSpeech audio-to-text R@1 36.38%
LibriSpeech audio-to-text R@5 68.22%
LibriSpeech audio-to-text R@10 79.52%

Files

File Description
model.pt PyTorch checkpoint
model.safetensors SafeTensors checkpoint
config.json Model component configuration
onnx/ ONNX deployment assets
README.md This model card

Lineage

This ModelScope package is published by agentic-intelligence-lab as part of the Elephant model release line. It mirrors the upstream HuggingFace model llm-semantic-router/multi-modal-embed-small and keeps the model artifacts unchanged except for the repository naming and model card presentation.

Limitations

  • English is the primary supported language for this compact release.
  • Image inputs are designed around 512×512 preprocessing.
  • Audio inputs are intended for short clips up to about 30 seconds at 16kHz.
  • The model is optimized for retrieval, routing, and similarity, not generation or captioning.
  • For higher-quality multimodal retrieval, use elephant-embeddings-v1-multimodal-large.

Citation

@misc{elephant-embeddings-v1-multimodal-small,
  title={Elephant Embeddings V1 Multimodal Small},
  author={Agentic Intelligence Lab},
  year={2026},
  url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small}
}

License

Apache 2.0

Downloads last month
32
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results