KaLM-Embedding

KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data.

KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data.

📑 Open-source Plan

Evaluation

Model Name Model Size C-MTEB(35) MTEB(56) avg
multilingual-e5-large 560M 58.81 61.5 60.16
bge-m3 (dense) 560M 60.80 59.84 60.32
gte-multilingual-base (dense) 305M 62.72 61.40 62.06
KaLM-embedding-multilingual-mini-v1 494M 62.31 61.87 62.09
KaLM-embedding-multilingual-mini-instruct-v1 494M 63.57 64.74 64.16

Requirements

Since we have used the Qwen2 model, we advise you to install transformers>=4.37.0, or you might encounter the following error:

KeyError: 'qwen2'

Usage

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

embeddings = model.encode(
    sentences, 
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
    sentences, 
    prompt=prompt,
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

Contact

If you encounter any issue, feel free to contact us via the email: yanshek.woo@gmail.com

Downloads last month
1,922
Safetensors
Model size
494M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including HIT-TMG/KaLM-embedding-multilingual-mini-v1

Evaluation results