KaLM-Embedding

KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data.

KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data.

📑 Open-source Plan

Model Checkpoint
- KaLM-embedding-multilingual-mini-v1
- KaLM-embedding-multilingual-mini-instruct-v1
- KaLM-embedding-multilingual-mini-instruct-v1.5
- KaLM-embedding-multilingual-max-v1
Training and Evaluation Code: HITsz-TMG/KaLM-Embedding
Technical Report: KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
Training Data

Evaluation

Model Name	Model Size	C-MTEB(35)	MTEB(56)	avg
multilingual-e5-large	560M	58.81	61.5	60.16
bge-m3 (dense)	560M	60.80	59.84	60.32
gte-multilingual-base (dense)	305M	62.72	61.40	62.06
KaLM-embedding-multilingual-mini-v1	494M	62.31	61.87	62.09
KaLM-embedding-multilingual-mini-instruct-v1	494M	63.57	64.74	64.16
KaLM-embedding-multilingual-mini-instruct-v1.5	494M	64.13	64.94	64.53

Requirements

Since we have used the Qwen2 model, we advise you to install transformers>=4.37.0, or you might encounter the following error:

KeyError: 'qwen2'

Usage

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

embeddings = model.encode(
    sentences, 
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
    sentences, 
    prompt=prompt,
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

Contact

If you encounter any issue, feel free to contact us via the email: yanshek.woo@gmail.com

HIT-TMG
/

KaLM-embedding-multilingual-mini-v1

KaLM-Embedding

📑 Open-source Plan

Evaluation

Requirements

Usage

Contact

Collection including HIT-TMG/KaLM-embedding-multilingual-mini-v1

KaLM-embedding

Evaluation results