xiaobu-embedding-v2
基于piccolo-embedding[1],主要改动如下:
- 合成数据替换为xiaobu-embedding-v1[2]所积累数据
- 在circle_loss[3]视角下统一处理CMTEB的6类问题,最大优势是可充分利用原始数据集中的多个正例,其次是可一定程度上避免考虑多个不同loss之间的权重问题
Usage (Sentence-Transformers)
pip install -U sentence-transformers
相似度计算:
from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('lier007/xiaobu-embedding-v2')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Reference
- Downloads last month
- 40,680
Model tree for lier007/xiaobu-embedding-v2
Spaces using lier007/xiaobu-embedding-v2 2
Evaluation results
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported56.919
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported60.956
- euclidean_pearson on MTEB AFQMCvalidation set self-reported59.738
- euclidean_spearman on MTEB AFQMCvalidation set self-reported60.957
- manhattan_pearson on MTEB AFQMCvalidation set self-reported59.740
- manhattan_spearman on MTEB AFQMCvalidation set self-reported60.952
- cos_sim_pearson on MTEB ATECtest set self-reported56.794
- cos_sim_spearman on MTEB ATECtest set self-reported58.810
- euclidean_pearson on MTEB ATECtest set self-reported63.422
- euclidean_spearman on MTEB ATECtest set self-reported58.810