Dmeta-embedding-small
- Dmeta-embedding系列模型是跨领域、跨任务、开箱即用的中文 Embedding 模型,适用于搜索、问答、智能客服、LLM+RAG 等各种业务场景,支持使用 Transformers/Sentence-Transformers/Langchain 等工具加载推理。
- Dmeta-embedding-zh-small是开源模型Dmeta-embedding-zh的蒸馏版本(8层BERT),模型大小不到300M。相较于原始版本,Dmeta-embedding-zh-small模型大小减小三分之一,推理速度提升约30%,总体精度下降约1.4%。
Evaluation
这里主要跟蒸馏前对应的 teacher 模型作了对比:
性能:(基于1万条数据测试,GPU设备是V100)
Teacher | Student | Gap | |
---|---|---|---|
Model | Dmeta-Embedding-zh (411M) | Dmeta-Embedding-zh-small (297M) | 0.67x |
Cost | 127s | 89s | -30% |
Latency | 13ms | 9ms | -31% |
Throughput | 78 sentence/s | 111 sentence/s | 1.4x |
精度:(参考自MTEB榜单)
Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Avg | |
---|---|---|---|---|---|---|---|
Dmeta-Embedding-zh | 70 | 50.96 | 88.92 | 67.17 | 70.41 | 64.89 | 67.51 |
Dmeta-Embedding-zh-small | 69.89 | 50.8 | 87.57 | 66.92 | 67.7 | 62.13 | 66.1 |
Gap | -0.11 | -0.16 | -1.35 | -0.25 | -2.71 | -2.76 | -1.41 |
Usage
目前模型支持通过 Sentence-Transformers, Langchain, Huggingface Transformers 等主流框架进行推理,具体用法参考各个框架的示例。
Sentence-Transformers
Dmeta-embedding 模型支持通过 sentence-transformers 来加载推理:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
model = SentenceTransformer('DMetaSoul/Dmeta-embedding-zh-small')
embs1 = model.encode(texts1, normalize_embeddings=True)
embs2 = model.encode(texts2, normalize_embeddings=True)
# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)
# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
scores = []
for j in range(len(texts2)):
scores.append([texts2[j], similarity[i][j]])
scores = sorted(scores, key=lambda x:x[1], reverse=True)
print(f"查询文本:{texts1[i]}")
for text2, score in scores:
print(f"相似文本:{text2},打分:{score}")
print()
示例输出如下:
查询文本:胡子长得太快怎么办?
相似文本:胡子长得快怎么办?,打分:0.965681254863739
相似文本:怎样使胡子不浓密!,打分:0.7353651523590088
相似文本:香港买手表哪里好,打分:0.24928246438503265
相似文本:在杭州手机到哪里买,打分:0.2038613110780716
查询文本:在香港哪里买手表好
相似文本:香港买手表哪里好,打分:0.9916468262672424
相似文本:在杭州手机到哪里买,打分:0.498248815536499
相似文本:胡子长得快怎么办?,打分:0.2424771636724472
相似文本:怎样使胡子不浓密!,打分:0.21715955436229706
Langchain
Dmeta-embedding 模型支持通过 LLM 工具框架 langchain 来加载推理:
pip install -U langchain
import torch
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "DMetaSoul/Dmeta-embedding-zh-small"
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
model = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
)
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
embs1 = model.embed_documents(texts1)
embs2 = model.embed_documents(texts2)
embs1, embs2 = np.array(embs1), np.array(embs2)
# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)
# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
scores = []
for j in range(len(texts2)):
scores.append([texts2[j], similarity[i][j]])
scores = sorted(scores, key=lambda x:x[1], reverse=True)
print(f"查询文本:{texts1[i]}")
for text2, score in scores:
print(f"相似文本:{text2},打分:{score}")
print()
HuggingFace Transformers
Dmeta-embedding 模型支持通过 HuggingFace Transformers 框架来加载推理:
pip install -U transformers
import torch
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def cls_pooling(model_output):
return model_output[0][:, 0]
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/Dmeta-embedding-zh-small')
model = AutoModel.from_pretrained('DMetaSoul/Dmeta-embedding-zh-small')
model.eval()
with torch.no_grad():
inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors='pt')
inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors='pt')
model_output1 = model(**inputs1)
model_output2 = model(**inputs2)
embs1, embs2 = cls_pooling(model_output1), cls_pooling(model_output2)
embs1 = torch.nn.functional.normalize(embs1, p=2, dim=1).numpy()
embs2 = torch.nn.functional.normalize(embs2, p=2, dim=1).numpy()
# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)
# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
scores = []
for j in range(len(texts2)):
scores.append([texts2[j], similarity[i][j]])
scores = sorted(scores, key=lambda x:x[1], reverse=True)
print(f"查询文本:{texts1[i]}")
for text2, score in scores:
print(f"相似文本:{text2},打分:{score}")
print()
Contact
您如果在使用过程中,遇到任何问题,欢迎前往讨论区建言献策。 您也可以联系我们:赵中昊 zhongh@dmetasoul.com, 肖文斌 xiaowenbin@dmetasoul.com, 孙凯 sunkai@dmetasoul.com 同时我们也开通了微信群,可扫码加入我们(人数超200了,先加管理员再拉进群),一起共建 AIGC 技术生态!
License
Dmeta-embedding 系列模型采用 Apache-2.0 License,开源模型可以进行免费商用私有部署。
- Downloads last month
- 427
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for DMetaSoul/Dmeta-embedding-zh-small
Spaces using DMetaSoul/Dmeta-embedding-zh-small 2
Evaluation results
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported55.384
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported59.543
- euclidean_pearson on MTEB AFQMCvalidation set self-reported58.186
- euclidean_spearman on MTEB AFQMCvalidation set self-reported59.543
- manhattan_pearson on MTEB AFQMCvalidation set self-reported58.142
- manhattan_spearman on MTEB AFQMCvalidation set self-reported59.479
- cos_sim_pearson on MTEB ATECtest set self-reported55.969
- cos_sim_spearman on MTEB ATECtest set self-reported58.633
- euclidean_pearson on MTEB ATECtest set self-reported62.784
- euclidean_spearman on MTEB ATECtest set self-reported58.633