Update README.md
Browse files
README.md
CHANGED
@@ -1053,6 +1053,181 @@ model-index:
|
|
1053 |
value: 69.04229346593745
|
1054 |
- type: f1
|
1055 |
value: 84.52986739717021
|
1056 |
-
---
|
1057 |
license: apache-2.0
|
1058 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1053 |
value: 69.04229346593745
|
1054 |
- type: f1
|
1055 |
value: 84.52986739717021
|
|
|
1056 |
license: apache-2.0
|
1057 |
---
|
1058 |
+
|
1059 |
+
<div align="center">
|
1060 |
+
<img src="logo.png" alt="icon" width="100px"/>
|
1061 |
+
</div>
|
1062 |
+
|
1063 |
+
<h1 align="center">Dmeta-embedding-small</h1>
|
1064 |
+
|
1065 |
+
- Dmeta-embedding系列模型是跨领域、跨任务、开箱即用的中文 Embedding 模型,适用于搜索、问答、智能客服、LLM+RAG 等各种业务场景,支持使用 Transformers/Sentence-Transformers/Langchain 等工具加载推理。
|
1066 |
+
- **Dmeta-embedding-zh-small**是开源模型[Dmeta-embedding-zh](https://huggingface.co/DMetaSoul/Dmeta-embedding-zh)的蒸馏版本(8层BERT),模型大小不到300M。相较于原始版本,Dmeta-embedding-zh-small模型大小减小三分之一,推理速度提升约30%,总体精度下降约1.4%。
|
1067 |
+
|
1068 |
+
---
|
1069 |
+
|
1070 |
+
## Evaluation
|
1071 |
+
|
1072 |
+
这里主要跟蒸馏前对应的 teacher 模型作了对比:
|
1073 |
+
|
1074 |
+
*性能:*(基于1万条数据测试,GPU设备是V100)
|
1075 |
+
|
1076 |
+
| | Teacher | Student | Gap |
|
1077 |
+
| ---------- | --------------------- | ------------------- | ----- |
|
1078 |
+
| Model | BERT-12-layers (411M) | BERT-8-layers (297M)| 0.67x |
|
1079 |
+
| Cost | 127s | 89s | -30% |
|
1080 |
+
| Latency | 13ms | 9ms | -31% |
|
1081 |
+
| Throughput | 78 sentence/s | 111 sentence/s | 1.4x |
|
1082 |
+
|
1083 |
+
|
1084 |
+
*精度:*(参考自MTEB榜单)
|
1085 |
+
|
1086 |
+
| | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Avg** |
|
1087 |
+
| -------------- | ----------------- | -------------- | ----------------------- | ------------- | ------------- | ------- | ------- |
|
1088 |
+
| **Teacher** | 70 | 50.96 | 88.92 | 67.17 | 70.41 | 64.89 | 67.51 |
|
1089 |
+
| **Student** | 69.89 | 50.8 | 87.57 | 66.92 | 67.7 | 62.13 | 66.1 |
|
1090 |
+
| **Gap** | -0.11 | -0.16 | -1.35 | -0.25 | -2.71 | -2.76 | -1.41 |
|
1091 |
+
|
1092 |
+
|
1093 |
+
## Usage
|
1094 |
+
|
1095 |
+
目前模型支持通过 [Sentence-Transformers](#sentence-transformers), [Langchain](#langchain), [Huggingface Transformers](#huggingface-transformers) 等主流框架进行推理,具体用法参考各个框架的示例。
|
1096 |
+
|
1097 |
+
### Sentence-Transformers
|
1098 |
+
|
1099 |
+
Dmeta-embedding 模型支持通过 [sentence-transformers](https://www.SBERT.net) 来加载推理:
|
1100 |
+
|
1101 |
+
```
|
1102 |
+
pip install -U sentence-transformers
|
1103 |
+
```
|
1104 |
+
|
1105 |
+
```python
|
1106 |
+
from sentence_transformers import SentenceTransformer
|
1107 |
+
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
|
1108 |
+
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
|
1109 |
+
model = SentenceTransformer('DMetaSoul/Dmeta-embedding-zh-small')
|
1110 |
+
embs1 = model.encode(texts1, normalize_embeddings=True)
|
1111 |
+
embs2 = model.encode(texts2, normalize_embeddings=True)
|
1112 |
+
# 计算两两相似度
|
1113 |
+
similarity = embs1 @ embs2.T
|
1114 |
+
print(similarity)
|
1115 |
+
# 获取 texts1[i] 对应的最相似 texts2[j]
|
1116 |
+
for i in range(len(texts1)):
|
1117 |
+
scores = []
|
1118 |
+
for j in range(len(texts2)):
|
1119 |
+
scores.append([texts2[j], similarity[i][j]])
|
1120 |
+
scores = sorted(scores, key=lambda x:x[1], reverse=True)
|
1121 |
+
print(f"查询文本:{texts1[i]}")
|
1122 |
+
for text2, score in scores:
|
1123 |
+
print(f"相似文本:{text2},打分:{score}")
|
1124 |
+
print()
|
1125 |
+
```
|
1126 |
+
|
1127 |
+
示例输出如下:
|
1128 |
+
|
1129 |
+
```
|
1130 |
+
查询文本:胡子长得太快怎么办?
|
1131 |
+
相似文本:胡子长得快怎么办?,打分:0.965681254863739
|
1132 |
+
相似文本:怎样使胡子不浓密!,打分:0.7353651523590088
|
1133 |
+
相似文本:香港买手表哪里好,打分:0.24928246438503265
|
1134 |
+
相似文本:在杭州手机到哪里买,打分:0.2038613110780716
|
1135 |
+
|
1136 |
+
查询文本:在香港哪里买手表好
|
1137 |
+
相似文本:香港买手表哪里好,打分:0.9916468262672424
|
1138 |
+
相似文本:在杭州手机到哪里买,打分:0.498248815536499
|
1139 |
+
相似文本:胡子长得快怎么办?,打分:0.2424771636724472
|
1140 |
+
相似文本:怎样使胡子不浓密!,打分:0.21715955436229706
|
1141 |
+
```
|
1142 |
+
|
1143 |
+
### Langchain
|
1144 |
+
|
1145 |
+
Dmeta-embedding 模型支持通过 LLM 工具框架 [langchain](https://www.langchain.com/) 来加载推理:
|
1146 |
+
|
1147 |
+
```
|
1148 |
+
pip install -U langchain
|
1149 |
+
```
|
1150 |
+
|
1151 |
+
```python
|
1152 |
+
import torch
|
1153 |
+
import numpy as np
|
1154 |
+
from langchain.embeddings import HuggingFaceEmbeddings
|
1155 |
+
model_name = "DMetaSoul/Dmeta-embedding-zh-small"
|
1156 |
+
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
|
1157 |
+
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
|
1158 |
+
model = HuggingFaceEmbeddings(
|
1159 |
+
model_name=model_name,
|
1160 |
+
model_kwargs=model_kwargs,
|
1161 |
+
encode_kwargs=encode_kwargs,
|
1162 |
+
)
|
1163 |
+
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
|
1164 |
+
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
|
1165 |
+
embs1 = model.embed_documents(texts1)
|
1166 |
+
embs2 = model.embed_documents(texts2)
|
1167 |
+
embs1, embs2 = np.array(embs1), np.array(embs2)
|
1168 |
+
# 计算两两相似度
|
1169 |
+
similarity = embs1 @ embs2.T
|
1170 |
+
print(similarity)
|
1171 |
+
# 获取 texts1[i] 对应的最相似 texts2[j]
|
1172 |
+
for i in range(len(texts1)):
|
1173 |
+
scores = []
|
1174 |
+
for j in range(len(texts2)):
|
1175 |
+
scores.append([texts2[j], similarity[i][j]])
|
1176 |
+
scores = sorted(scores, key=lambda x:x[1], reverse=True)
|
1177 |
+
print(f"查询文本:{texts1[i]}")
|
1178 |
+
for text2, score in scores:
|
1179 |
+
print(f"相似文本:{text2},打分:{score}")
|
1180 |
+
print()
|
1181 |
+
```
|
1182 |
+
|
1183 |
+
### HuggingFace Transformers
|
1184 |
+
|
1185 |
+
Dmeta-embedding 模型支持通过 [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) 框架来加载推理:
|
1186 |
+
|
1187 |
+
```
|
1188 |
+
pip install -U transformers
|
1189 |
+
```
|
1190 |
+
|
1191 |
+
```python
|
1192 |
+
import torch
|
1193 |
+
from transformers import AutoTokenizer, AutoModel
|
1194 |
+
def mean_pooling(model_output, attention_mask):
|
1195 |
+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
1196 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
1197 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
1198 |
+
def cls_pooling(model_output):
|
1199 |
+
return model_output[0][:, 0]
|
1200 |
+
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
|
1201 |
+
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
|
1202 |
+
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/Dmeta-embedding-zh-small')
|
1203 |
+
model = AutoModel.from_pretrained('DMetaSoul/Dmeta-embedding-zh-small')
|
1204 |
+
model.eval()
|
1205 |
+
with torch.no_grad():
|
1206 |
+
inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors='pt')
|
1207 |
+
inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors='pt')
|
1208 |
+
model_output1 = model(**inputs1)
|
1209 |
+
model_output2 = model(**inputs2)
|
1210 |
+
embs1, embs2 = cls_pooling(model_output1), cls_pooling(model_output2)
|
1211 |
+
embs1 = torch.nn.functional.normalize(embs1, p=2, dim=1).numpy()
|
1212 |
+
embs2 = torch.nn.functional.normalize(embs2, p=2, dim=1).numpy()
|
1213 |
+
# 计算两两相似度
|
1214 |
+
similarity = embs1 @ embs2.T
|
1215 |
+
print(similarity)
|
1216 |
+
# 获取 texts1[i] 对应的最相似 texts2[j]
|
1217 |
+
for i in range(len(texts1)):
|
1218 |
+
scores = []
|
1219 |
+
for j in range(len(texts2)):
|
1220 |
+
scores.append([texts2[j], similarity[i][j]])
|
1221 |
+
scores = sorted(scores, key=lambda x:x[1], reverse=True)
|
1222 |
+
print(f"查询文本:{texts1[i]}")
|
1223 |
+
for text2, score in scores:
|
1224 |
+
print(f"相似文本:{text2},打分:{score}")
|
1225 |
+
print()
|
1226 |
+
```
|
1227 |
+
## Contact
|
1228 |
+
您如果在使用过程中,遇到任何问题,欢迎前往[讨论区](https://huggingface.co/DMetaSoul/Dmeta-embedding/discussions)建言献策。
|
1229 |
+
您也可以联系我们:赵中昊 <zhongh@dmetasoul.com>, 肖文斌 <xiaowenbin@dmetasoul.com>, 孙凯 <sunkai@dmetasoul.com>
|
1230 |
+
同时我们也开通了微信群,可扫码加入我们(人数超200了,先加管理员再拉进群),一起共建 AIGC 技术生态!
|
1231 |
+
<image src="https://huggingface.co/DMetaSoul/Dmeta-embedding/resolve/main/weixin.jpeg" style="display: block; margin-left: auto; margin-right: auto; width: 256px; height: 358px;"/>
|
1232 |
+
## License
|
1233 |
+
Dmeta-embedding 模型采用 Apache-2.0 License,开源模型可以进行免费商用私有部署。
|