Update README.md
Browse files
README.md
CHANGED
@@ -1057,15 +1057,15 @@ model-index:
|
|
1057 |
|
1058 |
## piccolo-base-zh
|
1059 |
|
1060 |
-
piccolo是一个通用embedding
|
1061 |
在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
|
1062 |
-
|
1063 |
目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
|
1064 |
|
1065 |
-
piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
|
1066 |
Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
|
1067 |
and train the model with the pair(text and text pos) softmax contrastive loss.
|
1068 |
-
On the second stage, we collect 20 million human labeled chinese text pairs
|
1069 |
Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
|
1070 |
|
1071 |
## Metric
|
@@ -1106,10 +1106,11 @@ model = SentenceTransformer('sensenova/piccolo-base-zh')
|
|
1106 |
q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
|
1107 |
p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
|
1108 |
scores = q_embeddings @ p_embeddings.T
|
1109 |
-
|
1110 |
-
|
1111 |
```
|
1112 |
|
|
|
|
|
|
|
1113 |
## acknowledgement
|
1114 |
|
1115 |
piccolo is powered by Genral Model group from SenseTime Research.
|
|
|
1057 |
|
1058 |
## piccolo-base-zh
|
1059 |
|
1060 |
+
piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。
|
1061 |
在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
|
1062 |
+
在第二阶段中,我们从搜集整理了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
|
1063 |
目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
|
1064 |
|
1065 |
+
piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research.
|
1066 |
Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
|
1067 |
and train the model with the pair(text and text pos) softmax contrastive loss.
|
1068 |
+
On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
|
1069 |
Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
|
1070 |
|
1071 |
## Metric
|
|
|
1106 |
q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
|
1107 |
p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
|
1108 |
scores = q_embeddings @ p_embeddings.T
|
|
|
|
|
1109 |
```
|
1110 |
|
1111 |
+
## Training Detail
|
1112 |
+
TODO
|
1113 |
+
|
1114 |
## acknowledgement
|
1115 |
|
1116 |
piccolo is powered by Genral Model group from SenseTime Research.
|