sensenova
/

piccolo-base-zh

@@ -1057,15 +1057,15 @@ model-index:
 ## piccolo-base-zh
-piccolo是一个通用embedding模型, 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程，采用了两阶段的训练方式。
 在第一阶段中，我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据)，并采用二元组的softmax对比学习损失来优化模型。
-在第二阶段中，我们从互联网搜集了2000万人工标注的中文文本对(精标数据)，并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
 目前，我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
-piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
 Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
 and train the model with the pair(text and text pos) softmax contrastive loss.
-On the second stage, we collect 20 million human labeled chinese text pairs from the open-source dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
 Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
 ## Metric
@@ -1106,10 +1106,11 @@ model = SentenceTransformer('sensenova/piccolo-base-zh')
 q_embeddings = model.encode(["查询：" + q for q in queries], normalize_embeddings=True)
 p_embeddings = model.encode(["结果：" + p for p in passages], normalize_embeddings=True)
 scores = q_embeddings @ p_embeddings.T
 ```
 ## acknowledgement
 piccolo is powered by Genral Model group from SenseTime Research.

 ## piccolo-base-zh
+piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程，采用了两阶段的训练方式。
 在第一阶段中，我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据)，并采用二元组的softmax对比学习损失来优化模型。
+在第二阶段中，我们从搜集整理了2000万人工标注的中文文本对(精标数据)，并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
 目前，我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
+piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research.
 Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
 and train the model with the pair(text and text pos) softmax contrastive loss.
+On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
 Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
 ## Metric
 q_embeddings = model.encode(["查询：" + q for q in queries], normalize_embeddings=True)
 p_embeddings = model.encode(["结果：" + p for p in passages], normalize_embeddings=True)
 scores = q_embeddings @ p_embeddings.T
 ```
+## Training Detail
+TODO
 ## acknowledgement
 piccolo is powered by Genral Model group from SenseTime Research.