infgrad
/

stella-large-zh

@@ -1057,9 +1057,26 @@ model-index:
 ## stella model
-stella是一个通用的中文文本编码模型，目前有两个版本：base 和 large，**2个版本的模型均支持1024的输入长度**。
-完整的训练思路和训练过程已记录在[博客](https://zhuanlan.zhihu.com/p/655322183)，欢迎阅读讨论。
 **训练数据：**
@@ -1074,12 +1091,23 @@ stella是一个通用的中文文本编码模型，目前有两个版本：base
 4. cosent loss[5]
 5. 每一种类型的数据一个迭代器，分别计算loss进行更新
 **初始权重：**\
-stella-base-zh和stella-large-zh分别以piccolo-base-zh[6]和piccolo-large-zh作为基础模型，512-1024的position embedding使用层次分解位置编码[7]进行初始化。\
 感谢商汤科技研究院开源的[piccolo系列模型](https://huggingface.co/sensenova)。
-stella is a general-purpose Chinese text encoding model, currently with two versions: base and large, **both of them
-support input lengths of 1024.**
 The training data mainly includes:
@@ -1101,21 +1129,72 @@ stella-base-zh and stella-large-zh use piccolo-base-zh and piccolo-large-zh as t
 Training strategy:\
 One iterator for each type of data, separately calculating the loss.
 ## Metric
-#### C-MTEB leaderboard
-stella模型在C-MTEB[8]的结果，评测脚本请参见博客。
-|        Model Name        | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
-|:------------------------:|:---------------:|:---------:|:---------------:|:------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|
-|   **stella-large-zh**    |      0.65       |   1024    |    **1024**     |  **64.54**   |       67.62        |     48.65      |          78.72          |     65.98     |     71.02     |  58.3   |
-|    **stella-base-zh**    |       0.2       |    768    |    **1024**     |  **64.16**   |       67.77        |      48.7      |          76.09          |     66.95     |     71.07     |  56.54  |
-|     piccolo-large-zh     |      0.65       |   1024    |       512       |    64.11     |       67.03        |     47.04      |          78.38          |     65.98     |     70.93     |  58.02  |
-|       bge-large-zh       |       1.3       |   1024    |       512       |    63.96     |       68.32        |     48.39      |          78.94          |     65.11     |     71.52     |  54.98  |
-|     piccolo-base-zh      |       0.2       |    768    |       512       |    63.66     |       66.98        |     47.12      |          76.61          |     66.68     |     71.2      |  55.9   |
-| bge-large-zh-no-instruct |       1.3       |   1024    |       512       |     63.4     |       68.58        |     50.01      |          76.77          |     64.9      |     70.54     |   53    |
-|       [bge-base-zh       |      0.41       |    768    |       512       |     62.8     |       67.07        |     47.64      |          77.5           |     64.91     |     69.53     |  54.12  |
 #### Evaluation for long text
@@ -1159,29 +1238,31 @@ passage长度为800多，大于512，但是对于这个question而言只需要
 | Multifieldqa_zh |      81.41      |      83.92       |    83.92    |    83.42     |      79.9      |      80.4       |
 |   **Average**   |      74.98      |      74.83       |    74.76    |    76.15     |   **78.96**    |    **78.24**    |
 **注意：** 因为长文本评测数据数量稀少，所以构造时也使用了train部分，如果自行评测，请注意模型的训练数据以免数据泄露。
 ## Usage
-本模型是在piccolo基础上训练的，因此**用法和piccolo完全一致**。\
-**注意**：在stella中instruction里的冒号是英文冒号, 即`查询: `和`结果: `。
 在sentence-transformer库中的使用方法：
 ```python
-# 对于短对短数据集，下面是通用的使用方式
 from sentence_transformers import SentenceTransformer
 sentences = ["数据1", "数据2"]
-model = SentenceTransformer('infgrad/stella-base-zh')
 print(model.max_seq_length)
 embeddings_1 = model.encode(sentences, normalize_embeddings=True)
 embeddings_2 = model.encode(sentences, normalize_embeddings=True)
 similarity = embeddings_1 @ embeddings_2.T
 print(similarity)
-# 如果是短对长数据集，推荐添加instruction，来帮助模型更好地进行检索。
-# 注意instruction里的是英文的冒号
 ```
 直接使用transformers库：
@@ -1190,8 +1271,8 @@ print(similarity)
 from transformers import AutoModel, AutoTokenizer
 from sklearn.preprocessing import normalize
-model = AutoModel.from_pretrained('infgrad/stella-base-zh')
-tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-zh')
 sentences = ["数据1", "数据ABCDEFGH"]
 batch_data = tokenizer(
     batch_text_or_text_pairs=sentences,
@@ -1208,6 +1289,46 @@ vectors = normalize(vectors, norm="l2", axis=1, )
 print(vectors.shape)  # 2,768
 ```
 ## Training Detail
 **硬件：** 单卡A100-80GB
@@ -1218,13 +1339,12 @@ print(vectors.shape)  # 2,768
 **batch_size：** base模型为1024，额外增加20%的难负例；large模型为768，额外增加20%的难负例
-**数据量：** 约100万，其中用LLM构造的数据约有200K. LLM模型大小为13b
 ## ToDoList
 **评测的稳定性：**
-评测过程中发现Clustering任务会和官方的结果不一致，大约有±0.0x的小差距，基本上可以忽略不计，不影响评测结论。\
-但是不完全一样还是比较难理解的，本人试了bge和piccolo系列的模型都存在这个问题，个人猜测可能和使用的库、batch_size等环境有关。
 **更高质量的长文本训练和测试数据：** 训练数据多是用13b模型构造的，肯定会存在噪声。
 测试数据基本都是从mrc数据整理来的，所以问题都是factoid类型，不符合真实分布。
@@ -1246,3 +1366,5 @@ print(vectors.shape)  # 2,768

 ## stella model
+**新闻 | News**
+**[2023-10-19]** 开源stella-base-en-v2 使用简单，**不需要任何前缀文本**。
+Release stella-base-en-v2. This model **does not need any prefix text**.\
+**[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单，**不需要任何前缀文本**。
+Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
+and **do not need any prefix text**.\
+**[2023-09-11]** 开源stella-base-zh和stella-large-zh
+stella是一个通用的文本编码模型，主要有以下模型：
+|     Model Name     | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
+|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
+| stella-base-en-v2  |       0.2       |    768    |       512       | English  |               No                |
+| stella-large-zh-v2 |      0.65       |   1024    |      1024       | Chinese  |               No                |
+| stella-base-zh-v2  |       0.2       |    768    |      1024       | Chinese  |               No                |
+|  stella-large-zh   |      0.65       |   1024    |      1024       | Chinese  |               Yes               |
+|   stella-base-zh   |       0.2       |    768    |      1024       | Chinese  |               Yes               |
+完整的训练思路和训练过程已记录在[博客1](https://zhuanlan.zhihu.com/p/655322183)和[博客2](https://zhuanlan.zhihu.com/p/662209559)，欢迎阅读讨论。
 **训练数据：**
 4. cosent loss[5]
 5. 每一种类型的数据一个迭代器，分别计算loss进行更新
+stella-v2在stella模型的基础上，使用了更多的训练数据，同时知识蒸馏等方法去除了前置的instruction(
+比如piccolo的`查询:`, `结果:`, e5的`query:`和`passage:`)。
 **初始权重：**\
+stella-base-zh和stella-large-zh分别以piccolo-base-zh[6]和piccolo-large-zh作为基础模型，512-1024的position
+embedding使用层次分解位置编码[7]进行初始化。\
 感谢商汤科技研究院开源的[piccolo系列模型](https://huggingface.co/sensenova)。
+stella is a general-purpose text encoder, which mainly includes the following models:
+|     Model Name     | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
+|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
+| stella-base-en-v2  |       0.2       |    768    |       512       | English  |               No                |
+| stella-large-zh-v2 |      0.65       |   1024    |      1024       | Chinese  |               No                |
+| stella-base-zh-v2  |       0.2       |    768    |      1024       | Chinese  |               No                |
+|  stella-large-zh   |      0.65       |   1024    |      1024       | Chinese  |               Yes               |
+|   stella-base-zh   |       0.2       |    768    |      1024       | Chinese  |               Yes               |
 The training data mainly includes:
 Training strategy:\
 One iterator for each type of data, separately calculating the loss.
+Based on stella models, stella-v2 use more training data and remove instruction by Knowledge Distillation.
 ## Metric
+#### C-MTEB leaderboard (Chinese)
+|     Model Name     | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
+|:------------------:|:---------------:|:---------:|:---------------:|:------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|
+| stella-large-zh-v2 |      0.65       |   1024    |      1024       |    65.13     |       69.05        |     49.16      |          82.68          |     66.41     |     70.14     |  58.66  |
+| stella-base-zh-v2  |       0.2       |    768    |      1024       |    64.36     |       68.29        |      49.4      |          79.95          |     66.1      |     70.08     |  56.92  |
+|  stella-large-zh   |      0.65       |   1024    |      1024       |    64.54     |       67.62        |     48.65      |          78.72          |     65.98     |     71.02     |  58.3   |
+|   stella-base-zh   |       0.2       |    768    |      1024       |    64.16     |       67.77        |      48.7      |          76.09          |     66.95     |     71.07     |  56.54  |
+#### MTEB leaderboard (English)
+|    Model Name     | Model Size (GB) | Dimension | Sequence Length | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization  (1) |
+|:-----------------:|:---------------:|:---------:|:---------------:|:------------:|:-------------------:|:---------------:|:-----------------------:|:-------------:|:--------------:|:--------:|:------------------:|
+| stella-base-en-v2 |       0.2       |    768    |       512       |    62.61     |        75.28        |      44.9       |          86.45          |     58.77     |      50.1      |  83.02   |       32.52        |
+#### Reproduce our results
+**C-MTEB:**
+```python
+import torch
+import numpy as np
+from typing import List
+from mteb import MTEB
+from sentence_transformers import SentenceTransformer
+class FastTextEncoder():
+    def __init__(self, model_name):
+        self.model = SentenceTransformer(model_name).cuda().half().eval()
+        self.model.max_seq_length = 512
+    def encode(
+            self,
+            input_texts: List[str],
+            *args,
+            **kwargs
+    ):
+        new_sens = list(set(input_texts))
+        new_sens.sort(key=lambda x: len(x), reverse=True)
+        vecs = self.model.encode(
+            new_sens, normalize_embeddings=True, convert_to_numpy=True, batch_size=256
+        ).astype(np.float32)
+        sen2arrid = {sen: idx for idx, sen in enumerate(new_sens)}
+        vecs = vecs[[sen2arrid[sen] for sen in input_texts]]
+        torch.cuda.empty_cache()
+        return vecs
+if __name__ == '__main__':
+    model_name = "infgrad/stella-base-zh-v2"
+    output_folder = "zh_mteb_results/stella-base-zh-v2"
+    task_names = [t.description["name"] for t in MTEB(task_langs=['zh', 'zh-CN']).tasks]
+    model = FastTextEncoder(model_name)
+    for task in task_names:
+        MTEB(tasks=[task], task_langs=['zh', 'zh-CN']).run(model, output_folder=output_folder)
+```
+**MTEB:**
+You can use official script to reproduce our result. [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py)
 #### Evaluation for long text
 | Multifieldqa_zh |      81.41      |      83.92       |    83.92    |    83.42     |      79.9      |      80.4       |
 |   **Average**   |      74.98      |      74.83       |    74.76    |    76.15     |   **78.96**    |    **78.24**    |
 **注意：** 因为长文本评测数据数量稀少，所以构造时也使用了train部分，如果自行评测，请注意模型的训练数据以免数据泄露。
 ## Usage
+#### stella 中文系列模型
+stella-base-zh 和 stella-large-zh: 本模型是在piccolo基础上训练的，因此**用法和piccolo完全一致**
+，即在检索重排任务上给query和passage加上`查询: `和`结果: `。对于短短匹配不需要做任何操作。
+stella-base-zh-v2 和 stella-large-zh-v2: 本模型使用简单，**任何使用场景中都不需要加前缀文本**。
+stella中文系列模型均使用mean pooling做为文本向量。
 在sentence-transformer库中的使用方法：
 ```python
 from sentence_transformers import SentenceTransformer
 sentences = ["数据1", "数据2"]
+model = SentenceTransformer('infgrad/stella-base-zh-v2')
 print(model.max_seq_length)
 embeddings_1 = model.encode(sentences, normalize_embeddings=True)
 embeddings_2 = model.encode(sentences, normalize_embeddings=True)
 similarity = embeddings_1 @ embeddings_2.T
 print(similarity)
 ```
 直接使用transformers库：
 from transformers import AutoModel, AutoTokenizer
 from sklearn.preprocessing import normalize
+model = AutoModel.from_pretrained('infgrad/stella-base-zh-v2')
+tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-zh-v2')
 sentences = ["数据1", "数据ABCDEFGH"]
 batch_data = tokenizer(
     batch_text_or_text_pairs=sentences,
 print(vectors.shape)  # 2,768
 ```
+#### stella models for English
+**Using Sentence-Transformers:**
+```python
+from sentence_transformers import SentenceTransformer
+sentences = ["one car come", "one car go"]
+model = SentenceTransformer('infgrad/stella-base-en-v2')
+print(model.max_seq_length)
+embeddings_1 = model.encode(sentences, normalize_embeddings=True)
+embeddings_2 = model.encode(sentences, normalize_embeddings=True)
+similarity = embeddings_1 @ embeddings_2.T
+print(similarity)
+```
+**Using HuggingFace Transformers:**
+```python
+from transformers import AutoModel, AutoTokenizer
+from sklearn.preprocessing import normalize
+model = AutoModel.from_pretrained('infgrad/stella-base-en-v2')
+tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-en-v2')
+sentences = ["one car come", "one car go"]
+batch_data = tokenizer(
+    batch_text_or_text_pairs=sentences,
+    padding="longest",
+    return_tensors="pt",
+    max_length=512,
+    truncation=True,
+)
+attention_mask = batch_data["attention_mask"]
+model_output = model(**batch_data)
+last_hidden = model_output.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
+vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
+vectors = normalize(vectors, norm="l2", axis=1, )
+print(vectors.shape)  # 2,768
+```
 ## Training Detail
 **硬件：** 单卡A100-80GB
 **batch_size：** base模型为1024，额外增加20%的难负例；large模型为768，额外增加20%的难负例
+**数据量：** 第一版模型约100万，其中用LLM构造的数据约有200K. LLM模型大小为13b。v2系列模型到了2000万训练数据。
 ## ToDoList
 **评测的稳定性：**
+评测过程中发现Clustering任务会和官方的结果不一致，大约有±0.0x的小差距，原因是聚类代码没有设置random_seed，差距可以忽略不计，不影响评测结论。
 **更高质量的长文本训练和测试数据：** 训练数据多是用13b模型构造的，肯定会存在噪声。
 测试数据基本都是从mrc数据整理来的，所以问题都是factoid类型，不符合真实分布。