infgrad commited on
Commit
f11c871
1 Parent(s): 0516a2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -4
README.md CHANGED
@@ -1063,6 +1063,8 @@ model-index:
1063
 
1064
  **新闻 | News**
1065
 
 
 
1066
  **[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
1067
  Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
1068
  and **do not need any prefix text**.\
@@ -1072,12 +1074,13 @@ stella是一个通用的文本编码模型,主要有以下模型:
1072
 
1073
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
1074
  |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
 
1075
  | stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
1076
  | stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
1077
  | stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
1078
  | stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
1079
 
1080
- 完整的训练思路和训练过程已记录在[博客](https://zhuanlan.zhihu.com/p/655322183),欢迎阅读讨论。
1081
 
1082
  **训练数据:**
1083
 
@@ -1104,6 +1107,7 @@ stella is a general-purpose text encoder, which mainly includes the following mo
1104
 
1105
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
1106
  |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
 
1107
  | stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
1108
  | stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
1109
  | stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
@@ -1142,9 +1146,15 @@ Based on stella models, stella-v2 use more training data and remove instruction
1142
  | stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
1143
  | stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
1144
 
 
 
 
 
 
 
1145
  #### Reproduce our results
1146
 
1147
- Codes:
1148
 
1149
  ```python
1150
  import torch
@@ -1186,6 +1196,10 @@ if __name__ == '__main__':
1186
 
1187
  ```
1188
 
 
 
 
 
1189
  #### Evaluation for long text
1190
 
1191
  经过实际观察发现,C-MTEB的评测数据长度基本都是小于512的,
@@ -1244,7 +1258,6 @@ stella中文系列模型均使用mean pooling做为文本向量。
1244
  在sentence-transformer库中的使用方法:
1245
 
1246
  ```python
1247
- # 对于短对短数据集,下面是通用的使用方式
1248
  from sentence_transformers import SentenceTransformer
1249
 
1250
  sentences = ["数据1", "数据2"]
@@ -1282,7 +1295,43 @@ print(vectors.shape) # 2,768
1282
 
1283
  #### stella models for English
1284
 
1285
- developing...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1286
 
1287
  ## Training Detail
1288
 
@@ -1320,3 +1369,4 @@ developing...
1320
  9. https://github.com/THUDM/LongBench
1321
 
1322
 
 
 
1063
 
1064
  **新闻 | News**
1065
 
1066
+ **[2023-10-19]** 开源stella-base-en-v2 使用简单,**不需要任何前缀文本**。
1067
+ Release stella-base-en-v2. This model **does not need any prefix text**.\
1068
  **[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
1069
  Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
1070
  and **do not need any prefix text**.\
 
1074
 
1075
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
1076
  |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
1077
+ | stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
1078
  | stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
1079
  | stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
1080
  | stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
1081
  | stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
1082
 
1083
+ 完整的训练思路和训练过程已记录在[博客1](https://zhuanlan.zhihu.com/p/655322183)和[博客2](https://zhuanlan.zhihu.com/p/662209559),欢迎阅读讨论。
1084
 
1085
  **训练数据:**
1086
 
 
1107
 
1108
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
1109
  |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
1110
+ | stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
1111
  | stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
1112
  | stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
1113
  | stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
 
1146
  | stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
1147
  | stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
1148
 
1149
+ #### MTEB leaderboard (English)
1150
+
1151
+ | Model Name | Model Size (GB) | Dimension | Sequence Length | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) |
1152
+ |:-----------------:|:---------------:|:---------:|:---------------:|:------------:|:-------------------:|:---------------:|:-----------------------:|:-------------:|:--------------:|:--------:|:------------------:|
1153
+ | stella-base-en-v2 | 0.2 | 768 | 512 | 62.61 | 75.28 | 44.9 | 86.45 | 58.77 | 50.1 | 83.02 | 32.52 |
1154
+
1155
  #### Reproduce our results
1156
 
1157
+ **C-MTEB:**
1158
 
1159
  ```python
1160
  import torch
 
1196
 
1197
  ```
1198
 
1199
+ **MTEB:**
1200
+
1201
+ You can use official script to reproduce our result. [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py)
1202
+
1203
  #### Evaluation for long text
1204
 
1205
  经过实际观察发现,C-MTEB的评测数据长度基本都是小于512的,
 
1258
  在sentence-transformer库中的使用方法:
1259
 
1260
  ```python
 
1261
  from sentence_transformers import SentenceTransformer
1262
 
1263
  sentences = ["数据1", "数据2"]
 
1295
 
1296
  #### stella models for English
1297
 
1298
+ **Using Sentence-Transformers:**
1299
+
1300
+ ```python
1301
+ from sentence_transformers import SentenceTransformer
1302
+
1303
+ sentences = ["one car come", "one car go"]
1304
+ model = SentenceTransformer('infgrad/stella-base-en-v2')
1305
+ print(model.max_seq_length)
1306
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
1307
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
1308
+ similarity = embeddings_1 @ embeddings_2.T
1309
+ print(similarity)
1310
+ ```
1311
+
1312
+ **Using HuggingFace Transformers:**
1313
+
1314
+ ```python
1315
+ from transformers import AutoModel, AutoTokenizer
1316
+ from sklearn.preprocessing import normalize
1317
+
1318
+ model = AutoModel.from_pretrained('infgrad/stella-base-en-v2')
1319
+ tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-en-v2')
1320
+ sentences = ["one car come", "one car go"]
1321
+ batch_data = tokenizer(
1322
+ batch_text_or_text_pairs=sentences,
1323
+ padding="longest",
1324
+ return_tensors="pt",
1325
+ max_length=512,
1326
+ truncation=True,
1327
+ )
1328
+ attention_mask = batch_data["attention_mask"]
1329
+ model_output = model(**batch_data)
1330
+ last_hidden = model_output.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
1331
+ vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
1332
+ vectors = normalize(vectors, norm="l2", axis=1, )
1333
+ print(vectors.shape) # 2,768
1334
+ ```
1335
 
1336
  ## Training Detail
1337
 
 
1369
  9. https://github.com/THUDM/LongBench
1370
 
1371
 
1372
+