infgrad
/

stella-large-zh-v2

@@ -1063,6 +1063,8 @@ model-index:
 **新闻 | News**
 **[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单，**不需要任何前缀文本**。
 Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
 and **do not need any prefix text**.\
@@ -1072,12 +1074,13 @@ stella是一个通用的文本编码模型，主要有以下模型：
 |     Model Name     | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
 |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
 | stella-large-zh-v2 |      0.65       |   1024    |      1024       | Chinese  |               No                |
 | stella-base-zh-v2  |       0.2       |    768    |      1024       | Chinese  |               No                |
 |  stella-large-zh   |      0.65       |   1024    |      1024       | Chinese  |               Yes               |
 |   stella-base-zh   |       0.2       |    768    |      1024       | Chinese  |               Yes               |
-完整的训练思路和训练过程已记录在[博客](https://zhuanlan.zhihu.com/p/655322183)，欢迎阅读讨论。
 **训练数据：**
@@ -1104,6 +1107,7 @@ stella is a general-purpose text encoder, which mainly includes the following mo
 |     Model Name     | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
 |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
 | stella-large-zh-v2 |      0.65       |   1024    |      1024       | Chinese  |               No                |
 | stella-base-zh-v2  |       0.2       |    768    |      1024       | Chinese  |               No                |
 |  stella-large-zh   |      0.65       |   1024    |      1024       | Chinese  |               Yes               |
@@ -1142,9 +1146,15 @@ Based on stella models, stella-v2 use more training data and remove instruction
 |  stella-large-zh   |      0.65       |   1024    |      1024       |    64.54     |       67.62        |     48.65      |          78.72          |     65.98     |     71.02     |  58.3   |
 |   stella-base-zh   |       0.2       |    768    |      1024       |    64.16     |       67.77        |      48.7      |          76.09          |     66.95     |     71.07     |  56.54  |
 #### Reproduce our results
-Codes:
 ```python
 import torch
@@ -1186,6 +1196,10 @@ if __name__ == '__main__':
 ```
 #### Evaluation for long text
 经过实际观察发现，C-MTEB的评测数据长度基本都是小于512的，
@@ -1244,7 +1258,6 @@ stella中文系列模型均使用mean pooling做为文本向量。
 在sentence-transformer库中的使用方法：
 ```python
-# 对于短对短数据集，下面是通用的使用方式
 from sentence_transformers import SentenceTransformer
 sentences = ["数据1", "数据2"]
@@ -1282,7 +1295,43 @@ print(vectors.shape)  # 2,768
 #### stella models for English
-developing...
 ## Training Detail
@@ -1320,3 +1369,4 @@ developing...
 9. https://github.com/THUDM/LongBench

 **新闻 | News**
+**[2023-10-19]** 开源stella-base-en-v2 使用简单，**不需要任何前缀文本**。
+Release stella-base-en-v2. This model **does not need any prefix text**.\
 **[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单，**不需要任何前缀文本**。
 Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
 and **do not need any prefix text**.\
 |     Model Name     | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
 |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
+| stella-base-en-v2  |       0.2       |    768    |       512       | English  |               No                |
 | stella-large-zh-v2 |      0.65       |   1024    |      1024       | Chinese  |               No                |
 | stella-base-zh-v2  |       0.2       |    768    |      1024       | Chinese  |               No                |
 |  stella-large-zh   |      0.65       |   1024    |      1024       | Chinese  |               Yes               |
 |   stella-base-zh   |       0.2       |    768    |      1024       | Chinese  |               Yes               |
+完整的训练思路和训练过程已记录在[博客1](https://zhuanlan.zhihu.com/p/655322183)和[博客2](https://zhuanlan.zhihu.com/p/662209559)，欢迎阅读讨论。
 **训练数据：**
 |     Model Name     | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
 |:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
+| stella-base-en-v2  |       0.2       |    768    |       512       | English  |               No                |
 | stella-large-zh-v2 |      0.65       |   1024    |      1024       | Chinese  |               No                |
 | stella-base-zh-v2  |       0.2       |    768    |      1024       | Chinese  |               No                |
 |  stella-large-zh   |      0.65       |   1024    |      1024       | Chinese  |               Yes               |
 |  stella-large-zh   |      0.65       |   1024    |      1024       |    64.54     |       67.62        |     48.65      |          78.72          |     65.98     |     71.02     |  58.3   |
 |   stella-base-zh   |       0.2       |    768    |      1024       |    64.16     |       67.77        |      48.7      |          76.09          |     66.95     |     71.07     |  56.54  |
+#### MTEB leaderboard (English)
+|    Model Name     | Model Size (GB) | Dimension | Sequence Length | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization  (1) |
+|:-----------------:|:---------------:|:---------:|:---------------:|:------------:|:-------------------:|:---------------:|:-----------------------:|:-------------:|:--------------:|:--------:|:------------------:|
+| stella-base-en-v2 |       0.2       |    768    |       512       |    62.61     |        75.28        |      44.9       |          86.45          |     58.77     |      50.1      |  83.02   |       32.52        |
 #### Reproduce our results
+**C-MTEB:**
 ```python
 import torch
 ```
+**MTEB:**
+You can use official script to reproduce our result. [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py)
 #### Evaluation for long text
 经过实际观察发现，C-MTEB的评测数据长度基本都是小于512的，
 在sentence-transformer库中的使用方法：
 ```python
 from sentence_transformers import SentenceTransformer
 sentences = ["数据1", "数据2"]
 #### stella models for English
+**Using Sentence-Transformers:**
+```python
+from sentence_transformers import SentenceTransformer
+sentences = ["one car come", "one car go"]
+model = SentenceTransformer('infgrad/stella-base-en-v2')
+print(model.max_seq_length)
+embeddings_1 = model.encode(sentences, normalize_embeddings=True)
+embeddings_2 = model.encode(sentences, normalize_embeddings=True)
+similarity = embeddings_1 @ embeddings_2.T
+print(similarity)
+```
+**Using HuggingFace Transformers:**
+```python
+from transformers import AutoModel, AutoTokenizer
+from sklearn.preprocessing import normalize
+model = AutoModel.from_pretrained('infgrad/stella-base-en-v2')
+tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-en-v2')
+sentences = ["one car come", "one car go"]
+batch_data = tokenizer(
+    batch_text_or_text_pairs=sentences,
+    padding="longest",
+    return_tensors="pt",
+    max_length=512,
+    truncation=True,
+)
+attention_mask = batch_data["attention_mask"]
+model_output = model(**batch_data)
+last_hidden = model_output.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
+vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
+vectors = normalize(vectors, norm="l2", axis=1, )
+print(vectors.shape)  # 2,768
+```
 ## Training Detail
 9. https://github.com/THUDM/LongBench