thenlper
/

gte-large-zh

@@ -1065,11 +1065,41 @@ license: mit
 General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
-The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including [GTE-large-zh](https://huggingface.co/thenlper/gte-large-zh), [GTE-base-zh](https://huggingface.co/thenlper/gte-base-zh), and [GTE-small-zh](https://huggingface.co/thenlper/gte-small-zh). The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.
 ## Metrics
-We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
 ## Usage
@@ -1081,10 +1111,10 @@ from torch import Tensor
 from transformers import AutoTokenizer, AutoModel
 input_texts = [
-    "what is the capital of China?",
-    "how to implement quick sort in python?",
-    "Beijing",
-    "sorting algorithms"
 ]
 tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
@@ -1103,20 +1133,21 @@ print(scores.tolist())
 ```
 Use with sentence-transformers:
 ```python
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.util import cos_sim
 sentences = ['That is a happy person', 'That is a very happy person']
-model = SentenceTransformer('thenlper/gte-large')
 embeddings = model.encode(sentences)
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
 ### Limitation
-This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
 ### Citation

 General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
+The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer different sizes of models for both Chinese and English Languages. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.
+## Model List
+| Models | Language | Max Sequence Length | Dimension | Model Size |
+|:-----: | :-----: |:-----: |:-----: |:-----: |
+|[GTE-large-zh](https://huggingface.co/thenlper/gte-large-zh) | Chinese | 512 | 1024 | 0.67GB |
+|[GTE-base-zh](https://huggingface.co/thenlper/gte-base-zh) | Chinese | 512 | 512 | 0.21GB |
+|[GTE-small-zh](https://huggingface.co/thenlper/gte-small-zh) | Chinese | 512 | 512 | 0.10GB |
+|[GTE-large](https://huggingface.co/thenlper/gte-large) | English | 512 | 1024 | 0.67GB |
+|[GTE-base](https://huggingface.co/thenlper/gte-base) | English | 512 | 512 | 0.21GB |
+|[GTE-small](https://huggingface.co/thenlper/gte-small) | English | 512 | 384 | 0.10GB |
 ## Metrics
+We compared the performance of the GTE models with other popular text embedding models on the MTEB (CMTEB for Chinese language) benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
+- Evaluation results on CMTEB
+| Model               | Model Size (GB) | Embedding Dimensions | Sequence Length | Average (35 datasets) | Classification  (9 datasets)         | Clustering (4 datasets)        | Pair Classification        (2 datasets) | Reranking (4 datasets)         | Retrieval  (8 datasets)      | STS         (8 datasets) |
+| ------------------- | -------------- | -------------------- | ---------------- | --------------------- | ------------------------------------ | ------------------------------ | --------------------------------------- | ------------------------------ | ---------------------------- | ------------------------ |
+| **gte-large-zh** | 0.65           | 1024                  | 512              | **66.72**                 | 71.34                                | 53.07                        | 81.14                                   | 67.42                          | 72.49                        | 57.82                  |
+| gte-base-zh         | 0.20         | 768                  | 512              | 65.92                 | 71.26                               | 53.86                          | 80.44                                  | 67.00                          | 71.71                        | 55.96                    |
+| stella-large-zh-v2  | 0.65           | 1024                 | 1024             | 65.13                 | 69.05                                | 49.16                          | 82.68                                   | 66.41                          | 70.14                        | 58.66                    |
+| stella-large-zh     | 0.65           | 1024                 | 1024             | 64.54                 | 67.62                                | 48.65                          | 78.72                                   | 65.98                          | 71.02                        | 58.3                     |
+| bge-large-zh-v1.5   | 1.3            | 1024                 | 512              | 64.53                 | 69.13                                | 48.99                          | 81.6                                    | 65.84                          | 70.46                        | 56.25                    |
+| stella-base-zh-v2   | 0.21           | 768                  | 1024             | 64.36                 | 68.29                                | 49.4                           | 79.96                                   | 66.1                           | 70.08                        | 56.92                    |
+| stella-base-zh      | 0.21           | 768                  | 1024             | 64.16                 | 67.77                                | 48.7                           | 76.09                                   | 66.95                          | 71.07                        | 56.54                    |
+| piccolo-large-zh    | 0.65           | 1024                 | 512              | 64.11                 | 67.03                                | 47.04                          | 78.38                                   | 65.98                          | 70.93                        | 58.02                    |
+| piccolo-base-zh     | 0.2            | 768                  | 512              | 63.66                 | 66.98                                | 47.12                          | 76.61                                   | 66.68                          | 71.2                         | 55.9                     |
+| gte-small-zh         | 0.1           | 512                  | 512              | 60.04                 | 64.35                                | 48.95                          | 69.99                                   | 66.21                          | 65.50                        | 49.72                    |
+| bge-small-zh-v1.5     | 0.1           | 512                  | 512              | 57.82                 | 63.96                                | 44.18                          | 70.4                                   | 60.92                          | 61.77                        | 49.1                    |
+| m3e-base | 0.41 | 768 | 512 | 57.79 | 67.52 | 47.68 | 63.99 | 59.54| 56.91 | 50.47 |
+|text-embedding-ada-002(openai) | - | 1536| 8192 | 53.02 | 64.31 | 45.68 | 69.56 | 54.28 | 52.0 | 43.35 |
 ## Usage
 from transformers import AutoTokenizer, AutoModel
 input_texts = [
+    "中国的首都是哪里",
+    "你喜欢去哪里旅游",
+    "北京",
+    "今天中午吃什么"
 ]
 tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
 ```
 Use with sentence-transformers:
 ```python
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.util import cos_sim
 sentences = ['That is a happy person', 'That is a very happy person']
+model = SentenceTransformer('thenlper/gte-large-zh')
 embeddings = model.encode(sentences)
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
 ### Limitation
+This model exclusively caters to Chinese texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
 ### Citation