thenlper commited on
Commit
6becf55
1 Parent(s): c614179

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -8
README.md CHANGED
@@ -1065,11 +1065,41 @@ license: mit
1065
 
1066
  General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
1067
 
1068
- The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including [GTE-large-zh](https://huggingface.co/thenlper/gte-large-zh), [GTE-base-zh](https://huggingface.co/thenlper/gte-base-zh), and [GTE-small-zh](https://huggingface.co/thenlper/gte-small-zh). The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.
 
 
 
 
 
 
 
 
 
 
 
1069
 
1070
  ## Metrics
1071
 
1072
- We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1073
 
1074
  ## Usage
1075
 
@@ -1081,10 +1111,10 @@ from torch import Tensor
1081
  from transformers import AutoTokenizer, AutoModel
1082
 
1083
  input_texts = [
1084
- "what is the capital of China?",
1085
- "how to implement quick sort in python?",
1086
- "Beijing",
1087
- "sorting algorithms"
1088
  ]
1089
 
1090
  tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
@@ -1103,20 +1133,21 @@ print(scores.tolist())
1103
  ```
1104
 
1105
  Use with sentence-transformers:
 
1106
  ```python
1107
  from sentence_transformers import SentenceTransformer
1108
  from sentence_transformers.util import cos_sim
1109
 
1110
  sentences = ['That is a happy person', 'That is a very happy person']
1111
 
1112
- model = SentenceTransformer('thenlper/gte-large')
1113
  embeddings = model.encode(sentences)
1114
  print(cos_sim(embeddings[0], embeddings[1]))
1115
  ```
1116
 
1117
  ### Limitation
1118
 
1119
- This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
1120
 
1121
  ### Citation
1122
 
 
1065
 
1066
  General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
1067
 
1068
+ The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer different sizes of models for both Chinese and English Languages. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.
1069
+
1070
+ ## Model List
1071
+
1072
+ | Models | Language | Max Sequence Length | Dimension | Model Size |
1073
+ |:-----: | :-----: |:-----: |:-----: |:-----: |
1074
+ |[GTE-large-zh](https://huggingface.co/thenlper/gte-large-zh) | Chinese | 512 | 1024 | 0.67GB |
1075
+ |[GTE-base-zh](https://huggingface.co/thenlper/gte-base-zh) | Chinese | 512 | 512 | 0.21GB |
1076
+ |[GTE-small-zh](https://huggingface.co/thenlper/gte-small-zh) | Chinese | 512 | 512 | 0.10GB |
1077
+ |[GTE-large](https://huggingface.co/thenlper/gte-large) | English | 512 | 1024 | 0.67GB |
1078
+ |[GTE-base](https://huggingface.co/thenlper/gte-base) | English | 512 | 512 | 0.21GB |
1079
+ |[GTE-small](https://huggingface.co/thenlper/gte-small) | English | 512 | 384 | 0.10GB |
1080
 
1081
  ## Metrics
1082
 
1083
+ We compared the performance of the GTE models with other popular text embedding models on the MTEB (CMTEB for Chinese language) benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
1084
+
1085
+ - Evaluation results on CMTEB
1086
+
1087
+ | Model | Model Size (GB) | Embedding Dimensions | Sequence Length | Average (35 datasets) | Classification (9 datasets) | Clustering (4 datasets) | Pair Classification (2 datasets) | Reranking (4 datasets) | Retrieval (8 datasets) | STS (8 datasets) |
1088
+ | ------------------- | -------------- | -------------------- | ---------------- | --------------------- | ------------------------------------ | ------------------------------ | --------------------------------------- | ------------------------------ | ---------------------------- | ------------------------ |
1089
+ | **gte-large-zh** | 0.65 | 1024 | 512 | **66.72** | 71.34 | 53.07 | 81.14 | 67.42 | 72.49 | 57.82 |
1090
+ | gte-base-zh | 0.20 | 768 | 512 | 65.92 | 71.26 | 53.86 | 80.44 | 67.00 | 71.71 | 55.96 |
1091
+ | stella-large-zh-v2 | 0.65 | 1024 | 1024 | 65.13 | 69.05 | 49.16 | 82.68 | 66.41 | 70.14 | 58.66 |
1092
+ | stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
1093
+ | bge-large-zh-v1.5 | 1.3 | 1024 | 512 | 64.53 | 69.13 | 48.99 | 81.6 | 65.84 | 70.46 | 56.25 |
1094
+ | stella-base-zh-v2 | 0.21 | 768 | 1024 | 64.36 | 68.29 | 49.4 | 79.96 | 66.1 | 70.08 | 56.92 |
1095
+ | stella-base-zh | 0.21 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
1096
+ | piccolo-large-zh | 0.65 | 1024 | 512 | 64.11 | 67.03 | 47.04 | 78.38 | 65.98 | 70.93 | 58.02 |
1097
+ | piccolo-base-zh | 0.2 | 768 | 512 | 63.66 | 66.98 | 47.12 | 76.61 | 66.68 | 71.2 | 55.9 |
1098
+ | gte-small-zh | 0.1 | 512 | 512 | 60.04 | 64.35 | 48.95 | 69.99 | 66.21 | 65.50 | 49.72 |
1099
+ | bge-small-zh-v1.5 | 0.1 | 512 | 512 | 57.82 | 63.96 | 44.18 | 70.4 | 60.92 | 61.77 | 49.1 |
1100
+ | m3e-base | 0.41 | 768 | 512 | 57.79 | 67.52 | 47.68 | 63.99 | 59.54| 56.91 | 50.47 |
1101
+ |text-embedding-ada-002(openai) | - | 1536| 8192 | 53.02 | 64.31 | 45.68 | 69.56 | 54.28 | 52.0 | 43.35 |
1102
+
1103
 
1104
  ## Usage
1105
 
 
1111
  from transformers import AutoTokenizer, AutoModel
1112
 
1113
  input_texts = [
1114
+ "中国的首都是哪里",
1115
+ "你喜欢去哪里旅游",
1116
+ "北京",
1117
+ "今天中午吃什么"
1118
  ]
1119
 
1120
  tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
 
1133
  ```
1134
 
1135
  Use with sentence-transformers:
1136
+
1137
  ```python
1138
  from sentence_transformers import SentenceTransformer
1139
  from sentence_transformers.util import cos_sim
1140
 
1141
  sentences = ['That is a happy person', 'That is a very happy person']
1142
 
1143
+ model = SentenceTransformer('thenlper/gte-large-zh')
1144
  embeddings = model.encode(sentences)
1145
  print(cos_sim(embeddings[0], embeddings[1]))
1146
  ```
1147
 
1148
  ### Limitation
1149
 
1150
+ This model exclusively caters to Chinese texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
1151
 
1152
  ### Citation
1153