Support sentence_transformers and fix readme (#2)

- Support sentence_transformers and fix readme (6d109681c91c5107952c8eaab19a6d7623d21dbc)
- Create modules.json (fe3466bba6f9fbbc4ad8f2e698297076383722d5)
- Create sentence_bert_config.json (73f70c6827d9533cc6cfe50e44ea960f7d7c0014)
- Update README.md (124e8cc607418c69ebae00e23831dac0c7cf6c1f)
- Update README.md (f2db8aff31372da208e8e68f58e68fa2a1afc8b0)

Files changed (4) hide show

1_Pooling/config.json +10 -0
README.md +25 -16
modules.json +14 -0
sentence_bert_config.json +4 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 1024,
+  "pooling_mode_cls_token": true,
+  "pooling_mode_mean_tokens": false,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md CHANGED Viewed

@@ -1,4 +1,6 @@
 ---
 library_name: transformers
 tags:
 - gte
@@ -2168,7 +2170,7 @@ model-index:
     - type: mrr_at_1000
       value: 78.704
     - type: mrr_at_3
-      value: 77.0
     - type: mrr_at_5
       value: 78.083
     - type: ndcg_at_1
@@ -2202,7 +2204,7 @@ model-index:
     - type: recall_at_100
       value: 99.833
     - type: recall_at_1000
-      value: 100.0
     - type: recall_at_3
       value: 86.506
     - type: recall_at_5
@@ -2245,7 +2247,7 @@ model-index:
     - type: euclidean_precision
       value: 85.74181117533719
     - type: euclidean_recall
-      value: 89.0
     - type: manhattan_accuracy
       value: 99.75445544554455
     - type: manhattan_ap
@@ -2336,19 +2338,19 @@ model-index:
     - type: map_at_5
       value: 1.028
     - type: mrr_at_1
-      value: 88.0
     - type: mrr_at_10
-      value: 94.0
     - type: mrr_at_100
-      value: 94.0
     - type: mrr_at_1000
-      value: 94.0
     - type: mrr_at_3
-      value: 94.0
     - type: mrr_at_5
-      value: 94.0
     - type: ndcg_at_1
-      value: 82.0
     - type: ndcg_at_10
       value: 77.48899999999999
     - type: ndcg_at_100
@@ -2360,7 +2362,7 @@ model-index:
     - type: ndcg_at_5
       value: 80.449
     - type: precision_at_1
-      value: 88.0
     - type: precision_at_10
       value: 82.19999999999999
     - type: precision_at_100
@@ -2368,7 +2370,7 @@ model-index:
     - type: precision_at_1000
       value: 23.684
     - type: precision_at_3
-      value: 88.0
     - type: precision_at_5
       value: 85.6
     - type: recall_at_1
@@ -2627,7 +2629,7 @@ We also present the [`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-N
 | Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
 |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: |
 |[`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| English | 7720 | 32768 | 4096 | 67.34 | 87.57 |
-|[`gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | English | 409 | 8192 | 1024 | 65.39 | 86.71 |
 |[`gte-base-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | English | 137 | 8192 | 768 | 64.11 | 87.44 |
@@ -2673,7 +2675,7 @@ from sentence_transformers.util import cos_sim
 sentences = ['That is a happy person', 'That is a very happy person']
-model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5')
 embeddings = model.encode(sentences)
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
@@ -2688,6 +2690,11 @@ print(cos_sim(embeddings[0], embeddings[1]))
 ### Training Procedure
 - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
 - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
 - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
@@ -2700,7 +2707,9 @@ print(cos_sim(embeddings[0], embeddings[1]))
 ### MTEB
-The gte results setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
 | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
 |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -2732,4 +2741,4 @@ The gte results setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`
 **APA:**
-[More Information Needed]

 ---
+datasets:
+- allenai/c4
 library_name: transformers
 tags:
 - gte
     - type: mrr_at_1000
       value: 78.704
     - type: mrr_at_3
+      value: 77
     - type: mrr_at_5
       value: 78.083
     - type: ndcg_at_1
     - type: recall_at_100
       value: 99.833
     - type: recall_at_1000
+      value: 100
     - type: recall_at_3
       value: 86.506
     - type: recall_at_5
     - type: euclidean_precision
       value: 85.74181117533719
     - type: euclidean_recall
+      value: 89
     - type: manhattan_accuracy
       value: 99.75445544554455
     - type: manhattan_ap
     - type: map_at_5
       value: 1.028
     - type: mrr_at_1
+      value: 88
     - type: mrr_at_10
+      value: 94
     - type: mrr_at_100
+      value: 94
     - type: mrr_at_1000
+      value: 94
     - type: mrr_at_3
+      value: 94
     - type: mrr_at_5
+      value: 94
     - type: ndcg_at_1
+      value: 82
     - type: ndcg_at_10
       value: 77.48899999999999
     - type: ndcg_at_100
     - type: ndcg_at_5
       value: 80.449
     - type: precision_at_1
+      value: 88
     - type: precision_at_10
       value: 82.19999999999999
     - type: precision_at_100
     - type: precision_at_1000
       value: 23.684
     - type: precision_at_3
+      value: 88
     - type: precision_at_5
       value: 85.6
     - type: recall_at_1
 | Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
 |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: |
 |[`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| English | 7720 | 32768 | 4096 | 67.34 | 87.57 |
+|[`gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | English | 434 | 8192 | 1024 | 65.39 | 86.71 |
 |[`gte-base-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | English | 137 | 8192 | 768 | 64.11 | 87.44 |
 sentences = ['That is a happy person', 'That is a very happy person']
+model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
 embeddings = model.encode(sentences)
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
 ### Training Procedure
+To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy.
+The model first undergoes preliminary MLM pre-training on shorter lengths.
+And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.
+The entire training process is as follows:
 - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
 - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
 - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
 ### MTEB
+The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
+The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
 | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
 |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 **APA:**
+[More Information Needed]

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 8192,
+  "do_lower_case": false
+}