izhx commited on
Commit
4a1c613
1 Parent(s): b4a2b8c

Support sentence_transformers and fix readme (#2)

Browse files

- Support sentence_transformers and fix readme (6d109681c91c5107952c8eaab19a6d7623d21dbc)
- Create modules.json (fe3466bba6f9fbbc4ad8f2e698297076383722d5)
- Create sentence_bert_config.json (73f70c6827d9533cc6cfe50e44ea960f7d7c0014)
- Update README.md (124e8cc607418c69ebae00e23831dac0c7cf6c1f)
- Update README.md (f2db8aff31372da208e8e68f58e68fa2a1afc8b0)

Files changed (4) hide show
  1. 1_Pooling/config.json +10 -0
  2. README.md +25 -16
  3. modules.json +14 -0
  4. sentence_bert_config.json +4 -0
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,4 +1,6 @@
1
  ---
 
 
2
  library_name: transformers
3
  tags:
4
  - gte
@@ -2168,7 +2170,7 @@ model-index:
2168
  - type: mrr_at_1000
2169
  value: 78.704
2170
  - type: mrr_at_3
2171
- value: 77.0
2172
  - type: mrr_at_5
2173
  value: 78.083
2174
  - type: ndcg_at_1
@@ -2202,7 +2204,7 @@ model-index:
2202
  - type: recall_at_100
2203
  value: 99.833
2204
  - type: recall_at_1000
2205
- value: 100.0
2206
  - type: recall_at_3
2207
  value: 86.506
2208
  - type: recall_at_5
@@ -2245,7 +2247,7 @@ model-index:
2245
  - type: euclidean_precision
2246
  value: 85.74181117533719
2247
  - type: euclidean_recall
2248
- value: 89.0
2249
  - type: manhattan_accuracy
2250
  value: 99.75445544554455
2251
  - type: manhattan_ap
@@ -2336,19 +2338,19 @@ model-index:
2336
  - type: map_at_5
2337
  value: 1.028
2338
  - type: mrr_at_1
2339
- value: 88.0
2340
  - type: mrr_at_10
2341
- value: 94.0
2342
  - type: mrr_at_100
2343
- value: 94.0
2344
  - type: mrr_at_1000
2345
- value: 94.0
2346
  - type: mrr_at_3
2347
- value: 94.0
2348
  - type: mrr_at_5
2349
- value: 94.0
2350
  - type: ndcg_at_1
2351
- value: 82.0
2352
  - type: ndcg_at_10
2353
  value: 77.48899999999999
2354
  - type: ndcg_at_100
@@ -2360,7 +2362,7 @@ model-index:
2360
  - type: ndcg_at_5
2361
  value: 80.449
2362
  - type: precision_at_1
2363
- value: 88.0
2364
  - type: precision_at_10
2365
  value: 82.19999999999999
2366
  - type: precision_at_100
@@ -2368,7 +2370,7 @@ model-index:
2368
  - type: precision_at_1000
2369
  value: 23.684
2370
  - type: precision_at_3
2371
- value: 88.0
2372
  - type: precision_at_5
2373
  value: 85.6
2374
  - type: recall_at_1
@@ -2627,7 +2629,7 @@ We also present the [`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-N
2627
  | Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
2628
  |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: |
2629
  |[`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| English | 7720 | 32768 | 4096 | 67.34 | 87.57 |
2630
- |[`gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | English | 409 | 8192 | 1024 | 65.39 | 86.71 |
2631
  |[`gte-base-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | English | 137 | 8192 | 768 | 64.11 | 87.44 |
2632
 
2633
 
@@ -2673,7 +2675,7 @@ from sentence_transformers.util import cos_sim
2673
 
2674
  sentences = ['That is a happy person', 'That is a very happy person']
2675
 
2676
- model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5')
2677
  embeddings = model.encode(sentences)
2678
  print(cos_sim(embeddings[0], embeddings[1]))
2679
  ```
@@ -2688,6 +2690,11 @@ print(cos_sim(embeddings[0], embeddings[1]))
2688
 
2689
  ### Training Procedure
2690
 
 
 
 
 
 
2691
  - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
2692
  - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
2693
  - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
@@ -2700,7 +2707,9 @@ print(cos_sim(embeddings[0], embeddings[1]))
2700
 
2701
  ### MTEB
2702
 
2703
- The gte results setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
 
 
2704
 
2705
  | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
2706
  |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -2732,4 +2741,4 @@ The gte results setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`
2732
 
2733
  **APA:**
2734
 
2735
- [More Information Needed]
 
1
  ---
2
+ datasets:
3
+ - allenai/c4
4
  library_name: transformers
5
  tags:
6
  - gte
 
2170
  - type: mrr_at_1000
2171
  value: 78.704
2172
  - type: mrr_at_3
2173
+ value: 77
2174
  - type: mrr_at_5
2175
  value: 78.083
2176
  - type: ndcg_at_1
 
2204
  - type: recall_at_100
2205
  value: 99.833
2206
  - type: recall_at_1000
2207
+ value: 100
2208
  - type: recall_at_3
2209
  value: 86.506
2210
  - type: recall_at_5
 
2247
  - type: euclidean_precision
2248
  value: 85.74181117533719
2249
  - type: euclidean_recall
2250
+ value: 89
2251
  - type: manhattan_accuracy
2252
  value: 99.75445544554455
2253
  - type: manhattan_ap
 
2338
  - type: map_at_5
2339
  value: 1.028
2340
  - type: mrr_at_1
2341
+ value: 88
2342
  - type: mrr_at_10
2343
+ value: 94
2344
  - type: mrr_at_100
2345
+ value: 94
2346
  - type: mrr_at_1000
2347
+ value: 94
2348
  - type: mrr_at_3
2349
+ value: 94
2350
  - type: mrr_at_5
2351
+ value: 94
2352
  - type: ndcg_at_1
2353
+ value: 82
2354
  - type: ndcg_at_10
2355
  value: 77.48899999999999
2356
  - type: ndcg_at_100
 
2362
  - type: ndcg_at_5
2363
  value: 80.449
2364
  - type: precision_at_1
2365
+ value: 88
2366
  - type: precision_at_10
2367
  value: 82.19999999999999
2368
  - type: precision_at_100
 
2370
  - type: precision_at_1000
2371
  value: 23.684
2372
  - type: precision_at_3
2373
+ value: 88
2374
  - type: precision_at_5
2375
  value: 85.6
2376
  - type: recall_at_1
 
2629
  | Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
2630
  |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: |
2631
  |[`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| English | 7720 | 32768 | 4096 | 67.34 | 87.57 |
2632
+ |[`gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | English | 434 | 8192 | 1024 | 65.39 | 86.71 |
2633
  |[`gte-base-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | English | 137 | 8192 | 768 | 64.11 | 87.44 |
2634
 
2635
 
 
2675
 
2676
  sentences = ['That is a happy person', 'That is a very happy person']
2677
 
2678
+ model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
2679
  embeddings = model.encode(sentences)
2680
  print(cos_sim(embeddings[0], embeddings[1]))
2681
  ```
 
2690
 
2691
  ### Training Procedure
2692
 
2693
+ To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy.
2694
+ The model first undergoes preliminary MLM pre-training on shorter lengths.
2695
+ And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.
2696
+
2697
+ The entire training process is as follows:
2698
  - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
2699
  - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
2700
  - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
 
2707
 
2708
  ### MTEB
2709
 
2710
+ The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
2711
+
2712
+ The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
2713
 
2714
  | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
2715
  |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 
2741
 
2742
  **APA:**
2743
 
2744
+ [More Information Needed]
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }