izhx commited on
Commit
b149120
1 Parent(s): 4c742dc

Support sentence_transformers and fix readme (#2)

Browse files

- Support sentence_transformers and fix readme (bd0f92f41bae5f9551cbf7d5f965c9bd9ed8b38b)
- Create modules.json (9c793960cfcb9ca3f5d610e30e6c38661bbe9df5)
- Create sentence_bert_config.json (b1773972fe2086a6d0f4e4bcaa1a480047f2c4d3)
- Update README.md (6cdc15ab3ccd9dd78f3bcc68b9f9390b46de938d)

Files changed (4) hide show
  1. 1_Pooling/config.json +10 -0
  2. README.md +12 -7
  3. modules.json +14 -0
  4. sentence_bert_config.json +4 -0
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- library_name: transformers
3
  tags:
4
  - gte
5
  - mteb
@@ -2627,7 +2627,7 @@ We also present the [`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-N
2627
  | Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
2628
  |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: |
2629
  |[`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| English | 7720 | 32768 | 4096 | 67.34 | 87.57 |
2630
- |[`gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | English | 409 | 8192 | 1024 | 65.39 | 86.71 |
2631
  |[`gte-base-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | English | 137 | 8192 | 768 | 64.11 | 87.44 |
2632
 
2633
 
@@ -2665,7 +2665,7 @@ print(scores.tolist())
2665
  **It is recommended to install xformers and enable unpadding for acceleration, refer to [enable-unpadding-and-xformers](https://huggingface.co/Alibaba-NLP/test-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).**
2666
 
2667
 
2668
- Use with sentence-transformers:
2669
 
2670
  ```python
2671
  from sentence_transformers import SentenceTransformer
@@ -2673,7 +2673,7 @@ from sentence_transformers.util import cos_sim
2673
 
2674
  sentences = ['That is a happy person', 'That is a very happy person']
2675
 
2676
- model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5')
2677
  embeddings = model.encode(sentences)
2678
  print(cos_sim(embeddings[0], embeddings[1]))
2679
  ```
@@ -2686,8 +2686,13 @@ print(cos_sim(embeddings[0], embeddings[1]))
2686
  - Weak-supervised contrastive (WSC) pre-training: GTE pre-training data
2687
  - Supervised contrastive fine-tuning: GTE fine-tuning data
2688
 
2689
- ### Training Procedure
2690
 
 
 
 
 
 
2691
  - MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
2692
  - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
2693
  - WSC: max_len 512, lr 2e-4, batch_size 32768, num_steps 100000
@@ -2701,11 +2706,11 @@ print(cos_sim(embeddings[0], embeddings[1]))
2701
 
2702
  The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
2703
 
2704
- The gte results setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
2705
 
2706
  | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
2707
  |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
2708
- | [**gte-large-en-v1.5**](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | 409 | 1024 | 8192 | **65.39** | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
2709
  | [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
2710
  | [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
2711
  | [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)| 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
 
1
  ---
2
+ library_name: sentence-transformers
3
  tags:
4
  - gte
5
  - mteb
 
2627
  | Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
2628
  |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: |
2629
  |[`gte-Qwen1.5-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| English | 7720 | 32768 | 4096 | 67.34 | 87.57 |
2630
+ |[`gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | English | 434 | 8192 | 1024 | 65.39 | 86.71 |
2631
  |[`gte-base-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | English | 137 | 8192 | 768 | 64.11 | 87.44 |
2632
 
2633
 
 
2665
  **It is recommended to install xformers and enable unpadding for acceleration, refer to [enable-unpadding-and-xformers](https://huggingface.co/Alibaba-NLP/test-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).**
2666
 
2667
 
2668
+ Use with `sentence-transformers`:
2669
 
2670
  ```python
2671
  from sentence_transformers import SentenceTransformer
 
2673
 
2674
  sentences = ['That is a happy person', 'That is a very happy person']
2675
 
2676
+ model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True)
2677
  embeddings = model.encode(sentences)
2678
  print(cos_sim(embeddings[0], embeddings[1]))
2679
  ```
 
2686
  - Weak-supervised contrastive (WSC) pre-training: GTE pre-training data
2687
  - Supervised contrastive fine-tuning: GTE fine-tuning data
2688
 
2689
+ ### Training Procedure
2690
 
2691
+ To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy.
2692
+ The model first undergoes preliminary MLM pre-training on shorter lengths.
2693
+ And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.
2694
+
2695
+ The entire training process is as follows:
2696
  - MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
2697
  - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
2698
  - WSC: max_len 512, lr 2e-4, batch_size 32768, num_steps 100000
 
2706
 
2707
  The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
2708
 
2709
+ The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=8192`, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
2710
 
2711
  | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
2712
  |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
2713
+ | [**gte-large-en-v1.5**](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | 434 | 1024 | 8192 | **65.39** | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
2714
  | [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
2715
  | [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
2716
  | [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)| 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }