gte-base-en-v1.5
We introduce gte-v1.5
series, upgraded gte
embeddings that support the context length of up to 8192, while further enhancing model performance.
The models are built upon the transformer++
encoder backbone (BERT + RoPE + GLU).
The gte-v1.5
series achieve state-of-the-art scores on the MTEB benchmark within the same model size category and prodvide competitive on the LoCo long-context retrieval tests (refer to Evaluation).
We also present the gte-Qwen1.5-7B-instruct
,
a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB and 1st in C-MTEB.
- Developed by: Institute for Intelligent Computing, Alibaba Group
- Model type: Text Embeddings
- Paper: Coming soon.
Model list
Models | Language | Model Size | Max Seq. Length | Dimension | MTEB-en | LoCo |
---|---|---|---|---|---|---|
gte-Qwen1.5-7B-instruct |
Multiple | 7720 | 32768 | 4096 | 67.34 | 87.57 |
gte-large-en-v1.5 |
English | 434 | 8192 | 1024 | 65.39 | 86.71 |
gte-base-en-v1.5 |
English | 137 | 8192 | 768 | 64.11 | 87.44 |
How to Get Started with the Model
Use the code below to get started with the model.
# Requires transformers>=4.36.0
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
model_path = 'Alibaba-NLP/gte-base-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
It is recommended to install xformers and enable unpadding for acceleration, refer to enable-unpadding-and-xformers.
Use with sentence-transformers
:
# Requires sentence_transformers>=2.7.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True)
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Use with transformers.js
:
// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';
// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-base-en-v1.5', {
quantized: false, // Comment out this line to use the quantized version
});
// Generate sentence embeddings
const sentences = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities); // [34.504930869007296, 64.03973265120138, 19.520042686034362]
Training Details
Training Data
- Masked language modeling (MLM):
c4-en
- Weak-supervised contrastive (WSC) pre-training: GTE pre-training data
- Supervised contrastive fine-tuning: GTE fine-tuning data
Training Procedure
To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy. The model first undergoes preliminary MLM pre-training on shorter lengths. And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.
The entire training process is as follows:
- MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
- WSC: max_len 512, lr 2e-4, batch_size 32768, num_steps 100000
- Fine-tuning: TODO
Evaluation
MTEB
The results of other models are retrieved from MTEB leaderboard.
The gte evaluation setting: mteb==1.2.0, fp16 auto mix precision, max_length=8192
, and set ntk scaling factor to 2 (equivalent to rope_base * 2).
Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
---|---|---|---|---|---|---|---|---|---|---|---|
gte-large-en-v1.5 | 434 | 1024 | 8192 | 65.39 | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
mxbai-embed-large-v1 | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
multilingual-e5-large-instruct | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
bge-large-en-v1.5 | 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
gte-base-en-v1.5 | 137 | 768 | 8192 | 64.11 | 77.17 | 46.82 | 85.33 | 57.66 | 54.09 | 81.97 | 31.17 |
bge-base-en-v1.5 | 109 | 768 | 512 | 63.55 | 75.53 | 45.77 | 86.55 | 58.86 | 53.25 | 82.4 | 31.07 |
LoCo
Model Name | Dimension | Sequence Length | Average (5) | QsmsumRetrieval | SummScreenRetrieval | QasperAbastractRetrieval | QasperTitleRetrieval | GovReportRetrieval |
---|---|---|---|---|---|---|---|---|
gte-qwen1.5-7b | 4096 | 32768 | 87.57 | 49.37 | 93.10 | 99.67 | 97.54 | 98.21 |
gte-large-v1.5 | 1024 | 8192 | 86.71 | 44.55 | 92.61 | 99.82 | 97.81 | 98.74 |
gte-base-v1.5 | 768 | 8192 | 87.44 | 49.91 | 91.78 | 99.82 | 97.13 | 98.58 |
Citation
If you find our paper or models helpful, please consider citing them as follows:
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
- Downloads last month
- 92,735
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported74.791
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported37.054
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported68.511
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported93.017
- ap on MTEB AmazonPolarityClassificationtest set self-reported89.178
- f1 on MTEB AmazonPolarityClassificationtest set self-reported92.997
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported53.312
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported52.982
- map_at_1 on MTEB ArguAnatest set self-reported38.193
- map_at_10 on MTEB ArguAnatest set self-reported54.848
- map_at_100 on MTEB ArguAnatest set self-reported55.388
- map_at_1000 on MTEB ArguAnatest set self-reported55.389
- map_at_3 on MTEB ArguAnatest set self-reported50.427
- map_at_5 on MTEB ArguAnatest set self-reported53.105
- mrr_at_1 on MTEB ArguAnatest set self-reported39.047
- mrr_at_10 on MTEB ArguAnatest set self-reported55.153
- mrr_at_100 on MTEB ArguAnatest set self-reported55.686
- mrr_at_1000 on MTEB ArguAnatest set self-reported55.688
- mrr_at_3 on MTEB ArguAnatest set self-reported50.676
- mrr_at_5 on MTEB ArguAnatest set self-reported53.417
- ndcg_at_1 on MTEB ArguAnatest set self-reported38.193
- ndcg_at_10 on MTEB ArguAnatest set self-reported63.486
- ndcg_at_100 on MTEB ArguAnatest set self-reported65.580
- ndcg_at_1000 on MTEB ArguAnatest set self-reported65.610
- ndcg_at_3 on MTEB ArguAnatest set self-reported54.494
- ndcg_at_5 on MTEB ArguAnatest set self-reported59.339
- precision_at_1 on MTEB ArguAnatest set self-reported38.193
- precision_at_10 on MTEB ArguAnatest set self-reported9.075
- precision_at_100 on MTEB ArguAnatest set self-reported0.994
- precision_at_1000 on MTEB ArguAnatest set self-reported0.100
- precision_at_3 on MTEB ArguAnatest set self-reported22.096
- precision_at_5 on MTEB ArguAnatest set self-reported15.619
- recall_at_1 on MTEB ArguAnatest set self-reported38.193
- recall_at_10 on MTEB ArguAnatest set self-reported90.754
- recall_at_100 on MTEB ArguAnatest set self-reported99.431
- recall_at_1000 on MTEB ArguAnatest set self-reported99.644
- recall_at_3 on MTEB ArguAnatest set self-reported66.287
- recall_at_5 on MTEB ArguAnatest set self-reported78.094
- v_measure on MTEB ArxivClusteringP2Ptest set self-reported47.508
- v_measure on MTEB ArxivClusteringS2Stest set self-reported42.047
- map on MTEB AskUbuntuDupQuestionstest set self-reported61.829
- mrr on MTEB AskUbuntuDupQuestionstest set self-reported74.373
- cos_sim_pearson on MTEB BIOSSEStest set self-reported85.037
- cos_sim_spearman on MTEB BIOSSEStest set self-reported83.647
- euclidean_pearson on MTEB BIOSSEStest set self-reported82.640
- euclidean_spearman on MTEB BIOSSEStest set self-reported83.631
- manhattan_pearson on MTEB BIOSSEStest set self-reported82.715
- manhattan_spearman on MTEB BIOSSEStest set self-reported83.605
- accuracy on MTEB Banking77Classificationtest set self-reported86.734
- f1 on MTEB Banking77Classificationtest set self-reported86.703
- v_measure on MTEB BiorxivClusteringP2Ptest set self-reported40.319