Edit model card

gte-large-en-v1.5

We introduce gte-v1.5 series, upgraded gte embeddings that support the context length of up to 8192, while further enhancing model performance. The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU).

The gte-v1.5 series achieve state-of-the-art scores on the MTEB benchmark within the same model size category and prodvide competitive on the LoCo long-context retrieval tests (refer to Evaluation).

We also present the gte-Qwen1.5-7B-instruct, a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB and 1st in C-MTEB.

  • Developed by: Institute for Intelligent Computing, Alibaba Group
  • Model type: Text Embeddings
  • Paper: Coming soon.

Model list

Models Language Model Size Max Seq. Length Dimension MTEB-en LoCo
gte-Qwen1.5-7B-instruct Multiple 7720 32768 4096 67.34 87.57
gte-large-en-v1.5 English 434 8192 1024 65.39 86.71
gte-base-en-v1.5 English 137 8192 768 64.11 87.44

How to Get Started with the Model

Use the code below to get started with the model.

# Requires transformers>=4.36.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

It is recommended to install xformers and enable unpadding for acceleration, refer to enable-unpadding-and-xformers.

Use with sentence-transformers:

# Requires sentence_transformers>=2.7.0

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Use with transformers.js:

// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-large-en-v1.5', {
    quantized: false, // Comment out this line to use the quantized version
});

// Generate sentence embeddings
const sentences = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });

// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979677899]

Training Details

Training Data

  • Masked language modeling (MLM): c4-en
  • Weak-supervised contrastive (WSC) pre-training: GTE pre-training data
  • Supervised contrastive fine-tuning: GTE(https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data

Training Procedure

To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy. The model first undergoes preliminary MLM pre-training on shorter lengths. And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.

The entire training process is as follows:

  • MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
  • MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
  • MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
  • WSC: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
  • Fine-tuning: TODO

Evaluation

MTEB

The results of other models are retrieved from MTEB leaderboard.

The gte evaluation setting: mteb==1.2.0, fp16 auto mix precision, max_length=8192, and set ntk scaling factor to 2 (equivalent to rope_base * 2).

Model Name Param Size (M) Dimension Sequence Length Average (56) Class. (12) Clust. (11) Pair Class. (3) Reran. (4) Retr. (15) STS (10) Summ. (1)
gte-large-en-v1.5 409 1024 8192 65.39 77.75 47.95 84.63 58.50 57.91 81.43 30.91
mxbai-embed-large-v1 335 1024 512 64.68 75.64 46.71 87.2 60.11 54.39 85 32.71
multilingual-e5-large-instruct 560 1024 514 64.41 77.56 47.1 86.19 58.58 52.47 84.78 30.39
bge-large-en-v1.5 335 1024 512 64.23 75.97 46.08 87.12 60.03 54.29 83.11 31.61
gte-base-en-v1.5 137 768 8192 64.11 77.17 46.82 85.33 57.66 54.09 81.97 31.17
bge-base-en-v1.5 109 768 512 63.55 75.53 45.77 86.55 58.86 53.25 82.4 31.07

LoCo

Model Name Dimension Sequence Length Average (5) QsmsumRetrieval SummScreenRetrieval QasperAbastractRetrieval QasperTitleRetrieval GovReportRetrieval
gte-qwen1.5-7b 4096 32768 87.57 49.37 93.10 99.67 97.54 98.21
gte-large-v1.5 1024 8192 86.71 44.55 92.61 99.82 97.81 98.74
gte-base-v1.5 768 8192 87.44 49.91 91.78 99.82 97.13 98.58

Citation

If you find our paper or models helpful, please consider citing them as follows:

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}
Downloads last month
843,402
Safetensors
Model size
434M params
Tensor type
F32
Β·
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train Alibaba-NLP/gte-large-en-v1.5

Spaces using Alibaba-NLP/gte-large-en-v1.5 3