Edit model card

Model Card for udever-bloom

udever-bloom-560m is finetuned from bigscience/bloom-560m via BitFit on MS MARCO Passage Ranking, SNLI and MultiNLI data. It is a universal embedding model across tasks, natural and programming languages. (From the technical view, udever is merely with some minor improvements to sgpt-bloom)

Model Details

Model Description

Model Sources

Checkpoints

On ModelScope / 魔搭社区: udever-bloom-560m, udever-bloom-1b1, udever-bloom-3b, udever-bloom-7b1

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoTokenizer, BloomModel

tokenizer = AutoTokenizer.from_pretrained('izhx/udever-bloom-560m')
model = BloomModel.from_pretrained('izhx/udever-bloom-560m')

boq, eoq, bod, eod = '[BOQ]', '[EOQ]', '[BOD]', '[EOD]'
eoq_id, eod_id = tokenizer.convert_tokens_to_ids([eoq, eod])

if tokenizer.padding_side != 'left':
    print('!!!', tokenizer.padding_side)
    tokenizer.padding_side = 'left'


def encode(texts: list, is_query: bool = True, max_length=300):
    bos = boq if is_query else bod
    eos_id = eoq_id if is_query else eod_id
    texts = [bos + t for t in texts]
    encoding = tokenizer(
        texts, truncation=True, max_length=max_length - 1, padding=True
    )
    for ids, mask in zip(encoding['input_ids'], encoding['attention_mask']):
        ids.append(eos_id)
        mask.append(1)
    inputs = tokenizer.pad(encoding, return_tensors='pt')
    with torch.inference_mode():
        outputs = model(**inputs)
        embeds = outputs.last_hidden_state[:, -1]
    return embeds

encode(['I am Bert', 'You are Elmo'])

Training Details

Training Data

Training Procedure

Preprocessing

MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86). Negatives for SNLI and MultiNLI are randomly sampled.

Training Hyperparameters

  • Training regime: tf32, BitFit
  • Batch size: 1024
  • Epochs: 3
  • Optimizer: AdamW
  • Learning rate: 1e-4
  • Scheduler: constant with warmup.
  • Warmup: 0.25 epoch

Evaluation

Table 1: Massive Text Embedding Benchmark MTEB

MTEB Avg. Class. Clust. PairClass. Rerank. Retr. STS Summ.
#Datasets ➡️ 56 12 11 3 4 15 10 1
bge-large-en-v1.5 64.23 75.97 46.08 87.12 60.03 54.29 83.11 31.61
bge-base-en-v1.5 63.55 75.53 45.77 86.55 58.86 53.25 82.4 31.07
gte-large 63.13 73.33 46.84 85 59.13 52.22 83.35 31.66
gte-base 62.39 73.01 46.2 84.57 58.61 51.14 82.3 31.17
e5-large-v2 62.25 75.24 44.49 86.03 56.61 50.56 82.05 30.19
instructor-xl 61.79 73.12 44.74 86.62 57.29 49.26 83.06 32.32
instructor-large 61.59 73.86 45.29 85.89 57.54 47.57 83.15 31.84
e5-base-v2 61.5 73.84 43.8 85.73 55.91 50.29 81.05 30.28
e5-large 61.42 73.14 43.33 85.94 56.53 49.99 82.06 30.97
text-embedding-ada-002 (OpenAI API) 60.99 70.93 45.9 84.89 56.32 49.25 80.97 30.8
e5-base 60.44 72.63 42.11 85.09 55.7 48.75 80.96 31.01
SGPT-5.8B-msmarco 58.93 68.13 40.34 82 56.56 50.25 78.1 31.46
sgpt-bloom-7b1-msmarco 57.59 66.19 38.93 81.9 55.65 48.22 77.74 33.6
Udever-bloom-560m 55.80 68.04 36.89 81.05 52.60 41.19 79.93 32.06
Udever-bloom-1b1 58.28 70.18 39.11 83.11 54.28 45.27 81.52 31.10
Udever-bloom-3b 59.86 71.91 40.74 84.06 54.90 47.67 82.37 30.62
Udever-bloom-7b1 60.63 72.13 40.81 85.40 55.91 49.34 83.01 30.97

Table 2: CodeSearchNet

CodeSearchNet Go Ruby Python Java JS PHP Avg.
CodeBERT 69.3 70.6 84.0 86.8 74.8 70.6 76.0
GraphCodeBERT 84.1 73.2 87.9 75.7 71.1 72.5 77.4
cpt-code S 97.7 86.3 99.8 94.0 86.0 96.7 93.4
cpt-code M 97.5 85.5 99.9 94.4 86.5 97.2 93.5
sgpt-bloom-7b1-msmarco 76.79 69.25 95.68 77.93 70.35 73.45 77.24
Udever-bloom-560m 75.38 66.67 96.23 78.99 69.39 73.69 76.73
Udever-bloom-1b1 78.76 72.85 97.67 82.77 74.38 78.97 80.90
Udever-bloom-3b 80.63 75.40 98.02 83.88 76.18 79.67 82.29
Udever-bloom-7b1 79.37 76.59 98.38 84.68 77.49 80.03 82.76

Table 3: Chinese multi-domain retrieval Multi-cpr

E-commerce Entertainment video Medical
Model Train Backbone MRR@10 Recall@1k MRR@10 Recall@1k MRR@10 Recall@1k
BM25 - - 0.225 0.815 0.225 0.780 0.187 0.482
Doc2Query - - 0.239 0.826 0.238 0.794 0.210 0.505
DPR-1 In-Domain BERT 0.270 0.921 0.254 0.934 0.327 0.747
DPR-2 In-Domain BERT-CT 0.289 0.926 0.263 0.935 0.339 0.769
text-embedding-ada-002 General GPT 0.183 0.825 0.159 0.786 0.245 0.593
sgpt-bloom-7b1-msmarco General BLOOM 0.242 0.840 0.227 0.829 0.311 0.675
Udever-bloom-560m General BLOOM 0.156 0.802 0.149 0.749 0.245 0.571
Udever-bloom-1b1 General BLOOM 0.244 0.863 0.208 0.815 0.241 0.557
Udever-bloom-3b General BLOOM 0.267 0.871 0.228 0.836 0.288 0.619
Udever-bloom-7b1 General BLOOM 0.296 0.889 0.267 0.907 0.343 0.705

More results refer to paper section 3.

Technical Specifications

Model Architecture and Objective

Compute Infrastructure

  • Nvidia A100 SXM4 80GB.
  • torch 2.0.0, transformers 4.29.2.

Citation

BibTeX:

@article{zhang2023language,
  title={Language Models are Universal Embedders},
  author={Zhang, Xin and Li, Zehan and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Zhang, Min},
  journal={arXiv preprint arXiv:2310.08232},
  year={2023}
}
Downloads last month
5,242

Spaces using izhx/udever-bloom-560m 4

Evaluation results