metadata

license: bigscience-bloom-rail-1.0

Model Card for udever-bloom

udever-bloom-560m is finetuned from bigscience/bloom-560m via BitFit on MS MARCO Passage Ranking, SNLI and MultiNLI data. It is a universal embedding model across tasks, natural and programming languages. (From a technical view, udever is merely with some minor improvements to sgpt-bloom)

Model Details

Model Description

Developed by: Alibaba Group
Model type: Transformer-based Language Model (decoder-only)
Language(s) (NLP): Multiple; see bloom training data
Finetuned from model : bigscience/bloom-560m

Model Sources

Repository: github.com/izhx/uni-rep
Paper : Language Models are Universal Embedders

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Data

MS MARCO Passage Ranking, retrieved by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86)
SNLI and MultiNLI (https://sbert.net/datasets/AllNLI.tsv.gz)

Training Procedure

Preprocessing [optional]

MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86). Negatives for SNLI and MultiNLI are randomly sampled.

Training Hyperparameters

Training regime: tf32, BitFit
Batch size: 1024
Epochs: 3
Optimizer: AdamW
Learning rate: 1e-4
Scheduler: constant with warmup.
Warmup: 0.25 epoch

Evaluation

Table 1: Massive Text Embedding Benchmark MTEB

MTEB	Avg.	Class.	Clust.	PairClass.	Rerank.	Retr.	STS	Summ.
#Datasets ➡️	56	12	11	3	4	15	10	1

bge-large-en-v1.5	64.23	75.97	46.08	87.12	60.03	54.29	83.11	31.61
bge-base-en-v1.5	63.55	75.53	45.77	86.55	58.86	53.25	82.4	31.07
gte-large	63.13	73.33	46.84	85	59.13	52.22	83.35	31.66
gte-base	62.39	73.01	46.2	84.57	58.61	51.14	82.3	31.17
e5-large-v2	62.25	75.24	44.49	86.03	56.61	50.56	82.05	30.19
instructor-xl	61.79	73.12	44.74	86.62	57.29	49.26	83.06	32.32
instructor-large	61.59	73.86	45.29	85.89	57.54	47.57	83.15	31.84
e5-base-v2	61.5	73.84	43.8	85.73	55.91	50.29	81.05	30.28
e5-large	61.42	73.14	43.33	85.94	56.53	49.99	82.06	30.97
text-embedding-ada-002 (OpenAI API)	60.99	70.93	45.9	84.89	56.32	49.25	80.97	30.8
e5-base	60.44	72.63	42.11	85.09	55.7	48.75	80.96	31.01
SGPT-5.8B-msmarco	58.93	68.13	40.34	82	56.56	50.25	78.1	31.46
sgpt-bloom-7b1-msmarco	57.59	66.19	38.93	81.9	55.65	48.22	77.74	33.6

Udever-bloom-560m	55.80	68.04	36.89	81.05	52.60	41.19	79.93	32.06
Udever-bloom-1b1	58.28	70.18	39.11	83.11	54.28	45.27	81.52	31.10
Udever-bloom-3b	59.86	71.91	40.74	84.06	54.90	47.67	82.37	30.62
Udever-bloom-7b1	60.63	72.13	40.81	85.40	55.91	49.34	83.01	30.97

Table 2: CodeSearchNet

CodeSearchNet	Go	Ruby	Python	Java	JS	PHP	Avg.
CodeBERT	69.3	70.6	84.0	86.8	74.8	70.6	76.0
GraphCodeBERT	84.1	73.2	87.9	75.7	71.1	72.5	77.4
cpt-code S	97.7	86.3	99.8	94.0	86.0	96.7	93.4
cpt-code M	97.5	85.5	99.9	94.4	86.5	97.2	93.5
sgpt-bloom-7b1-msmarco	76.79	69.25	95.68	77.93	70.35	73.45	77.24

Udever-bloom-560m	75.38	66.67	96.23	78.99	69.39	73.69	76.73
Udever-bloom-1b1	78.76	72.85	97.67	82.77	74.38	78.97	80.90
Udever-bloom-3b	80.63	75.40	98.02	83.88	76.18	79.67	82.29
Udever-bloom-7b1	79.37	76.59	98.38	84.68	77.49	80.03	82.76

Table 3: Chinese multi-domain retrieval Multi-cpr

			E-commerce		Entertainment video		Medical
Model	Train	Backbone	MRR@10	Recall@1k	MRR@10	Recall@1k	MRR@10	Recall@1k

BM25	-	-	0.225	0.815	0.225	0.780	0.187	0.482
Doc2Query	-	-	0.239	0.826	0.238	0.794	0.210	0.505
DPR-1	In-Domain	BERT	0.270	0.921	0.254	0.934	0.327	0.747
DPR-2	In-Domain	BERT-CT	0.289	0.926	0.263	0.935	0.339	0.769
text-embedding-ada-002	General	GPT	0.183	0.825	0.159	0.786	0.245	0.593
sgpt-bloom-7b1-msmarco	General	BLOOM	0.242	0.840	0.227	0.829	0.311	0.675

Udever-bloom-560m	General	BLOOM	0.156	0.802	0.149	0.749	0.245	0.571
Udever-bloom-1b1	General	BLOOM	0.244	0.863	0.208	0.815	0.241	0.557
Udever-bloom-3b	General	BLOOM	0.267	0.871	0.228	0.836	0.288	0.619
Udever-bloom-7b1	General	BLOOM	0.296	0.889	0.267	0.907	0.343	0.705

More results refer to paper section 3.

Technical Specifications

Model Architecture and Objective

Model: bigscience/bloom-560m.
Objective: Constrastive loss with hard negatives (refer to paper section 2.2).

Compute Infrastructure

Nvidia A100 SXM4 80GB.
torch 2.0.0, transformers 4.29.2.

Citation

BibTeX:

@article{zhang2023language,
  title={Language Models are Universal Embedders},
  author={Zhang, Xin and Li, Zehan and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Zhang, Min},
  journal={arXiv preprint arXiv:2310.08232},
  year={2023}
}