Model Overview

Model Type: Text Embedding
Number of Parameters: 4B
Context Length: 32k
Adapted from Qwen/Qwen3-4B
Pooling: Last token

Robust Training for General Text Embeddings via Bagging-Based Model Merging

General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. Current batch-level shuffling for multi-task text embedding exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, BOOM naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that BOOM consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

Training data:

First Stage: General-Text-Data:

Retrieval: ELI5, HotpotQA, FEVER, MSMARCO, passage and document ranking, NQ, NLI, SQuAD, TriviaQA, and FiQA.
Reranking: StackOverFlowDupQuestions.
Classification: AmazonReviews-Classification, Banking77Classification, Emotion-Classification, MTOPIntentClassification, IMDB-Classification, ToxicConversationsClassification, TweetSentimentExtraction-Classification, AmazonCounterfactual-Classification.
Clustering: Arxiv/Biorxiv/Medrxiv/Reddit/StackExchangeClustering-S2S/P2P, TwentyNewsgroups-Clustering.
SemanticTextSimilarity(STS):STS12,STS22,STSBenchmark.
DuReader,MIRACL,Mr.TyDi,andT2-Ranking
Cornstack: JavaScript, Java, Python,PHP,and Ruby, sampled 500K
About 2.8M data.

Second Stage:

Sampled 40% General-Text-Data
Code retrieval training data: apps, codefeedback-mt, codefeedback-st, CodeSearchNet-ccr_go, CodeSearchNet-ccr_javascript, CodeSearchNet-ccr_java, CodeSearchNet-ccr_php, CodeSearchNet-ccr_python, CodeSearchNet-ccr_ruby, CodeSearchNet_go, CodeSearchNet_javascript, CodeSearchNet_java, CodeSearchNet_php, CodeSearchNet_python, CodeSearchNet_ruby, codetrans-contest, codetrans-dl, cosqa, stackoverflow-qa, synthetic-text2sql
FreedomIntelligence__Huatuo26M-Lite, infgrad__retrieval_data_llm, marco_chinese, msmarco-en2zh_sub_mixneg, msmarco-zh2en_sub_mixneg arguana, quora, scidocsrr, About 4.3M data.

Models Merged

The following models were included in the merge to produce the ICT-TIME-and-Querit-embedding-v1:

models:
  - model: First_stage(BOOM_4B_v1)
    parameters:
      weight: 2.8
  - model: Second stage model
    parameters:
      weight: 4.3
merge_method: multislerp
dtype: float32

This model was merged using the Multi-SLERP merge method.

Performance

Benchmark	Version	Score
MTEB (English, v2)	General	70.12
HUME(v1)	General	79.25
MTEB (Code)	Code	78.02
MTEB (Medical)	Medical	64.21
MTEB (Law)	Law	62.23
LongEmbed	Long text	78.04
RTEB	Multi-domain	67.72
ChemTEB	Chemical	75.18
MTEB(Multilingual, v2)	General	63.97

Usage

Sentence Transformers Usage

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.4739, 0.0365],
        [0.0895, 0.4089]])

Transformers Usage

# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained("ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1", padding_side='left')
model = AutoModel.from_pretrained("ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1")

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained("ICT-TIME-and-Querit/BOOM_4B_v1", attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [0.4739307463169098, 0.036478668451309204], [0.08952689170837402, 0.4088672697544098]]

Citation

If you find our work helpful, feel free to give us a cite.

@article{zhang2026bagging,
  title={Bagging-Based Model Merging for Robust General Text Embeddings},
  author={Zhang, Hengran and Bi, Keping and Guo, Jiafeng and Zhang, Jiaming and Yang, Wenbo and Shi, Daiting and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2602.05787},
  year={2026}
}

Downloads last month: 125

Safetensors

Model size

4B params

Tensor type

F32

Dataset used to train ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1

Spaces using ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1 6

Paper for ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1

Bagging-Based Model Merging for Robust General Text Embeddings

Paper • 2602.05787 • Published Feb 5 • 2