Model Overview

Model Type: Text Embedding
Number of Parameters: 4B
Context Length: 32k
Adapted from Qwen/Qwen3-4B
Pooling: Last token

Robust Training for General Text Embeddings via Bagging-Based Model Merging

πŸ“– arXiv Paper | πŸ€— Model Trained on English data | πŸ€— Model Trained on General data | πŸ€— Model Trained on Two-stage Incremental Learning | πŸ› οΈ Github |

General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. Current batch-level shuffling for multi-task text embedding exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, BOOM naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that BOOM consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

Training data:

First Stage: General-Text-Data:

  • Retrieval: ELI5, HotpotQA, FEVER, MSMARCO, passage and document ranking, NQ, NLI, SQuAD, TriviaQA, and FiQA.
  • Reranking: StackOverFlowDupQuestions.
  • Classification: AmazonReviews-Classification, Banking77Classification, Emotion-Classification, MTOPIntentClassification, IMDB-Classification, ToxicConversationsClassification, TweetSentimentExtraction-Classification, AmazonCounterfactual-Classification.
  • Clustering: Arxiv/Biorxiv/Medrxiv/Reddit/StackExchangeClustering-S2S/P2P, TwentyNewsgroups-Clustering.
  • SemanticTextSimilarity(STS):STS12,STS22,STSBenchmark.
  • DuReader,MIRACL,Mr.TyDi,andT2-Ranking
  • Cornstack: JavaScript, Java, Python,PHP,and Ruby, sampled 500K
    About 2.8M data.

Second Stage:

  • Sampled 40% General-Text-Data
  • Code retrieval training data: apps, codefeedback-mt, codefeedback-st, CodeSearchNet-ccr_go, CodeSearchNet-ccr_javascript, CodeSearchNet-ccr_java, CodeSearchNet-ccr_php, CodeSearchNet-ccr_python, CodeSearchNet-ccr_ruby, CodeSearchNet_go, CodeSearchNet_javascript, CodeSearchNet_java, CodeSearchNet_php, CodeSearchNet_python, CodeSearchNet_ruby, codetrans-contest, codetrans-dl, cosqa, stackoverflow-qa, synthetic-text2sql
  • FreedomIntelligence__Huatuo26M-Lite, infgrad__retrieval_data_llm, marco_chinese, msmarco-en2zh_sub_mixneg, msmarco-zh2en_sub_mixneg arguana, quora, scidocsrr, About 4.3M data.

Models Merged

The following models were included in the merge to produce the ICT-TIME-and-Querit-embedding-v1:

models:
  - model: First_stage(BOOM_4B_v1)
    parameters:
      weight: 2.8
  - model: Second stage model
    parameters:
      weight: 4.3
merge_method: multislerp
dtype: float32

This model was merged using the Multi-SLERP merge method.

Performance

Benchmark Version Score
MTEB (English, v2) General 70.12
HUME(v1) General 79.25
MTEB (Code) Code 78.02
MTEB (Medical) Medical 64.21
MTEB (Law) Law 62.23
LongEmbed Long text 78.04
RTEB Multi-domain 67.72
ChemTEB Chemical 75.18
MTEB(Multilingual, v2) General 63.97

Usage

Sentence Transformers Usage

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.4739, 0.0365],
        [0.0895, 0.4089]])

Transformers Usage

# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained("ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1", padding_side='left')
model = AutoModel.from_pretrained("ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1")

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained("ICT-TIME-and-Querit/BOOM_4B_v1", attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [0.4739307463169098, 0.036478668451309204], [0.08952689170837402, 0.4088672697544098]]

Citation

If you find our work helpful, feel free to give us a cite.

@article{zhang2026bagging,
  title={Bagging-Based Model Merging for Robust General Text Embeddings},
  author={Zhang, Hengran and Bi, Keping and Guo, Jiafeng and Zhang, Jiaming and Yang, Wenbo and Shi, Daiting and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2602.05787},
  year={2026}
}
Downloads last month
125
Safetensors
Model size
4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1

Spaces using ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1 6

Paper for ICT-TIME-and-Querit/ICT-TIME-and-Querit-embedding-v1