Fine-tuned ColBERT model for semantic caching

This is a PyLate model finetuned from colbert-ir/colbertv2.0 on the LangCache Sentence Pairs (subsets=['all'], train+val=True) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

Model Type: PyLate model
Base model: colbert-ir/colbertv2.0
Document Length: 128 tokens
Query Length: 128 tokens
Output Dimensionality: 128 tokens
Similarity Function: MaxSim
Training Dataset:
- LangCache Sentence Pairs (subsets=['all'], train+val=True)
Language: en
License: apache-2.0

Model Sources

Documentation: PyLate Documentation
Repository: PyLate on GitHub
Hugging Face: PyLate models on Hugging Face

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 127, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.

Indexing documents

Load the ColBERT model and initialize the PLAID index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="aditeyabaral/langcache-colbert-v1",
)

# Step 2: Initialize the PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path="aditeyabaral/langcache-colbert-v1",
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Evaluation

Metrics

Col BERTTriplet

Dataset: test_triplet
Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator

Metric	Value
accuracy	0.9847

Training Details

Training Dataset

LangCache Sentence Pairs (subsets=['all'], train+val=True)

Dataset: LangCache Sentence Pairs (subsets=['all'], train+val=True)
Size: 1,452,533 training samples
Columns: anchor, positive, and negative_1

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative_1
type	string	string	string
details	min: 9 tokens mean: 29.49 tokens max: 73 tokens	min: 8 tokens mean: 29.18 tokens max: 58 tokens	min: 4 tokens mean: 23.79 tokens max: 52 tokens

Samples:

anchor	positive	negative_1
`Any Canadian teachers (B.Ed. holders) teaching in U.S. schools?`	`Any Canadian teachers (B.Ed. holders) teaching in U.S. schools?`	`Are there many Canadians living and working illegally in the United States?`
`Are there any underlying psychological tricks/tactics that are used when designing the lines for rides at amusement parks?`	`Are there any underlying psychological tricks/tactics that are used when designing the lines for rides at amusement parks?`	`Is there any tricks for straight lines mcqs?`
`Can I pay with a debit card on PayPal?`	`Can I pay with a debit card on PayPal?`	`Can you transfer PayPal funds onto a debit card/credit card?`

Loss: pylate.losses.contrastive.Contrastive

Evaluation Dataset

LangCache Sentence Pairs (split=test)

Dataset: LangCache Sentence Pairs (split=test)
Size: 110,066 evaluation samples
Columns: anchor, positive, and negative_1

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative_1
type	string	string	string
details	min: 5 tokens mean: 28.57 tokens max: 121 tokens	min: 5 tokens mean: 28.01 tokens max: 121 tokens	min: 7 tokens mean: 20.73 tokens max: 65 tokens

Samples:

anchor	positive	negative_1
`What high potential jobs are there other than computer science?`	`What high potential jobs are there other than computer science?`	`Why IT or Computer Science jobs are being over rated than other Engineering jobs?`
`Would India ever be able to develop a missile system like S300 or S400 missile?`	`Would India ever be able to develop a missile system like S300 or S400 missile?`	`Should India buy the Russian S400 air defence missile system?`
`water from the faucet is being drunk by a yellow dog`	`A yellow dog is drinking water from the faucet`	`Do you get more homework in 9th grade than 8th?`

Loss: pylate.losses.contrastive.Contrastive

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 48
num_train_epochs: 5
learning_rate: 0.0002
warmup_steps: 0.1
optim: adamw_torch
weight_decay: 0.001
eval_strategy: steps
per_device_eval_batch_size: 48
eval_on_start: True
push_to_hub: True
hub_model_id: aditeyabaral/langcache-colbert-v1
load_best_model_at_end: True
ddp_find_unused_parameters: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

per_device_train_batch_size: 48
num_train_epochs: 5
max_steps: -1
learning_rate: 0.0002
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_steps: 0.1
optim: adamw_torch
optim_args: None
weight_decay: 0.001
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
optim_target_modules: None
gradient_accumulation_steps: 1
average_tokens_across_devices: True
max_grad_norm: 1.0
label_smoothing_factor: 0.0
bf16: False
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
use_liger_kernel: False
liger_kernel_config: None
use_cache: False
neftune_noise_alpha: None
torch_empty_cache_steps: None
auto_find_batch_size: False
log_on_each_node: True
logging_nan_inf_filter: True
include_num_input_tokens_seen: no
log_level: passive
log_level_replica: warning
disable_tqdm: False
project: huggingface
trackio_space_id: trackio
eval_strategy: steps
per_device_eval_batch_size: 48
prediction_loss_only: True
eval_on_start: True
eval_do_concat_batches: True
eval_use_gather_object: False
eval_accumulation_steps: None
include_for_metrics: []
batch_eval_metrics: False
save_only_model: False
save_on_each_node: False
enable_jit_checkpoint: False
push_to_hub: True
hub_private_repo: None
hub_model_id: aditeyabaral/langcache-colbert-v1
hub_strategy: every_save
hub_always_push: False
hub_revision: None
load_best_model_at_end: True
ignore_data_skip: False
restore_callback_states_from_checkpoint: False
full_determinism: False
seed: 42
data_seed: None
use_cpu: False
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
dataloader_drop_last: True
dataloader_num_workers: 0
dataloader_pin_memory: True
dataloader_persistent_workers: False
dataloader_prefetch_factor: None
remove_unused_columns: True
label_names: None
train_sampling_strategy: random
length_column_name: length
ddp_find_unused_parameters: True
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
ddp_backend: None
ddp_timeout: 1800
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
deepspeed: None
debug: []
skip_memory_metrics: True
do_predict: False
resume_from_checkpoint: None
warmup_ratio: None
local_rank: -1
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	accuracy
0	0	-	527.3283	0.8206
0.0661	1000	73.6149	-	-
0.1322	2000	5.4919	-	-
0.1983	3000	0.3036	-	-
0.2644	4000	0.2963	-	-
0.3305	5000	0.3388	-	-
0.3966	6000	0.2512	-	-
0.4627	7000	0.2497	-	-
0.5288	8000	0.2427	-	-
0.5948	9000	0.2585	-	-
0.6609	10000	0.5272	-	-
0.7270	11000	0.2143	-	-
0.7931	12000	0.2065	-	-
0.8592	13000	0.4024	-	-
0.9253	14000	0.6400	-	-
0.9914	15000	0.8233	-	-
1.0575	16000	0.7687	-	-
1.1236	17000	0.7550	-	-
1.1897	18000	0.6507	-	-
1.2558	19000	0.6809	-	-
1.3219	20000	0.6523	-	-
1.3880	21000	0.5745	-	-
1.4541	22000	0.5485	-	-
1.5202	23000	0.5092	-	-
1.5863	24000	0.4815	-	-
1.6523	25000	0.4785	-	-
1.7184	26000	0.4901	-	-
1.7845	27000	0.4581	-	-
1.8506	28000	0.5224	-	-
1.9167	29000	0.4892	-	-
1.9828	30000	0.4884	-	-
2.0489	31000	0.4530	-	-
2.1150	32000	0.4356	-	-
2.1811	33000	0.4555	-	-
2.2472	34000	0.4360	-	-
2.3133	35000	0.4478	-	-
2.3794	36000	0.4297	-	-
2.4455	37000	0.3896	-	-
2.5116	38000	0.3594	-	-
2.5777	39000	0.3581	-	-
2.6438	40000	0.3270	-	-
2.7098	41000	0.3995	-	-
2.7759	42000	0.3665	-	-
2.8420	43000	0.4018	-	-
2.9081	44000	0.4260	-	-
2.9742	45000	0.3957	-	-
3.0403	46000	0.3659	-	-
3.1064	47000	0.3826	-	-
3.1725	48000	0.3603	-	-
3.2386	49000	0.3646	-	-
3.3047	50000	0.4069	-	-
0	0	-	-	0.9847
3.3047	50000	-	2.1447	-
3.3708	51000	0.3493	-	-
3.4369	52000	0.3207	-	-
3.5030	53000	0.3311	-	-
3.5691	54000	0.3208	-	-
3.6352	55000	0.2760	-	-
3.7013	56000	0.3244	-	-
3.7673	57000	0.2789	-	-
3.8334	58000	0.3038	-	-
3.8995	59000	0.3958	-	-
3.9656	60000	0.3338	-	-
4.0317	61000	0.3445	-	-
4.0978	62000	0.3291	-	-
4.1639	63000	0.3225	-	-
4.2300	64000	0.3386	-	-
4.2961	65000	0.3439	-	-
4.3622	66000	0.3378	-	-
4.4283	67000	0.2919	-	-
4.4944	68000	0.3099	-	-
4.5605	69000	0.2911	-	-
4.6266	70000	0.2644	-	-
4.6927	71000	0.3037	-	-
4.7588	72000	0.2862	-	-
4.8249	73000	0.2931	-	-
4.8909	74000	0.3613	-	-
4.9570	75000	0.3131	-	-

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.12.12
Sentence Transformers: 5.3.0
PyLate: 1.5.0
Transformers: 5.3.0
PyTorch: 2.9.0+cu130
Accelerate: 1.13.0
Datasets: 4.8.5
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@inproceedings{DBLP:conf/cikm/ChaffinS25,
  author       = {Antoine Chaffin and
                  Rapha{"{e}}l Sourty},
  editor       = {Meeyoung Cha and
                  Chanyoung Park and
                  Noseong Park and
                  Carl Yang and
                  Senjuti Basu Roy and
                  Jessie Li and
                  Jaap Kamps and
                  Kijung Shin and
                  Bryan Hooi and
                  Lifang He},
  title        = {PyLate: Flexible Training and Retrieval for Late Interaction Models},
  booktitle    = {Proceedings of the 34th {ACM} International Conference on Information
                  and Knowledge Management, {CIKM} 2025, Seoul, Republic of Korea, November
                  10-14, 2025},
  pages        = {6334--6339},
  publisher    = {{ACM}},
  year         = {2025},
  url          = {https://github.com/lightonai/pylate},
  doi          = {10.1145/3746252.3761608},
}

Downloads last month: 59

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for aditeyabaral/langcache-colbert-v1

Base model

colbert-ir/colbertv2.0

Finetuned

(15)

this model

Dataset used to train aditeyabaral/langcache-colbert-v1

Paper for aditeyabaral/langcache-colbert-v1

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Evaluation results

Accuracy on test triplet
self-reported

0.985