---
inference: false
datasets:
- bclavie/mmarco-japanese-hard-negatives
- unicamp-dl/mmarco
language:
- ja
pipeline_tag: sentence-similarity
tags:
- ColBERT
base_model:
- cl-tohoku/bert-base-japanese-v3
- bclavie/JaColBERT
license: mit
library_name: RAGatouille
---


First version of JaColBERTv2. Weights might be updated in the next few days.

Current early checkpoint is fully functional and outperforms multilingual-e5-large, BGE-M3 and JaColBERT in early results, but full evaluation TBD.# Intro

> There is currently no JaColBERTv2 technical report. For an overall idea, you can refer to the JaColBERTv1 [arXiv Report](https://arxiv.org/abs/2312.16144)

If you just want to check out how to use the model, please check out the [Usage section](#usage) below!

Welcome to JaColBERT version 2, the second release of JaColBERT, a Japanese-only document retrieval model based on [ColBERT](https://github.com/stanford-futuredata/ColBERT).

JaColBERTv2 is a model that offers very strong out-of-domain generalisation. Having been only trained on a single dataset (MMarco), it reaches state-of-the-art performance.

JaColBERTv2 was initialised off JaColBERTv1 and trained using knowledge distillation with 31 negative examples per positive example. It was trained for 250k steps using a batch size of 32.

The information on this model card is minimal and intends to give a quick overview! It'll be updated once benchmarking is complete and a longer report is available.

# Why use a ColBERT-like approach for your RAG application?

Most retrieval methods have strong tradeoffs: 
 * __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling.
 * __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
 * __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are __not__ data-efficient (they often require hundreds of millions if not billions of  training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.

ColBERT and its variants, including JaColBERTv2, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders.


# Training

### Training Data

The model is trained on the japanese split of MMARCO. It uses ColBERTv2 style training, meaning the model uses knowledge distillation from a cross-encoder model. We use the same cross-encoder scores as the original English ColBERTv2 training (as MMarco is a translated dataset, these are more or less well mapped). These scores are available [here](https://huggingface.co/colbert-ir/colbertv2.0_msmarco_64way).

Unlike English ColBERTv2, we use nway=32 rather than nway=64, meaning that we provide the model with 31 negative examples per positive examples. Furthermore, we downsample the original sets of triplets from over 19 million to 8 million examples.

### Training Method

JColBERT is trained for a single epoch (1-pass over every triplet, meaning 250000 trainings teps) on 8 NVidia A100 40GB GPUs. Total training time was around 30 hours.

JColBERT is initialised from [JaColBERT](https://huggingface.co/bclavie/JaColBERT), which itselfs builds upon Tohoku University's excellent [bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3). Our experiments benefitted strongly from Nagoya University's work on building [strong Japanese SimCSE models](https://arxiv.org/abs/2310.19349), among other work.

JaColBERT is trained with an overall batch size of 32 and a learning rate of 1e-5, and a warmup of 20000 steps. Limited exploration was performed but those defaults outperformed other experiments.

JaColBERT, as mentioned above, uses knowledge distillation using cross-encoder scores generated by a MiniLM cross-encoder on the English version of MS Marco. Please refer to the original [ColBERTv2 paper](https://arxiv.org/abs/2112.01488) for more information on this approach.

# Results

We present the first results, on two datasets: JQaRa, a passage retrieval task composed of questions and wikipedia passages containing the answer, and JSQuAD, the Japanese translation of SQuAD. (Further evaluations on MIRACL and TyDi are running, but fairly slow due to how long it takes to run e5-large and bge-m3.)

JaColBERTv2 reaches state-of-the-art results on both datasets, outperforming models with 5x more parameters.


|                     |     |           | JQaRa     |           |           |     | JSQuAD    |           |           |
| ------------------- | --- | --------- | --------- | --------- | --------- | --- | --------- | --------- | --------- |
|                     |     | NDCG@10   | MRR@10    | NDCG@100  | MRR@100   |     | R@1       | R@5       | R@10      |
| JaColBERTv2         |     | **0.585** | **0.836** | **0.753** | **0.838** |     | **0.921** | **0.977** | **0.982** |
| JaColBERT           |     | 0.549     | 0.811     | 0.730     | 0.814     |     | 0.913     | 0.972     | 0.978     |
| bge-m3+all          |     | 0.576     | 0.818     | 0.745     | 0.820     |     | N/A       | N/A       | N/A       |
| bg3-m3+dense        |     | 0.539     | 0.785     | 0.721     | 0.788     |     | 0.850     | 0.959     | 0.976     |
| m-e5-large          |     | 0.554     | 0.799     | 0.731     | 0.801     |     | 0.865     | 0.966     | 0.977     |
| m-e5-base           |     | 0.471     | 0.727     | 0.673     | 0.731     |     | *0.838*   | *0.955*   | 0.973     |
| m-e5-small          |     | 0.492     | 0.729     | 0.689     | 0.733     |     | *0.840*   | *0.954*   | 0.973     |
| GLuCoSE             |     | 0.308     | 0.518     | 0.564     | 0.527     |     | 0.645     | 0.846     | 0.897     |
| sup-simcse-ja-base  |     | 0.324     | 0.541     | 0.572     | 0.550     |     | 0.632     | 0.849     | 0.897     |
| sup-simcse-ja-large |     | 0.356     | 0.575     | 0.596     | 0.583     |     | 0.603     | 0.833     | 0.889     |
| fio-base-v0.1       |     | 0.372     | 0.616     | 0.608     | 0.622     |     | 0.700     | 0.879     | 0.924     |
|                     |     |           |           |           |           |     |           |           |           |


# Usage

## Installation

JaColBERT works using ColBERT+RAGatouille. You can install it and all its necessary dependencies by running:
```sh
pip install -U ragatouille
```

For further examples on how to use RAGatouille with ColBERT models, you can check out the [`examples` section it the github repository](https://github.com/bclavie/RAGatouille/tree/main/examples).

Specifically, example 01 shows how to build/query an index, 04 shows how you can use JaColBERTv2 as a re-ranker, and 06 shows how to use JaColBERTv2 for in-memory searching rather than using an index.

Notably, RAGatouille has metadata support, so check the examples out if it's something you need!

## Encoding and querying documents without an index

If you want to use JaColBERTv2 without building an index, it's very simple, you just need to load the model, `encode()` some documents, and then `search_encoded_docs()`:

```python
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERTv2")

RAG.encode(['document_1', 'document_2', ...])
RAG.search_encoded_docs(query="your search query")
```

Subsequent calls to `encode()` will add to the existing in-memory collection. If you want to empty it, simply run `RAG.clear_encoded_docs()`.


## Indexing

In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
Indexing is the slowest step  retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:

```python
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか？マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",]
RAG.index(name="My_first_index", collection=documents)
```

The index files are stored, by default, at `.ragatouille/colbert/indexes/{index_name}`.

And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.


## Searching

Once you have created an index, searching through it is just as simple! If you're in the same session and `RAG` is still loaded, you can directly search the newly created index.
Otherwise, you'll want to load it from disk:

```python
RAG = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/My_first_index")
```

And then query it:

```python
RAG.search(query="QUERY")
> [{'content': 'TEXT OF DOCUMENT ONE',
   'score': float,
   'rank': 1,
   'document_id': str,
   'document_metadata': dict},
  {'content': 'TEXT OF DOCUMENT TWO',
   'score': float,
   'rank': 2,
   'document_id': str,
   'document_metadata': dict},
  [...]
]
```


# Citation

If you'd like to cite this work, please cite the JaColBERT technical report:

```
@misc{clavié2023jacolbert,
      title={JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report}, 
      author={Benjamin Clavié},
      year={2023},
      eprint={2312.16144},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```