|
--- |
|
inference: false |
|
datasets: |
|
- bclavie/mmarco-japanese-hard-negatives |
|
- unicamp-dl/mmarco |
|
language: |
|
- ja |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- ColBERT |
|
--- |
|
Under Construction, please come back in a few days! |
|
工事中です。数日後にまたお越しください。 |
|
|
|
# Intro |
|
|
|
## Training Data |
|
|
|
## Training Method |
|
|
|
# Results |
|
|
|
# Why use a ColBERT-like approach for your RAG application? |
|
|
|
Most retrieval methods have strong tradeoffs: |
|
* __Traditional sparse approaches__, such as BM25, are strong baselines, __but__ do not leverage any semantic understanding, and thus hit a hard ceiling. |
|
* __Cross-encoder__ retriever methods are powerful, __but__ prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores. |
|
* __Dense retrieval__ methods, using dense embeddings in vector databases, are lightweight and perform well, __but__ are data-inefficient (they require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem. |
|
|
|
ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially *bags-of-embeddings*, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders. |
|
|
|
The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets. |
|
|
|
On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models. |
|
|
|
Moreover, this approach requires **considerably less data than dense embeddings**: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models. |
|
|
|
# Usage |
|
|
|
## Installation |
|
|
|
Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers: |
|
|
|
To use JaColBERT, you will need to install the main ColBERT and those dependencies library: |
|
|
|
``` |
|
pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite |
|
``` |
|
|
|
ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying. |
|
|
|
## Indexing |
|
|
|
> ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs. |
|
|
|
In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index. |
|
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database. |
|
Indexing is the slowest step -- retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well: |
|
|
|
```python |
|
from colbert import Indexer |
|
from colbert.infra import Run, RunConfig |
|
|
|
n_gpu: int = 1 # Set your number of available GPUs |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # The name of your index, i.e. the name of your vector database |
|
|
|
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): |
|
indexer = Indexer(checkpoint="bclavie/JaColBERT") |
|
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。", |
|
... |
|
] |
|
indexer.index(name=index_name, collection=documents) |
|
``` |
|
|
|
And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated. |
|
|
|
|
|
## Searching |
|
|
|
Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage: |
|
|
|
```python |
|
from colbert import Searcher |
|
from colbert.infra import Run, RunConfig |
|
|
|
n_gpu: int = 0 |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # Name of your previously created index where the documents you want to search are stored. |
|
k: int = 10 # how many results you want to retrieve |
|
|
|
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): |
|
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index. |
|
query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか" |
|
results = searcher.search(query, k=k) |
|
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...) |
|
``` |