metadata

pipeline_tag: sentence-similarity
language: fr
license: apache-2.0
datasets:
  - unicamp-dl/mmarco
metrics:
  - recall
tags:
  - feature-extraction
  - sentence-similarity
library_name: colbert
inference: false

colbertv1-camembert-base-mmarcoFR

This is a ColBERTv1 model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the French portion of the mMARCO dataset.

Installation

To use this model, you will need to install the following libraries:

pip install colbert-ir[faiss-gpu] faiss torch

Usage

Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!

from colbert import Indexer
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # The name of your index, i.e. the name of your vector database

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    documents = [
      "Ceci est un premier document.",
      "Voici un second document.",
      ...
    ]
    indexer.index(name=index_name, collection=documents)

Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.

from colbert import Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 0
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
k: int = 10 # how many results you want to retrieve

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    query = "Comment effectuer une recherche avec ColBERT ?"
    results = searcher.search(query, k=k)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

Evaluation

(tba)

Training

Details

We used the camembert-base model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.

Data

We used the French version of the mMARCO dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:

a corpus of 8.8M passages;
a training set of ~533k queries (with at least one relevant passage);
a development set of ~101k queries;
a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works). Link: https://ir-datasets.com/mmarco.html#mmarco/v2/fr/

Citation

@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
   publisher = 'Hugging Face',
   month     = 'dec',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}