|
--- |
|
pipeline_tag: feature-extraction |
|
language: fr |
|
license: apache-2.0 |
|
datasets: |
|
- unicamp-dl/mmarco |
|
metrics: |
|
- recall |
|
tags: |
|
- feature-extraction |
|
- sentence-similarity |
|
library_name: colbert |
|
inference: false |
|
--- |
|
|
|
# colbertv1-camembert-base-mmarcoFR |
|
|
|
This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset. |
|
|
|
## Usage |
|
|
|
Using ColBERT on a dataset typically involves the following steps: |
|
|
|
**Step 1: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection. |
|
|
|
**Step 2: Index your collection.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. |
|
``` |
|
from colbert.infra import Run, RunConfig, ColBERTConfig |
|
from colbert import Indexer |
|
|
|
if __name__=='__main__': |
|
with Run().context(RunConfig(nranks=1, experiment="msmarco")): |
|
|
|
config = ColBERTConfig( |
|
nbits=2, |
|
root="/path/to/experiments", |
|
) |
|
indexer = Indexer(checkpoint="/path/to/checkpoint", config=config) |
|
indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv") |
|
``` |
|
|
|
**Step 3: Search the collection with your queries.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query. |
|
``` |
|
from colbert.data import Queries |
|
from colbert.infra import Run, RunConfig, ColBERTConfig |
|
from colbert import Searcher |
|
|
|
if __name__=='__main__': |
|
with Run().context(RunConfig(nranks=1, experiment="msmarco")): |
|
|
|
config = ColBERTConfig( |
|
root="/path/to/experiments", |
|
) |
|
searcher = Searcher(index="msmarco.nbits=2", config=config) |
|
queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv") |
|
ranking = searcher.search_all(queries, k=100) |
|
ranking.save("msmarco.nbits=2.ranking.tsv") |
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
*(tba)* |
|
|
|
## Training |
|
|
|
#### Details |
|
|
|
We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens. |
|
|
|
#### Data |
|
|
|
We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising: |
|
- a corpus of 8.8M passages; |
|
- a training set of ~533k queries (with at least one relevant passage); |
|
- a development set of ~101k queries; |
|
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works). |
|
Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@online{louis2023, |
|
author = 'Antoine Louis', |
|
title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO', |
|
publisher = 'Hugging Face', |
|
month = 'dec', |
|
year = '2023', |
|
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR', |
|
} |
|
``` |