antoinelouis
commited on
Commit
•
97d37d7
1
Parent(s):
defef1e
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
pipeline_tag:
|
3 |
language: fr
|
4 |
license: apache-2.0
|
5 |
datasets:
|
@@ -17,46 +17,53 @@ inference: false
|
|
17 |
|
18 |
This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
|
19 |
|
20 |
-
##
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
-
Using ColBERT on a dataset typically involves the following steps:
|
23 |
|
24 |
-
|
25 |
|
26 |
-
**Step
|
27 |
```
|
28 |
-
from colbert.infra import Run, RunConfig, ColBERTConfig
|
29 |
from colbert import Indexer
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
-
config = ColBERTConfig(
|
35 |
-
nbits=2,
|
36 |
-
root="/path/to/experiments",
|
37 |
-
)
|
38 |
-
indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
|
39 |
-
indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
|
40 |
```
|
41 |
|
42 |
-
**Step
|
43 |
```
|
44 |
-
from colbert.data import Queries
|
45 |
-
from colbert.infra import Run, RunConfig, ColBERTConfig
|
46 |
from colbert import Searcher
|
|
|
47 |
|
48 |
-
|
49 |
-
|
|
|
|
|
50 |
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
ranking = searcher.search_all(queries, k=100)
|
57 |
-
ranking.save("msmarco.nbits=2.ranking.tsv")
|
58 |
-
```
|
59 |
|
|
|
60 |
|
61 |
## Evaluation
|
62 |
|
|
|
1 |
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
language: fr
|
4 |
license: apache-2.0
|
5 |
datasets:
|
|
|
17 |
|
18 |
This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
|
19 |
|
20 |
+
## Installation
|
21 |
+
|
22 |
+
To use this model, you will need to install the following libraries:
|
23 |
+
```
|
24 |
+
pip install colbert-ir[faiss-gpu] faiss torch
|
25 |
+
```
|
26 |
|
|
|
27 |
|
28 |
+
## Usage
|
29 |
|
30 |
+
**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
|
31 |
```
|
|
|
32 |
from colbert import Indexer
|
33 |
+
from colbert.infra import Run, RunConfig
|
34 |
+
|
35 |
+
n_gpu: int = 1 # Set your number of available GPUs
|
36 |
+
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
37 |
+
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
38 |
|
39 |
+
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
40 |
+
indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
|
41 |
+
documents = [
|
42 |
+
"Ceci est un premier document.",
|
43 |
+
"Voici un second document.",
|
44 |
+
...
|
45 |
+
]
|
46 |
+
indexer.index(name=index_name, collection=documents)
|
47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
```
|
49 |
|
50 |
+
**Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
51 |
```
|
|
|
|
|
52 |
from colbert import Searcher
|
53 |
+
from colbert.infra import Run, RunConfig
|
54 |
|
55 |
+
n_gpu: int = 0
|
56 |
+
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
57 |
+
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
|
58 |
+
k: int = 10 # how many results you want to retrieve
|
59 |
|
60 |
+
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
61 |
+
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
62 |
+
query = "Comment effectuer une recherche avec ColBERT ?"
|
63 |
+
results = searcher.search(query, k=k)
|
64 |
+
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
|
|
|
|
|
|
65 |
|
66 |
+
```
|
67 |
|
68 |
## Evaluation
|
69 |
|