antoinelouis commited on
Commit
513c436
1 Parent(s): 200a637

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -21,7 +21,7 @@ This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it
21
 
22
  To use this model, you will need to install the following libraries:
23
  ```
24
- pip install colbert-ir @ git+https://github.com/stanford-futuredata/ColBERT.git faiss-gpu==1.7.2
25
  ```
26
 
27
 
@@ -67,7 +67,7 @@ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
67
 
68
  ## Evaluation
69
 
70
- We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with a biencoder model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
71
 
72
  | model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 |
73
  |:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
@@ -78,16 +78,16 @@ We evaluated our model on the smaller development set of mMARCO-fr, which consis
78
 
79
  #### Details
80
 
81
- We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
82
 
83
  #### Data
84
 
85
- We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:
86
  - a corpus of 8.8M passages;
87
- - a training set of ~533k queries (with at least one relevant passage);
88
  - a development set of ~101k queries;
89
  - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
90
- Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
91
 
92
  ## Citation
93
 
 
21
 
22
  To use this model, you will need to install the following libraries:
23
  ```
24
+ pip install colbert-ai @ git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
25
  ```
26
 
27
 
 
67
 
68
  ## Evaluation
69
 
70
+ The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance with a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
71
 
72
  | model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 |
73
  |:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
 
78
 
79
  #### Details
80
 
81
+ The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
82
 
83
  #### Data
84
 
85
+ The model is fine-tuned on the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises:
86
  - a corpus of 8.8M passages;
87
+ - a training set of ~533k unique queries (with at least one relevant passage);
88
  - a development set of ~101k queries;
89
  - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
90
+ The triples are sampled from the ~39.8M triples from [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). In the future, better negatives could be selected by exploiting the [msmarco-hard-negatives] dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query.
91
 
92
  ## Citation
93