antoinelouis
/

colbertv1-camembert-base-mmarcoFR

Sentence Similarity

passage-retrieval

Model card Files Files and versions

antoinelouis commited on Dec 25, 2023

Commit

e31fa91

·

1 Parent(s): faa6aeb

Update README.md

Files changed (1) hide show

README.md +2 -9

README.md CHANGED Viewed

@@ -17,7 +17,6 @@ library_name: colbert
 This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 ## Usage
-***
 Using ColBERT on a dataset typically involves the following steps:
@@ -59,22 +58,16 @@ if __name__=='__main__':
 ## Evaluation
-***
 We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages.
 [...]
 ## Training
-***
-#### Background
-We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query.
-#### Hyperparameters
-We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
 #### Data

 This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 ## Usage
 Using ColBERT on a dataset typically involves the following steps:
 ## Evaluation
 We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages.
 [...]
 ## Training
+#### Details
+We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.
 #### Data