license: mit
datasets:
- unicamp-dl/mmarco
language:
- pt
tags:
- colbert
- ColBERT
Disclaimer: This model is based on a model trained for brazilian portuguese, furthermore mMARCO was translated from MSMARCO using Google Translate which also tends to be biased towards brazilian portuguese, therefore it might not do well on european portuguese.
Training
Details
The model is initialized from the ricardoz/BERTugues-base-portuguese-cased model and fine-tuned on 10M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla A100 GPU with 40GBs of memory during 200k steps with 10% of warmup steps using a batch size of 96 and the AdamW optimizer with a constant learning rate of 3e-06. Total training time was around 12 hours.
Data
The model is fine-tuned on the Portugueses version of the mMARCO dataset, a multi-lingual machine-translated version of the MS MARCO dataset. The triples are sampled from the ~39.8M triples of triples.train.small.tsv
Evaluation
The model is evaluated on the smaller development set of mMARCO-es, which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
model | Vocab. | #Param. | Size | MRR@10 | R@50 | R@1000 |
---|---|---|---|---|---|---|
ColBERTv1.0-BERTugues-base-portuguese-mmarcoPT | portuguese | 110M | 440MB | 26.90 | 65.26 | 70.21 |