license: bigscience-bloom-rail-1.0
datasets:
- unicamp-dl/mmarco
- rajpurkar/squad
language:
- fr
- en
pipeline_tag: sentence-similarity
Evaluation
To assess the performance of the reranker, we will utilize the "validation" split of the SQuAD dataset. We will select the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:
Theme name | Context number |
---|---|
Normans | 39 |
Computational_complexity_theory | 48 |
Southern_California | 39 |
Sky_(United_Kingdom) | 22 |
Victoria_(Australia) | 25 |
Huguenot | 44 |
Steam_engine | 46 |
Oxygen | 43 |
1973_oil_crisis | 24 |
European_Union_law | 40 |
Amazon_rainforest | 21 |
Ctenophora | 31 |
Fresno,_California | 28 |
Packet_switching | 23 |
Black_Death | 23 |
Geology | 25 |
Pharmacy | 26 |
Civil_disobedience | 26 |
Construction | 22 |
Private_school | 26 |
Harvard_University | 30 |
Jacksonville,_Florida | 21 |
Economic_inequality | 44 |
University_of_Chicago | 37 |
Yuan_dynasty | 47 |
Immune_system | 49 |
Intergovernmental_Panel_on_Climate_Change | 24 |
Prime_number | 31 |
Rhine | 44 |
Scottish_Parliament | 39 |
Islamism | 39 |
Imperialism | 39 |
Warsaw | 49 |
French_and_Indian_War | 46 |
Force | 44 |
The evaluation corpus consists of 1204 pairs of query/context to be ranked.
Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
Model (French/French) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
---|---|---|---|---|---|---|---|---|
BM25 | 14.47 | 92.19 | 69.77 | 92.03 | 98.09 | 77.74 | NA | NA |
CamemBERT | 5.72 | 36.88 | 69.35 | 95.51 | 98.92 | 79.51 | 0.83 | 0.37 |
DistilCamemBERT | 5.54 | 25.90 | 66.11 | 92.77 | 99.17 | 76.00 | 0.80 | 0.39 |
mMiniLMv2-L12 | 4.43 | 30.27 | 71.51 | 95.68 | 99.42 | 80.17 | 0.78 | 0.38 |
RoBERTa (multilingual) | 15.13 | 60.39 | 57.23 | 83.87 | 96.18 | 66.21 | 0.53 | 0.11 |
cmarkea/bloomz-560m-reranking | 1.49 | 2.58 | 83.55 | 99.17 | 100 | 89.98 | 0.93 | 0.15 |
cmarkea/bloomz-3b-reranking | 1.22 | 1.06 | 89.37 | 99.75 | 100 | 93.79 | 0.94 | 0.10 |
Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.
Model (French/English) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
---|---|---|---|---|---|---|---|---|
BM25 | 288.04 | 371.46 | 21.93 | 41.93 | 55.15 | 28.41 | NA | NA |
CamemBERT | 12.20 | 61.39 | 59.55 | 89.71 | 97.42 | 70.38 | 0.65 | 0.47 |
DistilCamemBERT | 40.97 | 104.78 | 25.66 | 64.78 | 88.62 | 38.83 | 0.53 | 0.49 |
mMiniLMv2-L12 | 6.91 | 32.16 | 59.88 | 89.95 | 99.09 | 70.39 | 0.61 | 0.46 |
RoBERTa (multilingual) | 79.32 | 153.62 | 27.91 | 49.50 | 78.16 | 35.41 | 0.40 | 0.12 |
cmarkea/bloomz-560m-reranking | 1.51 | 1.92 | 81.89 | 99.09 | 100 | 88.64 | 0.92 | 0.15 |
cmarkea/bloomz-3b-reranking | 1.22 | 0.98 | 89.20 | 99.84 | 100 | 93.63 | 0.94 | 0.10 |
As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts for RAG-type applications.
How to Use Bloomz-3b-reranking
The following example utilizes the API Pipeline of the Transformers library.
from transformers import pipeline
reranker = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever', top_k=None)
similarity = reranker(
[dict(text=ii, text_pair=query) for ii in context_list]
)
context_reranked = sorted(
filter(lambda x: x[0]['label'] == "LABEL_1", zip(similarity, context_list)),
key=lambda x: x[0]
)
score, context_cleaned = zip(
*filter(
lambda x: x[0] >= 0.8
)
)
Citation
@online{DeBloomzReranking,
AUTHOR = {Cyrile Delestre},
ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
YEAR = {2024},
KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}