README.md · cmarkea/bloomz-3b-reranking at babbc6b2db30fb27addfc437353eba502fb8c675

metadata

license: bigscience-bloom-rail-1.0
datasets:
  - unicamp-dl/mmarco
  - rajpurkar/squad
language:
  - fr
  - en
pipeline_tag: sentence-similarity

Evaluation

To assess the performance of the reranker, we will utilize the "validation" split of the SQuAD dataset. We will select the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:

Theme name	Context number
Normans	39
Computational_complexity_theory	48
Southern_California	39
Sky_(United_Kingdom)	22
Victoria_(Australia)	25
Huguenot	44
Steam_engine	46
Oxygen	43
1973_oil_crisis	24
European_Union_law	40
Amazon_rainforest	21
Ctenophora	31
Fresno,_California	28
Packet_switching	23
Black_Death	23
Geology	25
Pharmacy	26
Civil_disobedience	26
Construction	22
Private_school	26
Harvard_University	30
Jacksonville,_Florida	21
Economic_inequality	44
University_of_Chicago	37
Yuan_dynasty	47
Immune_system	49
Intergovernmental_Panel_on_Climate_Change	24
Prime_number	31
Rhine	44
Scottish_Parliament	39
Islamism	39
Imperialism	39
Warsaw	49
French_and_Indian_War	46
Force	44

The evaluation corpus consists of 1204 pairs of query/context to be ranked.

Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).

Model (French/French)	Top-mean	Top-std	Top-1 (%)	Top-10 (%)	Top-100 (%)	MRR (x100)	mean score Top	std score Top
BM25	14.47	92.19	69.77	92.03	98.09	77.74	NA	NA
CamemBERT	5.72	36.88	69.35	95.51	98.92	79.51	0.83	0.37
DistilCamemBERT	5.54	25.90	66.11	92.77	99.17	76.00	0.80	0.39
mMiniLMv2-L12	4.43	30.27	71.51	95.68	99.42	80.17	0.78	0.38
RoBERTa (multilingual)	15.13	60.39	57.23	83.87	96.18	66.21	0.53	0.11
cmarkea/bloomz-560m-reranking	1.49	2.58	83.55	99.17	100	89.98	0.93	0.15
cmarkea/bloomz-3b-reranking	1.22	1.06	89.37	99.75	100	93.79	0.94	0.10

Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.

Model (French/English)	Top-mean	Top-std	Top-1 (%)	Top-10 (%)	Top-100 (%)	MRR (x100)	mean score Top	std score Top
BM25	288.04	371.46	21.93	41.93	55.15	28.41	NA	NA
CamemBERT	12.20	61.39	59.55	89.71	97.42	70.38	0.65	0.47
DistilCamemBERT	40.97	104.78	25.66	64.78	88.62	38.83	0.53	0.49
mMiniLMv2-L12	6.91	32.16	59.88	89.95	99.09	70.39	0.61	0.46
RoBERTa (multilingual)	79.32	153.62	27.91	49.50	78.16	35.41	0.40	0.12
cmarkea/bloomz-560m-reranking	1.51	1.92	81.89	99.09	100	88.64	0.92	0.15
cmarkea/bloomz-3b-reranking	1.22	0.98	89.20	99.84	100	93.63	0.94	0.10

As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts for RAG-type applications.

How to Use Bloomz-3b-reranking

The following example utilizes the API Pipeline of the Transformers library.

from transformers import pipeline

reranker = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever', top_k=None)

similarity = reranker(
    [dict(text=ii, text_pair=query) for ii in context_list]
)
context_reranked = sorted(
    filter(lambda x: x[0]['label'] == "LABEL_1", zip(similarity, context_list)),
    key=lambda x: x[0]
)
score, context_cleaned = zip(
    *filter(
        lambda x: x[0] >= 0.8
    )
)

Citation

@online{DeBloomzReranking,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}