bloomz-3b-reranking / README.md
Cyrile's picture
Update README.md
babbc6b verified
|
raw
history blame
8.14 kB
metadata
license: bigscience-bloom-rail-1.0
datasets:
  - unicamp-dl/mmarco
  - rajpurkar/squad
language:
  - fr
  - en
pipeline_tag: sentence-similarity

Evaluation

To assess the performance of the reranker, we will utilize the "validation" split of the SQuAD dataset. We will select the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:

Theme name Context number
Normans 39
Computational_complexity_theory 48
Southern_California 39
Sky_(United_Kingdom) 22
Victoria_(Australia) 25
Huguenot 44
Steam_engine 46
Oxygen 43
1973_oil_crisis 24
European_Union_law 40
Amazon_rainforest 21
Ctenophora 31
Fresno,_California 28
Packet_switching 23
Black_Death 23
Geology 25
Pharmacy 26
Civil_disobedience 26
Construction 22
Private_school 26
Harvard_University 30
Jacksonville,_Florida 21
Economic_inequality 44
University_of_Chicago 37
Yuan_dynasty 47
Immune_system 49
Intergovernmental_Panel_on_Climate_Change 24
Prime_number 31
Rhine 44
Scottish_Parliament 39
Islamism 39
Imperialism 39
Warsaw 49
French_and_Indian_War 46
Force 44

The evaluation corpus consists of 1204 pairs of query/context to be ranked.

Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).

Model (French/French) Top-mean Top-std Top-1 (%) Top-10 (%) Top-100 (%) MRR (x100) mean score Top std score Top
BM25 14.47 92.19 69.77 92.03 98.09 77.74 NA NA
CamemBERT 5.72 36.88 69.35 95.51 98.92 79.51 0.83 0.37
DistilCamemBERT 5.54 25.90 66.11 92.77 99.17 76.00 0.80 0.39
mMiniLMv2-L12 4.43 30.27 71.51 95.68 99.42 80.17 0.78 0.38
RoBERTa (multilingual) 15.13 60.39 57.23 83.87 96.18 66.21 0.53 0.11
cmarkea/bloomz-560m-reranking 1.49 2.58 83.55 99.17 100 89.98 0.93 0.15
cmarkea/bloomz-3b-reranking 1.22 1.06 89.37 99.75 100 93.79 0.94 0.10

Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.

Model (French/English) Top-mean Top-std Top-1 (%) Top-10 (%) Top-100 (%) MRR (x100) mean score Top std score Top
BM25 288.04 371.46 21.93 41.93 55.15 28.41 NA NA
CamemBERT 12.20 61.39 59.55 89.71 97.42 70.38 0.65 0.47
DistilCamemBERT 40.97 104.78 25.66 64.78 88.62 38.83 0.53 0.49
mMiniLMv2-L12 6.91 32.16 59.88 89.95 99.09 70.39 0.61 0.46
RoBERTa (multilingual) 79.32 153.62 27.91 49.50 78.16 35.41 0.40 0.12
cmarkea/bloomz-560m-reranking 1.51 1.92 81.89 99.09 100 88.64 0.92 0.15
cmarkea/bloomz-3b-reranking 1.22 0.98 89.20 99.84 100 93.63 0.94 0.10

As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts for RAG-type applications.

How to Use Bloomz-3b-reranking

The following example utilizes the API Pipeline of the Transformers library.

from transformers import pipeline

reranker = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever', top_k=None)

similarity = reranker(
    [dict(text=ii, text_pair=query) for ii in context_list]
)
context_reranked = sorted(
    filter(lambda x: x[0]['label'] == "LABEL_1", zip(similarity, context_list)),
    key=lambda x: x[0]
)
score, context_cleaned = zip(
    *filter(
        lambda x: x[0] >= 0.8
    )
)

Citation

@online{DeBloomzReranking,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}