File size: 8,139 Bytes

---
license: bigscience-bloom-rail-1.0
datasets:
- unicamp-dl/mmarco
- rajpurkar/squad
language:
- fr
- en
pipeline_tag: sentence-similarity
---

## Evaluation

To assess the performance of the reranker, we will utilize the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are
simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:

| Theme name                                   | Context number |
|---------------------------------------------:|:---------------|
| Normans                                      | 39             |
| Computational_complexity_theory              | 48             |
| Southern_California                          | 39             |
| Sky_(United_Kingdom)                         | 22             |
| Victoria_(Australia)                         | 25             |
| Huguenot                                     | 44             |
| Steam_engine                                 | 46             |
| Oxygen                                       | 43             |
| 1973_oil_crisis                              | 24             |
| European_Union_law                           | 40             |
| Amazon_rainforest                            | 21             |
| Ctenophora                                   | 31             |
| Fresno,_California                           | 28             |
| Packet_switching                             | 23             |
| Black_Death                                  | 23             |
| Geology                                      | 25             |
| Pharmacy                                     | 26             |
| Civil_disobedience                           | 26             |
| Construction                                 | 22             |
| Private_school                               | 26             |
| Harvard_University                           | 30             |
| Jacksonville,_Florida                        | 21             |
| Economic_inequality                          | 44             |
| University_of_Chicago                        | 37             |
| Yuan_dynasty                                 | 47             |
| Immune_system                                | 49             |
| Intergovernmental_Panel_on_Climate_Change    | 24             |
| Prime_number                                 | 31             |
| Rhine                                        | 44             |
| Scottish_Parliament                          | 39             |
| Islamism                                     | 39             |
| Imperialism                                  | 39             |
| Warsaw                                       | 49             |
| French_and_Indian_War                        | 46             |
| Force                                        | 44             |

The evaluation corpus consists of 1204 pairs of query/context to be ranked.

Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).

|     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) |    5.72    |   36.88   |   69.35   |    95.51   |    98.92    |    79.51   |       0.83       |       0.37      |
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) |    5.54    |   25.90   |   66.11   |    92.77   |    99.17    |    76.00   |       0.80       |       0.39      |
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    4.43    |   30.27   |   71.51   |    95.68   |    99.42    |    80.17   |       0.78       |       0.38      |
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |   15.13   |   60.39    |    57.23   |    83.87    |   96.18   |   66.21   |  0.53   |  0.11  |
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.49    |    2.58   |   83.55   |    99.17   |     100     |    89.98   |       0.93       |       0.15      |
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |


Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.

|     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|              BM25             |   288.04   |   371.46  |   21.93   |    41.93   |    55.15    |    28.41   |        NA        |        NA       |
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)           |    12.20   |   61.39   |   59.55   |    89.71   |    97.42    |    70.38   |       0.65       |       0.47      |
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR)        |    40.97   |   104.78  |   25.66   |    64.78   |    88.62    |    38.83   |       0.53       |       0.49      |
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    6.91    |   32.16   |   59.88   |    89.95   |    99.09    |    70.39   |       0.61       |       0.46      |
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |    79.32   |   153.62    |   27.91   |    49.50    |    78.16    |   35.41   |   0.40    |  0.12  |
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.51    |    1.92   |   81.89   |    99.09   |     100     |    88.64   |       0.92       |       0.15      |
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    0.98   |   89.20   |    99.84   |     100     |    93.63   |       0.94       |       0.10      |

As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
for RAG-type applications.

How to Use Bloomz-3b-reranking
------------------------------

The following example utilizes the API Pipeline of the Transformers library.

```python
from transformers import pipeline

reranker = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever', top_k=None)

similarity = reranker(
    [dict(text=ii, text_pair=query) for ii in context_list]
)
context_reranked = sorted(
    filter(lambda x: x[0]['label'] == "LABEL_1", zip(similarity, context_list)),
    key=lambda x: x[0]
)
score, context_cleaned = zip(
    *filter(
        lambda x: x[0] >= 0.8
    )
)
```

Citation
--------

```bibtex
@online{DeBloomzReranking,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}
```