File size: 8,139 Bytes
79ea6bf
 
19789cd
 
 
 
 
 
 
2a668b3
 
279bd09
 
97defb7
37f957a
 
 
279bd09
 
d2b8c74
279bd09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37f957a
279bd09
 
 
 
 
 
 
 
 
 
 
 
 
 
37f957a
279bd09
 
 
 
 
 
 
 
 
 
 
 
 
 
f93ddb0
 
 
 
 
 
 
 
 
babbc6b
 
 
 
 
 
 
 
 
 
 
 
 
 
f93ddb0
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: bigscience-bloom-rail-1.0
datasets:
- unicamp-dl/mmarco
- rajpurkar/squad
language:
- fr
- en
pipeline_tag: sentence-similarity
---

## Evaluation

To assess the performance of the reranker, we will utilize the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
the number of themes is limited, and each context from a corresponding theme that does not match the query forms a hard negative (other contexts outside the theme are
simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:

| Theme name                                   | Context number |
|---------------------------------------------:|:---------------|
| Normans                                      | 39             |
| Computational_complexity_theory              | 48             |
| Southern_California                          | 39             |
| Sky_(United_Kingdom)                         | 22             |
| Victoria_(Australia)                         | 25             |
| Huguenot                                     | 44             |
| Steam_engine                                 | 46             |
| Oxygen                                       | 43             |
| 1973_oil_crisis                              | 24             |
| European_Union_law                           | 40             |
| Amazon_rainforest                            | 21             |
| Ctenophora                                   | 31             |
| Fresno,_California                           | 28             |
| Packet_switching                             | 23             |
| Black_Death                                  | 23             |
| Geology                                      | 25             |
| Pharmacy                                     | 26             |
| Civil_disobedience                           | 26             |
| Construction                                 | 22             |
| Private_school                               | 26             |
| Harvard_University                           | 30             |
| Jacksonville,_Florida                        | 21             |
| Economic_inequality                          | 44             |
| University_of_Chicago                        | 37             |
| Yuan_dynasty                                 | 47             |
| Immune_system                                | 49             |
| Intergovernmental_Panel_on_Climate_Change    | 24             |
| Prime_number                                 | 31             |
| Rhine                                        | 44             |
| Scottish_Parliament                          | 39             |
| Islamism                                     | 39             |
| Imperialism                                  | 39             |
| Warsaw                                       | 49             |
| French_and_Indian_War                        | 46             |
| Force                                        | 44             |

The evaluation corpus consists of 1204 pairs of query/context to be ranked.

Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).

|     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) |    5.72    |   36.88   |   69.35   |    95.51   |    98.92    |    79.51   |       0.83       |       0.37      |
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) |    5.54    |   25.90   |   66.11   |    92.77   |    99.17    |    76.00   |       0.80       |       0.39      |
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    4.43    |   30.27   |   71.51   |    95.68   |    99.42    |    80.17   |       0.78       |       0.38      |
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |   15.13   |   60.39    |    57.23   |    83.87    |   96.18   |   66.21   |  0.53   |  0.11  |
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.49    |    2.58   |   83.55   |    99.17   |     100     |    89.98   |       0.93       |       0.15      |
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |


Next, we evaluate the model in a cross-language context, with queries in French and contexts in English.

|     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|              BM25             |   288.04   |   371.46  |   21.93   |    41.93   |    55.15    |    28.41   |        NA        |        NA       |
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)           |    12.20   |   61.39   |   59.55   |    89.71   |    97.42    |    70.38   |       0.65       |       0.47      |
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR)        |    40.97   |   104.78  |   25.66   |    64.78   |    88.62    |    38.83   |       0.53       |       0.49      |
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    6.91    |   32.16   |   59.88   |    89.95   |    99.09    |    70.39   |       0.61       |       0.46      |
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |    79.32   |   153.62    |   27.91   |    49.50    |    78.16    |   35.41   |   0.40    |  0.12  |
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.51    |    1.92   |   81.89   |    99.09   |     100     |    88.64   |       0.92       |       0.15      |
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    0.98   |   89.20   |    99.84   |     100     |    93.63   |       0.94       |       0.10      |

As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
for RAG-type applications.

How to Use Bloomz-3b-reranking
------------------------------

The following example utilizes the API Pipeline of the Transformers library.

```python
from transformers import pipeline

reranker = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever', top_k=None)

similarity = reranker(
    [dict(text=ii, text_pair=query) for ii in context_list]
)
context_reranked = sorted(
    filter(lambda x: x[0]['label'] == "LABEL_1", zip(similarity, context_list)),
    key=lambda x: x[0]
)
score, context_cleaned = zip(
    *filter(
        lambda x: x[0] >= 0.8
    )
)
```

Citation
--------

```bibtex
@online{DeBloomzReranking,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}
```