cmarkea
/

bloomz-3b-reranking

@@ -9,6 +9,84 @@ language:
 pipeline_tag: sentence-similarity
 ---
 |     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
 |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
 |              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |

 pipeline_tag: sentence-similarity
 ---
+## Evaluation
+To assess the performance of the reranker, we will utilize the "validation" portion of the SQuAD dataset. We will select the first question from each paragraph, along with
+the paragraph constituting the excerpt that should be ranked Top-1 for an Oracle modeling. What's intriguing is that the number of themes is limited, and each excerpt from
+a corresponding theme that does not match the question forms a hard negative (other excerpts outside the theme are simple negatives). Thus, we can construct the following
+table, with each theme showing the number of excerpts and associated questions:
+| Theme name                                   | Context number |
+|----------------------------------------------|----------------|
+| Normans                                      | 39             |
+| Computational_complexity_theory              | 48             |
+| Southern_California                          | 39             |
+| Sky_(United_Kingdom)                         | 22             |
+| Victoria_(Australia)                         | 25             |
+| Huguenot                                     | 44             |
+| Steam_engine                                 | 46             |
+| Oxygen                                       | 43             |
+| 1973_oil_crisis                              | 24             |
+| European_Union_law                           | 40             |
+| Amazon_rainforest                            | 21             |
+| Ctenophora                                   | 31             |
+| Fresno,_California                           | 28             |
+| Packet_switching                             | 23             |
+| Black_Death                                  | 23             |
+| Geology                                      | 25             |
+| Pharmacy                                     | 26             |
+| Civil_disobedience                           | 26             |
+| Construction                                 | 22             |
+| Private_school                               | 26             |
+| Harvard_University                           | 30             |
+| Jacksonville,_Florida                        | 21             |
+| Economic_inequality                          | 44             |
+| University_of_Chicago                        | 37             |
+| Yuan_dynasty                                 | 47             |
+| Immune_system                                | 49             |
+| Intergovernmental_Panel_on_Climate_Change    | 24             |
+| Prime_number                                 | 31             |
+| Rhine                                        | 44             |
+| Scottish_Parliament                          | 39             |
+| Islamism                                     | 39             |
+| Imperialism                                  | 39             |
+| Warsaw                                       | 49             |
+| French_and_Indian_War                        | 46             |
+| Force                                        | 44             |
+The evaluation corpus consists of 1204 pairs of question/context to be ranked.
+Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
+|     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
+|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
+|              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |
+| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) |    5.72    |   36.88   |   69.35   |    95.51   |    98.92    |    79.51   |       0.83       |       0.37      |
+| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) |    5.54    |   25.90   |   66.11   |    92.77   |    99.17    |    76.00   |       0.80       |       0.39      |
+| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    4.43    |   30.27   |   71.51   |    95.68   |    99.42    |    80.17   |       0.78       |       0.38      |
+| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |   15.13   |   60.39    |    57.23   |    83.87    |   96.18   |   66.21   |  0.53   |  0.11  |
+| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.49    |    2.58   |   83.55   |    99.17   |     100     |    89.98   |       0.93       |       0.15      |
+| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |
+Next, we evaluate the model in a cross-language context, with queries in English and contexts in French.
+|     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
+|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
+|              BM25             |   288.04   |   371.46  |   21.93   |    41.93   |    55.15    |    28.41   |        NA        |        NA       |
+| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)           |    12.20   |   61.39   |   59.55   |    89.71   |    97.42    |    70.38   |       0.65       |       0.47      |
+| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR)        |    40.97   |   104.78  |   25.66   |    64.78   |    88.62    |    38.83   |       0.53       |       0.49      |
+| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    6.91    |   32.16   |   59.88   |    89.95   |    99.09    |    70.39   |       0.61       |       0.46      |
+| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |    79.32   |   153.62    |   27.91   |    49.50    |    78.16    |   35.41   |   0.40    |  0.12  |
+| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.51    |    1.92   |   81.89   |    99.09   |     100     |    88.64   |       0.92       |       0.15      |
+| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    0.98   |   89.20   |    99.84   |     100     |    93.63   |       0.94       |       0.10      |
+As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
+Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
+for RAG-type applications.
 |     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
 |:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
 |              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |