Update README.md
Browse files
README.md
CHANGED
@@ -9,6 +9,84 @@ language:
|
|
9 |
pipeline_tag: sentence-similarity
|
10 |
---
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
| Model (French/French) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
|
13 |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|
14 |
| BM25 | 14.47 | 92.19 | 69.77 | 92.03 | 98.09 | 77.74 | NA | NA |
|
|
|
9 |
pipeline_tag: sentence-similarity
|
10 |
---
|
11 |
|
12 |
+
## Evaluation
|
13 |
+
|
14 |
+
To assess the performance of the reranker, we will utilize the "validation" portion of the SQuAD dataset. We will select the first question from each paragraph, along with
|
15 |
+
the paragraph constituting the excerpt that should be ranked Top-1 for an Oracle modeling. What's intriguing is that the number of themes is limited, and each excerpt from
|
16 |
+
a corresponding theme that does not match the question forms a hard negative (other excerpts outside the theme are simple negatives). Thus, we can construct the following
|
17 |
+
table, with each theme showing the number of excerpts and associated questions:
|
18 |
+
|
19 |
+
| Theme name | Context number |
|
20 |
+
|----------------------------------------------|----------------|
|
21 |
+
| Normans | 39 |
|
22 |
+
| Computational_complexity_theory | 48 |
|
23 |
+
| Southern_California | 39 |
|
24 |
+
| Sky_(United_Kingdom) | 22 |
|
25 |
+
| Victoria_(Australia) | 25 |
|
26 |
+
| Huguenot | 44 |
|
27 |
+
| Steam_engine | 46 |
|
28 |
+
| Oxygen | 43 |
|
29 |
+
| 1973_oil_crisis | 24 |
|
30 |
+
| European_Union_law | 40 |
|
31 |
+
| Amazon_rainforest | 21 |
|
32 |
+
| Ctenophora | 31 |
|
33 |
+
| Fresno,_California | 28 |
|
34 |
+
| Packet_switching | 23 |
|
35 |
+
| Black_Death | 23 |
|
36 |
+
| Geology | 25 |
|
37 |
+
| Pharmacy | 26 |
|
38 |
+
| Civil_disobedience | 26 |
|
39 |
+
| Construction | 22 |
|
40 |
+
| Private_school | 26 |
|
41 |
+
| Harvard_University | 30 |
|
42 |
+
| Jacksonville,_Florida | 21 |
|
43 |
+
| Economic_inequality | 44 |
|
44 |
+
| University_of_Chicago | 37 |
|
45 |
+
| Yuan_dynasty | 47 |
|
46 |
+
| Immune_system | 49 |
|
47 |
+
| Intergovernmental_Panel_on_Climate_Change | 24 |
|
48 |
+
| Prime_number | 31 |
|
49 |
+
| Rhine | 44 |
|
50 |
+
| Scottish_Parliament | 39 |
|
51 |
+
| Islamism | 39 |
|
52 |
+
| Imperialism | 39 |
|
53 |
+
| Warsaw | 49 |
|
54 |
+
| French_and_Indian_War | 46 |
|
55 |
+
| Force | 44 |
|
56 |
+
|
57 |
+
The evaluation corpus consists of 1204 pairs of question/context to be ranked.
|
58 |
+
|
59 |
+
Initially, the evaluation scores will be calculated in cases where both the query and the context are in the same language (French/French).
|
60 |
+
|
61 |
+
| Model (French/French) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
|
62 |
+
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|
63 |
+
| BM25 | 14.47 | 92.19 | 69.77 | 92.03 | 98.09 | 77.74 | NA | NA |
|
64 |
+
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | 5.72 | 36.88 | 69.35 | 95.51 | 98.92 | 79.51 | 0.83 | 0.37 |
|
65 |
+
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | 5.54 | 25.90 | 66.11 | 92.77 | 99.17 | 76.00 | 0.80 | 0.39 |
|
66 |
+
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | 4.43 | 30.27 | 71.51 | 95.68 | 99.42 | 80.17 | 0.78 | 0.38 |
|
67 |
+
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) | 15.13 | 60.39 | 57.23 | 83.87 | 96.18 | 66.21 | 0.53 | 0.11 |
|
68 |
+
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) | 1.49 | 2.58 | 83.55 | 99.17 | 100 | 89.98 | 0.93 | 0.15 |
|
69 |
+
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) | 1.22 | 1.06 | 89.37 | 99.75 | 100 | 93.79 | 0.94 | 0.10 |
|
70 |
+
|
71 |
+
|
72 |
+
Next, we evaluate the model in a cross-language context, with queries in English and contexts in French.
|
73 |
+
|
74 |
+
| Model (French/English) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
|
75 |
+
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|
76 |
+
| BM25 | 288.04 | 371.46 | 21.93 | 41.93 | 55.15 | 28.41 | NA | NA |
|
77 |
+
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | 12.20 | 61.39 | 59.55 | 89.71 | 97.42 | 70.38 | 0.65 | 0.47 |
|
78 |
+
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | 40.97 | 104.78 | 25.66 | 64.78 | 88.62 | 38.83 | 0.53 | 0.49 |
|
79 |
+
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | 6.91 | 32.16 | 59.88 | 89.95 | 99.09 | 70.39 | 0.61 | 0.46 |
|
80 |
+
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) | 79.32 | 153.62 | 27.91 | 49.50 | 78.16 | 35.41 | 0.40 | 0.12 |
|
81 |
+
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) | 1.51 | 1.92 | 81.89 | 99.09 | 100 | 88.64 | 0.92 | 0.15 |
|
82 |
+
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) | 1.22 | 0.98 | 89.20 | 99.84 | 100 | 93.63 | 0.94 | 0.10 |
|
83 |
+
|
84 |
+
As observed, the cross-language context does not significantly impact the behavior of our models. If the model is used in a reranking context along with filtering of the
|
85 |
+
Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
|
86 |
+
for RAG-type applications.
|
87 |
+
|
88 |
+
|
89 |
+
|
90 |
| Model (French/French) | Top-mean | Top-std | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) | mean score Top | std score Top |
|
91 |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|
92 |
| BM25 | 14.47 | 92.19 | 69.77 | 92.03 | 98.09 | 77.74 | NA | NA |
|