skirres commited on
Commit
b3e6c64
1 Parent(s): 53f7300

Use FP32 metrics

Browse files
Files changed (1) hide show
  1. README.md +16 -17
README.md CHANGED
@@ -26,7 +26,7 @@ The model was trained and tested in the following languages:
26
 
27
  | Metric | Value |
28
  |:--------------------|------:|
29
- | Relevance (NDCG@10) | 0.456 |
30
 
31
  Note that the relevance score is computed as an average over 14 retrieval datasets (see
32
  [details below](#evaluation-metrics)).
@@ -35,12 +35,11 @@ Note that the relevance score is computed as an average over 14 retrieval datase
35
 
36
  | GPU | Batch size 32 |
37
  |:-----------|--------------:|
38
- | NVIDIA A10 | 4 ms |
39
- | NVIDIA T4 | 13 ms |
40
 
41
  The inference times only measure the time the model takes to process a single batch, it does not include pre- or
42
- post-processing steps like the tokenization. The reported times are measured using the
43
- [FP16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) version of the model.
44
 
45
  ## Requirements
46
 
@@ -77,22 +76,22 @@ To determine the relevance score, we averaged the results that we obtained when
77
 
78
  | Dataset | NDCG@10 |
79
  |:------------------|--------:|
80
- | Average | 0.456 |
81
  | | |
82
- | Arguana | 0.517 |
83
  | CLIMATE-FEVER | 0.159 |
84
  | DBPedia Entity | 0.355 |
85
- | FEVER | 0.733 |
86
  | FiQA-2018 | 0.282 |
87
  | HotpotQA | 0.688 |
88
- | MS MARCO | 0.327 |
89
  | NFCorpus | 0.341 |
90
- | NQ | 0.441 |
91
- | Quora | 0.768 |
92
  | SCIDOCS | 0.143 |
93
- | SciFact | 0.629 |
94
- | TREC-COVID | 0.667 |
95
- | Webis-Touche-2020 | 0.328 |
96
 
97
  We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its
98
  multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics
@@ -100,6 +99,6 @@ for the existing languages.
100
 
101
  | Language | NDCG@10 |
102
  |:---------|--------:|
103
- | French | 0.349 |
104
- | German | 0.375 |
105
- | Spanish | 0.417 |
 
26
 
27
  | Metric | Value |
28
  |:--------------------|------:|
29
+ | Relevance (NDCG@10) | 0.453 |
30
 
31
  Note that the relevance score is computed as an average over 14 retrieval datasets (see
32
  [details below](#evaluation-metrics)).
 
35
 
36
  | GPU | Batch size 32 |
37
  |:-----------|--------------:|
38
+ | NVIDIA A10 | 8 ms |
39
+ | NVIDIA T4 | 21 ms |
40
 
41
  The inference times only measure the time the model takes to process a single batch, it does not include pre- or
42
+ post-processing steps like the tokenization.
 
43
 
44
  ## Requirements
45
 
 
76
 
77
  | Dataset | NDCG@10 |
78
  |:------------------|--------:|
79
+ | Average | 0.453 |
80
  | | |
81
+ | Arguana | 0.516 |
82
  | CLIMATE-FEVER | 0.159 |
83
  | DBPedia Entity | 0.355 |
84
+ | FEVER | 0.729 |
85
  | FiQA-2018 | 0.282 |
86
  | HotpotQA | 0.688 |
87
+ | MS MARCO | 0.334 |
88
  | NFCorpus | 0.341 |
89
+ | NQ | 0.438 |
90
+ | Quora | 0.726 |
91
  | SCIDOCS | 0.143 |
92
+ | SciFact | 0.630 |
93
+ | TREC-COVID | 0.664 |
94
+ | Webis-Touche-2020 | 0.337 |
95
 
96
  We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its
97
  multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics
 
99
 
100
  | Language | NDCG@10 |
101
  |:---------|--------:|
102
+ | French | 0.346 |
103
+ | German | 0.368 |
104
+ | Spanish | 0.416 |