claeyzre commited on
Commit
a538308
·
verified ·
1 Parent(s): 1c65fde

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ language:
7
+ - de
8
+ - en
9
+ - es
10
+ - fr
11
+ - it
12
+ - nl
13
+ - ja
14
+ - pt
15
+ - zh
16
+ - pl
17
+ ---
18
+
19
+ # Model Card for `vectorizer.guava`
20
+
21
+ This model is a vectorizer developed by Sinequa. It produces an embedding vector given a passage or a query. The
22
+ passage vectors are stored in our vector index and the query vector is used at query time to look up relevant passages
23
+ in the index.
24
+
25
+ Model name: `vectorizer.guava`
26
+
27
+ ## Supported Languages
28
+
29
+ The model was trained and tested in the following languages:
30
+
31
+ - English
32
+ - French
33
+ - German
34
+ - Spanish
35
+ - Italian
36
+ - Dutch
37
+ - Japanese
38
+ - Portuguese
39
+ - Chinese (simplified)
40
+ - Chinese (traditional)
41
+ - Polish
42
+
43
+ Besides these languages, basic support can be expected for additional 91 languages that were used during the pretraining
44
+ of the base model (see Appendix A of XLM-R paper).
45
+
46
+ ## Scores
47
+
48
+ | Metric | Value |
49
+ |:-------------------------------|------:|
50
+ | English Relevance (Recall@100) | 0.616 |
51
+
52
+ Note that the relevance scores are computed as an average over several retrieval datasets (see
53
+ [details below](#evaluation-metrics)).
54
+
55
+ ## Inference Times
56
+
57
+ | GPU | Quantization type | Batch size 1 | Batch size 32 |
58
+ |:------------------------------------------|:------------------|---------------:|---------------:|
59
+ | NVIDIA A10 | FP16 | 1 ms | 5 ms |
60
+ | NVIDIA A10 | FP32 | 2 ms | 18 ms |
61
+ | NVIDIA T4 | FP16 | 1 ms | 12 ms |
62
+ | NVIDIA T4 | FP32 | 3 ms | 52 ms |
63
+ | NVIDIA L4 | FP16 | 2 ms | 5 ms |
64
+ | NVIDIA L4 | FP32 | 4 ms | 24 ms |
65
+
66
+ ## Gpu Memory usage
67
+
68
+ | Quantization type | Memory |
69
+ |:-------------------------------------------------|-----------:|
70
+ | FP16 | 550 MiB |
71
+ | FP32 | 1050 MiB |
72
+
73
+ Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch
74
+ size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which
75
+ can be around 0.5 to 1 GiB depending on the used GPU.
76
+
77
+ ## Requirements
78
+
79
+ - Minimal Sinequa version: 11.10.0
80
+ - Minimal Sinequa version for using FP16 models and GPUs with CUDA compute capability of 8.9+ (like NVIDIA L4): 11.11.0
81
+ - [Cuda compute capability](https://developer.nvidia.com/cuda-gpus): above 5.0 (above 6.0 for FP16 use)
82
+
83
+ ## Model Details
84
+
85
+ ### Overview
86
+
87
+ - Number of parameters: 107 million
88
+ - Base language
89
+ model: [mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) ([Paper](https://arxiv.org/abs/2012.15828), [GitHub](https://github.com/microsoft/unilm/tree/master/minilm))
90
+ - Insensitive to casing and accents
91
+ - Output dimensions: 256 (reduced with an additional dense layer)
92
+ - Training procedure: Query-passage-negative triplets for datasets that have mined hard negative data, Query-passage
93
+ pairs for the rest. Number of negatives is augmented with in-batch negative strategy
94
+
95
+ ### Training Data
96
+
97
+ The model have been trained using all datasets that are cited in
98
+ the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
99
+ In addition to that, this model has been trained on the datasets cited
100
+ in [this paper](https://arxiv.org/pdf/2108.13897.pdf) on the first 9 aforementioned languages.
101
+ It has also been trained on [this dataset](https://huggingface.co/datasets/clarin-knext/msmarco-pl) for polish capacities, and a translated version of msmarco-zh for traditional chinese capacities.
102
+
103
+ ### Evaluation Metrics
104
+
105
+ #### English
106
+
107
+ To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
108
+ [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in **English**.
109
+
110
+ | Dataset | Recall@100 |
111
+ |:------------------|-----------:|
112
+ | Average | 0.616 |
113
+ | | |
114
+ | Arguana | 0.956 |
115
+ | CLIMATE-FEVER | 0.471 |
116
+ | DBPedia Entity | 0.379 |
117
+ | FEVER | 0.824 |
118
+ | FiQA-2018 | 0.642 |
119
+ | HotpotQA | 0.579 |
120
+ | MS MARCO | 0.85 |
121
+ | NFCorpus | 0.289 |
122
+ | NQ | 0.765 |
123
+ | Quora | 0.993 |
124
+ | SCIDOCS | 0.467 |
125
+ | SciFact | 0.899 |
126
+ | TREC-COVID | 0.104 |
127
+ | Webis-Touche-2020 | 0.407 |
128
+
129
+ #### Polish
130
+
131
+ This model has traditional chinese capacities, that are being evaluated over the same dev set at msmarco-zh, translated in traditional chinese.
132
+
133
+ | Dataset | Recall@100 |
134
+ |:---------------------------------|-----------:|
135
+ | msmarco-zh-traditional | 0.738 |
136
+
137
+ In comparison, raspberry scores a 0.693 on this dataset.
138
+
139
+
140
+ #### Other languages
141
+
142
+ We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its
143
+ multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics
144
+ for the existing languages.
145
+
146
+ | Language | Recall@100 |
147
+ |:----------------------|-----------:|
148
+ | French | 0.672 |
149
+ | German | 0.594 |
150
+ | Spanish | 0.632 |
151
+ | Japanese | 0.603 |
152
+ | Chinese (simplified) | 0.702 |