prithivida commited on
Commit
18555d3
1 Parent(s): 39a0847

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -7
README.md CHANGED
@@ -37,20 +37,25 @@ pipeline_tag: sentence-similarity
37
  </center>
38
 
39
 
40
- - [License and Terms:](#license-and-terms)
41
- - [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
42
- - [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
43
- - [Usage:](#usage)
44
  - [With Sentence Transformers:](#with-sentence-transformers)
45
  - [With Huggingface Transformers:](#with-huggingface-transformers)
 
 
46
  - [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
47
  - [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
 
 
48
  - [Notes on Reproducing:](#notes-on-reproducing)
49
  - [Reference:](#reference)
50
  - [Note on model bias](#note-on-model-bias)
51
 
52
 
53
- ## License and Terms:
 
54
 
55
  <center>
56
  <img src="./terms.png" width=200%/>
@@ -81,7 +86,7 @@ Full set of evaluation numbers for our model
81
 
82
  <br/>
83
 
84
- ## Usage:
85
 
86
  #### With Sentence Transformers:
87
 
@@ -136,6 +141,11 @@ for query, query_embedding in zip(queries, query_embeddings):
136
  #### With Huggingface Transformers:
137
  - T.B.A
138
 
 
 
 
 
 
139
  #### How do I optimise vector index cost ?
140
  [Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
141
 
@@ -149,6 +159,22 @@ The below numbers are with mDPR model, but miniMiracle_te_v1 should give a even
149
 
150
  *Note: MIRACL paper shows a different (higher) value for BM25 Telugu, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  # Notes on reproducing:
153
 
154
  We welcome anyone to reproduce our results. Here are some tips and observations:
@@ -166,7 +192,7 @@ Here are our numbers for the full hindi run on BGE-M3
166
  {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
167
  ```
168
 
169
- Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the MTEB benchmarks.
170
 
171
  # Reference:
172
  - [All Cohere numbers are copied form here](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12)
 
37
  </center>
38
 
39
 
40
+ - [License and Terms:](#license-and-terms)
41
+ - [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
42
+ - [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
43
+ - [Usage:](#usage)
44
  - [With Sentence Transformers:](#with-sentence-transformers)
45
  - [With Huggingface Transformers:](#with-huggingface-transformers)
46
+ - [FAQs](#faqs)
47
+ - [How can we run these models with out heavy torch dependency?](#how-can-we-run-these-models-with-out-heavy-torch-dependency)
48
  - [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
49
  - [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
50
+ - [Why not run MTEB?](#why-not-run-mteb)
51
+ - [Roadmap](#roadmap)
52
  - [Notes on Reproducing:](#notes-on-reproducing)
53
  - [Reference:](#reference)
54
  - [Note on model bias](#note-on-model-bias)
55
 
56
 
57
+
58
+ # License and Terms:
59
 
60
  <center>
61
  <img src="./terms.png" width=200%/>
 
86
 
87
  <br/>
88
 
89
+ # Usage:
90
 
91
  #### With Sentence Transformers:
92
 
 
141
  #### With Huggingface Transformers:
142
  - T.B.A
143
 
144
+ # FAQS
145
+
146
+ #### How can we run these models with out heavy torch dependency?
147
+ - You can use ONNX flavours of these models via [FlashRetrieve](https://github.com/PrithivirajDamodaran/FlashRetrieve) library.
148
+
149
  #### How do I optimise vector index cost ?
150
  [Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
151
 
 
159
 
160
  *Note: MIRACL paper shows a different (higher) value for BM25 Telugu, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
161
 
162
+ #### Why not run MTEB?
163
+ MTEB is a general purpose embedding evaluation bechmark covering wide range of tasks available currently only for English, Chinese, French and few other languages but not Indic languages. Besides like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
164
+ At the moment MIRACL is the gold standard for a subset of Indic languages.
165
+
166
+
167
+
168
+ # Roadmap
169
+ We will add miniMiracle series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
170
+
171
+ - Spanish
172
+ - Tamil
173
+ - Arabic
174
+ - German
175
+ - English ?
176
+
177
+
178
  # Notes on reproducing:
179
 
180
  We welcome anyone to reproduce our results. Here are some tips and observations:
 
192
  {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
193
  ```
194
 
195
+ Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the retrieval slice of MTEB benchmarks.
196
 
197
  # Reference:
198
  - [All Cohere numbers are copied form here](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12)