prithivida commited on
Commit
59d067d
1 Parent(s): b6acd08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -3
README.md CHANGED
@@ -1,3 +1,182 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ language:
4
+ - te
5
+ datasets:
6
+ - MIRACL
7
+ tags:
8
+ - miniMiracle
9
+ - passage-retrieval
10
+ - knowledge-distillation
11
+ - middle-training
12
+ pretty_name: >-
13
+ miniMiracle is a family of High-quality, Light Weight and Easy deploy
14
+ multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
15
+ Indo-Dravidin Languages.
16
+ library_name: transformers
17
+ pipeline_tag: sentence-similarity
18
+ ---
19
+
20
+
21
+ <center>
22
+ <img src="./logo.png" width=150/>
23
+ <img src="./te_intro.png" width=120%/>
24
+ </center>
25
+
26
+
27
+ <center>
28
+ <img src="./te_metrics_1.png" width=90%/>
29
+ <b><p>Table 1: Telugu retrieval performance on the MIRACL dev set (measured by nDCG@10)</p></b>
30
+ </center>
31
+
32
+
33
+ <br/>
34
+
35
+ <center>
36
+ <h1> Table Of Contents </h1>
37
+ </center>
38
+
39
+
40
+ - [License and Terms:](#license-and-terms)
41
+ - [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
42
+ - [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
43
+ - [Usage:](#usage)
44
+ - [With Sentence Transformers:](#with-sentence-transformers)
45
+ - [With Huggingface Transformers:](#with-huggingface-transformers)
46
+ - [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
47
+ - [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
48
+ - [Notes on Reproducing:](#notes-on-reproducing)
49
+ - [Reference:](#reference)
50
+ - [Note on model bias](#note-on-model-bias)
51
+
52
+
53
+ ## License and Terms:
54
+
55
+ <center>
56
+ <img src="./terms.png" width=200%/>
57
+ </center>
58
+
59
+
60
+ ## Detailed comparison & Our Contribution:
61
+
62
+ English language famously have **all-minilm** series models which were great for quick experimentations and for certain production workloads. The Idea is to have same for the other popular langauges, starting with Indo-Aryan and Indo-Dravidian languages. Our innovation is in bringing high quality models which easy to serve and embeddings are cheaper to store without ANY pretraining or expensive finetuning. For instance, **all-minilm** are finetuned on 1-Billion pairs. We offer a very lean model but with a huge vocabulary - around 250K.
63
+ We will add more details here.
64
+
65
+
66
+ <center>
67
+ <img src="./te_metrics_2.png" width=120%/>
68
+ <b><p>Table 2: Detailed Telugu retrieval performance on the MIRACL dev set (measured by nDCG@10)</p></b>
69
+
70
+ </center>
71
+
72
+ Full set of evaluation numbers for our model
73
+
74
+ ```python
75
+ {'NDCG@1': 0.45773, 'NDCG@3': 0.58701, 'NDCG@5': 0.60938, 'NDCG@10': 0.63416, 'NDCG@100': 0.66138, 'NDCG@1000': 0.6682}
76
+ {'MAP@1': 0.45129, 'MAP@3': 0.55509, 'MAP@5': 0.56774, 'MAP@10': 0.57728, 'MAP@100': 0.58319, 'MAP@1000': 0.58346}
77
+ {'Recall@10': 0.79247, 'Recall@50': 0.89936, 'Recall@100': 0.93639, 'Recall@200': 0.96276, 'Recall@500': 0.97967, 'Recall@1000': 0.98933}
78
+ {'P@1': 0.45773, 'P@3': 0.22947, 'P@5': 0.14903, 'P@10': 0.08152, 'P@100': 0.00965, 'P@1000': 0.00102}
79
+ {'MRR@10': 0.5813, 'MRR@100': 0.58704, 'MRR@1000': 0.58729}
80
+ ```
81
+
82
+ <br/>
83
+
84
+ ## Usage:
85
+
86
+ #### With Sentence Transformers:
87
+
88
+ ```python
89
+ from sentence_transformers import SentenceTransformer
90
+ import scipy.spatial
91
+
92
+
93
+ model = SentenceTransformer('prithivida/miniMiracle_te_v1')
94
+
95
+ corpus = [
96
+ 'ఒక వ్యక్తి ఆహారం తింటున్నాడు.',
97
+ 'ముఖ్యాలు బ్రెడ్ ముక్కను తింటున్నారు.',
98
+ 'అమ్మాయి ఒక బిడ్డను ఎత్తుకుందు.',
99
+ 'ఒక వ్యక్తి గుర్రం మీద సవారీ చేస్తున్నాడు.',
100
+ 'ఒక మహిళ వయోలిన్ వాయిస్తోంది.',
101
+ 'రెండు వ్యక్తులు అడవిలో కారును తోస్తున్నారు.',
102
+ 'ఒక వ్యక్తి ఒక తెల్ల గుర్రం మీద ఒక మూసిన ప్రదేశంలో సవారీ చేస్తున్నాడు.',
103
+ 'ఒక కోతి డ్రమ్ వాయిస్తోంది.',
104
+ 'ఒక చిరుత తన వేట వెనుక పరుగెడుతోంది.',
105
+ 'ఒక పెద్ద విందు ఉంది.'
106
+ ]
107
+
108
+ queries = [
109
+ 'ఒక వ్యక్తి పాస్తా తింటున్నాడు.',
110
+ 'ఒక గొరిల్లా సూట్ ధరించిన వ్యక్తి డ్రమ్ వాయిస్తోంది.'
111
+ ]
112
+
113
+ corpus_embeddings = model.encode(corpus)
114
+ query_embeddings = model.encode(queries)
115
+
116
+ # Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
117
+ closest_n = 3
118
+ for query, query_embedding in zip(queries, query_embeddings):
119
+ distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
120
+
121
+ results = zip(range(len(distances)), distances)
122
+ results = sorted(results, key=lambda x: x[1])
123
+
124
+ print("\n======================\n")
125
+ print("Query:", query)
126
+ print("\nTop 3 most similar sentences in corpus:\n")
127
+
128
+ for idx, distance in results[0:closest_n]:
129
+ print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
130
+
131
+ # Optional: How to quantize the embeddings
132
+ # binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")
133
+
134
+ ```
135
+
136
+ #### With Huggingface Transformers:
137
+ - T.B.A
138
+
139
+ #### How do I optimise vector index cost ?
140
+ [Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
141
+
142
+ <h4>How do I offer hybrid search to address Vocabulary Mismatch Problem?</h4>
143
+ MIRACL paper shows simply combining BM25 is a good starting point for a Hybrid option:
144
+ The below numbers are with mDPR model, but miniMiracle_te_v1 should give a even better hybrid performance.
145
+
146
+ | Language | ISO | nDCG@10 BM25 | nDCG@10 mDPR | nDCG@10 Hybrid |
147
+ |-----------|-----|--------------|--------------|----------------|
148
+ | **Telugu** | **te** | **0.383** | **0.356** | **0.602** |
149
+
150
+ *Note: MIRACL paper shows a different (higher) value for BM25 Telugu, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
151
+
152
+ # Notes on reproducing:
153
+
154
+ We welcome anyone to reproduce our results. Here are some tips and observations:
155
+
156
+ - Use CLS Pooling and Inner Product.
157
+ - There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.
158
+
159
+ Here are our numbers for the full hindi run on BGE-M3
160
+
161
+ ```python
162
+ {'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
163
+ {'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
164
+ {'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
165
+ {'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
166
+ {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
167
+ ```
168
+
169
+ Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the MTEB benchmarks.
170
+
171
+ # Reference:
172
+ - [All Cohere numbers are copied form here](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12)
173
+
174
+
175
+ # Note on model bias:
176
+ - Like any model this model might carry inherent biases from the base models and the datasets it was pretrained and finetuned on. Please use responsibly.
177
+
178
+
179
+
180
+
181
+
182
+