radoslavralev commited on
Commit
f6063b5
·
verified ·
1 Parent(s): ea194e5

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +75 -84
  2. model.safetensors +1 -1
README.md CHANGED
@@ -12,50 +12,53 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:1451941
16
- - loss:MultipleNegativesRankingLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: Gocharya ji authored Krishna Cahrit Manas in the poetic form describing
20
- about the full life of Lord Krishna ( from birth to Nirvana ) .
21
  sentences:
22
- - 'Q: Can I buy coverage for prescription drugs right away?'
23
- - Krishna Cahrit Manas in poetic form , describing the full life of Lord Krishna
24
- ( from birth to nirvana ) , wrote Gocharya ji .
25
- - Baron played actress Violet Carson who portrayed Ena Sharples in the soap .
26
- - source_sentence: The Kilkenny line only reached Maryborough in 1867 .
 
 
 
27
  sentences:
28
- - It was also known formerly as ' Crotto ' .
29
- - The line from Maryborough only reached Kilkenny in 1867 .
30
- - The line from Kilkenny only reached Maryborough in 1867 .
31
- - source_sentence: Tokelau International Netball Team represents Tokelau in the national
32
- netball .
 
 
 
33
  sentences:
34
- - Ernest Dewey Albinson ( 1898 in Minneapolis , Minnesota - 1971 in Mexico ) was
35
- an American artist .
36
- - The Tokelau national netball team represents Tokelau in international netball
37
- .
38
- - The Tokelau international netball team represents Tokelau in national netball
39
- .
40
- - source_sentence: The real number is called the `` imaginary part `` of the real
41
- number ; the real number is called the `` complex part `` of .
42
  sentences:
43
- - The school board consists of Robbie Sanders , Bryan Richards , Linda Fullingim
44
- , Lori Lambert , & Kelly Teague .
45
- - Which web design company has the best templates?
46
- - The real number is called the `` imaginary part `` of the real number , the real
47
- number of `` complex part `` of .
48
- - source_sentence: All For You was the third and last single of Kate Ryan 's third
49
- album `` Alive `` .
 
50
  sentences:
51
- - According to John Keay , he was `` country bred `` ( born and educated in India
52
- ) .
53
- - All For You was the third single of the third and last album `` Alive `` by Kate
54
- Ryan .
55
- - All For You was the third and last single of the third album of Kate Ryan `` Alive
56
- `` .
57
  datasets:
58
- - redis/langcache-sentencepairs-v1
59
  pipeline_tag: sentence-similarity
60
  library_name: sentence-transformers
61
  metrics:
@@ -72,32 +75,32 @@ model-index:
72
  type: information-retrieval
73
  name: Information Retrieval
74
  dataset:
75
- name: train
76
- type: train
77
  metrics:
78
  - type: cosine_accuracy@1
79
- value: 0.5578696687594717
80
  name: Cosine Accuracy@1
81
  - type: cosine_precision@1
82
- value: 0.5578696687594717
83
  name: Cosine Precision@1
84
  - type: cosine_recall@1
85
- value: 0.53589188426978
86
  name: Cosine Recall@1
87
  - type: cosine_ndcg@10
88
- value: 0.7523955452910316
89
  name: Cosine Ndcg@10
90
  - type: cosine_mrr@1
91
- value: 0.5578696687594717
92
  name: Cosine Mrr@1
93
  - type: cosine_map@100
94
- value: 0.6976030263836698
95
  name: Cosine Map@100
96
  ---
97
 
98
  # Redis fine-tuned BiEncoder model for semantic caching on LangCache
99
 
100
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) on the [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for sentence pair similarity.
101
 
102
  ## Model Details
103
 
@@ -108,7 +111,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [A
108
  - **Output Dimensionality:** 768 dimensions
109
  - **Similarity Function:** Cosine Similarity
110
  - **Training Dataset:**
111
- - [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
112
  - **Language:** en
113
  - **License:** apache-2.0
114
 
@@ -145,9 +148,9 @@ from sentence_transformers import SentenceTransformer
145
  model = SentenceTransformer("redis/langcache-embed-v3")
146
  # Run inference
147
  sentences = [
148
- "All For You was the third and last single of Kate Ryan 's third album `` Alive `` .",
149
- 'All For You was the third and last single of the third album of Kate Ryan `` Alive `` .',
150
- 'All For You was the third single of the third and last album `` Alive `` by Kate Ryan .',
151
  ]
152
  embeddings = model.encode(sentences)
153
  print(embeddings.shape)
@@ -156,9 +159,9 @@ print(embeddings.shape)
156
  # Get the similarity scores for the embeddings
157
  similarities = model.similarity(embeddings, embeddings)
158
  print(similarities)
159
- # tensor([[0.9961, 0.9922, 0.9961],
160
- # [0.9922, 1.0000, 0.9922],
161
- # [0.9961, 0.9922, 1.0078]], dtype=torch.bfloat16)
162
  ```
163
 
164
  <!--
@@ -191,17 +194,17 @@ You can finetune this model on your own dataset.
191
 
192
  #### Information Retrieval
193
 
194
- * Dataset: `train`
195
  * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
196
 
197
- | Metric | Value |
198
- |:-------------------|:-----------|
199
- | cosine_accuracy@1 | 0.5579 |
200
- | cosine_precision@1 | 0.5579 |
201
- | cosine_recall@1 | 0.5359 |
202
- | **cosine_ndcg@10** | **0.7524** |
203
- | cosine_mrr@1 | 0.5579 |
204
- | cosine_map@100 | 0.6976 |
205
 
206
  <!--
207
  ## Bias, Risks and Limitations
@@ -221,21 +224,21 @@ You can finetune this model on your own dataset.
221
 
222
  #### LangCache Sentence Pairs (all)
223
 
224
- * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
225
- * Size: 109,885 training samples
226
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
227
  * Approximate statistics based on the first 1000 samples:
228
  | | anchor | positive | negative |
229
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
230
  | type | string | string | string |
231
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.47 tokens</li><li>max: 61 tokens</li></ul> |
232
  * Samples:
233
  | anchor | positive | negative |
234
  |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
235
  | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
236
  | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
237
  | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
238
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
239
  ```json
240
  {
241
  "scale": 20.0,
@@ -248,21 +251,21 @@ You can finetune this model on your own dataset.
248
 
249
  #### LangCache Sentence Pairs (all)
250
 
251
- * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
252
- * Size: 109,885 evaluation samples
253
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
254
  * Approximate statistics based on the first 1000 samples:
255
  | | anchor | positive | negative |
256
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
257
  | type | string | string | string |
258
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.47 tokens</li><li>max: 61 tokens</li></ul> |
259
  * Samples:
260
  | anchor | positive | negative |
261
  |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
262
  | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
263
  | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
264
  | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
265
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
266
  ```json
267
  {
268
  "scale": 20.0,
@@ -272,9 +275,9 @@ You can finetune this model on your own dataset.
272
  ```
273
 
274
  ### Training Logs
275
- | Epoch | Step | train_cosine_ndcg@10 |
276
- |:-----:|:----:|:--------------------:|
277
- | -1 | -1 | 0.7524 |
278
 
279
 
280
  ### Framework Versions
@@ -303,18 +306,6 @@ You can finetune this model on your own dataset.
303
  }
304
  ```
305
 
306
- #### MultipleNegativesRankingLoss
307
- ```bibtex
308
- @misc{henderson2017efficient,
309
- title={Efficient Natural Language Response Suggestion for Smart Reply},
310
- author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
311
- year={2017},
312
- eprint={1705.00652},
313
- archivePrefix={arXiv},
314
- primaryClass={cs.CL}
315
- }
316
- ```
317
-
318
  <!--
319
  ## Glossary
320
 
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:3119809
16
+ - loss:ArcFaceInBatchLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: Hayley Vaughan portrayed Ripa on the ABC daytime soap opera , ``
20
+ All My Children `` , between 1990 and 2002 .
21
  sentences:
22
+ - Traxxpad is a music application for Sony 's PlayStation Portable published by
23
+ Definitive Studios and developed by Eidos Interactive .
24
+ - Between 1990 and 2002 , Hayley Vaughan Ripa portrayed in the ABC soap opera ``
25
+ All My Children `` .
26
+ - Between 1990 and 2002 , Ripa Hayley portrayed Vaughan in the ABC soap opera ``
27
+ All My Children `` .
28
+ - source_sentence: Olivella monilifera is a species of dwarf sea snail , small gastropod
29
+ mollusk in the family Olivellidae , the marine olives .
30
  sentences:
31
+ - Olivella monilifera is a species of the dwarf - sea snail , small gastropod mollusk
32
+ in the Olivellidae family , the marine olives .
33
+ - He was cut by the Browns after being signed by the Bills in 2013 . He was later
34
+ released .
35
+ - Olivella monilifera is a kind of sea snail , marine gastropod mollusk in the Olivellidae
36
+ family , the dwarf olives .
37
+ - source_sentence: Hayashi said that Mackey `` is a sort of `` of the original model
38
+ for Tenchi .
39
  sentences:
40
+ - In the summer of 2009 , Ellick shot a documentary about Malala Yousafzai .
41
+ - Hayashi said that Mackey is `` sort of `` the original model for Tenchi .
42
+ - Mackey said that Hayashi is `` sort of `` the original model for Tenchi .
43
+ - source_sentence: Much of the film was shot on location in Los Angeles and in nearby
44
+ Burbank and Glendale .
 
 
 
45
  sentences:
46
+ - Much of the film was shot on location in Los Angeles and in nearby Burbank and
47
+ Glendale .
48
+ - Much of the film was shot on site in Burbank and Glendale and in the nearby Los
49
+ Angeles .
50
+ - Traxxpad is a music application for the Sony PlayStation Portable developed by
51
+ the Definitive Studios and published by Eidos Interactive .
52
+ - source_sentence: According to him , the earth is the carrier of his artistic work
53
+ , which is only integrated into the creative process by minimal changes .
54
  sentences:
55
+ - National players are Bold players .
56
+ - According to him , earth is the carrier of his artistic work being integrated
57
+ into the creative process only by minimal changes .
58
+ - According to him , earth is the carrier of his creative work being integrated
59
+ into the artistic process only by minimal changes .
 
60
  datasets:
61
+ - redis/langcache-sentencepairs-v2
62
  pipeline_tag: sentence-similarity
63
  library_name: sentence-transformers
64
  metrics:
 
75
  type: information-retrieval
76
  name: Information Retrieval
77
  dataset:
78
+ name: test
79
+ type: test
80
  metrics:
81
  - type: cosine_accuracy@1
82
+ value: 0.5861241448475948
83
  name: Cosine Accuracy@1
84
  - type: cosine_precision@1
85
+ value: 0.5861241448475948
86
  name: Cosine Precision@1
87
  - type: cosine_recall@1
88
+ value: 0.5679885764966713
89
  name: Cosine Recall@1
90
  - type: cosine_ndcg@10
91
+ value: 0.7729838064849864
92
  name: Cosine Ndcg@10
93
  - type: cosine_mrr@1
94
+ value: 0.5861241448475948
95
  name: Cosine Mrr@1
96
  - type: cosine_map@100
97
+ value: 0.7216697804426214
98
  name: Cosine Map@100
99
  ---
100
 
101
  # Redis fine-tuned BiEncoder model for semantic caching on LangCache
102
 
103
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) on the [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for sentence pair similarity.
104
 
105
  ## Model Details
106
 
 
111
  - **Output Dimensionality:** 768 dimensions
112
  - **Similarity Function:** Cosine Similarity
113
  - **Training Dataset:**
114
+ - [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
115
  - **Language:** en
116
  - **License:** apache-2.0
117
 
 
148
  model = SentenceTransformer("redis/langcache-embed-v3")
149
  # Run inference
150
  sentences = [
151
+ 'According to him , the earth is the carrier of his artistic work , which is only integrated into the creative process by minimal changes .',
152
+ 'According to him , earth is the carrier of his artistic work being integrated into the creative process only by minimal changes .',
153
+ 'According to him , earth is the carrier of his creative work being integrated into the artistic process only by minimal changes .',
154
  ]
155
  embeddings = model.encode(sentences)
156
  print(embeddings.shape)
 
159
  # Get the similarity scores for the embeddings
160
  similarities = model.similarity(embeddings, embeddings)
161
  print(similarities)
162
+ # tensor([[1.0000, 0.9961, 0.9922],
163
+ # [0.9961, 1.0000, 0.9961],
164
+ # [0.9922, 0.9961, 0.9961]], dtype=torch.bfloat16)
165
  ```
166
 
167
  <!--
 
194
 
195
  #### Information Retrieval
196
 
197
+ * Dataset: `test`
198
  * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
199
 
200
+ | Metric | Value |
201
+ |:-------------------|:----------|
202
+ | cosine_accuracy@1 | 0.5861 |
203
+ | cosine_precision@1 | 0.5861 |
204
+ | cosine_recall@1 | 0.568 |
205
+ | **cosine_ndcg@10** | **0.773** |
206
+ | cosine_mrr@1 | 0.5861 |
207
+ | cosine_map@100 | 0.7217 |
208
 
209
  <!--
210
  ## Bias, Risks and Limitations
 
224
 
225
  #### LangCache Sentence Pairs (all)
226
 
227
+ * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
228
+ * Size: 126,938 training samples
229
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
230
  * Approximate statistics based on the first 1000 samples:
231
  | | anchor | positive | negative |
232
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
233
  | type | string | string | string |
234
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
235
  * Samples:
236
  | anchor | positive | negative |
237
  |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
238
  | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
239
  | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
240
  | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
241
+ * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
242
  ```json
243
  {
244
  "scale": 20.0,
 
251
 
252
  #### LangCache Sentence Pairs (all)
253
 
254
+ * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
255
+ * Size: 126,938 evaluation samples
256
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
257
  * Approximate statistics based on the first 1000 samples:
258
  | | anchor | positive | negative |
259
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
260
  | type | string | string | string |
261
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
262
  * Samples:
263
  | anchor | positive | negative |
264
  |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
265
  | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
266
  | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
267
  | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
268
+ * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
269
  ```json
270
  {
271
  "scale": 20.0,
 
275
  ```
276
 
277
  ### Training Logs
278
+ | Epoch | Step | test_cosine_ndcg@10 |
279
+ |:-----:|:----:|:-------------------:|
280
+ | -1 | -1 | 0.7730 |
281
 
282
 
283
  ### Framework Versions
 
306
  }
307
  ```
308
 
 
 
 
 
 
 
 
 
 
 
 
 
309
  <!--
310
  ## Glossary
311
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e5751153159fbe8c5f0461b1438392077d8a78eb76cea3ef88b53116443d1e6a
3
  size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
  size 298041696