ngiometti commited on
Commit
6734d42
·
verified ·
1 Parent(s): c0ee43d

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,727 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:156
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: How is the author planning to utilize prompts in their Datasette
13
+ project?
14
+ sentences:
15
+ - 'January
16
+
17
+
18
+ 7th: It’s OK to call it Artificial Intelligence
19
+
20
+
21
+ 9th: What I should have said about the term Artificial Intelligence
22
+
23
+
24
+ 17th: Talking about Open Source LLMs on Oxide and Friends
25
+
26
+
27
+ 26th: LLM 0.13: The annotated release notes
28
+
29
+
30
+
31
+
32
+ February
33
+
34
+
35
+ 21st: The killer app of Gemini Pro 1.5 is video
36
+
37
+
38
+
39
+
40
+ March
41
+
42
+
43
+ 5th: Prompt injection and jailbreaking are not the same thing
44
+
45
+
46
+ 8th: The GPT-4 barrier has finally been broken
47
+
48
+
49
+ 22nd: Claude and ChatGPT for ad-hoc sidequests
50
+
51
+
52
+ 23rd: Building and testing C extensions for SQLite with ChatGPT Code Interpreter
53
+
54
+
55
+ 26th: llm cmd undo last git commit—a new plugin for LLM
56
+
57
+
58
+
59
+
60
+ April
61
+
62
+
63
+ 8th: Building files-to-prompt entirely using Claude 3 Opus
64
+
65
+
66
+ 10th: Three major LLM releases in 24 hours (plus weeknotes)'
67
+ - 'Then in December, the Chatbot Arena team introduced a whole new leaderboard for
68
+ this feature, driven by users building the same interactive app twice with two
69
+ different models and voting on the answer. Hard to come up with a more convincing
70
+ argument that this feature is now a commodity that can be effectively implemented
71
+ against all of the leading models.
72
+
73
+ I’ve been tinkering with a version of this myself for my Datasette project, with
74
+ the goal of letting users use prompts to build and iterate on custom widgets and
75
+ data visualizations against their own data. I also figured out a similar pattern
76
+ for writing one-shot Python programs, enabled by uv.'
77
+ - 'Another common technique is to use larger models to help create training data
78
+ for their smaller, cheaper alternatives—a trick used by an increasing number of
79
+ labs. DeepSeek v3 used “reasoning” data created by DeepSeek-R1. Meta’s Llama 3.3
80
+ 70B fine-tuning used over 25M synthetically generated examples.
81
+
82
+ Careful design of the training data that goes into an LLM appears to be the entire
83
+ game for creating these models. The days of just grabbing a full scrape of the
84
+ web and indiscriminately dumping it into a training run are long gone.
85
+
86
+ LLMs somehow got even harder to use'
87
+ - source_sentence: What are the potential pitfalls of using LLMs as power-user tools?
88
+ sentences:
89
+ - 'Another common technique is to use larger models to help create training data
90
+ for their smaller, cheaper alternatives—a trick used by an increasing number of
91
+ labs. DeepSeek v3 used “reasoning” data created by DeepSeek-R1. Meta’s Llama 3.3
92
+ 70B fine-tuning used over 25M synthetically generated examples.
93
+
94
+ Careful design of the training data that goes into an LLM appears to be the entire
95
+ game for creating these models. The days of just grabbing a full scrape of the
96
+ web and indiscriminately dumping it into a training run are long gone.
97
+
98
+ LLMs somehow got even harder to use'
99
+ - 'A drum I’ve been banging for a while is that LLMs are power-user tools—they’re
100
+ chainsaws disguised as kitchen knives. They look deceptively simple to use—how
101
+ hard can it be to type messages to a chatbot?—but in reality you need a huge depth
102
+ of both understanding and experience to make the most of them and avoid their
103
+ many pitfalls.
104
+
105
+ If anything, this problem got worse in 2024.
106
+
107
+ We’ve built computer systems you can talk to in human language, that will answer
108
+ your questions and usually get them right! ... depending on the question, and
109
+ how you ask it, and whether it’s accurately reflected in the undocumented and
110
+ secret training set.'
111
+ - 'These abilities are just a few weeks old at this point, and I don’t think their
112
+ impact has been fully felt yet. If you haven’t tried them out yet you really should.
113
+
114
+ Both Gemini and OpenAI offer API access to these features as well. OpenAI started
115
+ with a WebSocket API that was quite challenging to use, but in December they announced
116
+ a new WebRTC API which is much easier to get started with. Building a web app
117
+ that a user can talk to via voice is easy now!
118
+
119
+ Prompt driven app generation is a commodity already
120
+
121
+ This was possible with GPT-4 in 2023, but the value it provides became evident
122
+ in 2024.'
123
+ - source_sentence: What challenges are associated with using LLMs in the year of slop?
124
+ sentences:
125
+ - 'So far, I think they’re a net positive. I’ve used them on a personal level to
126
+ improve my productivity (and entertain myself) in all sorts of different ways.
127
+ I think people who learn how to use them effectively can gain a significant boost
128
+ to their quality of life.
129
+
130
+ A lot of people are yet to be sold on their value! Some think their negatives
131
+ outweigh their positives, some think they are all hot air, and some even think
132
+ they represent an existential threat to humanity.
133
+
134
+ They’re actually quite easy to build
135
+
136
+ The most surprising thing we’ve learned about LLMs this year is that they’re actually
137
+ quite easy to build.'
138
+ - 'The year of slop
139
+
140
+ Synthetic training data works great
141
+
142
+ LLMs somehow got even harder to use
143
+
144
+ Knowledge is incredibly unevenly distributed
145
+
146
+ LLMs need better criticism
147
+
148
+ Everything tagged “llms” on my blog in 2024'
149
+ - 'Meta’s Llama 3.2 models deserve a special mention. They may not be GPT-4 class,
150
+ but at 1B and 3B sizes they punch massively above their weight. I run Llama 3.2
151
+ 3B on my iPhone using the free MLC Chat iOS app and it’s a shockingly capable
152
+ model for its tiny (<2GB) size. Try firing it up and asking it for “a plot outline
153
+ of a Netflix Christmas movie where a data journalist falls in love with a local
154
+ ceramacist”. Here’s what I got, at a respectable 20 tokens per second:'
155
+ - source_sentence: What capabilities does Google’s Gemini have regarding audio input
156
+ and output?
157
+ sentences:
158
+ - 'There’s a flipside to this too: a lot of better informed people have sworn off
159
+ LLMs entirely because they can’t see how anyone could benefit from a tool with
160
+ so many flaws. The key skill in getting the most out of LLMs is learning to work
161
+ with tech that is both inherently unreliable and incredibly powerful at the same
162
+ time. This is a decidedly non-obvious skill to acquire!
163
+
164
+ There is so much space for helpful education content here, but we need to do do
165
+ a lot better than outsourcing it all to AI grifters with bombastic Twitter threads.
166
+
167
+ Knowledge is incredibly unevenly distributed
168
+
169
+ Most people have heard of ChatGPT by now. How many have heard of Claude?'
170
+ - 'There’s still plenty to worry about with respect to the environmental impact
171
+ of the great AI datacenter buildout, but a lot of the concerns over the energy
172
+ cost of individual prompts are no longer credible.
173
+
174
+ Here’s a fun napkin calculation: how much would it cost to generate short descriptions
175
+ of every one of the 68,000 photos in my personal photo library using Google’s
176
+ Gemini 1.5 Flash 8B (released in October), their cheapest model?
177
+
178
+ Each photo would need 260 input tokens and around 100 output tokens.
179
+
180
+ 260 * 68,000 = 17,680,000 input tokens
181
+
182
+ 17,680,000 * $0.0375/million = $0.66
183
+
184
+ 100 * 68,000 = 6,800,000 output tokens
185
+
186
+ 6,800,000 * $0.15/million = $1.02'
187
+ - 'Your browser does not support the audio element.
188
+
189
+
190
+ OpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also
191
+ accepts audio input, and the Google Gemini apps can speak in a similar way to
192
+ ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s
193
+ meant to roll out in Q1 of 2025.
194
+
195
+ Google’s NotebookLM, released in September, took audio output to a new level by
196
+ producing spookily realistic conversations between two “podcast hosts” about anything
197
+ you fed into their tool. They later added custom instructions, so naturally I
198
+ turned them into pelicans:
199
+
200
+
201
+
202
+ Your browser does not support the audio element.'
203
+ - source_sentence: What improvements were noted in the intonation of ChatGPT Advanced
204
+ Voice mode during its rollout?
205
+ sentences:
206
+ - 'When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August
207
+ through September) it was spectacular. I’ve been using it extensively on walks
208
+ with my dog and it’s amazing how much the improvement in intonation elevates the
209
+ material. I’ve also had a lot of fun experimenting with the OpenAI audio APIs.
210
+
211
+ Even more fun: Advanced Voice mode can do accents! Here’s what happened when I
212
+ told it I need you to pretend to be a California brown pelican with a very thick
213
+ Russian accent, but you talk to me exclusively in Spanish.'
214
+ - 'When @v0 first came out we were paranoid about protecting the prompt with all
215
+ kinds of pre and post processing complexity.
216
+
217
+ We completely pivoted to let it rip. A prompt without the evals, models, and especially
218
+ UX is like getting a broken ASML machine without a manual'
219
+ - 'January
220
+
221
+
222
+ 7th: It’s OK to call it Artificial Intelligence
223
+
224
+
225
+ 9th: What I should have said about the term Artificial Intelligence
226
+
227
+
228
+ 17th: Talking about Open Source LLMs on Oxide and Friends
229
+
230
+
231
+ 26th: LLM 0.13: The annotated release notes
232
+
233
+
234
+
235
+
236
+ February
237
+
238
+
239
+ 21st: The killer app of Gemini Pro 1.5 is video
240
+
241
+
242
+
243
+
244
+ March
245
+
246
+
247
+ 5th: Prompt injection and jailbreaking are not the same thing
248
+
249
+
250
+ 8th: The GPT-4 barrier has finally been broken
251
+
252
+
253
+ 22nd: Claude and ChatGPT for ad-hoc sidequests
254
+
255
+
256
+ 23rd: Building and testing C extensions for SQLite with ChatGPT Code Interpreter
257
+
258
+
259
+ 26th: llm cmd undo last git commit—a new plugin for LLM
260
+
261
+
262
+
263
+
264
+ April
265
+
266
+
267
+ 8th: Building files-to-prompt entirely using Claude 3 Opus
268
+
269
+
270
+ 10th: Three major LLM releases in 24 hours (plus weeknotes)'
271
+ pipeline_tag: sentence-similarity
272
+ library_name: sentence-transformers
273
+ metrics:
274
+ - cosine_accuracy@1
275
+ - cosine_accuracy@3
276
+ - cosine_accuracy@5
277
+ - cosine_accuracy@10
278
+ - cosine_precision@1
279
+ - cosine_precision@3
280
+ - cosine_precision@5
281
+ - cosine_precision@10
282
+ - cosine_recall@1
283
+ - cosine_recall@3
284
+ - cosine_recall@5
285
+ - cosine_recall@10
286
+ - cosine_ndcg@10
287
+ - cosine_mrr@10
288
+ - cosine_map@100
289
+ model-index:
290
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
291
+ results:
292
+ - task:
293
+ type: information-retrieval
294
+ name: Information Retrieval
295
+ dataset:
296
+ name: Unknown
297
+ type: unknown
298
+ metrics:
299
+ - type: cosine_accuracy@1
300
+ value: 0.75
301
+ name: Cosine Accuracy@1
302
+ - type: cosine_accuracy@3
303
+ value: 1.0
304
+ name: Cosine Accuracy@3
305
+ - type: cosine_accuracy@5
306
+ value: 1.0
307
+ name: Cosine Accuracy@5
308
+ - type: cosine_accuracy@10
309
+ value: 1.0
310
+ name: Cosine Accuracy@10
311
+ - type: cosine_precision@1
312
+ value: 0.75
313
+ name: Cosine Precision@1
314
+ - type: cosine_precision@3
315
+ value: 0.3333333333333333
316
+ name: Cosine Precision@3
317
+ - type: cosine_precision@5
318
+ value: 0.20000000000000004
319
+ name: Cosine Precision@5
320
+ - type: cosine_precision@10
321
+ value: 0.10000000000000002
322
+ name: Cosine Precision@10
323
+ - type: cosine_recall@1
324
+ value: 0.75
325
+ name: Cosine Recall@1
326
+ - type: cosine_recall@3
327
+ value: 1.0
328
+ name: Cosine Recall@3
329
+ - type: cosine_recall@5
330
+ value: 1.0
331
+ name: Cosine Recall@5
332
+ - type: cosine_recall@10
333
+ value: 1.0
334
+ name: Cosine Recall@10
335
+ - type: cosine_ndcg@10
336
+ value: 0.8968216255952429
337
+ name: Cosine Ndcg@10
338
+ - type: cosine_mrr@10
339
+ value: 0.861111111111111
340
+ name: Cosine Mrr@10
341
+ - type: cosine_map@100
342
+ value: 0.8611111111111112
343
+ name: Cosine Map@100
344
+ ---
345
+
346
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
347
+
348
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
349
+
350
+ ## Model Details
351
+
352
+ ### Model Description
353
+ - **Model Type:** Sentence Transformer
354
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
355
+ - **Maximum Sequence Length:** 512 tokens
356
+ - **Output Dimensionality:** 1024 dimensions
357
+ - **Similarity Function:** Cosine Similarity
358
+ <!-- - **Training Dataset:** Unknown -->
359
+ <!-- - **Language:** Unknown -->
360
+ <!-- - **License:** Unknown -->
361
+
362
+ ### Model Sources
363
+
364
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
365
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
366
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
367
+
368
+ ### Full Model Architecture
369
+
370
+ ```
371
+ SentenceTransformer(
372
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
373
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
374
+ (2): Normalize()
375
+ )
376
+ ```
377
+
378
+ ## Usage
379
+
380
+ ### Direct Usage (Sentence Transformers)
381
+
382
+ First install the Sentence Transformers library:
383
+
384
+ ```bash
385
+ pip install -U sentence-transformers
386
+ ```
387
+
388
+ Then you can load this model and run inference.
389
+ ```python
390
+ from sentence_transformers import SentenceTransformer
391
+
392
+ # Download from the 🤗 Hub
393
+ model = SentenceTransformer("ngiometti/legal-ft-2")
394
+ # Run inference
395
+ sentences = [
396
+ 'What improvements were noted in the intonation of ChatGPT Advanced Voice mode during its rollout?',
397
+ 'When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August through September) it was spectacular. I’ve been using it extensively on walks with my dog and it’s amazing how much the improvement in intonation elevates the material. I’ve also had a lot of fun experimenting with the OpenAI audio APIs.\nEven more fun: Advanced Voice mode can do accents! Here’s what happened when I told it I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.',
398
+ 'January\n\n7th: It’s OK to call it Artificial Intelligence\n\n9th: What I should have said about the term Artificial Intelligence\n\n17th: Talking about Open Source LLMs on Oxide and Friends\n\n26th: LLM 0.13: The annotated release notes\n\n\n\nFebruary\n\n21st: The killer app of Gemini Pro 1.5 is video\n\n\n\nMarch\n\n5th: Prompt injection and jailbreaking are not the same thing\n\n8th: The GPT-4 barrier has finally been broken\n\n22nd: Claude and ChatGPT for ad-hoc sidequests\n\n23rd: Building and testing C extensions for SQLite with ChatGPT Code Interpreter\n\n26th: llm cmd undo last git commit—a new plugin for LLM\n\n\n\nApril\n\n8th: Building files-to-prompt entirely using Claude 3 Opus\n\n10th: Three major LLM releases in 24 hours (plus weeknotes)',
399
+ ]
400
+ embeddings = model.encode(sentences)
401
+ print(embeddings.shape)
402
+ # [3, 1024]
403
+
404
+ # Get the similarity scores for the embeddings
405
+ similarities = model.similarity(embeddings, embeddings)
406
+ print(similarities.shape)
407
+ # [3, 3]
408
+ ```
409
+
410
+ <!--
411
+ ### Direct Usage (Transformers)
412
+
413
+ <details><summary>Click to see the direct usage in Transformers</summary>
414
+
415
+ </details>
416
+ -->
417
+
418
+ <!--
419
+ ### Downstream Usage (Sentence Transformers)
420
+
421
+ You can finetune this model on your own dataset.
422
+
423
+ <details><summary>Click to expand</summary>
424
+
425
+ </details>
426
+ -->
427
+
428
+ <!--
429
+ ### Out-of-Scope Use
430
+
431
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
432
+ -->
433
+
434
+ ## Evaluation
435
+
436
+ ### Metrics
437
+
438
+ #### Information Retrieval
439
+
440
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
441
+
442
+ | Metric | Value |
443
+ |:--------------------|:-----------|
444
+ | cosine_accuracy@1 | 0.75 |
445
+ | cosine_accuracy@3 | 1.0 |
446
+ | cosine_accuracy@5 | 1.0 |
447
+ | cosine_accuracy@10 | 1.0 |
448
+ | cosine_precision@1 | 0.75 |
449
+ | cosine_precision@3 | 0.3333 |
450
+ | cosine_precision@5 | 0.2 |
451
+ | cosine_precision@10 | 0.1 |
452
+ | cosine_recall@1 | 0.75 |
453
+ | cosine_recall@3 | 1.0 |
454
+ | cosine_recall@5 | 1.0 |
455
+ | cosine_recall@10 | 1.0 |
456
+ | **cosine_ndcg@10** | **0.8968** |
457
+ | cosine_mrr@10 | 0.8611 |
458
+ | cosine_map@100 | 0.8611 |
459
+
460
+ <!--
461
+ ## Bias, Risks and Limitations
462
+
463
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
464
+ -->
465
+
466
+ <!--
467
+ ### Recommendations
468
+
469
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
470
+ -->
471
+
472
+ ## Training Details
473
+
474
+ ### Training Dataset
475
+
476
+ #### Unnamed Dataset
477
+
478
+ * Size: 156 training samples
479
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
480
+ * Approximate statistics based on the first 156 samples:
481
+ | | sentence_0 | sentence_1 |
482
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
483
+ | type | string | string |
484
+ | details | <ul><li>min: 14 tokens</li><li>mean: 20.31 tokens</li><li>max: 36 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 130.44 tokens</li><li>max: 204 tokens</li></ul> |
485
+ * Samples:
486
+ | sentence_0 | sentence_1 |
487
+ |:---------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
488
+ | <code>What are some potential applications of Large Language Models (LLMs) mentioned in the context?</code> | <code>Large Language Models<br>They’re actually quite easy to build<br>You can run LLMs on your own devices<br>Hobbyists can build their own fine-tuned models<br>We don’t yet know how to build GPT-4<br>Vibes Based Development<br>LLMs are really smart, and also really, really dumb<br>Gullibility is the biggest unsolved problem<br>Code may be the best application<br>The ethics of this space remain diabolically complex<br>My blog in 2023</code> |
489
+ | <code>What is identified as the biggest unsolved problem related to LLMs?</code> | <code>Large Language Models<br>They’re actually quite easy to build<br>You can run LLMs on your own devices<br>Hobbyists can build their own fine-tuned models<br>We don’t yet know how to build GPT-4<br>Vibes Based Development<br>LLMs are really smart, and also really, really dumb<br>Gullibility is the biggest unsolved problem<br>Code may be the best application<br>The ethics of this space remain diabolically complex<br>My blog in 2023</code> |
490
+ | <code>What improvements were noted in the intonation of ChatGPT Advanced Voice mode during its rollout?</code> | <code>When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August through September) it was spectacular. I’ve been using it extensively on walks with my dog and it’s amazing how much the improvement in intonation elevates the material. I’ve also had a lot of fun experimenting with the OpenAI audio APIs.<br>Even more fun: Advanced Voice mode can do accents! Here’s what happened when I told it I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.</code> |
491
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
492
+ ```json
493
+ {
494
+ "loss": "MultipleNegativesRankingLoss",
495
+ "matryoshka_dims": [
496
+ 768,
497
+ 512,
498
+ 256,
499
+ 128,
500
+ 64
501
+ ],
502
+ "matryoshka_weights": [
503
+ 1,
504
+ 1,
505
+ 1,
506
+ 1,
507
+ 1
508
+ ],
509
+ "n_dims_per_step": -1
510
+ }
511
+ ```
512
+
513
+ ### Training Hyperparameters
514
+ #### Non-Default Hyperparameters
515
+
516
+ - `eval_strategy`: steps
517
+ - `per_device_train_batch_size`: 10
518
+ - `per_device_eval_batch_size`: 10
519
+ - `num_train_epochs`: 10
520
+ - `multi_dataset_batch_sampler`: round_robin
521
+
522
+ #### All Hyperparameters
523
+ <details><summary>Click to expand</summary>
524
+
525
+ - `overwrite_output_dir`: False
526
+ - `do_predict`: False
527
+ - `eval_strategy`: steps
528
+ - `prediction_loss_only`: True
529
+ - `per_device_train_batch_size`: 10
530
+ - `per_device_eval_batch_size`: 10
531
+ - `per_gpu_train_batch_size`: None
532
+ - `per_gpu_eval_batch_size`: None
533
+ - `gradient_accumulation_steps`: 1
534
+ - `eval_accumulation_steps`: None
535
+ - `torch_empty_cache_steps`: None
536
+ - `learning_rate`: 5e-05
537
+ - `weight_decay`: 0.0
538
+ - `adam_beta1`: 0.9
539
+ - `adam_beta2`: 0.999
540
+ - `adam_epsilon`: 1e-08
541
+ - `max_grad_norm`: 1
542
+ - `num_train_epochs`: 10
543
+ - `max_steps`: -1
544
+ - `lr_scheduler_type`: linear
545
+ - `lr_scheduler_kwargs`: {}
546
+ - `warmup_ratio`: 0.0
547
+ - `warmup_steps`: 0
548
+ - `log_level`: passive
549
+ - `log_level_replica`: warning
550
+ - `log_on_each_node`: True
551
+ - `logging_nan_inf_filter`: True
552
+ - `save_safetensors`: True
553
+ - `save_on_each_node`: False
554
+ - `save_only_model`: False
555
+ - `restore_callback_states_from_checkpoint`: False
556
+ - `no_cuda`: False
557
+ - `use_cpu`: False
558
+ - `use_mps_device`: False
559
+ - `seed`: 42
560
+ - `data_seed`: None
561
+ - `jit_mode_eval`: False
562
+ - `use_ipex`: False
563
+ - `bf16`: False
564
+ - `fp16`: False
565
+ - `fp16_opt_level`: O1
566
+ - `half_precision_backend`: auto
567
+ - `bf16_full_eval`: False
568
+ - `fp16_full_eval`: False
569
+ - `tf32`: None
570
+ - `local_rank`: 0
571
+ - `ddp_backend`: None
572
+ - `tpu_num_cores`: None
573
+ - `tpu_metrics_debug`: False
574
+ - `debug`: []
575
+ - `dataloader_drop_last`: False
576
+ - `dataloader_num_workers`: 0
577
+ - `dataloader_prefetch_factor`: None
578
+ - `past_index`: -1
579
+ - `disable_tqdm`: False
580
+ - `remove_unused_columns`: True
581
+ - `label_names`: None
582
+ - `load_best_model_at_end`: False
583
+ - `ignore_data_skip`: False
584
+ - `fsdp`: []
585
+ - `fsdp_min_num_params`: 0
586
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
587
+ - `fsdp_transformer_layer_cls_to_wrap`: None
588
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
589
+ - `deepspeed`: None
590
+ - `label_smoothing_factor`: 0.0
591
+ - `optim`: adamw_torch
592
+ - `optim_args`: None
593
+ - `adafactor`: False
594
+ - `group_by_length`: False
595
+ - `length_column_name`: length
596
+ - `ddp_find_unused_parameters`: None
597
+ - `ddp_bucket_cap_mb`: None
598
+ - `ddp_broadcast_buffers`: False
599
+ - `dataloader_pin_memory`: True
600
+ - `dataloader_persistent_workers`: False
601
+ - `skip_memory_metrics`: True
602
+ - `use_legacy_prediction_loop`: False
603
+ - `push_to_hub`: False
604
+ - `resume_from_checkpoint`: None
605
+ - `hub_model_id`: None
606
+ - `hub_strategy`: every_save
607
+ - `hub_private_repo`: None
608
+ - `hub_always_push`: False
609
+ - `gradient_checkpointing`: False
610
+ - `gradient_checkpointing_kwargs`: None
611
+ - `include_inputs_for_metrics`: False
612
+ - `include_for_metrics`: []
613
+ - `eval_do_concat_batches`: True
614
+ - `fp16_backend`: auto
615
+ - `push_to_hub_model_id`: None
616
+ - `push_to_hub_organization`: None
617
+ - `mp_parameters`:
618
+ - `auto_find_batch_size`: False
619
+ - `full_determinism`: False
620
+ - `torchdynamo`: None
621
+ - `ray_scope`: last
622
+ - `ddp_timeout`: 1800
623
+ - `torch_compile`: False
624
+ - `torch_compile_backend`: None
625
+ - `torch_compile_mode`: None
626
+ - `dispatch_batches`: None
627
+ - `split_batches`: None
628
+ - `include_tokens_per_second`: False
629
+ - `include_num_input_tokens_seen`: False
630
+ - `neftune_noise_alpha`: None
631
+ - `optim_target_modules`: None
632
+ - `batch_eval_metrics`: False
633
+ - `eval_on_start`: False
634
+ - `use_liger_kernel`: False
635
+ - `eval_use_gather_object`: False
636
+ - `average_tokens_across_devices`: False
637
+ - `prompts`: None
638
+ - `batch_sampler`: batch_sampler
639
+ - `multi_dataset_batch_sampler`: round_robin
640
+
641
+ </details>
642
+
643
+ ### Training Logs
644
+ | Epoch | Step | cosine_ndcg@10 |
645
+ |:-----:|:----:|:--------------:|
646
+ | 1.0 | 16 | 0.9122 |
647
+ | 2.0 | 32 | 0.9093 |
648
+ | 3.0 | 48 | 0.8968 |
649
+ | 3.125 | 50 | 0.8968 |
650
+ | 4.0 | 64 | 0.8939 |
651
+ | 5.0 | 80 | 0.8908 |
652
+ | 6.0 | 96 | 0.8908 |
653
+ | 6.25 | 100 | 0.8908 |
654
+ | 7.0 | 112 | 0.8939 |
655
+ | 8.0 | 128 | 0.8968 |
656
+ | 9.0 | 144 | 0.8968 |
657
+ | 9.375 | 150 | 0.8968 |
658
+ | 10.0 | 160 | 0.8968 |
659
+
660
+
661
+ ### Framework Versions
662
+ - Python: 3.13.1
663
+ - Sentence Transformers: 3.4.1
664
+ - Transformers: 4.48.3
665
+ - PyTorch: 2.6.0+cu124
666
+ - Accelerate: 1.3.0
667
+ - Datasets: 3.2.0
668
+ - Tokenizers: 0.21.0
669
+
670
+ ## Citation
671
+
672
+ ### BibTeX
673
+
674
+ #### Sentence Transformers
675
+ ```bibtex
676
+ @inproceedings{reimers-2019-sentence-bert,
677
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
678
+ author = "Reimers, Nils and Gurevych, Iryna",
679
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
680
+ month = "11",
681
+ year = "2019",
682
+ publisher = "Association for Computational Linguistics",
683
+ url = "https://arxiv.org/abs/1908.10084",
684
+ }
685
+ ```
686
+
687
+ #### MatryoshkaLoss
688
+ ```bibtex
689
+ @misc{kusupati2024matryoshka,
690
+ title={Matryoshka Representation Learning},
691
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
692
+ year={2024},
693
+ eprint={2205.13147},
694
+ archivePrefix={arXiv},
695
+ primaryClass={cs.LG}
696
+ }
697
+ ```
698
+
699
+ #### MultipleNegativesRankingLoss
700
+ ```bibtex
701
+ @misc{henderson2017efficient,
702
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
703
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
704
+ year={2017},
705
+ eprint={1705.00652},
706
+ archivePrefix={arXiv},
707
+ primaryClass={cs.CL}
708
+ }
709
+ ```
710
+
711
+ <!--
712
+ ## Glossary
713
+
714
+ *Clearly define terms in order to be accessible across audiences.*
715
+ -->
716
+
717
+ <!--
718
+ ## Model Card Authors
719
+
720
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
721
+ -->
722
+
723
+ <!--
724
+ ## Model Card Contact
725
+
726
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
727
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.3",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:757b432add6dde99d0f35e29a427f9f084f1b4a3fe85c146341ea9edd6f1d6a5
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff