SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("kamkol/ab_testing_finetuned_arctic_ft-36dfff22-0696-40d2-b3bf-268fe2ff2aec")
# Run inference
sentences = [
    'who jacob cohen say about power analysis?',
    '#3: Need over 200K Visitors in Online Experiment • Maybe you need half or double this, but do not trust online controlled experiments with 2,000 users • Let me start with the assumptions required for this default: • Alpha=0.05 to declare stat-sig (industry standard, default #1 in this talk). Lower values, which are appropriate sometimes, will result in increasing sample size • Power=80%. We’ll discuss power over the next few slides, but this is a minimum. Going to higher power will increase the sample size. • Conversion rate is 5%. Even if you’re optimizing for something else, it is very common to build a guardrail on conversions. Sites are typically 2%-5%. A lower number will increase the sample size • MDE (minimum detectable effect) is relative 5%. It is rare to see experiments improve key metrics by 5%, but we’ll be aggressive looking for big wins. A small MDE will increase the sample size #16 When I finally stumbled onto power analysis… it was as if I had died and gone to heaven -- Jacob Cohen (1990) © 2022 Ron Kohavi #3: The Sobering Math • The power formula is simple: • For conversions, 𝜎2 = 𝑝∗(1−𝑝)(binomial distribution). We assume p=5%, so 𝜎2 = 0.05∗0.95 • Our MDE is 5% relative, so 𝛿 = 5%∗5%(absolute) • Plug it in, and n=121,600 per variant, or 243,200 for A/B test • 200K is therefore conservative minimum for these assumptions • Lower alpha, increase power, lower conversion rate, or lower the MDE, and you need a larger sample #17 © 2022 Ron Kohavi #3: What is Power? • Statistical power is the probability of detecting a given difference between the variants when there really is one • Given H0 (left) and H1 (right) separated by 𝛿 • Stat-sig is noted by the dark area (only the right one matters here) • The vertical lines indicate β, which is our type-II error (power= 1-β) • With low power, the right normal moves left and the vertical lines cover most of that distribution. You have to be “lucky” to get stat-sig #18 Diagram from van Belle (2011) © 2022 Ron Kohavi Winner’s Curse • A stat-sig result with low power has a high probability of exaggerating the actual number as follows → (Gelman and Carlin 2014) • GuessTheTest on 16 Dec 2021, shared an example with ~80 users in each variant and 337% improvement • It had 3% power to detect even a 10% delta, so it is 63% likely to be a false positive and highly likely to exaggerate effect. See http://bit.ly/ABTestingIntuitionBusters #19 © 2022 Ron Kohavi A Visualization of Power • If the null hypothesis is true (no difference, or effect is zero), the distribution of p-values is uniform • Using p-value of 0.05, about 5% of the time you’ll declare something stat-sig • We’ll look at the p-value distribution of 10,000 experiments where the treatment has a 5% lift (relative improvement) #20 © 2022 Ron Kohavi Power = 3% (N=100 per variant) Delta of 1 Delta 0Delta of 2 Delta of 3 Delta of 4 With so few users, only a few conversion combinations are possible, hence only a few p-values are possible © 2022 Ron Kohavi Power = 3% (N=100 per variant) cont • With small numbers, you get extreme results – winner’s curse • C had 13 conversions, T had 0 conversions ( -100%, p-value 0.00003) • C had 1 conversion, T had 13 conversions (+1200%, p-value 0.0001) • Average lift (absolute value) for stat-sig results was 271% Remember: true lift is 5%, so exaggeration factor is 54 times for average! • The maximum was 1200% lift, so 240 times the true value. • Wrong sign for stat-sig result: 36% of the time The truth was that there was a 5% lift, but we got a stat-sig negative lift! #22 © 2022 Ron Kohavi Power = ~10% (N=7,000 per variant) #23 Looks almost uniform (remember, our goal is for p-value < 0.05) Stat-sig only ~10% of the time (vs. 5% expected if there was no difference) Winner’s curse: when stat-sig, exaggeration factor of 3.9 (19.3% lift) © 2022 Ron Kohavi Power = ~30% (N=32,000 per variant) #24 Starts to put more mass below 0.05 Winner’s curse: when stat-sig, exaggeration factor of 1.9 (9.3% average lift) Could even get the sign wrong 0.1% of the time when stat-sig © 2022 Ron Kohavi Power = ~80% (N=122,000) #25 That’s the minimum we want: ~80% of the time, p-value < 0.05 Small exaggeration factor of 1.1 (5.6% average lift vs. real value of 5%) Never get the sign wrong when stat-sig © 2022 Ron Kohavi Power = ~90% (N=163,000) #26 Extending experiment from 122K users to 163K users (e.g., 34% longer) gives us great 90% power © 2022 Ron Kohavi References • Excel spreadsheet with the visualizations here (Nov 2022) • Low power LinkedIn post (Sept 2022) • A/B Testing Intuition Busters(Aug 2022) • Gelman, This is what “power = .06” looks like. Get used to it (11/2014) #27 © 2022 Ron Kohavi',
    '<1-hop>\n\n3 ESTIMATING THE FALSE POSITIVE RISK P-values are commonly misinterpreted as the probability of making a mistake when choosing the Treatment over Control when the observed metric of interest is statistically significantly different [25; 26; 27]. Multiple examples of this misinterpretation by A/B vendors, book authors, and in courts were provided by Kohavi, Deng, and Vermeer [14]. What is the p-value then? The p-value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that all the modeling assumptions, including the null hypothesis, 𝐻\u0b34, are true [26]. Conditioning1 on the null hypothesis is critical and most often misunderstood. In probabilistic terms, we have p-value = 𝑃(Δ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒|𝐻\u0b34 𝑖𝑠 𝑡𝑟𝑢𝑒). What we are looking for most of the time is the opposite conditional probability: 𝑃(𝐻\u0b34 𝑖𝑠 𝑡𝑟𝑢𝑒 |Δ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑) Using Bayes Rule, we can estimate the False Positive Risk (FPR), which is the probability that the statistically significant result is a false positive, or the probability that 𝐻\u0b34 is true (no real effect) when the test was statistically significant [15]. Note that FPR is sometimes named FDR, or False Discovery Rate [28; 29], but given the confusion with FDR from multiple hypothesis testing, we use the term recommended by Colquhoun [15]. We use the following terminology [14]: a) SS is a statistically significant positive2 result. b) 𝜶 is the threshold used to determine statistical significance (SS), commonly 0.05 for a two-tailed t-test, and 0.025 for the positive tail. c) 𝜷 is the type-II error (usually 0.2 for 80% power). d) 𝝅 is the prior probability of the null hypothesis, that is 𝑃(𝐻\u0b34). We can apply Bayes Rules as follows: 𝐹𝑃𝑅 = 𝑃(𝐻\u0b34|𝑆𝑆)= α∗గ α∗ గ ା (ଵିఉ)∗(ଵିగ) . An alternative derivation of FPR, resulting in the same formula, was made in the Supplement to Equation 2 and Figure 2 in Benjamin et. al. [30]. The key parameter required for the above is 𝝅, or 𝑃(𝐻\u0b34). Kohavi, Deng, and Vermeer [14] provided a table with seven success rate estimates (1- 𝝅) that were reported in the software industry, which ranged from 8% to 33% with a median and mode of 10%. Plugging these into the above formula results in an FPR of 22% for the median and mode success rate of 10%, industry standard two- tailed alpha of 0.05 equivalent to one-tailed 0.025, and 80% power (𝜋= 0.9,𝛼= 0.025,𝛽= 0.2). This is a much higher rate than people intuitively think of when they hear statistically significant 2 We refer to a positive result as an improvement in the desired direction, which is usually larger (e.g., conversion, revenue), but may be smaller (e.g., faster time). False Positives in A/B Tests KDD ’24, August 25-29, 2024, Barcelona, Spain improvement. For companies that use 𝛼=0.10 as their threshold for statistical significance, or equivalently use 𝛼=0.05 with a one- tailed test for the improvement tail (e.g. Optimizely [31], Analytics Toolkit [32], Booking.com [33], Expedia), the FPR for 10% success rate is a 36%. Over one third of the statistically significant results showing improvement, which we want to celebrate, are likely to be false positives! To provide intuition about why the FPR is so high when the success rate is low, we will use the data reported by Optimizely [34] of 12% win rate across 127,000 experiments. As we will show later in the paper in Section 4.4, the estimated true success rate is 9.3%, in line with the 10% median and mode of Table 2 in Kohavi, Deng, and Vermeer [14]. Looking at Figure 1, the dot-pattern (also green if viewed in color) in the first row represents the 9.3% success rate, that is, true effects that should be statistically significant given our sample size with 80% power. Of these, 80% will be identified as statistically significant, so 80%*9.3%=7.4% are denoted by a plus in the first row. Of the remaining 90.7% null effects, 5% will be statistically significant and positive, so 4.5% of the A/B tests will show a statistically significant result: a false positive. These are denoted by a plus in the second row. Of the ~12% wins (7.4%+4.5% depicted by pluses), 4.5% are false positives, so 4.5%/(4.5% + 7.4%) = 37.8%. This surprisingly high false positive is often referred to as the base rate fallacy [35].',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5083
cosine_accuracy@3 0.7333
cosine_accuracy@5 0.8333
cosine_accuracy@10 0.9
cosine_precision@1 0.5083
cosine_precision@3 0.325
cosine_precision@5 0.225
cosine_precision@10 0.1292
cosine_recall@1 0.3471
cosine_recall@3 0.6194
cosine_recall@5 0.7114
cosine_recall@10 0.8056
cosine_ndcg@10 0.6457
cosine_mrr@10 0.6401
cosine_map@100 0.5788

Training Details

Training Dataset

Unnamed Dataset

  • Size: 428 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 428 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 10 tokens
    • mean: 37.9 tokens
    • max: 78 tokens
    • min: 75 tokens
    • mean: 502.82 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1
    How do the pitfalls identified in online A/B testing, such as Simpson’s paradox and misuse of standard statistical formulas, relate to the ongoing debate between Bayesian methods and Frequentist approaches in interpreting A/B test results? <7-hop>

    in control, successes in treatment) to probabilities using non-informative priors, then I find the Bayesian exercise losing much of the promise. Worse, the online Bayesian A/B calculators not only require fewer parameters than FPR does, but the “Chance to beat Control” seems highly exaggerated. Appendix Some additional references to the A/B Bayesian vs. A/B Frequentists. 1. Economic Nobel prize winner Guido Imbens (2021) wrote: In the end, I do not see the advantages of Bayes factors over p-values as sufficient to convince researchers to adopt this technology more widely. 2. Bayes and Frequentist by Matt Gershoff (Oct 2022), part of the Build vs. Buy series (see intro deck and last deck with pointers). 3. Philosophy and the practice of Bayesian Statistics (2013) - great point about the need to check models, not just average 4. Nonsensical Bayesian Statistics in A/B Testing by Georgi Georgiev, author of Statistical Methods in Online A/B Testing 5. My post Multi-Armed bandits, T...
    how multiVariable testing (MVT) help speed up testing many factors at once and what experimentation infrastructure requirements make server-side assignment best for running complex MVTs on large sites? <2-hop>

    4 MultiVariable Testing1 An experiment that includes more than one factor is often called a MultiVariable test (MVT) (Alt and Usborne 2005). For example, consider testing five factors on the MSN homepage in a single experiment. A screenshot of the MSN homepage showing the control for each of these factors is given in Fig.8. 1 This is also known as Multivariate testing. We use the term MultiVariable Testing for two reasons. These tests were first called MultiVariable Tests in 1996 in an article in Forbes (Koselka 1996) referring to designed experiments in areas including sales and marketing. In addition, these tests are part of the statistical literature in the Design of Experiments field. There is a separate field of statistics known as multivariate statistics that does not deal with this topic so using the term multivariate could be a source of confusion. 123 Controlled experiments on the web 159 Factor Control Treatment F1 Shopping module as above Add Offers module below F2 Shop...
    How Figure 4.2 help manage variant assignment and system parameters in experiment platform? and its attributes (e.g., country, language, OS, platform), which experiment and variant combinations is that request assigned to? This assignment is based on the experiment specification and a pseudo-random hash of an ID, that is, f(ID). In most cases, to ensure the assignment is consistent for a user, a user ID is used. Variant assignment must also be independent, in that knowing the variant assignment of one user should not tell us anything about variant assignment for a different user. We discuss this in more depth in Chapter 14 . In this chapter, we assume user is the randomization unit. Production code, system parameters and values : Now that you have variant assignment and definitions, how do you ensure that the user receives the appropriate experience: how do you manage different production code and which system parameters should change to what values? This interface (or interfaces) is represented as the Variant Assignment Service in Figure 4.2 , and can return either just the ...
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            1024,
            768,
            512,
            256,
            128
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 100
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 100
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Click to expand
Epoch Step Training Loss cosine_ndcg@10
1.0 27 - 0.4217
1.8519 50 - 0.5487
2.0 54 - 0.5525
3.0 81 - 0.5851
3.7037 100 - 0.6000
4.0 108 - 0.6019
5.0 135 - 0.6160
5.5556 150 - 0.6255
6.0 162 - 0.6513
7.0 189 - 0.6403
7.4074 200 - 0.6306
8.0 216 - 0.6450
9.0 243 - 0.6455
9.2593 250 - 0.6489
10.0 270 - 0.6355
11.0 297 - 0.6619
11.1111 300 - 0.6650
12.0 324 - 0.6636
12.9630 350 - 0.6906
13.0 351 - 0.6869
14.0 378 - 0.6771
14.8148 400 - 0.6541
15.0 405 - 0.6537
16.0 432 - 0.6485
16.6667 450 - 0.6619
17.0 459 - 0.6334
18.0 486 - 0.6698
18.5185 500 2.6848 0.6645
19.0 513 - 0.6580
20.0 540 - 0.6888
20.3704 550 - 0.6676
21.0 567 - 0.6591
22.0 594 - 0.6558
22.2222 600 - 0.6554
23.0 621 - 0.6476
24.0 648 - 0.6580
24.0741 650 - 0.6560
25.0 675 - 0.6488
25.9259 700 - 0.6206
26.0 702 - 0.6033
27.0 729 - 0.6471
27.7778 750 - 0.6293
28.0 756 - 0.6346
29.0 783 - 0.6406
29.6296 800 - 0.6424
30.0 810 - 0.6234
31.0 837 - 0.6765
31.4815 850 - 0.6561
32.0 864 - 0.6562
33.0 891 - 0.6539
33.3333 900 - 0.6569
34.0 918 - 0.6462
35.0 945 - 0.6724
35.1852 950 - 0.6626
36.0 972 - 0.6280
37.0 999 - 0.6561
37.0370 1000 1.0045 0.6534
38.0 1026 - 0.6570
38.8889 1050 - 0.6650
39.0 1053 - 0.6516
40.0 1080 - 0.6562
40.7407 1100 - 0.6778
41.0 1107 - 0.6798
42.0 1134 - 0.6922
42.5926 1150 - 0.6902
43.0 1161 - 0.6775
44.0 1188 - 0.6663
44.4444 1200 - 0.6730
45.0 1215 - 0.6807
46.0 1242 - 0.6674
46.2963 1250 - 0.6657
47.0 1269 - 0.6648
48.0 1296 - 0.6716
48.1481 1300 - 0.6817
49.0 1323 - 0.6594
50.0 1350 - 0.6611
51.0 1377 - 0.6797
51.8519 1400 - 0.6858
52.0 1404 - 0.6828
53.0 1431 - 0.6836
53.7037 1450 - 0.6710
54.0 1458 - 0.6674
55.0 1485 - 0.6598
55.5556 1500 0.8341 0.6619
56.0 1512 - 0.6625
57.0 1539 - 0.6686
57.4074 1550 - 0.6650
58.0 1566 - 0.6214
59.0 1593 - 0.6366
59.2593 1600 - 0.6399
60.0 1620 - 0.6493
61.0 1647 - 0.6358
61.1111 1650 - 0.6326
62.0 1674 - 0.6171
62.9630 1700 - 0.6229
63.0 1701 - 0.6242
64.0 1728 - 0.6658
64.8148 1750 - 0.6622
65.0 1755 - 0.6555
66.0 1782 - 0.6286
66.6667 1800 - 0.6524
67.0 1809 - 0.6421
68.0 1836 - 0.6324
68.5185 1850 - 0.6479
69.0 1863 - 0.6443
70.0 1890 - 0.6260
70.3704 1900 - 0.6440
71.0 1917 - 0.6390
72.0 1944 - 0.6558
72.2222 1950 - 0.6563
73.0 1971 - 0.6455
74.0 1998 - 0.6422
74.0741 2000 0.6258 0.6507
75.0 2025 - 0.6504
75.9259 2050 - 0.6493
76.0 2052 - 0.6493
77.0 2079 - 0.6546
77.7778 2100 - 0.6430
78.0 2106 - 0.6443
79.0 2133 - 0.6432
79.6296 2150 - 0.6427
80.0 2160 - 0.6467
81.0 2187 - 0.6567
81.4815 2200 - 0.6529
82.0 2214 - 0.6522
83.0 2241 - 0.6487
83.3333 2250 - 0.6444
84.0 2268 - 0.6374
85.0 2295 - 0.6441
85.1852 2300 - 0.6439
86.0 2322 - 0.6378
87.0 2349 - 0.6441
87.0370 2350 - 0.6439
88.0 2376 - 0.6470
88.8889 2400 - 0.6519
89.0 2403 - 0.6451
90.0 2430 - 0.6461
90.7407 2450 - 0.6464
91.0 2457 - 0.6451
92.0 2484 - 0.6396
92.5926 2500 0.5699 0.6425
93.0 2511 - 0.6481
94.0 2538 - 0.6449
94.4444 2550 - 0.6450
95.0 2565 - 0.6452
96.0 2592 - 0.6457
96.2963 2600 - 0.6457
97.0 2619 - 0.6457
98.0 2646 - 0.6457
98.1481 2650 - 0.6457
99.0 2673 - 0.6457
100.0 2700 - 0.6457

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.6.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
9
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kamkol/ab_testing_finetuned_arctic_ft-36dfff22-0696-40d2-b3bf-268fe2ff2aec

Finetuned
(169)
this model

Papers for kamkol/ab_testing_finetuned_arctic_ft-36dfff22-0696-40d2-b3bf-268fe2ff2aec

Evaluation results