Edit model card

BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("joshuapb/fine-tuned-matryoshka-100")
# Run inference
sentences = [
    'Fine-tuning New Knowledge#\nFine-tuning a pre-trained LLM via supervised fine-tuning and RLHF is a common technique for improving certain capabilities of the model like instruction following. Introducing new knowledge at the fine-tuning stage is hard to avoid.\nFine-tuning usually consumes much less compute, making it debatable whether the model can reliably learn new knowledge via small-scale fine-tuning. Gekhman et al. 2024 studied the research question of whether fine-tuning LLMs on new knowledge encourages hallucinations. They found that (1) LLMs learn fine-tuning examples with new knowledge slower than other examples with knowledge consistent with the pre-existing knowledge of the model; (2) Once the examples with new knowledge are eventually learned, they increase the model’s tendency to hallucinate.',
    'How do the results presented by Gekhman et al. in their 2024 study inform our understanding of the reliability metrics associated with large language models (LLMs) when subjected to fine-tuning with novel datasets?',
    'What is the relationship between the calibration of AI models and the effectiveness of verbalized probabilities when applied to tasks of varying difficulty levels?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.8281
cosine_accuracy@3 0.9635
cosine_accuracy@5 0.974
cosine_accuracy@10 0.9948
cosine_precision@1 0.8281
cosine_precision@3 0.3212
cosine_precision@5 0.1948
cosine_precision@10 0.0995
cosine_recall@1 0.8281
cosine_recall@3 0.9635
cosine_recall@5 0.974
cosine_recall@10 0.9948
cosine_ndcg@10 0.922
cosine_mrr@10 0.8977
cosine_map@100 0.8981

Information Retrieval

Metric Value
cosine_accuracy@1 0.8021
cosine_accuracy@3 0.9635
cosine_accuracy@5 0.974
cosine_accuracy@10 0.9896
cosine_precision@1 0.8021
cosine_precision@3 0.3212
cosine_precision@5 0.1948
cosine_precision@10 0.099
cosine_recall@1 0.8021
cosine_recall@3 0.9635
cosine_recall@5 0.974
cosine_recall@10 0.9896
cosine_ndcg@10 0.9077
cosine_mrr@10 0.8802
cosine_map@100 0.881

Information Retrieval

Metric Value
cosine_accuracy@1 0.7969
cosine_accuracy@3 0.9583
cosine_accuracy@5 0.9688
cosine_accuracy@10 0.9792
cosine_precision@1 0.7969
cosine_precision@3 0.3194
cosine_precision@5 0.1937
cosine_precision@10 0.0979
cosine_recall@1 0.7969
cosine_recall@3 0.9583
cosine_recall@5 0.9688
cosine_recall@10 0.9792
cosine_ndcg@10 0.9011
cosine_mrr@10 0.8746
cosine_map@100 0.8758

Information Retrieval

Metric Value
cosine_accuracy@1 0.7865
cosine_accuracy@3 0.9323
cosine_accuracy@5 0.9635
cosine_accuracy@10 0.9635
cosine_precision@1 0.7865
cosine_precision@3 0.3108
cosine_precision@5 0.1927
cosine_precision@10 0.0964
cosine_recall@1 0.7865
cosine_recall@3 0.9323
cosine_recall@5 0.9635
cosine_recall@10 0.9635
cosine_ndcg@10 0.8881
cosine_mrr@10 0.8623
cosine_map@100 0.8647

Information Retrieval

Metric Value
cosine_accuracy@1 0.6875
cosine_accuracy@3 0.8646
cosine_accuracy@5 0.9271
cosine_accuracy@10 0.9688
cosine_precision@1 0.6875
cosine_precision@3 0.2882
cosine_precision@5 0.1854
cosine_precision@10 0.0969
cosine_recall@1 0.6875
cosine_recall@3 0.8646
cosine_recall@5 0.9271
cosine_recall@10 0.9688
cosine_ndcg@10 0.8336
cosine_mrr@10 0.7896
cosine_map@100 0.7918

Training Details

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • num_train_epochs: 5
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_128_cosine_map@100 dim_256_cosine_map@100 dim_512_cosine_map@100 dim_64_cosine_map@100 dim_768_cosine_map@100
0.3846 5 5.0472 - - - - -
0.7692 10 4.0023 - - - - -
1.0 13 - 0.7939 0.8135 0.8282 0.7207 0.8323
1.1538 15 2.3381 - - - - -
1.5385 20 3.4302 - - - - -
1.9231 25 2.08 - - - - -
2.0 26 - 0.8494 0.8681 0.8781 0.7959 0.8888
2.3077 30 1.4696 - - - - -
2.6923 35 1.8153 - - - - -
3.0 39 - 0.8641 0.8844 0.8924 0.7952 0.8997
3.0769 40 1.3498 - - - - -
3.4615 45 0.9135 - - - - -
3.8462 50 1.3996 - - - - -
4.0 52 - 0.8647 0.8775 0.8819 0.7896 0.8990
4.2308 55 1.1582 - - - - -
4.6154 60 1.2233 - - - - -
5.0 65 0.9757 0.8647 0.8758 0.8810 0.7918 0.8981
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.0.1
  • Transformers: 4.42.4
  • PyTorch: 2.3.1+cu121
  • Accelerate: 0.32.1
  • Datasets: 2.21.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
12
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for joshuapb/fine-tuned-matryoshka-100

Finetuned
(256)
this model

Evaluation results