SentenceTransformer based on sentence-transformers/all-roberta-large-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-roberta-large-v1. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the ๐Ÿค— Hub
model = SentenceTransformer("hanwenzhu/all-roberta-large-v1-lr5e-5-bs256-nneg3-ml-mar16")
# Run inference
sentences = [
    'Mathlib.Algebra.Polynomial.FieldDivision#94',
    'normalize_apply',
    'DifferentiableWithinAt.hasFDerivWithinAt',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 5,817,740 training samples
  • Columns: state_name and premise_name
  • Approximate statistics based on the first 1000 samples:
    state_name premise_name
    type string string
    details
    • min: 11 tokens
    • mean: 16.44 tokens
    • max: 24 tokens
    • min: 3 tokens
    • mean: 10.9 tokens
    • max: 50 tokens
  • Samples:
    state_name premise_name
    Mathlib.Algebra.Field.IsField#12 Classical.choose_spec
    Mathlib.Algebra.Field.IsField#12 IsField.mul_comm
    Mathlib.Algebra.Field.IsField#12 eq_of_heq
  • Loss: loss.MaskedCachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 1,959 evaluation samples
  • Columns: state_name and premise_name
  • Approximate statistics based on the first 1000 samples:
    state_name premise_name
    type string string
    details
    • min: 10 tokens
    • mean: 17.08 tokens
    • max: 24 tokens
    • min: 5 tokens
    • mean: 11.05 tokens
    • max: 31 tokens
  • Samples:
    state_name premise_name
    Mathlib.Algebra.Algebra.Hom#80 AlgHom.commutes
    Mathlib.Algebra.Algebra.NonUnitalSubalgebra#237 NonUnitalAlgHom.instNonUnitalAlgSemiHomClass
    Mathlib.Algebra.Algebra.NonUnitalSubalgebra#237 NonUnitalAlgebra.mem_top
  • Loss: loss.MaskedCachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 64
  • num_train_epochs: 1.0
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.03
  • bf16: True
  • dataloader_num_workers: 4
  • resume_from_checkpoint: /data/user_data/thomaszh/models/all-roberta-large-v1-lr5e-5-bs256-nneg3-ml/checkpoint-22116

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1.0
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.03
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: /data/user_data/thomaszh/models/all-roberta-large-v1-lr5e-5-bs256-nneg3-ml/checkpoint-22116
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss
0.9733 22120 1.1781 -
0.9738 22130 1.1226 -
0.9742 22140 1.219 -
0.9747 22150 1.1531 -
0.9751 22160 1.1907 -
0.9755 22170 1.2081 -
0.9760 22180 1.1849 -
0.9764 22190 1.1923 -
0.9769 22200 1.1496 -
0.9773 22210 1.1868 -
0.9777 22220 1.1968 -
0.9782 22230 1.2081 -
0.9786 22240 1.1685 -
0.9791 22250 1.1618 -
0.9795 22260 1.1504 -
0.9799 22270 1.1328 -
0.9804 22280 1.2012 -
0.9808 22290 1.2439 -
0.9813 22300 1.202 -
0.9817 22310 1.1656 -
0.9821 22320 1.1664 -
0.9826 22330 1.1423 -
0.9830 22340 1.177 -
0.9832 22344 - 1.3153
0.9835 22350 1.1704 -
0.9839 22360 1.1787 -
0.9843 22370 1.2041 -
0.9848 22380 1.2031 -
0.9852 22390 1.1365 -
0.9857 22400 1.212 -
0.9861 22410 1.1562 -
0.9865 22420 1.1781 -
0.9870 22430 1.1507 -
0.9874 22440 1.2138 -
0.9879 22450 1.1967 -
0.9883 22460 1.1548 -
0.9887 22470 1.2121 -
0.9892 22480 1.1681 -
0.9896 22490 1.1805 -
0.9901 22500 1.2138 -
0.9905 22510 1.179 -
0.9909 22520 1.1608 -
0.9914 22530 1.1851 -
0.9918 22540 1.1804 -
0.9923 22550 1.154 -
0.9927 22560 1.1649 -
0.9931 22570 1.1815 -
0.9932 22572 - 1.3150
0.9936 22580 1.201 -
0.9940 22590 1.1987 -
0.9945 22600 1.1885 -
0.9949 22610 1.1378 -
0.9953 22620 1.1776 -
0.9958 22630 1.1298 -
0.9962 22640 1.2037 -
0.9967 22650 1.1926 -
0.9971 22660 1.2298 -
0.9975 22670 1.1539 -
0.9980 22680 1.1929 -
0.9984 22690 1.1783 -
0.9989 22700 1.1222 -
0.9993 22710 1.1309 -
0.9997 22720 1.1766 -

Framework Versions

  • Python: 3.11.8
  • Sentence Transformers: 3.1.1
  • Transformers: 4.45.1
  • PyTorch: 2.5.1.post302
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.20.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MaskedCachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
4
Safetensors
Model size
355M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hanwenzhu/all-roberta-large-v1-lr5e-5-bs256-nneg3-ml-mar16

Finetuned
(8)
this model