Edit model card

SentenceTransformer based on lufercho/ArxBert-MLM

This is a sentence-transformers model finetuned from lufercho/ArxBert-MLM. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: lufercho/ArxBert-MLM
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("lufercho/AxvBert-Sentente-Transformer_v1")
# Run inference
sentences = [
    'Simultaneous Feature and Expert Selection within Mixture of Experts',
    '  A useful strategy to deal with complex classification scenarios is the\n"divide and conquer" approach. The mixture of experts (MOE) technique makes use\nof this strategy by joinly training a set of classifiers, or experts, that are\nspecialized in different regions of the input space. A global model, or gate\nfunction, complements the experts by learning a function that weights their\nrelevance in different parts of the input space. Local feature selection\nappears as an attractive alternative to improve the specialization of experts\nand gate function, particularly, for the case of high dimensional data. Our\nmain intuition is that particular subsets of dimensions, or subspaces, are\nusually more appropriate to classify instances located in different regions of\nthe input space. Accordingly, this work contributes with a regularized variant\nof MoE that incorporates an embedded process for local feature selection using\n$L1$ regularization, with a simultaneous expert selection. The experiments are\nstill pending.\n',
    "  Deep convolutional networks have proven to be very successful in learning\ntask specific features that allow for unprecedented performance on various\ncomputer vision tasks. Training of such networks follows mostly the supervised\nlearning paradigm, where sufficiently many input-output pairs are required for\ntraining. Acquisition of large training sets is one of the key challenges, when\napproaching a new task. In this paper, we aim for generic feature learning and\npresent an approach for training a convolutional network using only unlabeled\ndata. To this end, we train the network to discriminate between a set of\nsurrogate classes. Each surrogate class is formed by applying a variety of\ntransformations to a randomly sampled 'seed' image patch. In contrast to\nsupervised network training, the resulting feature representation is not class\nspecific. It rather provides robustness to the transformations that have been\napplied during training. This generic feature representation allows for\nclassification results that outperform the state of the art for unsupervised\nlearning on several popular datasets (STL-10, CIFAR-10, Caltech-101,\nCaltech-256). While such generic features cannot compete with class specific\nfeatures from supervised training on a classification task, we show that they\nare advantageous on geometric matching problems, where they also outperform the\nSIFT descriptor.\n",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 5,000 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 4 tokens
    • mean: 13.32 tokens
    • max: 34 tokens
    • min: 14 tokens
    • mean: 198.64 tokens
    • max: 489 tokens
  • Samples:
    sentence_0 sentence_1
    Multilabel Classification through Random Graph Ensembles We present new methods for multilabel classification, relying on ensemble
    learning on a collection of random output graphs imposed on the multilabel and
    a kernel-based structured output learner as the base classifier. For ensemble
    learning, differences among the output graphs provide the required base
    classifier diversity and lead to improved performance in the increasing size of
    the ensemble. We study different methods of forming the ensemble prediction,
    including majority voting and two methods that perform inferences over the
    graph structures before or after combining the base models into the ensemble.
    We compare the methods against the state-of-the-art machine learning approaches
    on a set of heterogeneous multilabel benchmark problems, including multilabel
    AdaBoost, convex multitask feature learning, as well as single target learning
    approaches represented by Bagging and SVM. In our experiments, the random graph
    ensembles are very competitive and robust, ranking first or second o...
    Group Sparse Additive Models We consider the problem of sparse variable selection in nonparametric
    additive models, with the prior knowledge of the structure among the covariates
    to encourage those variables within a group to be selected jointly. Previous
    works either study the group sparsity in the parametric setting (e.g., group
    lasso), or address the problem in the non-parametric setting without exploiting
    the structural information (e.g., sparse additive models). In this paper, we
    present a new method, called group sparse additive models (GroupSpAM), which
    can handle group sparsity in additive models. We generalize the l1/l2 norm to
    Hilbert spaces as the sparsity-inducing penalty in GroupSpAM. Moreover, we
    derive a novel thresholding condition for identifying the functional sparsity
    at the group level, and propose an efficient block coordinate descent algorithm
    for constructing the estimate. We demonstrate by simulation that GroupSpAM
    substantially outperforms the competing methods in terms of support recove...
    Inverse Covariance Estimation for High-Dimensional Data in Linear Time
    and Space: Spectral Methods for Riccati and Sparse Models
    We propose maximum likelihood estimation for learning Gaussian graphical
    models with a Gaussian (ell_2^2) prior on the parameters. This is in contrast
    to the commonly used Laplace (ell_1) prior for encouraging sparseness. We show
    that our optimization problem leads to a Riccati matrix equation, which has a
    closed form solution. We propose an efficient algorithm that performs a
    singular value decomposition of the training data. Our algorithm is
    O(NT^2)-time and O(NT)-space for N variables and T samples. Our method is
    tailored to high-dimensional problems (N gg T), in which sparseness promoting
    methods become intractable. Furthermore, instead of obtaining a single solution
    for a specific regularization parameter, our algorithm finds the whole solution
    path. We show that the method has logarithmic sample complexity under the
    spiked covariance model. We also propose sparsification of the dense solution
    with provable performance guarantees. We provide techniques for using our
    learnt model...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 2
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
1.5974 500 0.3282

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.46.2
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.1.1
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
9
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for lufercho/AxvBert-Sentente-Transformer_v1

Finetuned
(1)
this model