SentenceTransformer based on lufercho/ArxBert-MLM
This is a sentence-transformers model finetuned from lufercho/ArxBert-MLM. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: lufercho/ArxBert-MLM
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("lufercho/AxvBert-Sentente-Transformer_v1")
# Run inference
sentences = [
'Simultaneous Feature and Expert Selection within Mixture of Experts',
' A useful strategy to deal with complex classification scenarios is the\n"divide and conquer" approach. The mixture of experts (MOE) technique makes use\nof this strategy by joinly training a set of classifiers, or experts, that are\nspecialized in different regions of the input space. A global model, or gate\nfunction, complements the experts by learning a function that weights their\nrelevance in different parts of the input space. Local feature selection\nappears as an attractive alternative to improve the specialization of experts\nand gate function, particularly, for the case of high dimensional data. Our\nmain intuition is that particular subsets of dimensions, or subspaces, are\nusually more appropriate to classify instances located in different regions of\nthe input space. Accordingly, this work contributes with a regularized variant\nof MoE that incorporates an embedded process for local feature selection using\n$L1$ regularization, with a simultaneous expert selection. The experiments are\nstill pending.\n',
" Deep convolutional networks have proven to be very successful in learning\ntask specific features that allow for unprecedented performance on various\ncomputer vision tasks. Training of such networks follows mostly the supervised\nlearning paradigm, where sufficiently many input-output pairs are required for\ntraining. Acquisition of large training sets is one of the key challenges, when\napproaching a new task. In this paper, we aim for generic feature learning and\npresent an approach for training a convolutional network using only unlabeled\ndata. To this end, we train the network to discriminate between a set of\nsurrogate classes. Each surrogate class is formed by applying a variety of\ntransformations to a randomly sampled 'seed' image patch. In contrast to\nsupervised network training, the resulting feature representation is not class\nspecific. It rather provides robustness to the transformations that have been\napplied during training. This generic feature representation allows for\nclassification results that outperform the state of the art for unsupervised\nlearning on several popular datasets (STL-10, CIFAR-10, Caltech-101,\nCaltech-256). While such generic features cannot compete with class specific\nfeatures from supervised training on a classification task, we show that they\nare advantageous on geometric matching problems, where they also outperform the\nSIFT descriptor.\n",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 5,000 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 4 tokens
- mean: 13.32 tokens
- max: 34 tokens
- min: 14 tokens
- mean: 198.64 tokens
- max: 489 tokens
- Samples:
sentence_0 sentence_1 Multilabel Classification through Random Graph Ensembles
We present new methods for multilabel classification, relying on ensemble
learning on a collection of random output graphs imposed on the multilabel and
a kernel-based structured output learner as the base classifier. For ensemble
learning, differences among the output graphs provide the required base
classifier diversity and lead to improved performance in the increasing size of
the ensemble. We study different methods of forming the ensemble prediction,
including majority voting and two methods that perform inferences over the
graph structures before or after combining the base models into the ensemble.
We compare the methods against the state-of-the-art machine learning approaches
on a set of heterogeneous multilabel benchmark problems, including multilabel
AdaBoost, convex multitask feature learning, as well as single target learning
approaches represented by Bagging and SVM. In our experiments, the random graph
ensembles are very competitive and robust, ranking first or second o...Group Sparse Additive Models
We consider the problem of sparse variable selection in nonparametric
additive models, with the prior knowledge of the structure among the covariates
to encourage those variables within a group to be selected jointly. Previous
works either study the group sparsity in the parametric setting (e.g., group
lasso), or address the problem in the non-parametric setting without exploiting
the structural information (e.g., sparse additive models). In this paper, we
present a new method, called group sparse additive models (GroupSpAM), which
can handle group sparsity in additive models. We generalize the l1/l2 norm to
Hilbert spaces as the sparsity-inducing penalty in GroupSpAM. Moreover, we
derive a novel thresholding condition for identifying the functional sparsity
at the group level, and propose an efficient block coordinate descent algorithm
for constructing the estimate. We demonstrate by simulation that GroupSpAM
substantially outperforms the competing methods in terms of support recove...Inverse Covariance Estimation for High-Dimensional Data in Linear Time
and Space: Spectral Methods for Riccati and Sparse ModelsWe propose maximum likelihood estimation for learning Gaussian graphical
models with a Gaussian (ell_2^2) prior on the parameters. This is in contrast
to the commonly used Laplace (ell_1) prior for encouraging sparseness. We show
that our optimization problem leads to a Riccati matrix equation, which has a
closed form solution. We propose an efficient algorithm that performs a
singular value decomposition of the training data. Our algorithm is
O(NT^2)-time and O(NT)-space for N variables and T samples. Our method is
tailored to high-dimensional problems (N gg T), in which sparseness promoting
methods become intractable. Furthermore, instead of obtaining a single solution
for a specific regularization parameter, our algorithm finds the whole solution
path. We show that the method has logarithmic sample complexity under the
spiked covariance model. We also propose sparsification of the dense solution
with provable performance guarantees. We provide techniques for using our
learnt model... - Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size
: 16per_device_eval_batch_size
: 16num_train_epochs
: 2multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 2max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss |
---|---|---|
1.5974 | 500 | 0.3282 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.3.1
- Transformers: 4.46.2
- PyTorch: 2.5.1+cu121
- Accelerate: 1.1.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 9
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for lufercho/AxvBert-Sentente-Transformer_v1
Base model
lufercho/ArxBert-MLM