Add new SentenceTransformer model

Browse files

Files changed (11) hide show

1_Pooling/config.json +10 -0
README.md +730 -0
config.json +26 -0
config_sentence_transformers.json +10 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,730 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:5000
+- loss:MultipleNegativesRankingLoss
+base_model: lufercho/my-finetuned-bert-mlm
+widget:
+- source_sentence: "A Comprehensive Approach to Universal Piecewise Nonlinear Regression\n\
+    \  Based on Trees"
+  sentences:
+  - "  In sparse recovery we are given a matrix $A$ (the dictionary) and a vector\
+    \ of\nthe form $A X$ where $X$ is sparse, and the goal is to recover $X$. This\
+    \ is a\ncentral notion in signal processing, statistics and machine learning.\
+    \ But in\napplications such as sparse coding, edge detection, compression and\
+    \ super\nresolution, the dictionary $A$ is unknown and has to be learned from\
+    \ random\nexamples of the form $Y = AX$ where $X$ is drawn from an appropriate\n\
+    distribution --- this is the dictionary learning problem. In most settings, $A$\n\
+    is overcomplete: it has more columns than rows. This paper presents a\npolynomial-time\
+    \ algorithm for learning overcomplete dictionaries; the only\npreviously known\
+    \ algorithm with provable guarantees is the recent work of\nSpielman, Wang and\
+    \ Wright who gave an algorithm for the full-rank case, which\nis rarely the case\
+    \ in applications. Our algorithm applies to incoherent\ndictionaries which have\
+    \ been a central object of study since they were\nintroduced in seminal work of\
+    \ Donoho and Huo. In particular, a dictionary is\n$\\mu$-incoherent if each pair\
+    \ of columns has inner product at most $\\mu /\n\\sqrt{n}$.\n  The algorithm makes\
+    \ natural stochastic assumptions about the unknown sparse\nvector $X$, which can\
+    \ contain $k \\leq c \\min(\\sqrt{n}/\\mu \\log n, m^{1/2\n-\\eta})$ non-zero\
+    \ entries (for any $\\eta > 0$). This is close to the best $k$\nallowable by the\
+    \ best sparse recovery algorithms even if one knows the\ndictionary $A$ exactly.\
+    \ Moreover, both the running time and sample complexity\ndepend on $\\log 1/\\\
+    epsilon$, where $\\epsilon$ is the target accuracy, and so\nour algorithms converge\
+    \ very quickly to the true dictionary. Our algorithm can\nalso tolerate substantial\
+    \ amounts of noise provided it is incoherent with\nrespect to the dictionary (e.g.,\
+    \ Gaussian). In the noisy setting, our running\ntime and sample complexity depend\
+    \ polynomially on $1/\\epsilon$, and this is\nnecessary.\n"
+  - '  In this paper, we investigate adaptive nonlinear regression and introduce
+    tree based piecewise linear regression algorithms that are highly efficient and
+    provide significantly improved performance with guaranteed upper bounds in an
+    individual sequence manner. We use a tree notion in order to partition the
+    space of regressors in a nested structure. The introduced algorithms adapt not
+    only their regression functions but also the complete tree structure while
+    achieving the performance of the "best" linear mixture of a doubly exponential
+    number of partitions, with a computational complexity only polynomial in the
+    number of nodes of the tree. While constructing these algorithms, we also avoid
+    using any artificial "weighting" of models (with highly data dependent
+    parameters) and, instead, directly minimize the final regression error, which
+    is the ultimate performance goal. The introduced methods are generic such that
+    they can readily incorporate different tree construction methods such as random
+    trees in their framework and can use different regressor or partitioning
+    functions as demonstrated in the paper.
+    '
+  - '  In this paper we propose a multi-task linear classifier learning problem
+    called D-SVM (Dictionary SVM). D-SVM uses a dictionary of parameter covariance
+    shared by all tasks to do multi-task knowledge transfer among different tasks.
+    We formally define the learning problem of D-SVM and show two interpretations
+    of this problem, from both the probabilistic and kernel perspectives. From the
+    probabilistic perspective, we show that our learning formulation is actually a
+    MAP estimation on all optimization variables. We also show its equivalence to
+    a
+    multiple kernel learning problem in which one is trying to find a re-weighting
+    kernel for features from a dictionary of basis (despite the fact that only
+    linear classifiers are learned). Finally, we describe an alternative
+    optimization scheme to minimize the objective function and present empirical
+    studies to valid our algorithm.
+    '
+- source_sentence: "A Game-theoretic Machine Learning Approach for Revenue Maximization\
+    \ in\n  Sponsored Search"
+  sentences:
+  - '  A learning algorithm based on primary school teaching and learning is
+    presented. The methodology is to continuously evaluate a student and to give
+    them training on the examples for which they repeatedly fail, until, they can
+    correctly answer all types of questions. This incremental learning procedure
+    produces better learning curves by demanding the student to optimally dedicate
+    their learning time on the failed examples. When used in machine learning, the
+    algorithm is found to train a machine on a data with maximum variance in the
+    feature space so that the generalization ability of the network improves. The
+    algorithm has interesting applications in data mining, model evaluations and
+    rare objects discovery.
+    '
+  - '  In this paper we extend temporal difference policy evaluation algorithms to
+    performance criteria that include the variance of the cumulative reward. Such
+    criteria are useful for risk management, and are important in domains such as
+    finance and process control. We propose both TD(0) and LSTD(lambda) variants
+    with linear function approximation, prove their convergence, and demonstrate
+    their utility in a 4-dimensional continuous state space problem.
+    '
+  - '  Sponsored search is an important monetization channel for search engines, in
+    which an auction mechanism is used to select the ads shown to users and
+    determine the prices charged from advertisers. There have been several pieces
+    of work in the literature that investigate how to design an auction mechanism
+    in order to optimize the revenue of the search engine. However, due to some
+    unrealistic assumptions used, the practical values of these studies are not
+    very clear. In this paper, we propose a novel \emph{game-theoretic machine
+    learning} approach, which naturally combines machine learning and game theory,
+    and learns the auction mechanism using a bilevel optimization framework. In
+    particular, we first learn a Markov model from historical data to describe how
+    advertisers change their bids in response to an auction mechanism, and then for
+    any given auction mechanism, we use the learnt model to predict its
+    corresponding future bid sequences. Next we learn the auction mechanism through
+    empirical revenue maximization on the predicted bid sequences. We show that the
+    empirical revenue will converge when the prediction period approaches infinity,
+    and a Genetic Programming algorithm can effectively optimize this empirical
+    revenue. Our experiments indicate that the proposed approach is able to produce
+    a much more effective auction mechanism than several baselines.
+    '
+- source_sentence: Normalized Online Learning
+  sentences:
+  - "  The Frank-Wolfe method (a.k.a. conditional gradient algorithm) for smooth\n\
+    optimization has regained much interest in recent years in the context of large\n\
+    scale optimization and machine learning. A key advantage of the method is that\n\
+    it avoids projections - the computational bottleneck in many applications -\n\
+    replacing it by a linear optimization step. Despite this advantage, the known\n\
+    convergence rates of the FW method fall behind standard first order methods for\n\
+    most settings of interest. It is an active line of research to derive faster\n\
+    linear optimization-based algorithms for various settings of convex\noptimization.\n\
+    \  In this paper we consider the special case of optimization over strongly\n\
+    convex sets, for which we prove that the vanila FW method converges at a rate\n\
+    of $\\frac{1}{t^2}$. This gives a quadratic improvement in convergence rate\n\
+    compared to the general case, in which convergence is of the order\n$\\frac{1}{t}$,\
+    \ and known to be tight. We show that various balls induced by\n$\\ell_p$ norms,\
+    \ Schatten norms and group norms are strongly convex on one hand\nand on the other\
+    \ hand, linear optimization over these sets is straightforward\nand admits a closed-form\
+    \ solution. We further show how several previous\nfast-rate results for the FW\
+    \ method follow easily from our analysis.\n"
+  - '  We introduce online learning algorithms which are independent of feature
+    scales, proving regret bounds dependent on the ratio of scales existent in the
+    data rather than the absolute scale. This has several useful effects: there is
+    no need to pre-normalize data, the test-time and test-space complexity are
+    reduced, and the algorithms are more robust.
+    '
+  - '  In order to achieve high efficiency of classification in intrusion detection,
+    a compressed model is proposed in this paper which combines horizontal
+    compression with vertical compression. OneR is utilized as horizontal
+    com-pression for attribute reduction, and affinity propagation is employed as
+    vertical compression to select small representative exemplars from large
+    training data. As to be able to computationally compress the larger volume of
+    training data with scalability, MapReduce based parallelization approach is
+    then implemented and evaluated for each step of the model compression process
+    abovementioned, on which common but efficient classification methods can be
+    directly used. Experimental application study on two publicly available
+    datasets of intrusion detection, KDD99 and CMDC2012, demonstrates that the
+    classification using the compressed model proposed can effectively speed up the
+    detection procedure at up to 184 times, most importantly at the cost of a
+    minimal accuracy difference with less than 1% on average.
+    '
+- source_sentence: Bounds on the Bethe Free Energy for Gaussian Networks
+  sentences:
+  - '  We extend the Bayesian Information Criterion (BIC), an asymptotic
+    approximation for the marginal likelihood, to Bayesian networks with hidden
+    variables. This approximation can be used to select models given large samples
+    of data. The standard BIC as well as our extension punishes the complexity of
+    a
+    model according to the dimension of its parameters. We argue that the dimension
+    of a Bayesian network with hidden variables is the rank of the Jacobian matrix
+    of the transformation between the parameters of the network and the parameters
+    of the observable variables. We compute the dimensions of several networks
+    including the naive Bayes model with a hidden root node.
+    '
+  - '  Complex networks refer to large-scale graphs with nontrivial connection
+    patterns. The salient and interesting features that the complex network study
+    offer in comparison to graph theory are the emphasis on the dynamical
+    properties of the networks and the ability of inherently uncovering pattern
+    formation of the vertices. In this paper, we present a hybrid data
+    classification technique combining a low level and a high level classifier. The
+    low level term can be equipped with any traditional classification techniques,
+    which realize the classification task considering only physical features (e.g.,
+    geometrical or statistical features) of the input data. On the other hand, the
+    high level term has the ability of detecting data patterns with semantic
+    meanings. In this way, the classification is realized by means of the
+    extraction of the underlying network''s features constructed from the input
+    data. As a result, the high level classification process measures the
+    compliance of the test instances with the pattern formation of the training
+    data. Out of various high level perspectives that can be utilized to capture
+    semantic meaning, we utilize the dynamical features that are generated from a
+    tourist walker in a networked environment. Specifically, a weighted combination
+    of transient and cycle lengths generated by the tourist walk is employed for
+    that end. Interestingly, our study shows that the proposed technique is able to
+    further improve the already optimized performance of traditional classification
+    techniques.
+    '
+  - '  We address the problem of computing approximate marginals in Gaussian
+    probabilistic models by using mean field and fractional Bethe approximations.
+    As an extension of Welling and Teh (2001), we define the Gaussian fractional
+    Bethe free energy in terms of the moment parameters of the approximate
+    marginals and derive an upper and lower bound for it. We give necessary
+    conditions for the Gaussian fractional Bethe free energies to be bounded from
+    below. It turns out that the bounding condition is the same as the pairwise
+    normalizability condition derived by Malioutov et al. (2006) as a sufficient
+    condition for the convergence of the message passing algorithm. By giving a
+    counterexample, we disprove the conjecture in Welling and Teh (2001): even when
+    the Bethe free energy is not bounded from below, it can possess a local minimum
+    to which the minimization algorithms can converge.
+    '
+- source_sentence: Multi-Armed Bandits in Metric Spaces
+  sentences:
+  - '  The paper presents a FrameNet-based information extraction and knowledge
+    representation framework, called FrameNet-CNL. The framework is used on natural
+    language documents and represents the extracted knowledge in a tailor-made
+    Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can be
+    generated automatically in multiple languages. This approach brings together
+    the fields of information extraction and CNL, because a source text can be
+    considered belonging to FrameNet-CNL, if information extraction parser produces
+    the correct knowledge representation as a result. We describe a
+    state-of-the-art information extraction parser used by a national news agency
+    and speculate that FrameNet-CNL eventually could shape the natural language
+    subset used for writing the newswire articles.
+    '
+  - '  Applications such as face recognition that deal with high-dimensional data
+    need a mapping technique that introduces representation of low-dimensional
+    features with enhanced discriminatory power and a proper classifier, able to
+    classify those complex features. Most of traditional Linear Discriminant
+    Analysis suffer from the disadvantage that their optimality criteria are not
+    directly related to the classification ability of the obtained feature
+    representation. Moreover, their classification accuracy is affected by the
+    "small sample size" problem which is often encountered in FR tasks. In this
+    short paper, we combine nonlinear kernel based mapping of data called KDDA with
+    Support Vector machine classifier to deal with both of the shortcomings in an
+    efficient and cost effective manner. The proposed here method is compared, in
+    terms of classification accuracy, to other commonly used FR methods on UMIST
+    face database. Results indicate that the performance of the proposed method is
+    overall superior to those of traditional FR approaches, such as the Eigenfaces,
+    Fisherfaces, and D-LDA methods and traditional linear classifiers.
+    '
+  - '  In a multi-armed bandit problem, an online algorithm chooses from a set of
+    strategies in a sequence of trials so as to maximize the total payoff of the
+    chosen strategies. While the performance of bandit algorithms with a small
+    finite strategy set is quite well understood, bandit problems with large
+    strategy sets are still a topic of very active investigation, motivated by
+    practical applications such as online auctions and web advertisement. The goal
+    of such research is to identify broad and natural classes of strategy sets and
+    payoff functions which enable the design of efficient solutions. In this work
+    we study a very general setting for the multi-armed bandit problem in which the
+    strategies form a metric space, and the payoff function satisfies a Lipschitz
+    condition with respect to the metric. We refer to this problem as the
+    "Lipschitz MAB problem". We present a complete solution for the multi-armed
+    problem in this setting. That is, for every metric space (L,X) we define an
+    isometry invariant which bounds from below the performance of Lipschitz MAB
+    algorithms for X, and we present an algorithm which comes arbitrarily close to
+    meeting this bound. Furthermore, our technique gives even better results for
+    benign payoff functions.
+    '
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+# SentenceTransformer based on lufercho/my-finetuned-bert-mlm
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [lufercho/my-finetuned-bert-mlm](https://huggingface.co/lufercho/my-finetuned-bert-mlm). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [lufercho/my-finetuned-bert-mlm](https://huggingface.co/lufercho/my-finetuned-bert-mlm) <!-- at revision 8cf44893fd607477d06b067f1788b495abac1b2c -->
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("lufercho/AxvBert-Sentente-Transformer")
+# Run inference
+sentences = [
+    'Multi-Armed Bandits in Metric Spaces',
+    '  In a multi-armed bandit problem, an online algorithm chooses from a set of\nstrategies in a sequence of trials so as to maximize the total payoff of the\nchosen strategies. While the performance of bandit algorithms with a small\nfinite strategy set is quite well understood, bandit problems with large\nstrategy sets are still a topic of very active investigation, motivated by\npractical applications such as online auctions and web advertisement. The goal\nof such research is to identify broad and natural classes of strategy sets and\npayoff functions which enable the design of efficient solutions. In this work\nwe study a very general setting for the multi-armed bandit problem in which the\nstrategies form a metric space, and the payoff function satisfies a Lipschitz\ncondition with respect to the metric. We refer to this problem as the\n"Lipschitz MAB problem". We present a complete solution for the multi-armed\nproblem in this setting. That is, for every metric space (L,X) we define an\nisometry invariant which bounds from below the performance of Lipschitz MAB\nalgorithms for X, and we present an algorithm which comes arbitrarily close to\nmeeting this bound. Furthermore, our technique gives even better results for\nbenign payoff functions.\n',
+    '  Applications such as face recognition that deal with high-dimensional data\nneed a mapping technique that introduces representation of low-dimensional\nfeatures with enhanced discriminatory power and a proper classifier, able to\nclassify those complex features. Most of traditional Linear Discriminant\nAnalysis suffer from the disadvantage that their optimality criteria are not\ndirectly related to the classification ability of the obtained feature\nrepresentation. Moreover, their classification accuracy is affected by the\n"small sample size" problem which is often encountered in FR tasks. In this\nshort paper, we combine nonlinear kernel based mapping of data called KDDA with\nSupport Vector machine classifier to deal with both of the shortcomings in an\nefficient and cost effective manner. The proposed here method is compared, in\nterms of classification accuracy, to other commonly used FR methods on UMIST\nface database. Results indicate that the performance of the proposed method is\noverall superior to those of traditional FR approaches, such as the Eigenfaces,\nFisherfaces, and D-LDA methods and traditional linear classifiers.\n',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 5,000 training samples
+* Columns: <code>sentence_0</code> and <code>sentence_1</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | sentence_0                                                                        | sentence_1                                                                           |
+  |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
+  | type    | string                                                                            | string                                                                               |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 13.29 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 26 tokens</li><li>mean: 202.49 tokens</li><li>max: 506 tokens</li></ul> |
+* Samples:
+  | sentence_0                                                               | sentence_1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+  |:-------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>Validation of nonlinear PCA</code>                                 | <code>  Linear principal component analysis (PCA) can be extended to a nonlinear PCA<br>by using artificial neural networks. But the benefit of curved components<br>requires a careful control of the model complexity. Moreover, standard<br>techniques for model selection, including cross-validation and more generally<br>the use of an independent test set, fail when applied to nonlinear PCA because<br>of its inherent unsupervised characteristics. This paper presents a new<br>approach for validating the complexity of nonlinear PCA models by using the<br>error in missing data estimation as a criterion for model selection. It is<br>motivated by the idea that only the model of optimal complexity is able to<br>predict missing values with the highest accuracy. While standard test set<br>validation usually favours over-fitted nonlinear PCA models, the proposed model<br>validation approach correctly selects the optimal model complexity.<br></code>                                                                                                       |
+  | <code>Learning Attitudes and Attributes from Multi-Aspect Reviews</code> | <code>  The majority of online reviews consist of plain-text feedback together with a<br>single numeric score. However, there are multiple dimensions to products and<br>opinions, and understanding the `aspects' that contribute to users' ratings may<br>help us to better understand their individual preferences. For example, a<br>user's impression of an audiobook presumably depends on aspects such as the<br>story and the narrator, and knowing their opinions on these aspects may help us<br>to recommend better products. In this paper, we build models for rating systems<br>in which such dimensions are explicit, in the sense that users leave separate<br>ratings for each aspect of a product. By introducing new corpora consisting of<br>five million reviews, rated with between three and six aspects, we evaluate our<br>models on three prediction tasks: First, we use our model to uncover which<br>parts of a review discuss which of the rated aspects. Second, we use our model<br>to summarize reviews, which for us means finding the sentences...</code> |
+  | <code>Bayesian Differential Privacy through Posterior Sampling</code>    | <code>  Differential privacy formalises privacy-preserving mechanisms that provide<br>access to a database. We pose the question of whether Bayesian inference itself<br>can be used directly to provide private access to data, with no modification.<br>The answer is affirmative: under certain conditions on the prior, sampling from<br>the posterior distribution can be used to achieve a desired level of privacy<br>and utility. To do so, we generalise differential privacy to arbitrary dataset<br>metrics, outcome spaces and distribution families. This allows us to also deal<br>with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of<br>the posterior to the data, which gives a measure of robustness. We also show<br>how to use posterior sampling to provide differentially private responses to<br>queries, within a decision-theoretic framework. Finally, we provide bounds on<br>the utility and on the distinguishability of datasets. The latter are<br>complemented by a novel use of Le Cam's method to obtain lower bounds....</code> |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `num_train_epochs`: 2
+- `multi_dataset_batch_sampler`: round_robin
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: no
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 5e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1
+- `num_train_epochs`: 2
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.0
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: False
+- `hub_always_push`: False
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `dispatch_batches`: None
+- `split_batches`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: round_robin
+</details>
+### Training Logs
+| Epoch  | Step | Training Loss |
+|:------:|:----:|:-------------:|
+| 1.5974 | 500  | 0.3039        |
+### Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 3.3.1
+- Transformers: 4.46.2
+- PyTorch: 2.5.1+cu121
+- Accelerate: 1.1.1
+- Datasets: 3.1.0
+- Tokenizers: 0.20.3
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "_name_or_path": "lufercho/my-finetuned-bert-mlm",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.3.1",
+    "transformers": "4.46.2",
+    "pytorch": "2.5.1+cu121"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5acb361ba7f01378e500c54cef73f7794364f98dec8469c218eb8b51f1d5ede8
+size 437951328

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff