DistilRoBERTa-For-Semantic-Similarity

This model is a fine-tuned version of distilbert/distilroberta-base trained on the Stanford SNLI corpus (v1.0), available at stanfordnlp/snli. It achieves the following results on the evaluation set:

Loss: 0.2843
F1 (weighted average): 0.9032

Model description

The model classifies two text inputs as demonstrating entailment, contradiction, or neutrality. Entailment indicates that the hypothesis can be inferred from the premise; contradiction indicates that the hypothesis cannot be inferred from the premise; neutrality indicates that neither entailment nor contradiction exist.

Intended uses & limitations

The model was created for the purposes of comparing two texts, with fine-tuning done on DistilRoBERTa due to the model not being relatively resource intensive. This allows for comparisons that can be done quickly, suiting the model to tasks where real-time comparisons may be necessary.

The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Training and evaluation data

The model was trained using the Stanford SNLI corpus (v1.0). The corpus consists of text pairs—hypotheses and premises—, and labels that indicate entailment, neutrality, and contradiction. Each label is identified numerically as 0, 1, and 2, respectively.

The corpus contains training, evaluation, and test splits, each respectively consisting of approximately: 550,000; 10,000; and 10,000 hypothesis-premise-label data entries. Entailment, neutrality, and contradiction are all relatively balanced throughout each split. Each split was utilized in its entirety when training the model following grid search hyperparameter optimization. Each split was utilized exclusively for the following purposes: training split for training the model; the evalutation split for evaluating the model's performance; and the test split for further determining the model's inference capabilities on a fresh set of data.

Some entries contained a "-" label. Those entries were filtered out while preparing the dataset for use.

The corpus was chosen for multiple reasons: primarily, the size of the splits, in conjunction with their balanced natures, provides a breadth of data points for convergence of the model's performance to be derived from; in addition, the balance of the dataset splits assists in preventing the model from becoming particularly adept at classifying for a certain label(s), but not for others; finally, the size and balance of the dataset provides more datapoints from which the performance of the model can be generalized, helping to both hedge against overfitting and improve the model's performance with new text inputs.

Training procedure

The model was trained on an Nvidia A100 SXM4 40 GB GPU, accessed through Google's Colab integrated development environment (IDE) software.

DistilRoBERTa and its tokenizer were called through the AutoModelForSequenceClassification and AutoTokenizer functions, respectively, with the tokenizer having the do_lowercase_false argument set to false.

The data was called and prepared using the Hugging Face Datasets API and the DistilRoBERTa tokenizer, with entries containing a "-" being filtered out prior to tokenization. Padding and truncation were enabled for a maximum sequence length set to 128. Most datapoints in the corpus did not exceed that size. A greater value would have been unnecessary, with the most notable consequence likely having been greater demands on hardware resources to accomodate.

Prior to training the model grid search hyperparameter optimization was utilized across a set of values for the batch size, learning rate, and weight decay used. The values tested were as follows:

Batch size—32, 64, 128
Learning rate—1e-6, 1e-5, 2e-5, 5e-5
Weight decay—1e-3, 0.01, 0.1

These values were chosen from ranges of common and/or recommended hyperparameter values for models like DistilRoBERTa to: provide optimal performance on the utilized hardware; maximize the model's performance; and prevent overfitting while encouraging generalizability. A set of for loops were utilized to run the Hugging Face Trainer API with PyTorch for all the permutations of the hyperparameters. For each run of the train() function, DistilRoBERTa was called again from the Hugging Face Hub so that each step of the hyperparameter optimization procedure trained the same DistilRoBERTa model on a new permutation of hyperparameter values. Otherwise, it remained unclear if successive steps in the procedure would have applied training with the new hyperparameters to model resulting from the preceding step's training.

Of note, while the values utilized for the optimization procedure were chosen to span adequate ranges, they do not encompass the values in between themselves. As such, while the optimization procedure maximized the model's performance with the given values, testing more granular and comprehensive ranges may serve to further improve the model's performance. Epoch size is one such example, as three epochs were utilized to encourage convergence while minimizing the risk of overfitting and the computational burden of the optimization procedure, but testing of further values may reveal improvements in performance with a different value.

Other than the hyperparameters utilized in optimization, both the optimization procedure and training procedure utilized similar features: the AdamW optimizer, due to its well-roundedness in optimizing model performance versus the Adafactor and stochastic gradient descent optimizers; the cross entropy loss function, due to its applicability when dealing with discrete classes versus cosine embedding loss; 1,000 warmup steps; and the weighted average F1 score as the primary metric, due both to its penalizing false outputs more so than an accuracy metric and its encouragement of true outputs across all classes relative while accounting for any small imbalances in the dataset versus macro and micro average F1 scores.

Of note, many other features could have been utilized. As such, testing the use of different features for the trainer may serve to further improve the model's performance.

During the optimization procedure, approximately 10% of the dataset's training split and half of its evaluation split were utilized to minimize the time and resource demand from the procedure. However, this still amounted to 50,000 and 5,000 training and evaluation datapoints, respectively. As such, the constrained datasets were not sufficiently small so as to preclude the model's performance nor its convergence. In addition, the data was shuffled in both the hyperparameter optimization and training procedures, so as to randomly feed the model data, with seeds added for replicability. In addition, the seeds utilized during hyperparameter optimization and training were different, so as to feed the model data in distinct orders in each step and prevent it from learning to expect inputs in a certain order, and bias its outputs as such.

The optimization procedure determined that a learning rate of 5e-5 produced the most performant models across all batch size and weight decay values. However, while a learning rate of 1e-6 performed quite poorly relative to the others, the differences between the 1e-5, 2e-5, and 5e-5 learning rates were not as striking, with less than a 5% difference separating the F1 scores of the most and least performant iterations of the model within that range; meanwhile, the F1 scores of the least performant model in that range improved upon that of the most performant model with a learning rate of 1e-6 by 7.9%. A learning rate of 2e-5 was ultimately chosen to strike a balance between maximizing performance and the decreasing the computational burden, but learning rates of 1e-5 and 5e-5 would have likely produced similar perfomance.

The entire range of batch sizes also did not have much of a pronounced difference in model performance among the successful learning rates, with only an approximate maximum of 1.25% difference in the F1 scores of the most performant models for each batch size value. A batch size of 64 was chosen as an intermediate, helping to hedge against potential overfitting while minimizing the computation burden of greater batch sizes.

For the weight decays, the most performant model across the entire grid search procedure used a value of 1e-3. However, across the various permutations of learning rates and batch sizes, 75% of the most performant models utilized a weight decay of 1e-1. The most performant overall model was the sole such instance that utilized a weight decay value of 1e-3. This, in tandem with the large number of parameters of the model creating a partial predisposition to overfitting, motivated the use of an intermediate 1e-2 weight decay for the training procedure.

Having determined the optimal set of values for the above hyperparameters, DistilRoBERTa was trained, using the new seed to shuffle the data in a new order. The third epoch yielded an F1 score of approximately 0.9032, a training loss of approximately 0.2862, and a validation loss of approximately 0.2843. Training results, as well as training hyperparameters, are available for review below in their entirety.

The model was then pushed to the Hugging Face Hub, and its performance on inference with the test split examined. A weighted average F1 score remained as the primary metric, but accuracy was also measured to provide a fuller idea of its performance. Both values were similar, with each approximately 0.9017. A loss of approximately 0.2874 was additionally obtained.

The model is now available for use through the Transformers API through use of the pipeline() function or direct use.

Training hyperparameters

The following hyperparameters were utilized during training following grid-search hyperparameter optimization:

learning_rate: 2e-05
weight_decay: 0.01
train_batch_size: 64
eval_batch_size: 8
seed: 10
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	F1
0.4601	1.0	8584	0.3070	0.8870
0.3303	2.0	17168	0.2857	0.9000
0.2862	3.0	25752	0.2843	0.9032

Framework versions

Transformers 4.47.1
Pytorch 2.5.1+cu121
Datasets 3.2.0
Tokenizers 0.21.0

markusleonardo
/

DistilRoBERTa-For-Semantic-Similarity