mekjr1/guilbert-base-uncased
This model is a fine-tuned version of bert-base-uncased on an guilbert dataset. It is a masked language model that predicts missing tokens in a sentence.
Model description
The model is based on the bert-base-uncased
architecture, which has 12 layers, 768 hidden units, and 12 attention heads. It has been fine-tuned on a dataset with samples labeled as guilt or non-guilt from the Vent dataset. The model was trained with a maximum sequence length of 128 tokens and a batch size of 32. The training process used the AdamW optimizer with a learning rate of 2e-5, a weight decay rate of 0.01, and a linear learning rate warmup over 1,000 steps. The model achieved a validation loss of 1.8529 after 8 epochs.
Intended uses & limitations
This model can be used for predicting missing tokens in text sequences, particularly in the context of detecting guilt emotion in documents or other relevant applications.
However, the accuracy of the model may be limited by the quality and representativeness of the training data, as well as the biases present in the pre-trained bert-base-uncased
architecture.
Training and evaluation data
The model was trained on a dataset of samples labeled as guilt or non-guilt from the guilbert dataset (Extracted from Vent).
Training procedure
The model was trained using TensorFlow Keras with the AdamW optimizer and a learning rate of 2e-5. The training process used a batch size of 32 and a maximum sequence length of 128 tokens. The optimizer used a weight decay rate of 0.01 and a linear learning rate warmup over 1,000 steps. The model was trained for 8 epochs, with early stopping based on the validation loss. The training process achieved a validation loss of 1.8529 after 8 epochs.
Training hyperparameters
The following hyperparameters were used during training:
Optimizer:
AdamWeightDecay
with a learning rate ofWarmUp(initial_learning_rate=2e-05, decay_schedule_fn=PolynomialDecay(initial_learning_rate=2e-05, decay_steps=7167, end_learning_rate=0.0, power=1.0, cycle=False), warmup_steps=1000, power=1.0)
Weight decay rate: 0.01
Batch size: 32
Maximum sequence length: 128
Number of warmup steps: 1,000
Number of training steps: 1,761 The following hyperparameters were used during training:
optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'class_name': 'WarmUp', 'config': {'initial_learning_rate': 2e-05, 'decay_schedule_fn': {'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 2e-05, 'decay_steps': 7167, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'passive_serialization': True}, 'warmup_steps': 1000, 'power': 1.0, 'name': None}}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
training_precision: mixed_float16
Training results
The following table shows the training and validation loss for each epoch:
Train Loss | Validation Loss | Epoch |
---|---|---|
2.0976 | 1.8593 | 0 |
1.9643 | 1.8547 | 1 |
1.9651 | 1.9003 | 2 |
1.9608 | 1.8617 | 3 |
1.9646 | 1.8756 | 4 |
1.9626 | 1.9024 | 5 |
1.9574 | 1.8421 | 6 |
1.9594 | 1.8632 | 7 |
1.9616 | 1.8529 | 8 |
Framework versions
- Transformers 4.26.1
- TensorFlow 2.11.0
- Datasets 2.10.1
- Tokenizers 0.13.2
- Downloads last month
- 9