license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
metrics:
- accuracy
- precision
- recall
- f1
model-index:
- name: predicting_misdirection
results: []
predicting_misdirection
This model is a fine-tuned version of distilbert-base-uncased on the misdirection.csv
dataset.
The data is cleaned by selecting relevant columns and filtering rows based on whether they are labeled as 'accepted' or 'rejected'. It then groups the data by a unique identifier, concatenates text entries within each group into paragraphs, and prepares these paragraphs as predictors (X). Target labels (y) are derived from the final submission grade, mapping 'accepted' to 'violation' and 'rejected' to 'non-violation'. Finally, the data is split into training and testing sets using stratified sampling with a 20% test size and a random state of 1 for reproducibility.
It achieves the following results on the evaluation set:
- Accuracy: 0.6937
- Precision: 0.6916
- Recall: 0.6937
- F1: 0.6917
IMPORTANT NOTE
- When using the model, please note that
LABEL_0
refers to non-violation andLABEL_1
refers to a violation. - You can find the code for building the model within this repository. It is titled
code.ipynb
. - You can also find the data for building the model within this repository. It is titled
data.json
.
Model description
The code begins by loading a DistilBERT model and tokenizer configured for sequence classification with two possible labels. It then preprocesses the data: training and testing text sequences are tokenized using BERT, ensuring uniform length with padding and truncation to 256 tokens. A CustomDataset class is defined to organize the tokenized data into a format suitable for PyTorch training, converting labels ('non-violation' and 'violation') into numeric values. Evaluation metrics such as accuracy, precision, recall, and F1 score are set up to assess model performance. The main task is hyperparameter optimization using Optuna. An objective function is defined to optimize dropout rate, learning rate, batch size, epochs, and weight decay. For each trial, the data is tokenized again, a new model is initialized with the chosen dropout rate, and a Trainer object manages training and evaluation using these parameters. The goal is to maximize the F1 score across 15 trials.
Intended uses & limitations
Created solely for the Humane Intelligence Algorithmic Bias Bounty
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 4.81278007062444e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 9
Framework versions
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Tokenizers 0.19.1