language: en license: cc-by-4.0 tags: - pairwise-sequence-classification-evidence-detection repo: https://github.com/RibhavOjha/NLU_ed

Model Card for t56225ro-p37429am-ED

This is a classification model that was trained to detect, given a pair of claim and evidence, if the evidence supports or refutes the claim.

Model Details

Model Description

This model is based upon a BERT model that was fine-tuned on 23.7K pairs of texts as a part of the ED (Evidence Detection) dataset.

This model is intended for the task of pairwise sequence classification for ED. It can be further finetuned on related pairwise sequence classification tasks.

Developed by: Ribhav Ojha and Amal Manzoor
Language(s): English
Model type: Supervised
Model architecture: Transformers
Finetuned from model: bert-base-uncased

Model Resources

Repository: https://huggingface.co/google-bert/bert-base-uncased
Paper or documentation: https://aclanthology.org/N19-1423.pdf

Training Details

Training Data

23702 pairs of texts given in the coursework

Training Procedure

First, all the necessary libraries are imported such as pytorch, huggingface transformer, sklearn and pandas. Then, the data is loaded from the csv file. The data is made compatible with what BERT expects by making a custom dataset class. Just fine-tuning the model without modifications yielded an accuracy of 86% on the dev dataset. Different batch sizes, learning rates, and model architectures were experimented with. First, the training data was split into training and validation. This was done to keep track of model's performance while training. Next, the trained model was tested on the dev dataset. Finally, the predictions were generated using the test dataset.

To improve the performance, we added a dropout and a linear classification layer on top of the existing BERT model. This helped in reducing the overfitting of the data. This helped to improve the model accuracy to 88% on the dev dataset. We use Adam optimizer and CrossEntropyLoss as our loss function.

Training Hyperparameters

  - learning_rate: 2e-05
  - train_batch_size: 8
  - eval_batch_size: 8
  - num_epochs: 1
  - max_length: 128
  - optimizer: Adam

Speeds, Sizes, Times

  - overall training time: 30 minutes
  - duration per training epoch: 30 minutes
  - model size: 417MB

Evaluation

Testing Data & Metrics

Testing Data

A subset of the development set provided, amounting to 5926 pairs.

Metrics

  - Precision
  - Recall
  - F1-score
  - Accuracy
  - Support
  - Macro Average
  - Weighted Average

Results

The model obtained an F1-score of 92% for irrelvant pairs, and 77% for relevant pairs. The model obtained an accuracy of 88.12%. The macro avg and weighted avg numbers are close to each other, suggesting that the model's performance is consistent across classes and considering the class distribution. An F1-score of 0.88 indicates a good balance between precision and recall, considering both the macro avg and weighted avg.

Technical Specifications

Hardware

  - GPU: T4

Software

  - Transformers 4.18.0
  - Pytorch 1.11.0+cu113

Bias, Risks, and Limitations

Any inputs (concatenation of two sequences) longer than 512 subwords will be truncated by the model. BERT can be susceptible to biases in its training dataset. This is important to consider to create more inclusive and ethical AI systems. BERT also requires high computational resources. This makes it less accessible to organizations with limited budget.

Another limitation is that this model is trained on a relatively small dataset. It might perform poorly when tested on other datasets.

Additional Information

The hyperparameters were determined by experimentation with different values.