Ihor/DILI-scibert · Hugging Face

DILI-scibert This is a text classification model based on Scibert fine-tuned on a binary text classification dataset to recognize papers mentioned drug-incded liver injury (DILI).

The model was trained to participate in the CAMDA challenge, the dataset and details of the challenge can be found here.

Dataset

The CAMDA committee and FDA initially provided a training set of approximately 14,000 DILI-related papers from LiverTox, equally split into positive and negative examples. The challenge participants also received test and validation sets with varying levels of imbalance, incorporating increasing numbers of true negatives to mirror real-world task complexity. The first validation set had 6,494 abstracts, the second 32,814, and the third 100,265. Additionally, to evaluate model overfitting, the fourth validation set comprised 14,000 expert summaries instead of article abstracts.

Training

After the selection of 90% of data for training, the following hyperparameters were used:

learning rate: 2e^-5;
weight-decay: 0.001;
batch size: 12;
focal loss gamma: 2;
focal loss alpha: 0.3;

Citation

If using these models, please cite the following paper:

@article{Stepanov2023ComparativeAO,
  title={Comparative analysis of classification techniques for topic-based biomedical literature categorisation},
  author={Ihor Stepanov and Arsentii Ivasiuk and Oleksandr Yavorskyi and Alina Frolova},
  journal={Frontiers in Genetics},
  year={2023},
  volume={14},
  url={https://api.semanticscholar.org/CorpusID:265428155}
}