File size: 1,776 Bytes
0d6889a
 
8f37347
 
 
 
 
 
 
 
0d6889a
8f37347
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c933b5d
8f37347
 
 
 
 
 
 
c933b5d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
license: mit
language:
- en
metrics:
- f1
- accuracy
- precision
library_name: transformers
pipeline_tag: text-classification
---

**DILI-scibert**
This is a text classification model based on [Scibert](allenai/scibert_scivocab_uncased) fine-tuned on a binary text classification dataset to recognize papers mentioned drug-incded liver injury (DILI).

The model was trained to participate in the CAMDA challenge, the dataset and details of the challenge can be found [here](https://bipress.boku.ac.at/camda2022/).

### Dataset
The CAMDA committee and FDA initially provided a training set of approximately 14,000 DILI-related papers from LiverTox, equally split into positive and negative examples. 
The challenge participants also received test and validation sets with varying levels of imbalance, incorporating increasing numbers of true negatives to mirror real-world task complexity. 
The first validation set had 6,494 abstracts, the second 32,814, and the third 100,265. Additionally, to evaluate model overfitting, the fourth validation set comprised 14,000 expert summaries instead of article abstracts.

### Training
After the selection of 90% of data for training, the following hyperparameters were used:
* learning rate: 2e^-5;
* weight-decay: 0.001;
* batch size: 12;
* focal loss gamma: 2;
* focal loss alpha: 0.3;

### Citation
If using these models, please cite the following paper:
```
@article{Stepanov2023ComparativeAO,
  title={Comparative analysis of classification techniques for topic-based biomedical literature categorisation},
  author={Ihor Stepanov and Arsentii Ivasiuk and Oleksandr Yavorskyi and Alina Frolova},
  journal={Frontiers in Genetics},
  year={2023},
  volume={14},
  url={https://api.semanticscholar.org/CorpusID:265428155}
}
```