Model Card for semaj83/scibert_finetuned_ctmatch
This model can be used for classifying "<topic> [SEP] <clinical trial document>" pairs into 3 classes, 0, 1, 2, or not relevant, partially relevant, and relevant.
Model Details
Fine-tuned from 'allenai/scibert_scivocab_uncased' on triples of labelled topic, documents, relevance labels. These triples were processed using ctproc, collated from the openly available TREC22 Precision Medicine and CSIRO datasets here: https://huggingface.co/datasets/semaj83/ctmatch_classification
Model Description
Transformer model with linear sequence classification head, trained with cross-entropy on ~30k triples and evaluated using f1.
- Developed by: James Kelly
- Model type: SequenceClassification
- Language(s) (NLP): English
- License: MIT
- Finetuned from model:
allenai/scibert_scivocab_uncased
Model Sources
- Repository: https://github.com/semajyllek/ctmatch
- Paper [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Downstream Use
ctmatch IR pipeline for matching large set of clinical trials documents to text topic.
Bias, Risks, and Limitations
Please see dataset sources for information on patient descriptions (topics), constructed by medical professionals for these datasets. No personal health information about real individuals is contained in the related dataset. Links in dataset location on hub.
The claissifier model performs much better on deciding if a pair is 0 - not relevant, than differentiating between 1, partially relevant, and 2, relevant, though this is still an important clinical task.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("semaj83/scibert_finetuned_ctmatch")
model = AutoModelForSequenceClassification.from_pretrained("semaj83/scibert_finetuned_ctmatch")
Training Details
see notebook in ctmatch repo.
Training Data
https://huggingface.co/datasets/semaj83/ctmatch
Preprocessing
If using ctmatch labelled dataset, using the tokenizer alone is sufficient. If using raw topic and/or clinical trial documents, you may need to use ctproc or another method to extract relevant fields and preprocess text.
Training Hyperparameters
max_sequence_length=512 batch_size=8 padding='max_length' truncation=True learning_rate=2e-5 train_epochs=5 weight_decay=0.01 warmup_steps=500 seed=42 splits={"train":0.8, "val":0.1} use_trainer=True fp16=True early_stopping=True
Evaluation
sklearn classifier table on random test split:
precision recall f1-score support
0 0.88 0.93 0.90 5430
1 0.56 0.56 0.56 1331
2 0.65 0.49 0.56 1178
accuracy 0.80 7939
macro avg 0.70 0.66 0.67 7939
weighted avg 0.79 0.80 0.79 7939
Model Card Authors
James Kelly
Model Card Contact
- Downloads last month
- 12