Model Card for Model longluu/Clinical-NER-NCBI-Disease-GatorTronS
The model is an NER LLM algorithm that can classify each word in a text into different clinical categories.
Model Details
Model Description
The base pretrained model is GatorTronS which was trained on billions of words in various clinical texts (https://huggingface.co/UFNLP/gatortronS). Then using the NCBI Disease dataset (https://www.sciencedirect.com/science/article/pii/S1532046413001974?via%3Dihub), I fine-tuned the model for NER task in which the model can classify each word in a text into one of the categories ['no disease', 'disease', 'disease-continue'].
Model Sources [optional]
The github code associated with the model can be found here: https://github.com/longluu/LLM-NER-clinical-text.
Training Details
Training Data
This dataset contains the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Details are here https://www.sciencedirect.com/science/article/pii/S1532046413001974?via%3Dihub.
The preprocessed data for LLM training can be found here https://huggingface.co/datasets/ncbi_disease.
Training Hyperparameters
The hyperparameters are --batch_size 24 --num_train_epochs 5 --learning_rate 5e-5 --weight_decay 0.01
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was trained and validated on train and validation sets. Then it was tested on a separate test set. Note that some concepts in the test set were not available in the train and validatin sets.
Metrics
Here we use several metrics for classification tasks including macro-average F1, precision, recall and Matthew correlation.
Results
{'f1': 0.876008064516129, 'precision': 0.9052083333333333, 'recall': 0.8486328125}
Model Card Contact
Feel free to reach out to me at thelong20.4@gmail.com if you have any question or suggestion.
- Downloads last month
- 16