Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.9.1
title: LLMGeneLinker (LGL)
language: en
sdk: gradio
tags:
- Named Entity Recognition
- SciBERT
- Drug-Target interaction
- Drugs
- Genes
- Proteins
- Medical
datasets:
- bigbio/ncbi_disease
- bigbio/bc5cdr
- bigbio/genetag
- bigbio/drugprot
- allenai/drug-combo-extraction
LLMGeneLinker (LGL): a Fine-Tuned SciBERT Model for Named Entity Recognition
LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT
Table of Contents
Model Overview
The model is based on the SciBERT architecture, which is a pre-trained language model specifically designed for the biomedical domain. By fine-tuning SciBERT on a labeled dataset, we have created a specialized NER model that can accurately recognize drugs, genes, and diseases in biomedical texts.
Usage
You can access an interactive web interface for querying the fine-tuned LGL model here. If you prefer to load the model yourself, you can check out Installation below.
Installation
If you prefer to run LGL locally or conduct further fine-tuning, you need to install the required dependencies and download the model files. Follow the steps below to set up the environment:
- Clone this repository to your local machine. 1.1 If you do not have Python installed, download python via the official sources. Anaconda is recommended if you use scientific packages often.
If using anaconda, after installation setup a new conda environment via the following (replace myname with your own choice of environment name):
conda create --name *myname* python==3.8
- Activate your venv/ conda env (if using) and install the required Python packages using
pip
:
pip install -r requirements_local.txt
- To utilize the fine-tuned NER model for recognizing drugs, genes, and diseases, you can open
demo.ipynb
in Jupyter Lab by starting Jupyter Lab viajupyter lab
. The script takes text input as a string and returns the identified entities along with their respective labels.
Dataset
The following datasets were processed and used for training and evaluation:
Most datasets were sourced from BigBIO
[GitHub] (https://github.com/bigscience-workshop/biomedical/blob/main/README.md) [HF] (https://huggingface.co/bigbio)
| Task Type | Dataset | Links || |:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:| | NER | NCBI-disease | Link| | NER | BC5-disease | Link| | NER | Genetag | Link| | NER/RE | Drugprot | Link| | NER/RE | AllenAI Drug-Combo-Extraction | Link|