metadata

title: LLMGeneLinker (LGL)
language: en
sdk: gradio
tags:
  - Named Entity Recognition
  - SciBERT
  - Drug-Target interaction
  - Drugs
  - Genes
  - Proteins
  - Medical
datasets:
  - bigbio/ncbi_disease
  - bigbio/bc5cdr
  - bigbio/genetag
  - bigbio/drugprot
  - allenai/drug-combo-extraction

LLMGeneLinker (LGL): a Fine-Tuned SciBERT Model for Named Entity Recognition

LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT

Model Overview
Usage
Installation
Dataset
Contributing
License

Model Overview

The model is based on the SciBERT architecture, which is a pre-trained language model specifically designed for the biomedical domain. By fine-tuning SciBERT on a labeled dataset, we have created a specialized NER model that can accurately recognize drugs, genes, and diseases in biomedical texts.

Usage

You can access an interactive web interface for querying the fine-tuned LGL model here. If you prefer to load the model yourself, you can check out Installation below.

Installation

If you prefer to run LGL locally or conduct further fine-tuning, you need to install the required dependencies and download the model files. Follow the steps below to set up the environment:

Clone this repository to your local machine. 1.1 If you do not have Python installed, download python via the official sources. Anaconda is recommended if you use scientific packages often.

If using anaconda, after installation setup a new conda environment via the following (replace myname with your own choice of environment name): conda create --name *myname* python==3.8

Activate your venv/ conda env (if using) and install the required Python packages using pip:

pip install -r requirements_local.txt

To utilize the fine-tuned NER model for recognizing drugs, genes, and diseases, you can open demo.ipynb in Jupyter Lab by starting Jupyter Lab via jupyter lab. The script takes text input as a string and returns the identified entities along with their respective labels.

Dataset

The following datasets were processed and used for training and evaluation: Most datasets were sourced from BigBIO [GitHub] (https://github.com/bigscience-workshop/biomedical/blob/main/README.md) [HF] (https://huggingface.co/bigbio)

| Task Type | Dataset | Links || |:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:| | NER | NCBI-disease | Link| | NER | BC5-disease | Link| | NER | Genetag | Link| | NER/RE | Drugprot | Link| | NER/RE | AllenAI Drug-Combo-Extraction | Link|

Spaces:

plebias
/

LLMGeneLinker_LGL_V1

Sleeping

LLMGeneLinker (LGL): a Fine-Tuned SciBERT Model for Named Entity Recognition

Table of Contents

Model Overview

Usage

Installation

Dataset