GNorm2
GNorm2 is a gene name recognition and normalization tool with optimized functions and customizable configuration to the user preferences. The GNorm2 integrates multiple deep learning-based methods and achieves state-of-the-art performance. GNorm2 is freely available to download for stand-alone usage. Download GNorm2 here
Content
Dependency package
The codes have been tested using Python3.8/3.9 on CentOS and uses the following main dependencies on a CPU and GPU:
To install all dependencies automatically using the command:
$ pip install -r requirements.txt
Introduction of folders
- src_python
- GeneNER: the codes for gene recognition
- SpeAss: the codes for species assignment
- src_Java
- GNormPluslib : the codes for gene normalization and species recogntion
- GeneNER_SpeAss_run.py: the script for runing pipeline
- GNormPlus.jar: the upgraded GNormPlus tools for gene normalization
- gnorm_trained_models:pre-trianed models and trained NER/SA models
- bioformer-cased-v1.0: the original bioformer model
- BiomedNLP-PubMedBERT-base-uncased-abstract: the original pubmedbert model
- geneNER
- GeneNER-Bioformer/PubmedBERT-Allset.h5: the Gene NER models trained by all datasets
- GeneNER-Bioformer/PubmedBERT-Trainset.h5: the Gene NER models trained by the training set only
- SpeAss
- SpeAss-Bioformer/PubmedBERT-SG-Allset.h5: the Species Assignment models trained by all datasets
- SpeAss-Bioformer/PubmedBERT-SG-Trainset.h5: the Species Assignment models trained by the trianing set only
- stanza
- downloaded stanza library for offline usage
- vocab: label files for the machine learning models of GeneNER and SpeAss
- Dictionary: The dictionary folder contains all required files for gene normalization
- CRF: CRF++ library (called by GNormPlus.sh)
- Library: Ab3P library
- tmp/tmp_GNR/tmp_SA/tmp_SR folders: temp folder
- input/output folders: input and output folders. BioC (abstract or full text) and PubTator (abstract only) formats are both avaliable.
- GNorm2.sh: the script to run GNorm2
- setup.GN.txt/setup.SR.txt/setup.txt the setup files for GNorm2.
Running GNorm2
Please firstly download GNorm2 to your local. Below are the well-trained models (i.e., PubmedBERT/Bioformer) for Gene NER and Species Assignment. Models for Gene NER:
- gnorm_trained_models/geneNER/GeneNER-PubmedBERT.h5
- gnorm_trained_models/geneNER/GeneNER-Bioformer.h5 Models for Species Assignment:
- gnorm_trained_models/SpeAss/SpeAss-PubmedBERT.h5
- gnorm_trained_models/SpeAss/SpeAss-Bioformer.h5
The parameters of the input/output folders:
- INPUT, default="input"
- OUTPUT, default="output"
BioC-XML or PubTator formats are both avaliabel to GNorm2.
- Run GNorm2
Run Example:
$ ./GNorm2.sh input output
Acknowledgments
This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health.