File size: 3,542 Bytes
69fb171 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# GNorm2
***
GNorm2 is a gene name recognition and normalization tool with optimized functions and customizable configuration to the user preferences. The GNorm2 integrates multiple deep learning-based methods and achieves state-of-the-art performance. GNorm2 is freely available to download for stand-alone usage. [Download GNorm2 here](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/download/GNorm2/GNorm2.tar.gz)
## Content
- [Dependency package](#package)
- [Introduction of folders](#intro)
- [Running GNorm2](#pipeline)
## Dependency package
<a name="package"></a>
The codes have been tested using Python3.8/3.9 on CentOS and uses the following main dependencies on a CPU and GPU:
- [TensorFlow 2.3.0](https://www.tensorflow.org/)
- [Transformer 4.18.0](https://huggingface.co/docs/transformers/installation)
- [stanza 1.4.0](stanfordnlp.github.io/stanza/)
To install all dependencies automatically using the command:
$ pip install -r requirements.txt
## Introduction of folders
<a name="intro"></a>
- src_python
- GeneNER: the codes for gene recognition
- SpeAss: the codes for species assignment
- src_Java
- GNormPluslib : the codes for gene normalization and species recogntion
- GeneNER_SpeAss_run.py: the script for runing pipeline
- GNormPlus.jar: the upgraded GNormPlus tools for gene normalization
- gnorm_trained_models:pre-trianed models and trained NER/SA models
- bioformer-cased-v1.0: the original bioformer model
- BiomedNLP-PubMedBERT-base-uncased-abstract: the original pubmedbert model
- geneNER
- GeneNER-Bioformer/PubmedBERT-Allset.h5: the Gene NER models trained by all datasets
- GeneNER-Bioformer/PubmedBERT-Trainset.h5: the Gene NER models trained by the training set only
- SpeAss
- SpeAss-Bioformer/PubmedBERT-SG-Allset.h5: the Species Assignment models trained by all datasets
- SpeAss-Bioformer/PubmedBERT-SG-Trainset.h5: the Species Assignment models trained by the trianing set only
- stanza
- downloaded stanza library for offline usage
- vocab: label files for the machine learning models of GeneNER and SpeAss
- Dictionary: The dictionary folder contains all required files for gene normalization
- CRF: CRF++ library (called by GNormPlus.sh)
- Library: Ab3P library
- tmp/tmp_GNR/tmp_SA/tmp_SR folders: temp folder
- input/output folders: input and output folders. BioC (abstract or full text) and PubTator (abstract only) formats are both avaliable.
- GNorm2.sh: the script to run GNorm2
- setup.GN.txt/setup.SR.txt/setup.txt the setup files for GNorm2.
## Running GNorm2
<a name="pipeline"></a>
Please firstly download [GNorm2](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/download/GNorm2/GNorm2.tar.gz) to your local.
Below are the well-trained models (i.e., PubmedBERT/Bioformer) for Gene NER and Species Assignment.
Models for Gene NER:
- gnorm_trained_models/geneNER/GeneNER-PubmedBERT.h5
- gnorm_trained_models/geneNER/GeneNER-Bioformer.h5
Models for Species Assignment:
- gnorm_trained_models/SpeAss/SpeAss-PubmedBERT.h5
- gnorm_trained_models/SpeAss/SpeAss-Bioformer.h5
The parameters of the input/output folders:
- INPUT, default="input"
- OUTPUT, default="output"
[BioC-XML](bioc.sourceforge.net) or [PubTator](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/Format.html) formats are both avaliabel to GNorm2.
1. Run GNorm2
Run Example:
$ ./GNorm2.sh input output
## Acknowledgments
This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health. |