Timofey/PubMedBERT_Genes_Proteins_Context_Classifier

This model is a fine-tuned model of BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext (hugging-face card). The current model was developed for the web-based ANDDigest system for the classification of the short names of genes and proteins in texts on the basis of their context. The analyzed name should be replaced in text with tag.

Input:
Any biomedical text where a name of classified object is replaced with tag, for example, this pubmed abstract:
Plastic – (not) fantastic? Impact of bisphenol A on functioning of mammalian oocytes and embryos Bisphenol A is a monomeric organic compound belonging to phenols. It is widely used in the production of resins, polycarbonates and plastics. Mass production of this compound contributed to its widespread presence in the environment, and thus - in the organisms of animals and humans. <andsystem-candidate> belongs to xenoestrogens, synthetic compounds exerting an estrogen-like effect on cells. BPA can therefore disrupt the functioning of animal (including human) organisms. This article focuses on the impact of BPA on selected aspects of mammalian fertility. Recent literature data indicate that BPA disturbs several processes in oocytes and embryos, including epigenetic modifications, energy metabolism and spindle assembly, and as a result, decreases their developmental competence. We discuss the latest data on the influence of BPA on cellular processes taking place in oocytes and early embryos and describe molecular mechanisms responsible for this effect. We also discuss the significance of the results obtained from experiments conducted in vitro and/or on animal models in the context of BPA impact on fertility of women.

In this example BPA, which is a chemical compound bisphenol A, was replaced with <andsystem-candidate>. Please keep in mind that maximum length of input sequence for BERT is limited to 512 tokens.
Output:
LABEL_0 refers to the probability of the FALSE recognition, i.e. if the context of <andsystem-candidate> doesn't corresponds to the context specific for genes or proteins.
LABEL_1 refers to the probability of the TRUE recognition, i.e. when the context of <andsystem-candidate> corresponds to the context specific for genes or proteins.

The optimal threshold value for the short names of genes or proteins for the LABEL_1, was calculated using a gold standard (add link). It is >= 0.9999139308929443.

The Mathew Correlation Coefficient of the model for the long names (>= 15 symbols) is 0.982.
The ROC AUC value of the model, calculated for the short names (<= 4 symbols) is 0.939.

Citing

If you found the developed models to be useful in your research, please cite the following articles:

Ivanisenko, T.V., Saik, O.V., Demenkov, P.S. et al. ANDDigest: a new web-based module of ANDSystem for the search of knowledge in the scientific literature. BMC Bioinformatics 21 (Suppl 11), 228 (2020). https://doi.org/10.1186/s12859-020-03557-8
Ivanisenko, T.V.; Demenkov, P.S.; Kolchanov, N.A.; Ivanisenko, V.A. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int. J. Mol. Sci. 2022, 23, 14934. https://doi.org/10.3390/ijms232314934

Timofey
/

PubMedBERT_Genes_Proteins_Context_Classifier

You need to agree to share your contact information to access this model

Citing