alvaroalon2/biobert_diseases_ner · How to apply this model on PubMed full-text?

Mar 8, 2023

•

edited Mar 9, 2023

Hi @alvaroalon2 , I am trying to apply this model to highlight Disease entity on the full-text of a pubmed document. However, using all the default parameters, I noticed only the disease terms in the Abstract section were highlighted. I understood this model was trained on the ncbi_disease dataset which is 'a collection of 793 PubMed abstracts'. Is that why it's only able to highlight entities in the Abstract section? Is there any parameter I can apply to make the model applicable to the full-text of a pubmed paper?
Thanks!

alvaroalon2

Owner Mar 17, 2023

Hi! No, this is not the reason. The reason is that the model in which this is based, BERT, can only take as input sequences up to 512 tokens. So, when you apply it to large documents like full-text pubmed documents, then just the first sequences will be inferred. To address this limitation I implemented the following library in which larger documents can be analyzed: https://github.com/librairy/bio-ner

viktoroo

Aug 25, 2023

You can use pipeline to chunk longer text now https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/pipelines#transformers.TokenClassificationPipeline.stride