--- license: mit tags: - feature-extraction language: en --- # PubMedNCL A pretrained language model for document representations of biomedical papers. PubMedNCL is based on [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext), which is a BERT model pretrained on abstracts and full-texts from PubMedCentral, and fine-tuned via citation neighborhood contrastive learning, as introduced by [SciNCL](https://huggingface.co/malteos/scincl). ## How to use the pretrained model ```python from transformers import AutoTokenizer, AutoModel # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('malteos/PubMedNCL') model = AutoModel.from_pretrained('malteos/PubMedNCL') papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] # concatenate title and abstract with [SEP] token title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] # preprocess the input inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512) # inference result = model(**inputs) # take the first token ([CLS] token) in the batch as the embedding embeddings = result.last_hidden_state[:, 0, :] ``` ## Citation - [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671). - [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://arxiv.org/abs/2007.15779). ## License MIT