BART models fine tuned for keyphrase generation

About

This repository contains 5 models that were trained and evaluated on the three datasets KPBiomed, KP20k and KPTimes.

Details about the models and the KPBiomed dataset can be found in the original paper: Maël Houbre, Florian Boudin and Béatrice Daille. 2022. A Large-Scale Dataset for Biomedical Keyphrase Generation. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI 2022).

How to use

As this repository contains several models, using the huggingface API directly will not work. To use one of the 4 models, you need to first download the desired zip file and unzip it.

For example, if we take the biobart-medium model and unzip it in our source directory. We will be able to load the model with the API as below.

from transformers import BartTokenizerFast, BartForConditionalGeneration

tokenizer = BartTokenizerFast.from_pretrained("biobart-medium")
model = BartForConditionalGeneration.from_pretrained("biobart-medium")

model.to("cuda")

We will then be able to generate keyphrases with the model using Hugging Face's generate function

inputs = tokenizer(input_text, padding="max_length", max_length= 512, truncation=True, return_tensors='pt')

input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
            
outputs = model.generate(inputs=input_ids,attention_mask=attention_mask,
                         num_beams=20,
                         num_return_sequences=20
                         )
                         
keyphrase_sequence = tokenizer.batch_decode(outputs,skip_special_tokens=False)