--- license: apache-2.0 datasets: - allenai/mslr2022 language: - en pipeline_tag: summarization --- # PubMedBERT for biomedical extractive summarization ## Description Work done for my [Bachelor's thesis](https://amslaurea.unibo.it/id/eprint/29686). [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) fine-tuned on [MS^2](https://github.com/allenai/mslr-shared-task) for extractive summarization.\ The model architecture is similar to [BERTSum](https://github.com/nlpyang/BertSum).\ Training code is available at [biomed-ext-summ](https://github.com/NotXia/biomed-ext-summ). ## Usage ```python summarizer = pipeline("summarization", model = "NotXia/pubmedbert-bio-ext-summ", tokenizer = AutoTokenizer.from_pretrained("NotXia/pubmedbert-bio-ext-summ"), trust_remote_code = True, device = 0 ) sentences = ["sent1.", "sent2.", "sent3?"] summarizer({"sentences": sentences}, strategy="count", strategy_args=2) >>> (['sent1.', 'sent2.'], [0, 1]) ``` ### Strategies Strategies to summarize the document: - `length`: summary with a maximum length (`strategy_args` is the maximum length). - `count`: summary with the given number of sentences (`strategy_args` is the number of sentences). - `ratio`: summary proportional to the length of the document (`strategy_args` is the ratio [0, 1]). - `threshold`: summary only with sentences with a score higher than a given value (`strategy_args` is the minimum score).