metadata

license: apache-2.0
datasets:
  - allenai/mslr2022
language:
  - en
pipeline_tag: summarization

PubMedBERT for biomedical extractive summarization

Description

Work done for my Bachelor's thesis.

PubMedBERT fine-tuned on MS^2 for extractive summarization.
The model architecture is similar to BERTSum.
Training code is available at biomed-ext-summ.

Usage

summarizer = pipeline("summarization",
  model = "NotXia/pubmedbert-bio-ext-summ",
  tokenizer = AutoTokenizer.from_pretrained("NotXia/pubmedbert-bio-ext-summ"),
  trust_remote_code = True,
  device = 0
)

sentences = ["sent1.", "sent2.", "sent3?"]
summarizer({"sentences": sentences}, strategy="count", strategy_args=2)
>>> (['sent1.', 'sent2.'], [0, 1])

Strategies

Strategies to summarize the document:

length: summary with a maximum length (strategy_args is the maximum length).
count: summary with the given number of sentences (strategy_args is the number of sentences).
ratio: summary proportional to the length of the document (strategy_args is the ratio [0, 1]).
threshold: summary only with sentences with a score higher than a given value (strategy_args is the minimum score).