# bigbird pegasus on the booksum dataset- 40,000 steps

the fully-tuned model can be found here. This checkpoint will stay live because the summarization is almost as good and compute is way faster.

• typical datasets for summarization models arePubMed / arXiv; for my use cases I have found these to be not very useful
• summarizing text via arXiv models will typically make the summary sound so needlessly complicated you might as well have read the original text in that time anyway.
• this model is one attempt to help with that
• this is not a finished checkpoint but WIP:
• 40K steps or 4 epochs trained on the booksum dataset (have gone through ~60-70% of the training dataset).
• Note that while the model started from continues to be able to apply the attention mechanism at 4096 tokens, it was trained with the dataset tokenized to a max_length of 1024 for GPU memory reasons.
• Will continue to improve based on any result findings/feedback.
• the starting checkpoint was google/bigbird-pegasus-large-bigpatent

# example usage

An extended example, including a demo of batch summarization, is here.

• create the summarizer object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

_model = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/bigbird-pegasus-large-booksum-40k-K",
low_cpu_mem_usage=True,
)

_tokenizer = AutoTokenizer.from_pretrained(
"pszemraj/bigbird-pegasus-large-booksum-40k-K"
)

summarizer = pipeline(
"summarization",
model=_model,
tokenizer=_tokenizer
)


• define text to be summarized, and pass it through the pipeline. Boom done.
wall_of_text = "your text to be summarized goes here."

result = summarizer(
wall_of_text,
min_length=16,
max_length=256,
no_repeat_ngram_size=3,
clean_up_tokenization_spaces=True,
)

print(result[0]['summary_text'])


# Results

• below are scores from running evaluation on the entire Validation set, around ~1400 rows.
• note that while the dataset has three subsets (chapter, book, paragraph) - see the paper. the scores below are run in aggregate
• seems that these scores are on par / slightly better than what was reported in the paper, still more validation and other work to do.
{
"eval_gen_len": 126.5815,
"eval_loss": 3.747079610824585,
"eval_rouge1": 30.4775,
"eval_rouge2": 4.8919,
"eval_rougeL": 16.742,
"eval_rougeLsum": 27.57,
"eval_runtime": 4246.9369,
"eval_samples_per_second": 0.349,
"eval_steps_per_second": 0.349
}