bigbird pegasus on the booksum dataset

this is the "latest" version of the model that has been trained the longest, currently at 70k steps

  • GOAL: A summarization model that 1) summarizes the source content accurately 2) more important IMO produces summaries that are easy to read and understand (* cough * unlike arXiv * cough *)
    • This model attempts to help with that by using the booksum dataset to provide explanatory summarization
    • Explanatory Summary - A summary that both consolidates information and also explains why said consolidated information is important.
  • This model was trained for seven epochs total (approx 70,000 steps) and is closer to finished.
    • Will continue to improve (slowly, now that it has been trained for a long time) based on any result findings/feedback.
  • starting checkpoint was google/bigbird-pegasus-large-bigpatent

example usage

An extended example, including a demo of batch summarization, is here.

  • create the summarizer object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

_model = AutoModelForSeq2SeqLM.from_pretrained(
                "pszemraj/bigbird-pegasus-large-K-booksum",
                low_cpu_mem_usage=True,
            )

_tokenizer = AutoTokenizer.from_pretrained(
                "pszemraj/bigbird-pegasus-large-K-booksum",
            )
                                           

summarizer = pipeline(
                    "summarization", 
                    model=_model, 
                    tokenizer=_tokenizer
                )
             
  • define text to be summarized, and pass it through the pipeline. Boom done.
wall_of_text = "your text to be summarized goes here."

result = summarizer(
           wall_of_text,
           min_length=16, 
           max_length=256,
           no_repeat_ngram_size=3, 
           clean_up_tokenization_spaces=True,
    )

print(result[0]['summary_text'])

Alternate Checkpoint

  • if experiencing runtime/memory issues, try this earlier checkpoint at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.

Results

  • note that while the dataset has three subsets (chapter, book, paragraph) - see the paper. the scores below are run in aggregate. The paper has some benchmark scores listed, which this model competes with.
  • note that eval generations are run & computed at a length of 128 tokens.
'eval_gen_len': 126.9791,
 'eval_loss': 4.00944709777832,
 'eval_rouge1': 27.6028,
 'eval_rouge2': 4.6556,
 'eval_rougeL': 14.5259,
 'eval_rougeLsum': 25.6632,
 'eval_runtime': 29847.4812,
 'eval_samples_per_second': 0.05,
 'eval_steps_per_second': 0.05}
Downloads last month
14
Hosted inference API
Summarization
Examples
Examples
This model can be loaded on the Inference API on-demand.

Dataset used to train pszemraj/bigbird-pegasus-large-K-booksum