Edit model card

bigbird pegasus on the booksum dataset- 40,000 steps

the fully-tuned model can be found here. This checkpoint will stay live because the summarization is almost as good and compute is way faster.

  • typical datasets for summarization models arePubMed / arXiv; for my use cases I have found these to be not very useful
    • summarizing text via arXiv models will typically make the summary sound so needlessly complicated you might as well have read the original text in that time anyway.
    • this model is one attempt to help with that
  • this is not a finished checkpoint but WIP:
    • 40K steps or 4 epochs trained on the booksum dataset (have gone through ~60-70% of the training dataset).
    • Note that while the model started from continues to be able to apply the attention mechanism at 4096 tokens, it was trained with the dataset tokenized to a max_length of 1024 for GPU memory reasons.
    • Will continue to improve based on any result findings/feedback.
  • the starting checkpoint was google/bigbird-pegasus-large-bigpatent

example usage

An extended example, including a demo of batch summarization, is here.

  • create the summarizer object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

_model = AutoModelForSeq2SeqLM.from_pretrained(
                "pszemraj/bigbird-pegasus-large-booksum-40k-K",
                low_cpu_mem_usage=True,
            )

_tokenizer = AutoTokenizer.from_pretrained(
                "pszemraj/bigbird-pegasus-large-booksum-40k-K"
            )
                                           

summarizer = pipeline(
                    "summarization", 
                    model=_model, 
                    tokenizer=_tokenizer
                )
             
  • define text to be summarized, and pass it through the pipeline. Boom done.
wall_of_text = "your text to be summarized goes here."

result = summarizer(
           wall_of_text,
           min_length=16, 
           max_length=256,
           no_repeat_ngram_size=3, 
           clean_up_tokenization_spaces=True,
    )

print(result[0]['summary_text'])

Results

  • below are scores from running evaluation on the entire Validation set, around ~1400 rows.
  • note that while the dataset has three subsets (chapter, book, paragraph) - see the paper. the scores below are run in aggregate
  • seems that these scores are on par / slightly better than what was reported in the paper, still more validation and other work to do.
{
    "eval_gen_len": 126.5815,
    "eval_loss": 3.747079610824585,
    "eval_rouge1": 30.4775,
    "eval_rouge2": 4.8919,
    "eval_rougeL": 16.742,
    "eval_rougeLsum": 27.57,
    "eval_runtime": 4246.9369,
    "eval_samples_per_second": 0.349,
    "eval_steps_per_second": 0.349
}
Downloads last month
97
Hosted inference API
Summarization
Examples
Examples
This model can be loaded on the Inference API on-demand.

Dataset used to train pszemraj/bigbird-pegasus-large-booksum-40k-K