bigbird pegasus on the booksum dataset- 40,000 steps
the fully-tuned model can be found here. This checkpoint will stay live because the summarization is almost as good and compute is way faster.
- typical datasets for summarization models arePubMed / arXiv; for my use cases I have found these to be not very useful
- summarizing text via arXiv models will typically make the summary sound so needlessly complicated you might as well have read the original text in that time anyway.
- this model is one attempt to help with that
- this is not a finished checkpoint but WIP:
- 40K steps or 4 epochs trained on the booksum dataset (have gone through ~60-70% of the training dataset).
- Note that while the model started from continues to be able to apply the attention mechanism at 4096 tokens, it was trained with the dataset tokenized to a
max_length
of 1024 for GPU memory reasons. - Will continue to improve based on any result findings/feedback.
- the starting checkpoint was
google/bigbird-pegasus-large-bigpatent
example usage
An extended example, including a demo of batch summarization, is here.
- create the summarizer object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline
_model = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/bigbird-pegasus-large-booksum-40k-K",
low_cpu_mem_usage=True,
)
_tokenizer = AutoTokenizer.from_pretrained(
"pszemraj/bigbird-pegasus-large-booksum-40k-K"
)
summarizer = pipeline(
"summarization",
model=_model,
tokenizer=_tokenizer
)
- define text to be summarized, and pass it through the pipeline. Boom done.
wall_of_text = "your text to be summarized goes here."
result = summarizer(
wall_of_text,
min_length=16,
max_length=256,
no_repeat_ngram_size=3,
clean_up_tokenization_spaces=True,
)
print(result[0]['summary_text'])
Results
- below are scores from running evaluation on the entire Validation set, around ~1400 rows.
- note that while the dataset has three subsets (chapter, book, paragraph) - see the paper. the scores below are run in aggregate
- seems that these scores are on par / slightly better than what was reported in the paper, still more validation and other work to do.
{
"eval_gen_len": 126.5815,
"eval_loss": 3.747079610824585,
"eval_rouge1": 30.4775,
"eval_rouge2": 4.8919,
"eval_rougeL": 16.742,
"eval_rougeLsum": 27.57,
"eval_runtime": 4246.9369,
"eval_samples_per_second": 0.349,
"eval_steps_per_second": 0.349
}
- Downloads last month
- 97