license:
- apache-2.0
- bsd-3-clause
tags:
- summarization
- summary
- booksum
- long-document
- long-form
- tglobal-xl
- XL
datasets:
- kmfoda/booksum
metrics:
- rouge
inference: false
model-index:
- name: pszemraj/long-t5-tglobal-xl-16384-book-summary
results:
- task:
type: summarization
name: Summarization
dataset:
name: multi_news
type: multi_news
config: default
split: test
metrics:
- type: rouge
value: 36.2043
name: ROUGE-1
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYzRmMmUyOTVjMmJmZTRiZDcyYzY3MTQ1MmUyNDA5NjVhYzEzYzBiNzcxYTRhMDQ3OTlhMGZjYmJlNDM1M2NjYyIsInZlcnNpb24iOjF9._uArOQ1_0znXDPXMq7unA1OHB-XbgqzzKRbFRcVUzTUJdWk26LiSa2pEEVNNmJPg6Uo7CAvONmhpEswLvl9TAg
- type: rouge
value: 8.424
name: ROUGE-2
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzg0MzljYjVjYWQ3MmRkZDBlOGI5M2RiMGU0M2UwZGUzMDg2NTU0NjcwMTNiN2ZmODEzNTQ0MmEwNDA3NDA5MSIsInZlcnNpb24iOjF9.Dzj85ld6TjosQ8KyUdoadzicMLedEFrICC6Q-08O3qx28d9B9Uke1zw-VWabiuesPEDTRGbWuBgPA5vxYWUZAw
- type: rouge
value: 17.3721
name: ROUGE-L
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDA3ZjZmODAwMTNlM2RlZmJlMDI5MGVkMGRkMTBjMTYzNDk5ZjFiNTY5MWE1MDUwNWI2MDE4ZDA2YWMwMmI2NCIsInZlcnNpb24iOjF9.MOV_nId0XAK1eMQssG5GN9DsitZaTrxl4jdCJnOg9EZ0-vAw227ln599YV5YfZ1OPJnWwek6rneqqyONiHn9AQ
- type: rouge
value: 32.3994
name: ROUGE-LSUM
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZmY3MDMwOTZjNWI0YTk1MDgwMzJkYTFiN2U5YWU0Mzc0MWRiMzc1NzZlMDhjMWUwMmY2ODI2MjI5ODBkYWUxOSIsInZlcnNpb24iOjF9._BwGIZbcA4pUBkEAL0cW-JPPta0KSoGug4Z7vogHacUz-AEhIOI5ICUldZh0pt9OK67MpUSzpShJOu3rSt5YDQ
- type: loss
value: 2.0843334197998047
name: loss
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWFhMmE5ZjA3ODM4YmVjMDMyMjk5YjNlMjA1MGMzOWY0NTRlYzk1YjZiMzQxMDMxOTMwMjFkNTdmNjM1NDcyMyIsInZlcnNpb24iOjF9.3wbXV4CIIgnfXAnnRztdOR12PwsWsEfiglQQ09K-C1EgW4gai4x9l-wTE2OZ7CTWkuk_tr4tL_uqOCXLZRMtCQ
- type: gen_len
value: 248.3572
name: gen_len
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMWZhOGMwMDJjNGU2MzA2YzI1OWU1ZDY5N2NjZmM1YTA5NDg1MzUwNmU1YTBhNjQyNWYwYzA3OGNmODFjMmE2NSIsInZlcnNpb24iOjF9.Rc9u89zCdbFnjsnmq65l_JvCtUwOX_ZWapKJpTZ-rC8HxcUVfi2Ash2QfvvvxHH_YWhwklxxdnNa0HCm46qLAA
- task:
type: summarization
name: Summarization
dataset:
name: billsum
type: billsum
config: default
split: test
metrics:
- name: ROUGE-1
type: rouge
value: 41.3645
verified: true
- name: ROUGE-2
type: rouge
value: 16.144
verified: true
- name: ROUGE-L
type: rouge
value: 24.2981
verified: true
- name: ROUGE-LSUM
type: rouge
value: 35.3234
verified: true
- name: loss
type: loss
value: 1.282260775566101
verified: true
- name: gen_len
type: gen_len
value: 291.8158
verified: true
- task:
type: summarization
name: Summarization
dataset:
name: ccdv/arxiv-summarization
type: ccdv/arxiv-summarization
config: document
split: test
metrics:
- name: ROUGE-1
type: rouge
value: 36.3225
verified: true
- name: ROUGE-2
type: rouge
value: 9.3743
verified: true
- name: ROUGE-L
type: rouge
value: 19.8396
verified: true
- name: ROUGE-LSUM
type: rouge
value: 32.2532
verified: true
- name: loss
type: loss
value: 2.146871566772461
verified: true
- name: gen_len
type: gen_len
value: 186.2966
verified: true
long-t5-tglobal-xl + BookSum
Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
- Generalizes reasonably well to academic & narrative text.
- This is the XL checkpoint, which from a human-evaluation perspective, produces even better summaries.
A simple example/use case with the base model on ASR is here.
Cheeky Proof-of-Concept
A summary of the infamous navy seals copypasta:
In this chapter, the monster explains how he intends to exact revenge on "the little b****" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
While a somewhat crude example, try running this copypasta through other summarization models to see the difference in comprehension (despite it not even being a "long" text!)
Contents
Description
A fine-tuned version of google/long-t5-tglobal-xl on the kmfoda/booksum
dataset.
Read the paper by Guo et al. here: LongT5: Efficient Text-To-Text Transformer for Long Sequences
How-To in Python
install/update transformers pip install -U transformers
summarize text with pipeline:
import torch
from transformers import pipeline
summarizer = pipeline(
"summarization",
"pszemraj/long-t5-tglobal-xl-16384-book-summary",
device=0 if torch.cuda.is_available() else -1,
)
long_text = "Here is a lot of text I don't want to read. Replace me"
result = summarizer(long_text)
print(result[0]["summary_text"])
Beyond the basics
There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for decreased memory devouring.
Adjusting parameters
Pass other parameters related to beam search textgen when calling summarizer
to get even higher quality results.
LLM.int8 Quantization
alternate section title: how to get this monster to run inference on free Colab runtimes
Per this PR LLM.int8 is now supported for long-t5
models. Per initial testing summarization quality appears to hold while requiring significantly less memory! *
How-to: essentially ensure you have pip installed from the latest GitHub repo main version of transformers
, and bitsandbytes
install the latest main
branch:
pip install bitsandbytes
pip install git+https://github.com/huggingface/transformers.git
load in 8-bit (voodoo magic-the good kind-completed by bitsandbytes
behind the scenes)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(
"pszemraj/long-t5-tglobal-xl-16384-book-summary"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/long-t5-tglobal-xl-16384-book-summary",
load_in_8bit=True,
device_map="auto",
)
The above is already present in the Colab demo linked at the top of the model card.
Do you love to ask questions? Awesome. But first, check out the how LLM.int8 works blog post by huggingface.
* More rigorous metric-based investigation into comparing beam-search summarization with and without LLM.int8 will take place over time.
About
Intended uses & limitations
While this model seems to improve upon factual consistency, do not take summaries to be foolproof and check things that seem odd.
Specifically: negation statements (i.e., model says: This thing does not have [ATTRIBUTE] where instead it should have said This thing has a lot of [ATTRIBUTE]).
- I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by comparing a specific claim to what the surrounding sentences imply.
Training and evaluation data
kmfoda/booksum
dataset on HuggingFace - read the original paper here.
- Initial fine-tuning only used input text with 12288 tokens input or less and 1024 tokens output or less (i.e. rows with longer were dropped before training) for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the small minority
- In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.
- final phases of fine-tuning used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much.
Eval results
Official results with the model evaluator will be computed and posted here.
Please read the note above as due to training methods, validation set performance looks better than the test set results will be. The model achieves the following results on the evaluation set:
eval_loss: 1.2756
eval_rouge1: 41.8013
eval_rouge2: 12.0895
eval_rougeL: 21.6007
eval_rougeLsum: 39.5382
eval_gen_len: 387.2945
eval_runtime: 13908.4995
eval_samples_per_second: 0.107
eval_steps_per_second: 0.027
***** predict/test metrics (initial) ***** predict_gen_len = 506.4368 predict_loss = 2.028 predict_rouge1 = 36.8815 predict_rouge2 = 8.0625 predict_rougeL = 17.6161 predict_rougeLsum = 34.9068 predict_runtime = 2:04:14.37 predict_samples = 1431 predict_samples_per_second = 0.192 predict_steps_per_second = 0.048
* evaluating big model not as easy as it seems. Doing a bit more investigating
FAQ
How can I run inference with this on CPU?
lol
How to run inference over a very long (30k+ tokens) document in batches?
See summarize.py
in the code for my hf space Document Summarization :)
You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.
How to fine-tune further?
See train with a script and the summarization scripts
Training procedure
Updates
Updates to this model/model card will be posted here as relevant. The model seems fairly converged; if updates/improvements are possible using the BookSum
dataset, this repo will be updated.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0006
- train_batch_size: 1
- eval_batch_size: 1
- seed: 10350
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- total_eval_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: constant
- num_epochs: 1.0
*Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train.
Framework versions
- Transformers 4.25.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.6.1
- Tokenizers 0.13.1