Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
What: This is the (current) result of the quest for a summarization model that condenses technical/long information down well _in general, academic and narrative usage
Use cases: long narrative summarization (think stories - as the dataset intended), article/paper/textbook/other summarization, technical:simple summarization.
- Models trained on this dataset tend to also explain what they are summarizing, which IMO is awesome.
works well on lots of text, and can hand 16384 tokens/batch.
About
Trained for 16 epochs vs.
pszemraj/led-base-16384-finetuned-booksum
,- parameters adjusted for very fine-tuning type training (super low LR, etc)
- all the parameters for generation on the API are the same for easy comparison between versions.
Other Checkpoints on Booksum
- See led-large-book-summary for LED-large trained on the same dataset.
Usage - Basics
- it is recommended to use
encoder_no_repeat_ngram_size=3
when calling the pipeline object to improve summary quality.- this param forces the model to use new vocabulary and create an abstractive summary otherwise it may l compile the best extractive summary from the input provided.
- create the pipeline object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline
hf_name = 'pszemraj/led-base-book-summary'
_model = AutoModelForSeq2SeqLM.from_pretrained(
hf_name,
low_cpu_mem_usage=True,
)
_tokenizer = AutoTokenizer.from_pretrained(
hf_name
)
summarizer = pipeline(
"summarization",
model=_model,
tokenizer=_tokenizer
)
- put words into the pipeline object:
wall_of_text = "your words here"
result = summarizer(
wall_of_text,
min_length=8,
max_length=256,
no_repeat_ngram_size=3,
encoder_no_repeat_ngram_size=3,
repetition_penalty=3.5,
num_beams=4,
do_sample=False,
early_stopping=True,
)
print(result[0]['generated_text'])
- Downloads last month
- 192
Dataset used to train pszemraj/led-base-book-summary
Space using pszemraj/led-base-book-summary
Evaluation results
- ROUGE-1 on kmfoda/booksumverified33.454
- ROUGE-2 on kmfoda/booksumverified5.223
- ROUGE-L on kmfoda/booksumverified16.204
- ROUGE-LSUM on kmfoda/booksumverified29.977
- loss on kmfoda/booksumverified3.199
- gen_len on kmfoda/booksumverified191.978
- ROUGE-1 on samsumverified32.000
- ROUGE-2 on samsumverified10.078
- ROUGE-L on samsumverified23.633
- ROUGE-LSUM on samsumverified28.783
- loss on samsumverified2.903
- gen_len on samsumverified60.741
- ROUGE-1 on cnn_dailymailverified30.505
- ROUGE-2 on cnn_dailymailverified13.258
- ROUGE-L on cnn_dailymailverified19.031
- ROUGE-LSUM on cnn_dailymailverified28.342
- loss on cnn_dailymailverified3.948
- gen_len on cnn_dailymailverified231.076
- ROUGE-1 on billsumverified36.850
- ROUGE-2 on billsumverified15.915
- ROUGE-L on billsumverified23.476
- ROUGE-LSUM on billsumverified30.960
- loss on billsumverified3.879
- gen_len on billsumverified131.362