LED-Based Summarization Model: Condensing Long and Technical Information
The Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization is a model I fine-tuned from allenai/led-base-16384 to condense extensive technical, academic, and narrative content in a fairly generalizable way.
Key Features and Use Cases
- Ideal for summarizing long narratives, articles, papers, textbooks, and other documents.
- the sparknotes-esque style leads to 'explanations' in the summarized content, offering insightful output.
- High capacity: Handles up to 16,384 tokens per batch.
- demos: try it out in the notebook linked above or in the demo on Spaces
Note: The API widget has a max length of ~96 tokens due to inference timeout constraints.
Training Details
The model was trained on the BookSum dataset released by SalesForce, which leads to the bsd-3-clause
license. The training process involved 16 epochs with parameters tweaked to facilitate very fine-tuning-type training (super low learning rate).
Model checkpoint: pszemraj/led-base-16384-finetuned-booksum
.
Other Related Checkpoints
This model is the smallest/fastest booksum-tuned model I have worked on. If you're looking for higher quality summaries, check out:
There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
Basic Usage
I recommend using encoder_no_repeat_ngram_size=3
when calling the pipeline object, as it enhances the summary quality by encouraging the use of new vocabulary and crafting an abstractive summary.
Create the pipeline object:
import torch
from transformers import pipeline
hf_name = "pszemraj/led-base-book-summary"
summarizer = pipeline(
"summarization",
hf_name,
device=0 if torch.cuda.is_available() else -1,
)
Feed the text into the pipeline object:
wall_of_text = "your words here"
result = summarizer(
wall_of_text,
min_length=8,
max_length=256,
no_repeat_ngram_size=3,
encoder_no_repeat_ngram_size=3,
repetition_penalty=3.5,
num_beams=4,
do_sample=False,
early_stopping=True,
)
print(result[0]["generated_text"])
Simplified Usage with TextSum
To streamline the process of using this and other models, I've developed a Python package utility named textsum
. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
Install TextSum:
pip install textsum
Then use it in Python with this model:
from textsum.summarize import Summarizer
model_name = "pszemraj/led-base-book-summary"
summarizer = Summarizer(
model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub
token_batch_length=4096, # how many tokens to batch summarize at a time
)
long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")
Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a shareable demo/web UI.
For detailed explanations and documentation, check the README or the wiki
- Downloads last month
- 1,580
Model tree for pszemraj/led-base-book-summary
Dataset used to train pszemraj/led-base-book-summary
Spaces using pszemraj/led-base-book-summary 21
Collection including pszemraj/led-base-book-summary
Evaluation results
- ROUGE-1 on kmfoda/booksumtest set verified33.454
- ROUGE-2 on kmfoda/booksumtest set verified5.223
- ROUGE-L on kmfoda/booksumtest set verified16.204
- ROUGE-LSUM on kmfoda/booksumtest set verified29.977
- loss on kmfoda/booksumtest set verified3.199
- gen_len on kmfoda/booksumtest set verified191.978
- ROUGE-1 on samsumtest set verified32.000
- ROUGE-2 on samsumtest set verified10.078
- ROUGE-L on samsumtest set verified23.633
- ROUGE-LSUM on samsumtest set verified28.783