Edit model card

Longformer Encoder-Decoder (LED) fine-tuned on Booksum

  • allenai/led-base-16384 checkpoint trained on the booksum dataset for 3 epochs.
  • handles summarization a-la "school notes" style well, but takes a while to run (even compared to larger models such as a bigbird-pegasus checkpoint on the same data.
  • upside: works well on lots of text, can hand 16384 tokens/batch.
    • an example usage notebook is here with details

Other Checkpoints on Booksum


Usage - Basics

  • from testing, it is highly recommended to use the parameter encoder_no_repeat_ngram_size=3 when calling the pipeline object.
    • this forces the model to use new vocabulary and create an abstractive summary, as at times it will compile the best extractive summary from the input provided.
  • create the pipeline object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

hf_name = 'pszemraj/led-base-16384-finetuned-booksum'

_model = AutoModelForSeq2SeqLM.from_pretrained(
                hf_name,
                low_cpu_mem_usage=True,
            )

_tokenizer = AutoTokenizer.from_pretrained(
                hf_name
            )
                                           

summarizer = pipeline(
                    "summarization", 
                    model=_model, 
                    tokenizer=_tokenizer
                )
  • put words into the pipeline object:
wall_of_text = "your words here"

result = summarizer(
           wall_of_text,
           min_length=16, 
           max_length=256,
           no_repeat_ngram_size=3, 
           encoder_no_repeat_ngram_size =3,
           clean_up_tokenization_spaces=True,
           repetition_penalty=3.7,
           num_beams=4,
           early_stopping=True,
    )


Results

  • evaluation was completed with the following params and received the following score

  • params:

# set generate hyperparameters
model.config.num_beams = 5
model.config.max_length = 512
model.config.min_length = 32
model.config.length_penalty = 3.5
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3

trainer.evaluate(num_beams=5, max_length=128)
  • scores (on 1/10 validation for RT reasons):
  {'eval_loss': 2.899840831756592,
   'eval_rouge1': 30.0761, 
   'eval_rouge2': 6.4964, 
   'eval_rougeL': 15.9819, 
   'eval_rougeLsum': 28.2764, 
   'eval_gen_len': 126.8514, 
   'eval_runtime': 1442.991, 
   'eval_samples_per_second': 0.103,
   'eval_steps_per_second': 0.103
 }

Downloads last month
25
Hosted inference API
Summarization
Examples
Examples
This model can be loaded on the Inference API on-demand.

Dataset used to train pszemraj/led-base-16384-finetuned-booksum