pszemraj
/

led-large-book-summary

@@ -50,8 +50,8 @@ Let's think about a sensible choice of key tokens that a queried token actually
 >>> key_tokens = [] # => currently 'available' token doesn't have anything to attend
 Nearby tokens should be important because, in a sentence (sequence of words), the current word is highly dependent on neighboring past & future tokens. This intuition is the idea behind the concept of sliding attention."
   example_title: "bigbird blog intro"
-- text: " language  en tags  summarization  led  summary  longformer license apache20 datasets  kmfoda/booksum metrics  rouge widget  inference parameters max_length 64 min_length 4 no_repeat_ngram_size 2 early_stopping true repetition_penalty 24 length_penalty 05 encoder_no_repeat_ngram_size  3 num_beams  4  # longformer encoderdecoder (led) finetuned on booksum  an upgraded version of [`pszemraj/ledbase16384finetunedbooksum`](https //huggingface co/pszemraj/ledbase16384finetunedbooksum) it was trained for an additional epoch with a max summary length of 1024 tokens (original was trained with 512) as a small portion of the summaries are between 5121024 tokens long  all the parameters for generation on the api are the same for easy comparison between versions  works well on lots of text can hand 16384 tokens/batch ## other checkpoints on booksum  a oneepoch version of [ledlarge is available here](https //huggingface co/pszemraj/ledlargebooksummary1e)  a more polished version still wip  # usage  basics  it is recommended to use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality  this param forces the model to use new vocabulary and create an abstractive summary otherwise it may l compile the best _extractive_ summary from the input provided  create the pipeline object ``` from transformers import automodelforseq2seqlm autotokenizer from transformers import pipeline hf_name = pszemraj/ledbasebooksummary _model = automodelforseq2seqlm from_pretrained( hf_name low_cpu_mem_usage=true ) _tokenizer = autotokenizer from_pretrained( hf_name ) summarizer = pipeline( summarization model=_model tokenizer=_tokenizer ) ```  put words into the pipeline object ``` wall_of_text = your words here result = summarizer( wall_of_text min_length=16 max_length=256 no_repeat_ngram_size=3 encoder_no_repeat_ngram_size =3 clean_up_tokenization_spaces=true repetition_penalty=37 num_beams=4 early_stopping=true ) ```  # results **no results for this version yet** "
-  example_title: "the LED-base readme"
 inference:
   parameters:

 >>> key_tokens = [] # => currently 'available' token doesn't have anything to attend
 Nearby tokens should be important because, in a sentence (sequence of words), the current word is highly dependent on neighboring past & future tokens. This intuition is the idea behind the concept of sliding attention."
   example_title: "bigbird blog intro"
+- text: "The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset."
+  example_title: "BookSum Abstract"
 inference:
   parameters: