pszemraj
/

led-large-book-summary

@@ -347,28 +347,23 @@ model-index:
       verified: true
       verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWI2NjVlYjgwYWJiMjcyMDUzMzEwNDNjZTMxMDM0MjAzMzk1ZmIwY2Q1ZDQ2Y2M5NDBlMDEzYzFkNWEyNzJmNiIsInZlcnNpb24iOjF9.iZ1Iy7FuWL4GH7LS5EylVj5eZRC3L2ZsbYQapAkMNzR_VXPoMGvoM69Hp-kU7gW55tmz2V4Qxhvoz9cM8fciBA
 ---
-# Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
 <a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 </a>
-A fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the `BookSum` dataset.
-Goal: a model that can generalize well and is useful in summarizing long text in academic and daily usage. The result works well on lots of text and can handle 16384 tokens/batch (_if you have the GPU memory to handle that_)
  - See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
-> Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.
 ---
-# Usage - Basic
-- use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality.
-  - this forces the model to use new vocabulary and create an abstractive summary, otherwise it may compile the best _extractive_ summary from the input provided.
 Load the model into a pipeline object:
@@ -385,7 +380,7 @@ summarizer = pipeline(
 )
 ```
-- put words into the pipeline object:
 ```python
 wall_of_text = "your words here"
@@ -402,74 +397,81 @@ result = summarizer(
 )
 ```
-**Important:** To generate the best quality summaries, you should use the global attention mask when decoding, as demonstrated in [this community notebook here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing), see the definition of `generate_answer(batch)`.
-If having computing constraints, try the base version [`pszemraj/led-base-book-summary`](https://huggingface.co/pszemraj/led-base-book-summary)
-- all the parameters for generation on the API here are the same as [the base model](https://huggingface.co/pszemraj/led-base-book-summary) for easy comparison between versions.
-## Training and evaluation data
-- the [booksum](https://arxiv.org/abs/2105.08209) dataset (this is what adds the `bsd-3-clause` license)
-- During training, the input text was the text of the `chapter`, and the output was `summary_text`
-- Eval results can be found [here](https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-kmfoda__booksum-79c1c0d8-10905463) with metrics on the sidebar.
-## Training procedure
-- Training completed on the BookSum dataset for 13 total epochs
-- **The final four epochs combined the training and validation sets as 'train' in an effort to increase generalization.**
-### Training hyperparameters
-#### Initial Three Epochs
-The following hyperparameters were used during training:
-- learning_rate: 5e-05
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- gradient_accumulation_steps: 4
-- total_train_batch_size: 4
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- num_epochs: 3
-#### In-between Epochs
-Unfortunately, don't have all records on-hand for middle epochs; the following should be representative:
-- learning_rate: 4e-05
-- train_batch_size: 2
-- eval_batch_size: 2
-- seed: 42
-- distributed_type: multi-GPU
-- gradient_accumulation_steps: 16
-- total_train_batch_size: 32
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.05
-- num_epochs: 6 (in addition to prior model)
-#### Final Two Epochs
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- gradient_accumulation_steps: 16
-- total_train_batch_size: 16
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.03
-- num_epochs: 2 (in addition to prior model)
-### Framework versions
-- Transformers 4.19.2
-- Pytorch 1.11.0+cu113
-- Datasets 2.2.2
-- Tokenizers 0.12.1

       verified: true
       verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWI2NjVlYjgwYWJiMjcyMDUzMzEwNDNjZTMxMDM0MjAzMzk1ZmIwY2Q1ZDQ2Y2M5NDBlMDEzYzFkNWEyNzJmNiIsInZlcnNpb24iOjF9.iZ1Iy7FuWL4GH7LS5EylVj5eZRC3L2ZsbYQapAkMNzR_VXPoMGvoM69Hp-kU7gW55tmz2V4Qxhvoz9cM8fciBA
 ---
+# LED-Based Summarization Model (Large): Condensing Extensive Information
 <a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 </a>
+This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the `BookSum` dataset. It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes. Capable of handling up to 16,384 tokens per batch, this model provides effective summarization of large volumes of text.
  - See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
+> **Note:** Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)
 ---
+## Basic Usage
+To improve summary quality, use `encoder_no_repeat_ngram_size=3` when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.
 Load the model into a pipeline object:
 )
 ```
+Feed the text into the pipeline object:
 ```python
 wall_of_text = "your words here"
 )
 ```
+**Important:** For optimal summary quality, use the global attention mask when decoding, as demonstrated in [this community notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing), see the definition of `generate_answer(batch)`.
+If you're facing computing constraints, consider using the base version [`pszemraj/led-base-book-summary`](https://huggingface.co/pszemraj/led-base-book-summary). All generation parameters on the API here match those of the base model, enabling easy comparison between versions.
+---
+## Training Information
+### Data
+The model was trained on the [booksum](https://arxiv.org/abs/2105.08209) dataset. During training, the `chapter`was the input col, while the `summary_text` was the output.
+### Procedure
+Training was completed on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.
+### Hyperparameters
+The training process involved different settings across stages:
+- **Initial Three Epochs:** Low learning rate (5e-05), batch size of 1, 4 gradient accumulation steps, and a linear learning rate scheduler.
+- **In-between Epochs:** Learning rate reduced to 4e-05, increased batch size to 2, 16 gradient accumulation steps, and switched to a cosine learning rate scheduler with a 0.05 warmup ratio.
+- **Final Two Epochs:** Further reduced learning rate (2e-05), batch size reverted to 1, maintained gradient accumulation steps at 16, and continued with a cosine learning rate scheduler, albeit with a lower warmup ratio (0.03).
+### Versions
+- Transformers 4.19.2
+- Pytorch 1.11.0+cu113
+- Datasets 2.2.2
+- Tokenizers 0.12.1
+---
+## Simplified Usage with TextSum
+To streamline the process of using this and other models, I've developed [a Python package utility](https://github.com/pszemraj/textsum) named `textsum`. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
+Install TextSum:
+```bash
+pip install textsum
+```
+Then use it in Python with this model:
+```python
+from textsum.summarize import Summarizer
+model_name = "pszemraj/led-large-book-summary"
+summarizer = Summarizer(
+    model_name_or_path=model_name,  # you can use any Seq2Seq model on the Hub
+    token_batch_length=4096,  # tokens to batch summarize at a time, up to 16384
+)
+long_string = "This is a long string of text that will be summarized."
+out_str = summarizer.summarize_string(long_string)
+print(f"summary: {out_str}")
+```
+Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.
+For detailed explanations and documentation, check the [README](https://github.com/pszemraj/textsum) or the [wiki](https://github.com/pszemraj/textsum/wiki)
+---
+## Related Models
+Check out these other related models, also trained on the BookSum dataset:
+- [Long-T5-tglobal-base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)
+- [BigBird-Pegasus-Large-K](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum)
+- [Pegasus-X-Large](https://huggingface.co/pszemraj/pegasus-x-large-book-summary)
+- [Long-T5-tglobal-XL](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary)
+There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
+---