pszemraj commited on
Commit
9e14a17
1 Parent(s): 80f23d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -37
README.md CHANGED
@@ -423,51 +423,55 @@ model-index:
423
  verified: true
424
  verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDQxZmEwYmU5MGI1ZWE5NTIyMmM1MTVlMjVjNTg4MDQyMjJhNGE5NDJhNmZiN2Y4ZDc4ZmExNjBkMjQzMjQxMyIsInZlcnNpb24iOjF9.o3WblPY-iL1vT66xPwyyi1VMPhI53qs9GJ5HsHGbglOALwZT4n2-6IRxRNcL2lLj9qUehWUKkhruUyDM5-4RBg
425
  ---
426
-
427
- # Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
428
-
429
 
430
  <a href="https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb">
431
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
432
  </a>
433
 
434
- - **What:** This is the (current) result of the quest for a summarization model that condenses technical/long information down well _in general, academic and narrative usage
435
- - **Use cases:** long narrative summarization (think stories - as the dataset intended), article/paper/textbook/other summarization, technical:simple summarization.
436
- - Models trained on this dataset tend to also _explain_ what they are summarizing, which IMO is awesome.
437
- - Works well on lots of text, and can hand 16384 tokens/batch.
438
- - See examples in Colab demo linked above, or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
 
 
 
 
 
 
 
439
 
440
- -
441
 
442
- > Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.
443
 
 
444
 
445
- ## About
446
 
447
- - Trained on the BookSum dataset released by SalesForce (this is what adds the `bsd-3-clause` license)
448
- - Trained for 16 epochs vs. [`pszemraj/led-base-16384-finetuned-booksum`](https://huggingface.co/pszemraj/led-base-16384-finetuned-booksum),
449
 
450
- - parameters adjusted for _very_ fine-tuning type training (super low LR, etc)
451
- - all the parameters for generation on the API are the same for easy comparison between versions.
452
-
453
- ## Other Checkpoints on Booksum
454
 
455
- - See [led-large-book-summary](https://huggingface.co/pszemraj/led-large-book-summary) for LED-large trained on the same dataset.
456
 
457
  ---
458
 
459
- # Usage - Basics
460
 
461
- - it is recommended to use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality.
462
- - this param forces the model to use new vocabulary and create an abstractive summary otherwise it may l compile the best _extractive_ summary from the input provided.
463
- - create the pipeline object:
464
 
465
- ```python
466
 
 
467
  import torch
468
  from transformers import pipeline
469
 
470
- hf_name = 'pszemraj/led-base-book-summary'
471
 
472
  summarizer = pipeline(
473
  "summarization",
@@ -476,23 +480,50 @@ summarizer = pipeline(
476
  )
477
  ```
478
 
479
- - put words into the pipeline object:
480
 
481
  ```python
482
  wall_of_text = "your words here"
483
 
484
  result = summarizer(
485
- wall_of_text,
486
- min_length=8,
487
- max_length=256,
488
- no_repeat_ngram_size=3,
489
- encoder_no_repeat_ngram_size=3,
490
- repetition_penalty=3.5,
491
- num_beams=4,
492
- do_sample=False,
493
- early_stopping=True,
494
- )
495
- print(result[0]['generated_text'])
496
  ```
497
 
498
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
423
  verified: true
424
  verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDQxZmEwYmU5MGI1ZWE5NTIyMmM1MTVlMjVjNTg4MDQyMjJhNGE5NDJhNmZiN2Y4ZDc4ZmExNjBkMjQzMjQxMyIsInZlcnNpb24iOjF9.o3WblPY-iL1vT66xPwyyi1VMPhI53qs9GJ5HsHGbglOALwZT4n2-6IRxRNcL2lLj9qUehWUKkhruUyDM5-4RBg
425
  ---
426
+ # LED-Based Summarization Model: Condensing Long and Technical Information
 
 
427
 
428
  <a href="https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb">
429
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
430
  </a>
431
 
432
+ The Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization is a model I developed, designed to condense extensive technical, academic, and narrative content.
433
+
434
+ ## Key Features and Use Cases
435
+
436
+ - Ideal for summarizing long narratives, articles, papers, textbooks, and other technical documents.
437
+ - Trained to also explain the summarized content, offering insightful output.
438
+ - High capacity: Handles up to 16,384 tokens per batch.
439
+ - Live demo available: [Colab demo](https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb) and [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text).
440
+
441
+ > **Note:** The API is configured to generate a maximum of 64 tokens due to runtime constraints. For optimal results, use the Python approach detailed below.
442
+
443
+ ## Training Details
444
 
445
+ The model was trained on the BookSum dataset released by SalesForce, which leads to the `bsd-3-clause` license. The training process involved 16 epochs with parameters tweaked to facilitate very fine-tuning-type training (super low learning rate).
446
 
447
+ Model checkpoint: [`pszemraj/led-base-16384-finetuned-booksum`](https://huggingface.co/pszemraj/led-base-16384-finetuned-booksum).
448
 
449
+ For comparison, all generation parameters for the API have been kept consistent across versions.
450
 
451
+ ## Other Related Checkpoints
452
 
453
+ Apart from the LED-based model, I have also fine-tuned other models on `kmfoda/booksum`:
 
454
 
455
+ - [Long-T5-Global-Base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)
456
+ - [BigBird-Pegasus-Large-K](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum)
457
+ - [Pegasus-X-Large](https://huggingface.co/pszemraj/pegasus-x-large-book-summary)
458
+ - [Long-T5-Global-XL](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary)
459
 
460
+ There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
461
 
462
  ---
463
 
464
+ ## Basic Usage
465
 
466
+ I recommend using `encoder_no_repeat_ngram_size=3` when calling the pipeline object, as it enhances the summary quality by encouraging the use of new vocabulary and crafting an abstractive summary.
 
 
467
 
468
+ Create the pipeline object:
469
 
470
+ ```python
471
  import torch
472
  from transformers import pipeline
473
 
474
+ hf_name = "pszemraj/led-base-book-summary"
475
 
476
  summarizer = pipeline(
477
  "summarization",
 
480
  )
481
  ```
482
 
483
+ Feed the text into the pipeline object:
484
 
485
  ```python
486
  wall_of_text = "your words here"
487
 
488
  result = summarizer(
489
+ wall_of_text,
490
+ min_length=8,
491
+ max_length=256,
492
+ no_repeat_ngram_size=3,
493
+ encoder_no_repeat_ngram_size=3,
494
+ repetition_penalty=3.5,
495
+ num_beams=4,
496
+ do_sample=False,
497
+ early_stopping=True,
498
+ )
499
+ print(result[0]["generated_text"])
500
  ```
501
 
502
+ ## Simplified Usage with TextSum
503
+
504
+ To streamline the process of using this and other models, I've developed [a Python package utility](https://github.com/pszemraj/textsum) named `textsum`. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
505
+
506
+ Install TextSum:
507
+
508
+ ```bash
509
+ pip install textsum
510
+ ```
511
+
512
+ Then use it in Python with this model:
513
+
514
+ ```python
515
+ from textsum.summarize import Summarizer
516
+
517
+ model_name = "pszemraj/led-base-book-summary"
518
+ summarizer = Summarizer(
519
+ model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub
520
+ token_batch_length=4096, # how many tokens to batch summarize at a time
521
+ )
522
+ long_string = "This is a long string of text that will be summarized."
523
+ out_str = summarizer.summarize_string(long_string)
524
+ print(f"summary: {out_str}")
525
+ ```
526
+
527
+ Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a shareable demo application. For detailed explanations and documentation, check the README or the wiki.
528
+
529
+ ---