pszemraj commited on
Commit
4027dd3
1 Parent(s): ddb8588

:memo: formatting

Browse files
Files changed (1) hide show
  1. README.md +46 -33
README.md CHANGED
@@ -457,27 +457,49 @@ model-index:
457
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
458
  </a>
459
 
460
- - summarize long text and get a SparkNotes-esque summary of arbitrary topics!
461
- - generalizes reasonably well to academic & narrative text.
462
- - A simple example/use case on ASR is [here](https://longt5-booksum-example.netlify.app/). There's also an example notebook in Colab (click on the icon above).
 
 
463
 
464
-
465
  ## Cheeky Proof-of-Concept
466
 
467
  A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta):
468
 
469
  > The narrator tells us that he's graduated from the Navy seals and has been involved in many secret raids. He's also one of the best snipers in the entire U.S. military. He promises to "wipe you out with precision" when they meet again.
470
 
471
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
472
 
473
  ## Model description
474
 
475
  A fine-tuned version of [google/long-t5-tglobal-base](https://huggingface.co/google/long-t5-tglobal-base) on the `kmfoda/booksum` dataset:
476
 
477
  - 30+ epochs of fine-tuning from the base model on V100/A100 GPUs
478
- - all training used 16384 token input / 1024 max output
479
 
480
- Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf)
481
 
482
  ## How-To in Python
483
 
@@ -505,18 +527,15 @@ Pass [other parameters related to beam search textgen](https://huggingface.co/bl
505
  ## Intended uses & limitations
506
 
507
  - The current checkpoint is fairly well converged but will be updated if further improvements can be made.
508
- - Compare performance to [LED-base](https://huggingface.co/pszemraj/led-base-book-summary) trained on the same dataset (API gen parameters are the same).
509
  - while this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**.
510
 
511
  ## Training and evaluation data
512
 
513
  `kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209). Summaries longer than 1024 LongT5 tokens were filtered out to prevent the model from learning to generate "partial" summaries.
514
 
515
- _NOTE: early checkpoints of this model were trained on a "smaller" subsection of the dataset as it was filtered for summaries of **1024 characters**. This was subsequently caught and adjusted to **1024 tokens** and then trained further for 10+ epochs._
516
 
517
- ---
518
-
519
- ---
520
 
521
  ## FAQ
522
 
@@ -530,22 +549,21 @@ You can also use the same code to split a document into batches of 4096, etc., a
530
 
531
  See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization).
532
 
533
- This model was originally tuned on Google Colab with a heavily modified variant of the [longformer training notebook](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb), key enabler being deepspeed. You can try this as an alternate route to fine-tuning the model without the command line.
534
-
535
-
536
-
537
- ---
538
 
 
539
 
540
  ## Training procedure
541
 
542
  ### Updates:
543
 
544
  - July 22, 2022: updated to a fairly converged checkpoint
545
- - July 3, 2022: Added a new version with several epochs of additional training that is more performant in general.
546
 
547
  ### Training hyperparameters
548
 
 
 
549
  The following hyperparameters were used during the **most recent** training round\*:
550
 
551
  - learning_rate: 0.0005
@@ -560,9 +578,7 @@ The following hyperparameters were used during the **most recent** training roun
560
  - lr_scheduler_warmup_ratio: 0.01
561
  - num_epochs: 2
562
 
563
-
564
- \*_Prior training sessions used roughly similar parameters; multiple sessions were required as this takes eons to train
565
-
566
 
567
  ### Framework versions
568
 
@@ -571,18 +587,15 @@ The following hyperparameters were used during the **most recent** training roun
571
  - Datasets 2.3.2
572
  - Tokenizers 0.12.1
573
 
574
-
575
- ## citation info
576
 
577
  If you find `pszemraj/long-t5-tglobal-base-16384-book-summary` useful in your work, please consider citing this model :)
578
 
579
- ```
580
- @misc {peter_szemraj_2022,
581
- author = { {Peter Szemraj} },
582
- title = { long-t5-tglobal-base-16384-book-summary (Revision 4b12bce) },
583
- year = 2022,
584
- url = { https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary },
585
- doi = { 10.57967/hf/0100 },
586
- publisher = { Hugging Face }
587
- }
588
- ```
 
457
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
458
  </a>
459
 
460
+ Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
461
+
462
+ - generalizes reasonably well to academic & narrative text.
463
+ - A simple example/use case on ASR is [here](https://longt5-booksum-example.netlify.app/).
464
+ - Example notebook in Colab (_click on the icon above_).
465
 
 
466
  ## Cheeky Proof-of-Concept
467
 
468
  A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta):
469
 
470
  > The narrator tells us that he's graduated from the Navy seals and has been involved in many secret raids. He's also one of the best snipers in the entire U.S. military. He promises to "wipe you out with precision" when they meet again.
471
 
472
+ * * *
473
+
474
+ **Contents**
475
+
476
+ <!-- TOC -->
477
+
478
+ - [Model description](#model-description)
479
+ - [How-To in Python](#how-to-in-python)
480
+ - [Intended uses & limitations](#intended-uses--limitations)
481
+ - [Training and evaluation data](#training-and-evaluation-data)
482
+ - [FAQ](#faq)
483
+ - [Inference over long documents in batches](#how-to-run-inference-over-a-very-long-30k-tokens-document-in-batches)
484
+ - [How to fine-tune further](#how-to-fine-tune-further)
485
+ - [Training procedure](#training-procedure)
486
+ - [Updates](#updates)
487
+ - [Training hyperparameters](#training-hyperparameters)
488
+ - [Framework versions](#framework-versions)
489
+ - [Citation info](#citation-info)
490
+
491
+ <!-- /TOC -->
492
+
493
+ * * *
494
 
495
  ## Model description
496
 
497
  A fine-tuned version of [google/long-t5-tglobal-base](https://huggingface.co/google/long-t5-tglobal-base) on the `kmfoda/booksum` dataset:
498
 
499
  - 30+ epochs of fine-tuning from the base model on V100/A100 GPUs
500
+ - Training used 16384 token input / 1024 max output
501
 
502
+ Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf)
503
 
504
  ## How-To in Python
505
 
 
527
  ## Intended uses & limitations
528
 
529
  - The current checkpoint is fairly well converged but will be updated if further improvements can be made.
530
+ - Compare performance to [LED-base](https://huggingface.co/pszemraj/led-base-book-summary) trained on the same dataset (API gen parameters are the same).
531
  - while this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**.
532
 
533
  ## Training and evaluation data
534
 
535
  `kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209). Summaries longer than 1024 LongT5 tokens were filtered out to prevent the model from learning to generate "partial" summaries.
536
 
 
537
 
538
+ * * *
 
 
539
 
540
  ## FAQ
541
 
 
549
 
550
  See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization).
551
 
552
+ This model was originally tuned on Google Colab with a heavily modified variant of the [longformer training notebook](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb), key enabler being deepspeed. You can try this as an alternate route to fine-tuning the model without using the command line.
 
 
 
 
553
 
554
+ * * *
555
 
556
  ## Training procedure
557
 
558
  ### Updates:
559
 
560
  - July 22, 2022: updated to a fairly converged checkpoint
561
+ - July 3, 2022: Added a new version with several epochs of additional general training that is more performant.
562
 
563
  ### Training hyperparameters
564
 
565
+ _NOTE: early checkpoints of this model were trained on a "smaller" subsection of the dataset as it was filtered for summaries of **1024 characters**. This was subsequently caught and adjusted to **1024 tokens** and then trained further for 10+ epochs._
566
+
567
  The following hyperparameters were used during the **most recent** training round\*:
568
 
569
  - learning_rate: 0.0005
 
578
  - lr_scheduler_warmup_ratio: 0.01
579
  - num_epochs: 2
580
 
581
+ \* Prior training sessions used roughly similar parameters; multiple sessions were required as this takes eons to train
 
 
582
 
583
  ### Framework versions
584
 
 
587
  - Datasets 2.3.2
588
  - Tokenizers 0.12.1
589
 
590
+ ## Citation info
 
591
 
592
  If you find `pszemraj/long-t5-tglobal-base-16384-book-summary` useful in your work, please consider citing this model :)
593
 
594
+ @misc {peter_szemraj_2022,
595
+ author = { {Peter Szemraj} },
596
+ title = { long-t5-tglobal-base-16384-book-summary (Revision 4b12bce) },
597
+ year = 2022,
598
+ url = { https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary },
599
+ doi = { 10.57967/hf/0100 },
600
+ publisher = { Hugging Face }
601
+ }