File size: 8,042 Bytes
63cf97c
 
 
 
 
 
 
d5ec1fb
 
63cf97c
 
 
 
 
 
 
b04b6bd
1d81d95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63cf97c
 
 
 
09901c4
cd493ca
6401238
09901c4
cd493ca
63cf97c
205ea6f
 
 
 
 
 
c74dcb6
205ea6f
 
 
819b3be
63cf97c
 
 
 
 
 
 
cd493ca
63cf97c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c06e3c
63cf97c
819b3be
 
 
63cf97c
09901c4
 
c74dcb6
cd493ca
63cf97c
819b3be
63cf97c
09901c4
 
c74dcb6
63cf97c
 
 
819b3be
63cf97c
 
 
cd493ca
63cf97c
 
 
 
 
 
 
 
 
 
34f99d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63cf97c
 
 
 
 
 
 
 
984a13e
 
 
 
 
 
adc4997
 
 
 
 
63cf97c
 
 
 
 
 
09901c4
63cf97c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c06e3c
63cf97c
 
 
 
 
 
 
 
cd493ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
tags:
- summarization
- summary
- booksum
- long-document
- long-form
- tglobal-xl
- XL
license:
- apache-2.0
- bsd-3-clause
datasets:
- kmfoda/booksum
metrics:
- rouge
inference: false
model-index:
- name: pszemraj/long-t5-tglobal-xl-16384-book-summary
  results:
  - task:
      type: summarization
      name: Summarization
    dataset:
      name: multi_news
      type: multi_news
      config: default
      split: test
    metrics:
    - name: ROUGE-1
      type: rouge
      value: 36.2043
      verified: true
    - name: ROUGE-2
      type: rouge
      value: 8.424
      verified: true
    - name: ROUGE-L
      type: rouge
      value: 17.3721
      verified: true
    - name: ROUGE-LSUM
      type: rouge
      value: 32.3994
      verified: true
    - name: loss
      type: loss
      value: 2.0843334197998047
      verified: true
    - name: gen_len
      type: gen_len
      value: 248.3572
      verified: true
---

# long-t5-tglobal-xl + BookSum

Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
- Generalizes reasonably well to academic & narrative text.
- This is the XL checkpoint, which **from a human-evaluation perspective, [produces even better summaries](https://long-t5-xl-book-summary-examples.netlify.app/)**.

A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).

## Cheeky Proof-of-Concept

A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta):

> In this chapter, the monster explains how he intends to exact revenge on "the little b****" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.

While a somewhat crude example, try running this copypasta through other summarization models to see the difference in comprehension (_despite it not even being a "long" text!_)

---

## Description

A fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the `kmfoda/booksum` dataset.

Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf) 

## How-To in Python

> 🚧 `LLM.int8()` appears to be compatible with summarization and does not degrade the quality of the outputs; this is a crucial enabler for using this model on standard GPUs. A PR for this is in-progress [here](https://github.com/huggingface/transformers/pull/20341), and this model card will be updated with instructions once done :) 🚧

Install/update transformers `pip install -U transformers`

Summarize text with pipeline:

```python
import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-xl-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)
long_text = "Here is a lot of text I don't want to read. Replace me"

result = summarizer(long_text)
print(result[0]["summary_text"])
```

Pass [other parameters related to beam search textgen](https://huggingface.co/blog/how-to-generate) when calling `summarizer` to get even higher quality results.

---

## About

### Intended uses & limitations

While this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**.

Specifically: negation statements (i.e., model says: _This thing does not have [ATTRIBUTE]_ where instead it should have said _This thing has a lot of [ATTRIBUTE]_).
- I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by comparing a specific claim to what the surrounding sentences imply.

### Training and evaluation data

`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209). 

- **Initial fine-tuning** only used input text with 12288 tokens input or less and 1024 tokens output or less (_i.e. rows with longer were dropped before training_) for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the **small** minority
  - In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.**
- **final phases of fine-tuning** used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much.

### Eval results

Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.

**Please read the note above as due to training methods, validation set performance looks better than the test set results will be**. The model achieves the following results on the evaluation set:
- eval_loss: 1.2756
- eval_rouge1: 41.8013
- eval_rouge2: 12.0895
- eval_rougeL: 21.6007
- eval_rougeLsum: 39.5382
- eval_gen_len: 387.2945
- eval_runtime: 13908.4995
- eval_samples_per_second: 0.107
- eval_steps_per_second: 0.027

```
***** predict/test metrics (initial) *****                                                               
  predict_gen_len            =   506.4368                                                 
  predict_loss               =      2.028                                                 
  predict_rouge1             =    36.8815                                                 
  predict_rouge2             =     8.0625                                                 
  predict_rougeL             =    17.6161                                                 
  predict_rougeLsum          =    34.9068                                                 
  predict_runtime            = 2:04:14.37                                                 
  predict_samples            =       1431                                                 
  predict_samples_per_second =      0.192                                                 
  predict_steps_per_second   =      0.048
```
\* evaluating big model not as easy as it seems. Doing a bit more investigating

---

## FAQ

### How can I run inference with this on CPU?

lol

### How to run inference over a very long (30k+ tokens) document in batches?

See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :)

You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.

### How to fine-tune further?

See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)


---

## Training procedure

### Updates

Updates to this model/model card will be posted here as relevant. The model seems fairly converged; if updates/improvements are possible using the `BookSum` dataset, this repo will be updated.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0006
- train_batch_size: 1
- eval_batch_size: 1
- seed: 10350
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- total_eval_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: constant
- num_epochs: 1.0

\*_Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train._

### Framework versions

- Transformers 4.25.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.6.1
- Tokenizers 0.13.1

---