formatting
Browse files
README.md
CHANGED
@@ -251,65 +251,46 @@ model-index:
|
|
251 |
|
252 |
- create the summarizer object:
|
253 |
|
254 |
-
```
|
255 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
256 |
from transformers import pipeline
|
257 |
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
|
|
|
|
|
|
|
|
262 |
|
263 |
-
_tokenizer = AutoTokenizer.from_pretrained(
|
264 |
-
"pszemraj/bigbird-pegasus-large-K-booksum",
|
265 |
-
)
|
266 |
-
|
267 |
|
268 |
summarizer = pipeline(
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
-
|
274 |
```
|
275 |
|
276 |
- define text to be summarized, and pass it through the pipeline. Boom done.
|
277 |
|
278 |
-
```
|
279 |
wall_of_text = "your text to be summarized goes here."
|
280 |
|
281 |
result = summarizer(
|
282 |
-
|
283 |
-
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
-
|
288 |
|
289 |
-
print(result[0][
|
290 |
```
|
291 |
|
292 |
## Alternate Checkpoint
|
293 |
|
294 |
- if experiencing runtime/memory issues, try [this earlier checkpoint](https://huggingface.co/pszemraj/bigbird-pegasus-large-booksum-40k-K) at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.
|
|
|
295 |
|
296 |
---
|
297 |
-
|
298 |
-
# Results
|
299 |
-
|
300 |
-
- note that while the dataset has three subsets (chapter, book, paragraph) - see the [paper](https://arxiv.org/abs/2105.08209). the scores below are run in aggregate. The paper has some benchmark scores listed, which this model competes with.
|
301 |
-
- note that eval generations are run & computed at a length of 128 tokens.
|
302 |
-
|
303 |
-
|
304 |
-
```
|
305 |
-
'eval_gen_len': 126.9791,
|
306 |
-
'eval_loss': 4.00944709777832,
|
307 |
-
'eval_rouge1': 27.6028,
|
308 |
-
'eval_rouge2': 4.6556,
|
309 |
-
'eval_rougeL': 14.5259,
|
310 |
-
'eval_rougeLsum': 25.6632,
|
311 |
-
'eval_runtime': 29847.4812,
|
312 |
-
'eval_samples_per_second': 0.05,
|
313 |
-
'eval_steps_per_second': 0.05}
|
314 |
-
|
315 |
-
```
|
|
|
251 |
|
252 |
- create the summarizer object:
|
253 |
|
254 |
+
```python
|
255 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
256 |
from transformers import pipeline
|
257 |
|
258 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(
|
259 |
+
"pszemraj/bigbird-pegasus-large-K-booksum",
|
260 |
+
low_cpu_mem_usage=True,
|
261 |
+
)
|
262 |
+
|
263 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
264 |
+
"pszemraj/bigbird-pegasus-large-K-booksum",
|
265 |
+
)
|
266 |
|
|
|
|
|
|
|
|
|
267 |
|
268 |
summarizer = pipeline(
|
269 |
+
"summarization",
|
270 |
+
model=model,
|
271 |
+
tokenizer=tokenizer,
|
272 |
+
)
|
|
|
273 |
```
|
274 |
|
275 |
- define text to be summarized, and pass it through the pipeline. Boom done.
|
276 |
|
277 |
+
```python
|
278 |
wall_of_text = "your text to be summarized goes here."
|
279 |
|
280 |
result = summarizer(
|
281 |
+
wall_of_text,
|
282 |
+
min_length=16,
|
283 |
+
max_length=256,
|
284 |
+
no_repeat_ngram_size=3,
|
285 |
+
clean_up_tokenization_spaces=True,
|
286 |
+
)
|
287 |
|
288 |
+
print(result[0]["summary_text"])
|
289 |
```
|
290 |
|
291 |
## Alternate Checkpoint
|
292 |
|
293 |
- if experiencing runtime/memory issues, try [this earlier checkpoint](https://huggingface.co/pszemraj/bigbird-pegasus-large-booksum-40k-K) at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.
|
294 |
+
- see similar summarization models fine-tuned on booksum but using different architectures: [long-t5 base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) and [LED-Large](https://huggingface.co/pszemraj/led-large-book-summary)
|
295 |
|
296 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|