pszemraj
/

led-large-book-summary

text2text-generation

Model card Files Files and versions

pszemraj commited on Feb 22, 2022

Commit

3381aec

·

1 Parent(s): 683bb6c

add details

Files changed (1) hide show

README.md +53 -6

README.md CHANGED Viewed

@@ -69,19 +69,66 @@ inference:
 # Longformer Encoder-Decoder (LED) fine-tuned on Booksum
-This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure

 # Longformer Encoder-Decoder (LED) fine-tuned on Booksum
+- This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the booksum dataset.
+- the goal was to create a model that can generalize well and is useful in summarizing lots of text in academic and daily usage.
+- all the parameters for generation on the API are the same as [the base model](https://huggingface.co/pszemraj/led-base-book-summary) for easy comparison between versions.
+- works well on lots of text, can hand 16384 tokens/batch.
+---
+# Usage - Basics
+- it is recommended to use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality.
+  - this param forces the model to use new vocabulary and create an abstractive summary, otherwise it may l compile the best _extractive_ summary from the input provided.
+- create the pipeline object:
+```
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+from transformers import pipeline
+hf_name = 'pszemraj/led-base-book-summary'
+_model = AutoModelForSeq2SeqLM.from_pretrained(
+                hf_name,
+                low_cpu_mem_usage=True,
+            )
+_tokenizer = AutoTokenizer.from_pretrained(
+                hf_name
+            )
+summarizer = pipeline(
+                    "summarization",
+                    model=_model,
+                    tokenizer=_tokenizer
+                )
+```
+- put words into the pipeline object:
+```
+wall_of_text = "your words here"
+result = summarizer(
+           wall_of_text,
+           min_length=16,
+           max_length=256,
+           no_repeat_ngram_size=3,
+           encoder_no_repeat_ngram_size =3,
+           clean_up_tokenization_spaces=True,
+           repetition_penalty=3.7,
+           num_beams=4,
+           early_stopping=True,
+    )
+```
 ## Training and evaluation data
+- the [booksum](https://arxiv.org/abs/2105.08209) dataset
+- During training, the input text was the text of the chapter, and the output was the summary text
 ## Training procedure