system HF staff commited on
Commit
ddfc17d
1 Parent(s): ced9101

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "[en]"
3
+ datasets:
4
+ - Spotify Podcasts Dataset
5
+ metrics:
6
+ - ROUGE
7
+ ---
8
+
9
+ # T5 for Automatic Podcast Summarisation
10
+
11
+ This model is the result of fine-tuning [t5-base](https://huggingface.co/t5-base) on the [Spotify Podcast Dataset](https://arxiv.org/abs/2004.04270).
12
+
13
+ It is based on [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) which was pretrained on the [C4 dataset](https://huggingface.co/datasets/c4).
14
+
15
+
16
+ Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
17
+
18
+ Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
19
+
20
+ ## Intended uses & limitations
21
+ This model is intended to be used for automatic podcast summarisation. As creator provided descriptions
22
+ were used for training, the model also learned to generate promotional material in its summaries, as such
23
+ some post processing may be required on the model's outputs.
24
+
25
+ #### How to use
26
+ A 'summarize:' tag must be appended to the source text before it is passed to the T5 model.
27
+
28
+ ```python
29
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
30
+
31
+ tokenizer = T5Tokenizer.from_pretrained('paulowoicho/t5-podcast-summarisation')
32
+ model = T5ForConditionalGeneration.from_pretrained('paulowoicho/t5-podcast-summarisation')
33
+
34
+ podcast_transcript = 'summarize: ' + podcast_transcript
35
+ tokens = tokenizer.encode(podcast_transcript, return_tensors="pt")
36
+ summary_ids = model.generate(tokens, max_length=150, num_beams=2, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
37
+ output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
38
+
39
+ print(output)
40
+ ```
41
+
42
+ ## Training data
43
+
44
+ This model is the result of fine-tuning [t5-base](https://huggingface.co/t5-base) on the [Spotify Podcast Dataset](https://arxiv.org/abs/2004.04270).
45
+ [Pre-processing](https://github.com/paulowoicho/msc_project/blob/master/reformat.py) was done on the original data before fine-tuning.
46
+
47
+ ## Training procedure
48
+
49
+ Training was largely based on [Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb) by [Abhishek Kumar Mishra](https://github.com/abhimishra91)