Update README.md
Browse files
README.md
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "[en]"
|
3 |
+
datasets:
|
4 |
+
- Spotify Podcasts Dataset
|
5 |
+
metrics:
|
6 |
+
- ROUGE
|
7 |
+
---
|
8 |
+
|
9 |
+
# T5 for Automatic Podcast Summarisation
|
10 |
+
|
11 |
+
This model is the result of fine-tuning [t5-base](https://huggingface.co/t5-base) on the [Spotify Podcast Dataset](https://arxiv.org/abs/2004.04270).
|
12 |
+
|
13 |
+
It is based on [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) which was pretrained on the [C4 dataset](https://huggingface.co/datasets/c4).
|
14 |
+
|
15 |
+
|
16 |
+
Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
|
17 |
+
|
18 |
+
Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
|
19 |
+
|
20 |
+
## Intended uses & limitations
|
21 |
+
This model is intended to be used for automatic podcast summarisation. As creator provided descriptions
|
22 |
+
were used for training, the model also learned to generate promotional material in its summaries, as such
|
23 |
+
some post processing may be required on the model's outputs.
|
24 |
+
|
25 |
+
#### How to use
|
26 |
+
A 'summarize:' tag must be appended to the source text before it is passed to the T5 model.
|
27 |
+
|
28 |
+
```python
|
29 |
+
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
30 |
+
|
31 |
+
tokenizer = T5Tokenizer.from_pretrained('paulowoicho/t5-podcast-summarisation')
|
32 |
+
model = T5ForConditionalGeneration.from_pretrained('paulowoicho/t5-podcast-summarisation')
|
33 |
+
|
34 |
+
podcast_transcript = 'summarize: ' + podcast_transcript
|
35 |
+
tokens = tokenizer.encode(podcast_transcript, return_tensors="pt")
|
36 |
+
summary_ids = model.generate(tokens, max_length=150, num_beams=2, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
|
37 |
+
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
|
38 |
+
|
39 |
+
print(output)
|
40 |
+
```
|
41 |
+
|
42 |
+
## Training data
|
43 |
+
|
44 |
+
This model is the result of fine-tuning [t5-base](https://huggingface.co/t5-base) on the [Spotify Podcast Dataset](https://arxiv.org/abs/2004.04270).
|
45 |
+
[Pre-processing](https://github.com/paulowoicho/msc_project/blob/master/reformat.py) was done on the original data before fine-tuning.
|
46 |
+
|
47 |
+
## Training procedure
|
48 |
+
|
49 |
+
Training was largely based on [Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb) by [Abhishek Kumar Mishra](https://github.com/abhimishra91)
|