Training Setup + Results
Hi @pszemraj ! This is super cool thanks for working on this and sharing. Do you mind sharing your training set up in terms of GPUs etc? I'm trying to partition this model across a number of GPUs to avoid OOM errors using DeepSpeed but that's failing. I'm using 4 NVDIA Tesla V100 GPUs with 32GB RAM but still running out of memory.
Hey! Thanks for uploading the dataset, so I didn't have to run the script off their repo :)
Sure, I'm currently traveling, so let me know if you want more detail, and I can follow up once I'm back:
- I use a single V100 and 52 GB CPU RAM, deepspeed ZeRO 2; see config I just uploaded here
- due to memory constraints, I keep the batch size at one and just crank gradient accumulation, usually 64 or higher
- from the length values, I think
max_length
for outputs 1024 captures most of the dataset, so I filter out all summaries that are longer than that - if still facing memory constraints, I think you can also keep max_input_length (for training) at 8192 and filter out things more extended. Anecdotally, the model can still run
predict
on 16384 and still produce good summaries (I guess scaling what it learned when running inference). IIRC most of the summaries are shorter anyway - the base model fits within constraints on V100 so that one always trains with inputs 16384
- IMO-while it takes AGES longer to train-the
long-t5
models are starting to perform much better than LED (plus, when I have access to A100, you can train on bfloat16), so IMO, better progress can be made vs. training these
Hope that helps to start!
Interesting, thank you very much for this. I didn't consider using deepspeed with just 1 GPU, I was focused on partitioning the model across GPUs and I think that fails with these large attention models. Do you use gradient checkpointing in this setup?
Ideally I would like to avoid this as it really slows down training and I would like to do many sweeps to granularly fine tune a number of the hyper parameters to suit book summarisation.
Also thanks for the tips re reducing the output length. Ideally I would like to avoid this as I have a private dataset that has a lot of datapoints with 16384 tokens.
I agree completely re long-t5. I'm a big fan and want to move to train long-t5-xl at some point I just need to understand how to predict memory requirements for these models with a max_length of 16384 and understand why deepspeed fails with these types of models.
One option I'm experimenting with now is TPUs as I believe that's what they used in the paper but progressing quite slowly on that front.
Fair enough! I posted a WIP checkpoint and made it public at pszemraj/long-t5-tglobal-large-pubmed-3k-booksum-16384-WIP, you may find it helpful as a starting point. Let's continue the discussion there, but if you have other things on LED, feel free to reopen this.
Also, if you want to communicate not on hf or collaborate on something pick a means of communication on my site and reach out :)
Oh oops, I forgot one question:
Do you use gradient checkpointing in this setup?
Yes, I do, typically 32-64 steps. I think unless you have access to a massive compute setup it's more or less required - no free lunch, etc etc. I believe it's even used in the PubMed-3K model as well: see the W&B page
So which longer pre-trained model is best to use for 16k inputs?
And what are the optional hyper parameters?
currently I am testing below and yet it is to complete my 8k tokens input. it is running on 12 GB vram RTX 3060 GPU
import torch
from transformers import pipeline
hf_name = 'pszemraj/led-large-book-summary'
summarizer = pipeline(
"summarization",
hf_name,
device=0 if torch.cuda.is_available() else -1,
)
result = summarizer(
wall_of_text,
min_length=800,
no_repeat_ngram_size=3,
encoder_no_repeat_ngram_size =3,
repetition_penalty=3.5,
num_beams=4,
early_stopping=True,
)
with open('pszemraj-led-large-book-summary.txt', 'w') as f:
f.write(result[0]['summary_text'])
@MonsterMMORPG
see here for a response and links on hyperparameters/usage. I would also recommend pszemraj/long-t5-tglobal-base-16384-book-summary
over this one from a quality:compute perspective.
On this model card (see the Colab notebook link), there is a very detailed example notebook allowing for parameter adjustment; I would experiment with that too