running and fine tuning

#14
by mishavee - opened

how much gpu memory do I need to run Bloomz on a100 gpus? What about fine tune?

BigScience Workshop org

how much gpu memory do I need to run Bloomz on a100 gpus?

Ideally you want 8x A100s with 80GB each and then you can load it directly via accelerate + transformers. It may also work with 8x40GB or 4x80GB.

What about fine tune?

We trained it with 72x pipeline parallel, 1x tensor parallel, 4x data parallel & batch size = 1 on 288 A100 80GB GPUs. You could probably half the data parallel, but may have to increase tensor parallel to make it fit in that case. So I'd estimate 144 A100 80GB GPUs is the minimum.

is there any way to fine tune it with 8 gpus?

is there any way to batch the input dataset

like instead of one of the below in training aka fine tuning, put many as a single input?

paraphrase :
sentence1:(sentence)
sentence2:(sentence)

BigScience Workshop org

is there any way to fine tune it with 8 gpus?

You can finetune the smaller models with 8 GPUs, for example bloomz-7b1 or mt0-xxl. They are also very strong - mt0-xxl even outperforms bloomz on most tasks we measured.

is there any way to batch the input dataset

Yes ofc, you can batch inputs.

do you batch in the way I gave as an example?

what is the difference between versions of mt and bloomz?

what is the difference between bloom and bloomz?

What is the vocabulary size for bloomz-7b1 or mt0-xxl?

BigScience Workshop org

do you batch in the way I gave as an example?

We combine many examples

what is the difference between versions of mt and bloomz?

mt versions are fintuned on xP3mt instead; They're better for non-English prompts

what is the difference between bloom and bloomz?

BLOOMZ is better for following instructions. BLOOM is for continuing text.

What is the vocabulary size for bloomz-7b1 or mt0-xxl?

Written in the configs, e.g. https://huggingface.co/bigscience/mt0-xxl-mt/blob/main/config.json#L31

what is xp3mt? How is it different? What is it normally trained on?

please give an example of batching?

when you say mt0-xxl outperforms bloomz on most tasks, do you mean it outperforms the 176B model?

Hi!
I am trying to finetune Bloomz for data to text generation. The idea was to use a simple prefix like "Generate a natural english explanation of the following data:"
I was looking to use one of the smaller models, as my machine only has 4 gpus (48 gb total memory). If such a thing isn't possible with my hardware feel free to not read on and say so!

I was following this guide: https://github.com/bigscience-workshop/xmtf#bloomz
And have a few questions.
In the slurm script linked above, it calls finetune_t0.py, but in the repo only finetune_t0_non_causal_decoder.py is available. Is this okay to use instead?

I have my own data to text data set. If I wanted to finetune on this data, would using bloom-560-optimizer-states be the right call? If not what is a better option?
If so, should I follow this guide (linked for mt0 models in the guide above, not sure if it applies for bloomz): https://github.com/google-research/t5x/blob/main/docs/usage/finetune.md
Otherwise, how should I modify the script to work with the 560m model? It seems like pointing CHECKPOINT_PATH and removing no-load-optim reset-progress was not enough.

Hopefully these questions make sense, apologies for the dense post! If there are more helpful resources out there in regards to finetuning bloomz for a new task please let me know.
Thanks!

BigScience Workshop org
edited Dec 2, 2022

4 gpus (48 gb total memory)

Do you mean 48GB in total or 48GB / GPU? If it's in total, i.e. you have 12GB on each GPU, then I think it's not feasible reasonably cc @TimeRobber

In the slurm script linked above, it calls finetune_t0.py, but in the repo only finetune_t0_non_causal_decoder.py is available. Is this okay to use instead?

That's because you need to clone the repo on the branch t0loading like written in the guide 👍

would using bloom-560-optimizer-states be the right call?

Yes

If so, should I follow this guide

You need to preprocess your dataset first into Meg-DS format e.g. like here: https://github.com/bigscience-workshop/bigscience/blob/master/data/xp3/xp3_jsonl_to_meg.slurm

BigScience Workshop org

@Charm3link I think if you're trying to finetuning the smallest bloom you might be able to? It really depends on your setup, and on how fast you want to be able to finetune it. We should probably port some of the finetuning techniques we used in transformers library at some point so people can start leveraging it.

How much minimum requirements do i need to fine tune bloomz-560m model on custom data
can anyone guide?

Sign up or log in to comment