Train Bloom 560M

#1
by Mayhem50 - opened

Hi,

I was just trying to replicate your work on the bloom-560M model. I just finished the fine-tune and I think my setup was maybe wrong.
I had use your command CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch /content/code/biencoder/nli_msmarco/sentence-transformers/examples/training/nli/training_nli_v2.py --model_name bigscience/bloom-560m --freezenonbias --train_batch_size 64 --lr 32e-5 --pooling weightedmean --wandb --wandbwatchlog gradients --gradcache --chunksize 4
Should I modify something ?

Another question, can the model be improved on french language by fine-tuning it multilingual like it is described here: https://www.sbert.net/examples/training/multilingual/README.html

Thanks

BigScience Data org

The command looks fine to me - did training already finish? If not, which error did you get?

Yes, if you have good French data available, I would expect slightly better performance by training on it.
You can try with the French STS datasets from the link you sent πŸ‘

Let me know how it goes!

The training has finished: https://huggingface.co/Mayhem50/sgpt-bloom-560M-nli
But I was expecting better score on my dataset.

I will try to fine-tune both and see if improvements are significant

Thanks a lot.

BigScience Data org

Oh nice!
Note that the gap between BitFit & full fine-tuning only diminishes as you increase model size. For 560 million parameters you are likely better off training without BitFit (i.e. remove the --freezenonbias from your command).
If you scale up to 1.7B like this model or 7.1B like https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco, BitFit should perform just as well as full fine-tuning, so you can keep the command as is.

Also make sure that your downstream task is a symmetric one. If it's search-related, you may be better off training on MSMARCO.

I use this command to train, CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch /content/code/biencoder/nli_msmarco/sentence-transformers/examples/training/nli/training_nli_v2.py --model_name bigscience/bloom-560m --train_batch_size 64 --lr 32e-5 --pooling weightedmean --wandb --wandbwatchlog gradients --gradcache --chunksize 4

But it cannot be parallel. Using multiple GPUs is the same as one GPU. What's the problem?

BigScience Data org

Maybe you have to run accelerate config and select multiple gpus

Sign up or log in to comment