A way to inference and fine-tune BLOOMZ-176B from Google Colab or locally

#28
by borzunov - opened
BigScience Workshop org
edited Jan 18, 2023

Now you can run inference and fine-tune BLOOMZ (the 176B English version) using the Petals swarm.

You can use BLOOMZ via this Colab notebook to get the inference speed of 1-2 sec/token for a single sequence. Running the notebook on a local machine is also fine, you'd need only 10+ GB GPU memory or 12+ GB RAM (though it will be slower without a GPU).

Note: Don't forget to replace bigscience/bloom-petals with bigscience/bloomz-petals in the model name.

As an example, there is a chatbot app running BLOOMZ this way.

Sorry for some cross-posting but I really hope this may be useful, given that the free inference API is not available right now.

Hey there - curious about this setup. I'm running inference on a smaller version of the model but could fit the notebook in memory. Is it truly collaborative in that I can add to latent processing when I'm not directly running inference? Confused on the petals goal/arch.

BigScience Workshop org

Hi @JHenzi ,

Yes, Petals is truly collaborative - you can connect your GPU and increase its capacity, as described in our GitHub readme: https://github.com/bigscience-workshop/petals#connect-your-gpu-and-increase-petals-capacity

The Petals goal is to give a way to run 100B+ language models without having a GPU cluster. Instead, you can load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning. See the arch details in our paper: https://arxiv.org/abs/2209.01188

Sign up or log in to comment