Now you can run inference and fine-tune BLOOMZ (the 176B English version) using the Petals swarm.
You can use BLOOMZ via this Colab notebook to get the inference speed of 1-2 sec/token for a single sequence. Running the notebook on a local machine is also fine, you'd need only 10+ GB GPU memory or 12+ GB RAM (though it will be slower without a GPU).
Note: Don't forget to replace
bigscience/bloomz-petals in the model name.
As an example, there is a chatbot app running BLOOMZ this way.
Sorry for some cross-posting but I really hope this may be useful, given that the free inference API is not available right now.
Hey there - curious about this setup. I'm running inference on a smaller version of the model but could fit the notebook in memory. Is it truly collaborative in that I can add to latent processing when I'm not directly running inference? Confused on the petals goal/arch.
Yes, Petals is truly collaborative - you can connect your GPU and increase its capacity, as described in our GitHub readme: https://github.com/bigscience-workshop/petals#connect-your-gpu-and-increase-petals-capacity
The Petals goal is to give a way to run 100B+ language models without having a GPU cluster. Instead, you can load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning. See the arch details in our paper: https://arxiv.org/abs/2209.01188