arxiv:2312.08361

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Published on Dec 13, 2023

· Submitted by

akhaliq on Dec 14, 2023

#2 Paper of the day

Upvote

Authors:

Alexander Borzunov ,

Max Ryabinin ,

Artem Chumachenko ,

Dmitry Baranchuk ,

Tim Dettmers ,

Younes Belkada ,

Colin Raffel

Abstract

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

View arXiv page View PDF Add to collection

Community

MichaelBarryUK

Dec 14, 2023

•

edited Dec 14, 2023

This + https://vast.ai seems like a match made in heaven

You're basicly just paying for electricity

https://github.com/bigscience-workshop/petals

dolo650

Dec 18, 2023

The idea of Bit Torrent style inferencing on shared GPUs hosted across the globe is absolutely marvellous. As we see more and more people joining the swarm, its going to get better and better. Thank you.

kamil-Vapor

Dec 18, 2023

Really promising work! I would like to see some additional testing done around unreliable network conditions and temperature/humidty/pressure/power sensitivity to really gauge the client<>server architecture’s resilience and fault tolerance.

I second the other poster’s sentiment on BT style decentralization of inference! There are absolutely speed concerns with this approach, but in a world that is very quickly adopting 5G, being able to utilize wireless to handle parallelized inference is exciting.