How can I run the model with multiple GPUs?

#3
by eeyrw - opened

Such as 12G*8?

eeyrw changed discussion title from How can I the model with multiple GPUs? to How can I run the model with multiple GPUs?
OpenAssistant org

You can use text-generation-inference:

model=OpenAssistant/oasst-sft-1-pythia-12b
num_shard=4 # number of GPUs you want to use
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard

Then you can query the model with text-generation:

from text_generation import Client

client = Client("http://localhost:8080")

client.generate("<|prompter|>What's the Earth total population<|endoftext|><|assistant|>").generated_text

Sign up or log in to comment