Instructions for running on runpod.io

#2
by simsim314 - opened

I had successfully managed to run 2.75bpw brunch on 64GB VRAM with 4 RTX A4000 (16GB per GPU).

Here are some key points:

  1. The template I'm using
runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
  1. Download and install exllamav2 (inside jupyter).
!git clone https://github.com/turboderp/exllamav2
%cd exllamav2
# Optionally, create and activate a new conda environment
!pip install -r requirements.txt
!pip install .
!pip install huggingface_hub
  1. Download the model:
!huggingface-cli download turboderp/dbrx-instruct-exl2 --revision "2.75bpw" --local-dir dbrx_275 --exclude "*.safetensors"

%cd dbrx_275

!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00001-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00002-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00003-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00004-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00005-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00006-of-00006.safetensors"

%cd ..

Note: The reason I am not using huggingface-cli download for safetensors, is because runpod is downloading it first into limited space container (20GB).

  1. Run exllamav2 in terminal (working directory exllamav2):
python examples/chat.py -mode chatml -m dbrx_275 --gpu_split auto
simsim314 changed discussion title from Running on runpod.io instructions to Instructions for running on runpod.io

Sign up or log in to comment