databricks/dolly-v2-12b · Databricks Cluster Settings

Apr 13, 2023

What are the optimal settings for a Databricks cluster when running this model when loading the weights from huggingface? Currently using a Shared Compute cluster on 12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12) with g4dn.8xlarge, but it seems like the model is running on CPU (10+ minutes on average for completion). I see for training, p4d.24xlarge is ideal, but I assume this isn't the case if I'm not training the model myself. I'm using the code directly from the repo:

from transformers import pipeline

instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", trust_remote_code=True, device_map="auto")

Any help would be appreciated!

srowen

Databricks org Apr 13, 2023

What model, 12B? g4dn are 16GB T4, so yes it doesn't fit on the GPU and you're running mostly on the CPU.
You can get it to run on the T4 if you load in 8-bit instead.
You can also just use a smaller model, like the 6.9B param version of Dolly 2.

I recommend A10 instances (g5). The 12B model works with 8-bit, and loads the smaller models just fine. Note you should add torch_dtype=torch.bfloat16 for A10 or A100s.
A100s certainly work.

srowen changed discussion status to closed Apr 19, 2023