It takes time to generate response

#23
by KiranAli - opened

I'm running my model on GPU: NVIDIA Tesla V100 16 GB. It takes more than a minute to generate output.

Databricks org

Please see the many other threads here with ideas. What size model? that's too small for a 12B param model. Use 8-bit, a smaller model, or a larger GPU.

Now I'm running it on 2 V100 16GB and get the following error

return self.cos_cached[:seq_len, ...].to(x.device), self.sin_cached[:seq_len, ...].to(x.device)
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Memory of both GPUs quickly consumes to 16GB,

Databricks org

That's a hardware error, it seems.
Also, still not clear what model you're using or what you are calling.

I'm a newbie to this field, so I apologize if any query is pretty straightforward. I'm using Dolly 7b and have 2 GPUs of 16GB. Can I make use of both GPUs to deploy model 7b? Only dolly 3b performs well on GPU 16GB

Databricks org

@KiranAli you might be able to. If you follow examples in the model card where device_map is set to "auto" then it should spread the model across both GPUs. The model card suggests using bfloat16. This will save memory. You can also try load_in_8bit to further reduce memory. See https://github.com/databrickslabs/dolly#a10-gpus for more instructions.

srowen changed discussion status to closed

Sign up or log in to comment