databricks/dolly-v2-7b · Execution time, help

Apr 22, 2023

Hello

I am on the dolly journey

I started testing this on a macbook and got to 1h40 per response.
Then moved to a dedicated server with a random GPU with 8 core ... 20 minutes
Now I am in google cloud on a NVDIA T4 16 vCPU with 60GB memory .. I got it down to 4 minutes per answer

my goal is to get to a few secs, what would be your recommendations ? I read I need a NVDIA A100 but I can't provision that on google cloud (not a single available) not on AWS (they decline the request all the time)

Thanks for your help !

srowen

Databricks org Apr 22, 2023

Did you read https://github.com/databrickslabs/dolly#generating-on-other-instances ? It depends on your input, output and settings, but 5-15 seconds on an A10 is pretty feasible. Use the 3B model for faster responses.

srowen changed discussion status to closed Apr 23, 2023