Rough estimates for text generation?

#24
by hf2477565 - opened

Hi there,

I'm new to transformers, torch, and basically any ML development from the last decade and I'm trying to get back into it.

I've setup a jupyter notebook with torch and cuda enabled, I have an RTX 2080 8GB. I'm not expecting blistering performance, but should that be sufficient to build a pipeline from a pretrained model and get it to give me answers in say less than 10 minutes?

This code runs without error in about 8 minutes or so:

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", offload_folder="offload", torch_dtype=torch.bfloat16, device_map="auto", load_in_8bit=True)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

but

generate_text("tell a short story")

just seems to hang.

I thought the pipeline inference would be relatively quick compared to loading the model. Are my expectations wrong?

Databricks org

device_map='auto' is causing so much confusion. You don't have nearly enough GPU RAM to load so it loads most on the CPU and works but very slowly. Maybe we should just set the example to force CUDA 0 so it fails explicitly if it doesn't fit

For 16GB GPUs you can get it to load in 8 bit. For 8GB won't work. Use the 2.7B model?

Databricks org

To answer your question should be like 10-20 seconds on an A10.

srowen changed discussion status to closed

Hi @srowen , sorry to follow up on a closed discussion, but I'm wondering how to specify the device_map argument to force CUDA 0 and fail explicitly, as you suggested?

Databricks org

You just set device="cuda:0" then, and you don't need accelerate to figure out a device mapping in that case.

Thank you! That's clear and works like a charm.

Sign up or log in to comment