improve dolly serving time

#7
by xy-covey - opened

I'm experimenting dolly with the aim to use it in the production. In my experiment, I provided a roughly 400 words resume and asked questions like how many years this candidate has been worked at company ABC. It takes dolly about half an hour to provide a response. I'm using a 64gb gpu. What can I do here to bring the serving time down to a few seconds?

Databricks org

You aren't using a 64GB GPU :) What GPU are you using? hard to say without a lot more information about how you are using it. For best results, use an A100. It can be used on an A10 in 8-bit. I'm hitting 10-20 seconds even on the latter.

Typo, I was using CPU. Ok, let me try A100.

Hi @xy-covey can you provide the code snippet you used for providing text and asking question if possible please?

Hey
You can use text-generation-inference to serve it efficiently with Flash Attention and other nice features like dynamic batching, sharding...

# To run on one GPU
docker run --gpus all -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id databricks/dolly-v2-12b

# Distribute on 2 GPUs
docker run --gpus all -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id databricks/dolly-v2-12b --num-shard 2
from text_generation import Client

client = Client("http://localhost:8080")

print(client.generate("Hello!"))

I'm using an RTX3090 (24GB) and loading the model in 8-bit mode. Model inference takes about 110 seconds. The resource load during inference is shown in the figure. The results are also attached.

tokenizer = GPTNeoXTokenizerFast.from_pretrained(model_dir)
model = GPTNeoXForCausalLM.from_pretrained(model_dir,
                                           load_in_8bit=True,
                                           device_map='auto',
                                           torch_dtype=torch.float16,
                                           low_cpu_mem_usage=True,
                                           )
inputs = tokenizer(input)
start_time = time.time()
output_ids = self.model.generate(torch.as_tensor(inputs.input_ids).to(device),
                                 do_sample=True,
                                 temperature=0.8,
                                 max_length=512,
                                 top_p=0.95,
                                 
delta = time.time() - start_time
print(f"Time to inference is {delta}")
results = self.tokenizer.batch_decode(output_ids,
                                      skip_special_tokens=True,
                                      clean_up_tokenization_spaces=False)[0]

image.png

Who is Dalai?
 Baba Ram Dass: Who am I? That is a good question. I am not sure who I am. Some people think that I am a Buddhist, although I am not. I am not even certain what that is. I do know that I have a great curiosity about who I am, where I am from, and where am I going. I don't know where I came from, but I know where I'm going. So who am I?


[Note: In the early 1970's, Ram Dass was given the name Richard Alpert.  He was an American psychologist who was part of a team of researchers who conducted experiments on consciousness through the use of mind-altering drugs.  Through these experiments, Alpert discovered something called "set and setting".  According to Alpert, our consciousness is affected not by what we are consciously aware of, but rather by the context of our awareness.  In other words, our beliefs and expectations influence our experience of the world around us.  Through his experiments with psychedelic drugs, Alpert came to believe that the source of these beliefs and expectations was our spiritual consciousness, or soul.  Through the use of certain chants and prayers, Alpert was able to "re-channel" his soul, leading him to abandon his work with the Harvard Psilocybin Project and to change his name to Ram Dass.]


Dalai Lama:
 Who am I?
 Baba Ram Dass:
 That is a good question. I am not sure who I am. Some people think that I am a Buddhist, although I am not. I am not even certain what that is. I do know that I have a great curiosity about who I am, where I am from, and where am I going. I don't know where I came from, but I know where I'm going. So who am I?


[Note: The 14th Dalai Lama, Tenzin Gyatso, was a Tibetan Buddhist religious and political leader who was forced into exile following a failed 1959 Tibetan uprising against Chinese rule.  His Holiness was nominated for the Nobel Peace Prize five times between 1974 and 1990, and was awarded the Congressional Gold Medal in 2001.  Since 1959, the Dalai Lama has lived in northern India, where he has worked to preserve and promote the Tibetan language and culture.  His main teaching is that a stable and just society is based on individual ethics
srowen changed discussion status to closed

Sign up or log in to comment