Why is it so slow, even on GPU

by luigisaetta - opened Dec 7, 2023

Dec 7, 2023

•

edited Dec 7, 2023

I'm comparing Cerbero 7B to Llama2 /B.
Even if I'm using a 2 A10 GPU VM Cerbero is surprisingly slow. I have checked (gpustat) it is running on GPU... but getting an answer takes minutes.
Is it a fine tuning of Mistral 7B or do we have architectural changes requiring much more computing power? (I know, it has been trained on A100 8 GPU... big shape)

galatolo

Owner Dec 7, 2023

Thats weird it should have almost the same performance as llama2-7b, can you share some details on your setup?

luigisaetta

Dec 7, 2023

•

edited Dec 7, 2023

Test con max_new_token=128 Wall time: 1min 54s

I'm using a 2 A10 GPU VM (24 + 24 GB GPU mem).... on the same setting Lalma2 (running locally) usually answers in 5 sec
The chain is built using Llama-index
With gpustat I have checked that everything runs on GPU.

I can give you more context on what I'm doing: I'm evaluating several LLM with a RAG approach, on Italian language.
I have tested:

Cohere
-Llama 2 /B e 13B
MIstral 7B
now testing Cerbero

To build the RA chain I'm using Llamaidex. To evaluate Trulens-eval

The biggest problem I see is alway code-switching (question in Itaian, answer in Englis, let's say 20% of the cases)

Cerbero seems a little better... but it is really slow (10X the latency)

luigisaetta

Dec 7, 2023

•

edited Dec 7, 2023

Really it puzzles me also the fact that this model, if compared with lalama2 /B and MIstral 7B takes so much GPU mem.

galatolo

Owner Dec 7, 2023

Try to load the float16 variant, the default one is float32

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")

check model.dtype it should be float16

luigisaetta

Dec 8, 2023

Try to load the float16 variant, the default one is float32
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")
check model.dtype it should be float16

It is exactly what I have already done. No way... it takes minutes.

I have suspect that Cerbero makes a lot of effort processing long input. I'm using a RAG approach taking context from some English book. So the prompt to the model is always several hundreds of tokens (even thousands) long. Have you tested it with long inputs?
Well, on the 8GPU box described in the docs it will be fast.

galatolo

Owner Dec 8, 2023

Next week i will perform some inference benchmarks on smaller GPUs (i also have some A30 a T4) and we will get to the bottom of this. In the meantime maybe you can try the llama.cpp version. llama.cpp has a lot of optimizations and it is faster than transformers (despite the name, you can use the GPU with llama.cpp)

galatolo changed discussion status to closed Dec 23, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment