What is the inference time? On my Apple M1 Max completions take > 6 min

#15

by vedtam - opened Jul 27, 2023

Jul 27, 2023

Hi,

I'm wondering what others have to say about inference time, am I doing something wrong or is it normal to have inference time > 6 minutes on average for single line prompts?

robertmyers

Jul 27, 2023

are you quantizing at all? could you share the code you are using?

vedtam

Jul 28, 2023

I'm using the sample code from the model's page here on HF, the first block using the transformer library (copy/paste).

I'm not quantitizing.

DukeFerdinand

Aug 2, 2023

I'm getting the same experience from just running the sample code as well. Seems like a cool model so I'd love for this to work, but a few minutes completion times for basic prompts isn't gonna cut it :(

vedtam

Aug 2, 2023

I've posted the same question onnthe Cerebras Discord too. It was ignored which brings me to the assumption that you either need quantitasation or the M1 chip is not supported and the inference runs purely on CPU instead of the graphic acceleration.

Hopefully it's the second because if you need to quantitise the model, it defeats the purpose of running 3b as I can run just fine quantitised 7b models on my local infrastructure.

nudelbrot

Aug 2, 2023

•

edited Aug 2, 2023

can confirm inference is extremely slow for me too (Pascal Card, 8 GB) compared to quantized llama2-hf-7b.
Using the example inference code.

rskuzma

Aug 2, 2023

Hi @vedtam , I am able to generate using a T4 GPU in Colab in a few seconds using load_in_8bit=True (which requires installing accelerate and bitsandbytes) and the default Hugging Face transformers model.generate()

As I understand it, the default transformers library isn't great for inference latency compared to redpajama.cpp, vLLM, etc (https://hamel.dev/notes/llm/03_inference.html)

We're also looking into integrating with popular quantization and inference tools

vedtam

Aug 3, 2023

•

edited Aug 3, 2023

@rskuzma thanks for the update. I'm looking forward to have other means running the model for benchmarking, I'm keen to see how it could run on consumer hardware as for the T4 GPU, that's significantly more powerful than my Apple M1, not to mention a ~4 GB mobile device that the announcement of this model emphases.

Btw, that's a useful link, thanks for sharing!

ndey96

Cerebras org Aug 3, 2023

@vedtam thanks for your interest in BTLM! It sounds like the inference implementation you are using is not well optimized. To get rough estimates of throughput for BTLM I used redpajama.cpp to collect throughputs for RedPajama-INCITE-Base-3B-v1 which has a nearly identical cost to the BTLM-3B-8K model. I used llama.cpp to collect throughput numbers for LLaMA 7B.

We are working with maintainers of popular inference libraries to support BTLM-3B-8K but you will have to stay tuned for that!

vedtam

Aug 4, 2023

@ndey96 thanks for the numbers and the update, really appreciated! Will keep an eye on the news for when BTLM becomes picked up by the other libs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment