What is the inference time? On my Apple M1 Max completions take > 6 min
Hi,
I'm wondering what others have to say about inference time, am I doing something wrong or is it normal to have inference time > 6 minutes on average for single line prompts?
are you quantizing at all? could you share the code you are using?
I'm using the sample code from the model's page here on HF, the first block using the transformer library (copy/paste).
I'm not quantitizing.
I'm getting the same experience from just running the sample code as well. Seems like a cool model so I'd love for this to work, but a few minutes completion times for basic prompts isn't gonna cut it :(
I've posted the same question onnthe Cerebras Discord too. It was ignored which brings me to the assumption that you either need quantitasation or the M1 chip is not supported and the inference runs purely on CPU instead of the graphic acceleration.
Hopefully it's the second because if you need to quantitise the model, it defeats the purpose of running 3b as I can run just fine quantitised 7b models on my local infrastructure.
can confirm inference is extremely slow for me too (Pascal Card, 8 GB) compared to quantized llama2-hf-7b.
Using the example inference code.
Hi
@vedtam
, I am able to generate using a T4 GPU in Colab in a few seconds using load_in_8bit=True
(which requires installing accelerate
and bitsandbytes
) and the default Hugging Face transformers model.generate()
As I understand it, the default transformers library isn't great for inference latency compared to redpajama.cpp, vLLM, etc (https://hamel.dev/notes/llm/03_inference.html)
We're also looking into integrating with popular quantization and inference tools
@rskuzma thanks for the update. I'm looking forward to have other means running the model for benchmarking, I'm keen to see how it could run on consumer hardware as for the T4 GPU, that's significantly more powerful than my Apple M1, not to mention a ~4 GB mobile device that the announcement of this model emphases.
Btw, that's a useful link, thanks for sharing!
@vedtam
thanks for your interest in BTLM! It sounds like the inference implementation you are using is not well optimized. To get rough estimates of throughput for BTLM I used redpajama.cpp to collect throughputs for RedPajama-INCITE-Base-3B-v1 which has a nearly identical cost to the BTLM-3B-8K model. I used llama.cpp to collect throughput numbers for LLaMA 7B.
We are working with maintainers of popular inference libraries to support BTLM-3B-8K but you will have to stay tuned for that!