Text Generation
Transformers
Safetensors
English
llama
conversational
Inference Endpoints
text-generation-inference

Minimum supported device?

#15
by sachinmyneni - opened

Hello,
This is my first foray into downloading and attempting to run an LLM locally.
I am trying this on a late 2013 MacPro with 12-core Intel Xeon E5 with AMD FirePro D300 2 GB GPU memory.
Is this a "sufficient" system to run inference? After making a seemingly logical 'downgrade' I don't get the error but I don't get any results either.

When I ran the example, I get the error:

TypeError: BFloat16 is not supported on MPS

So I changed the pipeline call to:

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16, device_map="auto")

And reran:
I get a few warnings, but no errors:

position_ids = attention_mask.long().cumsum(-1) - 1
/Users/sachin/ai/llamaindex-openllm/llamaindex-openllm/lib/python3.10/site-packages/transformers/generation/logits_process.py:509: UserWarning: torch.topk support for k>16 by MPS on MacOS 13+, please upgrade (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Shape.mm:71.)
indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]
/Users/sachin/ai/llamaindex-openllm/llamaindex-openllm/lib/python3.10/site-packages/transformers/generation/logits_process.py:447: UserWarning: torch.sort is supported by MPS on MacOS 13+, please upgrade. Falling back to CPU (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Sort.mm:41.)
sorted_logits, sorted_indices = torch.sort(scores, descending=False)
/Users/sachin/ai/llamaindex-openllm/llamaindex-openllm/lib/python3.10/site-packages/transformers/generation/utils.py:2920: UserWarning: MPS: no support for int64 for min_max, downcasting to a smaller data type (int32/float32). Native support for int64 has been added in macOS 13.3. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/ReduceOps.mm:621.)
if unfinished_sequences.max() == 0:

But I get nothing in the output or any error beyond the warnings above:

outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
<|system|>
You are a friendly chatbot who always responds in the style of a pirate
<|user|>
How many helicopters can a human eat in one sitting?
<|assistant|>

"This is my first foray into downloading and attempting to run an LLM locally"
If this is true, I highly recommend trying out running GGUF quantized models from TheBloke: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF using a backend like llama.cpp. TheBloke also lists several other clients and libraries that support GGUF, most of which also support running LLM inference on a mac. I suggest this method because it should be a very simple and beginner-friendly intro to running LLMs locally, the main requirement in your case being the support for older hardware.

Also, you could check out discussions in the llama.cpp github as well as the r/localLlama subreddit for posts like these: https://www.reddit.com/r/LocalLLaMA/comments/16j29s3/old_comp_running_llm_i_got_llama27bchatq2_kgguf/

Thanks @JoeySalmons That was very helpful. I could use llama-cpp-python and a few models successfully. I then tried the text-generator-webui but the installation was not as smooth as the one in the reddit thread. I will keep at that. But with llama-cpp-python, I have a good start.
I am now playing with different parameters and looking to see if I can train a model with my data. llama-cpp-python does not seem to be there yet according to this thread atleast: https://github.com/abetlen/llama-cpp-python/issues/814
Oh.. and I went ahead and got into google colab since the free tier itself gives some GPU resources.

Sign up or log in to comment