meta-llama/Meta-Llama-3-8B-Instruct · can I run it on CPU ?

aljbali

Apr 19, 2024

can I run it on CPU ?

J22

Apr 19, 2024

Yes, you can.

Checkout llama.cpp or ChatLLM.cpp.

aljbali

Apr 19, 2024

•

edited Apr 19, 2024

Thank you! I was trying to run it on TGI but I am getting the following error,

(base) compute:data hadra002$ docker run --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference --model-id $model --disable-custom-kernels
2024-04-19T20:40:42.937880Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B-Instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "4fe31fe89102", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-19T20:40:42.939590Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-04-19T20:40:43.309344Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-19T20:40:43.309463Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-19T20:40:43.309480Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-19T20:40:43.309492Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-19T20:40:43.309949Z INFO download: text_generation_launcher: Starting download process.
2024-04-19T20:40:52.029571Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-19T20:40:53.146755Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-19T20:40:53.148336Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-19T20:41:00.442091Z WARN text_generation_launcher: We're not using custom kernels.

2024-04-19T20:41:00.464385Z WARN text_generation_launcher: Could not import Flash Attention enabled models: CUDA is not available

2024-04-19T20:41:01.703605Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 16, in
from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in
from text_generation_server.models.flash_mistral import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 29, in
from text_generation_server.utils import paged_attention, flash_attn

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 12, in
raise ImportError("CUDA is not available")

ImportError: CUDA is not available
rank=0
Error: ShardCannotStart
2024-04-19T20:41:01.809661Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-19T20:41:01.809775Z INFO text_generation_launcher: Shutting down shards
(base) compute:data hadra002$

sky00123

Apr 25, 2024

use https://github.com/ollama/ollama

narsisfa

May 1, 2024

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B"

pipeline = transformers.pipeline(
"text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)

pipeline("hi")

why carsh and not give response?
i run it on colab

DigiSpocDeera

Aug 12, 2024

I think it's one of the worst idea to adapt model like such as Llama into CPU machine.