Spaces:

huggingface-projects
/

llama-2-7b-chat

Running on Zero

Running llama-2-7b-chat locally

#52

by ohsa1122 - opened Apr 2, 2024

Apr 2, 2024

Hi, I am using the llama-2-7b-chat online API on this link: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat to make inferences, and the accuracy I am getting is pretty good.

I am trying to achieve the same results locally but I am unable to. I am using the following setup:

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)

sequences = pipeline(
prompt,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=1000,
)

The accuracy I am getting is way lower. My question is what type of GPU is being used in the online API? And what are inputs being used in the pipeline call?

baadror

Jun 20, 2024

I did not try this solution, but this blog have demo working on huggingface . GPU it suggests
https://huggingface.co/blog/llama2#:~:text=For%207B%20models,4x%20Nvidia%20A100%22)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment