Running llama-2-7b-chat locally

#52
by ohsa1122 - opened

Hi, I am using the llama-2-7b-chat online API on this link: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat to make inferences, and the accuracy I am getting is pretty good.

I am trying to achieve the same results locally but I am unable to. I am using the following setup:

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)

sequences = pipeline(
prompt,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=1000,
)

The accuracy I am getting is way lower. My question is what type of GPU is being used in the online API? And what are inputs being used in the pipeline call?

Sign up or log in to comment