I have downloaded the repo TheBloke/Llama-2-70B-GPTQ to local using snapshot_download. I am using A100 80GB GPU :

from huggingface_hub import snapshot_download
snapshot_download(repo_id="TheBloke/Llama-2-70B-GPTQ",local_dir="./Llama-2-70B-GPTQ")

here is the code which is used for generation:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

import torch
torch.cuda.is_available()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def llama2_70b_model(job):
job_input = job["input"]
job_text = job_input['text']
context = job_text['context']
prompt = job_text['prompt']
temperature = job_text['temperature']
max_new_tokens = job_text['max_new_tokens']
top_p = job_text['top_p']
repetition_penalty = job_text['repetition_penalty']
prompt_template=f'''[INST] <>
{prompt}
<>
{context} [/INST]'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=temperature, max_new_tokens=max_new_tokens, top_p=top_p,  repetition_penalty=repetition_penalty)

outputs = tokenizer.decode(output[0])
# output = outputs.split(":")[3]
torch.cuda.empty_cache()

return outputs

if name == "main":
model_name_or_path = "Llama-2-70B-GPTQ/"
model_basename = "model"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path ,local_files_only=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path , model_basename=model_basename, use_safetensors=True, low_cpu_mem_usage=True, local_files_only=True,device_map='auto', use_triton=use_triton, quantize_config=None,inject_fused_attention=False)

prompt = "Extract the names of all characters mentioned in the text."
context = '''

In the novel 'Pride and Prejudice,' Elizabeth Bennet, Mr. Darcy, and Jane Bennet are prominent characters.:
'''
job = { "input": {
"text":{ "prompt":prompt,"context":context,"temperature":0.7, "max_new_tokens":4020, "top_p":5, "repetition_penalty":1}
}
}
print(llama2_70b_model(job))

here is the output:

Prompt: Extract the names of all characters mentioned in the text. /n/ncontext:In the novel 'Pride and Prejudice,' Elizabeth Bennet, Mr. Darcy, and Jane Bennet are prominent characters.:
Prompts can be used to extract information from a document or set of documents that is not explicitly stated but implied by contextual clues such as character relationships (e.g., who knows whom), events occurring at certain times during narratives etc.. For example if we want our model trained on this prompt then it will learn how different people interact with each other based upon their actions within stories written about them; thus allowing us access into understanding human behavior better than ever before!

I have tried different parameters and values, but every time, it gives very bad responses. Sometimes it repeats the same words or gives a blank output. I have also tried using Llama 2 13B, and it performed much better than this. So, does anyone have an idea of what could be the issue? Why is the model producing irrelevant and poor responses?

TheBloke
/

Llama-2-70B-GPTQ

model responses not good.

I have downloaded the repo TheBloke/Llama-2-70B-GPTQ to local using snapshot_download. I am using A100 80GB GPU :

here is the code which is used for generation:

here is the output: