May 6, 2024

•

edited May 6, 2024

I am trying to run the model with:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF", model_file="Meta-Llama-3-70B-Instruct.Q4_K_M.gguf", model_type="llama")
answer = llm("This is a prompt")

but I get:
ERROR: byte not found in vocab: '
'
/root/onstart.sh: line 1: 540 Segmentation fault (core dumped)

Any ideas how to run the models on a linux machine with GPU? Also, how do I load the 16fp models which have 4 files?

Many thanks for your help. :-)

MaziyarPanahi

Owner May 6, 2024

These are quantized models suitable for CPUs (and GPUs) but via Llama.cpp or any other libraries that use Llama.cpp. If you want to use AutoModelForCausalLM, I suggest using the original model for GPU: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

alexskr

May 8, 2024

Thanks @MaziyarPanahi could you please help me with hardware requirements for running this model. Thanks!

MaziyarPanahi

Owner May 10, 2024

You are welcome. The first question is, do you only have CPUs (RAM) or do you also have any GPU device? (it is possible to offload some layers on GPUs and the rest on CPUs)

KIlian42

May 11, 2024

Many thanks for your response @MaziyarPanahi :-) I am quite new with Llama.cpp.

I would like to try your GGUF model, but only use GPU. However, even if configure "n_threads":0 and "n_gpu_layers":40, I do not see any GPU usage and the response takes very long (because the inferring is handled by the CPU I guess). Do you have any idea what I am doing wrong? Here is my code:

*************

from huggingface_hub import hf_hub_download
from llama_cpp import Llama
model_name = "MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF"
model_file = "Meta-Llama-3-70B-Instruct.Q5_K_M.gguf"
model_path = hf_hub_download(model_name, filename=model_file)
model_kwargs = {
"n_ctx":4096, # Context length to use
"n_threads":0, # Number of CPU threads to use
"n_gpu_layers":40, # Number of model layers to offload to GPU. Set to 0 if only using CPU
}
llm = Llama(model_path=model_path, **model_kwargs)
res = llm("Hello, Llama3!", **generation_kwargs)
print(res["choices"][0]["text"])

*************

Another general question: For GPTQ a dataset is needed on which the quantization is optimized. Is this also the case for GGUF? And if so, which dataset did you use for it? :-)

Many thanks for any help in advance. 🙂

MaziyarPanahi

Owner May 12, 2024

Hi @KIlian42

You are welcome. It seems you are having problem making your Llama.cpp to work with your GPU. I can see you are using llama.cpp-python, it has a specific CUDA build, I recommend following steps in their github to make sure it works with your GPU.

GPTQ, I use the default wikitext2 if I remember correctly :)

KIlian42

May 12, 2024

Many thanks for your help. :-) Do you also use wikitext2 for your GGUF model?

MaziyarPanahi

Owner May 14, 2024

No for GGUF we use a subset of a diverse data to make an imatrix file, then with that we quantize.

KIlian42

May 22, 2024

•

edited May 22, 2024

@MaziyarPanahi Many thanks. The main problem I am facing with all GGUF/GPTQ Llama models is that they quite hallucinating and producing random output. If I just prompt "Hello" the models generate so much random output (infinite generation). Even if I set the temperature very low or turn off sampling they still do. Do you also face this issue, or how do you configure your generation_config? Do I need to configure somewhere to stop the sequence when eos is reached? Would be very thankful for any tips and genration_config templates. :-)

Example:
model_id = "MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ"
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False
)
model = AutoGPTQForCausalLM.from_quantized(
model_id,
use_safetensors=True,
device="cuda:0",
quantize_config=quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.1,
top_p=0.95,
repetition_penalty=1.1
)
start=datetime.now()
outputs = pipe("How are you?")
print(outputs[0]["generated_text"])
print(f"Duration: {datetime.now()-start}")

Output:

How are you? How was your day?
I'm doing well,'thank you for asking. My day has been quite busy so far. I've been working on a project and trying to meet a deadline.
That sounds like a lot of work! What kind of project is it?
It's a marketing campaign for a new product launch. We're trying to create a buzz around the product and get people excited about it.
That sounds interesting. What's the product?
It's a new smartphone app that helps people track their fitness goals and connect with others who share similar interests. It's really cool!
Wow, that does sound cool! I could use something like that. Do you think it'll be popular?
We hope so! The market research suggests that there's a big demand for this type of app, and we're confident that it'll do well. But we'll have to wait and see how it performs once it's launched.

In this example, the conversation starts with a greeting and an inquiry about the other person's day. The response provides some information about what they've been doing, which leads to further questions and discussion. The conversation flows naturally and doesn't feel forced or artificial.

Here are some tips for having a natural-sounding conversation in English:

Start with a greeting: Begin with a hello, hi, or hey, and ask how the other person is doing.
Be interested: Show genuine interest in the other person's life and ask follow-up questions based on what they say.
Use conversational language: Avoid using overly formal or stilted language. Instead, opt for everyday phrases and expressions that you would use with friends.
Keep it simple: Don't try to use complicated vocabulary or grammar structures that might make you stumble. Stick to what feels comfortable and natural.
Listen actively: Pay attention to what the other person is saying and respond accordingly. This will help keep the conversation flowing smoothly.

By following these tips, you can have more natural-sounding conversations in English and improve your communication skills.assistant

Excellent advice!

Starting with a greeting and showing genuine interest in the other person's life sets the tone for a friendly and engaging conversation. Using conversational language and keeping it simple also helps to avoid awkwardness and misunderstandings.

Active listening is crucial in maintaining a smooth flow of conversation. By paying attention to what the other person is saying, you can respond thoughtfully and show that you value their thoughts and opinions.

Additionally, being open-minded
Duration: 0:01:02.534108

Many thanks in advance for any advises! :-)

KIlian42 changed discussion status to closed May 22, 2024

KIlian42 changed discussion status to open May 22, 2024

MaziyarPanahi

Owner May 23, 2024

@KIlian42

I have tested all the GGUF models, they work without any issue. You should test them in LM Studio that already sets the correct prompts / parameters.

In this code, you are using GPTQ, which is not the same model as in this repo. There could be issues with GPTQ, if there is please open a discussion in the actual model's discussion.

MaziyarPanahi changed discussion status to closed May 23, 2024

MaziyarPanahi
/

Meta-Llama-3-70B-Instruct-GGUF

How to run the model?

*************

*************