Transformers
English
llama
text-generation-inference

Running Llama-2-7B-32K-Instruct-GGML with llama.cpp ?

#1
by gsimard - opened

I tried running the model with the following command, but only got repeated blank spaces or columns forever.

simard@CATALYS-1:/mnt/c/git/llama.cpp$ ./main -t 15 -m /mnt/f/ai-models/llama-2-7b-32k-instruct.ggmlv3.q8_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
main: build = 888 (e76d630)
main: seed = 1693268931
llama.cpp: loading model from /mnt/f/ai-models/llama-2-7b-32k-instruct.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 5504
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 7196.45 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size = 1024.00 MB

system_info: n_threads = 15 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0

Instruction: Write a story about llamas\n### Response:::::::::::

CTRL+C

simard@CATALYS-1:/mnt/c/git/llama.cpp$

-p "### Instruction: Write a story about llamas\n### Response:"

It should be -p "[INST] Write a story about llamas. [/INST] "I think.

Same issue with -p "[INST] Write a story about llamas. [/INST] "

i am facing same issue

How about
def llama_prompt(message: str, system_message: str = LLAMA_SYSTEM_PROMPT) -> str:
prompt = f"[INST] <>\n{system_message}\n<>\n\n{message} [/INST]"
return prompt

It deleted system taggs form inside the brackets and it only left empty brackets. It should be double in fact so < < SYS > > and < < / SYS > > without spaces

Same issue as far as I can tell.

@TheBloke I know you must be super busy, but would you have an idea ? I have followed the instructions on this model's page as far as I can tell with no luck. Your other models have been working fine.

Same with me.

Same here.

@gsimard @JulianStreibel @enesj @akarshanbiswas

i am facing the same issue

did you guys solved this ? got any solutions ?

@tocof44188alibrscoM you should not be using ggml model as it’s very outdated.

Use theblokes gguf models with llama.cpp or ctransformers or llama cpp python

@YaTharThShaRma999 I tried the gguf model as well

https://huggingface.co/TheBloke/Llama-2-7B-32K-Instruct-GGUF/discussions/1

here i did the RoPE adjustment as well, still i am getting the blank spaces, please any help would be appreciated

@tocof44188alibrscoM I believe you shouldnt do any rope adjustment as gguf already has that set in i think. try that possibly? And either way i would recommend using mistral 7b instead since its much better then llama 2 7b and also should work at 32k context.

Sign up or log in to comment