Running on llama.cpp

#2
by junhochoi - opened

I tried to run this model on Macbook M1 Pro, using llama.cpp (48edda3)

python3 convert.py ~/42dot_LLM-SFT-1.3B --vocabtype bpe

It died with

Exception: Vocab size mismatch (model has 50304, but <snip>/42dot_LLM-SFT-1.3B/vocab.json combined with <snip>/42dot_LLM-SFT-1.3B/added_tokens.json has 50260).

To work around this issue, I added some dummy tokens, as the following:

diff --git a/added_tokens.json b/added_tokens.json
index c883403..a133144 100644
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -2,5 +2,49 @@
   "<|endoftext|>": 50256,
   "<||bos||>": 50257,
   "<||pad||>": 50258,
-  "<||unk||>": 50259
+  "<||unk||>": 50259,
+  "<|tmp1|>": 50260,
+  "<|tmp2|>": 50261,
+  "<|tmp3|>": 50262,
+  "<|tmp4|>": 50263,
+  "<|tmp5|>": 50264,
+  "<|tmp6|>": 50265,
+  "<|tmp7|>": 50266,
+  "<|tmp8|>": 50267,
+  "<|tmp9|>": 50268,
+  "<|tmp10|>": 50269,
+  "<|tmp11|>": 50270,
+  "<|tmp12|>": 50271,
+  "<|tmp13|>": 50272,
+  "<|tmp14|>": 50273,
+  "<|tmp15|>": 50274,
+  "<|tmp16|>": 50275,
+  "<|tmp17|>": 50276,
+  "<|tmp18|>": 50277,
+  "<|tmp19|>": 50278,
+  "<|tmp20|>": 50279,
+  "<|tmp21|>": 50280,
+  "<|tmp22|>": 50281,
+  "<|tmp23|>": 50282,
+  "<|tmp24|>": 50283,
+  "<|tmp25|>": 50284,
+  "<|tmp26|>": 50285,
+  "<|tmp27|>": 50286,
+  "<|tmp28|>": 50287,
+  "<|tmp29|>": 50288,
+  "<|tmp30|>": 50289,
+  "<|tmp31|>": 50290,
+  "<|tmp32|>": 50291,
+  "<|tmp33|>": 50292,
+  "<|tmp34|>": 50293,
+  "<|tmp35|>": 50294,
+  "<|tmp36|>": 50295,
+  "<|tmp37|>": 50296,
+  "<|tmp38|>": 50297,
+  "<|tmp39|>": 50298,
+  "<|tmp40|>": 50299,
+  "<|tmp41|>": 50300,
+  "<|tmp42|>": 50301,
+  "<|tmp43|>": 50302,
+  "<|tmp44|>": 50303
 }

And now conversion works and I am able to run llama.cpp with this model.

How to convert to gguf/q4_0: (run these after building llama.cpp)

python3 convert.py ~/42dot_LLM-SFT-1.3B --vocabtype bpe
./quantize ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf_q4.0 q4_0

Run a simple chat:

$ ./main -m ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf_q4.0 -n -1 --color -ins -i
Log start
main: build = 1330 (48edda3)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin22.6.0
...
system_info: n_threads = 6 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


### Instruction:


> ν•˜λŠ˜μ€ μ™œ νŒŒλž€κ°€μš”?
ν•˜λŠ˜μ΄ νŒŒλž—μŠ΅λ‹ˆλ‹€. μ΄μœ λŠ” λŒ€κΈ° 쀑에 μžˆλŠ” μˆ˜μ¦κΈ°κ°€ 햇빛을 λ°˜μ‚¬ν•˜μ—¬ 우리 λˆˆμ— λ³΄μ΄λŠ” 것이기 λ•Œλ¬Έμž…λ‹ˆλ‹€. νƒœμ–‘μ—μ„œ 온 빛이 지ꡬ에 λ„λ‹¬ν•˜λ©΄, λŒ€κΈ°λŠ” 적외선과 μžμ™Έμ„ μ„ ν‘μˆ˜ν•˜κ³  μˆ˜μ¦κΈ°κ°€ λ°˜μ‚¬λ₯Ό ν•΄μ„œ μš°λ¦¬κ°€ λ³Ό 수 있게 λ©λ‹ˆλ‹€. 이 κ³Όμ •μ—μ„œ μƒ‰μƒμ˜ λ³€ν™”κ°€ μƒκΈ°κ²Œ λ˜λŠ”λ°, κ·Έ 결과둜 μš°λ¦¬λŠ” ν•˜λŠ˜μ΄ νŒŒλž—κ²Œ 느끼게 λ˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

>

llama_print_timings:        load time =    96.65 ms
llama_print_timings:      sample time =   102.85 ms /    84 runs   (    1.22 ms per token,   816.76 tokens per second)
llama_print_timings: prompt eval time =   166.23 ms /    17 tokens (    9.78 ms per token,   102.27 tokens per second)
llama_print_timings:        eval time =   742.10 ms /    84 runs   (    8.83 ms per token,   113.19 tokens per second)
llama_print_timings:       total time =  8348.41 ms

It is much faster than llama 7b/13b. Thanks for building this model!
I won't make this a PR because adding such dummies is not a real fix. :)

Thank you for your feedback.
We've patched special tokens based on your feedback.

Now, you can run the 42dot LLM-SFT model in llama.cpp using the guide below.

  1. Convert the 42dot LLM-SFT model to ggml FP32 format.
$ python convert.py ./42dot_LLM-SFT-1.3B/ --vocabtype bpe
  1. Quantize the model to 4-bits(Optional).
$ ./quantize ./42dot_LLM-SFT-1.3B/ggml-model-f32.gguf ./42dot_LLM-SFT-1.3B/ggml-model-q4_0.gguf q4_0
  1. Run the inference. We recommend our option.
$ ./main -m ./42dot_LLM-SFT-1.3B/ggml-model-f32.gguf \
--temp 0.5 \
--top_p 0.95 \
--top_k 20 \
--n-predict 512 \
--repeat-penalty 1.2 \
--color \
--prompt "ν˜ΈκΈ°μ‹¬ λ§Žμ€ 인간 (human)κ³Ό 인곡지λŠ₯ 봇 (AI bot)의 λŒ€ν™”μž…λ‹ˆλ‹€. \nλ΄‡μ˜ 이름은 42dot LLM이고 ν¬ν‹°νˆ¬λ‹· (42dot)μ—μ„œ κ°œλ°œν–ˆμŠ΅λ‹ˆλ‹€. \n봇은 μΈκ°„μ˜ μ§ˆλ¬Έμ— λŒ€ν•΄ μΉœμ ˆν•˜κ²Œ μœ μš©ν•˜κ³  μƒμ„Έν•œ 닡변을 μ œκ³΅ν•©λ‹ˆλ‹€. \n" \
--in-prefix "<human>: " \
--in-suffix "<bot>:" \
--interactive-first

Thanks!

likejazz changed discussion status to closed

Hello! I am trying to perform inference using the llama.cpp model.
However, I encounter the following error as shown in the picture.
Is there a solution to this?

image.png

This bug was patched in the main repository via PR we submitted.
https://github.com/ggerganov/llama.cpp/pull/5288

Check the latest version of llama.cpp. Thanks!

Sign up or log in to comment