Running on llama.cpp

by junhochoi - opened Oct 6, 2023

Oct 6, 2023

•

edited Oct 6, 2023

I tried to run this model on Macbook M1 Pro, using llama.cpp (48edda3)

python3 convert.py ~/42dot_LLM-SFT-1.3B --vocabtype bpe

It died with

Exception: Vocab size mismatch (model has 50304, but <snip>/42dot_LLM-SFT-1.3B/vocab.json combined with <snip>/42dot_LLM-SFT-1.3B/added_tokens.json has 50260).

To work around this issue, I added some dummy tokens, as the following:

diff --git a/added_tokens.json b/added_tokens.json
index c883403..a133144 100644
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -2,5 +2,49 @@
   "<|endoftext|>": 50256,
   "<||bos||>": 50257,
   "<||pad||>": 50258,
-  "<||unk||>": 50259
+  "<||unk||>": 50259,
+  "<|tmp1|>": 50260,
+  "<|tmp2|>": 50261,
+  "<|tmp3|>": 50262,
+  "<|tmp4|>": 50263,
+  "<|tmp5|>": 50264,
+  "<|tmp6|>": 50265,
+  "<|tmp7|>": 50266,
+  "<|tmp8|>": 50267,
+  "<|tmp9|>": 50268,
+  "<|tmp10|>": 50269,
+  "<|tmp11|>": 50270,
+  "<|tmp12|>": 50271,
+  "<|tmp13|>": 50272,
+  "<|tmp14|>": 50273,
+  "<|tmp15|>": 50274,
+  "<|tmp16|>": 50275,
+  "<|tmp17|>": 50276,
+  "<|tmp18|>": 50277,
+  "<|tmp19|>": 50278,
+  "<|tmp20|>": 50279,
+  "<|tmp21|>": 50280,
+  "<|tmp22|>": 50281,
+  "<|tmp23|>": 50282,
+  "<|tmp24|>": 50283,
+  "<|tmp25|>": 50284,
+  "<|tmp26|>": 50285,
+  "<|tmp27|>": 50286,
+  "<|tmp28|>": 50287,
+  "<|tmp29|>": 50288,
+  "<|tmp30|>": 50289,
+  "<|tmp31|>": 50290,
+  "<|tmp32|>": 50291,
+  "<|tmp33|>": 50292,
+  "<|tmp34|>": 50293,
+  "<|tmp35|>": 50294,
+  "<|tmp36|>": 50295,
+  "<|tmp37|>": 50296,
+  "<|tmp38|>": 50297,
+  "<|tmp39|>": 50298,
+  "<|tmp40|>": 50299,
+  "<|tmp41|>": 50300,
+  "<|tmp42|>": 50301,
+  "<|tmp43|>": 50302,
+  "<|tmp44|>": 50303
 }

And now conversion works and I am able to run llama.cpp with this model.

How to convert to gguf/q4_0: (run these after building llama.cpp)

python3 convert.py ~/42dot_LLM-SFT-1.3B --vocabtype bpe
./quantize ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf_q4.0 q4_0

Run a simple chat:

$ ./main -m ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf_q4.0 -n -1 --color -ins -i
Log start
main: build = 1330 (48edda3)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin22.6.0
...
system_info: n_threads = 6 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


### Instruction:


> 하늘은 왜 파란가요?
하늘이 파랗습니다. 이유는 대기 중에 있는 수증기가 햇빛을 반사하여 우리 눈에 보이는 것이기 때문입니다. 태양에서 온 빛이 지구에 도달하면, 대기는 적외선과 자외선을 흡수하고 수증기가 반사를 해서 우리가 볼 수 있게 됩니다. 이 과정에서 색상의 변화가 생기게 되는데, 그 결과로 우리는 하늘이 파랗게 느끼게 되는 것입니다.

>

llama_print_timings:        load time =    96.65 ms
llama_print_timings:      sample time =   102.85 ms /    84 runs   (    1.22 ms per token,   816.76 tokens per second)
llama_print_timings: prompt eval time =   166.23 ms /    17 tokens (    9.78 ms per token,   102.27 tokens per second)
llama_print_timings:        eval time =   742.10 ms /    84 runs   (    8.83 ms per token,   113.19 tokens per second)
llama_print_timings:       total time =  8348.41 ms

It is much faster than llama 7b/13b. Thanks for building this model!
I won't make this a PR because adding such dummies is not a real fix. :)

likejazz

Oct 28, 2023

•

edited Oct 28, 2023

Thank you for your feedback.
We've patched special tokens based on your feedback.

Now, you can run the 42dot LLM-SFT model in llama.cpp using the guide below.

Convert the 42dot LLM-SFT model to ggml FP32 format.

$ python convert.py ./42dot_LLM-SFT-1.3B/ --vocabtype bpe

Quantize the model to 4-bits(Optional).

$ ./quantize ./42dot_LLM-SFT-1.3B/ggml-model-f32.gguf ./42dot_LLM-SFT-1.3B/ggml-model-q4_0.gguf q4_0

Run the inference. We recommend our option.

$ ./main -m ./42dot_LLM-SFT-1.3B/ggml-model-f32.gguf \
--temp 0.5 \
--top_p 0.95 \
--top_k 20 \
--n-predict 512 \
--repeat-penalty 1.2 \
--color \
--prompt "호기심 많은 인간 (human)과 인공지능 봇 (AI bot)의 대화입니다. \n봇의 이름은 42dot LLM이고 포티투닷 (42dot)에서 개발했습니다. \n봇은 인간의 질문에 대해 친절하게 유용하고 상세한 답변을 제공합니다. \n" \
--in-prefix "<human>: " \
--in-suffix "<bot>:" \
--interactive-first

Thanks!

likejazz changed discussion status to closed Oct 28, 2023

junhochoi

Oct 28, 2023

•

edited Oct 28, 2023

LGTM. Thanks!

shleeeee

Feb 5

Hello! I am trying to perform inference using the llama.cpp model.
However, I encounter the following error as shown in the picture.
Is there a solution to this?

likejazz

Feb 7

This bug was patched in the main repository via PR we submitted.
https://github.com/ggerganov/llama.cpp/pull/5288

Check the latest version of llama.cpp. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment