Running on llama.cpp
I tried to run this model on Macbook M1 Pro, using llama.cpp (48edda3)
python3 convert.py ~/42dot_LLM-SFT-1.3B --vocabtype bpe
It died with
Exception: Vocab size mismatch (model has 50304, but <snip>/42dot_LLM-SFT-1.3B/vocab.json combined with <snip>/42dot_LLM-SFT-1.3B/added_tokens.json has 50260).
To work around this issue, I added some dummy tokens, as the following:
diff --git a/added_tokens.json b/added_tokens.json
index c883403..a133144 100644
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -2,5 +2,49 @@
"<|endoftext|>": 50256,
"<||bos||>": 50257,
"<||pad||>": 50258,
- "<||unk||>": 50259
+ "<||unk||>": 50259,
+ "<|tmp1|>": 50260,
+ "<|tmp2|>": 50261,
+ "<|tmp3|>": 50262,
+ "<|tmp4|>": 50263,
+ "<|tmp5|>": 50264,
+ "<|tmp6|>": 50265,
+ "<|tmp7|>": 50266,
+ "<|tmp8|>": 50267,
+ "<|tmp9|>": 50268,
+ "<|tmp10|>": 50269,
+ "<|tmp11|>": 50270,
+ "<|tmp12|>": 50271,
+ "<|tmp13|>": 50272,
+ "<|tmp14|>": 50273,
+ "<|tmp15|>": 50274,
+ "<|tmp16|>": 50275,
+ "<|tmp17|>": 50276,
+ "<|tmp18|>": 50277,
+ "<|tmp19|>": 50278,
+ "<|tmp20|>": 50279,
+ "<|tmp21|>": 50280,
+ "<|tmp22|>": 50281,
+ "<|tmp23|>": 50282,
+ "<|tmp24|>": 50283,
+ "<|tmp25|>": 50284,
+ "<|tmp26|>": 50285,
+ "<|tmp27|>": 50286,
+ "<|tmp28|>": 50287,
+ "<|tmp29|>": 50288,
+ "<|tmp30|>": 50289,
+ "<|tmp31|>": 50290,
+ "<|tmp32|>": 50291,
+ "<|tmp33|>": 50292,
+ "<|tmp34|>": 50293,
+ "<|tmp35|>": 50294,
+ "<|tmp36|>": 50295,
+ "<|tmp37|>": 50296,
+ "<|tmp38|>": 50297,
+ "<|tmp39|>": 50298,
+ "<|tmp40|>": 50299,
+ "<|tmp41|>": 50300,
+ "<|tmp42|>": 50301,
+ "<|tmp43|>": 50302,
+ "<|tmp44|>": 50303
}
And now conversion works and I am able to run llama.cpp with this model.
How to convert to gguf/q4_0: (run these after building llama.cpp)
python3 convert.py ~/42dot_LLM-SFT-1.3B --vocabtype bpe
./quantize ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf_q4.0 q4_0
Run a simple chat:
$ ./main -m ~/42dot_LLM-SFT-1.3B/ggml-model-f32.gguf_q4.0 -n -1 --color -ins -i
Log start
main: build = 1330 (48edda3)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin22.6.0
...
system_info: n_threads = 6 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
### Instruction:
> νλμ μ νλκ°μ?
νλμ΄ νλμ΅λλ€. μ΄μ λ λκΈ° μ€μ μλ μμ¦κΈ°κ° νλΉμ λ°μ¬νμ¬ μ°λ¦¬ λμ 보μ΄λ κ²μ΄κΈ° λλ¬Έμ
λλ€. νμμμ μ¨ λΉμ΄ μ§κ΅¬μ λλ¬νλ©΄, λκΈ°λ μ μΈμ κ³Ό μμΈμ μ ν‘μνκ³ μμ¦κΈ°κ° λ°μ¬λ₯Ό ν΄μ μ°λ¦¬κ° λ³Ό μ μκ² λ©λλ€. μ΄ κ³Όμ μμ μμμ λ³νκ° μκΈ°κ² λλλ°, κ·Έ κ²°κ³Όλ‘ μ°λ¦¬λ νλμ΄ νλκ² λλΌκ² λλ κ²μ
λλ€.
>
llama_print_timings: load time = 96.65 ms
llama_print_timings: sample time = 102.85 ms / 84 runs ( 1.22 ms per token, 816.76 tokens per second)
llama_print_timings: prompt eval time = 166.23 ms / 17 tokens ( 9.78 ms per token, 102.27 tokens per second)
llama_print_timings: eval time = 742.10 ms / 84 runs ( 8.83 ms per token, 113.19 tokens per second)
llama_print_timings: total time = 8348.41 ms
It is much faster than llama 7b/13b. Thanks for building this model!
I won't make this a PR because adding such dummies is not a real fix. :)
Thank you for your feedback.
We've patched special tokens based on your feedback.
Now, you can run the 42dot LLM-SFT model in llama.cpp using the guide below.
- Convert the 42dot LLM-SFT model to ggml FP32 format.
$ python convert.py ./42dot_LLM-SFT-1.3B/ --vocabtype bpe
- Quantize the model to 4-bits(Optional).
$ ./quantize ./42dot_LLM-SFT-1.3B/ggml-model-f32.gguf ./42dot_LLM-SFT-1.3B/ggml-model-q4_0.gguf q4_0
- Run the inference. We recommend our option.
$ ./main -m ./42dot_LLM-SFT-1.3B/ggml-model-f32.gguf \
--temp 0.5 \
--top_p 0.95 \
--top_k 20 \
--n-predict 512 \
--repeat-penalty 1.2 \
--color \
--prompt "νΈκΈ°μ¬ λ§μ μΈκ° (human)κ³Ό μΈκ³΅μ§λ₯ λ΄ (AI bot)μ λνμ
λλ€. \nλ΄μ μ΄λ¦μ 42dot LLMμ΄κ³ ν¬ν°ν¬λ· (42dot)μμ κ°λ°νμ΅λλ€. \nλ΄μ μΈκ°μ μ§λ¬Έμ λν΄ μΉμ νκ² μ μ©νκ³ μμΈν λ΅λ³μ μ 곡ν©λλ€. \n" \
--in-prefix "<human>: " \
--in-suffix "<bot>:" \
--interactive-first
Thanks!
LGTM. Thanks!
This bug was patched in the main repository via PR we submitted.
https://github.com/ggerganov/llama.cpp/pull/5288
Check the latest version of llama.cpp. Thanks!