Text Generation
Chinese

使用ggmlv3 q6_K model, inference會掉字

#1
by wennycooper - opened

您好,
我使用ggml quantize 成為 q6_K format, 然後用以下 code 做inference

`
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
model_path="/workspace/test/TaiwanLLama_v1.0/Taiwan-LLaMa-13b-1.0.ggmlv3.q6_K.bin",
n_gpu_layers=16,
n_batch=8,
n_ctx=2048,
temperature=0.1,
max_tokens=512,
callback_manager=callback_manager,
)

prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")
response = llm(prompt)
`
結果會掉字... 如下:

深度學是機器學的一子集,基人工神經結。使得計算機能通別模式大量中學,而不需要明編程。深度學算法用分、進行和別模式

我也測試了 q8_0, 也同樣有掉字的問題.. 請問有解決辦法嗎?

llama.cpp: loading model from /workspace/test/TaiwanLLama_v1.0/Taiwan-LLaMa-13b-1.0.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 6912
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 10431.68 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/43 layers to GPU
llama_model_load_internal: total VRAM used: 3695 MB
llama_new_context_with_model: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
深度學是機器學的一子集,基人工神經結和流程的算法。用別中的模式提取特,些特

這跟模型無關,應該是 llama-cpp-python 之前的 bug ,更新套件可以解決
https://github.com/abetlen/llama-cpp-python/pull/309

然後新版本的 llama.cpp 沒支援 GGML ,現在應該都要遷移去用 GGUF 了

audreyt changed discussion status to closed

Sign up or log in to comment