生成似乎是乱码

#1
by WoolCool - opened

F:\llamacpp-k>main --mlock --instruct -i --interactive-first --top_k 60 --top_p 1.1 -c 2048 --color --temp 0.8 -n -1 --keep -1 --repeat_penalty 1.1 -t 6 -m Baichuan-13B-Instruction.ggmlv3.q5_1.bin -ngl 22
main: build = 913 (eb542d3)
main: seed = 1690457767
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5
llama.cpp: loading model from Baichuan-13B-Instruction.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 64000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 214
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13696
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 5043.99 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 22 repeating layers to GPU
llama_model_load_internal: offloaded 22/43 layers to GPU
llama_model_load_internal: total VRAM used: 5442 MB
llama_new_context_with_model: kv self size = 1600.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 60, tfs_z = 1.000000, top_p = 1.100000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

你好

Console.WriteLine("#  用C#编写输出文本为:"Hello, world!"

> 常见的水果有哪几种?


> 下雨时人为什么要打伞
にはなれが生るりりリリりりりれれ```
当使用C# 当使用c cccc在将:

>

Hi, did you try to use another prompt template? Based on Alpachino implementation, example input, instruct should look like this:

'''python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("AlpachinoNLP/Baichuan-13B-Instruction", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("AlpachinoNLP/Baichuan-13B-Instruction", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("AlpachinoNLP/Baichuan-13B-Instruction")
messages = []
messages.append({"role": "Human", "content": "世界上第二高的山峰是哪座"})
response = model.chat(tokenizer, messages)
print(response)
'''

I did not check the efficiency of original model, so to rethink if it is based on quantization or the model in overall return rubish instruct. ^^

I built simple space here https://huggingface.co/spaces/s3nh/Baichuan-13B-Instruction-GGML and after simple tests I can confirm that it generate not efficient prompts.

I tested with the space, it seems to produce gibberish outputs as well, it might be the problem with the original model

Thanks for the quantization

Sign up or log in to comment