GGUF model how to run ?

#3
by Nondzu - opened

Hi,
I see gguf files model-q6k.gguf model-q4k.gguf, how to run it ?
original llama.cpp looks like does not support madlad ?

Waiting for an solution too..

Looks like llama.cpp T5 support was merged two days ago (2024-07-04) but these gguf files are missing some required metadata:

$ ~/projects/llama.cpp/build/bin/llama-cli --model /mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf --prompt "<2pt> I love pizza!"
Log start
main: build = 3325 (87e25a1d)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 1720258911
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
llama_model_loader: loaded meta data with 0 key-value pairs and 742 tensors from /mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - type  f32:  164 tensors
llama_model_loader: - type q6_K:  578 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: ''
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf'
main: error: unable to load model
$ 

You'll have to convert to gguf and quantize yourself, then it works (tested on Q8_0 quant, CUDA accelerated):

$~/projects/llama.cpp/build/bin/llama-cli --n-gpu-layers 40 --model model-q8_0.gguf --prompt "<2pt> I love pizza!"

.. skipped a lot of logs ..

sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


 Eu amo pizza! [end of text]

llama_print_timings:        load time =    6691,80 ms
llama_print_timings:      sample time =       0,74 ms /     5 runs   (    0,15 ms per token,  6784,26 tokens per second)
llama_print_timings: prompt eval time =      71,99 ms /     8 tokens (    9,00 ms per token,   111,13 tokens per second)
llama_print_timings:        eval time =     131,41 ms /     4 runs   (   32,85 ms per token,    30,44 tokens per second)
llama_print_timings:       total time =     378,46 ms /    12 tokens
Log end

Just in case someone wants to try the llama.cpp with GGUF weights, I've uploaded here:
https://huggingface.co/thirteenbit/madlad400-10b-mt-gguf/tree/main

GGUF weights were made by following this guide:
https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md

These worked with llama-cli as noted above (single prompt from command line).

llama-server failed to start, llama-cli interactive (chat) mode outputs garbage.
Have not tried other frontends.

Sign up or log in to comment