GGUF model how to run ?

by Nondzu - opened Jan 13, 2024

Jan 13, 2024

Hi,
I see gguf files model-q6k.gguf model-q4k.gguf, how to run it ?
original llama.cpp looks like does not support madlad ?

JasonM8

Jun 4, 2024

Waiting for an solution too..

thirteenbit

Jul 6, 2024

•

edited Jul 6, 2024

Looks like llama.cpp T5 support was merged two days ago (2024-07-04) but these gguf files are missing some required metadata:

$ ~/projects/llama.cpp/build/bin/llama-cli --model /mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf --prompt "<2pt> I love pizza!"
Log start
main: build = 3325 (87e25a1d)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 1720258911
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
llama_model_loader: loaded meta data with 0 key-value pairs and 742 tensors from /mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - type  f32:  164 tensors
llama_model_loader: - type q6_K:  578 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: ''
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/e/cache/huggingface/hub/models--google--madlad400-10b-mt/snapshots/9f2797629c31e69617186dbe5f0ca43bf662f36d/model-q6k.gguf'
main: error: unable to load model
$

thirteenbit

Jul 6, 2024

You'll have to convert to gguf and quantize yourself, then it works (tested on Q8_0 quant, CUDA accelerated):

$~/projects/llama.cpp/build/bin/llama-cli --n-gpu-layers 40 --model model-q8_0.gguf --prompt "<2pt> I love pizza!"

.. skipped a lot of logs ..

sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


 Eu amo pizza! [end of text]

llama_print_timings:        load time =    6691,80 ms
llama_print_timings:      sample time =       0,74 ms /     5 runs   (    0,15 ms per token,  6784,26 tokens per second)
llama_print_timings: prompt eval time =      71,99 ms /     8 tokens (    9,00 ms per token,   111,13 tokens per second)
llama_print_timings:        eval time =     131,41 ms /     4 runs   (   32,85 ms per token,    30,44 tokens per second)
llama_print_timings:       total time =     378,46 ms /    12 tokens
Log end

thirteenbit

Jul 6, 2024

Just in case someone wants to try the llama.cpp with GGUF weights, I've uploaded here:
https://huggingface.co/thirteenbit/madlad400-10b-mt-gguf/tree/main

GGUF weights were made by following this guide:
https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md

These worked with llama-cli as noted above (single prompt from command line).

llama-server failed to start, llama-cli interactive (chat) mode outputs garbage.
Have not tried other frontends.

xiaohe-cver

Sep 23, 2024

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment