requesting gguf version

#16

pinned

by Hasaranga85 - opened Oct 7

Discussion

Hasaranga85

Oct 7

Hi, can you please provide the gguf version of this model? So, I can use it with ollama.

pszemraj

Owner Oct 8

hi, thanks for reaching out! Yes, I can take a pass at this over the next few days 'manually' - I tried with the gguf my repo space and ran into issues, if that ends up working for you do let me know

pszemraj

Owner Oct 9

•

edited Oct 9

check out the Q6_K in this repo (here)

edit: more options here: https://hf.co/pszemraj/flan-t5-large-grammar-synthesis-gguf

pszemraj changed discussion status to closed Oct 9

Hasaranga85

Oct 9

tried "ggml-model-Q5_K_M.gguf" with llamacpp and it is repeating the system prompt.

https://hf.co/pszemraj/flan-t5-large-grammar-synthesis-gguf

pszemraj

Owner Oct 9

please review the demo code and/or the colab notebook linked on the model card. this is a text2text model, it does not use a system prompt of any kind, you cannot talk to it, etc.

it does one thing and one thing only: whatever text you put in will be grammatically corrected (this is what its doing with your "system prompt")

Hasaranga85

Oct 9

Thank you very much for your explanation. So this model cannot be use with llama-cli right?

pszemraj

Owner 30 days ago

hey - you should be able to run inference with it, but not as a "chat interface". try just passing a prompt, which will then be corrected. An analogue would be what I have for flan-ul2 (different framework and model, but same idea)

pszemraj

Owner 16 days ago

quick update, you can use the GGUFs with llamafile (or llama-cli) like this:

llamafile.exe -m grammar-synthesis-Q6_K.gguf --temp 0 -p "There car broke down so their hitching a ride to they're class."

and it will output the corrected text:

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0


 The car broke down so they had to take a ride to school. [end of text]


llama_print_timings:        load time =     782.21 ms
llama_print_timings:      sample time =       0.23 ms /    16 runs   (    0.01 ms per token, 68376.07 tokens per second)
llama_print_timings: prompt eval time =      85.08 ms /    19 tokens (    4.48 ms per token,   223.33 tokens per second)
llama_print_timings:        eval time =     341.74 ms /    15 runs   (   22.78 ms per token,    43.89 tokens per second)
llama_print_timings:       total time =     456.56 ms /    34 tokens
Log end

pszemraj pinned discussion 16 days ago

Hasaranga85

16 days ago

That worked. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment