requesting gguf version
Hi, can you please provide the gguf version of this model? So, I can use it with ollama.
hi, thanks for reaching out! Yes, I can take a pass at this over the next few days 'manually' - I tried with the gguf my repo space and ran into issues, if that ends up working for you do let me know
check out the Q6_K in this repo (here)
edit: more options here: https://hf.co/pszemraj/flan-t5-large-grammar-synthesis-gguf
tried "ggml-model-Q5_K_M.gguf" with llamacpp and it is repeating the system prompt.
please review the demo code and/or the colab notebook linked on the model card. this is a text2text model, it does not use a system prompt of any kind, you cannot talk to it, etc.
it does one thing and one thing only: whatever text you put in will be grammatically corrected (this is what its doing with your "system prompt")
Thank you very much for your explanation. So this model cannot be use with llama-cli right?
quick update, you can use the GGUFs with llamafile (or llama-cli) like this:
llamafile.exe -m grammar-synthesis-Q6_K.gguf --temp 0 -p "There car broke down so their hitching a ride to they're class."
and it will output the corrected text:
system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0
The car broke down so they had to take a ride to school. [end of text]
llama_print_timings: load time = 782.21 ms
llama_print_timings: sample time = 0.23 ms / 16 runs ( 0.01 ms per token, 68376.07 tokens per second)
llama_print_timings: prompt eval time = 85.08 ms / 19 tokens ( 4.48 ms per token, 223.33 tokens per second)
llama_print_timings: eval time = 341.74 ms / 15 runs ( 22.78 ms per token, 43.89 tokens per second)
llama_print_timings: total time = 456.56 ms / 34 tokens
Log end
That worked. Thanks!