And where is the GGUF file itself?
And where is the GGUF file itself?
Give the guy some time, new repo's always get made first so the upload scripts can do their job.
It may not even be possible to convert it yet.
Sorry, I launched this last night thinking it's the exact same model as Mistral 7B so it should be all fine. However, they are using a slightly different but not that different tokenizer.
I am helping testing this PR, once it's resolved it should be pretty quick :)
https://github.com/ggerganov/llama.cpp/pull/8579
@MaziyarPanahi
Kindly let us know when the quants are ready :)
Thank you.
Of course, the PR is ready to be merged. So hopefully it will be ready today :)
is merged ;)
The PR seems to be just a piece of support for Mistral-Nemo-Instruct-2407
. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible
The PR seems to be just a piece of support for
Mistral-Nemo-Instruct-2407
. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible
@MaziyarPanahi Does that mean it's just a workaround and not a fix?
@MaziyarPanahi Does that mean it's just a workaround and not a fix?
It's not a workaround, it's just one part of the whole Support Mistral-Nemo-Instruct-2407 128K issue solution.
If you try to use this part only, the model will start loading, but then will fail with a wrong tensor shape error because Mistral-Nemo uses non-standard tensor shapes.
The Llama.cpp team is currently working on this part of the issue.
Last PR is merge and models are being uploaded!
Last PR is merge and models are being uploaded!
Can confirm that they work :3
I've tested Q4_K_S with b3437 and it's coherent to 16K, with cache quant too
Nice!!!! Love to see how far we can go with the context length here! :D
Thanks for the fine quants!
I threw a friend's 450 page Ph.D. dissertation (just over ~50k tokens) at the Q8_0 and it generally returned a rough summary. Can almost fit 128k context on my 3090TI 24GB VRAM GPU (had to dial it back just a bit to not OOM when offloading all layers).
I'll likely use this model to experiment quickly generating summaries of medium sized chunks of text (up to 16k or 32k).
Runtime
$ ./llama-server --version
version: 3441 (081fe431)
built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu
$ ./llama-server \
--model "../models/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407.Q8_0.gguf" \
--n-gpu-layers 41 \
--ctx-size 102400 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads 8 \
--flash-attn \
--mlock \
--n-predict -1 \
--host 127.0.0.1 \
--port 8080
Client Config
{
"temperature": 0.2,
"top_k": 40,
"top_p": 0.95,
"min_p": 0.05,
"repeat_penalty": 1.1,
"n_predict": -1,
"seed": -1,
}
Mixtral Prompt Format
Make sure to use the correct Mixtral Prompt Format being mindful of preserving white spaces and how to fudge in a "system prompt" or not.
Using the wrong prompt format e.g. ChatML it sometimes evaluates the entire prompt and immediately returns end of string generating nothing.
[INST] Just tell it what to do here without system prompt and keep the space in front. [/INST]
Example Timings
INFO [ print_timings] prompt eval time = 34172.36 ms / 51617 tokens ( 0.66 ms per token, 1510.49 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 n_prompt_tokens_processed=51617 t_token=0.6620369258190131 n_tokens_second=1510.489764242212
INFO [ print_timings] generation eval time = 25648.80 ms / 557 runs ( 46.05 ms per token, 21.72 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_token_generation=25648.798 n_decoded=557 t_token=46.04811131059246 n_tokens_second=21.716417276162417
INFO [ print_timings] total time = 59821.16 ms | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 t_token_generation=25648.798 t_total=59821.157999999996
Cheers!