Instructions to use nex-agi/Nex-N2-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nex-agi/Nex-N2-Pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nex-agi/Nex-N2-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("nex-agi/Nex-N2-Pro") model = AutoModelForMultimodalLM.from_pretrained("nex-agi/Nex-N2-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nex-agi/Nex-N2-Pro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nex-agi/Nex-N2-Pro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nex-agi/Nex-N2-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nex-agi/Nex-N2-Pro
- SGLang
How to use nex-agi/Nex-N2-Pro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nex-agi/Nex-N2-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nex-agi/Nex-N2-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nex-agi/Nex-N2-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nex-agi/Nex-N2-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nex-agi/Nex-N2-Pro with Docker Model Runner:
docker model run hf.co/nex-agi/Nex-N2-Pro
What a Monster. It is very good...
I'm DLing now, will make quants. Would you say this model is better then base Qwen3.5 397b?
Which queries did you use, so we test it in rival models?
I'm DLing now, will make quants. Would you say this model is better then base Qwen3.5 397b?
If you are planning to make quants, be careful of the ongoing issues with the jinja template. Also, if possible, I would greatly appreciate if you could make quants for 2-bit and 3-bit, as Qwen 3.5 397B handles quantization better than most models.
If you are planning to make quants, be careful of the ongoing issues with the jinja template. Also, if possible, I would greatly appreciate if you could make quants for 2-bit and 3-bit, as Qwen 3.5 397B handles quantization better than most models.
Are you talking about this comment?
https://huggingface.co/nex-agi/Nex-N2-Pro/discussions/3#6a25aecfeda1ecba7f4ad2bb
There are more issues: The chat_template.jinja does not define the tags in the form expected by llama.cpp. This file needs to be patched.
Diff:
103c103< {{- '<|im_start|>' + message.role + '\n' + content }}---> {{- '<|im_start|>' + message.role + '\n<think></think>' + content }}150c150< {{- '<think>\n\n</think>\n\n' }}---> {{- '<think></think>' }}152c152< {{- '<think>' }}---> {{- '<think>\n' }}
If I understand correctly, this is a sneaky attempt at inserting tags in the history AI messages even when there's no thinking in the message. This way, you can avoid stripping the current message from its empty tags when it becomes part of the history, avoiding a reprocessing of the previous message.
I used to do the same thing, until I eventually realized that this decreases coherence during long conversation (with many messages). The model was trained without empty thinking tags, and if you force them in, the output gets degraded. This is a hard to find problem, since it cannot get detected via ppl (which is done without instruct format anyway).
I suggest accepting the rebuilding of the last message. An AI message isn't generally much more the 1000 tokens, and it shouldn't take more then 4-5 seconds to reprocess it. If you still want to use these changes to avoid reprocessing the last message, I can upload a second jinja file with these modifications, and you can load it with --jinja --chat-template-file jinja/custom-no-reprocess.jinja
That said, it's going to take a little bit for the model to upload. (It's pretty big).
If you are planning to make quants, be careful of the ongoing issues with the jinja template. Also, if possible, I would greatly appreciate if you could make quants for 2-bit and 3-bit, as Qwen 3.5 397B handles quantization better than most models.
Are you talking about this comment?
https://huggingface.co/nex-agi/Nex-N2-Pro/discussions/3#6a25aecfeda1ecba7f4ad2bb
There are more issues: The chat_template.jinja does not define the tags in the form expected by llama.cpp. This file needs to be patched.
Diff:
103c103< {{- '<|im_start|>' + message.role + '\n' + content }}---> {{- '<|im_start|>' + message.role + '\n<think></think>' + content }}150c150< {{- '<think>\n\n</think>\n\n' }}---> {{- '<think></think>' }}152c152< {{- '<think>' }}---> {{- '<think>\n' }}If I understand correctly, this is a sneaky attempt at inserting tags in the history AI messages even when there's no thinking in the message. This way, you can avoid stripping the current message from its empty tags when it becomes part of the history, avoiding a reprocessing of the previous message.
I used to do the same thing, until I eventually realized that this decreases coherence during long conversation (with many messages). The model was trained without empty thinking tags, and if you force them in, the output gets degraded. This is a hard to find problem, since it cannot get detected via ppl (which is done without instruct format anyway).
I suggest accepting the rebuilding of the last message. An AI message isn't generally much more the 1000 tokens, and it shouldn't take more then 4-5 seconds to reprocess it. If you still want to use these changes to avoid reprocessing the last message, I can upload a second jinja file with these modifications, and you can load it with
--jinja --chat-template-file jinja/custom-no-reprocess.jinjaThat said, it's going to take a little bit for the model to upload. (It's pretty big).
There is also another issue noted by the same user here:
https://huggingface.co/nex-agi/Nex-N2-Pro/discussions/3#6a27072e06d499f85f4c503b
I am not sure if you aware of this one as well, just wanted to make sure you knew.
Thank you for creating and uploading quants for the community to use, I very much appreciate it!
There is also another issue noted by the same user here:
https://huggingface.co/nex-agi/Nex-N2-Pro/discussions/3#6a27072e06d499f85f4c503b
I am not sure if you aware of this one as well, just wanted to make sure you knew.
Thank you for creating and uploading quants for the community to use, I very much appreciate it!
Yeah no, thank you for telling me. I'm halfway done downloading the BF16 safetensor, will be done by tomorrow morning. I'll test everything properly to make sure it works.
That said, Qwen/Qwen3.5-397B-A17B has the same mismatch:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/tokenizer_config.json
https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/config.json
So it's probably ok? Qwen3.5-397B-A17B works fine for me.
There is also another issue noted by the same user here:
https://huggingface.co/nex-agi/Nex-N2-Pro/discussions/3#6a27072e06d499f85f4c503b
I am not sure if you aware of this one as well, just wanted to make sure you knew.
Thank you for creating and uploading quants for the community to use, I very much appreciate it!
Yeah no, thank you for telling me. I'm halfway done downloading the BF16 safetensor, will be done by tomorrow morning. I'll test everything properly to make sure it works.
That said, Qwen/Qwen3.5-397B-A17B has the same mismatch:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/tokenizer_config.json
https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/config.jsonSo it's probably ok? Qwen3.5-397B-A17B works fine for me.
Oh, if thats the case, then no worries lol, cant wait to try this out!
Which queries did you use, so we test it in rival models?
generate an svg of a pelican riding a bike
create an svg image of a capybara wearing a kimono drinking matcha tea
Which queries did you use, so we test it in rival models?
generate an svg of a pelican riding a bike
create an svg image of a capybara wearing a kimono drinking matcha tea
This is an: IQ4_XS 35B with same query. I let you judge.
Not sure if this is the right bench to show how good it is. In a quick coding test to Nex model it couldnΒ΄t code a functional frequency generator, (we used the q4_k_m ). In writing skills the step 3.7 is superiour (in our opinion). We are expecting and hoping that we did smth wrong. The original qwen 397b is not bad at all.
Any suggestion?
So, using a Q5 as well with Q8 KV cache and BF16 deltanets (uploading it in paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF). In all cases it thought very little, less then 100 tokens of thought, and it wrote the captions by itself. I have downloaded the BF16 weights already, so it takes little to me to make a specific quant. If anyone wants a quant that targets a specific RAM/VRAM size just lemme know. Mmproj is already up btw.
Generate an SVG of a car as detailed as possible:
Generate an svg of a pelican riding a bike:
Create an svg image of a capybara wearing a kimono drinking matcha tea:
So, using a Q5 as well with Q8 KV cache and BF16 deltanets (uploading it in paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF). In all cases it thought very little, less then 100 tokens of thought, and it wrote the captions by itself. I have downloaded the BF16 weights already, so it takes little to me to make a specific quant. If anyone wants a quant that targets a specific RAM/VRAM size just lemme know. Mmproj is already up btw.
@paragon-of-brah , any chance you could make something around Q2 range? Original Qwen 397B was working surprisingly well even at such a low quant, and it's the limit what I can fit fully in VRAM with high context.
@paragon-of-brah , any chance you could make something around Q2 range? Original Qwen 397B was working surprisingly well even at such a low quant, and it's the limit what I can fit fully in VRAM with high context.
No problem, just tell me how much VRAM you have. A IQ2_M should be around 130GB weights, plus say 4-5 GB of KV cache and buffers, assuming using around 100k context length.
That said, is there a reason for you to hold the whole model in VRAM? I only have a single 5090 and I still run a Q5. 214GB of moe weights all in RAM. I get around 10 t/s TG at 200k context filled, and around 650 t/s PP.
Not using RAM at all feels like a waste.
@Stocke How much VRAM do you have? What cards are those? if you have a bunch of Nvidia GPUs you can try using ti in TabbyAPI, I made ~2.65 bpw EXL3 quant and I could make quants down to ~2.1bpw happen.
@paragon-of-brah , any chance you could make something around Q2 range? Original Qwen 397B was working surprisingly well even at such a low quant, and it's the limit what I can fit fully in VRAM with high context.
No problem, just tell me how much VRAM you have. A IQ2_M should be around 130GB weights, plus say 4-5 GB of KV cache and buffers, assuming using around 100k context length.
That said, is there a reason for you to hold the whole model in VRAM? I only have a single 5090 and I still run a Q5. 214GB of moe weights all in RAM. I get around 10 t/s TG at 200k context filled, and around 650 t/s PP.
Not using RAM at all feels like a waste.
Me personally, I have a 4090 and 192 GB system RAM, is this a config you can support in the future as well? I would like to comfortably fit full 250,000 context as well within this range.
Thank you very much for your support!
No problem, just tell me how much VRAM you have. A IQ2_M should be around 130GB weights, plus say 4-5 GB of KV cache and buffers, assuming using around 100k context length.
That said, is there a reason for you to hold the whole model in VRAM? I only have a single 5090 and I still run a Q5. 214GB of moe weights all in RAM. I get around 10 t/s TG at 200k context filled, and around 650 t/s PP.
Thanks! I was using IQ2_XXS for Qwen, which is around 115GB, but IQ2_M quant should be fine since I have 148GB of total VRAM.
I prefer a lower-quality quant that fits in VRAM, as it gives me over 20 t/s TG and ~700 t/s PP. If I use tensor parallel + MTP, I can even get over 30 t/s TG (though PP takes quite a big hit). The last time I tried offloading to RAM with --ncpu-moe, I saw a relatively big performance hit, but it was quite a while ago, so I might want to experiment with it a bit more.
Still, I only have 64GB of RAM, so I could try a Q3 quant at best. I'm not sure if it's worth taking a performance hit to go from Q2 to Q3, especially since I want to keep some GBs for context checkpoints.
@Stocke How much VRAM do you have? What cards are those? if you have a bunch of Nvidia GPUs you can try using ti in TabbyAPI, I made ~2.65 bpw EXL3 quant and I could make quants down to ~2.1bpw happen.
I have a mix of MI50s and 7900XT so in this setup I can effectively run only llama.cpp. Technically I can also run a fork of vllm but it's not worth it as the performance is considerably lower.
Me personally, I have a 4090 and 192 GB system RAM, is this a config you can support in the future as well? I would like to comfortably fit full 250,000 context as well within this range.
Thank you very much for your support!
That config is interesting. The best you can run is probably an IQ4_KSS with q8 attention and delta net with ik_llama.cpp fork. That should take 180 GB of RAM and around 20 GB of VRAM at 262k ctx Q8. It's kind of tight RAM wise, but using mmap your computer should be able to push tensors out of RAM seamlessly the moment you need it for some other program, in case those remaining 12 GB aren't enough.
I'll also make an IQ3_M, using 168GB of RAM and the same VRAM, in case you want to use mainline llama.cpp and need more free RAM, but IQ3_M is quite a bit older and lower quality the the IQ_K quants, the drop in quality will be a bit more noticeble at long ctx.
Thanks! I was using IQ2_XXS for Qwen, which is around 115GB, but IQ2_M quant should be fine since I have 148GB of total VRAM.
I prefer a lower-quality quant that fits in VRAM, as it gives me over 20 t/s TG and ~700 t/s PP. If I use tensor parallel + MTP, I can even get over 30 t/s TG (though PP takes quite a big hit). The last time I tried offloading to RAM with --ncpu-moe, I saw a relatively big performance hit, but it was quite a while ago, so I might want to experiment with it a bit more.
Still, I only have 64GB of RAM, so I could try a Q3 quant at best. I'm not sure if it's worth taking a performance hit to go from Q2 to Q3, especially since I want to keep some GBs for context checkpoints.
--ncpu-moe is a bit finnicky, not great. It's probably best to just put all tensors on VRAM with -ngl 100 then using regex -ot "blk\.([XXX])\.ffn.*_exps.*=CPU"
-ot "blk\.([0-5])\.ffn.*_exps.*=CPU"
-ngl 100
^ This will only put 5 moe layers on RAM, so the perf hit should be small. At IQ3_XXS, this still saves 11.7GB of VRAM space, reducing VRAM requirement to around 130GB. The difference in quality between IQ2_M and IQ3_XXS is quite big, especially at high ctx, so I recommend trying it. At any rate, I'll make both Q2 and Q3.
So, recap: I'll make
IQ5_KS (uploading, done tomorrow) - ik only
IQ3_XXS - mainline llama.cpp compatible
IQ3_M - mainline llama.cpp compatible
IQ2_M - mainline llama.cpp compatible
IQ4_KSS - ik only
Each will take about 20 hours to upload.
Me personally, I have a 4090 and 192 GB system RAM, is this a config you can support in the future as well? I would like to comfortably fit full 250,000 context as well within this range.
Thank you very much for your support!
That config is interesting. The best you can run is probably an IQ4_KSS with q8 attention and delta net with ik_llama.cpp fork. That should take 180 GB of RAM and around 20 GB of VRAM at 262k ctx Q8. It's kind of tight RAM wise, but using mmap your computer should be able to push tensors out of RAM seamlessly the moment you need it for some other program, in case those remaining 12 GB aren't enough.
I'll also make an IQ3_M, using 168GB of RAM and the same VRAM, in case you want to use mainline llama.cpp and need more free RAM, but IQ3_M is quite a bit older and lower quality the the IQ_K quants, the drop in quality will be a bit more noticeble at long ctx.
Thanks! I was using IQ2_XXS for Qwen, which is around 115GB, but IQ2_M quant should be fine since I have 148GB of total VRAM.
I prefer a lower-quality quant that fits in VRAM, as it gives me over 20 t/s TG and ~700 t/s PP. If I use tensor parallel + MTP, I can even get over 30 t/s TG (though PP takes quite a big hit). The last time I tried offloading to RAM with --ncpu-moe, I saw a relatively big performance hit, but it was quite a while ago, so I might want to experiment with it a bit more.
Still, I only have 64GB of RAM, so I could try a Q3 quant at best. I'm not sure if it's worth taking a performance hit to go from Q2 to Q3, especially since I want to keep some GBs for context checkpoints.
--ncpu-moe is a bit finnicky, not great. It's probably best to just put all tensors on VRAM with
-ngl 100then using regex-ot "blk\.([XXX])\.ffn.*_exps.*=CPU"-ot "blk\.([0-5])\.ffn.*_exps.*=CPU" -ngl 100^ This will only put 5 moe layers on RAM, so the perf hit should be small. At IQ3_XXS, this still saves 11.7GB of VRAM space, reducing VRAM requirement to around 130GB. The difference in quality between IQ2_M and IQ3_XXS is quite big, especially at high ctx, so I recommend trying it. At any rate, I'll make both Q2 and Q3.
So, recap: I'll make
IQ5_KS (uploading, done tomorrow) - ik only
IQ3_XXS - mainline llama.cpp compatible
IQ3_M - mainline llama.cpp compatible
IQ2_M - mainline llama.cpp compatible
IQ4_KSS - ik onlyEach will take about 20 hours to upload.
I used IQ2_M for Qwen 3.5 397B, and the quality I was seeing in it's outputs were surprisingly wonderful.
EDIT: My apologies, I was in fact using IQ2_M for Qwen 3.5 397B, not IQ2_XXS, I just checked, so everything is fine.
I also used IQ2_XXS for Qwen 3.5 397B, and the quality I was seeing in it's outputs were surprisingly wonderful. Do you think you can add that to the mainline llama.cpp compatible list, even if its the last one you do after the previous ones you mentioned, that would still be ok.
All right then, since it's one of the smallest ones i'll actually upload it second. That one probably uploads in like 10 hours. Uploads are in this order:
IQ5_KS (uploading, done tomorrow) - ik onlyIQ2_XXS - mainline llama.cpp compatibleIQ4_KSS - ik onlyIQ3_M - mainline llama.cpp compatibleIQ3_XXS - mainline llama.cpp compatibleIQ2_M - mainline llama.cpp compatible
Ahahah, nevermind then, IQ2_XXS is canceled. New order is:
IQ5_KS (uploading, done tomorrow) - ik only
IQ2_M - mainline llama.cpp compatible
IQ4_KSS - ik only
IQ3_M - mainline llama.cpp compatible
IQ3_XXS - mainline llama.cpp compatible
Small update. IQ5_KS has been uploaded. IQ2_M has been created and is being uploaded. Will probably need 15 hours to upload, the first shard will be uploaded in maybe a few hours. The model has Q8 attention and Q8 deltanets, so total size is 134GB (2.88bpw). In hybrid inference, loaded together with mmproj for vision, it'll use 138GB on RAM and 19GB on VRAM with 200k ctx. The model thinks a lot more then IQ5_KS.
Using the IQ2_M:
Create an svg image of a capybara wearing a kimono drinking matcha tea:
Generate an svg of a pelican riding a bike:
Pretty good for Q2. Albeit tbh it's almost a Q3.
Hi all. First off, thank you @Hunterx for the kind words and the SVG tests, and a big thank you to @paragon-of-brah , @cpral , @Trilogix1 and everyone making and sharing quants β this is awesome to see. π
A couple of clarifications on the template/config points that have come up across this and #3, so nobody bakes a workaround into their quants unnecessarily:
On the chat template β please keep chat_template.jinja as-is.
The model was trained strictly on the current template, so patching the tags (adding \n after <think>, or injecting empty <think></think> into history) deviates from the training-time format and can degrade output quality β especially over long conversations. @paragon-of-brah 's instinct here is correct: forcing empty thinking tags into history hurts coherence, and reprocessing the last message is the better tradeoff. The "thinking blends into normal text" symptom that motivated the \n patch is actually a bug in llama.cpp's reasoning parser, not the template (we confirmed this in PR #7).
The clean fix is to use our patched llama.cpp, which works with the unmodified GGUF and template:
Binaries: https://github.com/nex-agi/llama.cpp/releases/tag/nex-b9596-fix-b9599-9cd1771
Docker: docker pull ghcr.io/nex-agi/llama.cpp:server-cuda-nex-b9596-fix-b9598-8c0d5c9 (more variants: https://github.com/orgs/nex-agi/packages)
We're upstreaming the patch to llama.cpp shortly and will post the PR link once it's merged.
Hi all. First off, thank you @Hunterx for the kind words and the SVG tests, and a big thank you to @paragon-of-brah , @cpral , @Trilogix1 and everyone making and sharing quants β this is awesome to see. π
A couple of clarifications on the template/config points that have come up across this and #3, so nobody bakes a workaround into their quants unnecessarily:
On the chat template β please keep chat_template.jinja as-is.
The model was trained strictly on the current template, so patching the tags (adding
\nafter<think>, or injecting empty<think></think>into history) deviates from the training-time format and can degrade output quality β especially over long conversations. @paragon-of-brah 's instinct here is correct: forcing empty thinking tags into history hurts coherence, and reprocessing the last message is the better tradeoff. The "thinking blends into normal text" symptom that motivated the \n patch is actually a bug in llama.cpp's reasoning parser, not the template (we confirmed this in PR #7).The clean fix is to use our patched llama.cpp, which works with the unmodified GGUF and template:
Binaries: https://github.com/nex-agi/llama.cpp/releases/tag/nex-b9596-fix-b9599-9cd1771
Docker: docker pull ghcr.io/nex-agi/llama.cpp:server-cuda-nex-b9596-fix-b9598-8c0d5c9 (more variants: https://github.com/orgs/nex-agi/packages)We're upstreaming the patch to llama.cpp shortly and will post the PR link once it's merged.
Thanks for the clarification. All the quants at paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF have been made with the original Nex jinja, so if you have downloaded any of them you're good already.
Also, IQ5_KS, IQ2_M are already up, IQ4_KSS will be finished uploading in ~1 hr, 10/11 shards are already up.









