Instructions to use nex-agi/Nex-N2-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nex-agi/Nex-N2-Pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nex-agi/Nex-N2-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("nex-agi/Nex-N2-Pro") model = AutoModelForMultimodalLM.from_pretrained("nex-agi/Nex-N2-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nex-agi/Nex-N2-Pro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nex-agi/Nex-N2-Pro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nex-agi/Nex-N2-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nex-agi/Nex-N2-Pro
- SGLang
How to use nex-agi/Nex-N2-Pro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nex-agi/Nex-N2-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nex-agi/Nex-N2-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nex-agi/Nex-N2-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nex-agi/Nex-N2-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nex-agi/Nex-N2-Pro with Docker Model Runner:
docker model run hf.co/nex-agi/Nex-N2-Pro
Fix thinking bug in jinja template
Without the \n after <think>, the think content will be mixed into normal conversational text.
This is the practice aligning with jinja template usages in comparable projects, for example:
- https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/chat_template.jinja#L152
- https://huggingface.co/unsloth/Qwen3.5-397B-A17B/blob/main/chat_template.jinja#L155
Should be a solution to https://github.com/nex-agi/Nex-N2/issues/2
Hi @huaj1ng , thanks a lot for the contribution and for digging into the chat template! π
To help us verify the fix, could you share a bit more detail?
A repro β the original request (rendered prompt / messages) and the raw response where the thinking content blended into the regular text.
Serving stack β were you using our recommended sglang branch or upstream sglang / another engine, and was --reasoning-parser qwen3 enabled? The rendering of can differ across stacks.
This will help us confirm the change matches the training-time format before merging. Thanks again!
Hi @huaj1ng β thanks for the report.
After investigation, the root cause turned out to be in llama.cpp's reasoning parser, not the template.
Adding \n after does work around it, but the model was trained strictly on the current template, so deviating from it at inference time may hurt output quality. We'd rather keep the template as-is.
We've patched llama.cpp and verified the fix with the unmodified GGUF. Builds are available now:
Binaries: https://github.com/nex-agi/llama.cpp/releases/tag/nex-b9596-fix-b9599-9cd1771
Docker: docker pull ghcr.io/nex-agi/llama.cpp:server-cuda-nex-b9596-fix-b9598-8c0d5c9 (more variants at https://github.com/orgs/nex-agi/packages)
We'll submit the patch upstream to llama.cpp shortly β once merged, stock llama.cpp will work out of the box. We'll update this thread with the PR link.