Instructions to use wdrones/nemo-qwen-3.5-sardine with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wdrones/nemo-qwen-3.5-sardine with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="wdrones/nemo-qwen-3.5-sardine", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("wdrones/nemo-qwen-3.5-sardine", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("wdrones/nemo-qwen-3.5-sardine", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use wdrones/nemo-qwen-3.5-sardine with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="wdrones/nemo-qwen-3.5-sardine",
	filename="low_sardine-step13530-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use wdrones/nemo-qwen-3.5-sardine with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Use Docker

docker model run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M

LM Studio
Jan

vLLM

How to use wdrones/nemo-qwen-3.5-sardine with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wdrones/nemo-qwen-3.5-sardine"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wdrones/nemo-qwen-3.5-sardine",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M

SGLang

How to use wdrones/nemo-qwen-3.5-sardine with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wdrones/nemo-qwen-3.5-sardine" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wdrones/nemo-qwen-3.5-sardine",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wdrones/nemo-qwen-3.5-sardine" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wdrones/nemo-qwen-3.5-sardine",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use wdrones/nemo-qwen-3.5-sardine with Ollama:
```
ollama run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M
```

Unsloth Studio

How to use wdrones/nemo-qwen-3.5-sardine with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for wdrones/nemo-qwen-3.5-sardine to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for wdrones/nemo-qwen-3.5-sardine to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for wdrones/nemo-qwen-3.5-sardine to start chatting

How to use wdrones/nemo-qwen-3.5-sardine with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "wdrones/nemo-qwen-3.5-sardine:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use wdrones/nemo-qwen-3.5-sardine with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use wdrones/nemo-qwen-3.5-sardine with Docker Model Runner:
```
docker model run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M
```

Lemonade

How to use wdrones/nemo-qwen-3.5-sardine with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull wdrones/nemo-qwen-3.5-sardine:Q4_K_M

Run and chat with the model

lemonade run user.nemo-qwen-3.5-sardine-Q4_K_M

List all available models

lemonade list

nemotron-edge-exp · low_sardine · instruction_following_sft · qwen3_5_4b_base · step 13530

⚠️ Experimental early checkpoint. This is an intermediate training checkpoint (step 13530) from an instruction-following supervised fine-tuning (SFT) run on a Qwen3.5 ~4B base model. It is not a finished, fully-evaluated release. Behavior, quality, and prompt formatting may change in later checkpoints. Use for research and experimentation only.

Model Overview

This checkpoint is a vision-language, instruction-following model based on the Qwen3.5 architecture (Qwen3_5ForConditionalGeneration). It was produced by an internal NVIDIA Nemotron "edge" experiment (codename low_sardine) that applies instruction-following SFT on top of a Qwen3.5 4B-class base model. It accepts interleaved text and images (and video frames) as input and generates text output. The chat template also supports optional reasoning (<think>) sections and tool/function calling via <tool_call> blocks.

Developer: NVIDIA (Nemotron edge experiments)
Base architecture: Qwen3.5 (model_type: qwen3_5)
Fine-tuning objective: Instruction-following SFT
Checkpoint: step13530 (intermediate)
Modality: Image + Text → Text (multimodal)
Language: English

Model Architecture

Property	Value
Architecture class	`Qwen3_5ForConditionalGeneration`
Model type	`qwen3_5` (text: `qwen3_5_text`)
Hidden size	2560
Hidden layers	32
Attention pattern	Hybrid: 3× `linear_attention` then 1× `full_attention` (full-attention every 4th layer)
Attention heads	16 (4 KV heads, GQA)
Head dim	256
Linear-attention heads	16 key / 32 value (key & value head dim 128, conv kernel dim 4)
Intermediate size	9216 (SiLU MLP)
Vocab size	248,320
Max position embeddings	262,144 (≈262K context)
RoPE	mRoPE interleaved, θ = 10,000,000, partial rotary factor 0.25
Tied embeddings	Yes
Multi-token prediction	1 MTP layer
Dtype	bfloat16
Vision encoder	24-layer ViT, hidden 1024, patch size 16, spatial merge 2 → out hidden 2560
Total parameters (weights)	≈ 9.3 GB on disk across 2 safetensors shards

The vision tower uses a Qwen2-VL-style image processor (Qwen2VLImageProcessorFast, Qwen3VLProcessor) with image mean/std of 0.5.

Input / Output

Input types: Text; optionally images and video.
Input format: Chat messages (OpenAI/Qwen style) or a plain string.
Output type: Text (free-form; optional <think> reasoning; <tool_call> blocks when tools are provided).
Context length: up to 262K tokens.

Special tokens

Role	Token
EOS	`<
Pad	`<
Vision start / end	`<
Image / video pad	`<
Reasoning	`<think>` / `</think>`

Usage

With Transformers

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

messages = [{"role": "user", "content": "Explain photosynthesis in two sentences."}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Serving with vLLM (OpenAI-compatible API)

This repository ships a generation_config.json so the model stops correctly at the end of each chat turn out of the box. The chat template terminates assistant turns with <|im_end|> (token 248046), so the generation config sets eos_token_id: [248046, 248044] (<|im_end|> and <|endoftext|>). Without this, a server would keep generating past the end of the turn.

Launch an OpenAI-compatible server:

vllm serve nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530 \
    --trust-remote-code \
    --served-model-name nemotron-edge-sardine \
    --max-model-len 32768

The server then exposes the standard OpenAI routes, e.g. POST /v1/chat/completions and POST /v1/completions:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "nemotron-edge-sardine",
      "messages": [{"role": "user", "content": "Give me three tips for writing clearly."}],
      "max_tokens": 256
    }'

Use it from the OpenAI Python client by pointing base_url at the server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="nemotron-edge-sardine",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

Multimodal (image) requests use the OpenAI image_url content parts, which vLLM maps onto the model's vision tokens:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "nemotron-edge-sardine",
      "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
        {"type": "text", "text": "Describe this image."}
      ]}],
      "max_tokens": 256
    }'

Tool calling with vLLM

The chat template also supports tool/function calling. To expose tool calls through the OpenAI tools field, start the server with auto tool choice enabled:

vllm serve nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530 \
    --trust-remote-code \
    --served-model-name nemotron-edge-sardine \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Note: This checkpoint emits a custom XML tool-call format (<tool_call><function=...><parameter=...>...</parameter></function></tool_call>) rather than the JSON Hermes format. The built-in parsers may not parse it perfectly; you can still read the raw <tool_call> blocks from the message content, or supply a matching custom --tool-call-parser plugin. This is an instruction-following SFT checkpoint, so tool-calling reliability may be lower than a dedicated tool-calling checkpoint. Verify against your prompts.

Version requirement: The qwen3_5 architecture is new. Use a vLLM build recent enough to include Qwen3_5ForConditionalGeneration support (install from main if a released version does not yet recognize model_type: qwen3_5).

Deploying on Hugging Face Inference Endpoints

This repository ships a custom handler.py implementing the EndpointHandler interface and a requirements.txt, so it can be deployed directly as a Custom Inference Endpoint (see the Inference Toolkit docs).

Example request body:

{
  "inputs": [
    {"role": "user", "content": "Summarize the water cycle for a 10-year-old."}
  ],
  "parameters": {"max_new_tokens": 256, "do_sample": false}
}

Multimodal request (image by URL or base64):

{
  "inputs": [
    {"role": "user", "content": [
      {"type": "image", "image": "https://example.com/cat.jpg"},
      {"type": "text", "text": "Describe this image."}
    ]}
  ],
  "parameters": {"max_new_tokens": 256}
}

The handler returns:

[{"generated_text": "..."}]

Software Integration

Runtime engines: vLLM (OpenAI-compatible server, recommended for serving) and Hugging Face Transformers (custom handler / Inference Toolkit).
Recommended hardware: NVIDIA GPU with ≥ 16 GB VRAM (bf16). CPU works but is slow.
Operating system: Linux.

The qwen3_5 architecture requires a recent Transformers build (transformers>=4.57.0, or install from source). If model loading fails with an unknown model_type: qwen3_5, upgrade Transformers from GitHub main.

Limitations & Responsible Use

This is an early, intermediate SFT checkpoint and has not undergone full safety, bias, or capability evaluation. Outputs may be inaccurate, incomplete, or unsafe.
Instruction-following and formatting may be inconsistent at this training step.
No benchmark results are published for this checkpoint.
Use in accordance with the governing license and NVIDIA's Trustworthy AI terms. Do not remove or circumvent safety guardrails without an appropriate substitute for your use case.

License

Governed by the NVIDIA Open Model License. Confirm the exact license terms applicable to this experimental checkpoint with the model owner before any production or commercial use.

Model Version

Checkpoint: step13530
Experiment: nemotron_edge_exp · low_sardine · instruction_following_sft · qwen3_5_4b_base

Downloads last month: 31

Safetensors

Model size

5B params

Tensor type

BF16