Instructions to use wdrones/nemo-qwen-3.5-sardine with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wdrones/nemo-qwen-3.5-sardine with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="wdrones/nemo-qwen-3.5-sardine", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("wdrones/nemo-qwen-3.5-sardine", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("wdrones/nemo-qwen-3.5-sardine", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use wdrones/nemo-qwen-3.5-sardine with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="wdrones/nemo-qwen-3.5-sardine", filename="low_sardine-step13530-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use wdrones/nemo-qwen-3.5-sardine with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M # Run inference directly in the terminal: llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M # Run inference directly in the terminal: llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Use Docker
docker model run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use wdrones/nemo-qwen-3.5-sardine with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wdrones/nemo-qwen-3.5-sardine" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wdrones/nemo-qwen-3.5-sardine", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M
- SGLang
How to use wdrones/nemo-qwen-3.5-sardine with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wdrones/nemo-qwen-3.5-sardine" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wdrones/nemo-qwen-3.5-sardine", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wdrones/nemo-qwen-3.5-sardine" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wdrones/nemo-qwen-3.5-sardine", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use wdrones/nemo-qwen-3.5-sardine with Ollama:
ollama run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M
- Unsloth Studio
How to use wdrones/nemo-qwen-3.5-sardine with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wdrones/nemo-qwen-3.5-sardine to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wdrones/nemo-qwen-3.5-sardine to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for wdrones/nemo-qwen-3.5-sardine to start chatting
- Pi
How to use wdrones/nemo-qwen-3.5-sardine with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "wdrones/nemo-qwen-3.5-sardine:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use wdrones/nemo-qwen-3.5-sardine with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use wdrones/nemo-qwen-3.5-sardine with Docker Model Runner:
docker model run hf.co/wdrones/nemo-qwen-3.5-sardine:Q4_K_M
- Lemonade
How to use wdrones/nemo-qwen-3.5-sardine with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull wdrones/nemo-qwen-3.5-sardine:Q4_K_M
Run and chat with the model
lemonade run user.nemo-qwen-3.5-sardine-Q4_K_M
List all available models
lemonade list
nemotron-edge-exp · low_sardine · instruction_following_sft · qwen3_5_4b_base · step 13530
⚠️ Experimental early checkpoint. This is an intermediate training checkpoint (
step 13530) from an instruction-following supervised fine-tuning (SFT) run on a Qwen3.5 ~4B base model. It is not a finished, fully-evaluated release. Behavior, quality, and prompt formatting may change in later checkpoints. Use for research and experimentation only.
Model Overview
This checkpoint is a vision-language, instruction-following model based on the
Qwen3.5 architecture (Qwen3_5ForConditionalGeneration). It was produced by
an internal NVIDIA Nemotron "edge" experiment (codename low_sardine) that
applies instruction-following SFT on top of a Qwen3.5 4B-class base model. It
accepts interleaved text and images (and video frames) as input and generates
text output. The chat template also supports optional reasoning (<think>)
sections and tool/function calling via <tool_call> blocks.
- Developer: NVIDIA (Nemotron edge experiments)
- Base architecture: Qwen3.5 (
model_type: qwen3_5) - Fine-tuning objective: Instruction-following SFT
- Checkpoint:
step13530(intermediate) - Modality: Image + Text → Text (multimodal)
- Language: English
Model Architecture
| Property | Value |
|---|---|
| Architecture class | Qwen3_5ForConditionalGeneration |
| Model type | qwen3_5 (text: qwen3_5_text) |
| Hidden size | 2560 |
| Hidden layers | 32 |
| Attention pattern | Hybrid: 3× linear_attention then 1× full_attention (full-attention every 4th layer) |
| Attention heads | 16 (4 KV heads, GQA) |
| Head dim | 256 |
| Linear-attention heads | 16 key / 32 value (key & value head dim 128, conv kernel dim 4) |
| Intermediate size | 9216 (SiLU MLP) |
| Vocab size | 248,320 |
| Max position embeddings | 262,144 (≈262K context) |
| RoPE | mRoPE interleaved, θ = 10,000,000, partial rotary factor 0.25 |
| Tied embeddings | Yes |
| Multi-token prediction | 1 MTP layer |
| Dtype | bfloat16 |
| Vision encoder | 24-layer ViT, hidden 1024, patch size 16, spatial merge 2 → out hidden 2560 |
| Total parameters (weights) | ≈ 9.3 GB on disk across 2 safetensors shards |
The vision tower uses a Qwen2-VL-style image processor (Qwen2VLImageProcessorFast,
Qwen3VLProcessor) with image mean/std of 0.5.
Input / Output
- Input types: Text; optionally images and video.
- Input format: Chat messages (OpenAI/Qwen style) or a plain string.
- Output type: Text (free-form; optional
<think>reasoning;<tool_call>blocks when tools are provided). - Context length: up to 262K tokens.
Special tokens
| Role | Token |
|---|---|
| EOS | `< |
| Pad | `< |
| Vision start / end | `< |
| Image / video pad | `< |
| Reasoning | <think> / </think> |
Usage
With Transformers
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
messages = [{"role": "user", "content": "Explain photosynthesis in two sentences."}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Serving with vLLM (OpenAI-compatible API)
This repository ships a generation_config.json so the
model stops correctly at the end of each chat turn out of the box. The chat
template terminates assistant turns with <|im_end|> (token 248046), so the
generation config sets eos_token_id: [248046, 248044] (<|im_end|> and
<|endoftext|>). Without this, a server would keep generating past the end of
the turn.
Launch an OpenAI-compatible server:
vllm serve nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530 \
--trust-remote-code \
--served-model-name nemotron-edge-sardine \
--max-model-len 32768
The server then exposes the standard OpenAI routes, e.g. POST /v1/chat/completions
and POST /v1/completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron-edge-sardine",
"messages": [{"role": "user", "content": "Give me three tips for writing clearly."}],
"max_tokens": 256
}'
Use it from the OpenAI Python client by pointing base_url at the server:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="nemotron-edge-sardine",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
Multimodal (image) requests use the OpenAI image_url content parts, which vLLM
maps onto the model's vision tokens:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron-edge-sardine",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
{"type": "text", "text": "Describe this image."}
]}],
"max_tokens": 256
}'
Tool calling with vLLM
The chat template also supports tool/function calling. To expose tool calls
through the OpenAI tools field, start the server with auto tool choice enabled:
vllm serve nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530 \
--trust-remote-code \
--served-model-name nemotron-edge-sardine \
--enable-auto-tool-choice \
--tool-call-parser hermes
Note: This checkpoint emits a custom XML tool-call format (
<tool_call><function=...><parameter=...>...</parameter></function></tool_call>) rather than the JSON Hermes format. The built-in parsers may not parse it perfectly; you can still read the raw<tool_call>blocks from the message content, or supply a matching custom--tool-call-parserplugin. This is an instruction-following SFT checkpoint, so tool-calling reliability may be lower than a dedicated tool-calling checkpoint. Verify against your prompts.
Version requirement: The
qwen3_5architecture is new. Use a vLLM build recent enough to includeQwen3_5ForConditionalGenerationsupport (install frommainif a released version does not yet recognizemodel_type: qwen3_5).
Deploying on Hugging Face Inference Endpoints
This repository ships a custom handler.py implementing the
EndpointHandler interface and a requirements.txt, so it can
be deployed directly as a Custom Inference Endpoint
(see the Inference Toolkit docs).
Example request body:
{
"inputs": [
{"role": "user", "content": "Summarize the water cycle for a 10-year-old."}
],
"parameters": {"max_new_tokens": 256, "do_sample": false}
}
Multimodal request (image by URL or base64):
{
"inputs": [
{"role": "user", "content": [
{"type": "image", "image": "https://example.com/cat.jpg"},
{"type": "text", "text": "Describe this image."}
]}
],
"parameters": {"max_new_tokens": 256}
}
The handler returns:
[{"generated_text": "..."}]
Software Integration
- Runtime engines: vLLM (OpenAI-compatible server, recommended for serving) and Hugging Face Transformers (custom handler / Inference Toolkit).
- Recommended hardware: NVIDIA GPU with ≥ 16 GB VRAM (bf16). CPU works but is slow.
- Operating system: Linux.
The
qwen3_5architecture requires a recent Transformers build (transformers>=4.57.0, or install from source). If model loading fails with an unknownmodel_type: qwen3_5, upgrade Transformers from GitHubmain.
Limitations & Responsible Use
- This is an early, intermediate SFT checkpoint and has not undergone full safety, bias, or capability evaluation. Outputs may be inaccurate, incomplete, or unsafe.
- Instruction-following and formatting may be inconsistent at this training step.
- No benchmark results are published for this checkpoint.
- Use in accordance with the governing license and NVIDIA's Trustworthy AI terms. Do not remove or circumvent safety guardrails without an appropriate substitute for your use case.
License
Governed by the NVIDIA Open Model License. Confirm the exact license terms applicable to this experimental checkpoint with the model owner before any production or commercial use.
Model Version
- Checkpoint:
step13530 - Experiment:
nemotron_edge_exp · low_sardine · instruction_following_sft · qwen3_5_4b_base
- Downloads last month
- 31