nemotron-edge-exp · low_sardine · instruction_following_sft · qwen3_5_4b_base · step 13530

⚠️ Experimental early checkpoint. This is an intermediate training checkpoint (step 13530) from an instruction-following supervised fine-tuning (SFT) run on a Qwen3.5 ~4B base model. It is not a finished, fully-evaluated release. Behavior, quality, and prompt formatting may change in later checkpoints. Use for research and experimentation only.

Model Overview

This checkpoint is a vision-language, instruction-following model based on the Qwen3.5 architecture (Qwen3_5ForConditionalGeneration). It was produced by an internal NVIDIA Nemotron "edge" experiment (codename low_sardine) that applies instruction-following SFT on top of a Qwen3.5 4B-class base model. It accepts interleaved text and images (and video frames) as input and generates text output. The chat template also supports optional reasoning (<think>) sections and tool/function calling via <tool_call> blocks.

  • Developer: NVIDIA (Nemotron edge experiments)
  • Base architecture: Qwen3.5 (model_type: qwen3_5)
  • Fine-tuning objective: Instruction-following SFT
  • Checkpoint: step13530 (intermediate)
  • Modality: Image + Text → Text (multimodal)
  • Language: English

Model Architecture

Property Value
Architecture class Qwen3_5ForConditionalGeneration
Model type qwen3_5 (text: qwen3_5_text)
Hidden size 2560
Hidden layers 32
Attention pattern Hybrid: 3× linear_attention then 1× full_attention (full-attention every 4th layer)
Attention heads 16 (4 KV heads, GQA)
Head dim 256
Linear-attention heads 16 key / 32 value (key & value head dim 128, conv kernel dim 4)
Intermediate size 9216 (SiLU MLP)
Vocab size 248,320
Max position embeddings 262,144 (≈262K context)
RoPE mRoPE interleaved, θ = 10,000,000, partial rotary factor 0.25
Tied embeddings Yes
Multi-token prediction 1 MTP layer
Dtype bfloat16
Vision encoder 24-layer ViT, hidden 1024, patch size 16, spatial merge 2 → out hidden 2560
Total parameters (weights) ≈ 9.3 GB on disk across 2 safetensors shards

The vision tower uses a Qwen2-VL-style image processor (Qwen2VLImageProcessorFast, Qwen3VLProcessor) with image mean/std of 0.5.

Input / Output

  • Input types: Text; optionally images and video.
  • Input format: Chat messages (OpenAI/Qwen style) or a plain string.
  • Output type: Text (free-form; optional <think> reasoning; <tool_call> blocks when tools are provided).
  • Context length: up to 262K tokens.

Special tokens

Role Token
EOS `<
Pad `<
Vision start / end `<
Image / video pad `<
Reasoning <think> / </think>

Usage

With Transformers

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

messages = [{"role": "user", "content": "Explain photosynthesis in two sentences."}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Serving with vLLM (OpenAI-compatible API)

This repository ships a generation_config.json so the model stops correctly at the end of each chat turn out of the box. The chat template terminates assistant turns with <|im_end|> (token 248046), so the generation config sets eos_token_id: [248046, 248044] (<|im_end|> and <|endoftext|>). Without this, a server would keep generating past the end of the turn.

Launch an OpenAI-compatible server:

vllm serve nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530 \
    --trust-remote-code \
    --served-model-name nemotron-edge-sardine \
    --max-model-len 32768

The server then exposes the standard OpenAI routes, e.g. POST /v1/chat/completions and POST /v1/completions:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "nemotron-edge-sardine",
      "messages": [{"role": "user", "content": "Give me three tips for writing clearly."}],
      "max_tokens": 256
    }'

Use it from the OpenAI Python client by pointing base_url at the server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="nemotron-edge-sardine",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

Multimodal (image) requests use the OpenAI image_url content parts, which vLLM maps onto the model's vision tokens:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "nemotron-edge-sardine",
      "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
        {"type": "text", "text": "Describe this image."}
      ]}],
      "max_tokens": 256
    }'

Tool calling with vLLM

The chat template also supports tool/function calling. To expose tool calls through the OpenAI tools field, start the server with auto tool choice enabled:

vllm serve nvidia/nemotron_edge_exp-low_sardine-instruction_following_sft-qwen3_5_4b_base-step13530 \
    --trust-remote-code \
    --served-model-name nemotron-edge-sardine \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Note: This checkpoint emits a custom XML tool-call format (<tool_call><function=...><parameter=...>...</parameter></function></tool_call>) rather than the JSON Hermes format. The built-in parsers may not parse it perfectly; you can still read the raw <tool_call> blocks from the message content, or supply a matching custom --tool-call-parser plugin. This is an instruction-following SFT checkpoint, so tool-calling reliability may be lower than a dedicated tool-calling checkpoint. Verify against your prompts.

Version requirement: The qwen3_5 architecture is new. Use a vLLM build recent enough to include Qwen3_5ForConditionalGeneration support (install from main if a released version does not yet recognize model_type: qwen3_5).

Deploying on Hugging Face Inference Endpoints

This repository ships a custom handler.py implementing the EndpointHandler interface and a requirements.txt, so it can be deployed directly as a Custom Inference Endpoint (see the Inference Toolkit docs).

Example request body:

{
  "inputs": [
    {"role": "user", "content": "Summarize the water cycle for a 10-year-old."}
  ],
  "parameters": {"max_new_tokens": 256, "do_sample": false}
}

Multimodal request (image by URL or base64):

{
  "inputs": [
    {"role": "user", "content": [
      {"type": "image", "image": "https://example.com/cat.jpg"},
      {"type": "text", "text": "Describe this image."}
    ]}
  ],
  "parameters": {"max_new_tokens": 256}
}

The handler returns:

[{"generated_text": "..."}]

Software Integration

  • Runtime engines: vLLM (OpenAI-compatible server, recommended for serving) and Hugging Face Transformers (custom handler / Inference Toolkit).
  • Recommended hardware: NVIDIA GPU with ≥ 16 GB VRAM (bf16). CPU works but is slow.
  • Operating system: Linux.

The qwen3_5 architecture requires a recent Transformers build (transformers>=4.57.0, or install from source). If model loading fails with an unknown model_type: qwen3_5, upgrade Transformers from GitHub main.

Limitations & Responsible Use

  • This is an early, intermediate SFT checkpoint and has not undergone full safety, bias, or capability evaluation. Outputs may be inaccurate, incomplete, or unsafe.
  • Instruction-following and formatting may be inconsistent at this training step.
  • No benchmark results are published for this checkpoint.
  • Use in accordance with the governing license and NVIDIA's Trustworthy AI terms. Do not remove or circumvent safety guardrails without an appropriate substitute for your use case.

License

Governed by the NVIDIA Open Model License. Confirm the exact license terms applicable to this experimental checkpoint with the model owner before any production or commercial use.

Model Version

  • Checkpoint: step13530
  • Experiment: nemotron_edge_exp · low_sardine · instruction_following_sft · qwen3_5_4b_base
Downloads last month
31
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support