Pulsar 16B

Optimized for Fast and Efficient Inference · Reduced Memory Footprint

Model Overview
Key Characteristics
Quick Start
Reasoning Control
Tool Calling
Training & Fine-Tuning
Evaluation & Benchmarks
Languages
Safety & Limitations
Model Information
Citation

Model Overview

Pulsar 16B is a model based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, developed by Multiverse Computing. The original model is a ~31.6B parameter, part of the Nemotron model family. It supports long-context inference up to 1M tokens and is designed for general-purpose language modeling tasks.

This version applies model compression techniques to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves 50% compression, reducing the parameter count to 16.15B parameters and lowering memory requirements.

Key Characteristics

Characteristic	Description
Base model	nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. 31.6B total parameters, 3.6B activated per forward pass (11.34% activation ratio). NVIDIA Open Model License.
Pulsar-16B-BF16 (this model)	16.15B total parameters, 3.1B activated per forward pass (19.28% activation ratio) after CompactifAI compression.
📐 Architecture	Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint).
🛠️ Tool calling	Yes. Same tool-call structure and format as Nemotron-3-Nano-30B-A3B-BF16. See Tool Calling.
🗜️ Compression	CompactifAI (proprietary compression technology)
Primary language	English

Quick Start

This model can be loaded with the Transformers API. Use trust_remote_code=True. Recommended approach: AutoModelForCausalLM with apply_chat_template. This configuration has been tested with Transformers 4.57.6.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda" if torch.cuda.is_available() else "auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    temperature=1.0,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Alternatively you can use the pipeline API with trust_remote_code=True; the pipeline returns the full conversation structure, so extract the assistant message from outputs[0]["generated_text"] as needed.

vLLM Serving

Installation

pip install -U "vllm>=0.12.0"

Reasoning parser (NVIDIA)

Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as nano_v3_reasoning_parser.py on the base Hugging Face repo (not specific to Pulsar). Direct download:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

You can keep any local filename; the vllm serve flags below assume the file is in the current directory as nano_v3_reasoning_parser.py. If you mirror an identical copy under the Pulsar model repo, use that URL instead.

Serve

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Note: The NeMo container nvcr.io/nvidia/nemo:25.11.nemotron_3_nano comes with mamba_ssm and causal-conv1d pre-installed.

Thinking (Reasoning) Control

Pulsar 16B supports a hybrid reasoning mode: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the enable_thinking flag in the chat template.

This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard

Transformers API

Pass enable_thinking through apply_chat_template:

Thinking ON (default)

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,   # default — can be omitted
)

Thinking OFF

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)

When thinking is ON the model opens a <think> block before the answer.

output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
    reasoning, answer = output.split("</think>", 1)
    reasoning = reasoning.replace("<think>", "").strip()
    answer = answer.strip()
else:
    answer = output

vLLM

Server-level default

Set the default for all requests at startup with --default-chat-template-kwargs.

Requires recent versions of vLLM.

Thinking OFF for all requests

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  ...

Thinking ON for all requests (default if flag is omitted)

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  ...

Per-request override

--trust-request-chat-template is required to allow per-request overrides.

Individual requests can override the server default by passing chat_template_kwargs in the request body. This works regardless of the server-level default.

Thinking ON/OFF for one request

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "model",
    "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
    "max_tokens": 1024,
    "temperature": 1.0,
    "chat_template_kwargs": {"enable_thinking": True},
})

Tool Calling

Pulsar 16B emits tool calls in the following format:

<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>

When serving (e.g with vLLM), you must use the qwen3_coder tool parser.

vllm serve <model_path> \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

Training & Fine-Tuning

Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

The base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the original model card for details.

CompactifAI Compression

CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.

Evaluation & Benchmarks

Benchmark	Nemotron 3 Nano 30B A3B	Pulsar 16B	gpt-oss-20b	Qwen3-14B	Ministral-3-14B-Instruct-2512
AIME	87.66	87.22	87.66	76.00	33.00
GPQA	74.04	71.41	68.99	63.63	56.45
IFBench	72.31	70.79	68.46	39.20	32.80
MMLU-Pro	78.90	74.78	76.65	85.01	70.09
LiveCodeBench	71.11	68.04	64.65	66.35	29.84

Quantizations

Benchmark	Nemotron 3 Nano 30B A3B	Pulsar 16B (BF16)	Pulsar 16B (fp8)	Pulsar 16B (nvfp4)
AIME	87.66	87.22	86.67	82.00
GPQA	74.04	71.41	70.61	71.11
IFBench	72.31	70.79	69.60	69.90
MMLU-Pro	78.90	74.78	74.76	74.19
LiveCodeBench	71.11	68.04	68.68	65.60

Performance

Framework: guidellm
Inference: vLLM 0.18.0
GPU: NVIDIA L40s
Decode: temperature: 0.0, top_p: 1.0
Measure Window: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
Workload shape: 8k/16k workload as in the original model's card.

Long Context

Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.

Benchmark	Nemotron 3 Nano 30B A3B	Pulsar 16B
Longbench	31.84	29.84
AA-LCR	33.67	29.33
NIAH (@100K)	100.00	100.00
RULER (@128K)	95.02	94.20
RULER (@256K)	92.02	87.74

Evaluation Methodology

Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.

Inference:

Backend: VLLM 0.18.0
Nemotron models: temp 1.0, top_p 1.0
GPT-OSS-20B: temp: 1.0, top_p: 1.0, reasoning_effort: high
Qwen3-14B: temp: 0.6, top_p: 0.95, top_k: 20, min_p: 0.0
Ministral-3-14B-Instruct-2512: temp: 0.15

Benchmark	Framework	Repeats	Other
MMLU-Pro	NeMo-Skills	1
AIME25	NeMo-Skills	10
GPQA:d	NeMo-Skills	5
LiveCodeBench	NeMo-Skills	3
IFBench	NeMo-Skills	5
LongBench v1	lm-evaluation-harness	1
AA-LCR	EvalScope 1.4.1	3	Judge: `Qwen/Qwen3-235B-A22B-Instruct-2507`. `judge_score_type`: `pattern`. `judge_args` → `generation_config`: `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7.
NIAH	EvalScope 1.4.1	1	Judge: `qwen/qwen3-235b-a22b-2507` . `judge_model_args`: `{}` (no extra judge settings in YAML).
RULER	NeMo-Skills (+ RULER)	1

Languages

Primary language: English
Other languages: Spanish

Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.

Safety & Limitations

Known Limitations

English-centric training data (inherited from base model).
Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
Compression may affect some behaviors; evaluate for your use case.

Recommendations

Validate tool outputs before running them
Human oversight for critical use
Task-specific eval before production

Model Information

Field	Value
Model name	Pulsar 16B
Based on	NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Version	v1.5.0
Release date	TBD
Developed by	Multiverse Computing
License	Apache 2.0
Contact	business@multiversecomputing.com

Citation

If you use this model, please cite the base model and Pulsar 16B:

@misc{nemotron3nanoTR,
  title         = {NVIDIA Nemotron 3 Nano Technical Report},
  author        = {{NVIDIA}},
  year          = {2025},
  url           = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
  title         = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
  author        = {Multiverse Computing},
  year          = {2026},
  url           = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
  note          = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}

Built by Multiverse Computing · Report an issue · Discord

Downloads last month: 16

Safetensors

Model size

16B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MultiverseComputingCAI/Pulsar-16B-BF16

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Finetuned

(51)

this model

Quantizations

2 models

MultiverseComputingCAI
/

Pulsar-16B-BF16

Pulsar 16B

Table of Contents

Model Overview

Key Characteristics

Quick Start

vLLM Serving

Installation

Reasoning parser (NVIDIA)

Serve

Thinking (Reasoning) Control

Transformers API

vLLM

Server-level default

Per-request override

Tool Calling

Training & Fine-Tuning

Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

CompactifAI Compression

Evaluation & Benchmarks

Quantizations

Performance

Long Context

Evaluation Methodology

Inference:

Languages

Safety & Limitations

Known Limitations

Recommendations

Model Information

Citation

Model tree for MultiverseComputingCAI/Pulsar-16B-BF16