Pulsar 16B

License HuggingFace Discord

Powered by CompactifAI

Optimized for Fast and Efficient Inference · Reduced Memory Footprint


Table of Contents


Model Overview

Pulsar 16B is a model based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, developed by Multiverse Computing. The original model is a ~31.6B parameter, part of the Nemotron model family. It supports long-context inference up to 1M tokens and is designed for general-purpose language modeling tasks.

This version applies model compression techniques to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves 50% compression, reducing the parameter count to 16.15B parameters and lowering memory requirements.


Key Characteristics

Characteristic Description
Base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. 31.6B total parameters, 3.6B activated per forward pass (11.34% activation ratio). NVIDIA Open Model License.
Pulsar-16B-BF16 (this model) 16.15B total parameters, 3.1B activated per forward pass (19.28% activation ratio) after CompactifAI compression.
📐 Architecture Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint).
🛠️ Tool calling Yes. Same tool-call structure and format as Nemotron-3-Nano-30B-A3B-BF16. See Tool Calling.
🗜️ Compression CompactifAI (proprietary compression technology)
Primary language English

Quick Start

This model can be loaded with the Transformers API. Use trust_remote_code=True. Recommended approach: AutoModelForCausalLM with apply_chat_template. This configuration has been tested with Transformers 4.57.6.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda" if torch.cuda.is_available() else "auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    temperature=1.0,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Alternatively you can use the pipeline API with trust_remote_code=True; the pipeline returns the full conversation structure, so extract the assistant message from outputs[0]["generated_text"] as needed.

vLLM Serving

Installation

pip install -U "vllm>=0.12.0"

Reasoning parser (NVIDIA)

Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as nano_v3_reasoning_parser.py on the base Hugging Face repo (not specific to Pulsar). Direct download:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

You can keep any local filename; the vllm serve flags below assume the file is in the current directory as nano_v3_reasoning_parser.py. If you mirror an identical copy under the Pulsar model repo, use that URL instead.

Serve

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Note: The NeMo container nvcr.io/nvidia/nemo:25.11.nemotron_3_nano comes with mamba_ssm and causal-conv1d pre-installed.


Thinking (Reasoning) Control

Pulsar 16B supports a hybrid reasoning mode: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the enable_thinking flag in the chat template.

This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard


Transformers API

Pass enable_thinking through apply_chat_template:

Thinking ON (default)

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,   # default — can be omitted
)

Thinking OFF

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)

When thinking is ON the model opens a <think> block before the answer.

output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
    reasoning, answer = output.split("</think>", 1)
    reasoning = reasoning.replace("<think>", "").strip()
    answer = answer.strip()
else:
    answer = output

vLLM

Server-level default

Set the default for all requests at startup with --default-chat-template-kwargs.

Requires recent versions of vLLM.

Thinking OFF for all requests

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  ...

Thinking ON for all requests (default if flag is omitted)

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  ...

Per-request override

--trust-request-chat-template is required to allow per-request overrides.

Individual requests can override the server default by passing chat_template_kwargs in the request body. This works regardless of the server-level default.

Thinking ON/OFF for one request

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "model",
    "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
    "max_tokens": 1024,
    "temperature": 1.0,
    "chat_template_kwargs": {"enable_thinking": True},
})

Tool Calling

Pulsar 16B emits tool calls in the following format:

<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>

When serving (e.g with vLLM), you must use the qwen3_coder tool parser.

vllm serve <model_path> \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

Training & Fine-Tuning

Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

The base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the original model card for details.

CompactifAI Compression

CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.


Evaluation & Benchmarks

Combined benchmark chart

Benchmark Nemotron 3 Nano 30B A3B Pulsar 16B gpt-oss-20b Qwen3-14B Ministral-3-14B-Instruct-2512
AIME 87.66 87.22 87.66 76.00 33.00
GPQA 74.04 71.41 68.99 63.63 56.45
IFBench 72.31 70.79 68.46 39.20 32.80
MMLU-Pro 78.90 74.78 76.65 85.01 70.09
LiveCodeBench 71.11 68.04 64.65 66.35 29.84

Quantizations

Quantization results

Benchmark Nemotron 3 Nano 30B A3B Pulsar 16B (BF16) Pulsar 16B (fp8) Pulsar 16B (nvfp4)
AIME 87.66 87.22 86.67 82.00
GPQA 74.04 71.41 70.61 71.11
IFBench 72.31 70.79 69.60 69.90
MMLU-Pro 78.90 74.78 74.76 74.19
LiveCodeBench 71.11 68.04 68.68 65.60

Performance

Performance results

  • Framework: guidellm
  • Inference: vLLM 0.18.0
  • GPU: NVIDIA L40s
  • Decode: temperature: 0.0, top_p: 1.0
  • Measure Window: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
  • Workload shape: 8k/16k workload as in the original model's card.

Long Context

Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.

Long-context benchmark results

Benchmark Nemotron 3 Nano 30B A3B Pulsar 16B
Longbench 31.84 29.84
AA-LCR 33.67 29.33
NIAH (@100K) 100.00 100.00
RULER (@128K) 95.02 94.20
RULER (@256K) 92.02 87.74

Evaluation Methodology

Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.

Inference:

  • Backend: VLLM 0.18.0
  • Nemotron models: temp 1.0, top_p 1.0
  • GPT-OSS-20B: temp: 1.0, top_p: 1.0, reasoning_effort: high
  • Qwen3-14B: temp: 0.6, top_p: 0.95, top_k: 20, min_p: 0.0
  • Ministral-3-14B-Instruct-2512: temp: 0.15
Benchmark Framework Repeats Other
MMLU-Pro NeMo-Skills 1
AIME25 NeMo-Skills 10
GPQA:d NeMo-Skills 5
LiveCodeBench NeMo-Skills 3
IFBench NeMo-Skills 5
LongBench v1 lm-evaluation-harness 1
AA-LCR EvalScope 1.4.1 3 Judge: Qwen/Qwen3-235B-A22B-Instruct-2507. judge_score_type: pattern. judge_argsgeneration_config: top_p 0.8, top_k 20, min_p 0.0, temperature 0.7.
NIAH EvalScope 1.4.1 1 Judge: qwen/qwen3-235b-a22b-2507 . judge_model_args: {} (no extra judge settings in YAML).
RULER NeMo-Skills (+ RULER) 1

Languages

  • Primary language: English
  • Other languages: Spanish

Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.

Safety & Limitations

Known Limitations

  • English-centric training data (inherited from base model).
  • Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
  • Compression may affect some behaviors; evaluate for your use case.

Recommendations

  • Validate tool outputs before running them
  • Human oversight for critical use
  • Task-specific eval before production

Model Information

Field Value
Model name Pulsar 16B
Based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Version v1.5.0
Release date TBD
Developed by Multiverse Computing
License Apache 2.0
Contact business@multiversecomputing.com

Citation

If you use this model, please cite the base model and Pulsar 16B:

@misc{nemotron3nanoTR,
  title         = {NVIDIA Nemotron 3 Nano Technical Report},
  author        = {{NVIDIA}},
  year          = {2025},
  url           = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
  title         = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
  author        = {Multiverse Computing},
  year          = {2026},
  url           = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
  note          = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}

Built by Multiverse Computing · Report an issue · Discord

Downloads last month
16
Safetensors
Model size
16B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MultiverseComputingCAI/Pulsar-16B-BF16

Finetuned
(51)
this model
Quantizations
2 models