Instructions to use respinosamena/Helios-Nova-306M-Instruct-2606 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="respinosamena/Helios-Nova-306M-Instruct-2606") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("respinosamena/Helios-Nova-306M-Instruct-2606", dtype="auto") - llama-cpp-python
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="respinosamena/Helios-Nova-306M-Instruct-2606", filename="Helios-Nova-306M-Instruct-2606-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M # Run inference directly in the terminal: llama cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M # Run inference directly in the terminal: llama cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
Use Docker
docker model run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "respinosamena/Helios-Nova-306M-Instruct-2606" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "respinosamena/Helios-Nova-306M-Instruct-2606", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
- SGLang
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "respinosamena/Helios-Nova-306M-Instruct-2606" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "respinosamena/Helios-Nova-306M-Instruct-2606", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "respinosamena/Helios-Nova-306M-Instruct-2606" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "respinosamena/Helios-Nova-306M-Instruct-2606", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Ollama:
ollama run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
- Unsloth Studio
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for respinosamena/Helios-Nova-306M-Instruct-2606 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for respinosamena/Helios-Nova-306M-Instruct-2606 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for respinosamena/Helios-Nova-306M-Instruct-2606 to start chatting
- Atomic Chat new
- Docker Model Runner
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Docker Model Runner:
docker model run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
- Lemonade
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
Run and chat with the model
lemonade run user.Helios-Nova-306M-Instruct-2606-Q4_K_M
List all available models
lemonade list
Helios Nova 306M-Instruct-2606
Helios Nova 306M-Instruct-2606 is a 306M-parameter, dense, decoder-only language model for instruction following and conversation. It is the reinforcement-learning-aligned release in the Helios Nova family: a from-scratch base model, instruction-tuned with supervised fine-tuning, then improved with Group Relative Policy Optimization (GRPO) using verifiable, rule-based rewards.
The model was developed independently and end-to-end by a single author — architecture, tokenizer, pre-training, post-training, and evaluation. It was designed to study capability per unit of compute at small scale: where sub-billion-parameter quality comes from architecture and data quality rather than from data volume alone.
At ~80× less pre-training data, Helios Nova reaches 96% of SmolLM2-360M on commonsense reasoning (Winogrande + PIQA), measured on an identical evaluation harness. The base model was pre-trained on 50B tokens on a single GPU for under USD 190 of compute.
The model is distributed both as GGUF quantizations (for llama.cpp: CUDA, Apple Metal, Vulkan, or CPU) and as full-precision safetensors (for PyTorch). Reference chat clients are provided in the companion GitHub repository.
Highlights
- 306M dense decoder, custom architecture and 16k tokenizer, built from scratch.
- GRPO-aligned: instruction-following (constraint-following pass-rate) improved by +18.3 points over the SFT baseline with no measurable capability regression.
- Data-efficient: 96% of SmolLM2-360M commonsense reasoning at ~80× fewer pre-training tokens.
- Low cost: base pre-training under USD 190 on a single H100; post-training on a single consumer iGPU.
- Runs anywhere: pure-PyTorch path (any OS/CPU) and GGUF/llama.cpp path (CUDA / Metal / Vulkan / CPU).
Usage
The reference clients live in the GitHub repository and download these weights automatically on first run.
git clone https://github.com/rafaelespinosamena/Helios-Nova-306M-Instruct-2606.git
cd Helios-Nova-306M-Instruct-2606
PyTorch (any operating system, CPU or GPU, no system dependencies):
pip install -r requirements.txt
python chat.py
GGUF via llama.cpp (fastest; CUDA, Apple Metal, AMD/Intel Vulkan, or CPU):
# install llama.cpp once — macOS: `brew install llama.cpp`;
# otherwise download a release for your backend from github.com/ggml-org/llama.cpp/releases
python instruct_chat.py # F16 (default, full quality)
python instruct_chat.py --model q8 # Q8_0, near-lossless, ~2x smaller
python instruct_chat.py --model q4 # Q4_K_M, smallest and fastest (CPU / edge)
Both clients apply the exact training chat template and stop sequences, so generation terminates cleanly at the end of each turn.
Files
| File | Size | Description |
|---|---|---|
Helios-Nova-306M-Instruct-2606-F16.gguf |
584 MB | Full precision (default) |
Helios-Nova-306M-Instruct-2606-Q8_0.gguf |
311 MB | Near-lossless |
Helios-Nova-306M-Instruct-2606-Q4_K_M.gguf |
179 MB | Smallest and fastest (CPU, edge) |
model.safetensors (+ config.json, HeliosNova.py, tokenizer) |
645 MB | bf16 weights for PyTorch |
Model architecture
| Component | Value |
|---|---|
| Parameters | 305.8M (dense) |
| Layers / hidden size | 24 / 1024 (depth-over-width, following the MobileLLM finding for sub-500M models) |
| Attention | Grouped-Query Attention — 16 query heads, 4 key-value heads, head dimension 64 |
| Feed-forward | SwiGLU, intermediate size 3072 |
| Positional encoding / norm | RoPE (theta 10,000), QK-Norm, RMSNorm (pre-norm), tied input/output embeddings |
| Tokenizer / context | Custom 16k BPE / 2048 tokens |

Training
Pre-training (base model)
The base model, Helios-Nova-306M, was pre-trained on 50B tokens of FineWeb-Edu on a single NVIDIA H100 in under 120 hours, for under USD 190. It uses a Warmup-Stable-Decay (WSD) learning-rate schedule with fused AdamW, bf16, and torch.compile. The validation loss decreases throughout the stable phase and drops sharply during the final decay.
Post-training (this model)
The post-training pipeline — supervised fine-tuning, Direct Preference Optimization (DPO), and GRPO — was implemented from scratch in pure PyTorch and run on a single AMD Strix Halo iGPU (ROCm, gfx1151), without TRL or bitsandbytes.
- Supervised fine-tuning on smol-smoltalk with prompt masking. At 306M parameters, multi-epoch SFT induces catastrophic forgetting of base knowledge; training is stopped at approximately 0.5 epochs, at the point that balances instruction-following against retained general knowledge.

- Preference optimization. On-policy DPO preserved benchmark accuracy but did not improve held-out generation quality, because at this scale self-sampled candidates carry a weak preference signal. The objective was therefore changed to GRPO with verifiable, rule-based rewards (programmatically checkable instructions), which targets a capability the model can reliably improve. Constraint-following pass-rate rises smoothly during training while the KL divergence from the reference policy stays bounded.
Evaluation
Base model: data efficiency
All models below were re-run through one identical lm-evaluation-harness configuration (0-shot), so the comparison is internally consistent; these figures therefore differ slightly from each model's published numbers.

| Metric (0-shot) | Helios-306M (50B tok) | SmolLM2-360M (~4T) | Qwen2.5-0.5B (~18T) |
|---|---|---|---|
| Winogrande | 57.2 | 57.9 | 56.3 |
| PIQA | 68.1 | 72.6 | 70.6 |
| OpenBookQA | 34.4 | 37.6 | 35.4 |
| HellaSwag | 44.7 | 52.5 | 49.5 |
| ARC (avg) | 42.8 | 53.4 | 45.5 |
| MMLU | 24.3 | 25.3 | 47.6 |
| Commonsense reasoning (Winogrande + PIQA) | 62.65 | 65.25 | 63.45 |
Helios reaches 96.0% of SmolLM2-360M on commonsense reasoning (Winogrande + PIQA) at roughly 80× less pre-training data, and ties it on Winogrande (99%). On MMLU the two models are within 96% of each other (24.3 versus 25.3); at this scale both sit near the 25% random-chance floor on MMLU, so this indicates parity rather than mastery. The model trails on tasks bounded by data volume — broad factual recall (TriviaQA) and exam-style knowledge, where Qwen2.5-0.5B's much larger curated corpus is decisive. Helios Nova is data-efficient, not knowledge-rich.

Post-training: SFT to GRPO
Each checkpoint was evaluated on the same seeded harness across three axes: capability retention, constraint-following pass-rate, and pairwise generation win-rate.
| Stage | Capabilities (avg MC) | Constraint-following | Win-rate vs SFT |
|---|---|---|---|
| SFT (baseline) | 0.371 | 39.1% | — |
| GRPO (this model) | 0.371 | 57.4% (+18.3 pp) | 52.7% (no regression) |
Intended use and limitations
Helios Nova 306M-Instruct-2606 is suitable for general conversation, instruction following, commonsense reasoning, format- and constraint-following, and on-device or CPU inference. It is a strong base for further fine-tuning, quantization, and compression research.
It is not suitable as a source of factual knowledge. A 306M-parameter model trained on 50B tokens of educational text has limited world knowledge, and performs near chance on broad factual recall (TriviaQA) and exam-style benchmarks (MMLU). Outputs may be inaccurate or outdated and should be verified before use; the model is not appropriate for high-stakes decisions. The model is English-only.
The Helios Nova family
| Model | Description |
|---|---|
| Helios-Nova-306M | From-scratch base model (50B tokens) |
| Helios-Nova-306M-Instruct | Original SFT instruction model (PyTorch) |
| Helios-Nova-306M-Instruct-GGUF | GGUF build of the SFT instruction model |
| Helios-Nova-306M-Instruct-2606 (this model) | GRPO-aligned instruction model; GGUF and safetensors |
Citation
@misc{espinosamena2026heliosnova2606,
title = {Helios Nova 306M-Instruct-2606: data-efficient pre-training and verifiable-reward GRPO on a single iGPU},
author = {Espinosa Mena, Rafael},
year = {2026},
howpublished = {\url{https://huggingface.co/respinosamena/Helios-Nova-306M-Instruct-2606}}
}
Contact
Rafael Espinosa Mena — rafaelespinosamena@gmail.com
License
Released under the Apache-2.0 license. Copyright 2026 Rafael Espinosa Mena.
- Downloads last month
- 160
Model tree for respinosamena/Helios-Nova-306M-Instruct-2606
Base model
respinosamena/Helios-Nova-306M