Instructions to use puwaer/Susono-10B-A1B-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use puwaer/Susono-10B-A1B-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="puwaer/Susono-10B-A1B-Thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("puwaer/Susono-10B-A1B-Thinking", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use puwaer/Susono-10B-A1B-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "puwaer/Susono-10B-A1B-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "puwaer/Susono-10B-A1B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/puwaer/Susono-10B-A1B-Thinking

SGLang

How to use puwaer/Susono-10B-A1B-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "puwaer/Susono-10B-A1B-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "puwaer/Susono-10B-A1B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "puwaer/Susono-10B-A1B-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "puwaer/Susono-10B-A1B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use puwaer/Susono-10B-A1B-Thinking with Docker Model Runner:
```
docker model run hf.co/puwaer/Susono-10B-A1B-Thinking
```

Susono-10B-A1B-Thinking

English | 日本語

Susono-10B-A1B-Thinking is a reasoning model created by post-training Susono-10B-A1B-Base with SFT and DPO. It emits its reasoning process inside a <think>...</think> block before the final answer. It is an original-architecture LLM with 10B total parameters and about 1B active parameters per token (A1B), integrating Engram (a conditional memory module) and mHC-lite (Manifold-Constrained Hyper-Connections Lite) into a hybrid backbone of Full Attention + GatedDeltaNet + MoE.

Training was performed on the NVIDIA GH200 Grace Hopper Superchip. Dedicated fused kernels were implemented for Engram and mHC-lite, and training was optimized with FP8 training + CPU offload, taking advantage of the GH200 GPU architecture.

Note that this model was developed purely as a personal hobby project and funded privately. The development cost was only about USD 1,875 (roughly JPY 300,000), so please be aware that pre-training and post-training have not been carried out to a sufficient extent.

⚠️ This is a thinking model post-trained for reasoning. Apply the chat template when generating responses. The output begins with a reasoning process starting from <think>, and the final answer follows after </think>.

We assume no responsibility for the model's outputs. Use it at your own risk.

Model Overview

Item	Details
Base model	Susono-10B-A1B-Base
Post-training	SFT + DPO
Output format	Reasoning process emitted inside `<think>...</think>`
Architecture	Hybrid of Full Attention + GatedDeltaNet + Sparse MoE, with Engram + mHC-lite
Total parameters	~10B
Active parameters per token	~1B (A1B)
Vocabulary size	151,680
Max context length	262,144 (up to 16,384 during training)
Training stack	Extended Megatron-LM (FP8 training + CPU offload)
Training environment	Supercomputer Miyabi (NVIDIA GH200 × 16)

Reference papers:

Engram: arXiv:2601.07372v1 "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
mHC-lite: arXiv:2601.05732v1 "mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations"

Architecture

Full Attention + GatedDeltaNet: A hybrid configuration that uses full softmax attention every 4 layers (full_attention_interval=4) and GatedDeltaNet (linear attention) in the remaining layers.
Sparse MoE: All FFN layers are MoE (96 experts, 4 active per token).
Engram (conditional memory): O(1) lookup into static embeddings via N-gram hashing. It directly retrieves local, repetitive patterns and frees up attention for global-context processing. Inserted at layers 3 and 7, it serves as the primary store of factual knowledge.
mHC-lite (multi-stream residual connections): Dynamic residual connections across multiple streams. Leveraging the Birkhoff–von Neumann theorem, it strictly guarantees a doubly stochastic matrix without any Sinkhorn-Knopp iterations.

Module	Key settings
MoE	num_experts=96, num_experts_per_tok=4, moe_intermediate_size=512
Engram	max_ngram_size=3, embed_dim=672, n_head=8, layer_ids=[3, 7]
mHC-lite	num_streams=4 (n!=24 permutation matrices)

Training Environment

NVIDIA GH200 Grace Hopper Superchip

The GH200 is a heterogeneous superchip that directly connects a Grace CPU (Arm Neoverse V2 / 72 cores) and a Hopper GPU (H100-class / 96GB HBM3) via NVLink-C2C (900GB/s bidirectional, 7× the bandwidth of PCIe Gen5). Hardware-level memory coherency lets the CPU and GPU access each other's memory without page migration, making full-scale CPU offload practical.

Training Framework

Based on Megatron-LM, extended for Susono as follows:

Triton Fused Kernels: Fuse operations such as Engram lookup, mHC width connection, GatedDeltaNet decay, MoE router, RMSNorm variants, aux loss, and cross entropy. Every kernel includes a PyTorch fallback.
FP8 training + CPU offload: Parameters are kept in FP8 (e4m3), while the Adam optimizer state and master weights (BF16) are offloaded to CPU memory over NVLink-C2C.

Training Schedule

Phase	Context length	Target tokens	GBS	Learning rate
Phase 1: Pre-training	4,096	300B	1,024	2.0e-4
Phase 2: Mid-training	16,384	250B	256	2.0e-4
Phase 3: SFT	16,384	-	128	2.0e-5
Phase 4: DPO	16,384	-	32	1.0e-6

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "puwaer/Susono-10B-A1B-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    repetition_penalty=1.05,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parse thinking content
try:
    # find 151668 (</think>) from the end
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Reference Repositories

HuggingFace transformers implementation: https://github.com/puwaer/transformers.git (main branch)
Megatron-LM implementation: https://github.com/puwaer/Megatron-LM.git (main branch)
SGLang implementation: https://github.com/puwaer/sglang.git (sglang-v0.5.10-add-suson-model branch)
vLLM implementation: https://github.com/puwaer/vllm.git (vllm-v0.19.1-add-suson-model branch)