Susono-10B-A1B-Thinking

English | 日本語

Susono-10B-A1B-Thinking is a reasoning model created by post-training Susono-10B-A1B-Base with SFT and DPO. It emits its reasoning process inside a <think>...</think> block before the final answer. It is an original-architecture LLM with 10B total parameters and about 1B active parameters per token (A1B), integrating Engram (a conditional memory module) and mHC-lite (Manifold-Constrained Hyper-Connections Lite) into a hybrid backbone of Full Attention + GatedDeltaNet + MoE.

Training was performed on the NVIDIA GH200 Grace Hopper Superchip. Dedicated fused kernels were implemented for Engram and mHC-lite, and training was optimized with FP8 training + CPU offload, taking advantage of the GH200 GPU architecture.

Note that this model was developed purely as a personal hobby project and funded privately. The development cost was only about USD 1,875 (roughly JPY 300,000), so please be aware that pre-training and post-training have not been carried out to a sufficient extent.

⚠️ This is a thinking model post-trained for reasoning. Apply the chat template when generating responses. The output begins with a reasoning process starting from <think>, and the final answer follows after </think>.

We assume no responsibility for the model's outputs. Use it at your own risk.

Model Overview

Item Details
Base model Susono-10B-A1B-Base
Post-training SFT + DPO
Output format Reasoning process emitted inside <think>...</think>
Architecture Hybrid of Full Attention + GatedDeltaNet + Sparse MoE, with Engram + mHC-lite
Total parameters ~10B
Active parameters per token ~1B (A1B)
Vocabulary size 151,680
Max context length 262,144 (up to 16,384 during training)
Training stack Extended Megatron-LM (FP8 training + CPU offload)
Training environment Supercomputer Miyabi (NVIDIA GH200 × 16)

Reference papers:

  • Engram: arXiv:2601.07372v1 "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
  • mHC-lite: arXiv:2601.05732v1 "mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations"

Architecture

  • Full Attention + GatedDeltaNet: A hybrid configuration that uses full softmax attention every 4 layers (full_attention_interval=4) and GatedDeltaNet (linear attention) in the remaining layers.
  • Sparse MoE: All FFN layers are MoE (96 experts, 4 active per token).
  • Engram (conditional memory): O(1) lookup into static embeddings via N-gram hashing. It directly retrieves local, repetitive patterns and frees up attention for global-context processing. Inserted at layers 3 and 7, it serves as the primary store of factual knowledge.
  • mHC-lite (multi-stream residual connections): Dynamic residual connections across multiple streams. Leveraging the Birkhoff–von Neumann theorem, it strictly guarantees a doubly stochastic matrix without any Sinkhorn-Knopp iterations.
Module Key settings
MoE num_experts=96, num_experts_per_tok=4, moe_intermediate_size=512
Engram max_ngram_size=3, embed_dim=672, n_head=8, layer_ids=[3, 7]
mHC-lite num_streams=4 (n!=24 permutation matrices)

Training Environment

NVIDIA GH200 Grace Hopper Superchip

The GH200 is a heterogeneous superchip that directly connects a Grace CPU (Arm Neoverse V2 / 72 cores) and a Hopper GPU (H100-class / 96GB HBM3) via NVLink-C2C (900GB/s bidirectional, 7× the bandwidth of PCIe Gen5). Hardware-level memory coherency lets the CPU and GPU access each other's memory without page migration, making full-scale CPU offload practical.

Training Framework

Based on Megatron-LM, extended for Susono as follows:

  • Triton Fused Kernels: Fuse operations such as Engram lookup, mHC width connection, GatedDeltaNet decay, MoE router, RMSNorm variants, aux loss, and cross entropy. Every kernel includes a PyTorch fallback.
  • FP8 training + CPU offload: Parameters are kept in FP8 (e4m3), while the Adam optimizer state and master weights (BF16) are offloaded to CPU memory over NVLink-C2C.

Training Schedule

Phase Context length Target tokens GBS Learning rate
Phase 1: Pre-training 4,096 300B 1,024 2.0e-4
Phase 2: Mid-training 16,384 250B 256 2.0e-4
Phase 3: SFT 16,384 - 128 2.0e-5
Phase 4: DPO 16,384 - 32 1.0e-6

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "puwaer/Susono-10B-A1B-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    repetition_penalty=1.05,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parse thinking content
try:
    # find 151668 (</think>) from the end
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Reference Repositories

Note: the transformers, SGLang, and vLLM implementations are planned to be merged into their respective upstream (main) repositories.

Downloads last month
81
Safetensors
Model size
11B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including puwaer/Susono-10B-A1B-Thinking

Papers for puwaer/Susono-10B-A1B-Thinking