Susono-10B-A1B-Instruct

English | 日本語

Susono-10B-A1B-Instruct is an instruction-following model created by post-training Susono-10B-A1B-Base with SFT and DPO. It is an original-architecture LLM with 10B total parameters and about 1B active parameters per token (A1B), integrating Engram (a conditional memory module) and mHC-lite (Manifold-Constrained Hyper-Connections Lite) into a hybrid backbone of Full Attention + GatedDeltaNet + MoE.

Training was performed on the NVIDIA GH200 Grace Hopper Superchip. Dedicated fused kernels were implemented for Engram and mHC-lite, and training was optimized with FP8 training + CPU offload, taking advantage of the GH200 GPU architecture.

Note that this model was developed purely as a personal hobby project and funded privately. The development cost was only about USD 1,875 (roughly JPY 300,000), so please be aware that pre-training and post-training have not been carried out to a sufficient extent.

⚠️ This is an instruct model post-trained for chat and instruction following. Apply the chat template when generating responses.

We assume no responsibility for the model's outputs. Use it at your own risk.

Model Overview

Item Details
Base model Susono-10B-A1B-Base
Post-training SFT + DPO
Architecture Hybrid of Full Attention + GatedDeltaNet + Sparse MoE, with Engram + mHC-lite
Total parameters ~10B
Active parameters per token ~1B (A1B)
Vocabulary size 151,680
Max context length 262,144 (up to 16,384 during training)
Training stack Extended Megatron-LM (FP8 training + CPU offload)
Training environment Supercomputer Miyabi (NVIDIA GH200 × 16)

Reference papers:

  • Engram: arXiv:2601.07372v1 "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
  • mHC-lite: arXiv:2601.05732v1 "mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations"

Architecture

  • Full Attention + GatedDeltaNet: A hybrid configuration that uses full softmax attention every 4 layers (full_attention_interval=4) and GatedDeltaNet (linear attention) in the remaining layers.
  • Sparse MoE: All FFN layers are MoE (96 experts, 4 active per token).
  • Engram (conditional memory): O(1) lookup into static embeddings via N-gram hashing. It directly retrieves local, repetitive patterns and frees up attention for global-context processing. Inserted at layers 3 and 7, it serves as the primary store of factual knowledge.
  • mHC-lite (multi-stream residual connections): Dynamic residual connections across multiple streams. Leveraging the Birkhoff–von Neumann theorem, it strictly guarantees a doubly stochastic matrix without any Sinkhorn-Knopp iterations.
Module Key settings
MoE num_experts=96, num_experts_per_tok=4, moe_intermediate_size=512
Engram max_ngram_size=3, embed_dim=672, n_head=8, layer_ids=[3, 7]
mHC-lite num_streams=4 (n!=24 permutation matrices)

Training Environment

NVIDIA GH200 Grace Hopper Superchip

The GH200 is a heterogeneous superchip that directly connects a Grace CPU (Arm Neoverse V2 / 72 cores) and a Hopper GPU (H100-class / 96GB HBM3) via NVLink-C2C (900GB/s bidirectional, 7× the bandwidth of PCIe Gen5). Hardware-level memory coherency lets the CPU and GPU access each other's memory without page migration, making full-scale CPU offload practical.

Training Framework

Based on Megatron-LM, extended for Susono as follows:

  • Triton Fused Kernels: Fuse operations such as Engram lookup, mHC width connection, GatedDeltaNet decay, MoE router, RMSNorm variants, aux loss, and cross entropy. Every kernel includes a PyTorch fallback.
  • FP8 training + CPU offload: Parameters are kept in FP8 (e4m3), while the Adam optimizer state and master weights (BF16) are offloaded to CPU memory over NVLink-C2C.

Training Schedule

Phase Context length Target tokens GBS Learning rate
Phase 1: Pre-training 4,096 300B 1,024 2.0e-4
Phase 2: Mid-training 16,384 250B 256 2.0e-4
Phase 3: SFT 16,384 - 128 2.0e-5
Phase 4: DPO 16,384 - 32 1.0e-6

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "puwaer/Susono-10B-A1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    repetition_penalty=1.05,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(content)

Reference Repositories

Note: the transformers, SGLang, and vLLM implementations are planned to be merged into their respective upstream (main) repositories.

Downloads last month
34
Safetensors
Model size
11B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including puwaer/Susono-10B-A1B-Instruct

Papers for puwaer/Susono-10B-A1B-Instruct