TinyStories Qwen3-MoE 1.95M (tinyqwen3moe2m) HF Validation Suite

This repository provides an ultra-lightweight Qwen3 Mixture-of-Experts (MoE) model file in Hugging Face / Safetensors format, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing, MoE routing verification, and automated CI pipelines.

Why this repository exists

When developing a custom Mixture-of-Experts (MoE) inference engine or optimizing low-level routing layers, debugging with full-sized models significantly slows down development efficiency. This suite offers a true 1.95M parameter scale Qwen3-MoE model, allowing developers to validate their top-k gates, token dispatchers, shared expert combinations, and YaRN RoPE scaling parameters step-by-step with maximum efficiency and verifiable natural language outputs.

Key Validation Targets

This model is specifically designed to expose architectural layout and edge-case calculation bugs in MoE pipelines:

Explicit Layer Topology (No Dynamic Extension): By explicitly defining num_attention_heads=8 and head_dim=32 in the configuration, a physical 256-dimensional q_proj layer is built directly. This eliminates the need for implicit, runtime 2x layer expansion logic and ensures a predictable memory layout during weight loading.
Q-Norm / K-Norm Structure Verification: Validates the application of Per-head RMSNorm directly to the Query and Key tensors prior to the core attention dot-product computation. This is a crucial native feature of the Qwen3 architecture to ensure mathematical stability.
Native YaRN RoPE Scaling Integration: Incorporates a true YaRN (Yet Another RoPE Extension) configuration (rope_scaling) with factor=4.0 and original_max_position_embeddings=64. This validates that the inference engine can accurately compute frequency adjustments across an expanded context window.
True 8:1 GQA Ratio: Implements an asymmetric configuration containing exactly 8 Query heads and 1 Key-Value head (num_attention_heads=8, num_key_value_heads=1). This checks that KV caching structures, stride calculations, parallel splits, and index handling process Grouped-Query Attention topologies properly without memory alignment failures.
Top-1 Gated Routing & Token Dispatch: Features a micro-MoE topology with exactly 2 local experts, routing 1 expert per token (num_experts_per_tok=1) across 6 deep transformer layers. This allows developers to strictly track and trace the accuracy of router logit probability flows, routing masks, and row-major tensor switching logic without heavy hardware memory tracking.
Shared Expert Isolation: Implements an explicit Shared Expert configuration with an independent dimension shape (shared_expert_intermediate_size=128). This helps verify whether the model correctly adds the non-gated common network baseline output onto the single gated routing expert output path without accumulation alignment bugs.
Continuous Packed Sequence Handling: The training pipeline utilizes a high-density token-packing algorithm, concatenating sample token continuous sequences separated strictly by control tags (<s> / </s>) into exact 256-block chunks. This is ideal for testing internal state boundaries, position indexing shifts, and sequence-packing continuous-attention masks in custom hardware layouts.
Layer-wise Projection Bias Verification ($\pm 0.2$ Uniform Range): Injected with a frozen constant random uniform bias range ($\pm 0.2$) inside q_proj, k_proj, and v_proj architectures. If an inference engine's dynamic lookup fails to map or slightly shifts these attention biases, the numerical discrepancy scales exponentially across the 6 network layers, immediately destroying greedy generation text into garbage within a few tokens.

📂 Repository Structure & File Descriptions

The repository structure follows standard Hugging Face native configurations for standalone tensor loading:


```text
SUCCESS

```text
.
└── hf/
    ├── config.json
    ├── generation_config.json
    ├── model.safetensors
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    └── tokenizer.json

Hugging Face Native Format (`./hf/`)

hf/model.safetensors: The raw, unquantized model weights containing standard query/key/value projections with biases, routing router layers, and MoE experts stored in secure Safetensors format.
hf/config.json: The architectural configuration file defining MoE hyperparameters (6 layers, 8 heads, 2 local experts, 1 active expert, shared parameters, weight-tying, standard dimensions, and YaRN scaling parameters).
hf/generation_config.json: Default parameters optimized for text generation.
hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen3 fast tokenizer setup.
hf/special_tokens_map.json: Architectural mappings tying special characters to the token blocks.
hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout trained with a base size of 1000.

🚀 Usage Examples

Loading Hugging Face Formats via Python

To match the dataset structure used during training, encode text with add_special_tokens=False and manually prepend the exact BOS token ID (1000). This ensures perfect token alignment and accurate MoE forward pass sampling.

import torch
from transformers import PreTrainedTokenizerFast, Qwen3MoeForCausalLM

repo_id = "shibatch/tinyqwen3moe2m"

# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = Qwen3MoeForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Once upon"

# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)

# Manually prepend the exact BOS token ID (1000) to match the training pipeline layout
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False,        # Forces deterministic greedy decoding
        repetition_penalty=1.0,
        top_p=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📝 Model Specifications

The routing configuration, hidden layers, and attention mechanics conform strictly to the standard Qwen3-MoE structural profiling.

Architecture: Qwen3-MoE (Qwen3MoeForCausalLM)
Dataset: TinyStories
Total Parameters: 1,954,816 parameters (~1.95M)
Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters)
Hidden Size (hidden_size): 128
Head Dimension (head_dim): 32 (8 heads $\times$ 32 dim = 256, explicitly defining the 256-dimensional q_proj from the start without dynamic runtime extensions)
Number of Hidden Layers (num_hidden_layers): 6
Number of Attention Heads (num_attention_heads): 8
Number of Key-Value Heads (num_key_value_heads): 1 (Standard GQA 8:1 topology)
Intermediate Size (intermediate_size): 256
Max Position Embeddings (max_position_embeddings): 256
Attention Bias (attention_bias): True (Explicitly uniform random between -0.2 and 0.2 for q_proj, k_proj, and v_proj)
Total Local Experts (num_experts): 2
Experts Selected per Token (num_experts_per_tok): 1 (Top-1 Routing)
Expert Intermediate Size (moe_intermediate_size): 256
Shared Expert Intermediate Size (shared_expert_intermediate_size): 128
RMS Norm Epsilon: 1e-06
RoPE Base Frequency (rope_theta): 1,000,000.0
RoPE Scaling (rope_scaling): {"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 64}
Weight Tying (tie_word_embeddings): True

📜 Acknowledgments & License

Original Architecture: Qwen3 Model Family.
Dataset: TinyStories dataset.
License: MIT License. You are free to use, modify, and distribute these assets for any purpose, commercial or private.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support