TinyStories Qwen3-Gated-MoE 2M (tinyqwen3gated2m) HF Validation Suite

This repository provides an ultra-lightweight Qwen3 Gated-Hybrid Mixture-of-Experts (MoE) model file in Hugging Face / Safetensors format. It is trained from scratch on the TinyStories dataset and specifically optimized for low-level inference engine architecture debugging, hybrid layer routing verification, and automated CI pipelines.

Why this repository exists

When developing or optimizing a custom hardware/software inference engine for next-generation architectures, debugging with full-scale models significantly slows down implementation efficiency. This suite provides a true 2M parameter scale Qwen3-Gated-MoE model that integrates both linear recurrent states and dense attention layers. This allows developers to validate top-k gating, token dispatchers, shared experts, linear recurrent update loops, and YaRN RoPE scaling parameters step-by-step with maximum agility and verifiable natural language outputs.

Key Validation Targets

This model is specifically designed to expose architectural layout and edge-case calculation bugs in modern hybrid-MoE pipelines:

Equal Gated-Hybrid Topology (layer_types Validation): Features a 1:1 alternating hybrid layer arrangement across 6 deep transformer layers. Layers 0, 2, and 4 are configured as linear_attention (Gated DeltaNet linear recurrence layers), while Layers 1, 3, and 5 are configured as full_attention (Gated Softmax attention layers). This guarantees that both execution pathways are exercised and verified within the engine.
Explicit Layer Shape Structure: By explicitly defining num_attention_heads=8 and head_dim=32 in the configuration, a physical 256-dimensional attention mechanism is built directly. This eliminates the need for implicit, runtime 2x layer expansion logic and ensures a predictable memory layout during weight loading.
Q-Norm / K-Norm Structure Verification: Validates the application of Per-head RMSNorm directly to the Query and Key tensors prior to the core attention dot-product computation. This is a crucial native feature of the Qwen3 architecture to ensure mathematical stability.
Native YaRN RoPE Scaling Integration: Incorporates a true YaRN (Yet Another RoPE Extension) configuration (rope_scaling) with factor=4.0 and original_max_position_embeddings=64. This validates that the inference engine can accurately compute frequency adjustments across an expanded context window.
True 8:1 GQA Ratio: Implements an asymmetric configuration containing exactly 8 Query heads and 1 Key-Value head (num_attention_heads=8, num_key_value_heads=1). This checks that KV caching structures, stride calculations, parallel splits, and index handling process Grouped-Query Attention topologies properly without memory alignment failures.
Top-1 Gated Routing & Token Dispatch: Features a micro-MoE topology with exactly 2 local experts, routing 1 expert per token (num_experts_per_tok=1) across the layers. This allows developers to strictly track and trace the accuracy of router logit probability flows, routing masks, and row-major tensor switching logic without heavy hardware memory tracking.
Shared Expert Isolation: Implements an explicit Shared Expert configuration with an independent dimension shape (shared_expert_intermediate_size=128). This helps verify whether the model correctly adds the non-gated common network baseline output onto the single gated routing expert output path without accumulation alignment bugs.
Layer-wise Projection Bias Verification (±0.2 Uniform Range): Injected with a frozen constant random uniform bias range (±0.2) inside q_proj, k_proj, and v_proj architectures. If an inference engine's dynamic lookup fails to map or slightly shifts these attention biases, the numerical discrepancy scales exponentially across the network layers, immediately destroying greedy generation text into garbage within a few tokens.

📂 Repository Structure & File Descriptions

The repository structure follows standard Hugging Face native configurations for standalone tensor loading:

.
└── hf/
    ├── config.json
    ├── generation_config.json
    ├── model.safetensors
    ├── tokenizer_config.json
    └── tokenizer.json

Hugging Face Native Format (`./hf/`)

hf/model.safetensors: The raw, unquantized model weights containing standard query/key/value projections with biases, routing router layers, recurrent DeltaNet weights, and MoE experts stored in secure Safetensors format.
hf/config.json: The architectural configuration file defining MoE and Hybrid hyperparameters (6 layers alternating between linear_attention and full_attention, 8 heads, 2 local experts, 1 active expert, shared parameters, weight-tying, standard dimensions, and YaRN scaling parameters).
hf/generation_config.json: Default parameters optimized for text generation.
hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen3 fast tokenizer setup.
hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout trained with a base size of 1000.

🚀 Usage Examples

Loading Hugging Face Formats via Python

To match the dataset structure used during training, encode text with add_special_tokens=False and manually prepend the exact BOS token ID (1000). This ensures perfect token alignment and accurate MoE forward pass sampling.

import torch
from transformers import PreTrainedTokenizerFast, Qwen3MoeForCausalLM

repo_id = "shibatch/tinyqwen3gated2m"

# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = Qwen3MoeForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Once upon"

# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)

# Manually prepend the exact BOS token ID (1000) to match the training pipeline layout
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False,        # Forces deterministic greedy decoding
        repetition_penalty=1.0,
        top_p=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📝 Model Specifications

The routing configuration, hybrid layers, and attention mechanics conform strictly to the standard Qwen3-Gated-MoE structural profiling.

Parameter	Specification
Architecture	Qwen3-MoE (`Qwen3MoeForCausalLM`)
Layer Types (`layer_types`)	Alternating `["linear_attention", "full_attention", ...]` (3 layers each)
Dataset	TinyStories
Total Parameters	~2.0M parameter scale
Vocabulary Size	1,024 (Custom Byte-Level BPE with 1000 base tokens + special characters)
Hidden Size (`hidden_size`)	128
Head Dimension (`head_dim`)	32 (8 heads × 32 dim = 256 explicit structural `q_proj` dimension)
Number of Hidden Layers	6
Number of Attention Heads	8
Number of Key-Value Heads	1 (Standard GQA 8:1 topology)
Intermediate Size	256
Max Position Embeddings	256
Attention Bias	True (Explicitly uniform random between -0.2 and 0.2 for q_proj, k_proj, v_proj)
Total Local Experts	2
Experts Selected per Token	1 (Top-1 Routing)
Expert Intermediate Size	256
Shared Expert Intermediate Size	128
RMS Norm Epsilon	1e-06
RoPE Base Frequency (`rope_theta`)	1,000,000.0
RoPE Scaling (`rope_scaling`)	`{"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 64}`
Weight Tying (`tie_word_embeddings`)	True

📜 Acknowledgments & License

Original Architecture: Qwen3 Model Family (Alibaba Cloud).
Dataset: TinyStories dataset (Microsoft Research).
License: MIT License. You are free to use, modify, and distribute these assets for any purpose, commercial or private.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support