Instructions to use shibatch/tinyqwen3gated2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinyqwen3gated2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyqwen3gated2m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
TinyStories Qwen3-Gated-MoE 2M (tinyqwen3gated2m) HF Validation Suite
This repository provides an ultra-lightweight Qwen3 Gated-Hybrid Mixture-of-Experts (MoE) model file in Hugging Face / Safetensors format. It is trained from scratch on the TinyStories dataset and specifically optimized for low-level inference engine architecture debugging, hybrid layer routing verification, and automated CI pipelines.
Why this repository exists
When developing or optimizing a custom hardware/software inference engine for next-generation architectures, debugging with full-scale models significantly slows down implementation efficiency. This suite provides a true 2M parameter scale Qwen3-Gated-MoE model that integrates both linear recurrent states and dense attention layers. This allows developers to validate top-k gating, token dispatchers, shared experts, linear recurrent update loops, and YaRN RoPE scaling parameters step-by-step with maximum agility and verifiable natural language outputs.
Key Validation Targets
This model is specifically designed to expose architectural layout and edge-case calculation bugs in modern hybrid-MoE pipelines:
- Equal Gated-Hybrid Topology (
layer_typesValidation): Features a 1:1 alternating hybrid layer arrangement across 6 deep transformer layers. Layers 0, 2, and 4 are configured aslinear_attention(Gated DeltaNet linear recurrence layers), while Layers 1, 3, and 5 are configured asfull_attention(Gated Softmax attention layers). This guarantees that both execution pathways are exercised and verified within the engine. - Explicit Layer Shape Structure: By explicitly defining
num_attention_heads=8andhead_dim=32in the configuration, a physical 256-dimensional attention mechanism is built directly. This eliminates the need for implicit, runtime 2x layer expansion logic and ensures a predictable memory layout during weight loading. - Q-Norm / K-Norm Structure Verification: Validates the application of Per-head RMSNorm directly to the Query and Key tensors prior to the core attention dot-product computation. This is a crucial native feature of the Qwen3 architecture to ensure mathematical stability.
- Native YaRN RoPE Scaling Integration: Incorporates a true YaRN (Yet Another RoPE Extension) configuration (
rope_scaling) withfactor=4.0andoriginal_max_position_embeddings=64. This validates that the inference engine can accurately compute frequency adjustments across an expanded context window. - True 8:1 GQA Ratio: Implements an asymmetric configuration containing exactly 8 Query heads and 1 Key-Value head (
num_attention_heads=8,num_key_value_heads=1). This checks that KV caching structures, stride calculations, parallel splits, and index handling process Grouped-Query Attention topologies properly without memory alignment failures. - Top-1 Gated Routing & Token Dispatch: Features a micro-MoE topology with exactly 2 local experts, routing 1 expert per token (
num_experts_per_tok=1) across the layers. This allows developers to strictly track and trace the accuracy of router logit probability flows, routing masks, and row-major tensor switching logic without heavy hardware memory tracking. - Shared Expert Isolation: Implements an explicit Shared Expert configuration with an independent dimension shape (
shared_expert_intermediate_size=128). This helps verify whether the model correctly adds the non-gated common network baseline output onto the single gated routing expert output path without accumulation alignment bugs. - Layer-wise Projection Bias Verification (Β±0.2 Uniform Range): Injected with a frozen constant random uniform bias range (Β±0.2) inside
q_proj,k_proj, andv_projarchitectures. If an inference engine's dynamic lookup fails to map or slightly shifts these attention biases, the numerical discrepancy scales exponentially across the network layers, immediately destroying greedy generation text into garbage within a few tokens.
π Repository Structure & File Descriptions
The repository structure follows standard Hugging Face native configurations for standalone tensor loading:
.
βββ hf/
βββ config.json
βββ generation_config.json
βββ model.safetensors
βββ tokenizer_config.json
βββ tokenizer.json
Hugging Face Native Format (./hf/)
hf/model.safetensors: The raw, unquantized model weights containing standard query/key/value projections with biases, routing router layers, recurrent DeltaNet weights, and MoE experts stored in secure Safetensors format.hf/config.json: The architectural configuration file defining MoE and Hybrid hyperparameters (6 layers alternating betweenlinear_attentionandfull_attention, 8 heads, 2 local experts, 1 active expert, shared parameters, weight-tying, standard dimensions, and YaRN scaling parameters).hf/generation_config.json: Default parameters optimized for text generation.hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen3 fast tokenizer setup.hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout trained with a base size of 1000.
π Usage Examples
Loading Hugging Face Formats via Python
To match the dataset structure used during training, encode text with add_special_tokens=False and manually prepend the exact BOS token ID (1000). This ensures perfect token alignment and accurate MoE forward pass sampling.
import torch
from transformers import PreTrainedTokenizerFast, Qwen3MoeForCausalLM
repo_id = "shibatch/tinyqwen3gated2m"
# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = Qwen3MoeForCausalLM.from_pretrained(repo_id, subfolder="hf")
prompt = "Once upon"
# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)
# Manually prepend the exact BOS token ID (1000) to match the training pipeline layout
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Forces deterministic greedy decoding
repetition_penalty=1.0,
top_p=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Model Specifications
The routing configuration, hybrid layers, and attention mechanics conform strictly to the standard Qwen3-Gated-MoE structural profiling.
| Parameter | Specification |
|---|---|
| Architecture | Qwen3-MoE (Qwen3MoeForCausalLM) |
Layer Types (layer_types) |
Alternating ["linear_attention", "full_attention", ...] (3 layers each) |
| Dataset | TinyStories |
| Total Parameters | ~2.0M parameter scale |
| Vocabulary Size | 1,024 (Custom Byte-Level BPE with 1000 base tokens + special characters) |
Hidden Size (hidden_size) |
128 |
Head Dimension (head_dim) |
32 (8 heads Γ 32 dim = 256 explicit structural q_proj dimension) |
| Number of Hidden Layers | 6 |
| Number of Attention Heads | 8 |
| Number of Key-Value Heads | 1 (Standard GQA 8:1 topology) |
| Intermediate Size | 256 |
| Max Position Embeddings | 256 |
| Attention Bias | True (Explicitly uniform random between -0.2 and 0.2 for q_proj, k_proj, v_proj) |
| Total Local Experts | 2 |
| Experts Selected per Token | 1 (Top-1 Routing) |
| Expert Intermediate Size | 256 |
| Shared Expert Intermediate Size | 128 |
| RMS Norm Epsilon | 1e-06 |
RoPE Base Frequency (rope_theta) |
1,000,000.0 |
RoPE Scaling (rope_scaling) |
{"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 64} |
Weight Tying (tie_word_embeddings) |
True |
π Acknowledgments & License
- Original Architecture: Qwen3 Model Family (Alibaba Cloud).
- Dataset: TinyStories dataset (Microsoft Research).
- License: MIT License. You are free to use, modify, and distribute these assets for any purpose, commercial or private.