Instructions to use shibatch/tinyqwen3moe2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinyqwen3moe2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyqwen3moe2m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
TinyStories Qwen3-MoE 1.95M (tinyqwen3moe2m) HF Validation Suite
This repository provides an ultra-lightweight Qwen3 Mixture-of-Experts (MoE) model file in Hugging Face / Safetensors format, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing, MoE routing verification, and automated CI pipelines.
Why this repository exists
When developing a custom Mixture-of-Experts (MoE) inference engine or optimizing low-level routing layers, debugging with full-sized models significantly slows down development efficiency. This suite offers a true 1.95M parameter scale Qwen3-MoE model, allowing developers to validate their top-k gates, token dispatchers, shared expert combinations, and YaRN RoPE scaling parameters step-by-step with maximum efficiency and verifiable natural language outputs.
Key Validation Targets
This model is specifically designed to expose architectural layout and edge-case calculation bugs in MoE pipelines:
- Explicit Layer Topology (No Dynamic Extension): By explicitly defining
num_attention_heads=8andhead_dim=32in the configuration, a physical 256-dimensionalq_projlayer is built directly. This eliminates the need for implicit, runtime 2x layer expansion logic and ensures a predictable memory layout during weight loading. - Q-Norm / K-Norm Structure Verification: Validates the application of Per-head RMSNorm directly to the Query and Key tensors prior to the core attention dot-product computation. This is a crucial native feature of the Qwen3 architecture to ensure mathematical stability.
- Native YaRN RoPE Scaling Integration: Incorporates a true YaRN (Yet Another RoPE Extension) configuration (
rope_scaling) withfactor=4.0andoriginal_max_position_embeddings=64. This validates that the inference engine can accurately compute frequency adjustments across an expanded context window. - True 8:1 GQA Ratio: Implements an asymmetric configuration containing exactly 8 Query heads and 1 Key-Value head (
num_attention_heads=8,num_key_value_heads=1). This checks that KV caching structures, stride calculations, parallel splits, and index handling process Grouped-Query Attention topologies properly without memory alignment failures. - Top-1 Gated Routing & Token Dispatch: Features a micro-MoE topology with exactly 2 local experts, routing 1 expert per token (
num_experts_per_tok=1) across 6 deep transformer layers. This allows developers to strictly track and trace the accuracy of router logit probability flows, routing masks, and row-major tensor switching logic without heavy hardware memory tracking. - Shared Expert Isolation: Implements an explicit Shared Expert configuration with an independent dimension shape (
shared_expert_intermediate_size=128). This helps verify whether the model correctly adds the non-gated common network baseline output onto the single gated routing expert output path without accumulation alignment bugs. - Continuous Packed Sequence Handling: The training pipeline utilizes a high-density token-packing algorithm, concatenating sample token continuous sequences separated strictly by control tags (
<s>/</s>) into exact 256-block chunks. This is ideal for testing internal state boundaries, position indexing shifts, and sequence-packing continuous-attention masks in custom hardware layouts. - Layer-wise Projection Bias Verification ($\pm 0.2$ Uniform Range): Injected with a frozen constant random uniform bias range ($\pm 0.2$) inside
q_proj,k_proj, andv_projarchitectures. If an inference engine's dynamic lookup fails to map or slightly shifts these attention biases, the numerical discrepancy scales exponentially across the 6 network layers, immediately destroying greedy generation text into garbage within a few tokens.
π Repository Structure & File Descriptions
The repository structure follows standard Hugging Face native configurations for standalone tensor loading:
```text
SUCCESS
```text
.
βββ hf/
βββ config.json
βββ generation_config.json
βββ model.safetensors
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ tokenizer.json
Hugging Face Native Format (./hf/)
hf/model.safetensors: The raw, unquantized model weights containing standard query/key/value projections with biases, routing router layers, and MoE experts stored in secure Safetensors format.hf/config.json: The architectural configuration file defining MoE hyperparameters (6 layers, 8 heads, 2 local experts, 1 active expert, shared parameters, weight-tying, standard dimensions, and YaRN scaling parameters).hf/generation_config.json: Default parameters optimized for text generation.hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen3 fast tokenizer setup.hf/special_tokens_map.json: Architectural mappings tying special characters to the token blocks.hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout trained with a base size of 1000.
π Usage Examples
Loading Hugging Face Formats via Python
To match the dataset structure used during training, encode text with add_special_tokens=False and manually prepend the exact BOS token ID (1000). This ensures perfect token alignment and accurate MoE forward pass sampling.
import torch
from transformers import PreTrainedTokenizerFast, Qwen3MoeForCausalLM
repo_id = "shibatch/tinyqwen3moe2m"
# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = Qwen3MoeForCausalLM.from_pretrained(repo_id, subfolder="hf")
prompt = "Once upon"
# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)
# Manually prepend the exact BOS token ID (1000) to match the training pipeline layout
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Forces deterministic greedy decoding
repetition_penalty=1.0,
top_p=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Model Specifications
The routing configuration, hidden layers, and attention mechanics conform strictly to the standard Qwen3-MoE structural profiling.
- Architecture: Qwen3-MoE (
Qwen3MoeForCausalLM) - Dataset: TinyStories
- Total Parameters: 1,954,816 parameters (~1.95M)
- Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters)
- Hidden Size (
hidden_size): 128 - Head Dimension (
head_dim): 32 (8 heads $\times$ 32 dim = 256, explicitly defining the 256-dimensionalq_projfrom the start without dynamic runtime extensions) - Number of Hidden Layers (
num_hidden_layers): 6 - Number of Attention Heads (
num_attention_heads): 8 - Number of Key-Value Heads (
num_key_value_heads): 1 (Standard GQA 8:1 topology) - Intermediate Size (
intermediate_size): 256 - Max Position Embeddings (
max_position_embeddings): 256 - Attention Bias (
attention_bias): True (Explicitly uniform random between -0.2 and 0.2 for q_proj, k_proj, and v_proj) - Total Local Experts (
num_experts): 2 - Experts Selected per Token (
num_experts_per_tok): 1 (Top-1 Routing) - Expert Intermediate Size (
moe_intermediate_size): 256 - Shared Expert Intermediate Size (
shared_expert_intermediate_size): 128 - RMS Norm Epsilon: 1e-06
- RoPE Base Frequency (
rope_theta): 1,000,000.0 - RoPE Scaling (
rope_scaling):{"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 64} - Weight Tying (
tie_word_embeddings): True
π Acknowledgments & License
- Original Architecture: Qwen3 Model Family.
- Dataset: TinyStories dataset.
- License: MIT License. You are free to use, modify, and distribute these assets for any purpose, commercial or private.