Instructions to use shibatch/tinyqwen3-2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinyqwen3-2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyqwen3-2m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
TinyStories Qwen3 2.07M (tinyqwen3-2m) Validation Suite
This repository provides an ultra-lightweight Qwen3 dense model file in Hugging Face / Safetensors format, trained on the TinyStories dataset and optimized for inference engine testing, verification, and automated CI pipelines.
Why this repository exists
When developing a custom LLM inference engine or optimizing low-level tensor operations, debugging with a full-sized model slows down development efficiency. This suite offers a true 2.07M parameter scale Qwen3 dense model, allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and custom attention mechanisms step-by-step with maximum efficiency and verifiable natural language outputs.
Key Validation Targets
This model is specifically designed to expose architectural layout and edge-case calculation bugs:
Q-Norm / K-Norm Structure Verification: Validates the application of Per-head RMSNorm directly to the Query and Key tensors prior to the core attention dot-product computation. This is a crucial native feature of the Qwen3 architecture to ensure mathematical stability. By setting
head_dim=32explicitly from the start, a physical 256-dimensionalq_projlayer is built directly without relying on dynamic runtime extension logic.True 8:1 GQA Ratio: Implements an asymmetric configuration containing exactly 8 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, parallel splits, and index handling process Grouped-Query Attention topologies properly without memory alignment failures.
Layer-wise Random Bias Verification (Deep Vertical Topology): Features a 6-layer depth combined with explicit, non-zero random uniform biases ($\pm 0.2$) injected into the
q_proj,k_proj, andv_projsurfaces during initialization (configured as frozen non-grad constants). If an inference engine miscalculates, omits, or shifts the index of these projection biases, the numerical discrepancy accumulates rapidly across the 6 sequential layers, causing the text generation to immediately break into random garbage within a few tokens. This acts as a highly sensitive tripwire for automated CI validation.
π Repository Structure & File Descriptions
The current directory layout excludes any GGUF binaries and is composed purely of standard Hugging Face native Safetensors structures:
.
βββ hf/
βββ config.json
βββ generation_config.json
βββ model.safetensors
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ tokenizer.json
π Usage Example (Loading via Python)
To get perfect token alignment and verify the trained text representation layout, use the script below. To match the dataset structure used during training, encode text with add_special_tokens=False and manually prepend the exact BOS token ID (1000).
import torch
from transformers import PreTrainedTokenizerFast, Qwen3ForCausalLM
repo_id = "shibatch/tinyqwen3-2m"
# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = Qwen3ForCausalLM.from_pretrained(repo_id, subfolder="hf")
prompt = "Once upon"
# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)
# Manually prepend the exact BOS token ID (1000) to match the training pipeline
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Matches --temp 0
repetition_penalty=1.0,
top_p=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Model Specifications
The network architecture features an active weight-tying matrix (tie_word_embeddings), perfectly aligned dimensions, and standard non-linear structural constraints.
Architecture: Qwen3 Dense (
Qwen3ForCausalLM)Dataset: TinyStories
Total Parameters: 2,070,784 parameters (~2.07M)
Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special control characters)
Hidden Size (
hidden_size): 128Head Dimension (
head_dim): 32 (8 heads $\times$ 32 dim = 256, explicitly defining the 256-dimensionalq_projfrom the start without dynamic runtime extensions)Number of Hidden Layers (
num_hidden_layers): 6Number of Attention Heads (
num_attention_heads): 8Number of Key-Value Heads (
num_key_value_heads): 1 (Standard GQA 8:1 topology)Intermediate Size (
intermediate_size): 691Max Position Embeddings (
max_position_embeddings): 256Attention Bias (
attention_bias): True (Explicitly configured with $\pm 0.2$ frozen uniform random bias vectors)RMS Norm Epsilon: 1e-06
RoPE Base Frequency (
rope_theta): 1,000,000.0Weight Tying (
tie_word_embeddings): True
π License
- License: MIT License. You are free to use, modify, and distribute these assets for any purpose, commercial or private.