TinyStories Qwen3 2.07M (tinyqwen3-2m) Validation Suite

This repository provides an ultra-lightweight Qwen3 dense model file in Hugging Face / Safetensors format, trained on the TinyStories dataset and optimized for inference engine testing, verification, and automated CI pipelines.

Why this repository exists

When developing a custom LLM inference engine or optimizing low-level tensor operations, debugging with a full-sized model slows down development efficiency. This suite offers a true 2.07M parameter scale Qwen3 dense model, allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and custom attention mechanisms step-by-step with maximum efficiency and verifiable natural language outputs.

Key Validation Targets

This model is specifically designed to expose architectural layout and edge-case calculation bugs:

  • Q-Norm / K-Norm Structure Verification: Validates the application of Per-head RMSNorm directly to the Query and Key tensors prior to the core attention dot-product computation. This is a crucial native feature of the Qwen3 architecture to ensure mathematical stability. By setting head_dim=32 explicitly from the start, a physical 256-dimensional q_proj layer is built directly without relying on dynamic runtime extension logic.

  • True 8:1 GQA Ratio: Implements an asymmetric configuration containing exactly 8 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, parallel splits, and index handling process Grouped-Query Attention topologies properly without memory alignment failures.

  • Layer-wise Random Bias Verification (Deep Vertical Topology): Features a 6-layer depth combined with explicit, non-zero random uniform biases ($\pm 0.2$) injected into the q_proj, k_proj, and v_proj surfaces during initialization (configured as frozen non-grad constants). If an inference engine miscalculates, omits, or shifts the index of these projection biases, the numerical discrepancy accumulates rapidly across the 6 sequential layers, causing the text generation to immediately break into random garbage within a few tokens. This acts as a highly sensitive tripwire for automated CI validation.


πŸ“‚ Repository Structure & File Descriptions

The current directory layout excludes any GGUF binaries and is composed purely of standard Hugging Face native Safetensors structures:

.
└── hf/
    β”œβ”€β”€ config.json
    β”œβ”€β”€ generation_config.json
    β”œβ”€β”€ model.safetensors
    β”œβ”€β”€ special_tokens_map.json
    β”œβ”€β”€ tokenizer_config.json
    └── tokenizer.json


πŸš€ Usage Example (Loading via Python)

To get perfect token alignment and verify the trained text representation layout, use the script below. To match the dataset structure used during training, encode text with add_special_tokens=False and manually prepend the exact BOS token ID (1000).

import torch
from transformers import PreTrainedTokenizerFast, Qwen3ForCausalLM

repo_id = "shibatch/tinyqwen3-2m"

# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = Qwen3ForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Once upon"

# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)

# Manually prepend the exact BOS token ID (1000) to match the training pipeline
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False,        # Matches --temp 0
        repetition_penalty=1.0,
        top_p=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


πŸ“ Model Specifications

The network architecture features an active weight-tying matrix (tie_word_embeddings), perfectly aligned dimensions, and standard non-linear structural constraints.

  • Architecture: Qwen3 Dense (Qwen3ForCausalLM)

  • Dataset: TinyStories

  • Total Parameters: 2,070,784 parameters (~2.07M)

  • Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special control characters)

  • Hidden Size (hidden_size): 128

  • Head Dimension (head_dim): 32 (8 heads $\times$ 32 dim = 256, explicitly defining the 256-dimensional q_proj from the start without dynamic runtime extensions)

  • Number of Hidden Layers (num_hidden_layers): 6

  • Number of Attention Heads (num_attention_heads): 8

  • Number of Key-Value Heads (num_key_value_heads): 1 (Standard GQA 8:1 topology)

  • Intermediate Size (intermediate_size): 691

  • Max Position Embeddings (max_position_embeddings): 256

  • Attention Bias (attention_bias): True (Explicitly configured with $\pm 0.2$ frozen uniform random bias vectors)

  • RMS Norm Epsilon: 1e-06

  • RoPE Base Frequency (rope_theta): 1,000,000.0

  • Weight Tying (tie_word_embeddings): True

πŸ“œ License

  • License: MIT License. You are free to use, modify, and distribute these assets for any purpose, commercial or private.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support