TinyStories Qwen2 2M (tinyqwen2m) GGUF & HF Validation Suite

This repository provides ultra-lightweight Qwen2 model files across both GGUF and Hugging Face / Safetensors formats, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing and validation.

Why this repository exists

When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true 2M parameter scale Qwen2 model (~4.0MB), allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and Grouped-Query Attention (GQA) logic step-by-step with maximum efficiency and verifiable natural language outputs.

Key Validation Targets

This model is designed to expose architectural layout bugs that standard Llama files cannot trigger:

  • Dynamic Namespace Prefix Parsing: GGUF metadata keys use the qwen2. namespace (e.g., qwen2.attention.head_count) instead of the traditional llama. identifier. This forces your GGUF loader to resolve string lookup configurations dynamically based on general.architecture rather than falling back onto hardcoded defaults.
  • True 4:1 GQA Ratio: Implements an asymmetric configuration containing exactly 4 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, and sequence parallel splits handle Grouped-Query Attention topologies properly without scaling alignment failures.
  • Compact Token Arrays & Tied Embeddings: Utilizes a highly optimized, clean vocabulary size of 1024 to eliminate index select out-of-bounds risks (indexSelectSmallIndex errors) on private hardware setups. Configured with "tie_word_embeddings": true to validate shared memory layouts across projection surfaces.
  • Layer-wise Projection Bias Verification (Deep & Slim Architecture): Features an expanded 8-layer depth combined with an explicit, non-zero constant bias (0.1) injected into the q_proj, k_proj, and v_proj surfaces during training. If an inference engine fails to process or omits these projection biases, the numerical discrepancy accumulates rapidly across the 8 sequential layers, causing text generation to break completely into random garbage within a few tokens.

πŸ“‚ Repository Structure & File Descriptions

.
β”œβ”€β”€ tinyqwen2m.gguf
β”œβ”€β”€ README.md
└── hf/
    β”œβ”€β”€ config.json
    β”œβ”€β”€ generation_config.json
    β”œβ”€β”€ model.safetensors
    β”œβ”€β”€ tokenizer_config.json
    β”œβ”€β”€ special_tokens_map.json
    └── tokenizer.json

1. GGUF Format (Root Directory)

A validation binary converted for custom engines and native runtimes. The tokenizer vocabulary and special tokens are fully embedded within the GGUF file.

  • tinyqwen2m.gguf (~4.0 MB) Validates dynamic qwen2. GGUF namespace parsing, attention bias handling, RoPE operations, 16-bit floating point matrix layouts, type casting, and SwiGLU activation pipelines.

2. Hugging Face Native Format (./hf/)

This directory contains the standard files required to load the model using the PyTorch transformers library:

  • hf/model.safetensors: The raw, unquantized model weights stored securely in Safetensors format.
  • hf/config.json: The architectural configuration file defining hyperparameters (8 layers, attention biases, weight-tying, standard dimensions).
  • hf/generation_config.json: Default parameters optimized for text generation.
  • hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen2 fast tokenizer setup.
  • hf/special_tokens_map.json: Architectural mappings tying special characters to the token blocks.
  • hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout.

πŸš€ Usage Examples

A. Running GGUF via Native CLI

To verify your local loader setup or validate dynamic key parsing via native completions:

./llama-completion -m tinyqwen2m.gguf -p "Once upon" -n 100 --temp 0.0 --repeat-penalty 1.0 --top-p 1.0

Expected Golden Output:

Once upon a time, there was a little girl named Lily. Lily loved to play with her toys and her friends. One day, Lily's friend came over to play. She showed her how to make a tall tower. Lily was so happy and proud of her tall tower. She showed it to her friend and they both laughed together. From that day on, Lily and her friend played together every day. They would pretend they

B. Loading Hugging Face Formats via Python

To get identical token alignment and generation results as GGUF, use PreTrainedTokenizerFast to load the subfolder configurations, and manually prepend the BOS token ID (1000) to replicate the exact dataset layout used during training.

import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM

repo_id = "shibatch/tinyqwen2m"

# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Once upon"

# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)

# Manually prepend the exact BOS token ID (1000) to match the training pipeline
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False,        # Matches --temp 0
        repetition_penalty=1.0,
        top_p=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“ Model Specifications

The network architecture features an active weight-tying matrix (tie_word_embeddings), perfectly aligned power-of-two shapes, and explicit Attention QKV bias vectors matching full-scale Qwen2 profiles.

  • Architecture: Qwen2 (Qwen2ForCausalLM)
  • Dataset: TinyStories
  • Total Parameters: ~2.03M
  • Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters)
  • Hidden Size (hidden_size): 128
  • Head Dimension (head_dim): 32 (128 / 4, satisfies hardware SDPA and RoPE alignment constraints)
  • Number of Hidden Layers (num_hidden_layers): 8 (Deep vertical structure to accelerate bias omission errors)
  • Number of Attention Heads (num_attention_heads): 4
  • Number of Key-Value Heads (num_key_value_heads): 1 (Standard GQA 4:1 topology)
  • Intermediate Size (intermediate_size): 512 (Standard power-of-two dimension)
  • Max Position Embeddings (max_position_embeddings): 256 (Standard power-of-two context length)
  • Attention Bias (attention_bias): True (Explicitly fixed at 0.1 for q_proj, k_proj, and v_proj)
  • RMS Norm Epsilon: 1e-06
  • RoPE Base Frequency (rope_theta): 1,000,000.0

πŸ“œ Acknowledgments & License

  • Original Architecture: Qwen2 Model Family.
  • Dataset: TinyStories dataset.
  • License: MIT License. You are free to use, modify, and distribute these assets for any purpose.
Downloads last month
124
GGUF
Model size
2.04M params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support