Instructions to use shibatch/tinyqwen2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tinyqwen2m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tinyqwen2m", dtype="auto")

llama-cpp-python

How to use shibatch/tinyqwen2m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinyqwen2m",
	filename="tinyqwen2m.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tinyqwen2m with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shibatch/tinyqwen2m
# Run inference directly in the terminal:
llama-cli -hf shibatch/tinyqwen2m

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shibatch/tinyqwen2m
# Run inference directly in the terminal:
llama-cli -hf shibatch/tinyqwen2m

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tinyqwen2m
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tinyqwen2m

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tinyqwen2m
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tinyqwen2m

Use Docker

docker model run hf.co/shibatch/tinyqwen2m

LM Studio
Jan
Ollama
How to use shibatch/tinyqwen2m with Ollama:
```
ollama run hf.co/shibatch/tinyqwen2m
```

Unsloth Studio

How to use shibatch/tinyqwen2m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinyqwen2m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinyqwen2m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tinyqwen2m to start chatting

Atomic Chat new
Docker Model Runner
How to use shibatch/tinyqwen2m with Docker Model Runner:
```
docker model run hf.co/shibatch/tinyqwen2m
```

Lemonade

How to use shibatch/tinyqwen2m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tinyqwen2m

Run and chat with the model

lemonade run user.tinyqwen2m-{{QUANT_TAG}}

List all available models

lemonade list

TinyStories Qwen2 2M (tinyqwen2m) GGUF & HF Validation Suite

This repository provides ultra-lightweight Qwen2 model files across both GGUF and Hugging Face / Safetensors formats, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing and validation.

Why this repository exists

When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true 2M parameter scale Qwen2 model (~4.0MB), allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and Grouped-Query Attention (GQA) logic step-by-step with maximum efficiency and verifiable natural language outputs.

Key Validation Targets

This model is designed to expose architectural layout bugs that standard Llama files cannot trigger:

Dynamic Namespace Prefix Parsing: GGUF metadata keys use the qwen2. namespace (e.g., qwen2.attention.head_count) instead of the traditional llama. identifier. This forces your GGUF loader to resolve string lookup configurations dynamically based on general.architecture rather than falling back onto hardcoded defaults.
True 4:1 GQA Ratio: Implements an asymmetric configuration containing exactly 4 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, and sequence parallel splits handle Grouped-Query Attention topologies properly without scaling alignment failures.
Compact Token Arrays & Tied Embeddings: Utilizes a highly optimized, clean vocabulary size of 1024 to eliminate index select out-of-bounds risks (indexSelectSmallIndex errors) on private hardware setups. Configured with "tie_word_embeddings": true to validate shared memory layouts across projection surfaces.
Layer-wise Projection Bias Verification (Deep & Slim Architecture): Features an expanded 8-layer depth combined with an explicit, non-zero constant bias (0.1) injected into the q_proj, k_proj, and v_proj surfaces during training. If an inference engine fails to process or omits these projection biases, the numerical discrepancy accumulates rapidly across the 8 sequential layers, causing text generation to break completely into random garbage within a few tokens.

📂 Repository Structure & File Descriptions

.
├── tinyqwen2m.gguf
├── README.md
└── hf/
    ├── config.json
    ├── generation_config.json
    ├── model.safetensors
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    └── tokenizer.json

1. GGUF Format (Root Directory)

A validation binary converted for custom engines and native runtimes. The tokenizer vocabulary and special tokens are fully embedded within the GGUF file.

tinyqwen2m.gguf (~4.0 MB) Validates dynamic qwen2. GGUF namespace parsing, attention bias handling, RoPE operations, 16-bit floating point matrix layouts, type casting, and SwiGLU activation pipelines.

2. Hugging Face Native Format (`./hf/`)

This directory contains the standard files required to load the model using the PyTorch transformers library:

hf/model.safetensors: The raw, unquantized model weights stored securely in Safetensors format.
hf/config.json: The architectural configuration file defining hyperparameters (8 layers, attention biases, weight-tying, standard dimensions).
hf/generation_config.json: Default parameters optimized for text generation.
hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen2 fast tokenizer setup.
hf/special_tokens_map.json: Architectural mappings tying special characters to the token blocks.
hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout.

🚀 Usage Examples

A. Running GGUF via Native CLI

To verify your local loader setup or validate dynamic key parsing via native completions:

./llama-completion -m tinyqwen2m.gguf -p "Once upon" -n 100 --temp 0.0 --repeat-penalty 1.0 --top-p 1.0

Expected Golden Output:

Once upon a time, there was a little girl named Lily. Lily loved to play with her toys and her friends. One day, Lily's friend came over to play. She showed her how to make a tall tower. Lily was so happy and proud of her tall tower. She showed it to her friend and they both laughed together. From that day on, Lily and her friend played together every day. They would pretend they

B. Loading Hugging Face Formats via Python

To get identical token alignment and generation results as GGUF, use PreTrainedTokenizerFast to load the subfolder configurations, and manually prepend the BOS token ID (1000) to replicate the exact dataset layout used during training.

import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM

repo_id = "shibatch/tinyqwen2m"

# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Once upon"

# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)

# Manually prepend the exact BOS token ID (1000) to match the training pipeline
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False,        # Matches --temp 0
        repetition_penalty=1.0,
        top_p=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📝 Model Specifications

The network architecture features an active weight-tying matrix (tie_word_embeddings), perfectly aligned power-of-two shapes, and explicit Attention QKV bias vectors matching full-scale Qwen2 profiles.

Architecture: Qwen2 (Qwen2ForCausalLM)
Dataset: TinyStories
Total Parameters: ~2.03M
Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters)
Hidden Size (hidden_size): 128
Head Dimension (head_dim): 32 (128 / 4, satisfies hardware SDPA and RoPE alignment constraints)
Number of Hidden Layers (num_hidden_layers): 8 (Deep vertical structure to accelerate bias omission errors)
Number of Attention Heads (num_attention_heads): 4
Number of Key-Value Heads (num_key_value_heads): 1 (Standard GQA 4:1 topology)
Intermediate Size (intermediate_size): 512 (Standard power-of-two dimension)
Max Position Embeddings (max_position_embeddings): 256 (Standard power-of-two context length)
Attention Bias (attention_bias): True (Explicitly fixed at 0.1 for q_proj, k_proj, and v_proj)
RMS Norm Epsilon: 1e-06
RoPE Base Frequency (rope_theta): 1,000,000.0

📜 Acknowledgments & License

Original Architecture: Qwen2 Model Family.
Dataset: TinyStories dataset.
License: MIT License. You are free to use, modify, and distribute these assets for any purpose.

Downloads last month: 124

GGUF

Model size

2.04M params

Architecture

qwen2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support