Instructions to use shibatch/tinyqwen2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinyqwen2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyqwen2m", dtype="auto") - llama-cpp-python
How to use shibatch/tinyqwen2m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shibatch/tinyqwen2m", filename="tinyqwen2m.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use shibatch/tinyqwen2m with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinyqwen2m # Run inference directly in the terminal: llama-cli -hf shibatch/tinyqwen2m
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinyqwen2m # Run inference directly in the terminal: llama-cli -hf shibatch/tinyqwen2m
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf shibatch/tinyqwen2m # Run inference directly in the terminal: ./llama-cli -hf shibatch/tinyqwen2m
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf shibatch/tinyqwen2m # Run inference directly in the terminal: ./build/bin/llama-cli -hf shibatch/tinyqwen2m
Use Docker
docker model run hf.co/shibatch/tinyqwen2m
- LM Studio
- Jan
- Ollama
How to use shibatch/tinyqwen2m with Ollama:
ollama run hf.co/shibatch/tinyqwen2m
- Unsloth Studio
How to use shibatch/tinyqwen2m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinyqwen2m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinyqwen2m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shibatch/tinyqwen2m to start chatting
- Atomic Chat new
- Docker Model Runner
How to use shibatch/tinyqwen2m with Docker Model Runner:
docker model run hf.co/shibatch/tinyqwen2m
- Lemonade
How to use shibatch/tinyqwen2m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull shibatch/tinyqwen2m
Run and chat with the model
lemonade run user.tinyqwen2m-{{QUANT_TAG}}List all available models
lemonade list
TinyStories Qwen2 2M (tinyqwen2m) GGUF & HF Validation Suite
This repository provides ultra-lightweight Qwen2 model files across both GGUF and Hugging Face / Safetensors formats, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing and validation.
Why this repository exists
When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true 2M parameter scale Qwen2 model (~4.0MB), allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and Grouped-Query Attention (GQA) logic step-by-step with maximum efficiency and verifiable natural language outputs.
Key Validation Targets
This model is designed to expose architectural layout bugs that standard Llama files cannot trigger:
- Dynamic Namespace Prefix Parsing: GGUF metadata keys use the
qwen2.namespace (e.g.,qwen2.attention.head_count) instead of the traditionalllama.identifier. This forces your GGUF loader to resolve string lookup configurations dynamically based ongeneral.architecturerather than falling back onto hardcoded defaults. - True 4:1 GQA Ratio: Implements an asymmetric configuration containing exactly 4 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, and sequence parallel splits handle Grouped-Query Attention topologies properly without scaling alignment failures.
- Compact Token Arrays & Tied Embeddings: Utilizes a highly optimized, clean vocabulary size of
1024to eliminate index select out-of-bounds risks (indexSelectSmallIndexerrors) on private hardware setups. Configured with"tie_word_embeddings": trueto validate shared memory layouts across projection surfaces. - Layer-wise Projection Bias Verification (Deep & Slim Architecture): Features an expanded 8-layer depth combined with an explicit, non-zero constant bias (
0.1) injected into theq_proj,k_proj, andv_projsurfaces during training. If an inference engine fails to process or omits these projection biases, the numerical discrepancy accumulates rapidly across the 8 sequential layers, causing text generation to break completely into random garbage within a few tokens.
π Repository Structure & File Descriptions
.
βββ tinyqwen2m.gguf
βββ README.md
βββ hf/
βββ config.json
βββ generation_config.json
βββ model.safetensors
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ tokenizer.json
1. GGUF Format (Root Directory)
A validation binary converted for custom engines and native runtimes. The tokenizer vocabulary and special tokens are fully embedded within the GGUF file.
tinyqwen2m.gguf(~4.0 MB) Validates dynamicqwen2.GGUF namespace parsing, attention bias handling, RoPE operations, 16-bit floating point matrix layouts, type casting, and SwiGLU activation pipelines.
2. Hugging Face Native Format (./hf/)
This directory contains the standard files required to load the model using the PyTorch transformers library:
hf/model.safetensors: The raw, unquantized model weights stored securely in Safetensors format.hf/config.json: The architectural configuration file defining hyperparameters (8 layers, attention biases, weight-tying, standard dimensions).hf/generation_config.json: Default parameters optimized for text generation.hf/tokenizer_config.json: Tokenizer behavior layout specifying the custom ChatML/Qwen2 fast tokenizer setup.hf/special_tokens_map.json: Architectural mappings tying special characters to the token blocks.hf/tokenizer.json: The custom Byte-Level BPE tokenization descriptor layout.
π Usage Examples
A. Running GGUF via Native CLI
To verify your local loader setup or validate dynamic key parsing via native completions:
./llama-completion -m tinyqwen2m.gguf -p "Once upon" -n 100 --temp 0.0 --repeat-penalty 1.0 --top-p 1.0
Expected Golden Output:
Once upon a time, there was a little girl named Lily. Lily loved to play with her toys and her friends. One day, Lily's friend came over to play. She showed her how to make a tall tower. Lily was so happy and proud of her tall tower. She showed it to her friend and they both laughed together. From that day on, Lily and her friend played together every day. They would pretend they
B. Loading Hugging Face Formats via Python
To get identical token alignment and generation results as GGUF, use PreTrainedTokenizerFast to load the subfolder configurations, and manually prepend the BOS token ID (1000) to replicate the exact dataset layout used during training.
import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
repo_id = "shibatch/tinyqwen2m"
# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")
prompt = "Once upon"
# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)
# Manually prepend the exact BOS token ID (1000) to match the training pipeline
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Matches --temp 0
repetition_penalty=1.0,
top_p=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Model Specifications
The network architecture features an active weight-tying matrix (tie_word_embeddings), perfectly aligned power-of-two shapes, and explicit Attention QKV bias vectors matching full-scale Qwen2 profiles.
- Architecture: Qwen2 (
Qwen2ForCausalLM) - Dataset: TinyStories
- Total Parameters: ~2.03M
- Vocabulary Size: 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters)
- Hidden Size (
hidden_size): 128 - Head Dimension (
head_dim): 32 (128 / 4, satisfies hardware SDPA and RoPE alignment constraints) - Number of Hidden Layers (
num_hidden_layers): 8 (Deep vertical structure to accelerate bias omission errors) - Number of Attention Heads (
num_attention_heads): 4 - Number of Key-Value Heads (
num_key_value_heads): 1 (Standard GQA 4:1 topology) - Intermediate Size (
intermediate_size): 512 (Standard power-of-two dimension) - Max Position Embeddings (
max_position_embeddings): 256 (Standard power-of-two context length) - Attention Bias (
attention_bias): True (Explicitly fixed at 0.1 for q_proj, k_proj, and v_proj) - RMS Norm Epsilon: 1e-06
- RoPE Base Frequency (
rope_theta): 1,000,000.0
π Acknowledgments & License
- Original Architecture: Qwen2 Model Family.
- Dataset: TinyStories dataset.
- License: MIT License. You are free to use, modify, and distribute these assets for any purpose.
- Downloads last month
- 124
We're not able to determine the quantization variants.