Grizzly Mini Transformer v2

Grizzly Mini Transformer v2 is a second iteration of my locally trained GPT-style language model, rebuilt from scratch using Apple MLX for native Metal GPU training on Apple Silicon. Named after my dog, Grizzly.

This repository is intentionally presented as a learning artifact. It is not instruction-tuned, not aligned, and not intended for production use.

Model Details

Architecture: decoder-only GPT-style transformer
Parameters: 25,272,192
Framework: Apple MLX
Vocabulary size: 8,192
Context length: 512 tokens
Embedding dimension: 384
Layers: 10
Attention heads: 8
Query groups: 4 (GQA)
Feed-forward hidden dimension: 1,536
Positional encoding: RoPE
Feed-forward block: SwiGLU
Normalization: RMSNorm with pre-normalization
Checkpoint format: model.safetensors

Training

The model was trained locally on Apple Silicon (M-series Mac) using MLX with a compiled Metal-native training loop.

Dataset: Salesforce/wikitext (wikitext-2-v1)
Final global step: ~10,500+
Best validation loss: 3.62 (perplexity: 37.34)
Batch size: 32 (with packed sequences)
Optimizer: AdamW (lr=3e-4, betas=[0.9, 0.95], weight_decay=0.01)
LR schedule: cosine decay with 500 warmup steps
Compiled train step: yes
Packed sequences: yes

Architecture Improvements Over v1

Component	v1 (PyTorch)	v2 (MLX)
Parameters	17.9M	25.3M
Embedding dim	320	384
Hidden dim	1,280	1,536
Batch size	16	32
Sequence packing	no	yes
Compiled training	no	yes (Metal)

Benchmark Results

Evaluated on Wikitext-2 test set at temperature 0.8 (the model's sweet spot):

Metric	Value
Validation loss	3.62
Perplexity	37.34
Avg generation length	52 tokens
UNK token rate	4.9%
3-gram repetition	1.1%
Inference speed	550–2,640 tok/s

Generation Quality by Temperature

Temp	Avg Length	UNK Rate	3-gram Rep	Loops
0.2	28 words	9.7%	11.3%	0/12
0.5	35 words	6.5%	7.7%	1/12
0.8	52 words	4.9%	1.1%	0/12
1.0	61 words	3.3%	0.1%	0/12

Note: Higher temperatures produce better quality — the model generates more coherent, less repetitive text at temp 0.8–1.0.

Sample Generation (temperature 0.8)

Prompt: "She opened the mysterious door and"

She opened the mysterious door and made it one of the first four main stock footage of the first two.

Prompt: "The train arrived at midnight with"

The train arrived at midnight with the . The was then sent to the , where the first boat on the...

Prompt: "Python is a programming language that"

Python is a programming language that " are the world of the world and an example of that kind of genius that is not an...

Files

model.safetensors: final trained model weights
config.json: architecture and training configuration
training_state.pt: local training metadata and loss history
tokenizer/tokenizer.json: ByteLevel BPE tokenizer
tokenizer/tokenizer_meta.json: tokenizer metadata
src/transformer.py: custom GPT-style model definition
src/tokenizer.py: tokenizer wrapper used by the training code
load_model.py: minimal loading and generation example

Usage

MLX (recommended — Apple Silicon only)

pip install mlx safetensors

import mlx.core as mx
import mlx.nn as nn
import json
from safetensors import safe_open

from src.tokenizer import BPETokenizer
from src.transformer import MLXGPT as GPT

with open("config.json") as f:
    config = json.load(f)

model = GPT(
    vocab_size=config["vocab_size"],
    embedding_dim=config["embedding_dim"],
    num_layers=config["num_layers"],
    num_heads=config["num_heads"],
    num_query_groups=config["num_query_groups"],
    hidden_dim=config["hidden_dim"],
    max_seq_len=config["max_seq_len"],
    dropout=0.0,
)

# Load weights
with safe_open("model.safetensors", framework="numpy") as f:
    for key in f.keys():
        model.update({key: mx.array(f.get_tensor(key))})

model.eval()

tokenizer = BPETokenizer.load("tokenizer")
prompt_ids = mx.array([tokenizer.encode("The future of machine learning")], dtype=mx.int32)
generated = model.generate(prompt_ids, max_new_tokens=40, temperature=0.8, top_k=50)
mx.eval(generated)
print(tokenizer.decode(generated[0].tolist()))

PyTorch (legacy v1 weights still included)

python load_model.py

Limitations

This model was trained as a transformer learning project. It may generate repetitive, incorrect, biased, or unsafe text. It should be used for experimentation, code reading, and learning about model training rather than for factual answers or user-facing applications. The perplexity of 37 indicates the model is in early training stages — further training would significantly improve output quality.

Downloads last month: -

Safetensors

Model size

25.3M params

Tensor type

F32

MLX

Hardware compatibility

Quantized

cjnielson44
/

grizzly-mini-transformer