Grizzly Mini Transformer v2

Grizzly Mini Transformer v2 is a second iteration of my locally trained GPT-style language model, rebuilt from scratch using Apple MLX for native Metal GPU training on Apple Silicon. Named after my dog, Grizzly.

Grizzly, the dog this model is named after

This repository is intentionally presented as a learning artifact. It is not instruction-tuned, not aligned, and not intended for production use.

Model Details

  • Architecture: decoder-only GPT-style transformer
  • Parameters: 25,272,192
  • Framework: Apple MLX
  • Vocabulary size: 8,192
  • Context length: 512 tokens
  • Embedding dimension: 384
  • Layers: 10
  • Attention heads: 8
  • Query groups: 4 (GQA)
  • Feed-forward hidden dimension: 1,536
  • Positional encoding: RoPE
  • Feed-forward block: SwiGLU
  • Normalization: RMSNorm with pre-normalization
  • Checkpoint format: model.safetensors

Training

The model was trained locally on Apple Silicon (M-series Mac) using MLX with a compiled Metal-native training loop.

  • Dataset: Salesforce/wikitext (wikitext-2-v1)
  • Final global step: ~10,500+
  • Best validation loss: 3.62 (perplexity: 37.34)
  • Batch size: 32 (with packed sequences)
  • Optimizer: AdamW (lr=3e-4, betas=[0.9, 0.95], weight_decay=0.01)
  • LR schedule: cosine decay with 500 warmup steps
  • Compiled train step: yes
  • Packed sequences: yes

Architecture Improvements Over v1

Component v1 (PyTorch) v2 (MLX)
Parameters 17.9M 25.3M
Embedding dim 320 384
Hidden dim 1,280 1,536
Batch size 16 32
Sequence packing no yes
Compiled training no yes (Metal)

Benchmark Results

Evaluated on Wikitext-2 test set at temperature 0.8 (the model's sweet spot):

Metric Value
Validation loss 3.62
Perplexity 37.34
Avg generation length 52 tokens
UNK token rate 4.9%
3-gram repetition 1.1%
Inference speed 550–2,640 tok/s

Generation Quality by Temperature

Temp Avg Length UNK Rate 3-gram Rep Loops
0.2 28 words 9.7% 11.3% 0/12
0.5 35 words 6.5% 7.7% 1/12
0.8 52 words 4.9% 1.1% 0/12
1.0 61 words 3.3% 0.1% 0/12

Note: Higher temperatures produce better quality — the model generates more coherent, less repetitive text at temp 0.8–1.0.

Sample Generation (temperature 0.8)

Prompt: "She opened the mysterious door and"

She opened the mysterious door and made it one of the first four main stock footage of the first two.

Prompt: "The train arrived at midnight with"

The train arrived at midnight with the . The was then sent to the , where the first boat on the...

Prompt: "Python is a programming language that"

Python is a programming language that " are the world of the world and an example of that kind of genius that is not an...

Files

  • model.safetensors: final trained model weights
  • config.json: architecture and training configuration
  • training_state.pt: local training metadata and loss history
  • tokenizer/tokenizer.json: ByteLevel BPE tokenizer
  • tokenizer/tokenizer_meta.json: tokenizer metadata
  • src/transformer.py: custom GPT-style model definition
  • src/tokenizer.py: tokenizer wrapper used by the training code
  • load_model.py: minimal loading and generation example

Usage

MLX (recommended — Apple Silicon only)

pip install mlx safetensors
import mlx.core as mx
import mlx.nn as nn
import json
from safetensors import safe_open

from src.tokenizer import BPETokenizer
from src.transformer import MLXGPT as GPT

with open("config.json") as f:
    config = json.load(f)

model = GPT(
    vocab_size=config["vocab_size"],
    embedding_dim=config["embedding_dim"],
    num_layers=config["num_layers"],
    num_heads=config["num_heads"],
    num_query_groups=config["num_query_groups"],
    hidden_dim=config["hidden_dim"],
    max_seq_len=config["max_seq_len"],
    dropout=0.0,
)

# Load weights
with safe_open("model.safetensors", framework="numpy") as f:
    for key in f.keys():
        model.update({key: mx.array(f.get_tensor(key))})

model.eval()

tokenizer = BPETokenizer.load("tokenizer")
prompt_ids = mx.array([tokenizer.encode("The future of machine learning")], dtype=mx.int32)
generated = model.generate(prompt_ids, max_new_tokens=40, temperature=0.8, top_k=50)
mx.eval(generated)
print(tokenizer.decode(generated[0].tolist()))

PyTorch (legacy v1 weights still included)

python load_model.py

Limitations

This model was trained as a transformer learning project. It may generate repetitive, incorrect, biased, or unsafe text. It should be used for experimentation, code reading, and learning about model training rather than for factual answers or user-facing applications. The perplexity of 37 indicates the model is in early training stages — further training would significantly improve output quality.

Downloads last month
-
Safetensors
Model size
25.3M params
Tensor type
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cjnielson44/grizzly-mini-transformer