Instructions to use cjnielson44/grizzly-mini-transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use cjnielson44/grizzly-mini-transformer with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("cjnielson44/grizzly-mini-transformer") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use cjnielson44/grizzly-mini-transformer with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "cjnielson44/grizzly-mini-transformer" --prompt "Once upon a time"
Grizzly Mini Transformer v2
Grizzly Mini Transformer v2 is a second iteration of my locally trained GPT-style language model, rebuilt from scratch using Apple MLX for native Metal GPU training on Apple Silicon. Named after my dog, Grizzly.
This repository is intentionally presented as a learning artifact. It is not instruction-tuned, not aligned, and not intended for production use.
Model Details
- Architecture: decoder-only GPT-style transformer
- Parameters: 25,272,192
- Framework: Apple MLX
- Vocabulary size: 8,192
- Context length: 512 tokens
- Embedding dimension: 384
- Layers: 10
- Attention heads: 8
- Query groups: 4 (GQA)
- Feed-forward hidden dimension: 1,536
- Positional encoding: RoPE
- Feed-forward block: SwiGLU
- Normalization: RMSNorm with pre-normalization
- Checkpoint format:
model.safetensors
Training
The model was trained locally on Apple Silicon (M-series Mac) using MLX with a compiled Metal-native training loop.
- Dataset:
Salesforce/wikitext(wikitext-2-v1) - Final global step: ~10,500+
- Best validation loss: 3.62 (perplexity: 37.34)
- Batch size: 32 (with packed sequences)
- Optimizer: AdamW (lr=3e-4, betas=[0.9, 0.95], weight_decay=0.01)
- LR schedule: cosine decay with 500 warmup steps
- Compiled train step: yes
- Packed sequences: yes
Architecture Improvements Over v1
| Component | v1 (PyTorch) | v2 (MLX) |
|---|---|---|
| Parameters | 17.9M | 25.3M |
| Embedding dim | 320 | 384 |
| Hidden dim | 1,280 | 1,536 |
| Batch size | 16 | 32 |
| Sequence packing | no | yes |
| Compiled training | no | yes (Metal) |
Benchmark Results
Evaluated on Wikitext-2 test set at temperature 0.8 (the model's sweet spot):
| Metric | Value |
|---|---|
| Validation loss | 3.62 |
| Perplexity | 37.34 |
| Avg generation length | 52 tokens |
| UNK token rate | 4.9% |
| 3-gram repetition | 1.1% |
| Inference speed | 550–2,640 tok/s |
Generation Quality by Temperature
| Temp | Avg Length | UNK Rate | 3-gram Rep | Loops |
|---|---|---|---|---|
| 0.2 | 28 words | 9.7% | 11.3% | 0/12 |
| 0.5 | 35 words | 6.5% | 7.7% | 1/12 |
| 0.8 | 52 words | 4.9% | 1.1% | 0/12 |
| 1.0 | 61 words | 3.3% | 0.1% | 0/12 |
Note: Higher temperatures produce better quality — the model generates more coherent, less repetitive text at temp 0.8–1.0.
Sample Generation (temperature 0.8)
Prompt: "She opened the mysterious door and"
She opened the mysterious door and made it one of the first four main stock footage of the first two.
Prompt: "The train arrived at midnight with"
The train arrived at midnight with the . The was then sent to the , where the first boat on the...
Prompt: "Python is a programming language that"
Python is a programming language that " are the world of the world and an example of that kind of genius that is not an...
Files
model.safetensors: final trained model weightsconfig.json: architecture and training configurationtraining_state.pt: local training metadata and loss historytokenizer/tokenizer.json: ByteLevel BPE tokenizertokenizer/tokenizer_meta.json: tokenizer metadatasrc/transformer.py: custom GPT-style model definitionsrc/tokenizer.py: tokenizer wrapper used by the training codeload_model.py: minimal loading and generation example
Usage
MLX (recommended — Apple Silicon only)
pip install mlx safetensors
import mlx.core as mx
import mlx.nn as nn
import json
from safetensors import safe_open
from src.tokenizer import BPETokenizer
from src.transformer import MLXGPT as GPT
with open("config.json") as f:
config = json.load(f)
model = GPT(
vocab_size=config["vocab_size"],
embedding_dim=config["embedding_dim"],
num_layers=config["num_layers"],
num_heads=config["num_heads"],
num_query_groups=config["num_query_groups"],
hidden_dim=config["hidden_dim"],
max_seq_len=config["max_seq_len"],
dropout=0.0,
)
# Load weights
with safe_open("model.safetensors", framework="numpy") as f:
for key in f.keys():
model.update({key: mx.array(f.get_tensor(key))})
model.eval()
tokenizer = BPETokenizer.load("tokenizer")
prompt_ids = mx.array([tokenizer.encode("The future of machine learning")], dtype=mx.int32)
generated = model.generate(prompt_ids, max_new_tokens=40, temperature=0.8, top_k=50)
mx.eval(generated)
print(tokenizer.decode(generated[0].tolist()))
PyTorch (legacy v1 weights still included)
python load_model.py
Limitations
This model was trained as a transformer learning project. It may generate repetitive, incorrect, biased, or unsafe text. It should be used for experimentation, code reading, and learning about model training rather than for factual answers or user-facing applications. The perplexity of 37 indicates the model is in early training stages — further training would significantly improve output quality.
- Downloads last month
- -
Quantized
