Marshmello-8M
Marshmello-8M is a decoder-only GPT language model trained from scratch in Marshmello β a step-by-step project that builds transformers from one weight to GPT pretraining, SFT, and 300M scaling on Apple Silicon.
| GitHub | mohmmedwee/Marshmello |
| Parameters | ~13.3M (13,340,928) |
| Architecture | GPT (causal self-attention, learned positional embeddings) |
d_model |
384 |
| Layers | 4 |
| Heads | 6 |
| FFN dim | 1536 |
| Context | 256 tokens |
| Tokenizer | BPE (~8,000 vocab) |
| Config key | default |
Quick start
git clone https://github.com/mohmmedwee/Marshmello.git
cd Marshmello
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt huggingface_hub safetensors
# Download weights from this Hub repo into checkpoints/
python 13_gpt_pretraining/hub/download_from_hub.py --repo-id ostah-1010/Marshmello-8M
# Generate text
python 13_gpt_pretraining/generate.py --config default --prompt "Database systems"
Marshmello family
| Model | Hugging Face | Params | GitHub config | Status |
|---|---|---|---|---|
| Marshmello-8M | ostah-1010/Marshmello-8M | ~8M | default |
Published |
| Marshmello-55M | ostah-1010/Marshmello | ~55M | large_50m |
Published base |
| Marshmello-300M | GitHub only | 268.8M | large_300m |
Phase 19A pretraining |
Full source, training pipeline, and evaluation suite: https://github.com/mohmmedwee/Marshmello
SFT dataset: ostah-1010/Marshmello-SFT
Learning path (GitHub repo)
Linear model β Attention β Transformer β BPE LM β GPT pretraining
β Dataset pipeline β 50M scaling β Evaluation β Instruction dataset
β Chat adaptation (18C/18H) β Tiny teacher SFT (18E) β Instruct tuning (18B)
β Core routing eval (18J) β General benchmark (18K) β 300M scaling (19A)
Phases 01β19A in the repo walk through every layer of the stack with readable Python.
Phase 19A β 300M scaling (GitHub)
After the 55M line plateaued (18J 18%, 18K domain 21.8%), Phase 19A tests
whether extra capacity helps. Status on large_300m:
| Gate | Result |
|---|---|
| Benchmark | Passed (~3,552 train tok/s, ~5.5 GB peak, no OOM) |
| Smoke test (20 steps) | Passed (loss 9.16 β 7.16) |
| Chat-only pretraining | In progress |
| Hub upload | Not until 18J/18K improve |
Docs: 19A_scale_to_300m
Instruct checkpoints (GitHub only)
Base weights live on this Hub repo. Instruct / routing checkpoints (~632 MB each) are trained locally and documented in the GitHub repo β not uploaded here yet:
| Checkpoint | Role |
|---|---|
18E_tiny_teacher_sft/checkpoints/teacher_latest.pt |
Tiny teacher SFT (~1590 short answers incl. math) |
18B_marshmello_instruct/checkpoints/best_18j_routing.pt |
Best 18J core routing (~18%) β recommended deploy |
Chat after cloning + downloading base weights:
python 18B_marshmello_instruct/chat.py \
--checkpoint 18B_marshmello_instruct/checkpoints/best_18j_routing.pt \
--prompt "Explain what a database index is" --greedy
Dual benchmarks: 18J (core concept routing, best ~18%) and 18K
(general assistant QA, best domain ~22.5%, hallucination ~64%).
See 18J_marshmello_core_sft/ and 18K_general_benchmark/ on GitHub.
Files in this repo
| File | Description |
|---|---|
model.safetensors |
Model weights |
config.json |
Architecture + parameter breakdown |
tokenizer.json |
BPE tokenizer (</w> word boundaries) |
generation_config.json |
Default sampling settings |
training_meta.json |
Training step, losses, hyperparameters |
Limitations
- Trained on a small educational corpus (not web-scale pretraining)
- Outputs may memorize training paragraphs (see Phase 16 evaluation in GitHub repo)
- This Hub repo ships the base causal LM; instruct/routing checkpoints are on GitHub
- Small educational corpus β not web-scale; may memorize training text (Phase 16)
- Custom PyTorch GPT (not
transformersAutoModel)
Citation
Built with the Marshmello learning project (Phases 01β17).
- Downloads last month
- 30