Model Checkpoints & Training Data

All large artefacts (model checkpoints and training data) are hosted on Hugging Face. This repository contains only code, configs, and analysis scripts.

Model Checkpoints — Hugging Face Model Hub

Repository: https://huggingface.co/SarahDaakour/dark-whiteGPLM

Five GPT-2 Small checkpoints (~982 MB each, ~4.9 GB total):

File	Dataset	Final Val Loss	Best Iter
`white/ckpt.pt`	White proteins (annotated, photosynthetic)	2.6535	40 000
`algae/ckpt.pt`	Algae proteins (400 k subsample)	2.7178	40 000
`dark/ckpt.pt`	Dark proteins (uncharacterised)	2.7549	32 000
`random_algae/ckpt.pt`	Random algae-biased (composition control)	2.8545	40 000
`random_full/ckpt.pt`	Random full (max-entropy control)	2.9932	40 000

Loading a checkpoint

import torch, sys
sys.path.insert(0, '.')          # repo root must be on path
from model import GPT, GPTConfig

ckpt  = torch.load('dark/ckpt.pt', map_location='cpu')
model = GPT(GPTConfig(**ckpt['model_args']))
state = {k.replace('_orig_mod.', ''): v for k, v in ckpt['model'].items()}
model.load_state_dict(state)
model.eval()

Training Data — Hugging Face Dataset Hub

Repository: https://huggingface.co/datasets/SarahDaakour/dark-whiteGPLM-data

Path	Description	Size
`char_white/input.txt`	White proteins (raw sequences)	71 MB
`char_algae/input.txt`	Algae proteins (raw sequences)	63 MB
`char_algae/algae_sampled_400k.fasta`	Algae source FASTA (400 k seqs)	78 MB
`char_dark/input.txt`	Dark proteins (raw sequences)	67 MB
`char_random_algae_biased/input.txt`	Random algae-biased sequences	59 MB
`char_random_full/input.txt`	Random full (uniform) sequences	59 MB
`char_white/train.bin`	White tokenised training split	127 MB
`char_white/val.bin`	White tokenised validation split	15 MB
`char_algae/train.bin`	Algae tokenised training split	113 MB
`char_algae/val.bin`	Algae tokenised validation split	13 MB
`char_dark/train.bin`	Dark tokenised training split	121 MB
`char_dark/val.bin`	Dark tokenised validation split	14 MB
`char_random_algae_biased/train.bin`	Random algae tokenised train	106 MB
`char_random_algae_biased/val.bin`	Random algae tokenised val	12 MB
`char_random_full/train.bin`	Random full tokenised train	106 MB
`char_random_full/val.bin`	Random full tokenised val	12 MB

Reproducing tokenised data from source

# Each data/char_*/prepare.py reads input.txt and writes train.bin + val.bin
python data/char_dark/prepare.py
python data/char_white/prepare.py
python data/char_algae/prepare.py
python data/char_random_algae_biased/prepare.py
python data/char_random_full/prepare.py

Training a model

# Example: reproduce the dark-protein run
python train_.py config/train_dark.py

See config/train_*.py for all five run configurations.

Analysis

Post-training analysis scripts and all figures are in analysis/.

python analysis/post_training_analysis.py   # training curves, stats, learnability
python analysis/roc_auc_analysis.py         # ROC/AUC from checkpoints

See analysis/ANALYSIS_NOTES.md for a full summary of results.

Citation

If you use these checkpoints or data, please cite this repository:

@software{dark-whiteGPLM,
  author = {SarahD4},
  title  = {dark-whiteGPLM},
  year   = {2026},
  url    = {https://github.com/SarahD4/dark-whiteGPLM}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support