YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model Checkpoints & Training Data
All large artefacts (model checkpoints and training data) are hosted on Hugging Face. This repository contains only code, configs, and analysis scripts.
Model Checkpoints β Hugging Face Model Hub
Repository: https://huggingface.co/SarahDaakour/dark-whiteGPLM
Five GPT-2 Small checkpoints (~982 MB each, ~4.9 GB total):
| File | Dataset | Final Val Loss | Best Iter |
|---|---|---|---|
white/ckpt.pt |
White proteins (annotated, photosynthetic) | 2.6535 | 40 000 |
algae/ckpt.pt |
Algae proteins (400 k subsample) | 2.7178 | 40 000 |
dark/ckpt.pt |
Dark proteins (uncharacterised) | 2.7549 | 32 000 |
random_algae/ckpt.pt |
Random algae-biased (composition control) | 2.8545 | 40 000 |
random_full/ckpt.pt |
Random full (max-entropy control) | 2.9932 | 40 000 |
Loading a checkpoint
import torch, sys
sys.path.insert(0, '.') # repo root must be on path
from model import GPT, GPTConfig
ckpt = torch.load('dark/ckpt.pt', map_location='cpu')
model = GPT(GPTConfig(**ckpt['model_args']))
state = {k.replace('_orig_mod.', ''): v for k, v in ckpt['model'].items()}
model.load_state_dict(state)
model.eval()
Training Data β Hugging Face Dataset Hub
Repository: https://huggingface.co/datasets/SarahDaakour/dark-whiteGPLM-data
| Path | Description | Size |
|---|---|---|
char_white/input.txt |
White proteins (raw sequences) | 71 MB |
char_algae/input.txt |
Algae proteins (raw sequences) | 63 MB |
char_algae/algae_sampled_400k.fasta |
Algae source FASTA (400 k seqs) | 78 MB |
char_dark/input.txt |
Dark proteins (raw sequences) | 67 MB |
char_random_algae_biased/input.txt |
Random algae-biased sequences | 59 MB |
char_random_full/input.txt |
Random full (uniform) sequences | 59 MB |
char_white/train.bin |
White tokenised training split | 127 MB |
char_white/val.bin |
White tokenised validation split | 15 MB |
char_algae/train.bin |
Algae tokenised training split | 113 MB |
char_algae/val.bin |
Algae tokenised validation split | 13 MB |
char_dark/train.bin |
Dark tokenised training split | 121 MB |
char_dark/val.bin |
Dark tokenised validation split | 14 MB |
char_random_algae_biased/train.bin |
Random algae tokenised train | 106 MB |
char_random_algae_biased/val.bin |
Random algae tokenised val | 12 MB |
char_random_full/train.bin |
Random full tokenised train | 106 MB |
char_random_full/val.bin |
Random full tokenised val | 12 MB |
Reproducing tokenised data from source
# Each data/char_*/prepare.py reads input.txt and writes train.bin + val.bin
python data/char_dark/prepare.py
python data/char_white/prepare.py
python data/char_algae/prepare.py
python data/char_random_algae_biased/prepare.py
python data/char_random_full/prepare.py
Training a model
# Example: reproduce the dark-protein run
python train_.py config/train_dark.py
See config/train_*.py for all five run configurations.
Analysis
Post-training analysis scripts and all figures are in analysis/.
python analysis/post_training_analysis.py # training curves, stats, learnability
python analysis/roc_auc_analysis.py # ROC/AUC from checkpoints
See analysis/ANALYSIS_NOTES.md for a full summary of results.
Citation
If you use these checkpoints or data, please cite this repository:
@software{dark-whiteGPLM,
author = {SarahD4},
title = {dark-whiteGPLM},
year = {2026},
url = {https://github.com/SarahD4/dark-whiteGPLM}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support