YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Checkpoints & Training Data

All large artefacts (model checkpoints and training data) are hosted on Hugging Face. This repository contains only code, configs, and analysis scripts.


Model Checkpoints β€” Hugging Face Model Hub

Repository: https://huggingface.co/SarahDaakour/dark-whiteGPLM

Five GPT-2 Small checkpoints (~982 MB each, ~4.9 GB total):

File Dataset Final Val Loss Best Iter
white/ckpt.pt White proteins (annotated, photosynthetic) 2.6535 40 000
algae/ckpt.pt Algae proteins (400 k subsample) 2.7178 40 000
dark/ckpt.pt Dark proteins (uncharacterised) 2.7549 32 000
random_algae/ckpt.pt Random algae-biased (composition control) 2.8545 40 000
random_full/ckpt.pt Random full (max-entropy control) 2.9932 40 000

Loading a checkpoint

import torch, sys
sys.path.insert(0, '.')          # repo root must be on path
from model import GPT, GPTConfig

ckpt  = torch.load('dark/ckpt.pt', map_location='cpu')
model = GPT(GPTConfig(**ckpt['model_args']))
state = {k.replace('_orig_mod.', ''): v for k, v in ckpt['model'].items()}
model.load_state_dict(state)
model.eval()

Training Data β€” Hugging Face Dataset Hub

Repository: https://huggingface.co/datasets/SarahDaakour/dark-whiteGPLM-data

Path Description Size
char_white/input.txt White proteins (raw sequences) 71 MB
char_algae/input.txt Algae proteins (raw sequences) 63 MB
char_algae/algae_sampled_400k.fasta Algae source FASTA (400 k seqs) 78 MB
char_dark/input.txt Dark proteins (raw sequences) 67 MB
char_random_algae_biased/input.txt Random algae-biased sequences 59 MB
char_random_full/input.txt Random full (uniform) sequences 59 MB
char_white/train.bin White tokenised training split 127 MB
char_white/val.bin White tokenised validation split 15 MB
char_algae/train.bin Algae tokenised training split 113 MB
char_algae/val.bin Algae tokenised validation split 13 MB
char_dark/train.bin Dark tokenised training split 121 MB
char_dark/val.bin Dark tokenised validation split 14 MB
char_random_algae_biased/train.bin Random algae tokenised train 106 MB
char_random_algae_biased/val.bin Random algae tokenised val 12 MB
char_random_full/train.bin Random full tokenised train 106 MB
char_random_full/val.bin Random full tokenised val 12 MB

Reproducing tokenised data from source

# Each data/char_*/prepare.py reads input.txt and writes train.bin + val.bin
python data/char_dark/prepare.py
python data/char_white/prepare.py
python data/char_algae/prepare.py
python data/char_random_algae_biased/prepare.py
python data/char_random_full/prepare.py

Training a model

# Example: reproduce the dark-protein run
python train_.py config/train_dark.py

See config/train_*.py for all five run configurations.


Analysis

Post-training analysis scripts and all figures are in analysis/.

python analysis/post_training_analysis.py   # training curves, stats, learnability
python analysis/roc_auc_analysis.py         # ROC/AUC from checkpoints

See analysis/ANALYSIS_NOTES.md for a full summary of results.


Citation

If you use these checkpoints or data, please cite this repository:

@software{dark-whiteGPLM,
  author = {SarahD4},
  title  = {dark-whiteGPLM},
  year   = {2026},
  url    = {https://github.com/SarahD4/dark-whiteGPLM}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support