GenomeOcean-4B-v1.2
GenomeOcean-4B-v1.2 is a 4-billion-parameter causal language model for microbial genomic sequences. It is the June 2026 public release of the 4B v1.2 continued-training run, starting from GenomeOcean-4B and trained on an expanded corpus that adds IMGVR5 UViG, GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database.
This release packages the Perlmutter 128-node leapfrog checkpoint selected for the June 2026 release at step 84,197.
What's new in v1.2
| Change | Details |
|---|---|
| IMGVR5 UViG dataset | Added ~2.7M uncultivated viral genomes from IMG/MetaVR |
| Expanded dataset | Includes GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database |
| Scaffold gap handling | Consecutive Ns (>=10) are collapsed to ceil(log2(N)) [MASK] tokens with loss masking, allowing the model to bridge gaps without being trained to generate them |
[CLS] domain tag |
Viral/phage sequences are prefixed with [CLS] to signal sequence type to the model |
[SEP] genome boundaries |
[SEP] token inserted between genome records during the packing step |
| IUPAC resolution | Ambiguity codes (R, Y, S, W, K, M, B, D, H, V) resolved by random sampling |
| Long-context training | Trained with a 10,240-token packing block size (~50 kbp genomic context) |
Model Details
- Architecture: Mistral (decoder-only transformer)
- Parameters: ~4.25B
- Tokenizer: BPE, vocabulary size 4096
- Training block size: 10,240 tokens (~50 kbp)
- Config max positions: 32,768
- Training precision: bfloat16
- Training framework: DeepSpeed ZeRO-3
- Continued from:
DOEJGI/GenomeOcean-4B - Release checkpoint:
checkpoint-84197
Final Validation Metrics
Metrics from the final evaluation recorded alongside the release checkpoint:
- Eval loss: 4.9682
- Eval accuracy: 0.1439
- Perplexity: 143.77
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"DOEJGI/GenomeOcean-4B-v1.2",
trust_remote_code=True,
padding_side="left",
)
model = AutoModelForCausalLM.from_pretrained(
"DOEJGI/GenomeOcean-4B-v1.2",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).to("cuda")
sequence = "ATGCGATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Data
| Dataset | Type | Source |
|---|---|---|
| Antarctic, GRE, Harvard Forest, Mendota, NEON, Oilcane, Tara, HMP2 | Metagenomic assembly | Internal |
| GTDB r226 representative genomes | Bacterial/Archaeal genomic DNA | https://gtdb.ecogenomic.org/ |
| INPHARED phage genomes (Apr 2025) | Phage/bacteriophage DNA | https://github.com/RyanCook94/inphared |
| Zenodo RNA virus database | RNA virus genomes | https://zenodo.org/records/10989253 |
| IMGVR5 UViG | Uncultivated viral genomes | https://www.meta-virome.org/ |
Special Token Usage
The v1.2 tokenizer uses special tokens in a specific way during training:
[CLS]: prepended to viral/phage/RNA chunks to act as a domain tag[SEP]: inserted at genome boundaries during packing[MASK]: used to represent collapsed N-gaps in scaffold sequences[UNK]: should not appear in clean data; IUPAC ambiguity codes are resolved before tokenization
Citation
If you use GenomeOcean in your research, please cite:
@article{zhou2025genomeocean,
title={GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies},
author={Zhou, Zhihan and Riley, Robert and Kautsar, Satria and Wu, Weimin and Egan, Rob
and Hofmeyr, Steven and Goldhaber-Gordon, Shira and Yu, Mutian and Ho, Harrison
and Liu, Fengchen and others},
journal={bioRxiv},
pages={2025--01},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
License
Copyright (c) 2025, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and Northwestern University. All rights reserved.
See LICENSE for details.
- Downloads last month
- -