GenomeOcean-4B-v1.2

GenomeOcean-4B-v1.2 is a 4-billion-parameter causal language model for microbial genomic sequences. It is the June 2026 public release of the 4B v1.2 continued-training run, starting from GenomeOcean-4B and trained on an expanded corpus that adds IMGVR5 UViG, GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database.

This release packages the Perlmutter 128-node leapfrog checkpoint selected for the June 2026 release at step 84,197.

What's new in v1.2

Change Details
IMGVR5 UViG dataset Added ~2.7M uncultivated viral genomes from IMG/MetaVR
Expanded dataset Includes GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database
Scaffold gap handling Consecutive Ns (>=10) are collapsed to ceil(log2(N)) [MASK] tokens with loss masking, allowing the model to bridge gaps without being trained to generate them
[CLS] domain tag Viral/phage sequences are prefixed with [CLS] to signal sequence type to the model
[SEP] genome boundaries [SEP] token inserted between genome records during the packing step
IUPAC resolution Ambiguity codes (R, Y, S, W, K, M, B, D, H, V) resolved by random sampling
Long-context training Trained with a 10,240-token packing block size (~50 kbp genomic context)

Model Details

  • Architecture: Mistral (decoder-only transformer)
  • Parameters: ~4.25B
  • Tokenizer: BPE, vocabulary size 4096
  • Training block size: 10,240 tokens (~50 kbp)
  • Config max positions: 32,768
  • Training precision: bfloat16
  • Training framework: DeepSpeed ZeRO-3
  • Continued from: DOEJGI/GenomeOcean-4B
  • Release checkpoint: checkpoint-84197

Final Validation Metrics

Metrics from the final evaluation recorded alongside the release checkpoint:

  • Eval loss: 4.9682
  • Eval accuracy: 0.1439
  • Perplexity: 143.77

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "DOEJGI/GenomeOcean-4B-v1.2",
    trust_remote_code=True,
    padding_side="left",
)
model = AutoModelForCausalLM.from_pretrained(
    "DOEJGI/GenomeOcean-4B-v1.2",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

sequence = "ATGCGATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Data

Dataset Type Source
Antarctic, GRE, Harvard Forest, Mendota, NEON, Oilcane, Tara, HMP2 Metagenomic assembly Internal
GTDB r226 representative genomes Bacterial/Archaeal genomic DNA https://gtdb.ecogenomic.org/
INPHARED phage genomes (Apr 2025) Phage/bacteriophage DNA https://github.com/RyanCook94/inphared
Zenodo RNA virus database RNA virus genomes https://zenodo.org/records/10989253
IMGVR5 UViG Uncultivated viral genomes https://www.meta-virome.org/

Special Token Usage

The v1.2 tokenizer uses special tokens in a specific way during training:

  • [CLS]: prepended to viral/phage/RNA chunks to act as a domain tag
  • [SEP]: inserted at genome boundaries during packing
  • [MASK]: used to represent collapsed N-gaps in scaffold sequences
  • [UNK]: should not appear in clean data; IUPAC ambiguity codes are resolved before tokenization

Citation

If you use GenomeOcean in your research, please cite:

@article{zhou2025genomeocean,
  title={GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies},
  author={Zhou, Zhihan and Riley, Robert and Kautsar, Satria and Wu, Weimin and Egan, Rob
          and Hofmeyr, Steven and Goldhaber-Gordon, Shira and Yu, Mutian and Ho, Harrison
          and Liu, Fengchen and others},
  journal={bioRxiv},
  pages={2025--01},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

License

Copyright (c) 2025, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and Northwestern University. All rights reserved.

See LICENSE for details.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including DOEJGI/GenomeOcean-4B-v1.2