GenomeOcean-4B-v1.2

GenomeOcean-4B-v1.2 is a 4-billion-parameter causal language model for microbial genomic sequences. It is the June 2026 public release of the 4B v1.2 continued-training run, starting from GenomeOcean-4B and trained on an expanded corpus that adds IMGVR5 UViG, GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database.

This release packages the Perlmutter 128-node leapfrog checkpoint selected for the June 2026 release at step 84,197.

What's new in v1.2

Change	Details
IMGVR5 UViG dataset	Added ~2.7M uncultivated viral genomes from IMG/MetaVR
Expanded dataset	Includes GTDB r226 representative genomes, INPHARED phage genomes, and the Zenodo RNA virus database
Scaffold gap handling	Consecutive Ns (>=10) are collapsed to `ceil(log2(N))` `[MASK]` tokens with loss masking, allowing the model to bridge gaps without being trained to generate them
`[CLS]` domain tag	Viral/phage sequences are prefixed with `[CLS]` to signal sequence type to the model
`[SEP]` genome boundaries	`[SEP]` token inserted between genome records during the packing step
IUPAC resolution	Ambiguity codes (R, Y, S, W, K, M, B, D, H, V) resolved by random sampling
Long-context training	Trained with a 10,240-token packing block size (~50 kbp genomic context)

Model Details

Architecture: Mistral (decoder-only transformer)
Parameters: ~4.25B
Tokenizer: BPE, vocabulary size 4096
Training block size: 10,240 tokens (~50 kbp)
Config max positions: 32,768
Training precision: bfloat16
Training framework: DeepSpeed ZeRO-3
Continued from: DOEJGI/GenomeOcean-4B
Release checkpoint: checkpoint-84197

Final Validation Metrics

Metrics from the final evaluation recorded alongside the release checkpoint:

Eval loss: 4.9682
Eval accuracy: 0.1439
Perplexity: 143.77

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "DOEJGI/GenomeOcean-4B-v1.2",
    trust_remote_code=True,
    padding_side="left",
)
model = AutoModelForCausalLM.from_pretrained(
    "DOEJGI/GenomeOcean-4B-v1.2",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

sequence = "ATGCGATCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Data

Dataset	Type	Source
Antarctic, GRE, Harvard Forest, Mendota, NEON, Oilcane, Tara, HMP2	Metagenomic assembly	Internal
GTDB r226 representative genomes	Bacterial/Archaeal genomic DNA	https://gtdb.ecogenomic.org/
INPHARED phage genomes (Apr 2025)	Phage/bacteriophage DNA	https://github.com/RyanCook94/inphared
Zenodo RNA virus database	RNA virus genomes	https://zenodo.org/records/10989253
IMGVR5 UViG	Uncultivated viral genomes	https://www.meta-virome.org/

Special Token Usage

The v1.2 tokenizer uses special tokens in a specific way during training:

[CLS]: prepended to viral/phage/RNA chunks to act as a domain tag
[SEP]: inserted at genome boundaries during packing
[MASK]: used to represent collapsed N-gaps in scaffold sequences
[UNK]: should not appear in clean data; IUPAC ambiguity codes are resolved before tokenization

Citation

If you use GenomeOcean in your research, please cite:

@article{zhou2025genomeocean,
  title={GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies},
  author={Zhou, Zhihan and Riley, Robert and Kautsar, Satria and Wu, Weimin and Egan, Rob
          and Hofmeyr, Steven and Goldhaber-Gordon, Shira and Yu, Mutian and Ho, Harrison
          and Liu, Fengchen and others},
  journal={bioRxiv},
  pages={2025--01},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

License

Copyright (c) 2025, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and Northwestern University. All rights reserved.

See LICENSE for details.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including DOEJGI/GenomeOcean-4B-v1.2

GenomeOcean-v1.2

Collection

Added GTDB diverse dataset, trained to plateau • 7 items • Updated about 21 hours ago