gLM-650M

Minimal HuggingFace port of the 650M parameter variant of gLM2 -- a mixed-modality genomic language model that encodes a genomic scaffold using both amino-acid and DNA tokens. Pretrained with masked language modeling on the OMG dataset.

Architecture

Parameter	Value
Layers	33
Attention heads	20
Embedding dimension	1280
FFN hidden dimension	3584 (SwiGLU, multiple_of=256)
Vocabulary size	37
Positional encoding	RoPE (base=10000, non-interleaved)
Normalization	RMSNorm
Architecture	Pre-LN Transformer with SwiGLU FFN
Max sequence length	4096

Vocabulary: <cls>, <pad>, <eos>, <unk>, the 26 IUPAC amino-acid letters (L A G V S E R T I D P K Q N F Y M H W C X B U Z O, uppercase), the 4 DNA nucleotides (a t c g, lowercase), strand markers <+> / <->, and <mask> / <sep>. Amino-acid and nucleotide tokens share the alphabet by case (uppercase = amino acid, lowercase = nucleotide).

Pretraining

Objective: Masked language modeling (30% mask rate)
Data: OMG dataset (open metagenomic corpus, semantically-deduplicated)
Pretraining tokens: 315B (bfloat16, context length 4096)
Source checkpoint: tattabio/gLM2_650M

Parity Verification

All 34 representation levels (embedding + 33 transformer blocks) verified to be bit-exact (max abs diff = 0.00) against the original tattabio/gLM2_650M weights with attn_implementation="sdpa". The added eager and flash_attention_2 backends agree within fp32 kernel drift (atol = 1e-3) and bf16 cosine similarity >= 0.999 respectively. Verified on GPU with PyTorch 2.7 / CUDA 12.

Related Models

See the full gLM2 collection.

Model	Parameters	Notes
gLM-150M	150M	Smaller variant
gLM-650M	650M	This model

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/gLM-650M", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/gLM-650M", trust_remote_code=True)
model.eval()

# Canonical gLM2 input: amino acids (uppercase) + DNA (lowercase) + strand markers.
sequence = (
    "<+>MALTKVEKRNRIKRRVRGKISGTQASPRLSVYKSNK"
    "<+>aatttaaggaa"
    "<->MLGIDNIERVKPGGLELVDRLVAVNRVTKVTKGGRAFGFSAIVVVGNED"
)
enc = tokenizer([sequence], return_tensors="pt")

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 1280) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 1280)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer16_emb = out_all.hidden_states[16]       # after block 16

The tokenizer also accepts plain DNA strings (no strand marker) and auto-prepares them by lowercasing, replacing U/u with t, and prepending <+>. The three calls below produce identical token sequences:

tokenizer(["ATCGATCG", "atcgatcg", "AUCGAUCG"], return_tensors="pt")

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/gLM-650M", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/gLM-650M", trust_remote_code=True)
model.eval()

enc = tokenizer(["<+>MA<mask>K"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 37)

Faster attention backends

# SDPA (PyTorch 2.0+, default upstream backend) -- recommended for fp32
model = AutoModel.from_pretrained("Taykhoom/gLM-650M", trust_remote_code=True,
                                  attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package) -- fastest on long sequences
model = AutoModel.from_pretrained("Taykhoom/gLM-650M", trust_remote_code=True,
                                  attn_implementation="flash_attention_2",
                                  dtype=torch.bfloat16)

Fine-tuning

Standard HF conventions. For sequence-level tasks, pool over non-padding positions or use the CLS token embedding as input to a prediction head.

Implementation Notes

The original gLM2 implementation uses PyTorch SDPA as the only attention backend. This HF port adds eager and flash_attention_2 as separate implementations selectable via attn_implementation, with eager falling back automatically when output_attentions=True is requested.

The eager kernel computes the QK matmul and softmax in fp32 even when the model is loaded in bf16, matching the numerical behaviour of SDPA and flash_attention_2 in mixed precision.

Citation

@article{cornman2024_glm2,
  title   = {The {OMG} dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling},
  author  = {Cornman, Andre and West-Roberts, Jacob and Camargo, Antonio Pedro and Roux, Simon and Beracochea, Martin and Mirdita, Milot and Ovchinnikov, Sergey and Hwang, Yunha},
  journal = {bioRxiv},
  year    = {2024},
  doi     = {10.1101/2024.08.14.607850}
}

Credits

Original model and code by Cornman et al. (Tatta Bio). Source: GitHub, tattabio/gLM2_650M on the Hub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0, following the original repository.

Downloads last month: 37

Safetensors

Model size

0.7B params

Tensor type

F32

Dataset used to train Taykhoom/gLM-650M

Collection including Taykhoom/gLM-650M

gLM2

Collection

HF ports of gLM2: 2 model versions ranging from 150M to 650M parameters. • 2 items • Updated 3 days ago