BacLM 350M Masked

macwiatrak/baclm-350m-masked is a 350M-parameter masked language model for bacterial genomics. It is designed to model both protein sequences and intergenic DNA with a single shared character-level transformer encoder.

BacLM is a mixed-modality model in the sense that the same encoder is trained on both modalities, where each input is either a protein sequence or an intergenic DNA sequence, and the shared encoder processes one sequence modality at a time.

Model Description

BacLM is a mixed-modality genomic language model trained on bacterial protein and intergenic DNA sequences using a masked language modeling objective.

Key properties:

Model type: masked language model
Parameters: ~350M
Architecture: 32-layer transformer encoder
Hidden size: 960
Attention heads: 16
Maximum context length: 2048 tokens
Tokenization: character-level
Modalities: proteins and DNA/intergenic sequences
Modality handling: shared encoder weights across protein and DNA inputs
Objective: MLM (masking 15% of the tokens)
The tokenizer uses a shared vocabulary over protein and nucleotide characters and also produces token_type_ids, which let the model distinguish modalities internally. Protein and DNA examples can be batched together, but each example should correspond to a single sequence modality.

Input Format

BacLM is case-sensitive:

Protein sequences should be passed in uppercase
DNA/intergenic sequences should be passed in lowercase

Examples:

Protein: MKTAYIAKQRQISFVKSHFSRQ
DNA: atgcttagctagcttacg

Intended Uses

This model is intended for:

extracting contextual sequence embeddings
pretraining and transfer learning for bacterial genomics
downstream evaluation on bacterial sequence tasks
masked-token prediction in bacterial protein or DNA sequences

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "macwiatrak/baclm-350m-masked"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, dtype=torch.bfloat16)
model.eval().cuda()

seqs = [
    "MKTAYIAKQRQISFVKSHFSRQ",   # protein: uppercase
    "atgcttagctagcttacg",       # DNA: lowercase
]

batch = tokenizer.batch_encode_plus(
    seqs,
    padding=True,
    truncation=True,
    max_length=2048,
    return_tensors="pt",
)
batch = {k: v.cuda() for k, v in batch.items()}

with torch.no_grad():
    outputs = model(
        input_ids=batch["input_ids"],
        token_type_ids=batch.get("token_type_ids"),
        attention_mask=batch.get("attention_mask"),
    )

# Token-level embeddings
token_embeddings = outputs.last_hidden_state

# Mean pooled embeddings
attention_mask = batch["attention_mask"].unsqueeze(-1)
mean_embeddings = (token_embeddings * attention_mask).sum(dim=1) / attention_mask.sum(dim=1).clamp_min(1)
print(mean_embeddings.shape)

Training Data

BacLM was trained on large-scale bacterial sequence data comprising protein sequences derived from coding regions intergenic DNA sequences. Specifically:

Limitations

The model is intended for bacterial sequences, not general eukaryotic genomics.
It operates at the character level, so masking and prediction are over single sequence tokens rather than higher-level biological units.
Protein and DNA inputs should follow the expected casing convention for reliable modality handling.

Citation

TBD

Downloads last month: 271

Safetensors

Model size

0.4B params

Tensor type

F32

BOOL

Collection including macwiatrak/baclm-350m-masked

BacLM

Collection

Genomic Language Model (350M) trained on bacterial protein and intergenic (DNA) sequences. • 2 items • Updated 18 days ago