BacLM 350M Masked

macwiatrak/baclm-350m-masked is a 350M-parameter masked language model for bacterial genomics. It is designed to model both protein sequences and intergenic DNA with a single shared character-level transformer encoder.

BacLM is a mixed-modality model in the sense that the same encoder is trained on both modalities, where each input is either a protein sequence or an intergenic DNA sequence, and the shared encoder processes one sequence modality at a time.

Model Description

BacLM is a mixed-modality genomic language model trained on bacterial protein and intergenic DNA sequences using a masked language modeling objective.

Key properties:

  • Model type: masked language model
  • Parameters: ~350M
  • Architecture: 32-layer transformer encoder
  • Hidden size: 960
  • Attention heads: 16
  • Maximum context length: 2048 tokens
  • Tokenization: character-level
  • Modalities: proteins and DNA/intergenic sequences
  • Modality handling: shared encoder weights across protein and DNA inputs
  • Objective: MLM (masking 15% of the tokens)
  • The tokenizer uses a shared vocabulary over protein and nucleotide characters and also produces token_type_ids, which let the model distinguish modalities internally. Protein and DNA examples can be batched together, but each example should correspond to a single sequence modality.

Input Format

BacLM is case-sensitive:

  • Protein sequences should be passed in uppercase
  • DNA/intergenic sequences should be passed in lowercase

Examples:

  • Protein: MKTAYIAKQRQISFVKSHFSRQ
  • DNA: atgcttagctagcttacg

Intended Uses

This model is intended for:

  • extracting contextual sequence embeddings
  • pretraining and transfer learning for bacterial genomics
  • downstream evaluation on bacterial sequence tasks
  • masked-token prediction in bacterial protein or DNA sequences

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "macwiatrak/baclm-350m-masked"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, dtype=torch.bfloat16)
model.eval().cuda()

seqs = [
    "MKTAYIAKQRQISFVKSHFSRQ",   # protein: uppercase
    "atgcttagctagcttacg",       # DNA: lowercase
]

batch = tokenizer.batch_encode_plus(
    seqs,
    padding=True,
    truncation=True,
    max_length=2048,
    return_tensors="pt",
)
batch = {k: v.cuda() for k, v in batch.items()}

with torch.no_grad():
    outputs = model(
        input_ids=batch["input_ids"],
        token_type_ids=batch.get("token_type_ids"),
        attention_mask=batch.get("attention_mask"),
    )

# Token-level embeddings
token_embeddings = outputs.last_hidden_state

# Mean pooled embeddings
attention_mask = batch["attention_mask"].unsqueeze(-1)
mean_embeddings = (token_embeddings * attention_mask).sum(dim=1) / attention_mask.sum(dim=1).clamp_min(1)
print(mean_embeddings.shape)

Training Data

BacLM was trained on large-scale bacterial sequence data comprising protein sequences derived from coding regions intergenic DNA sequences. Specifically:

Limitations

  • The model is intended for bacterial sequences, not general eukaryotic genomics.
  • It operates at the character level, so masking and prediction are over single sequence tokens rather than higher-level biological units.
  • Protein and DNA inputs should follow the expected casing convention for reliable modality handling.

Citation

TBD
Downloads last month
271
Safetensors
Model size
0.4B params
Tensor type
F32
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including macwiatrak/baclm-350m-masked