This is repository for MutBERT (pretrained with mutation data in human genome).

Introduction

This is the official pre-trained model introduced in MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models.

We sincerely appreciate the Tochka-Al team for the ruRoPEBert implementation, which serves as the base of MutBERT development.

MutBERT is a transformer-based genome foundation model trained only on Human genome.

Model Source

Usage

Load tokenizer and model

from transformers import AutoTokenizer, AutoModel

model_name = "JadenLong/MutBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

The default attention is flash attention("sdpa"). If you want use basic attention, you can replace it with "eager". Please refer to here.

Get embeddings

import torch
import torch.nn.functional as F

from transformers import AutoTokenizer, AutoModel

model_name = "JadenLong/MutBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

dna = "ATCGGGGCCCATTA"
inputs = tokenizer(dna, return_tensors='pt')["input_ids"]

mut_inputs = F.one_hot(inputs, num_classes=len(tokenizer)).float().to("cpu")  # len(tokenizer) is vocab size
last_hidden_state = model(inputs).last_hidden_state   # [1, sequence_length, 768]
# or: last_hidden_state = model(mut_inputs)[0]        # [1, sequence_length, 768]

# embedding with mean pooling
embedding_mean = torch.mean(last_hidden_state[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

Using as a Classifier

from transformers import AutoModelForSequenceClassification

model_name = "JadenLong/MutBERT"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2)

With RoPE scaling

Allowed types for RoPE scaling are: linear and dynamic. To extend the model's context window you need to add rope_scaling parameter.

If you want to scale your model context by 2x:

model = AutoModel.from_pretrained(model_name,
                                  trust_remote_code=True,
                                  rope_scaling={'type': 'dynamic','factor': 2.0}
                                  ) # 2.0 for x2 scaling, 4.0 for x4, etc..
Downloads last month
11
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.