Edit model card

buetnlpbio/bidna-bert was trained on only Human Genome DNA dataset for 1 epoch (for ablation). Performance on other DNA types may be limited.

Model Card for Model ID

BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).

BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base

Usage

Extracting RNA embeddings

import torch
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/bidna-tokenizer")

config = transformers.BertConfig.from_pretrained("buetnlpbio/bidna-bert")
mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/bidna-bert",config=config,trust_remote_code=True)
mysterybert.cls = torch.nn.Identity()

# To get sequence embeddings
seq_embed = mysterybert(**tokenizer("AGCTACGTACGT", return_tensors="pt"))
print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP

# To get nucleotide embeddings
char_embed = mysterybert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt")) 
print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP
Downloads last month
4
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including buetnlpbio/bidna-bert