|
--- |
|
library_name: transformers |
|
--- |
|
|
|
<span style="color:red">`buetnlpbio/bidna-bert` was trained on only Human Genome DNA dataset for 1 epoch (for ablation). Performance on other DNA types may be limited. </span> |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE). |
|
|
|
BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base |
|
|
|
|
|
# Usage |
|
## Extracting RNA embeddings |
|
|
|
```python |
|
import torch |
|
import transformers |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/bidna-tokenizer") |
|
|
|
config = transformers.BertConfig.from_pretrained("buetnlpbio/bidna-bert") |
|
mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/bidna-bert",config=config,trust_remote_code=True) |
|
mysterybert.cls = torch.nn.Identity() |
|
|
|
# To get sequence embeddings |
|
seq_embed = mysterybert(**tokenizer("AGCTACGTACGT", return_tensors="pt")) |
|
print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP |
|
|
|
# To get nucleotide embeddings |
|
char_embed = mysterybert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt")) |
|
print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP |
|
``` |
|
|
|
|