File size: 1,260 Bytes
c6b64ff b131d59 c6b64ff 805611c c6b64ff 805611c c6b64ff 805611c c6b64ff 805611c c6b64ff 805611c c6b64ff 805611c c6b64ff 805611c c6b64ff 805611c c6b64ff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
---
library_name: transformers
---
<span style="color:red">`buetnlpbio/bpe-only-rna-bert` is trained on BPE tokens only (for ablation). Please consider using `buetnlpbio/birna-bert` instead.
</span>
# Model Card for Model ID
BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).
BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base
# Usage
## Extracting RNA embeddings
```python
import torch
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/birna-tokenizer")
config = transformers.BertConfig.from_pretrained("buetnlpbio/bpe-only-rna-bert")
mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/bpe-only-rna-bert",config=config,trust_remote_code=True)
mysterybert.cls = torch.nn.Identity()
# To get sequence embeddings
seq_embed = mysterybert(**tokenizer("AGCTACGTACGT", return_tensors="pt"))
print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP
```
|