File size: 1,260 Bytes
c6b64ff
 
 
b131d59
 
 
c6b64ff
 
 
805611c
c6b64ff
805611c
c6b64ff
 
805611c
 
c6b64ff
805611c
 
 
 
c6b64ff
805611c
c6b64ff
805611c
 
 
c6b64ff
805611c
 
 
c6b64ff
805611c
c6b64ff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
library_name: transformers
---
<span style="color:red">`buetnlpbio/bpe-only-rna-bert` is trained on BPE tokens only (for ablation). Please consider using `buetnlpbio/birna-bert` instead.
</span>


# Model Card for Model ID

BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).

BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base


# Usage
## Extracting RNA embeddings

```python
import torch
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/birna-tokenizer")

config = transformers.BertConfig.from_pretrained("buetnlpbio/bpe-only-rna-bert")
mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/bpe-only-rna-bert",config=config,trust_remote_code=True)
mysterybert.cls = torch.nn.Identity()

# To get sequence embeddings
seq_embed = mysterybert(**tokenizer("AGCTACGTACGT", return_tensors="pt"))
print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP

```