|
--- |
|
license: other |
|
pipeline_tag: token-classification |
|
tags: |
|
- biology |
|
- RNA |
|
- Torsional |
|
- Angles |
|
--- |
|
# `RNA-TorsionBERT` |
|
|
|
## Model Description |
|
|
|
`RNA-TorsionBERT` is a 331 MB parameter BERT-based language model that predicts RNA torsional and pseudo-torsional angles from the sequence. |
|
|
|
`RNA-TorsionBERT` is a DNABERT model that was pre-trained on ~4200 RNA structures before being fine-tuned on 185 non-redundant structures. |
|
|
|
It provides an improvement of MAE of 6.2° over the previous state-of-the-art model, SPOT-RNA-1D, on the Test Set (composed of RNA-Puzzles and CASP-RNA). |
|
|
|
|
|
| Model | alpha | beta | gamma | delta | epsilon | zeta | chi | eta | theta | |
|
|------------------|----------|-------|-------|-------|---------|-------|-------|-------|-------| |
|
| **RNA-TorsionBERT** | 37.3 | 19.6 | 29.4 | 13.6 | 16.6 | 26.6 | 14.7 | 20.1 | 25.4 | |
|
| SPOT-RNA-1D | 45.7 | 23 | 33.6 | 19 | 21.1 | 34.4 | 19.3 | 28.9 | 33.9 | |
|
|
|
**Key Features** |
|
* Torsional and Pseudo-torsional angles prediction |
|
* Predict sequences up to 512 nucleotides |
|
|
|
## Usage |
|
|
|
Get started generating text with `RNA-TorsionBERT` by using the following code snippet: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True) |
|
model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True) |
|
|
|
sequence = "ACG CGG GGT GTT" |
|
params_tokenizer = { |
|
"return_tensors": "pt", |
|
"padding": "max_length", |
|
"max_length": 512, |
|
"truncation": True, |
|
} |
|
inputs = tokenizer(sequence, **params_tokenizer) |
|
output = model(inputs)["logits"] |
|
``` |
|
|
|
- Please note that it was fine-tuned from a DNABERT-3 model and therefore the tokenizer is the same as the one used for DNABERT. Nucleotide `U` should therefore be replaced by `T` in the input sequence. |
|
- The output is the sinus and the cosine for each angle. The angles are in the following order: `alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`, `chi`, `eta`, `theta`. |
|
|
|
To convert the predictions into angles, you can use the following code snippet: |
|
|
|
```python |
|
from typing import Optional |
|
|
|
import numpy as np |
|
|
|
ANGLES_ORDER = [ "alpha", "beta", "gamma", "delta", "epsilon", "zeta", "chi", "eta", "theta" ] |
|
|
|
def convert_sin_cos_to_angles(output: np.ndarray, input_ids: Optional[np.ndarray] = None): |
|
""" |
|
Convert the raw predictions of the RNA-TorsionBERT into angles. |
|
It converts the cos and sinus into angles using: |
|
alpha = arctan(sin(alpha)/cos(alpha)) |
|
:param output: Dictionary with the predictions of the RNA-TorsionBERT per angle |
|
:param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence, |
|
and not the special tokens. |
|
:return: a np.ndarray with the angles for the sequence |
|
""" |
|
if input_ids is not None: |
|
output[ (input_ids == 0) | (input_ids == 1) | (input_ids == 2) | (input_ids == 3) | (input_ids == 4) ] = np.nan |
|
pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange( |
|
1, output.shape[-1], 2 |
|
) |
|
sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes] |
|
tan = np.arctan2(sin, cos) |
|
angles = np.degrees(tan) |
|
return angles |
|
|
|
output = output.cpu().detach().numpy() |
|
input_ids = inputs["input_ids"].cpu().detach().numpy() |
|
real_angles = convert_sin_cos_to_angles(output, input_ids) |
|
``` |
|
|