ZhiyuanChen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,19 @@ base_model: multimolecule/ernierna
|
|
11 |
pipeline_tag: fill-mask
|
12 |
mask_token: "<mask>"
|
13 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
- example_title: "microRNA-21"
|
15 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
16 |
output:
|
@@ -48,7 +61,7 @@ ERNIE-RNA is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-styl
|
|
48 |
### Variations
|
49 |
|
50 |
- **[`multimolecule/ernierna`](https://huggingface.co/multimolecule/ernierna)**: The ERNIE-RNA model pre-trained on non-coding RNA sequences.
|
51 |
-
- **[`multimolecule/ernierna
|
52 |
|
53 |
### Model Specification
|
54 |
|
@@ -63,7 +76,7 @@ ERNIE-RNA is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-styl
|
|
63 |
- **Paper**: [ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations](https://doi.org/10.1101/2024.03.17.585376)
|
64 |
- **Developed by**: Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
|
65 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
|
66 |
-
- **Original Repository**: [
|
67 |
|
68 |
## Usage
|
69 |
|
@@ -80,29 +93,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
80 |
```python
|
81 |
>>> import multimolecule # you must import multimolecule to register models
|
82 |
>>> from transformers import pipeline
|
83 |
-
>>> unmasker = pipeline(
|
84 |
-
>>> unmasker("
|
85 |
|
86 |
-
[{'score': 0.
|
87 |
-
'token': 9,
|
88 |
-
'token_str': 'U',
|
89 |
-
'sequence': 'U A G C U U A U C A G A C U G A U G U U G A'},
|
90 |
-
{'score': 0.1741773933172226,
|
91 |
-
'token': 7,
|
92 |
-
'token_str': 'C',
|
93 |
-
'sequence': 'U A G C C U A U C A G A C U G A U G U U G A'},
|
94 |
-
{'score': 0.16430608928203583,
|
95 |
'token': 8,
|
96 |
'token_str': 'G',
|
97 |
-
'sequence': 'U
|
98 |
-
{'score': 0.
|
|
|
|
|
|
|
|
|
99 |
'token': 6,
|
100 |
'token_str': 'A',
|
101 |
-
'sequence': '
|
102 |
-
{'score': 0.
|
|
|
|
|
|
|
|
|
103 |
'token': 21,
|
104 |
'token_str': '.',
|
105 |
-
'sequence': '
|
106 |
```
|
107 |
|
108 |
### Downstream Use
|
@@ -115,11 +128,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
115 |
from multimolecule import RnaTokenizer, ErnieRnaModel
|
116 |
|
117 |
|
118 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
119 |
-
model = ErnieRnaModel.from_pretrained(
|
120 |
|
121 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
122 |
-
input = tokenizer(text, return_tensors=
|
123 |
|
124 |
output = model(**input)
|
125 |
```
|
@@ -135,17 +148,17 @@ import torch
|
|
135 |
from multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction
|
136 |
|
137 |
|
138 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
139 |
-
model = ErnieRnaForSequencePrediction.from_pretrained(
|
140 |
|
141 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
142 |
-
input = tokenizer(text, return_tensors=
|
143 |
label = torch.tensor([1])
|
144 |
|
145 |
output = model(**input, labels=label)
|
146 |
```
|
147 |
|
148 |
-
####
|
149 |
|
150 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
151 |
|
@@ -153,14 +166,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
153 |
|
154 |
```python
|
155 |
import torch
|
156 |
-
from multimolecule import RnaTokenizer,
|
157 |
|
158 |
|
159 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
160 |
-
model =
|
161 |
|
162 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
163 |
-
input = tokenizer(text, return_tensors=
|
164 |
label = torch.randint(2, (len(text), ))
|
165 |
|
166 |
output = model(**input, labels=label)
|
@@ -177,11 +190,11 @@ import torch
|
|
177 |
from multimolecule import RnaTokenizer, ErnieRnaForContactPrediction
|
178 |
|
179 |
|
180 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
181 |
-
model = ErnieRnaForContactPrediction.from_pretrained(
|
182 |
|
183 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
184 |
-
input = tokenizer(text, return_tensors=
|
185 |
label = torch.randint(2, (len(text), len(text)))
|
186 |
|
187 |
output = model(**input, labels=label)
|
|
|
11 |
pipeline_tag: fill-mask
|
12 |
mask_token: "<mask>"
|
13 |
widget:
|
14 |
+
- example_title: "HIV-1"
|
15 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
16 |
+
output:
|
17 |
+
- label: "G"
|
18 |
+
score: 0.2066272348165512
|
19 |
+
- label: "U"
|
20 |
+
score: 0.1811930239200592
|
21 |
+
- label: "A"
|
22 |
+
score: 0.17954225838184357
|
23 |
+
- label: "-"
|
24 |
+
score: 0.12186982482671738
|
25 |
+
- label: "."
|
26 |
+
score: 0.10200861096382141
|
27 |
- example_title: "microRNA-21"
|
28 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
29 |
output:
|
|
|
61 |
### Variations
|
62 |
|
63 |
- **[`multimolecule/ernierna`](https://huggingface.co/multimolecule/ernierna)**: The ERNIE-RNA model pre-trained on non-coding RNA sequences.
|
64 |
+
- **[`multimolecule/ernierna-ss`](https://huggingface.co/multimolecule/ernierna-ss)**: The ERNIE-RNA model fine-tuned on RNA secondary structure prediction.
|
65 |
|
66 |
### Model Specification
|
67 |
|
|
|
76 |
- **Paper**: [ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations](https://doi.org/10.1101/2024.03.17.585376)
|
77 |
- **Developed by**: Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
|
78 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
|
79 |
+
- **Original Repository**: [Bruce-ywj/ERNIE-RNA](https://github.com/Bruce-ywj/ERNIE-RNA)
|
80 |
|
81 |
## Usage
|
82 |
|
|
|
93 |
```python
|
94 |
>>> import multimolecule # you must import multimolecule to register models
|
95 |
>>> from transformers import pipeline
|
96 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/ernierna-ss")
|
97 |
+
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
|
98 |
|
99 |
+
[{'score': 0.2066272348165512,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
'token': 8,
|
101 |
'token_str': 'G',
|
102 |
+
'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
103 |
+
{'score': 0.1811930239200592,
|
104 |
+
'token': 9,
|
105 |
+
'token_str': 'U',
|
106 |
+
'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
107 |
+
{'score': 0.17954225838184357,
|
108 |
'token': 6,
|
109 |
'token_str': 'A',
|
110 |
+
'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
111 |
+
{'score': 0.12186982482671738,
|
112 |
+
'token': 24,
|
113 |
+
'token_str': '-',
|
114 |
+
'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
115 |
+
{'score': 0.10200861096382141,
|
116 |
'token': 21,
|
117 |
'token_str': '.',
|
118 |
+
'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'}]
|
119 |
```
|
120 |
|
121 |
### Downstream Use
|
|
|
128 |
from multimolecule import RnaTokenizer, ErnieRnaModel
|
129 |
|
130 |
|
131 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna-ss")
|
132 |
+
model = ErnieRnaModel.from_pretrained("multimolecule/ernierna-ss")
|
133 |
|
134 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
135 |
+
input = tokenizer(text, return_tensors="pt")
|
136 |
|
137 |
output = model(**input)
|
138 |
```
|
|
|
148 |
from multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction
|
149 |
|
150 |
|
151 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna-ss")
|
152 |
+
model = ErnieRnaForSequencePrediction.from_pretrained("multimolecule/ernierna-ss")
|
153 |
|
154 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
155 |
+
input = tokenizer(text, return_tensors="pt")
|
156 |
label = torch.tensor([1])
|
157 |
|
158 |
output = model(**input, labels=label)
|
159 |
```
|
160 |
|
161 |
+
#### Token Classification / Regression
|
162 |
|
163 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
164 |
|
|
|
166 |
|
167 |
```python
|
168 |
import torch
|
169 |
+
from multimolecule import RnaTokenizer, ErnieRnaForTokenPrediction
|
170 |
|
171 |
|
172 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna-ss")
|
173 |
+
model = ErnieRnaForTokenPrediction.from_pretrained("multimolecule/ernierna-ss")
|
174 |
|
175 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
176 |
+
input = tokenizer(text, return_tensors="pt")
|
177 |
label = torch.randint(2, (len(text), ))
|
178 |
|
179 |
output = model(**input, labels=label)
|
|
|
190 |
from multimolecule import RnaTokenizer, ErnieRnaForContactPrediction
|
191 |
|
192 |
|
193 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna-ss")
|
194 |
+
model = ErnieRnaForContactPrediction.from_pretrained("multimolecule/ernierna-ss")
|
195 |
|
196 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
197 |
+
input = tokenizer(text, return_tensors="pt")
|
198 |
label = torch.randint(2, (len(text), len(text)))
|
199 |
|
200 |
output = model(**input, labels=label)
|