ZhiyuanChen commited on
Commit
04eb149
1 Parent(s): 2ec72e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -33
README.md CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - example_title: "microRNA-21"
14
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
15
  output:
@@ -63,8 +76,8 @@ UTR-LM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style m
63
 
64
  ### Variations
65
 
66
- - **[`multimolecule/utrlm.te_el`](https://huggingface.co/multimolecule/utrlm.te_el)**: The UTR-LM model for Translation Efficiency of transcripts and mRNA Expression Level.
67
- - **[`multimolecule/utrlm.mrl`](https://huggingface.co/multimolecule/utrlm.mrl)**: The UTR-LM model for Mean Ribosome Loading.
68
 
69
  ### Model Specification
70
 
@@ -110,7 +123,7 @@ UTR-LM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style m
110
  - **Paper**: [A 5’ UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions](http://doi.org/10.1038/s41467-021-24436-7)
111
  - **Developed by**: Yanyi Chu, Dan Yu, Yupeng Li, Kaixuan Huang, Yue Shen, Le Cong, Jason Zhang, Mengdi Wang
112
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
113
- - **Original Repository**: [https://github.com/a96123155/UTR-LM](https://github.com/a96123155/UTR-LM)
114
 
115
  ## Usage
116
 
@@ -127,29 +140,29 @@ You can use this model directly with a pipeline for masked language modeling:
127
  ```python
128
  >>> import multimolecule # you must import multimolecule to register models
129
  >>> from transformers import pipeline
130
- >>> unmasker = pipeline('fill-mask', model='multimolecule/utrlm.te_el')
131
- >>> unmasker("uagc<mask>uaucagacugauguuga")
132
 
133
- [{'score': 0.08083827048540115,
134
  'token': 23,
135
  'token_str': '*',
136
- 'sequence': 'U A G C * U A U C A G A C U G A U G U U G A'},
137
- {'score': 0.07966958731412888,
138
  'token': 5,
139
  'token_str': '<null>',
140
- 'sequence': 'U A G C U A U C A G A C U G A U G U U G A'},
141
- {'score': 0.0771222859621048,
142
- 'token': 6,
143
- 'token_str': 'A',
144
- 'sequence': 'U A G C A U A U C A G A C U G A U G U U G A'},
145
- {'score': 0.06853719055652618,
146
  'token': 10,
147
  'token_str': 'N',
148
- 'sequence': 'U A G C N U A U C A G A C U G A U G U U G A'},
149
- {'score': 0.06666938215494156,
150
- 'token': 21,
151
- 'token_str': '.',
152
- 'sequence': 'U A G C. U A U C A G A C U G A U G U U G A'}]
153
  ```
154
 
155
  ### Downstream Use
@@ -162,11 +175,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
162
  from multimolecule import RnaTokenizer, UtrLmModel
163
 
164
 
165
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrlm.te_el')
166
- model = UtrLmModel.from_pretrained('multimolecule/utrlm.te_el')
167
 
168
  text = "UAGCUUAUCAGACUGAUGUUGA"
169
- input = tokenizer(text, return_tensors='pt')
170
 
171
  output = model(**input)
172
  ```
@@ -182,17 +195,17 @@ import torch
182
  from multimolecule import RnaTokenizer, UtrLmForSequencePrediction
183
 
184
 
185
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrlm.te_el')
186
- model = UtrLmForSequencePrediction.from_pretrained('multimolecule/utrlm.te_el')
187
 
188
  text = "UAGCUUAUCAGACUGAUGUUGA"
189
- input = tokenizer(text, return_tensors='pt')
190
  label = torch.tensor([1])
191
 
192
  output = model(**input, labels=label)
193
  ```
194
 
195
- #### Nucleotide Classification / Regression
196
 
197
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
198
 
@@ -200,14 +213,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
200
 
201
  ```python
202
  import torch
203
- from multimolecule import RnaTokenizer, UtrLmForNucleotidePrediction
204
 
205
 
206
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrlm.te_el')
207
- model = UtrLmForNucleotidePrediction.from_pretrained('multimolecule/utrlm.te_el')
208
 
209
  text = "UAGCUUAUCAGACUGAUGUUGA"
210
- input = tokenizer(text, return_tensors='pt')
211
  label = torch.randint(2, (len(text), ))
212
 
213
  output = model(**input, labels=label)
@@ -224,11 +237,11 @@ import torch
224
  from multimolecule import RnaTokenizer, UtrLmForContactPrediction
225
 
226
 
227
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrlm')
228
- model = UtrLmForContactPrediction.from_pretrained('multimolecule/utrlm')
229
 
230
  text = "UAGCUUAUCAGACUGAUGUUGA"
231
- input = tokenizer(text, return_tensors='pt')
232
  label = torch.randint(2, (len(text), len(text)))
233
 
234
  output = model(**input, labels=label)
 
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
+ - example_title: "HIV-1"
14
+ text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
15
+ output:
16
+ - label: "*"
17
+ score: 0.07707168161869049
18
+ - label: "<null>"
19
+ score: 0.07588472962379456
20
+ - label: "U"
21
+ score: 0.07178673148155212
22
+ - label: "N"
23
+ score: 0.06414645165205002
24
+ - label: "Y"
25
+ score: 0.06385370343923569
26
  - example_title: "microRNA-21"
27
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
28
  output:
 
76
 
77
  ### Variations
78
 
79
+ - **[`multimolecule/utrlm-te_el`](https://huggingface.co/multimolecule/utrlm-te_el)**: The UTR-LM model for Translation Efficiency of transcripts and mRNA Expression Level.
80
+ - **[`multimolecule/utrlm-mrl`](https://huggingface.co/multimolecule/utrlm-mrl)**: The UTR-LM model for Mean Ribosome Loading.
81
 
82
  ### Model Specification
83
 
 
123
  - **Paper**: [A 5’ UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions](http://doi.org/10.1038/s41467-021-24436-7)
124
  - **Developed by**: Yanyi Chu, Dan Yu, Yupeng Li, Kaixuan Huang, Yue Shen, Le Cong, Jason Zhang, Mengdi Wang
125
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
126
+ - **Original Repository**: [a96123155/UTR-LM](https://github.com/a96123155/UTR-LM)
127
 
128
  ## Usage
129
 
 
140
  ```python
141
  >>> import multimolecule # you must import multimolecule to register models
142
  >>> from transformers import pipeline
143
+ >>> unmasker = pipeline("fill-mask", model="multimolecule/utrlm-te_el")
144
+ >>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
145
 
146
+ [{'score': 0.07707168161869049,
147
  'token': 23,
148
  'token_str': '*',
149
+ 'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},
150
+ {'score': 0.07588472962379456,
151
  'token': 5,
152
  'token_str': '<null>',
153
+ 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},
154
+ {'score': 0.07178673148155212,
155
+ 'token': 9,
156
+ 'token_str': 'U',
157
+ 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
158
+ {'score': 0.06414645165205002,
159
  'token': 10,
160
  'token_str': 'N',
161
+ 'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},
162
+ {'score': 0.06385370343923569,
163
+ 'token': 12,
164
+ 'token_str': 'Y',
165
+ 'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'}]
166
  ```
167
 
168
  ### Downstream Use
 
175
  from multimolecule import RnaTokenizer, UtrLmModel
176
 
177
 
178
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
179
+ model = UtrLmModel.from_pretrained("multimolecule/utrlm-te_el")
180
 
181
  text = "UAGCUUAUCAGACUGAUGUUGA"
182
+ input = tokenizer(text, return_tensors="pt")
183
 
184
  output = model(**input)
185
  ```
 
195
  from multimolecule import RnaTokenizer, UtrLmForSequencePrediction
196
 
197
 
198
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
199
+ model = UtrLmForSequencePrediction.from_pretrained("multimolecule/utrlm-te_el")
200
 
201
  text = "UAGCUUAUCAGACUGAUGUUGA"
202
+ input = tokenizer(text, return_tensors="pt")
203
  label = torch.tensor([1])
204
 
205
  output = model(**input, labels=label)
206
  ```
207
 
208
+ #### Token Classification / Regression
209
 
210
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
211
 
 
213
 
214
  ```python
215
  import torch
216
+ from multimolecule import RnaTokenizer, UtrLmForTokenPrediction
217
 
218
 
219
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
220
+ model = UtrLmForTokenPrediction.from_pretrained("multimolecule/utrlm-te_el")
221
 
222
  text = "UAGCUUAUCAGACUGAUGUUGA"
223
+ input = tokenizer(text, return_tensors="pt")
224
  label = torch.randint(2, (len(text), ))
225
 
226
  output = model(**input, labels=label)
 
237
  from multimolecule import RnaTokenizer, UtrLmForContactPrediction
238
 
239
 
240
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrlm-te_el")
241
+ model = UtrLmForContactPrediction.from_pretrained("multimolecule/utrlm-te_el")
242
 
243
  text = "UAGCUUAUCAGACUGAUGUUGA"
244
+ input = tokenizer(text, return_tensors="pt")
245
  label = torch.randint(2, (len(text), len(text)))
246
 
247
  output = model(**input, labels=label)