William commited on
Commit
10d6c9d
1 Parent(s): bf8969c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +353 -0
README.md CHANGED
@@ -1,3 +1,356 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - BAAI/OPI
5
+ language:
6
+ - en
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - biology
10
  ---
11
+ ![OPI_logo](./OPI_logo.png)
12
+
13
+ ## Model Card of OPI_full_Galactica-6.7B
14
+ This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a protein instruction dataset as well as propose to explore and benckmark LLMs for protein modeling in protein biology.
15
+ ![Overview](./Overview.png)
16
+
17
+ **Usage and License Notices:** [LLaMA](https://github.com/facebookresearch/llama) and [Galactica](https://github.com/paperswithcode/galai) are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) is also CC BY NC 4.0 (allowing only non-commercial use).
18
+
19
+ ## OPI-instruction tuning from original Galactica-6.7B model and LLaMA-7B model
20
+ For OPI-instruction tuning, we adopt the training script of [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca).
21
+
22
+ ### 1. Galactica instruction-tuning with OPI
23
+
24
+ [Example: train_keywords.sh](./train_galai/train_keywords.sh)
25
+ ```
26
+ #!/bin/bash
27
+
28
+ OMP_NUM_THREADS=1 torchrun --nnodes=$1 --node_rank=$2 --nproc_per_node=3 train_galai/train.py \
29
+ --model_name_or_path path/to/galactica_base_model/galactica-$3 \
30
+ --data_path ./OPI_DATA/AP/Keywords/train/keywords_train.json \
31
+ --bf16 True \
32
+ --output_dir path/to/output/galai_ft_opi/galai_ft_keywords_$3_e$4 \
33
+ --num_train_epochs $4 \
34
+ --per_device_train_batch_size 4 \
35
+ --per_device_eval_batch_size 4 \
36
+ --gradient_accumulation_steps 8 \
37
+ --evaluation_strategy "no" \
38
+ --save_strategy "steps" \
39
+ --save_steps 2000 \
40
+ --save_total_limit 1 \
41
+ --learning_rate 2e-5 \
42
+ --weight_decay 0. \
43
+ --warmup_ratio 0.03 \
44
+ --deepspeed "./configs/default_offload_opt_param.json" \
45
+ --tf32 True
46
+ ```
47
+
48
+ In the Shell above, you can setup your onw local LLM weights path or Huggingface model entry (e.g., *facebook/galactica-6.7b*) to ```model_name_or_path``` and you onw training results saving path to ```output_dir```.
49
+
50
+ To start training, please do like this:
51
+ ```
52
+ bash train_galai/train_keywords.sh 1 0 6.7b 3
53
+ ```
54
+
55
+ Explanation of such bash arguments:
56
+ ```
57
+ 1: nnodes \
58
+ 0: node_rank \
59
+ 6.7b: model size of Galactica \
60
+ 3: total training epochs
61
+ ```
62
+
63
+ ### 2. LLaMA instruction-tuning with OPI
64
+
65
+ [Example: train_EC_number.sh](./train_llama/train_EC_number.sh)
66
+ ```
67
+ #!/bin/bash
68
+
69
+ OMP_NUM_THREADS=1 torchrun --nnodes=$1 --node_rank=$2 --nproc_per_node=3 train_llama/train.py \
70
+ --model_name_or_path path/to/llama_base_model/hf_version/llama-$3 \
71
+ --data_path ./OPI_DATA/SU/EC_number/train/CLEAN_EC_number_train.json \
72
+ --bf16 True \
73
+ --output_dir path/to/output/llama_ft_CLEAN_EC_number_$3_e$4 \
74
+ --num_train_epochs $4 \
75
+ --per_device_train_batch_size 4 \
76
+ --per_device_eval_batch_size 4 \
77
+ --gradient_accumulation_steps 16 \
78
+ --evaluation_strategy "no" \
79
+ --save_strategy "steps" \
80
+ --save_steps 2000 \
81
+ --save_total_limit 1 \
82
+ --learning_rate 2e-5 \
83
+ --weight_decay 0. \
84
+ --warmup_ratio 0.03 \
85
+ --deepspeed "./configs/default_offload_opt_param.json" \
86
+ --tf32 True
87
+ ```
88
+ In the Shell above, you can setup your onw local LLM weights path or Huggingface model entry (e.g., *decapoda-research/llama-7b-hf*) to ```model_name_or_path``` and you onw training results saving path to ```output_dir```.
89
+ To start training, please do like this:
90
+ ```
91
+ bash train_llama/train_EC_number.sh 1 0 7b 3
92
+ ```
93
+
94
+ Explanation of such bash arguments:
95
+ ```
96
+ 1: nnodes \
97
+ 0: node_rank \
98
+ 7b: model size of LLaMA \
99
+ 3: total training epochs
100
+ ```
101
+
102
+ **Note**: As for the training, we take the suggestion to address out-of-memory issue from [tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca), using DeepSpeed ZeRO stage-3 with offload.
103
+
104
+ ### 3. Convert DeepSpeed-format weights
105
+ Once finished instruction tuning, the DeepSpeed-format weights should be converted to **pytorch_model.bin**, using the following script:
106
+ ```
107
+ cd output_dir
108
+ python zero_to_fp32.py . pytorch_model.bin
109
+ ```
110
+
111
+ ### 4. Split pytorch_model.bin into chunks to speedup loading for inference
112
+ After step 3, you will get the **pytorch_model.bin** file. You can further split it to small chunks, e.g., pytorch_model-00001-of-00004.bin
113
+ pytorch_model-00002-of-00004.bin, pytorch_model-00003-of-00004.bin, pytorch_model-00004-of-00004.bin, in order to speedup loading it when inferenceing. However, it is not a must, if you don't want. If you would like to split it, please do like this:
114
+ ```
115
+ cd model_split
116
+ python model_split.py --model_idx OPI-instruction-tuned-model-name
117
+ ```
118
+ Then you will get a checkpoint folder suffixed with "**chunked**", which you can take as the **pretrained model path** for later evaluation job.
119
+
120
+ ### 5. How to access OPI-instruction-tuned Galactica-6.7B model?
121
+ In this repo, we release the OPI_full_Galactica-6.7B model which is fine-funed on OPI full dataset, which can be accessed from [HuggingFace](https://huggingface.co/BAAI/OPI_full_Galactica-6.7B). Please feel free to contact us if there is any question.
122
+
123
+ ## Nine Evaluation tasks
124
+
125
+ For benchamarking, we design 3 types of evaluation tasks, each of which contains 3 specific ones, as shown in the following table.
126
+
127
+ | Task Type | Abbreviation | Task Name |
128
+ | :--------------------: | :----------: | :-----------------------------------------: |
129
+ | Sequence Understanding | SU | EC Number Prediction |
130
+ | Sequence Understanding | SU | Fold Type Prediction |
131
+ | Sequence Understanding | SU | Subcellular Localization Prediction |
132
+ | Annotation Prediction | AP | Function Keywords Prediction |
133
+ | Annotation Prediction | AP | Gene Ontology(GO) Terms Prediction |
134
+ | Annotation Prediction | AP | Function Description Prediction |
135
+ | Knowledge Mining | KM | Tissue Location Prediction from Gene Symbol |
136
+ | Knowledge Mining | KM | Cancer Prediction from Gene Symbol |
137
+ | Knowledge Mining | KM | Cancer Prediction from Gene Name |
138
+
139
+ ## Evaluating various models with OPI data
140
+ ### 1. Environment setup
141
+ ```
142
+ pip install -r requirements.txt
143
+ ```
144
+
145
+ As for the evaluation, we refer to the inference script from [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca).
146
+
147
+ ### 2. Evaluation of Galactica
148
+ We evaluate OPI-instruction-tuned Galactica-6.7B model and origional Galactica-6.7B model.
149
+
150
+ **For OPI-instruction-tuned Galactica-6.7B model, please use the following script:**
151
+ ```
152
+ cd eval_galai
153
+ python eval_galai.py --model_idx OPI-instruction-tuned-model-name --output_dir ./eval_galai_output --gpus=0
154
+ ```
155
+ In the commands above, ```model_idx```is the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the [model_dict](eval_galai/eval_galai.py#L74) in [eval_galai.py](eval_galai/eval_galai.py#L74). ```output_dir```is where you save the evaluation results.
156
+
157
+ **For the original Galactica-6.7B model, please use the following script:**
158
+ ```
159
+ cd eval_galai/infer_with_original_galai
160
+ bash galactica_infer.sh
161
+ ```
162
+
163
+ ### 3. Evaluation of Alpaca
164
+ For comparison, we evaluate Alpaca-7B model and [Galpaca-6.7B](https://huggingface.co/GeorgiaTechResearchInstitute/galpaca-6.7b) model. The Galpaca-6.7B model is contributed by Georgia Tech Research Institute on HuggingFace.
165
+
166
+ As for Alpaca-7B model, we first get [alpaca-7b-wdiff](https://huggingface.co/tatsu-lab/alpaca-7b-wdiff) from HuggingFace, which is the weight diff for [Stanford Alpaca-7B](https://github.com/tatsu-lab/stanford_alpaca/), then recover the original Alpaca-7B weights using the conversion script provided by [tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca).
167
+
168
+ The same script is used for evaluating Alpaca-7B and Galpaca-6.7B model, just by setting a different model_idx for a different model.
169
+ ```
170
+ cd eval_alpaca
171
+ python eval_alpaca.py --model_idx alpaca-7b-recover --output_dir ./eval_alpaca_output --gpus=0 #original Alpaca-7B weights
172
+ ```
173
+ In the commands above, ```model_idx```is the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the [model_dict](eval_galai/eval_galai.py#L74) in [eval_alpaca.py](eval_alpaca/eval_alpaca.py#L81). ```output_dir```is where you save the evaluation results.
174
+
175
+ ### 4. Evaluation of LLaMA
176
+ For comparison, we evaluate OPI-instruction-tuned LLaMA-7B model and original LLaMA-7B model.
177
+
178
+ The same script is used for evaluating OPI-instruction-tuned LLaMA-7B model and original LLaMA-7B model, just by setting a different model_idx for a different model.
179
+ ```
180
+ cd eval_llama
181
+ python eval_llama.py --model_idx llama_7b_hf --output_dir ./eval_llama_output --gpus=0 #original LLaMA-7B weights
182
+ ```
183
+ In the commands above, ```model_idx```is the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the [model_dict](eval_galai/eval_galai.py#L74) in [eval_llama.py](eval_llama/eval_llama.py#L83). ```output_dir```is where you save the evaluation results.
184
+
185
+ ### 5. The following table shows evaluation results of OPI_full_Galactica-6.7B model on 9 tasks.
186
+ | Task Type | Task Name | Testing file | Accuracy | Precision | Recall | F1 | Rouge-L |
187
+ | ---------------------- | ------------------------------------------- | ----------------------------- | :------: | :-------: | :----: | :---: | :-----: |
188
+ | Sequence Understanding | EC Number Prediction | CLEAN_EC_number_new_test | - | 0.181 | 0.174 | 0.176 | - |
189
+ | Sequence Understanding | EC Number Prediction | CLEAN_EC_number_price_test | - | 0.054 | 0.054 | 0.054 | - |
190
+ | Sequence Understanding | Fold Type Prediction | Remote_test_fold | 0.068 | - | - | - | - |
191
+ | Sequence Understanding | Fold Type Prediction | Remote_test_superfamily | 0.090 | - | - | - | - |
192
+ | Sequence Understanding | Fold Type Prediction | Remote_test_family | 0.416 | - | - | - | - |
193
+ | Sequence Understanding | Subcellular Localization Prediction | location_test | 0.678 | - | - | - | - |
194
+ | Annotation Prediction | Function Keywords Prediction | CASPSimilarSeq_keywords_test | - | 0.716 | 0.669 | 0.674 | - |
195
+ | Annotation Prediction | Function Keywords Prediction | IDFilterSeq_keywords_test | - | 0.822 | 0.771 | 0.778 | - |
196
+ | Annotation Prediction | Function Keywords Prediction | UniProtSeq_keywords_test | - | 0.871 | 0.802 | 0.820 | - |
197
+ | Annotation Prediction | Gene Ontology(GO) Terms Prediction | CASPSimilarSeq_go_test | - | 0.710 | 0.627 | 0.647 | - |
198
+ | Annotation Prediction | Gene Ontology(GO) Terms Prediction | IDFilterSeq_go_test | - | 0.724 | 0.637 | 0.656 | - |
199
+ | Annotation Prediction | Gene Ontology(GO) Terms Prediction | UniProtSeq_go_test | - | 0.759 | 0.683 | 0.698 | - |
200
+ | Annotation Prediction | Function Description Prediction | CASPSimilarSeq_function_test | - | - | - | - | 0.431 |
201
+ | Annotation Prediction | Function Description Prediction | IDFilterSeq_function_test | - | - | - | - | 0.624 |
202
+ | Annotation Prediction | Function Description Prediction | UniProtSeq_function_test | - | - | - | - | 0.696 |
203
+ | Knowledge Mining | Tissue Location Prediction from Gene Symbol | gene_symbol_to_tissue_test | - | 0.377 | 0.779 | 0.468 | - |
204
+ | Knowledge Mining | Cancer Prediction from Gene Symbol | gene_symbol_to_cancer_test | - | 0.554 | 0.433 | 0.465 | - |
205
+ | Knowledge Mining | Cancer Prediction from Gene Name | gene_name_to_cancer_test | - | 0.507 | 0.400 | 0.429 | - |
206
+
207
+ ## Prediction (by OPI_full_Galactica-6.7B) v.s. Target
208
+
209
+ <details>
210
+ <summary>EC Number Prediction</summary>
211
+
212
+ ```
213
+ Instruction:
214
+ What is the EC number of the input sequence?
215
+ Input:
216
+ MSLLAYTNLLLQNGRIFRYYKKANIKKFIKKIIKLDLKSTPSEASVSRQTFLSTGLNSVKNAVQLQARKLLINNVLERVTPTLNSDLKKKAAKRLFYGDSAPFFALVGVSLASGSGLLTKDDELEGICWEIREAVSKGKWNDSESENVEQLQAANLDELDLGEPIAKGCNAVVYSAKLKNVQSNKLAHQLAVKMMFNYDVESNSTAILKAMYRETVPAMSYFFNQNLFNIENISDFKIRLPPHPNIVRMYSVFADRIPDLQCNKQLYPEALPPRINPEGSGRNMSLFLVMKRYDCTLKEYLRDKTPNMRSSILLLSQLLEAVAHMNIHNISHRDLKSDNILVDLSEGDAYPTIVITDFGCCLCDKQNGLVIPYRSEDQDKGGNRALMAPEIANAKPGTFSWLNYKKSDLWAVGAIAYEIFNIDNPFYDKTMKLLSKSYKEEDLPELPDTIPFIIRNLVSNMLSRSTNKRLDCDVAATVAQLYLWAPSSWLKENYTLPNSNEIIQWLLCLSSKVLCERDITARNKTNTMSESVSKAQYKGRRSLPEYELIASFLRRVRLHLVRKGLKWIQELHIYN
217
+ Prediction:
218
+ 2.7.11.1
219
+ Target:
220
+ 2.7.11.1
221
+ ```
222
+
223
+ </details>
224
+
225
+ <details>
226
+ <summary>Fold Type Prediction</summary>
227
+
228
+ ```
229
+ Instruction:
230
+ Please predict its folding type based on the protein sequence. Here, a number is assigned to each folding type, ranging from 0 to 1194.
231
+ Input:
232
+ GSGDSHPDFPEDADVDLKDVDKILLISEDLKNIGNTFFKSQNWEMAIKKYTKVLRYVEGSRAAAEDADGAKLQPVALSCVLNIGACKLKMSDWQGAVDSCLEALEIDPSNTKALYRRAQGWQGLKEYDQALADLKKAQEIAPEDKAIQAELLKVKQKIKAQKDKEKAAY
233
+ Prediction:
234
+ 3
235
+ Target:
236
+ 3
237
+ ```
238
+
239
+ </details>
240
+
241
+ <details>
242
+ <summary>Subcellular Localization Prediction</summary>
243
+
244
+ ```
245
+ Instruction:
246
+ By scrutinizing the protein's amino acid composition and sequence motifs, forecast its intracellular localization in eukaryotic cells.
247
+ Input:
248
+ MEDEAVLDRGASFLKHVCDEEEVEGHHTIYIGVHVPKSYRRRRRHKRKTGHREKKEKERISENYSDKSDVENADESSSSILKPLISPAAERIRFILGEEDDSPAPPQLFTELDELLAVDGQEMEWKETARWIKFEEKVEQGGERWSKPHVATLSLHSLFELRTCMEKGSIMLDREASSLPQLVEMIVDHQIETGLLKPDLKDKVTYTLLRKHRHQTKKSNLRSLADIGKTVSSASRMFTNPDNGSPAMTHRNLTSSSLNDISDKPEKDQLKNKFMKKLPRDAEASNVLVGEVDFLDSPFIAFVRLQQAVMLGALTEVPVPTRFLFILLGPKGKAKSYHEIGRAIATLMSDEVFHDIAYKAKDRQDLIAGIDEFLDEVIVLPPGEWDPAIRIEPPKSLPSSDKRKNMYSGGENVQMNGDTPPDGGHGGGGHADCEELQRTGRFCGGLIKDIKRKAPFFASDFYDALNIQALSAILFIYLATVTNAITFGGLLGDATDNMQGVLESFLGTAVSGAIFCLFAGQPLTILSSTGPVLVFERLLFNFSKDHNFDYLEFRLWIGLWSAFLCLILVATDASFLVQYFTRFTEEGFSSLISFIFIYDAFKKMIKLADYYPINSNFKVGYNTQFSCVCMPPDPVNISVSNDTTLAPEDLPTISSSNMYHNATFDWAFLTTKECLKYGGKLVGNNCGFVPDITLMSFILFLGTYTSSMALKKFKTSPYFPTTARKLISDFAIILPILIFCVIDALVGVDTPKLIVPSEFKPTSPNRGWFVAPFGGNPWWVYLAAAIPALLVTILIFMDQQITAVIVNRKEHKLKKGAGYHLDLFWVAILMVVCSFMALPWYVAATVISIAHIDSLKMETETSAPGEQPKFLGVREQRVTGTLVFILTGLSVFMAPILKFIPMPVLYGVFLYMGVASLNGVQFMDRLKLLLMPLKHQPDFIYLRHVPLRRVHLFTFLQVLCLALLWILKSTVAAIIFPVMILALVAVRKGMDYLFSQHDLSFLDDVIPEKDKKKKEDEKKKKKKKGSVDSDNDDSDCPYSEKVPSIKIPMDIMEQQPFLSDSKPSDRERSPTFLERHTSC
249
+ Prediction:
250
+ membrane
251
+ Target:
252
+ membrane
253
+ ```
254
+
255
+ </details>
256
+
257
+ <details>
258
+ <summary>Function Keywords Prediction</summary>
259
+
260
+ ```
261
+ Instruction:
262
+ What are the UniProtKB keywords for this specific protein sequence?
263
+ Input:
264
+ MRGSFFSRLPPQLSLLLLLLLLLSWRRVWTQEHIGTDPSKSPVAPVCPEACSCSPGGKANCSALALPAVPAGLSWQVRSLLLDRNRVSTLPPGAFADAGALLYLVLRENRLRSVHARAFWGLGVLQRLDLSSNQLETLSPGTFTPLRALSFLSLAGNRLALLEPSILGPLPLLRVLSLQDNSLSALEAGLLNSLPALDVLRLHGNPWACSCALRPLCTWLRKHPRPTSETETLLCVSPKLQTLNLLTDFPDNAFKQCTQSLAARDLAVVYALGPASFLASLAICLALGSVLTACGARRRRRRTTVRHLIRRQPDPEGPASLEDVGSPTTTAIQA
265
+ Prediction:
266
+ Cell membrane ; Cytoplasm ; Cytoskeleton ; Disulfide bond ; Ion channel ; Ion transport ; Leucine-rich repeat ; Membrane ; Reference proteome ; Repeat ; Signal ; Transmembrane ; Transmembrane helix ; Transport
267
+ Target:
268
+ Cell membrane ; Cytoplasm ; Cytoskeleton ; Disulfide bond ; Ion channel ; Ion transport ; Leucine-rich repeat ; Membrane ; Reference proteome ; Repeat ; Signal ; Transmembrane ; Transmembrane helix ; Transport
269
+ ```
270
+
271
+ </details>
272
+
273
+ <details>
274
+ <summary>Gene Ontology(GO) Terms Prediction</summary>
275
+
276
+ ```
277
+ Instruction:
278
+ The Gene Ontology project (GO) provides a controlled vocabulary to describe gene and gene product attributes in any organism. There are 3 disjoint categories: cellular component, molecular function and biological process. Predict the GO term for a given protein sequence.
279
+ Input:
280
+ MEFVTNYTLEELKKRFTELGLEPYRAKQVFRWVYKKFVTDFEKMTDLGKKHRELLKEHFAFHPLEKLDRVEAPDAVKYLFKTKDGHILETVLIKERDHYTLCVSSQIGCAVGCTFCATALDGLKRNLSTAEIIDQYLQVQQDLGEEKIRNVVFMGMGEPLANYENVRKAVEIMVSPEGLDLSKRRITISTSGIVAQIKRMAQDPVMKEVNLAVSLNAVSQKKREELMPLTKTNTLEELMEVLKNYPLPKYRRITLEYVLIKGVNDSPNDAERLAKLIGRHKKKFKVNLIPFNPDPNLPYERPALTDIMKFQKVLWKYGISNFVRFSKGVEVFGACGQLRTQRLQLQRV
281
+ Prediction:
282
+ cytoplasm ; 4 iron, 4 sulfur cluster binding ; metal ion binding ; rRNA (adenine-C2-)-methyltransferase activity ; rRNA binding ; tRNA (adenine-C2-)-methyltransferase activity ; tRNA binding ; rRNA base methylation
283
+ Target:
284
+ cytoplasm ; 4 iron, 4 sulfur cluster binding ; metal ion binding ; rRNA (adenine-C2-)-methyltransferase activity ; rRNA binding ; tRNA (adenine-C2-)-methyltransferase activity ; tRNA binding ; rRNA base methylation ; tRNA methylation
285
+ ```
286
+
287
+ </details>
288
+
289
+ <details>
290
+ <summary>Function Description Prediction</summary>
291
+
292
+ ```
293
+ Instruction:
294
+ Generate the functional description in free-text form based on the protein sequence.
295
+ Input:
296
+ MAAPSGVHLLVRRGSHRIFSSPLNHIYLHKQSSSQQRRNFFFRRQRDISHSIVLPAAVSSAHPVPKHIKKPDYVTTGIVPDWGDSIEVKNEDQIQGLHQACQLARHVLLLAGKSLKVDMTTEEIDALVHREIISHNAYPSPLGYGGFPKSVCTSVNNVLCHGIPDSRPLQDGDIINIDVTVYYNGYHGDTSETFLVGNVDECGKKLVEVARRCRDEAIAACRAGAPFSVIGNTISHITHQNGFQVCPHFVGHGIGSYFHGHPEIWHHANDSDLPMEEGMAFTIEPIITEGSPEFKVLEDAWTVVSLDNQRSAQFEHTVLITSRGAQILTKLPHEA
297
+ Prediction:
298
+ Removes the N-terminal methionine from nascent proteins. The N-terminal methionine is often cleaved when the second residue in the primary sequence is small and uncharged (Met-Ala-, Cys, Gly, Pro, Ser, Thr, or Val). Requires deformylation of the N(alpha)-formylated initiator methionine before it can be hydrolyzed.
299
+ Target:
300
+ Removes the N-terminal methionine from nascent proteins. The N-terminal methionine is often cleaved when the second residue in the primary sequence is small and uncharged (Met-Ala-, Cys, Gly, Pro, Ser, Thr, or Val). Requires deformylation of the N(alpha)-formylated initiator methionine before it can be hydrolyzed (By similarity). May play a role in colon tumorigenesis.
301
+ ```
302
+
303
+ </details>
304
+
305
+ <details>
306
+ <summary>Tissue Location Prediction from Gene Symbol</summary>
307
+
308
+ ```
309
+ Instruction:
310
+ According to the relevant data reported in the scientific literature, the Gene symbol of a given protein can be used to predict which tissues the protein is expressed in.
311
+ Input:
312
+ TMEM198
313
+ Prediction:
314
+ adrenal gland, appendix, bone marrow, breast, bronchus, caudate, cerebellum, cerebral cortex, cervix, colon, duodenum, endometrium, epididymis, esophagus, fallopian tube, gallbladder, heart muscle, hippocampus, kidney, liver, lung, lymph node, nasopharynx, oral mucosa, ovary, pancreas, parathyroid gland, placenta, prostate, rectum, salivary gland, seminal vesicle, skeletal muscle, skin, small intestine, smooth muscle, soft tissue, spleen, stomach, testis, thyroid gland, tonsil, urinary bladder
315
+ Target:
316
+ adrenal gland, appendix, breast, bronchus, cerebral cortex, colon, fallopian tube, gallbladder, kidney, nasopharynx, prostate, rectum, salivary gland, seminal vesicle, skeletal muscle, small intestine, stomach, urinary bladder
317
+ ```
318
+
319
+ </details>
320
+
321
+ <details>
322
+ <summary>Cancer Prediction from Gene Symbol</summary>
323
+
324
+ ```
325
+ Instruction:
326
+ Return the name of the cancer corresponding to the given gene symbol.
327
+ Input:
328
+ FOXL2
329
+ Prediction:
330
+ granulosa-cell tumour of the ovary
331
+ Target:
332
+ granulosa-cell tumour of the ovary
333
+ ```
334
+
335
+ </details>
336
+
337
+ <details>
338
+ <summary>Cancer Prediction from Gene Name</summary>
339
+
340
+ ```
341
+ Instruction:
342
+ Give back the cancer name that is associated with the provided gene name.
343
+ Input:
344
+ immunoglobulin lambda locus
345
+ Prediction:
346
+ Burkitt lymphoma
347
+ Target:
348
+ Burkitt lymphoma
349
+ ```
350
+
351
+ </details>
352
+
353
+ ## Demo
354
+ We use the [FastChat](https://github.com/lm-sys/FastChat) platform to visually demonstrate the ability of OPI_full_Galactica-6.7B model on various evaluation tasks.
355
+
356
+ ![OPI Demo](./OPI_demo.gif)