README.md · BAAI/OPI-Galactica-6.7B at 10d6c9df6919cdfa7e7dbaf072a6e04b7afafb99

metadata

license: apache-2.0
datasets:
  - BAAI/OPI
language:
  - en
pipeline_tag: text-generation
tags:
  - biology

Model Card of OPI_full_Galactica-6.7B

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a protein instruction dataset as well as propose to explore and benckmark LLMs for protein modeling in protein biology.

Usage and License Notices: LLaMA and Galactica are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).

OPI-instruction tuning from original Galactica-6.7B model and LLaMA-7B model

For OPI-instruction tuning, we adopt the training script of Stanford Alpaca.

1. Galactica instruction-tuning with OPI

Example: train_keywords.sh

#!/bin/bash

OMP_NUM_THREADS=1 torchrun --nnodes=$1 --node_rank=$2 --nproc_per_node=3 train_galai/train.py \
    --model_name_or_path path/to/galactica_base_model/galactica-$3 \
    --data_path  ./OPI_DATA/AP/Keywords/train/keywords_train.json \
    --bf16 True \
    --output_dir path/to/output/galai_ft_opi/galai_ft_keywords_$3_e$4 \
    --num_train_epochs $4 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

In the Shell above, you can setup your onw local LLM weights path or Huggingface model entry (e.g., facebook/galactica-6.7b) to model_name_or_path and you onw training results saving path to output_dir.

To start training, please do like this:

bash train_galai/train_keywords.sh 1 0 6.7b 3

Explanation of such bash arguments:

1: nnodes \
0: node_rank \
6.7b: model size of Galactica \
3: total training epochs

2. LLaMA instruction-tuning with OPI

Example: train_EC_number.sh

#!/bin/bash

OMP_NUM_THREADS=1 torchrun --nnodes=$1 --node_rank=$2 --nproc_per_node=3 train_llama/train.py \
    --model_name_or_path path/to/llama_base_model/hf_version/llama-$3 \
    --data_path  ./OPI_DATA/SU/EC_number/train/CLEAN_EC_number_train.json \
    --bf16 True \
    --output_dir path/to/output/llama_ft_CLEAN_EC_number_$3_e$4 \
    --num_train_epochs $4 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

In the Shell above, you can setup your onw local LLM weights path or Huggingface model entry (e.g., decapoda-research/llama-7b-hf) to model_name_or_path and you onw training results saving path to output_dir. To start training, please do like this:

bash train_llama/train_EC_number.sh 1 0 7b 3

Explanation of such bash arguments:

1: nnodes \
0: node_rank \
7b: model size of LLaMA \
3: total training epochs

Note: As for the training, we take the suggestion to address out-of-memory issue from tatsu-lab/stanford_alpaca, using DeepSpeed ZeRO stage-3 with offload.

3. Convert DeepSpeed-format weights

Once finished instruction tuning, the DeepSpeed-format weights should be converted to pytorch_model.bin, using the following script:

cd output_dir
python zero_to_fp32.py . pytorch_model.bin

4. Split pytorch_model.bin into chunks to speedup loading for inference

After step 3, you will get the pytorch_model.bin file. You can further split it to small chunks, e.g., pytorch_model-00001-of-00004.bin pytorch_model-00002-of-00004.bin, pytorch_model-00003-of-00004.bin, pytorch_model-00004-of-00004.bin, in order to speedup loading it when inferenceing. However, it is not a must, if you don't want. If you would like to split it, please do like this:

cd model_split
python model_split.py --model_idx OPI-instruction-tuned-model-name

Then you will get a checkpoint folder suffixed with "chunked", which you can take as the pretrained model path for later evaluation job.

5. How to access OPI-instruction-tuned Galactica-6.7B model?

In this repo, we release the OPI_full_Galactica-6.7B model which is fine-funed on OPI full dataset, which can be accessed from HuggingFace. Please feel free to contact us if there is any question.

Nine Evaluation tasks

For benchamarking, we design 3 types of evaluation tasks, each of which contains 3 specific ones, as shown in the following table.

Task Type	Abbreviation	Task Name
Sequence Understanding	SU	EC Number Prediction
Sequence Understanding	SU	Fold Type Prediction
Sequence Understanding	SU	Subcellular Localization Prediction
Annotation Prediction	AP	Function Keywords Prediction
Annotation Prediction	AP	Gene Ontology(GO) Terms Prediction
Annotation Prediction	AP	Function Description Prediction
Knowledge Mining	KM	Tissue Location Prediction from Gene Symbol
Knowledge Mining	KM	Cancer Prediction from Gene Symbol
Knowledge Mining	KM	Cancer Prediction from Gene Name

Evaluating various models with OPI data

1. Environment setup

pip install -r requirements.txt

As for the evaluation, we refer to the inference script from Chinese-LLaMA-Alpaca.

2. Evaluation of Galactica

We evaluate OPI-instruction-tuned Galactica-6.7B model and origional Galactica-6.7B model.

For OPI-instruction-tuned Galactica-6.7B model, please use the following script:

cd eval_galai
python eval_galai.py --model_idx OPI-instruction-tuned-model-name --output_dir ./eval_galai_output --gpus=0

In the commands above, model_idxis the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the model_dict in eval_galai.py. output_diris where you save the evaluation results.

For the original Galactica-6.7B model, please use the following script:

cd eval_galai/infer_with_original_galai
bash galactica_infer.sh

3. Evaluation of Alpaca

For comparison, we evaluate Alpaca-7B model and Galpaca-6.7B model. The Galpaca-6.7B model is contributed by Georgia Tech Research Institute on HuggingFace.

As for Alpaca-7B model, we first get alpaca-7b-wdiff from HuggingFace, which is the weight diff for Stanford Alpaca-7B, then recover the original Alpaca-7B weights using the conversion script provided by tatsu-lab/stanford_alpaca.

The same script is used for evaluating Alpaca-7B and Galpaca-6.7B model, just by setting a different model_idx for a different model.

cd eval_alpaca
python eval_alpaca.py --model_idx alpaca-7b-recover --output_dir ./eval_alpaca_output --gpus=0 #original Alpaca-7B weights

In the commands above, model_idxis the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the model_dict in eval_alpaca.py. output_diris where you save the evaluation results.

4. Evaluation of LLaMA

For comparison, we evaluate OPI-instruction-tuned LLaMA-7B model and original LLaMA-7B model.

The same script is used for evaluating OPI-instruction-tuned LLaMA-7B model and original LLaMA-7B model, just by setting a different model_idx for a different model.

cd eval_llama
python eval_llama.py --model_idx llama_7b_hf --output_dir ./eval_llama_output --gpus=0  #original LLaMA-7B weights

In the commands above, model_idxis the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the model_dict in eval_llama.py. output_diris where you save the evaluation results.

5. The following table shows evaluation results of OPI_full_Galactica-6.7B model on 9 tasks.

Task Type	Task Name	Testing file	Accuracy	Precision	Recall	F1	Rouge-L
Sequence Understanding	EC Number Prediction	CLEAN_EC_number_new_test	-	0.181	0.174	0.176	-
Sequence Understanding	EC Number Prediction	CLEAN_EC_number_price_test	-	0.054	0.054	0.054	-
Sequence Understanding	Fold Type Prediction	Remote_test_fold	0.068	-	-	-	-
Sequence Understanding	Fold Type Prediction	Remote_test_superfamily	0.090	-	-	-	-
Sequence Understanding	Fold Type Prediction	Remote_test_family	0.416	-	-	-	-
Sequence Understanding	Subcellular Localization Prediction	location_test	0.678	-	-	-	-
Annotation Prediction	Function Keywords Prediction	CASPSimilarSeq_keywords_test	-	0.716	0.669	0.674	-
Annotation Prediction	Function Keywords Prediction	IDFilterSeq_keywords_test	-	0.822	0.771	0.778	-
Annotation Prediction	Function Keywords Prediction	UniProtSeq_keywords_test	-	0.871	0.802	0.820	-
Annotation Prediction	Gene Ontology(GO) Terms Prediction	CASPSimilarSeq_go_test	-	0.710	0.627	0.647	-
Annotation Prediction	Gene Ontology(GO) Terms Prediction	IDFilterSeq_go_test	-	0.724	0.637	0.656	-
Annotation Prediction	Gene Ontology(GO) Terms Prediction	UniProtSeq_go_test	-	0.759	0.683	0.698	-
Annotation Prediction	Function Description Prediction	CASPSimilarSeq_function_test	-	-	-	-	0.431
Annotation Prediction	Function Description Prediction	IDFilterSeq_function_test	-	-	-	-	0.624
Annotation Prediction	Function Description Prediction	UniProtSeq_function_test	-	-	-	-	0.696
Knowledge Mining	Tissue Location Prediction from Gene Symbol	gene_symbol_to_tissue_test	-	0.377	0.779	0.468	-
Knowledge Mining	Cancer Prediction from Gene Symbol	gene_symbol_to_cancer_test	-	0.554	0.433	0.465	-
Knowledge Mining	Cancer Prediction from Gene Name	gene_name_to_cancer_test	-	0.507	0.400	0.429	-

Prediction (by OPI_full_Galactica-6.7B) v.s. Target

EC Number Prediction

Instruction:
    What is the EC number of the input sequence?
Input:
    MSLLAYTNLLLQNGRIFRYYKKANIKKFIKKIIKLDLKSTPSEASVSRQTFLSTGLNSVKNAVQLQARKLLINNVLERVTPTLNSDLKKKAAKRLFYGDSAPFFALVGVSLASGSGLLTKDDELEGICWEIREAVSKGKWNDSESENVEQLQAANLDELDLGEPIAKGCNAVVYSAKLKNVQSNKLAHQLAVKMMFNYDVESNSTAILKAMYRETVPAMSYFFNQNLFNIENISDFKIRLPPHPNIVRMYSVFADRIPDLQCNKQLYPEALPPRINPEGSGRNMSLFLVMKRYDCTLKEYLRDKTPNMRSSILLLSQLLEAVAHMNIHNISHRDLKSDNILVDLSEGDAYPTIVITDFGCCLCDKQNGLVIPYRSEDQDKGGNRALMAPEIANAKPGTFSWLNYKKSDLWAVGAIAYEIFNIDNPFYDKTMKLLSKSYKEEDLPELPDTIPFIIRNLVSNMLSRSTNKRLDCDVAATVAQLYLWAPSSWLKENYTLPNSNEIIQWLLCLSSKVLCERDITARNKTNTMSESVSKAQYKGRRSLPEYELIASFLRRVRLHLVRKGLKWIQELHIYN
Prediction:
    2.7.11.1
Target:
    2.7.11.1

Fold Type Prediction

Instruction:
    Please predict its folding type based on the protein sequence. Here, a number is assigned to each folding type, ranging from 0 to 1194.
Input:
    GSGDSHPDFPEDADVDLKDVDKILLISEDLKNIGNTFFKSQNWEMAIKKYTKVLRYVEGSRAAAEDADGAKLQPVALSCVLNIGACKLKMSDWQGAVDSCLEALEIDPSNTKALYRRAQGWQGLKEYDQALADLKKAQEIAPEDKAIQAELLKVKQKIKAQKDKEKAAY
Prediction:
    3
Target:
    3

Subcellular Localization Prediction

Instruction:
    By scrutinizing the protein's amino acid composition and sequence motifs, forecast its intracellular localization in eukaryotic cells.
Input:
    MEDEAVLDRGASFLKHVCDEEEVEGHHTIYIGVHVPKSYRRRRRHKRKTGHREKKEKERISENYSDKSDVENADESSSSILKPLISPAAERIRFILGEEDDSPAPPQLFTELDELLAVDGQEMEWKETARWIKFEEKVEQGGERWSKPHVATLSLHSLFELRTCMEKGSIMLDREASSLPQLVEMIVDHQIETGLLKPDLKDKVTYTLLRKHRHQTKKSNLRSLADIGKTVSSASRMFTNPDNGSPAMTHRNLTSSSLNDISDKPEKDQLKNKFMKKLPRDAEASNVLVGEVDFLDSPFIAFVRLQQAVMLGALTEVPVPTRFLFILLGPKGKAKSYHEIGRAIATLMSDEVFHDIAYKAKDRQDLIAGIDEFLDEVIVLPPGEWDPAIRIEPPKSLPSSDKRKNMYSGGENVQMNGDTPPDGGHGGGGHADCEELQRTGRFCGGLIKDIKRKAPFFASDFYDALNIQALSAILFIYLATVTNAITFGGLLGDATDNMQGVLESFLGTAVSGAIFCLFAGQPLTILSSTGPVLVFERLLFNFSKDHNFDYLEFRLWIGLWSAFLCLILVATDASFLVQYFTRFTEEGFSSLISFIFIYDAFKKMIKLADYYPINSNFKVGYNTQFSCVCMPPDPVNISVSNDTTLAPEDLPTISSSNMYHNATFDWAFLTTKECLKYGGKLVGNNCGFVPDITLMSFILFLGTYTSSMALKKFKTSPYFPTTARKLISDFAIILPILIFCVIDALVGVDTPKLIVPSEFKPTSPNRGWFVAPFGGNPWWVYLAAAIPALLVTILIFMDQQITAVIVNRKEHKLKKGAGYHLDLFWVAILMVVCSFMALPWYVAATVISIAHIDSLKMETETSAPGEQPKFLGVREQRVTGTLVFILTGLSVFMAPILKFIPMPVLYGVFLYMGVASLNGVQFMDRLKLLLMPLKHQPDFIYLRHVPLRRVHLFTFLQVLCLALLWILKSTVAAIIFPVMILALVAVRKGMDYLFSQHDLSFLDDVIPEKDKKKKEDEKKKKKKKGSVDSDNDDSDCPYSEKVPSIKIPMDIMEQQPFLSDSKPSDRERSPTFLERHTSC
Prediction:
    membrane
Target:
    membrane

Function Keywords Prediction

Instruction:
    What are the UniProtKB keywords for this specific protein sequence?
Input:
    MRGSFFSRLPPQLSLLLLLLLLLSWRRVWTQEHIGTDPSKSPVAPVCPEACSCSPGGKANCSALALPAVPAGLSWQVRSLLLDRNRVSTLPPGAFADAGALLYLVLRENRLRSVHARAFWGLGVLQRLDLSSNQLETLSPGTFTPLRALSFLSLAGNRLALLEPSILGPLPLLRVLSLQDNSLSALEAGLLNSLPALDVLRLHGNPWACSCALRPLCTWLRKHPRPTSETETLLCVSPKLQTLNLLTDFPDNAFKQCTQSLAARDLAVVYALGPASFLASLAICLALGSVLTACGARRRRRRTTVRHLIRRQPDPEGPASLEDVGSPTTTAIQA
Prediction:
    Cell membrane ; Cytoplasm ; Cytoskeleton ; Disulfide bond ; Ion channel ; Ion transport ; Leucine-rich repeat ; Membrane ; Reference proteome ; Repeat ; Signal ; Transmembrane ; Transmembrane helix ; Transport
Target:
    Cell membrane ; Cytoplasm ; Cytoskeleton ; Disulfide bond ; Ion channel ; Ion transport ; Leucine-rich repeat ; Membrane ; Reference proteome ; Repeat ; Signal ; Transmembrane ; Transmembrane helix ; Transport

Gene Ontology(GO) Terms Prediction

Instruction:
    The Gene Ontology project (GO) provides a controlled vocabulary to describe gene and gene product attributes in any organism. There are 3 disjoint categories: cellular component, molecular function and biological process. Predict the GO term for a given protein sequence.
Input:
    MEFVTNYTLEELKKRFTELGLEPYRAKQVFRWVYKKFVTDFEKMTDLGKKHRELLKEHFAFHPLEKLDRVEAPDAVKYLFKTKDGHILETVLIKERDHYTLCVSSQIGCAVGCTFCATALDGLKRNLSTAEIIDQYLQVQQDLGEEKIRNVVFMGMGEPLANYENVRKAVEIMVSPEGLDLSKRRITISTSGIVAQIKRMAQDPVMKEVNLAVSLNAVSQKKREELMPLTKTNTLEELMEVLKNYPLPKYRRITLEYVLIKGVNDSPNDAERLAKLIGRHKKKFKVNLIPFNPDPNLPYERPALTDIMKFQKVLWKYGISNFVRFSKGVEVFGACGQLRTQRLQLQRV
Prediction:
    cytoplasm ; 4 iron, 4 sulfur cluster binding ; metal ion binding ; rRNA (adenine-C2-)-methyltransferase activity ; rRNA binding ; tRNA (adenine-C2-)-methyltransferase activity ; tRNA binding ; rRNA base methylation 
Target:
    cytoplasm ; 4 iron, 4 sulfur cluster binding ; metal ion binding ; rRNA (adenine-C2-)-methyltransferase activity ; rRNA binding ; tRNA (adenine-C2-)-methyltransferase activity ; tRNA binding ; rRNA base methylation ; tRNA methylation

Function Description Prediction

Instruction:
    Generate the functional description in free-text form based on the protein sequence.
Input:
    MAAPSGVHLLVRRGSHRIFSSPLNHIYLHKQSSSQQRRNFFFRRQRDISHSIVLPAAVSSAHPVPKHIKKPDYVTTGIVPDWGDSIEVKNEDQIQGLHQACQLARHVLLLAGKSLKVDMTTEEIDALVHREIISHNAYPSPLGYGGFPKSVCTSVNNVLCHGIPDSRPLQDGDIINIDVTVYYNGYHGDTSETFLVGNVDECGKKLVEVARRCRDEAIAACRAGAPFSVIGNTISHITHQNGFQVCPHFVGHGIGSYFHGHPEIWHHANDSDLPMEEGMAFTIEPIITEGSPEFKVLEDAWTVVSLDNQRSAQFEHTVLITSRGAQILTKLPHEA
Prediction:
    Removes the N-terminal methionine from nascent proteins. The N-terminal methionine is often cleaved when the second residue in the primary sequence is small and uncharged (Met-Ala-, Cys, Gly, Pro, Ser, Thr, or Val). Requires deformylation of the N(alpha)-formylated initiator methionine before it can be hydrolyzed.
Target:
    Removes the N-terminal methionine from nascent proteins. The N-terminal methionine is often cleaved when the second residue in the primary sequence is small and uncharged (Met-Ala-, Cys, Gly, Pro, Ser, Thr, or Val). Requires deformylation of the N(alpha)-formylated initiator methionine before it can be hydrolyzed (By similarity). May play a role in colon tumorigenesis.

Tissue Location Prediction from Gene Symbol

Instruction:
    According to the relevant data reported in the scientific literature, the Gene symbol of a given protein can be used to predict which tissues the protein is expressed in.
Input:
    TMEM198
Prediction:
    adrenal gland, appendix, bone marrow, breast, bronchus, caudate, cerebellum, cerebral cortex, cervix, colon, duodenum, endometrium, epididymis, esophagus, fallopian tube, gallbladder, heart muscle, hippocampus, kidney, liver, lung, lymph node, nasopharynx, oral mucosa, ovary, pancreas, parathyroid gland, placenta, prostate, rectum, salivary gland, seminal vesicle, skeletal muscle, skin, small intestine, smooth muscle, soft tissue, spleen, stomach, testis, thyroid gland, tonsil, urinary bladder
Target:
    adrenal gland, appendix, breast, bronchus, cerebral cortex, colon, fallopian tube, gallbladder, kidney, nasopharynx, prostate, rectum, salivary gland, seminal vesicle, skeletal muscle, small intestine, stomach, urinary bladder

Cancer Prediction from Gene Symbol

Instruction:
    Return the name of the cancer corresponding to the given gene symbol.
Input:
    FOXL2
Prediction:
    granulosa-cell tumour of the ovary
Target:
    granulosa-cell tumour of the ovary

Cancer Prediction from Gene Name

Instruction:
    Give back the cancer name that is associated with the provided gene name.
Input:
    immunoglobulin lambda locus
Prediction:
    Burkitt lymphoma
Target:
    Burkitt lymphoma

Demo

We use the FastChat platform to visually demonstrate the ability of OPI_full_Galactica-6.7B model on various evaluation tasks.