OPI-Galactica-6.7B / README.md
William
Update README.md
10d6c9d
|
raw
history blame
20.5 kB
metadata
license: apache-2.0
datasets:
  - BAAI/OPI
language:
  - en
pipeline_tag: text-generation
tags:
  - biology

OPI_logo

Model Card of OPI_full_Galactica-6.7B

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a protein instruction dataset as well as propose to explore and benckmark LLMs for protein modeling in protein biology. Overview

Usage and License Notices: LLaMA and Galactica are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).

OPI-instruction tuning from original Galactica-6.7B model and LLaMA-7B model

For OPI-instruction tuning, we adopt the training script of Stanford Alpaca.

1. Galactica instruction-tuning with OPI

Example: train_keywords.sh

#!/bin/bash

OMP_NUM_THREADS=1 torchrun --nnodes=$1 --node_rank=$2 --nproc_per_node=3 train_galai/train.py \
    --model_name_or_path path/to/galactica_base_model/galactica-$3 \
    --data_path  ./OPI_DATA/AP/Keywords/train/keywords_train.json \
    --bf16 True \
    --output_dir path/to/output/galai_ft_opi/galai_ft_keywords_$3_e$4 \
    --num_train_epochs $4 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

In the Shell above, you can setup your onw local LLM weights path or Huggingface model entry (e.g., facebook/galactica-6.7b) to model_name_or_path and you onw training results saving path to output_dir.

To start training, please do like this:

bash train_galai/train_keywords.sh 1 0 6.7b 3 

Explanation of such bash arguments:

1: nnodes \
0: node_rank \
6.7b: model size of Galactica \
3: total training epochs

2. LLaMA instruction-tuning with OPI

Example: train_EC_number.sh

#!/bin/bash

OMP_NUM_THREADS=1 torchrun --nnodes=$1 --node_rank=$2 --nproc_per_node=3 train_llama/train.py \
    --model_name_or_path path/to/llama_base_model/hf_version/llama-$3 \
    --data_path  ./OPI_DATA/SU/EC_number/train/CLEAN_EC_number_train.json \
    --bf16 True \
    --output_dir path/to/output/llama_ft_CLEAN_EC_number_$3_e$4 \
    --num_train_epochs $4 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

In the Shell above, you can setup your onw local LLM weights path or Huggingface model entry (e.g., decapoda-research/llama-7b-hf) to model_name_or_path and you onw training results saving path to output_dir. To start training, please do like this:

bash train_llama/train_EC_number.sh 1 0 7b 3 

Explanation of such bash arguments:

1: nnodes \
0: node_rank \
7b: model size of LLaMA \
3: total training epochs

Note: As for the training, we take the suggestion to address out-of-memory issue from tatsu-lab/stanford_alpaca, using DeepSpeed ZeRO stage-3 with offload.

3. Convert DeepSpeed-format weights

Once finished instruction tuning, the DeepSpeed-format weights should be converted to pytorch_model.bin, using the following script:

cd output_dir
python zero_to_fp32.py . pytorch_model.bin

4. Split pytorch_model.bin into chunks to speedup loading for inference

After step 3, you will get the pytorch_model.bin file. You can further split it to small chunks, e.g., pytorch_model-00001-of-00004.bin pytorch_model-00002-of-00004.bin, pytorch_model-00003-of-00004.bin, pytorch_model-00004-of-00004.bin, in order to speedup loading it when inferenceing. However, it is not a must, if you don't want. If you would like to split it, please do like this:

cd model_split
python model_split.py --model_idx OPI-instruction-tuned-model-name

Then you will get a checkpoint folder suffixed with "chunked", which you can take as the pretrained model path for later evaluation job.

5. How to access OPI-instruction-tuned Galactica-6.7B model?

In this repo, we release the OPI_full_Galactica-6.7B model which is fine-funed on OPI full dataset, which can be accessed from HuggingFace. Please feel free to contact us if there is any question.

Nine Evaluation tasks

For benchamarking, we design 3 types of evaluation tasks, each of which contains 3 specific ones, as shown in the following table.

Task Type Abbreviation Task Name
Sequence Understanding SU EC Number Prediction
Sequence Understanding SU Fold Type Prediction
Sequence Understanding SU Subcellular Localization Prediction
Annotation Prediction AP Function Keywords Prediction
Annotation Prediction AP Gene Ontology(GO) Terms Prediction
Annotation Prediction AP Function Description Prediction
Knowledge Mining KM Tissue Location Prediction from Gene Symbol
Knowledge Mining KM Cancer Prediction from Gene Symbol
Knowledge Mining KM Cancer Prediction from Gene Name

Evaluating various models with OPI data

1. Environment setup

pip install -r requirements.txt

As for the evaluation, we refer to the inference script from Chinese-LLaMA-Alpaca.

2. Evaluation of Galactica

We evaluate OPI-instruction-tuned Galactica-6.7B model and origional Galactica-6.7B model.

For OPI-instruction-tuned Galactica-6.7B model, please use the following script:

cd eval_galai
python eval_galai.py --model_idx OPI-instruction-tuned-model-name --output_dir ./eval_galai_output --gpus=0

In the commands above, model_idxis the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the model_dict in eval_galai.py. output_diris where you save the evaluation results.

For the original Galactica-6.7B model, please use the following script:

cd eval_galai/infer_with_original_galai
bash galactica_infer.sh

3. Evaluation of Alpaca

For comparison, we evaluate Alpaca-7B model and Galpaca-6.7B model. The Galpaca-6.7B model is contributed by Georgia Tech Research Institute on HuggingFace.

As for Alpaca-7B model, we first get alpaca-7b-wdiff from HuggingFace, which is the weight diff for Stanford Alpaca-7B, then recover the original Alpaca-7B weights using the conversion script provided by tatsu-lab/stanford_alpaca.

The same script is used for evaluating Alpaca-7B and Galpaca-6.7B model, just by setting a different model_idx for a different model.

cd eval_alpaca
python eval_alpaca.py --model_idx alpaca-7b-recover --output_dir ./eval_alpaca_output --gpus=0 #original Alpaca-7B weights

In the commands above, model_idxis the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the model_dict in eval_alpaca.py. output_diris where you save the evaluation results.

4. Evaluation of LLaMA

For comparison, we evaluate OPI-instruction-tuned LLaMA-7B model and original LLaMA-7B model.

The same script is used for evaluating OPI-instruction-tuned LLaMA-7B model and original LLaMA-7B model, just by setting a different model_idx for a different model.

cd eval_llama
python eval_llama.py --model_idx llama_7b_hf --output_dir ./eval_llama_output --gpus=0  #original LLaMA-7B weights

In the commands above, model_idxis the model index you can allocate to your local LLM weights for you to easily access a LLM model when inferencing, which you can set it up in the model_dict in eval_llama.py. output_diris where you save the evaluation results.

5. The following table shows evaluation results of OPI_full_Galactica-6.7B model on 9 tasks.

Task Type Task Name Testing file Accuracy Precision Recall F1 Rouge-L
Sequence Understanding EC Number Prediction CLEAN_EC_number_new_test - 0.181 0.174 0.176 -
Sequence Understanding EC Number Prediction CLEAN_EC_number_price_test - 0.054 0.054 0.054 -
Sequence Understanding Fold Type Prediction Remote_test_fold 0.068 - - - -
Sequence Understanding Fold Type Prediction Remote_test_superfamily 0.090 - - - -
Sequence Understanding Fold Type Prediction Remote_test_family 0.416 - - - -
Sequence Understanding Subcellular Localization Prediction location_test 0.678 - - - -
Annotation Prediction Function Keywords Prediction CASPSimilarSeq_keywords_test - 0.716 0.669 0.674 -
Annotation Prediction Function Keywords Prediction IDFilterSeq_keywords_test - 0.822 0.771 0.778 -
Annotation Prediction Function Keywords Prediction UniProtSeq_keywords_test - 0.871 0.802 0.820 -
Annotation Prediction Gene Ontology(GO) Terms Prediction CASPSimilarSeq_go_test - 0.710 0.627 0.647 -
Annotation Prediction Gene Ontology(GO) Terms Prediction IDFilterSeq_go_test - 0.724 0.637 0.656 -
Annotation Prediction Gene Ontology(GO) Terms Prediction UniProtSeq_go_test - 0.759 0.683 0.698 -
Annotation Prediction Function Description Prediction CASPSimilarSeq_function_test - - - - 0.431
Annotation Prediction Function Description Prediction IDFilterSeq_function_test - - - - 0.624
Annotation Prediction Function Description Prediction UniProtSeq_function_test - - - - 0.696
Knowledge Mining Tissue Location Prediction from Gene Symbol gene_symbol_to_tissue_test - 0.377 0.779 0.468 -
Knowledge Mining Cancer Prediction from Gene Symbol gene_symbol_to_cancer_test - 0.554 0.433 0.465 -
Knowledge Mining Cancer Prediction from Gene Name gene_name_to_cancer_test - 0.507 0.400 0.429 -

Prediction (by OPI_full_Galactica-6.7B) v.s. Target

EC Number Prediction
Instruction:
    What is the EC number of the input sequence?
Input:
    MSLLAYTNLLLQNGRIFRYYKKANIKKFIKKIIKLDLKSTPSEASVSRQTFLSTGLNSVKNAVQLQARKLLINNVLERVTPTLNSDLKKKAAKRLFYGDSAPFFALVGVSLASGSGLLTKDDELEGICWEIREAVSKGKWNDSESENVEQLQAANLDELDLGEPIAKGCNAVVYSAKLKNVQSNKLAHQLAVKMMFNYDVESNSTAILKAMYRETVPAMSYFFNQNLFNIENISDFKIRLPPHPNIVRMYSVFADRIPDLQCNKQLYPEALPPRINPEGSGRNMSLFLVMKRYDCTLKEYLRDKTPNMRSSILLLSQLLEAVAHMNIHNISHRDLKSDNILVDLSEGDAYPTIVITDFGCCLCDKQNGLVIPYRSEDQDKGGNRALMAPEIANAKPGTFSWLNYKKSDLWAVGAIAYEIFNIDNPFYDKTMKLLSKSYKEEDLPELPDTIPFIIRNLVSNMLSRSTNKRLDCDVAATVAQLYLWAPSSWLKENYTLPNSNEIIQWLLCLSSKVLCERDITARNKTNTMSESVSKAQYKGRRSLPEYELIASFLRRVRLHLVRKGLKWIQELHIYN
Prediction:
    2.7.11.1
Target:
    2.7.11.1
Fold Type Prediction
Instruction:
    Please predict its folding type based on the protein sequence. Here, a number is assigned to each folding type, ranging from 0 to 1194.
Input:
    GSGDSHPDFPEDADVDLKDVDKILLISEDLKNIGNTFFKSQNWEMAIKKYTKVLRYVEGSRAAAEDADGAKLQPVALSCVLNIGACKLKMSDWQGAVDSCLEALEIDPSNTKALYRRAQGWQGLKEYDQALADLKKAQEIAPEDKAIQAELLKVKQKIKAQKDKEKAAY
Prediction:
    3
Target:
    3
Subcellular Localization Prediction
Instruction:
    By scrutinizing the protein's amino acid composition and sequence motifs, forecast its intracellular localization in eukaryotic cells.
Input:
    MEDEAVLDRGASFLKHVCDEEEVEGHHTIYIGVHVPKSYRRRRRHKRKTGHREKKEKERISENYSDKSDVENADESSSSILKPLISPAAERIRFILGEEDDSPAPPQLFTELDELLAVDGQEMEWKETARWIKFEEKVEQGGERWSKPHVATLSLHSLFELRTCMEKGSIMLDREASSLPQLVEMIVDHQIETGLLKPDLKDKVTYTLLRKHRHQTKKSNLRSLADIGKTVSSASRMFTNPDNGSPAMTHRNLTSSSLNDISDKPEKDQLKNKFMKKLPRDAEASNVLVGEVDFLDSPFIAFVRLQQAVMLGALTEVPVPTRFLFILLGPKGKAKSYHEIGRAIATLMSDEVFHDIAYKAKDRQDLIAGIDEFLDEVIVLPPGEWDPAIRIEPPKSLPSSDKRKNMYSGGENVQMNGDTPPDGGHGGGGHADCEELQRTGRFCGGLIKDIKRKAPFFASDFYDALNIQALSAILFIYLATVTNAITFGGLLGDATDNMQGVLESFLGTAVSGAIFCLFAGQPLTILSSTGPVLVFERLLFNFSKDHNFDYLEFRLWIGLWSAFLCLILVATDASFLVQYFTRFTEEGFSSLISFIFIYDAFKKMIKLADYYPINSNFKVGYNTQFSCVCMPPDPVNISVSNDTTLAPEDLPTISSSNMYHNATFDWAFLTTKECLKYGGKLVGNNCGFVPDITLMSFILFLGTYTSSMALKKFKTSPYFPTTARKLISDFAIILPILIFCVIDALVGVDTPKLIVPSEFKPTSPNRGWFVAPFGGNPWWVYLAAAIPALLVTILIFMDQQITAVIVNRKEHKLKKGAGYHLDLFWVAILMVVCSFMALPWYVAATVISIAHIDSLKMETETSAPGEQPKFLGVREQRVTGTLVFILTGLSVFMAPILKFIPMPVLYGVFLYMGVASLNGVQFMDRLKLLLMPLKHQPDFIYLRHVPLRRVHLFTFLQVLCLALLWILKSTVAAIIFPVMILALVAVRKGMDYLFSQHDLSFLDDVIPEKDKKKKEDEKKKKKKKGSVDSDNDDSDCPYSEKVPSIKIPMDIMEQQPFLSDSKPSDRERSPTFLERHTSC
Prediction:
    membrane
Target:
    membrane
Function Keywords Prediction
Instruction:
    What are the UniProtKB keywords for this specific protein sequence?
Input:
    MRGSFFSRLPPQLSLLLLLLLLLSWRRVWTQEHIGTDPSKSPVAPVCPEACSCSPGGKANCSALALPAVPAGLSWQVRSLLLDRNRVSTLPPGAFADAGALLYLVLRENRLRSVHARAFWGLGVLQRLDLSSNQLETLSPGTFTPLRALSFLSLAGNRLALLEPSILGPLPLLRVLSLQDNSLSALEAGLLNSLPALDVLRLHGNPWACSCALRPLCTWLRKHPRPTSETETLLCVSPKLQTLNLLTDFPDNAFKQCTQSLAARDLAVVYALGPASFLASLAICLALGSVLTACGARRRRRRTTVRHLIRRQPDPEGPASLEDVGSPTTTAIQA
Prediction:
    Cell membrane ; Cytoplasm ; Cytoskeleton ; Disulfide bond ; Ion channel ; Ion transport ; Leucine-rich repeat ; Membrane ; Reference proteome ; Repeat ; Signal ; Transmembrane ; Transmembrane helix ; Transport
Target:
    Cell membrane ; Cytoplasm ; Cytoskeleton ; Disulfide bond ; Ion channel ; Ion transport ; Leucine-rich repeat ; Membrane ; Reference proteome ; Repeat ; Signal ; Transmembrane ; Transmembrane helix ; Transport
Gene Ontology(GO) Terms Prediction
Instruction:
    The Gene Ontology project (GO) provides a controlled vocabulary to describe gene and gene product attributes in any organism. There are 3 disjoint categories: cellular component, molecular function and biological process. Predict the GO term for a given protein sequence.
Input:
    MEFVTNYTLEELKKRFTELGLEPYRAKQVFRWVYKKFVTDFEKMTDLGKKHRELLKEHFAFHPLEKLDRVEAPDAVKYLFKTKDGHILETVLIKERDHYTLCVSSQIGCAVGCTFCATALDGLKRNLSTAEIIDQYLQVQQDLGEEKIRNVVFMGMGEPLANYENVRKAVEIMVSPEGLDLSKRRITISTSGIVAQIKRMAQDPVMKEVNLAVSLNAVSQKKREELMPLTKTNTLEELMEVLKNYPLPKYRRITLEYVLIKGVNDSPNDAERLAKLIGRHKKKFKVNLIPFNPDPNLPYERPALTDIMKFQKVLWKYGISNFVRFSKGVEVFGACGQLRTQRLQLQRV
Prediction:
    cytoplasm ; 4 iron, 4 sulfur cluster binding ; metal ion binding ; rRNA (adenine-C2-)-methyltransferase activity ; rRNA binding ; tRNA (adenine-C2-)-methyltransferase activity ; tRNA binding ; rRNA base methylation 
Target:
    cytoplasm ; 4 iron, 4 sulfur cluster binding ; metal ion binding ; rRNA (adenine-C2-)-methyltransferase activity ; rRNA binding ; tRNA (adenine-C2-)-methyltransferase activity ; tRNA binding ; rRNA base methylation ; tRNA methylation
Function Description Prediction
Instruction:
    Generate the functional description in free-text form based on the protein sequence.
Input:
    MAAPSGVHLLVRRGSHRIFSSPLNHIYLHKQSSSQQRRNFFFRRQRDISHSIVLPAAVSSAHPVPKHIKKPDYVTTGIVPDWGDSIEVKNEDQIQGLHQACQLARHVLLLAGKSLKVDMTTEEIDALVHREIISHNAYPSPLGYGGFPKSVCTSVNNVLCHGIPDSRPLQDGDIINIDVTVYYNGYHGDTSETFLVGNVDECGKKLVEVARRCRDEAIAACRAGAPFSVIGNTISHITHQNGFQVCPHFVGHGIGSYFHGHPEIWHHANDSDLPMEEGMAFTIEPIITEGSPEFKVLEDAWTVVSLDNQRSAQFEHTVLITSRGAQILTKLPHEA
Prediction:
    Removes the N-terminal methionine from nascent proteins. The N-terminal methionine is often cleaved when the second residue in the primary sequence is small and uncharged (Met-Ala-, Cys, Gly, Pro, Ser, Thr, or Val). Requires deformylation of the N(alpha)-formylated initiator methionine before it can be hydrolyzed.
Target:
    Removes the N-terminal methionine from nascent proteins. The N-terminal methionine is often cleaved when the second residue in the primary sequence is small and uncharged (Met-Ala-, Cys, Gly, Pro, Ser, Thr, or Val). Requires deformylation of the N(alpha)-formylated initiator methionine before it can be hydrolyzed (By similarity). May play a role in colon tumorigenesis.
Tissue Location Prediction from Gene Symbol
Instruction:
    According to the relevant data reported in the scientific literature, the Gene symbol of a given protein can be used to predict which tissues the protein is expressed in.
Input:
    TMEM198
Prediction:
    adrenal gland, appendix, bone marrow, breast, bronchus, caudate, cerebellum, cerebral cortex, cervix, colon, duodenum, endometrium, epididymis, esophagus, fallopian tube, gallbladder, heart muscle, hippocampus, kidney, liver, lung, lymph node, nasopharynx, oral mucosa, ovary, pancreas, parathyroid gland, placenta, prostate, rectum, salivary gland, seminal vesicle, skeletal muscle, skin, small intestine, smooth muscle, soft tissue, spleen, stomach, testis, thyroid gland, tonsil, urinary bladder
Target:
    adrenal gland, appendix, breast, bronchus, cerebral cortex, colon, fallopian tube, gallbladder, kidney, nasopharynx, prostate, rectum, salivary gland, seminal vesicle, skeletal muscle, small intestine, stomach, urinary bladder
Cancer Prediction from Gene Symbol
Instruction:
    Return the name of the cancer corresponding to the given gene symbol.
Input:
    FOXL2
Prediction:
    granulosa-cell tumour of the ovary
Target:
    granulosa-cell tumour of the ovary
Cancer Prediction from Gene Name
Instruction:
    Give back the cancer name that is associated with the provided gene name.
Input:
    immunoglobulin lambda locus
Prediction:
    Burkitt lymphoma
Target:
    Burkitt lymphoma

Demo

We use the FastChat platform to visually demonstrate the ability of OPI_full_Galactica-6.7B model on various evaluation tasks.

OPI Demo