Bio Series
Collection
Embeddings and NLG related to biology / amino acid sequences
•
10 items
•
Updated
•
1
This is a work in progress. Essentially:
Sample protein prompt:
Write information about the protein sequence.
### Sequence:
MGCVKLVFFMLYVFLFQLVSSSSLPHLCPEDQALALLQFKNMFTVNPNAFHYCPDITGREIQSYPRTLSWNKSTSCCSWDGVHCDETTGQVIALDLRCSQLQGKFHSNSSLFQLSNLKRLDLSNNNFIGSLISPKFGEFSDLTHLDLSDSSFTGVIPSEISHLSKLHVLLIGDQYGLSIVPHNFEPLLKNLTQLRELNLYEVNLSSTVPSNFSSHLTTLQLSGTGLRGLLPERVFHLSDLEFLDLSYNSQLMVRFPTTKWNSSASLMKLYVHSVNIADRIPESFSHLTSLHELDMGYTNLSGPIPKPLWNLTNIESLDLRYNHLEGPIPQLPIFEKLKKLSLFRNDNLDGGLEFLSFNTQLERLDLSSNSLTGPIPSNISGLQNLECLYLSSNHLNGSIPSWIFSLPSLVELDLSNNTFSGKIQEFKSKTLSAVTLKQNKLKGRIPNSLLNQKNLQLLLLSHNNISGHISSAICNLKTLILLDLGSNNLEGTIPQCVVERNEYLSHLDLSKNRLSGTINTTFSVGNILRVISLHGNKLTGKVPRSMINCKYLTLLDLGNNMLNDTFPNWLGYLFQLKILSLRSNKLHGPIKSSGNTNLFMGLQILDLSSNGFSGNLPERILGNLQTMKEIDESTGFPEYISDPYDIYYNYLTTISTKGQDYDSVRILDSNMIINLSKNRFEGHIPSIIGDLVGLRTLNLSHNVLEGHIPASFQNLSVLESLDLSSNKISGEIPQQLASLTFLEVLNLSHNHLVGCIPKGKQFDSFGNTSYQGNDGLRGFPLSKLCGGEDQVTTPAELDQEEEEEDSPMISWQGVLVGYGCGLVIGLSVIYIMWSTQYPAWFSRMDLKLEHIITTKMKKHKKRY
### Annotation:
Involved in plant defense. Confers resistance to the fungal pathogen C.fulvum through recognition of the AVR9 elicitor protein.
Subcellular locations: Cell membrane<|end_of_text|>
Training Notebook: https://colab.research.google.com/drive/1hrkO2LLt1PRk5jrZaKR0iR4NNB_1vVJK?usp=sharing
Inference Notebook: https://colab.research.google.com/drive/1l3-wMfjVTM_a2BDKT2kf8Y6WT354Vgcr?usp=sharing
from transformers import AutoPeftModelForCausalLM, AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained("monsoon-nlp/llama3-protein-annotation", load_in_4bit=True).to("cuda")
# remember to use the base model for the tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/llama-3-8b-bnb-4bit")
prefix = "Write information about the protein sequence.\n\n### Sequence:\n"
annotation = "\n\n### Annotation:\n"
inputs = tokenizer(f"{prefix}{sequence}{annotation}", return_tensors="pt")
model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.