About

This is a work in progress. Essentially:

Used UnSloth to train this adapter for quantized LLaMA 3
Used part of GreenBeing's finetuning split: 2300 proteins and their annotations (sometimes "None")

Sample protein prompt:

Write information about the protein sequence.

### Sequence:
MGCVKLVFFMLYVFLFQLVSSSSLPHLCPEDQALALLQFKNMFTVNPNAFHYCPDITGREIQSYPRTLSWNKSTSCCSWDGVHCDETTGQVIALDLRCSQLQGKFHSNSSLFQLSNLKRLDLSNNNFIGSLISPKFGEFSDLTHLDLSDSSFTGVIPSEISHLSKLHVLLIGDQYGLSIVPHNFEPLLKNLTQLRELNLYEVNLSSTVPSNFSSHLTTLQLSGTGLRGLLPERVFHLSDLEFLDLSYNSQLMVRFPTTKWNSSASLMKLYVHSVNIADRIPESFSHLTSLHELDMGYTNLSGPIPKPLWNLTNIESLDLRYNHLEGPIPQLPIFEKLKKLSLFRNDNLDGGLEFLSFNTQLERLDLSSNSLTGPIPSNISGLQNLECLYLSSNHLNGSIPSWIFSLPSLVELDLSNNTFSGKIQEFKSKTLSAVTLKQNKLKGRIPNSLLNQKNLQLLLLSHNNISGHISSAICNLKTLILLDLGSNNLEGTIPQCVVERNEYLSHLDLSKNRLSGTINTTFSVGNILRVISLHGNKLTGKVPRSMINCKYLTLLDLGNNMLNDTFPNWLGYLFQLKILSLRSNKLHGPIKSSGNTNLFMGLQILDLSSNGFSGNLPERILGNLQTMKEIDESTGFPEYISDPYDIYYNYLTTISTKGQDYDSVRILDSNMIINLSKNRFEGHIPSIIGDLVGLRTLNLSHNVLEGHIPASFQNLSVLESLDLSSNKISGEIPQQLASLTFLEVLNLSHNHLVGCIPKGKQFDSFGNTSYQGNDGLRGFPLSKLCGGEDQVTTPAELDQEEEEEDSPMISWQGVLVGYGCGLVIGLSVIYIMWSTQYPAWFSRMDLKLEHIITTKMKKHKKRY

### Annotation:
Involved in plant defense. Confers resistance to the fungal pathogen C.fulvum through recognition of the AVR9 elicitor protein.
Subcellular locations: Cell membrane<|end_of_text|>

Training Notebook: https://colab.research.google.com/drive/1hrkO2LLt1PRk5jrZaKR0iR4NNB_1vVJK?usp=sharing

Usage

Inference Notebook: https://colab.research.google.com/drive/1l3-wMfjVTM_a2BDKT2kf8Y6WT354Vgcr?usp=sharing

from transformers import AutoPeftModelForCausalLM, AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained("monsoon-nlp/llama3-protein-annotation", load_in_4bit=True).to("cuda")
# remember to use the base model for the tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/llama-3-8b-bnb-4bit")

prefix = "Write information about the protein sequence.\n\n### Sequence:\n"
annotation = "\n\n### Annotation:\n"
inputs = tokenizer(f"{prefix}{sequence}{annotation}", return_tensors="pt")
model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)

License: apache-2.0
Finetuned from model : unsloth/llama-3-8b-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

monsoon-nlp
/

llama3-protein-annotation

About

Usage

Finetuned from

Dataset used to train monsoon-nlp/llama3-protein-annotation

Collection including monsoon-nlp/llama3-protein-annotation

Bio Series

About

Usage

Finetuned from unsloth/llama-3-8b-bnb-4bit

Dataset used to train monsoon-nlp/llama3-protein-annotation

Collection including monsoon-nlp/llama3-protein-annotation

Finetuned from