AntibodyGeneration/fine-tuned-progen2-large · Running "example code" inside README.md: the output is the input sequence(target seq, or called antigen) extended with about 150-200 amino acids

Dec 20, 2023

I did as title, but the output is funny: it contains whole sequence of the input(such as PD1 sequence) plus 150 or so amino acids attached at the end of input. I don't understand why. Please advice. Thank you!

joethequant

Antibody Generation org Dec 21, 2023

We used ANACRI to parse the output of the model. You can see what we did here on how to install ANARCI.
https://github.com/joethequant/antibodygpt

https://github.com/joethequant/antibodygpt/blob/main/4_run_models_w_anarci.ipynb

https://github.com/joethequant/antibodygpt/blob/main/5_model_grading.ipynb

Here is our predict function we used. In this script we are just checking if it returns a valid antibody predicted by ANARCI.

'''python
def predict_sequence(model, tokenizer, sequence, device='cuda:0', number_of_sequences=1 ):
# Tokenize the sequence
tokenized_sequence = tokenizer.encode(sequence)

# Convert to PyTorch tensor and add batch dimension
input_tensor = torch.tensor([tokenized_sequence.ids]).to(device)

# Pass the tensor through the model
with torch.no_grad():
    output = model.generate(input_tensor, max_length=1024, pad_token_id=tokenizer.encode('<|pad|>').ids[0], do_sample=True, top_p=0.9, temperature=0.8, num_return_sequences=number_of_sequences)

    as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
    sequences = tokenizer.decode_batch(as_lists(output))

    if len(sequences) > 0:
        sequences = [x.replace('2', '') for x in sequences] #replace stop token with empty string
    else:
        return []

    sequence_with_heavy_and_light_chains = []

    #filter out sequences that don't have heavy and light chains
    for sequence in sequences:
        # print(sequence)
        species, e_value, score, heavy_chain, light_chain = run_anarci(sequence)
        if (len(heavy_chain) > 0) and (len(light_chain) > 0):
            sequence_with_heavy_and_light_chains.append(sequence)

    return sequence_with_heavy_and_light_chains

'''

If you want the full ANARCI output into a CSV. In the repo we have some code that we import and then call it in the grading script.

'''python
from seq import ab_number as abn
df_result_H, df_result_KL = abn.number_seqs_as_df(sampled_sequences)
'''

guruace

Dec 21, 2023

I need sometime to digest your code, since my background is biology, not computer science. My understanding is that if I run through your python scripts from 1 to 5 at "https://github.com/joethequant/antibodygpt/", I will get the same result as your Demo results (shown at "https://orca-app-ygzbp.ondigitalocean.app/Demo_Antibody_Generator")? By the way, the notebook 1_download_pretrained_checkpoints.ipynb at "https://github.com/joethequant/antibodygpt/" seems to be corrupted and unusable, please kindly re-upload a good one (the rest of the notebook is ok and can be opened).

Your work is fascinating, and my group may test some results out of your model (if I could be successful in reproducing your results) in my wet lab (clone the antibodies designed by your models, and compare it with the other methods). Thank you so much for your prompt response, and we all need to keep working.

joethequant

Antibody Generation org Dec 23, 2023

WOW! That would be great; please keep us updated with the results; we are happy to help!

I fixed the 1_download_pretrained_checkpoints.ipynb file. You only need this to fine-tune from the progen2 foundational models. Once you download the models, you can run 4_run_models_w_anarci.ipynb. We have weights and biases embedded for logging. You can remove the wandb lines or get a weights and biases account for free and enter your API key when it asks.

If you want to generate sequences, you can run 5_model_grading.ipynb, and it will automatically download our trained weights from huggingface, run the outputs through ANARCI, and download them into CSVs.

Did you have any issues installing ANARCI?

Also, we have a serverless runpod docker image that runs the model:

Code: https://github.com/joethequant/docker_protein_generator
Already Built and Pushed Image: https://hub.docker.com/r/robertsj32/antibody_generation_runpod
- This image already has our trained models in it.
- You can reference this image "docker pull robertsj32/antibody_generation_runpod" directly in RUNPOD serverless, and then call it using the example script in the readme.

Here is the Streamlit app:

Code: https://github.com/joethequant/docker_streamlit_antibody_protein_generation
Already Built and Pushed Image: https://hub.docker.com/repository/docker/robertsj32/antibody_generation_streamlit/general

guruace

Dec 24, 2023

•

edited Dec 24, 2023

Thank you for all. 1. the file"1_download_pretrained_checkpoints.ipynb" is working fine now. 2. Yes, we did download your docker image by "docker pull robertsj32/antibody_generation_runpod", but when we ran it by "docker run -i -t IMAGE ID", it showed error:

"--- Starting Serverless Worker | Version 1.3.4 ---
WARN | test_input.json not found, exiting."

and when we checked within the container by "docker run -it --entrypoint /bin/bash IMAGE ID", we could not find the input file "test_input.json" under the root directory. Presumably, the test_input.json should be in the root directory as shown at your github repo.

guruace changed discussion status to closed Dec 24, 2023

joethequant

Antibody Generation org Mar 22, 2024

Hi, it has been a while. I wanted to check in to see if you have had any successes or need help with anything?

Cheers,
Joe

guruace

Mar 23, 2024

hi Joe,

Some wet lab works are going , and need some time to get the result. In the meantime, UW David Baker's lab work on computational antibody design may some how to distract us from your work since Baker's lab showed wet lab results(https://www.biorxiv.org/content/10.1101/2024.03.14.585103v1). We are all working hard. Please email me to discuss in depth if you want to (guruace@163.com).