AIDO.RNAIF-1.6B / README.md
smahbub's picture
Update README.md
63b37db verified
metadata
base_model:
  - genbio-ai/AIDO.RNA-1.6B

RNA Inverse Folding

RNA inverse folding is a computational method designed to create RNA sequences that fold into predetermined three-dimensional structures. Our study focuses on generating sequences using the known backbone structure of an RNA, defined by the 3D coordinates of its backbone atoms, without any information of the individual bases. Specifically. we fully finetune the AIDO.RNA-1.6B model on the single-state split from Das et al. already processed by Joshi et al.. We use the same train, validation, and test splits used by their method gRNAde. Current version of ModelGenerator contains the inference pipeline for RNA inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.

Setup:

Install ModelGenerator.

  • It is required to use docker to run our inverse folding pipeline.
  • Please set up a docker image using our provided Dockerfile and run the inverse folding inference from within the docker container.
    • Here is an example bash script to set up and access a docker container:
      # clone the ModelGenerator repository
      git clone https://github.com/genbio-ai/ModelGenerator.git
      # cd to "ModelGenerator" folder where you should find the "Dockerfile"
      cd ModelGenerator
      # create a docker image
      docker build -t aido .
      # create a local folder as ModelGenerator's data directory
      mkdir -p $HOME/mgen_data
      # run a container
      docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
      # find the container ID
      docker ps # this will print the running containers and their IDs
      # execute the container with ID=<container_id>
      docker exec -it <container_id> /bin/bash  # now you should be inside the docker container
      # test if you can access the nvidia GPUs
      nvidia-smi # this should print the GPUs' details
      
  • Execute the following steps from within the docker container you just created.
  • Note: Multi-GPU inference for inverse folding is not currently supported and will be included in the future.

Download model checkpoints:

  • Download the model.ckpt checkpoint from here. Place it inside the local directory ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B.

  • Download the gRNAde checkpoint named gRNAde_ARv1_1state_das.h5 from the huggingface-hub or the original source. Place it inside the directory ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models. Set the environment variable gRNAde_CKPT_PATH=${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models/gRNAde_ARv1_1state_das.h5

    Alternatively, you can simply run the following script to do both of these steps:

    mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B
    huggingface-cli download genbio-ai/AIDO.RNAIF-1.6B \
    --repo-type model \
    --local-dir ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B
    # Set the environment variable gRNAde_CKPT_PATH
    export gRNAde_CKPT_PATH=${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models/gRNAde_ARv1_1state_das.h5
    

Download data:

  • Download the data preprocessed by Joshi et al.. Mainly download these two files: processed.pt.zip and processed_df.csv. Place them inside the directory ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/. Please refer to this link for details about the dataset and its preprocessing.

    Alternatively, you run the following script to do it:

    mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/
    huggingface-cli download genbio-ai/rna-inverse-folding \
    --repo-type dataset \
    --local-dir ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/
    

Run inference:

  • From your terminal, change directory to experiments/AIDO.RNA/rna_inverse_folding folder and run the following script:
    cd modelgenerator/rna_inv_fold/gRNAde_structure_encoder
    echo "Running inference.."
    python main.py
    echo "Extracting structure encoding.."
    python main_encoder_only.py
    cd  ../../../experiments/AIDO.RNA/rna_inverse_folding/
    # run inference
    mgen test --config rna_inv_fold_test.yaml \
            --trainer.default_root_dir ${MGEN_DATA_DIR}/modelgenerator/logs/rna_inv_fold/ \
            --ckpt_path ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/model.ckpt \
            --trainer.devices 0, \
            --data.path ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/structure_encoding/
    

Outputs:

  • The evaluation score will be printed on the console.
  • The generated sequences will be stored in ./rnaIF_outputs/designed_sequences.json.
    • In this file, we will have:
      1. "true_seq": the ground truth sequences,
      2. "pred_seq": predicted sequences by our method,
      3. "baseline_seq": predicted sequences by the baseline method gRNAde.
    • An example file content with two test samples is shown below:
      {
        "true_seq": [
            "CCCAGUCCACCGGGUGAGAAGGGGGCAGAGAAACACACGACGUGGUGCAUUACCUGCC",
            "UCCCGUCCACCGCGGUGAGAAGGGGGCAGAGAAACACACGAUCGUGGUACAUUACCUGCC",
        ],
        "pred_seq": [
            "UGGGGAGCCCCCGGGGUGAACCAGCCGGUGAAAGGCACCCGGUGAUCGGUCAGCCCAC",
            "GCGGAUGCCCCGCCCGGUCAACCGCAUGGUGAAAUCCACGCGCCUGGUGGGUUAGCCAUG",
        ],
        "baseline_seq": [
            "UGGUGAGCCCCCGGGGUGAACCAGUAGGUGAAAGGCACCCGGUGAUCGGUCAGCCCAC",
            "GCGGAUGCCGGGCCCGGUCCACCGCAUGGUGAAAUUCAGGCGCCUGGAGGGUUAGCCAUG",
        ]
      }