VirusT5: Harnessing Large Language Models to Predict SARS-CoV-2 Evolution

Github Link - https://github.com/vrmarathe/VirusT5

Overview

VirusT5 is a transformer-based language model built on the T5 architecture, designed to predict SARS-CoV-2 evolution. By modeling viral mutations as a "mutation-as-translation" process, VirusT5 captures mutation patterns in the Receptor-Binding Domain (RBD) of the spike protein, identifies mutation hotspots, and forecasts future viral strains.

Features

Variant Classification: Accurately classifies SARS-CoV-2 variants based on RBD sequences.
Mutation Prediction: Translates parental RBD sequences into evolved child sequences.
Generative Evolution: Simulates multi-generational viral evolution.

How It Works

VirusT5 is pretrained on 100,000 SARS-CoV-2 genome sequences from the GISAID database. Fine-tuning involves tasks like:

Classifying RBD variant types.
Translating parent-child mutation pairs to predict evolutionary changes.
Simulating mutations across multiple viral generations.

How To Use The Pretrained Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer for the VirusT5 model
tokenizer = AutoTokenizer.from_pretrained("vrmarathe/VirusT5", trust_remote_code=True)

# Load the pre-trained VirusT5 model (T5-based)
model = AutoModelForSeq2SeqLM.from_pretrained("vrmarathe/VirusT5", trust_remote_code=True,from_flax=True)

Performance Highlights

Variant Classification Accuracy: 97.29%
Mutation Translation BLEU Score: 0.999
Multi-Generational Evolution Simulation Accuracy: 100%

Installation

Clone the repository and set up the required dependencies:

git clone https://github.com/vrmarathe/VirusT5.git  
cd VirusT5
cd environment
conda env create -f flax2_environment.yml

Datasets

VirusT5 was trained and fine-tuned using the following datasets:

1. Genome Dataset

Description: This dataset comprises 100,000 complete SARS-CoV-2 genome sequences, randomly sampled from the GISAID database.
Usage: Used during the pretraining phase to help the model learn mutation patterns in the SARS-CoV-2 genome.
Details:
- Segmented into non-overlapping sequences of up to 512 base pairs.
- Processed using a masked language modeling objective.
Source: GISAID Database
**Preprocessing Link and Code - https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main

2. Receptor Binding Domain (RBD) Dataset

Description: Contains genetic sequences encoding the receptor-binding domain of the SARS-CoV-2 spike protein.
Usage:
- Fine-tuning for variant classification tasks.
- Generating the Parent-Child dataset for evolutionary studies.
- **Preprocessing For Pretaining and FineTuning Datasets - https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main
Details:
- Codon-aware multiple sequence alignment (MSA) performed using MUSCLE.
- Mapped to reference genome (NCBI: NC_004718.3).

3. Parent-Child Dataset

Description: Contains pairs of RBD sequences where one sequence acts as the evolutionary parent of the other.
Usage: Fine-tuning for "mutation-as-translation" tasks, where the model predicts the child sequence from the parent sequence.
- **Preprocessing For Pretaining and FineTuning Datasets - https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main
Details:
- Constructed from RBD sequences divided into 10 temporal bins.
- Includes 500,000 parent-child pairs sampled across Alpha, Delta, Omicron, and non-VOC variants.

Notes

Access: While the datasets rely on public resources like GISAID, access may require registration or compliance with their terms of use.
Preprocessing: Preprocessing scripts for dataset preparation are available in the Preprocessing in Pretaining and FineTuning Datasets directory.
Datasets will be provided on request.

Pretraining and Fine-Tuning

Pretraining

VirusT5 was pretrained on a large corpus of SARS-CoV-2 genome sequences to learn the underlying syntax and grammar of genomic data.

Dataset: Genome Dataset comprising 100,000 SARS-CoV-2 genome sequences from GISAID.
Objective: Masked Language Modeling (MLM) with 15% token masking using sentinel tokens.
Sequence Length: Segmented into sequences of up to 512 base pairs.
Optimization:
- Inverse square root learning rate schedule.
- Initial learning rate: 0.005 for 2,000 steps, followed by exponential decay.
Training Hardware:
- NDSU CCAST HPC clusters with 32 CPU cores, 100 GB RAM, and two NVIDIA A40 GPUs (40 GB each).
Duration: Pretrained for 12,000 steps.
The scripts for the pretraining can be found in the pretraining folder

Fine-Tuning

Fine-tuning tailored the pretrained VirusT5 model for specific downstream tasks, such as classification and mutation prediction.

Tasks

Variant Classification:
- Dataset: RBD Dataset, divided into training (60%), validation (20%), and test (20%) sets.
- Objective: Predict variant types (e.g., Alpha, Delta, Omicron, non-VOC) from RBD sequences.
- Result: Achieved 97.29% accuracy.
- The original finetuning script for RBD classification can be found in the rbd-classification folder rbd-classifier.
- The general classifier script can be used for other classification experiments can be found in General Classification
Mutation Translation:
- Dataset: Parent-Child Dataset with 500,000 RBD sequence pairs representing evolutionary parent-child relationships.
- Objective: Predict how an RBD sequence evolves from one generation to the next.
- The original finetuning script for RBD translation/evolution predication can be found in the RBD-translation.
- The general mutation translation script can be used for other experiments and can be found in Translation-general
- Evaluation:
  - BLEU Score: 0.999
  - Sequence Identity: 99.97% ± 0.1%
For Other Tasks
- The model is based on the T5 archictecture. The model can be fine-tuned to similar DNA/Genome/Virus related tasks that T5 was fine-tned on like summarization,question-answering etc.

Fine-Tuning Process

The model was trained and validated over multiple epochs until convergence, stopping when both training and validation losses stabilized.
The following split was used for all datasets:
- Training: 60%
- Validation: 20%
- Testing: 20%
Fine-tuning used similar hardware as pretraining.

Citation

If you use VirusT5 in your research, please cite the following paper:

@misc{marathe2024virust5harnessinglargelanguage,
      title={VirusT5: Harnessing Large Language Models to Predicting SARS-CoV-2 Evolution}, 
      author={Vishwajeet Marathe and Deewan Bajracharya and Changhui Yan},
      year={2024},
      eprint={2412.16262},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2412.16262}, 
}