|
--- |
|
license: cc-by-nc-nd-4.0 |
|
extra_gated_fields: |
|
Name: text |
|
Company: text |
|
Country: country |
|
Specific date: date_picker |
|
I want to use this model for: |
|
type: select |
|
options: |
|
- Research |
|
- Education |
|
- label: Other |
|
value: other |
|
I agree to share generated sequences and associated data with authors before publishing: checkbox |
|
I agree not to file patents on any sequences generated by this model: checkbox |
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
--- |
|
|
|
# Masked Discrete Latent Diffusion Model for Protein Sequence Generation |
|
|
|
Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion. |
|
|
|
## Directory Structure |
|
|
|
``` |
|
project/ |
|
β |
|
βββ configs/ |
|
β βββ config.py |
|
β |
|
βββ data/ |
|
β βββ train.csv |
|
β βββ val.csv |
|
β βββ test.csv |
|
β |
|
βββ models/ |
|
β βββ diffusion.py |
|
β |
|
βββ scripts/ |
|
β βββ train.py |
|
β βββ test.py |
|
β βββ generate.py |
|
β |
|
βββ utils/ |
|
β βββ data_loader.py |
|
β βββ esm_utils.py |
|
β |
|
βββ checkpoints/ |
|
β βββ example.ckpt # Placeholder for checkpoints |
|
β |
|
βββ requirements.txt |
|
β |
|
βββ README.md |
|
``` |
|
|
|
## Setup and Requirements |
|
|
|
### Prerequisites |
|
|
|
- Python 3.8+ |
|
- CUDA (for GPU support) |
|
|
|
### Install Dependencies |
|
|
|
1. Create and activate a virtual environment: |
|
```bash |
|
python -m venv venv |
|
source venv/bin/activate # On Windows use `venv\Scripts\activate` |
|
``` |
|
|
|
2. Install the required packages: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### Prepare Data |
|
|
|
Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences. |
|
|
|
## Configuration |
|
|
|
Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration: |
|
|
|
```python |
|
class Config: |
|
model_name = "facebook/esm2_t33_650M_UR50D" |
|
latent_dim = 1280 # Adjust based on ESM-2 latent dimension |
|
optim = {"lr": 1e-4} |
|
training = { |
|
"ema": 0.999, |
|
"epochs": 10, |
|
"batch_size": 32, |
|
"gpus": 8, |
|
"precision": 16, # Mixed precision training |
|
"accumulate_grad_batches": 2, # Gradient accumulation |
|
"save_dir": "./checkpoints/", |
|
} |
|
data_path = "./data/" |
|
T = 1000 # Number of diffusion steps |
|
subs_masking = False |
|
``` |
|
|
|
## Mathematical Formulations |
|
|
|
### Forward Diffusion |
|
|
|
The forward diffusion process adds noise to the latent representations of the protein sequences: |
|
\[ ext{noisy\_latents} = ext{latents} + \sigma \cdot \epsilon \] |
|
where: |
|
- \(\sigma\) is the noise level. |
|
- \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise. |
|
|
|
### Reverse Diffusion |
|
|
|
The reverse diffusion process denoises the latent representations: |
|
\[ ext{denoised\_latents} = ext{backbone}( ext{noisy\_latents}, \sigma) \] |
|
where the backbone model predicts the denoised latent representations. |
|
|
|
### Loss Function |
|
|
|
The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents: |
|
\[ \mathcal{L} = ext{MSE}( ext{denoised\_latents}, ext{latents}) \] |
|
|
|
## Training |
|
|
|
To train the model, run the `train.py` script: |
|
|
|
```bash |
|
python scripts/train.py |
|
``` |
|
|
|
This script will: |
|
- Load the ESM-2-650M model and tokenizer from Hugging Face. |
|
- Prepare the data loaders for training and validation datasets. |
|
- Initialize the latent diffusion model. |
|
- Train the model using the specified configurations. |
|
|
|
## Testing |
|
|
|
To test the model, run the `test.py` script: |
|
|
|
```bash |
|
python scripts/test.py |
|
``` |
|
|
|
This script will: |
|
- Load the trained model from the checkpoint. |
|
- Prepare the data loader for the test dataset. |
|
- Evaluate the model on the test dataset. |
|
|
|
## Generating Protein Sequences |
|
|
|
To generate protein sequences, use the `generate.py` script. This script supports three strategies: |
|
|
|
1. **Generating a Scaffold to Connect Multiple Peptides**: |
|
```bash |
|
python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length> |
|
``` |
|
Example: |
|
```bash |
|
python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30 |
|
``` |
|
|
|
2. **Filling in Specified Regions in a Given Protein Sequence**: |
|
```bash |
|
python scripts/generate.py fill <sequence_with_X> |
|
``` |
|
Example: |
|
```bash |
|
python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ |
|
``` |
|
|
|
3. **Purely De Novo Generation of a Protein Sequence**: |
|
```bash |
|
python scripts/generate.py de_novo <sequence_length> |
|
``` |
|
Example: |
|
```bash |
|
python scripts/generate.py de_novo 50 |
|
``` |
|
|
|
## Notes |
|
|
|
- Ensure you have a compatible CUDA environment if you are training on GPUs. |
|
- Modify the paths and configurations in `configs/config.py` as needed to match your setup. |
|
|
|
## Acknowledgements |
|
|
|
This implementation is based on the MDLM framework and uses the ESM-2-650M model. |
|
|