MeMDLM / README.md
pranamanam's picture
Update README.md
4015690 verified
|
raw
history blame
5.04 kB
---
license: cc-by-nc-nd-4.0
extra_gated_fields:
Name: text
Company: text
Country: country
Specific date: date_picker
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: other
I agree to share generated sequences and associated data with authors before publishing: checkbox
I agree not to file patents on any sequences generated by this model: checkbox
I agree to use this model for non-commercial use ONLY: checkbox
---
# Masked Discrete Latent Diffusion Model for Protein Sequence Generation
Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
## Directory Structure
```
project/
β”‚
β”œβ”€β”€ configs/
β”‚ β”œβ”€β”€ config.py
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ train.csv
β”‚ β”œβ”€β”€ val.csv
β”‚ β”œβ”€β”€ test.csv
β”‚
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ diffusion.py
β”‚
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ train.py
β”‚ β”œβ”€β”€ test.py
β”‚ β”œβ”€β”€ generate.py
β”‚
β”œβ”€β”€ utils/
β”‚ β”œβ”€β”€ data_loader.py
β”‚ β”œβ”€β”€ esm_utils.py
β”‚
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ example.ckpt # Placeholder for checkpoints
β”‚
β”œβ”€β”€ requirements.txt
β”‚
└── README.md
```
## Setup and Requirements
### Prerequisites
- Python 3.8+
- CUDA (for GPU support)
### Install Dependencies
1. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
2. Install the required packages:
```bash
pip install -r requirements.txt
```
### Prepare Data
Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences.
## Configuration
Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:
```python
class Config:
model_name = "facebook/esm2_t33_650M_UR50D"
latent_dim = 1280 # Adjust based on ESM-2 latent dimension
optim = {"lr": 1e-4}
training = {
"ema": 0.999,
"epochs": 10,
"batch_size": 32,
"gpus": 8,
"precision": 16, # Mixed precision training
"accumulate_grad_batches": 2, # Gradient accumulation
"save_dir": "./checkpoints/",
}
data_path = "./data/"
T = 1000 # Number of diffusion steps
subs_masking = False
```
## Mathematical Formulations
### Forward Diffusion
The forward diffusion process adds noise to the latent representations of the protein sequences:
\[ ext{noisy\_latents} = ext{latents} + \sigma \cdot \epsilon \]
where:
- \(\sigma\) is the noise level.
- \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise.
### Reverse Diffusion
The reverse diffusion process denoises the latent representations:
\[ ext{denoised\_latents} = ext{backbone}( ext{noisy\_latents}, \sigma) \]
where the backbone model predicts the denoised latent representations.
### Loss Function
The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents:
\[ \mathcal{L} = ext{MSE}( ext{denoised\_latents}, ext{latents}) \]
## Training
To train the model, run the `train.py` script:
```bash
python scripts/train.py
```
This script will:
- Load the ESM-2-650M model and tokenizer from Hugging Face.
- Prepare the data loaders for training and validation datasets.
- Initialize the latent diffusion model.
- Train the model using the specified configurations.
## Testing
To test the model, run the `test.py` script:
```bash
python scripts/test.py
```
This script will:
- Load the trained model from the checkpoint.
- Prepare the data loader for the test dataset.
- Evaluate the model on the test dataset.
## Generating Protein Sequences
To generate protein sequences, use the `generate.py` script. This script supports three strategies:
1. **Generating a Scaffold to Connect Multiple Peptides**:
```bash
python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
```
Example:
```bash
python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
```
2. **Filling in Specified Regions in a Given Protein Sequence**:
```bash
python scripts/generate.py fill <sequence_with_X>
```
Example:
```bash
python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
```
3. **Purely De Novo Generation of a Protein Sequence**:
```bash
python scripts/generate.py de_novo <sequence_length>
```
Example:
```bash
python scripts/generate.py de_novo 50
```
## Notes
- Ensure you have a compatible CUDA environment if you are training on GPUs.
- Modify the paths and configurations in `configs/config.py` as needed to match your setup.
## Acknowledgements
This implementation is based on the MDLM framework and uses the ESM-2-650M model.