Sophia Tang commited on 9 days ago

Commit

40e7e76

0 Parent(s):

initial commit

Files changed (46) hide show

.gitattributes +2 -0
.gitignore +1 -0
README.md +193 -0
assets/mcts.png +3 -0
assets/mdlm.png +3 -0
assets/peptune.png +3 -0
assets/poster.png +3 -0
data/dataloading_for_dynamic_batching.py +156 -0
data/dataset.py +207 -0
scripts/generate_mcts.sh +57 -0
scripts/generate_unconditional.sh +16 -0
scripts/train.sh +18 -0
src/config.py +319 -0
src/config.yaml +164 -0
src/diffusion.py +1015 -0
src/environment.yml +40 -0
src/generate_mcts.py +365 -0
src/generate_unconditional.py +111 -0
src/metrics.py +72 -0
src/noise_schedule.py +152 -0
src/pareto_mcts.py +492 -0
src/roformer.py +74 -0
src/scoring/functions/binding.py +178 -0
src/scoring/functions/binding_utils.py +290 -0
src/scoring/functions/classifiers/hemolysis-xgboost.json +0 -0
src/scoring/functions/classifiers/nonfouling-xgboost.json +0 -0
src/scoring/functions/classifiers/permeability-xgboost.json +3 -0
src/scoring/functions/classifiers/solubility-xgboost.json +0 -0
src/scoring/functions/hemolysis.py +63 -0
src/scoring/functions/nonfouling.py +66 -0
src/scoring/functions/permeability.py +171 -0
src/scoring/functions/scoring_utils.py +94 -0
src/scoring/functions/solubility.py +63 -0
src/scoring/scoring_functions.py +75 -0
src/scoring/tokenizer/my_tokenizers.py +424 -0
src/scoring/tokenizer/new_splits.txt +159 -0
src/scoring/tokenizer/new_vocab.txt +587 -0
src/tokenizer/__init__.py +0 -0
src/tokenizer/my_tokenizers.py +441 -0
src/tokenizer/new_splits.txt +159 -0
src/tokenizer/new_vocab.txt +587 -0
src/train.py +133 -0
src/train_peptune.py +226 -0
src/utils/app.py +1255 -0
src/utils/generate_utils.py +77 -0
src/utils/utils.py +256 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ src/scoring/functions/classifiers/permeability-xgboost.json filter=lfs diff=lfs merge=lfs -text
2	+ assets/*.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ logs/

README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# [PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion](https://arxiv.org/abs/2412.17780) 🧬🔮 (ICML 2025)
+[**Sophia Tang**](https://sophtang.github.io/)\*, [**Yinuo Zhang**](https://www.linkedin.com/in/yinuozhang98/)\* and [**Pranam Chatterjee**](https://www.chatterjeelab.com/)
+![PepTune](assets/poster.png)
+This is the repository for **[PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion](https://arxiv.org/abs/2412.17780)** 🧬🔮 published at **ICML 2025**. It is partially built on the **[MDLM repo](https://github.com/kuleshov-group/mdlm)** ([Sahoo et al. 2024](https://arxiv.org/abs/2406.07524)).
+PepTune leverages **Monte-Carlo Tree Search (MCTS)** to guide a generative masked discrete diffusion model which iteratively refines a set of Pareto non-dominated sequences optimized across a set of therapeutic properties, including binding affinity, cell membrane permeability, solubility, non-fouling, and non-hemolysis.
+## Environment Installation
+```bash
+conda env create -f src/environment.yml
+conda activate peptune
+```
+## Model Pretrained Weights Download
+Follow the steps below to download the model weights required for this experiment.
+1. Download the PepTune pre-trained MDLM checkpoint and place in `checkpoints/`: https://drive.google.com/file/d/1oXGDpKLNF0KX0ZdOcl1NZj5Czk2lSFUn/view?usp=sharing
+2. Download the pre-trained binding affinity Transformer model and place in `src/scoring/functions/classifiers/`: https://drive.google.com/file/d/128shlEP_-rYAxPgZRCk_n0HBWVbOYSva/view?usp=sharing
+## Training Data Download
+Download the peptide training dataset from https://drive.google.com/file/d/1yCDr641WVjCtECg3nbG0nsMNu8j7d7gp/view?usp=drive_link and unzip it into the `data/` directory:
+```bash
+# Download peptide_data.zip into the data/ directory
+cd data/
+# Unzip the training data
+unzip peptide_data.zip
+cd ..
+```
+After unzipping, the data should be located at `data/peptide_data/`.
+## Repository Structure
+```
+PepTune/
+├── src/
+│   ├── train_peptune.py          # Main training script
+│   ├── generate_mcts.py          # MCTS-guided peptide generation
+│   ├── generate_unconditional.py # Unconditional generation
+│   ├── diffusion.py              # Core masked discrete diffusion model
+│   ├── pareto_mcts.py            # Pareto-front MCTS implementation
+│   ├── roformer.py               # RoFormer backbone
+│   ├── noise_schedule.py         # Noise scheduling (loglinear, logpoly)
+│   ├── config.yaml               # Hydra configuration
+│   ├── config.py                 # Argparse configuration
+│   ├── environment.yml           # Conda environment
+│   ├── scoring/                  # Therapeutic property scoring
+│   │   ├── scoring_functions.py  # Unified scoring interface
+│   │   └── functions/            # Individual property predictors
+│   │       ├── binding.py
+│   │       ├── hemolysis.py
+│   │       ├── nonfouling.py
+│   │       ├── permeability.py
+│   │       ├── solubility.py
+│   │       └── classifiers/      # Pre-trained scoring model weights
+│   ├── tokenizer/                # SMILES SPE tokenizer
+│   │   ├── my_tokenizers.py
+│   │   ├── new_vocab.txt
+│   │   └── new_splits.txt
+│   └── utils/                    # Utilities & PeptideAnalyzer
+│       ├── app.py
+│       ├── generate_utils.py
+│       └── utils.py
+├── scripts/                      # Shell scripts for running experiments
+│   ├── train.sh                  # Pre-training
+│   ├── generate_mcts.sh          # MCTS-guided generation
+│   └── generate_unconditional.sh # Unconditional generation
+├── data/                         # Training data
+│   ├── dataloading_for_dynamic_batching.py
+│   └── dataset.py
+├── checkpoints/                  # Model checkpoints
+└── assets/                       # Figures
+```
+## Pre-training
+Before running, fill in `HOME_LOC` and `ENV_LOC` in `scripts/train.sh` and `base_path` in `src/config.yaml` to match your paths.
+```bash
+chmod +x scripts/train.sh
+nohup ./scripts/train.sh > train.log 2>&1 &
+```
+Training uses Hydra configuration from `src/config.yaml`. Key settings:
+- **Backbone:** RoFormer (768 hidden, 8 layers, 12 heads)
+- **Optimizer:** AdamW (lr=3e-4, weight_decay=0.075)
+- **Data:** 11M SMILES peptide dataset with dynamic batching by length
+- **Precision:** fp64
+- Checkpoints saved to `checkpoints/` (monitors `val/nll`, saves top 10)
+## MCTS-Guided Peptide Generation
+Generate therapeutic peptides optimized across multiple objectives using Monte-Carlo Tree Search.
+1. Fill in `base_path` in `src/config.yaml` and `src/scoring/scoring_functions.py`.
+2. Fill in `HOME_LOC` in `scripts/generate_mcts.sh`.
+3. Create output directories: `mkdir -p results logs`
+```bash
+chmod +x scripts/generate_mcts.sh
+# Usage: ./scripts/generate_mcts.sh [PROT_NAME] [PROT_NAME2] [MODE] [MODEL] [LENGTH] [EPOCH]
+# Example: Generate peptides targeting GFAP with length 100
+nohup ./scripts/generate_mcts.sh gfap "" 2 mcts 100 7 > generate.log 2>&1 &
+```
+### Available Target Proteins
+| Name | Target |
+|------|--------|
+| `amhr` | AMH Receptor |
+| `tfr` | Transferrin Receptor |
+| `gfap` | Glial Fibrillary Acidic Protein |
+| `glp1` | GLP-1 Receptor |
+| `glast` | Excitatory Amino Acid Transporter |
+| `ncam` | Neural Cell Adhesion Molecule |
+| `cereblon` | Cereblon (CRBN) |
+| `ligase` | E3 Ubiquitin Ligase |
+| `skp2` | S-Phase Kinase-Associated Protein 2 |
+| `p53` | Tumor Suppressor p53 |
+| `egfp` | Enhanced Green Fluorescent Protein |
+To specify a custom target protein, override `+prot_seq=<amino acid sequence>` and `+prot_name=<name>` as Hydra arguments in the generation script.
+### Scoring Objectives
+PepTune jointly optimizes across five therapeutic properties via the integrated scoring suite:
+| Objective | Property | Model |
+|-----------|----------|-------|
+| `binding_affinity1` | Binding affinity to target protein | Cross-attention Transformer |
+| `solubility` | Aqueous solubility | XGBoost on SMILES CNN embeddings |
+| `hemolysis` | Non-hemolytic | SMILES binary classifier |
+| `nonfouling` | Non-fouling | SMILES binary classifier |
+| `permeability` | Cell membrane permeability | PAMPA CNN |
+### Default MCTS Hyperparameters
+These can be overridden via Hydra config overrides:
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `mcts.num_children` | 50 | Branching factor per MCTS node |
+| `mcts.num_iter` | 128 | Number of MCTS iterations |
+| `mcts.num_objectives` | 5 | Number of optimization objectives |
+| `sampling.steps` | 128 | Diffusion denoising steps |
+| `sampling.seq_length` | 200 | Generated peptide length |
+## Unconditional Generation
+Generate peptides without property guidance:
+```bash
+chmod +x scripts/generate_unconditional.sh
+nohup ./scripts/generate_unconditional.sh > generate_unconditional.log 2>&1 &
+```
+## Evaluation
+To summarize metrics after generation, fill in `path` and `prot_name` in `src/metrics.py` and run:
+```bash
+python src/metrics.py
+```
+## Citation
+If you find this repository helpful for your publications, please consider citing our paper:
+```bibtex
+@article{tang2025peptune,
+  title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
+  author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
+  journal={42nd International Conference on Machine Learning},
+  year={2025}
+}
+```
+## License
+To use this repository, you agree to abide by the MIT License.

assets/mcts.png ADDED Viewed

Git LFS Details

SHA256: e63bdc835269660e4b7bda69973bd60611b61045f25c5c07a9baa277e31d2acd
Pointer size: 132 Bytes
Size of remote file: 1.67 MB

assets/mdlm.png ADDED Viewed

Git LFS Details

SHA256: 2944b0a2fde891d883a765f29dd235c877cea5bf3c5117bd7423cab7f3102fa3
Pointer size: 131 Bytes
Size of remote file: 432 kB

assets/peptune.png ADDED Viewed

Git LFS Details

SHA256: f6e3bbdab7e5e9c435248796b9cf9d7eca6a41354d80556bb37cf5d01920830c
Pointer size: 131 Bytes
Size of remote file: 210 kB

assets/poster.png ADDED Viewed

Git LFS Details

SHA256: 6c35b4c6a3c7e55f5ac821ba36b1a78fadbeb9fb6927e324984031c31428acee
Pointer size: 132 Bytes
Size of remote file: 1.05 MB

data/dataloading_for_dynamic_batching.py ADDED Viewed

	@@ -0,0 +1,156 @@

+#!/usr/bin/env
+import torch
+from torch.utils.data import Dataset, DataLoader
+from datasets import Dataset,load_from_disk
+import sys
+import lightning.pytorch as pl
+from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
+from functools import partial
+import re
+class DynamicBatchingDataset(Dataset):
+    def __init__(self, dataset_dict, tokenizer):
+        print('Initializing dataset...')
+        self.dataset_dict = {
+            'attention_mask': [torch.tensor(item) for item in dataset_dict['attention_mask']],
+            'input_ids': [torch.tensor(item) for item in dataset_dict['input_ids']],
+            'labels': dataset_dict['labels']
+        }
+        self.tokenizer = tokenizer
+    def __len__(self):
+        return len(self.dataset_dict['attention_mask'])
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            return {
+                'input_ids': self.dataset_dict['input_ids'][idx],
+                'attention_mask': self.dataset_dict['attention_mask'][idx],
+                'labels': self.dataset_dict['labels'][idx]
+            }
+        elif isinstance(idx, list):
+            return {
+                'input_ids': [self.dataset_dict['input_ids'][i] for i in idx],
+                'attention_mask': [self.dataset_dict['attention_mask'][i] for i in idx],
+                'labels': [self.dataset_dict['labels'][i] for i in idx]
+            }
+        else:
+            raise ValueError(f"Expected idx to be int or list, but got {type(idx)}")
+class CustomDataModule(pl.LightningDataModule):
+    def __init__(self, dataset_path, tokenizer):
+        super().__init__()
+        self.dataset = load_from_disk(dataset_path)
+        self.tokenizer = tokenizer
+    def peptide_bond_mask(self, smiles_list):
+        """
+        Returns a mask with shape (batch_size, seq_length) that has 1 at the locations
+        of recognized bonds in the positions dictionary and 0 elsewhere.
+        Args:
+            smiles_list: List of peptide SMILES strings (batch of SMILES strings).
+        Returns:
+            np.ndarray: A mask of shape (batch_size, seq_length) with 1s at bond positions.
+        """
+        # Initialize the batch mask
+        batch_size = len(smiles_list)
+        max_seq_length = 1035 #max(len(smiles) for smiles in smiles_list)  # Find the longest SMILES
+        mask = torch.zeros((batch_size, max_seq_length), dtype=torch.int)  # Mask filled with zeros
+        bond_patterns = [
+            (r'OC\(=O\)', 'ester'),
+            (r'N\(C\)C\(=O\)', 'n_methyl'),
+            (r'N[12]C\(=O\)', 'peptide'),  # Pro peptide bonds
+            (r'NC\(=O\)', 'peptide'),  # Regular peptide bonds
+            (r'C\(=O\)N\(C\)', 'n_methyl'),
+            (r'C\(=O\)N[12]?', 'peptide')
+        ]
+        for batch_idx, smiles in enumerate(smiles_list):
+            positions = []
+            used = set()
+            # Identify bonds
+            for pattern, bond_type in bond_patterns:
+                for match in re.finditer(pattern, smiles):
+                    if not any(p in range(match.start(), match.end()) for p in used):
+                        positions.append({
+                            'start': match.start(),
+                            'end': match.end(),
+                            'type': bond_type,
+                            'pattern': match.group()
+                        })
+                        used.update(range(match.start(), match.end()))
+            # Update the mask for the current SMILES
+            for pos in positions:
+                mask[batch_idx, pos['start']:pos['end']] = 1
+        return mask
+    def peptide_token_mask(self, smiles_list, token_lists):
+        """
+        Returns a mask with shape (batch_size, num_tokens) that has 1 for tokens
+        where any part of the token overlaps with a peptide bond, and 0 elsewhere.
+        Args:
+            smiles_list: List of peptide SMILES strings (batch of SMILES strings).
+            token_lists: List of tokenized SMILES strings (split into tokens).
+        Returns:
+            np.ndarray: A mask of shape (batch_size, num_tokens) with 1s for peptide bond tokens.
+        """
+        # Initialize the batch mask
+        batch_size = len(smiles_list)
+        token_seq_length = max(len(tokens) for tokens in token_lists)  # Find the longest tokenized sequence
+        tokenized_masks = torch.zeros((batch_size, token_seq_length), dtype=torch.int)  # Mask filled with zeros
+        atomwise_masks = self.peptide_bond_mask(smiles_list)
+        for batch_idx, atomwise_mask in enumerate(atomwise_masks):
+            token_seq = token_lists[batch_idx]
+            atom_idx = 0
+            for token_idx, token in enumerate(token_seq):
+                if token_idx != 0 and token_idx != len(token_seq) - 1:
+                    if torch.sum(atomwise_mask[atom_idx:atom_idx+len(token)]) >= 1:
+                        tokenized_masks[batch_idx][token_idx] = 1
+                    atom_idx += len(token)
+        return tokenized_masks
+    def collate_fn(self, batch):
+        item = batch[0]
+        token_array = self.tokenizer.get_token_split(item['input_ids'])
+        bond_mask = self.peptide_token_mask(item['labels'], token_array)
+        return {
+            'input_ids': item['input_ids'],
+            'attention_mask': item['attention_mask'],
+            'bond_mask': bond_mask
+        }
+    def train_dataloader(self):
+        train_dataset = DynamicBatchingDataset(self.dataset['train'], tokenizer=self.tokenizer)
+        return DataLoader(
+            train_dataset,
+            batch_size=1,
+            collate_fn=self.collate_fn,  # Use the instance method
+            shuffle=True,
+            num_workers=12,
+            pin_memory=True
+        )
+    def val_dataloader(self):
+        val_dataset = DynamicBatchingDataset(self.dataset['val'], tokenizer=self.tokenizer)
+        return DataLoader(
+            val_dataset,
+            batch_size=1,
+            collate_fn=self.collate_fn,  # Use the instance method
+            num_workers=8,
+            pin_memory=True
+        )

data/dataset.py ADDED Viewed

	@@ -0,0 +1,207 @@

+import re
+import torch
+import utils
+from torch.utils.data import Dataset, DataLoader
+import lightning.pytorch as pl
+from functools import partial
+import sys
+class CustomDataset(Dataset):
+    def __init__(self, dataset, indices):
+        self.dataset = dataset
+        self.indices = indices
+    def __len__(self):
+        return len(self.indices)
+    def __getitem__(self, idx):
+        actual_idx = int(self.indices[idx])
+        item = self.dataset[actual_idx]
+        return item
+# for weighting losses of peptide bonds
+def peptide_bond_mask(smiles_list):
+    """
+    Returns a mask with shape (batch_size, seq_length) that has 1 at the locations
+    of recognized bonds in the positions dictionary and 0 elsewhere.
+    Args:
+        smiles_list: List of peptide SMILES strings (batch of SMILES strings).
+    Returns:
+        np.ndarray: A mask of shape (batch_size, seq_length) with 1s at bond positions.
+    """
+    # Initialize the batch mask
+    batch_size = len(smiles_list)
+    max_seq_length = max(len(smiles) for smiles in smiles_list)  # Find the longest SMILES
+    mask = torch.zeros((batch_size, max_seq_length), dtype=torch.int)  # Mask filled with zeros
+    bond_patterns = [
+        (r'OC\(=O\)', 'ester'),
+        (r'N\(C\)C\(=O\)', 'n_methyl'),
+        (r'N[12]C\(=O\)', 'peptide'),  # Pro peptide bonds
+        (r'NC\(=O\)', 'peptide'),  # Regular peptide bonds
+        (r'C\(=O\)N\(C\)', 'n_methyl'),
+        (r'C\(=O\)N[12]?', 'peptide')
+    ]
+    for batch_idx, smiles in enumerate(smiles_list):
+        positions = []
+        used = set()
+        # Identify bonds
+        for pattern, bond_type in bond_patterns:
+            for match in re.finditer(pattern, smiles):
+                if not any(p in range(match.start(), match.end()) for p in used):
+                    positions.append({
+                        'start': match.start(),
+                        'end': match.end(),
+                        'type': bond_type,
+                        'pattern': match.group()
+                    })
+                    used.update(range(match.start(), match.end()))
+        # Update the mask for the current SMILES
+        for pos in positions:
+            mask[batch_idx, pos['start']:pos['end']] = 1
+    return mask
+def peptide_token_mask(smiles_list, token_lists):
+    """
+    Returns a mask with shape (batch_size, num_tokens) that has 1 for tokens
+    where any part of the token overlaps with a peptide bond, and 0 elsewhere.
+    Args:
+        smiles_list: List of peptide SMILES strings (batch of SMILES strings).
+        token_lists: List of tokenized SMILES strings (split into tokens).
+    Returns:
+        np.ndarray: A mask of shape (batch_size, num_tokens) with 1s for peptide bond tokens.
+    """
+    # Initialize the batch mask
+    batch_size = len(smiles_list)
+    token_seq_length = max(len(tokens) for tokens in token_lists)  # Find the longest tokenized sequence
+    tokenized_masks = torch.zeros((batch_size, token_seq_length), dtype=torch.int)  # Mask filled with zeros
+    atomwise_masks = peptide_bond_mask(smiles_list)
+    for batch_idx, atomwise_mask in enumerate(atomwise_masks):
+        token_seq = token_lists[batch_idx]
+        atom_idx = 0
+        for token_idx, token in enumerate(token_seq):
+            if token_idx != 0 and token_idx != len(token_seq) - 1:
+                if torch.sum(atomwise_mask[atom_idx:atom_idx+len(token)]) >= 1:
+                    tokenized_masks[batch_idx][token_idx] = 1
+                atom_idx += len(token)
+    return tokenized_masks
+def extract_amino_acid_sequence(helm_string):
+    """
+    Extracts the amino acid sequence from a HELM peptide notation and outputs it as an array,
+    removing any brackets around each amino acid.
+    Args:
+        helm_string (str): The HELM notation string for a peptide.
+    Returns:
+        list: A list containing each amino acid in sequence without brackets.
+    """
+    # Use regex to find the pattern within `{}` brackets following "PEPTIDE" followed by a number
+    matches = re.findall(r'PEPTIDE\d+\{([^}]+)\}', helm_string)
+    if matches:
+        # Join all matched sequences and split by dots to get individual amino acids
+        amino_acid_sequence = []
+        for match in matches:
+            sequence = match.replace('[', '').replace(']', '').split('.')
+            amino_acid_sequence.extend(sequence)
+        return amino_acid_sequence
+    else:
+        return "Invalid HELM notation or no peptide sequence found."
+def helm_collate_fn(batch, tokenizer):
+    sequences = [item['HELM'] for item in batch]
+    max_len = 0
+    for sequence in sequences:
+        seq_len = len(extract_amino_acid_sequence(sequence))
+        if seq_len > max_len:
+            max_len = seq_len
+    tokens = tokenizer(sequences, return_tensors='pt', padding=True, truncation=True, max_length=1024)
+    return {
+        'input_ids': tokens['input_ids'],
+        'attention_mask': tokens['attention_mask']
+    }
+def collate_fn(batch, tokenizer):
+    """Standard data collator that truncates/pad sequences based on max_length"""
+    valid_sequences = []
+    valid_items = []
+    for item in batch:
+        try:
+            test_tokens = tokenizer([item['SMILES']], return_tensors='pt', padding=False, truncation=True, max_length=1035)
+            valid_sequences.append(item['SMILES'])
+            valid_items.append(item)
+        except Exception as e:
+            print(f"Skipping sequence due to: {str(e)}")
+            continue
+    #sequences = [item['SMILES'] for item in batch]
+    #max_len = max([len(seq) for seq in sequences])
+    #labels = torch.tensor([item['labels'] for item in batch], dtype=torch.float32)
+    tokens = tokenizer(valid_sequences, return_tensors='pt', padding=True, truncation=True, max_length=1035)
+    token_array = tokenizer.get_token_split(tokens['input_ids'])
+    bond_mask = peptide_token_mask(valid_sequences, token_array)
+    #attention_masks = torch.ones(tokens.size()[:2], dtype=torch.bool)
+    return {
+        'input_ids': tokens['input_ids'],
+        'attention_mask': tokens['attention_mask'],
+        'bond_mask': bond_mask
+    }
+class CustomDataModule(pl.LightningDataModule):
+    def __init__(self, train_dataset, val_dataset, test_dataset, tokenizer, batch_size, collate_fn=collate_fn):
+        super().__init__()
+        self.train_dataset = train_dataset
+        self.val_dataset = val_dataset
+        #self.test_dataset = test_dataset
+        self.batch_size = batch_size
+        self.tokenizer = tokenizer
+        self.collate_fn = collate_fn
+    def train_dataloader(self):
+        return DataLoader(self.train_dataset,
+                          batch_size=self.batch_size,
+                          collate_fn=partial(self.collate_fn, tokenizer=self.tokenizer),
+                          num_workers=8,
+                          pin_memory=True
+                          )
+    def val_dataloader(self):
+        return DataLoader(self.val_dataset,
+                          batch_size=self.batch_size,
+                          collate_fn=partial(self.collate_fn, tokenizer=self.tokenizer),
+                          num_workers=8,
+                          pin_memory=True
+                          )
+    """def test_dataloader(self):
+        return DataLoader(self.test_dataset, batch_size=self.batch_size,
+                          collate_fn=partial(self.collate_fn, tokenizer=self.tokenizer),
+                          num_workers=8, pin_memory=True)"""

scripts/generate_mcts.sh ADDED Viewed

	@@ -0,0 +1,57 @@

+#!/bin/bash
+HOME_LOC=/path/to/your/home/PepTune
+SCRIPT_LOC=$HOME_LOC/src
+LOG_LOC=$HOME_LOC/logs
+DATE=$(date +%m_%d)
+SPECIAL_PREFIX='mcts'
+PYTHON_EXECUTABLE=python
+# ===================================================================
+# Default parameters (can be overridden by command line arguments)
+# Available proteins: amhr, tfr, gfap, glp1, glast, ncam, cereblon, ligase, skp2, p53, egfp
+PROT_NAME1=${1:-"gfap"}
+PROT_NAME2=${2:-""}
+MODE=${3:-"2"}
+MODEL=${4:-"mcts"}
+LENGTH=${5:-"100"}
+EPOCH=${6:-"7"}
+CKPT=$HOME_LOC/checkpoints/epoch13-new-tokenizer.ckpt
+# ===================================================================
+echo "Activating conda environment..."
+source "$(conda info --base)/etc/profile.d/conda.sh"
+conda activate peptune
+mkdir -p "${HOME_LOC}/${PROT_NAME1}"
+mkdir -p "${LOG_LOC}"
+echo "Running MCTS generation with parameters:"
+echo "  Protein Name 1: $PROT_NAME1"
+echo "  Protein Name 2: $PROT_NAME2"
+echo "  Mode: $MODE"
+echo "  Model: $MODEL"
+echo "  Length: $LENGTH"
+echo "  Epoch: $EPOCH"
+# Build Hydra override arguments
+mkdir -p "${LOG_LOC}"
+HYDRA_ARGS="+prot_name1=$PROT_NAME1 ++mode=$MODE +model_type=$MODEL +length=$LENGTH +epoch=$EPOCH"
+if [ -n "$PROT_NAME2" ]; then
+    HYDRA_ARGS="$HYDRA_ARGS +prot_name2=$PROT_NAME2"
+fi
+cd "$SCRIPT_LOC"
+# Run the MCTS generation script with Hydra overrides
+$PYTHON_EXECUTABLE $SCRIPT_LOC/generate_mcts.py \
+    --config-path "$SCRIPT_LOC" \
+    --config-name config \
+    base_path="$HOME_LOC" \
+    eval.checkpoint_path="$CKPT" \
+    $HYDRA_ARGS >> ${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}_generate.log 2>&1
+echo "Generation complete. Check logs at: ${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}_generate.log"
+conda deactivate

scripts/generate_unconditional.sh ADDED Viewed

	@@ -0,0 +1,16 @@

+#!/bin/bash
+HOME_LOC=/path/to/your/home/PepTune
+SCRIPT_LOC=$HOME_LOC/src
+LOG_LOC=$HOME_LOC/logs
+DATE=$(date +%m_%d)
+SPECIAL_PREFIX='unconditional'
+PYTHON_EXECUTABLE=python
+# ===================================================================
+source "$(conda info --base)/etc/profile.d/conda.sh"
+conda activate peptune
+$PYTHON_EXECUTABLE $SCRIPT_LOC/generate_unconditional.py >> ${DATE}_${SPECIAL_PREFIX}_generate.log 2>&1
+conda deactivate

scripts/train.sh ADDED Viewed

	@@ -0,0 +1,18 @@

+#!/bin/bash
+HOME_LOC=/path/to/your/home/PepTune
+ENV_LOC=/path/to/your/envs/peptune
+SCRIPT_LOC=$HOME_LOC/src
+LOG_LOC=$HOME_LOC/logs
+DATE=$(date +%m_%d)
+SPECIAL_PREFIX='11M-ablation-all-losses'
+# set 3 have skip connection
+PYTHON_EXECUTABLE=$ENV_LOC/bin/python
+# ===================================================================
+source "$(conda info --base)/etc/profile.d/conda.sh"
+conda activate $ENV_LOC
+$PYTHON_EXECUTABLE $SCRIPT_LOC/train_peptune.py >> ${DATE}_${SPECIAL_PREFIX}_train.log 2>&1
+conda deactivate

src/config.py ADDED Viewed

	@@ -0,0 +1,319 @@

+import argparse
+import os
+def get_parser():
+    parser = argparse.ArgumentParser(description='PepTune Training and Evaluation')
+    # Noise parameters
+    noise_group = parser.add_argument_group('noise')
+    noise_group.add_argument('--noise-type', type=str, default='loglinear',
+                            help='Type of noise schedule')
+    noise_group.add_argument('--sigma-min', type=float, default=1e-4,
+                            help='Minimum sigma value')
+    noise_group.add_argument('--sigma-max', type=float, default=20,
+                            help='Maximum sigma value')
+    noise_group.add_argument('--state-dependent', action='store_true', default=True,
+                            help='Use state-dependent noise')
+    # Base parameters
+    parser.add_argument('--base-path', type=str, default='/path/to/PepTune',
+                       help='Base path to PepTune')
+    parser.add_argument('--mode', type=str, default='ppl_eval',
+                       choices=['train', 'ppl_eval', 'sample_eval'],
+                       help='Running mode')
+    parser.add_argument('--diffusion', type=str, default='absorbing_state',
+                       help='Diffusion type')
+    parser.add_argument('--vocab', type=str, default='old_smiles',
+                       choices=['old_smiles', 'new_smiles', 'selfies', 'helm'],
+                       help='Vocabulary type')
+    parser.add_argument('--backbone', type=str, default='roformer',
+                       choices=['peptideclm', 'helmgpt', 'dit', 'roformer', 'finetune_roformer'],
+                       help='Model backbone')
+    parser.add_argument('--parameterization', type=str, default='subs',
+                       help='Parameterization type')
+    parser.add_argument('--time-conditioning', action='store_true', default=False,
+                       help='Use time conditioning')
+    parser.add_argument('--T', type=int, default=0,
+                       help='Number of diffusion steps (0 for continuous time, 1000 for discrete)')
+    parser.add_argument('--subs-masking', action='store_true', default=False,
+                       help='Use substitution masking')
+    parser.add_argument('--seed', type=int, default=42,
+                       help='Random seed')
+    # MCTS parameters
+    mcts_group = parser.add_argument_group('mcts')
+    mcts_group.add_argument('--mcts-num-children', type=int, default=50,
+                           help='Number of children in MCTS')
+    mcts_group.add_argument('--mcts-num-objectives', type=int, default=5,
+                           help='Number of objectives in MCTS')
+    mcts_group.add_argument('--mcts-topk', type=int, default=100,
+                           help='Top-k for MCTS')
+    mcts_group.add_argument('--mcts-mask-token', type=int, default=4,
+                           help='Mask token ID')
+    mcts_group.add_argument('--mcts-num-iter', type=int, default=128,
+                           help='Number of MCTS iterations')
+    mcts_group.add_argument('--mcts-sampling', type=int, default=0,
+                           help='Sampling strategy (0 for gumbel, >0 for top-k)')
+    mcts_group.add_argument('--mcts-invalid-penalty', type=float, default=0.5,
+                           help='Penalty for invalid sequences')
+    mcts_group.add_argument('--mcts-sample-prob', type=float, default=1.0,
+                           help='Sampling probability')
+    mcts_group.add_argument('--mcts-perm', action='store_true', default=True,
+                           help='Use permutation in MCTS')
+    mcts_group.add_argument('--mcts-dual', action='store_true', default=False,
+                           help='Use dual mode')
+    mcts_group.add_argument('--mcts-single', action='store_true', default=False,
+                           help='Use single mode')
+    mcts_group.add_argument('--mcts-time-dependent', action='store_true', default=True,
+                           help='Use time-dependent MCTS')
+    # Data parameters
+    data_group = parser.add_argument_group('data')
+    data_group.add_argument('--train-data', type=str,
+                           default='/path/to/your/home/PepTune/data/peptide_data',
+                           help='Path to training data')
+    data_group.add_argument('--valid-data', type=str,
+                           default='/path/to/your/home/PepTune/data/peptide_data',
+                           help='Path to validation data')
+    data_group.add_argument('--data-batching', type=str, default='wrapping',
+                           choices=['padding', 'wrapping'],
+                           help='Batching strategy')
+    # Loader parameters
+    loader_group = parser.add_argument_group('loader')
+    loader_group.add_argument('--global-batch-size', type=int, default=64,
+                             help='Global batch size')
+    loader_group.add_argument('--eval-global-batch-size', type=int, default=None,
+                             help='Evaluation global batch size (defaults to global-batch-size)')
+    loader_group.add_argument('--num-workers', type=int, default=None,
+                             help='Number of dataloader workers (defaults to available CPUs)')
+    loader_group.add_argument('--pin-memory', action='store_true', default=True,
+                             help='Pin memory for dataloaders')
+    # Sampling parameters
+    sampling_group = parser.add_argument_group('sampling')
+    sampling_group.add_argument('--predictor', type=str, default='ddpm_cache',
+                               choices=['analytic', 'ddpm', 'ddpm_cache'],
+                               help='Predictor type for sampling')
+    sampling_group.add_argument('--num-sequences', type=int, default=100,
+                               help='Number of sequences to generate')
+    sampling_group.add_argument('--sampling-eps', type=float, default=1e-3,
+                               help='Sampling epsilon')
+    sampling_group.add_argument('--steps', type=int, default=128,
+                               help='Number of sampling steps')
+    sampling_group.add_argument('--seq-length', type=int, default=100,
+                               help='Sequence length')
+    sampling_group.add_argument('--noise-removal', action='store_true', default=True,
+                               help='Use noise removal')
+    sampling_group.add_argument('--num-sample-batches', type=int, default=2,
+                               help='Number of sample batches')
+    sampling_group.add_argument('--num-sample-log', type=int, default=2,
+                               help='Number of samples to log')
+    sampling_group.add_argument('--stride-length', type=int, default=1,
+                               help='Stride length for sampling')
+    sampling_group.add_argument('--num-strides', type=int, default=1,
+                               help='Number of strides')
+    # Training parameters
+    training_group = parser.add_argument_group('training')
+    training_group.add_argument('--antithetic-sampling', action='store_true', default=True,
+                               help='Use antithetic sampling')
+    training_group.add_argument('--training-sampling-eps', type=float, default=1e-3,
+                               help='Training sampling epsilon')
+    training_group.add_argument('--focus-mask', action='store_true', default=False,
+                               help='Use focus mask')
+    training_group.add_argument('--accumulator', action='store_true', default=False,
+                               help='Use accumulator')
+    # Evaluation parameters
+    eval_group = parser.add_argument_group('eval')
+    eval_group.add_argument('--checkpoint-path', type=str, default=None,
+                           help='Path to checkpoint for evaluation')
+    eval_group.add_argument('--disable-ema', action='store_true', default=False,
+                           help='Disable EMA')
+    eval_group.add_argument('--compute-generative-perplexity', action='store_true', default=False,
+                           help='Compute generative perplexity')
+    eval_group.add_argument('--perplexity-batch-size', type=int, default=8,
+                           help='Batch size for perplexity computation')
+    eval_group.add_argument('--compute-perplexity-on-sanity', action='store_true', default=False,
+                           help='Compute perplexity on sanity check')
+    eval_group.add_argument('--gen-ppl-eval-model', type=str, default='gpt2-large',
+                           help='Model for generative perplexity evaluation')
+    eval_group.add_argument('--generate-samples', action='store_true', default=True,
+                           help='Generate samples during evaluation')
+    eval_group.add_argument('--generation-model', type=str, default=None,
+                           help='Model for generation')
+    # Optimizer parameters
+    optim_group = parser.add_argument_group('optim')
+    optim_group.add_argument('--weight-decay', type=float, default=0.075,
+                            help='Weight decay')
+    optim_group.add_argument('--lr', type=float, default=3e-4,
+                            help='Learning rate')
+    optim_group.add_argument('--beta1', type=float, default=0.9,
+                            help='Adam beta1')
+    optim_group.add_argument('--beta2', type=float, default=0.999,
+                            help='Adam beta2')
+    optim_group.add_argument('--eps', type=float, default=1e-8,
+                            help='Adam epsilon')
+    # PepCLM model parameters
+    pepclm_group = parser.add_argument_group('pepclm')
+    pepclm_group.add_argument('--pepclm-hidden-size', type=int, default=768,
+                             help='PepCLM hidden size')
+    pepclm_group.add_argument('--pepclm-cond-dim', type=int, default=256,
+                             help='PepCLM conditioning dimension')
+    pepclm_group.add_argument('--pepclm-n-heads', type=int, default=20,
+                             help='PepCLM number of attention heads')
+    pepclm_group.add_argument('--pepclm-n-blocks', type=int, default=4,
+                             help='PepCLM number of blocks')
+    pepclm_group.add_argument('--pepclm-dropout', type=float, default=0.5,
+                             help='PepCLM dropout rate')
+    pepclm_group.add_argument('--pepclm-length', type=int, default=512,
+                             help='PepCLM sequence length')
+    # General model parameters
+    model_group = parser.add_argument_group('model')
+    model_group.add_argument('--model-type', type=str, default='ddit',
+                            help='Model type')
+    model_group.add_argument('--hidden-size', type=int, default=768,
+                            help='Model hidden size')
+    model_group.add_argument('--cond-dim', type=int, default=128,
+                            help='Conditioning dimension')
+    model_group.add_argument('--length', type=int, default=512,
+                            help='Sequence length')
+    model_group.add_argument('--n-blocks', type=int, default=12,
+                            help='Number of blocks')
+    model_group.add_argument('--n-heads', type=int, default=12,
+                            help='Number of attention heads')
+    model_group.add_argument('--scale-by-sigma', action='store_true', default=True,
+                            help='Scale by sigma')
+    model_group.add_argument('--dropout', type=float, default=0.1,
+                            help='Dropout rate')
+    # RoFormer parameters
+    roformer_group = parser.add_argument_group('roformer')
+    roformer_group.add_argument('--roformer-hidden-size', type=int, default=768,
+                               help='RoFormer hidden size')
+    roformer_group.add_argument('--roformer-n-layers', type=int, default=8,
+                               help='RoFormer number of layers')
+    roformer_group.add_argument('--roformer-n-heads', type=int, default=8,
+                               help='RoFormer number of attention heads')
+    roformer_group.add_argument('--roformer-max-position-embeddings', type=int, default=1035,
+                               help='RoFormer max position embeddings')
+    # HelmGPT parameters
+    helmgpt_group = parser.add_argument_group('helmgpt')
+    helmgpt_group.add_argument('--helmgpt-hidden-size', type=int, default=256,
+                              help='HelmGPT hidden size')
+    helmgpt_group.add_argument('--helmgpt-embd-pdrop', type=float, default=0.1,
+                              help='HelmGPT embedding dropout')
+    helmgpt_group.add_argument('--helmgpt-resid-pdrop', type=float, default=0.1,
+                              help='HelmGPT residual dropout')
+    helmgpt_group.add_argument('--helmgpt-attn-pdrop', type=float, default=0.1,
+                              help='HelmGPT attention dropout')
+    helmgpt_group.add_argument('--helmgpt-ff-dropout', type=float, default=0.0,
+                              help='HelmGPT feedforward dropout')
+    helmgpt_group.add_argument('--helmgpt-block-size', type=int, default=140,
+                              help='HelmGPT block size')
+    helmgpt_group.add_argument('--helmgpt-n-layer', type=int, default=8,
+                              help='HelmGPT number of layers')
+    helmgpt_group.add_argument('--helmgpt-n-heads', type=int, default=8,
+                              help='HelmGPT number of attention heads')
+    # Trainer parameters
+    trainer_group = parser.add_argument_group('trainer')
+    trainer_group.add_argument('--accelerator', type=str, default='cuda',
+                              help='Accelerator type')
+    trainer_group.add_argument('--num-nodes', type=int, default=1,
+                              help='Number of nodes')
+    trainer_group.add_argument('--devices', type=int, default=1,
+                              help='Number of devices')
+    trainer_group.add_argument('--gradient-clip-val', type=float, default=1.0,
+                              help='Gradient clipping value')
+    trainer_group.add_argument('--precision', type=str, default='64-true',
+                              help='Training precision')
+    trainer_group.add_argument('--num-sanity-val-steps', type=int, default=2,
+                              help='Number of sanity validation steps')
+    trainer_group.add_argument('--max-epochs', type=int, default=100,
+                              help='Maximum number of epochs')
+    trainer_group.add_argument('--max-steps', type=int, default=1_000_000,
+                              help='Maximum number of steps')
+    trainer_group.add_argument('--log-every-n-steps', type=int, default=10,
+                              help='Log every n steps')
+    trainer_group.add_argument('--limit-train-batches', type=float, default=1.0,
+                              help='Limit training batches')
+    trainer_group.add_argument('--limit-val-batches', type=float, default=1.0,
+                              help='Limit validation batches')
+    trainer_group.add_argument('--check-val-every-n-epoch', type=int, default=1,
+                              help='Check validation every n epochs')
+    # WandB parameters
+    wandb_group = parser.add_argument_group('wandb')
+    wandb_group.add_argument('--wandb-project', type=str, default='peptune',
+                            help='WandB project name')
+    wandb_group.add_argument('--wandb-notes', type=str, default=None,
+                            help='WandB notes')
+    wandb_group.add_argument('--wandb-group', type=str, default=None,
+                            help='WandB group')
+    wandb_group.add_argument('--wandb-job-type', type=str, default=None,
+                            help='WandB job type')
+    wandb_group.add_argument('--wandb-name', type=str, default='sophia-tang',
+                            help='WandB run name')
+    wandb_group.add_argument('--wandb-id', type=str, default=None,
+                            help='WandB run ID')
+    # Checkpointing parameters
+    checkpoint_group = parser.add_argument_group('checkpointing')
+    checkpoint_group.add_argument('--save-dir', type=str, default=None,
+                                 help='Directory to save checkpoints')
+    checkpoint_group.add_argument('--resume-from-ckpt', action='store_true', default=True,
+                                 help='Resume from checkpoint')
+    checkpoint_group.add_argument('--resume-ckpt-path', type=str, default=None,
+                                 help='Path to checkpoint to resume from')
+    checkpoint_group.add_argument('--checkpoint-every-n-epochs', type=int, default=1,
+                                 help='Save checkpoint every n epochs')
+    checkpoint_group.add_argument('--checkpoint-monitor', type=str, default='val/nll',
+                                 help='Metric to monitor for checkpointing')
+    checkpoint_group.add_argument('--checkpoint-save-top-k', type=int, default=10,
+                                 help='Save top k checkpoints')
+    checkpoint_group.add_argument('--checkpoint-mode', type=str, default='min',
+                                 choices=['min', 'max'],
+                                 help='Mode for checkpoint monitoring')
+    checkpoint_group.add_argument('--checkpoint-dirpath', type=str,
+                                 default='./checkpoints/11M-old-tokenizer',
+                                 help='Directory path for checkpoints')
+    # LR Scheduler parameters
+    scheduler_group = parser.add_argument_group('lr_scheduler')
+    scheduler_group.add_argument('--lr-warmup-steps', type=int, default=2500,
+                                help='Number of warmup steps for learning rate')
+    return parser
+def get_args():
+    """Parse and return arguments."""
+    parser = get_parser()
+    args = parser.parse_args()
+    # Post-process arguments
+    if args.eval_global_batch_size is None:
+        args.eval_global_batch_size = args.global_batch_size
+    if args.num_workers is None:
+        args.num_workers = len(os.sched_getaffinity(0))
+    if args.wandb_id is None:
+        args.wandb_id = f"{args.wandb_name}_nov12_set2"
+    if args.save_dir is None:
+        args.save_dir = os.getcwd()
+    return args
+if __name__ == '__main__':
+    args = get_args()
+    print(args)

src/config.yaml ADDED Viewed

	@@ -0,0 +1,164 @@

+noise:
+  type: loglinear
+  sigma_min: 1e-4
+  sigma_max: 20
+  state_dependent: True
+base_path: /path/to/your/home/PepTune
+mode: train  # train / ppl_eval / sample_eval
+diffusion: absorbing_state
+vocab: old_smiles # old_smiles / new_smiles / selfies / helm
+backbone: roformer  # peptideclm / helmgpt / dit / roformer / finetune_roformer
+parameterization: subs  # subs
+time_conditioning: False
+T: 0  # 0 (continuous time) / 1000
+subs_masking: False
+seed: 42
+mcts:
+  num_children: 50
+  num_objectives: 5
+  topk: 100
+  mask_token: 4
+  num_iter: 128
+  sampling: 0 # 0 is gumbel sampling / > 0 samples children from top k probs
+  invalid_penalty: 0.5
+  sample_prob: 1.0
+  perm: True
+  dual: False
+  single: False
+  time_dependent: True
+lr_scheduler:
+  _target_: transformers.get_constant_schedule_with_warmup
+  num_warmup_steps: 2500
+loader:
+  global_batch_size: 64
+  eval_global_batch_size: ${.global_batch_size}
+  # Note: batch_size and eval_batch_size are **per machine**
+  batch_size: ${div_up:${.global_batch_size}, ${eval:${trainer.devices} * ${trainer.num_nodes}}}
+  eval_batch_size: ${div_up:${.eval_global_batch_size}, ${eval:${trainer.devices} * ${trainer.num_nodes}}}
+  num_workers: ${eval:"len(__import__('os').sched_getaffinity(0))"}
+  pin_memory: True
+sampling:
+  predictor: ddpm_cache  # analytic, ddpm, ddpm_cache
+  num_sequences: 100
+  sampling_eps: 1e-3
+  steps: 128
+  seq_length: 200
+  noise_removal: True
+  num_sample_batches: 2  # Total samples: `num_gpus` * `loader.eval_batch_size` * num_sample_batches
+  num_sample_log: 2
+  stride_length: 1
+  num_strides: 1
+training:
+  antithetic_sampling: True
+  sampling_eps: 1e-3
+  focus_mask: False
+  #dynamic_batching: True
+  accumulator: False
+eval:
+  checkpoint_path: None
+  disable_ema: False
+  compute_generative_perplexity: False
+  perplexity_batch_size: 8
+  compute_perplexity_on_sanity: False
+  gen_ppl_eval_model_name_or_path: gpt2-large  # gpt2-large, meta-llama/Llama-2-7b-hf
+  generate_samples: True
+  generation_model: None
+optim:
+  weight_decay: 0.075
+  lr: 3e-4
+  beta1: 0.9
+  beta2: 0.999
+  eps: 1e-8
+pepclm:
+  hidden_size: 768
+  cond_dim: 256
+  n_heads: 20
+  n_blocks: 4
+  dropout: 0.5
+  length: 512
+  #scale_by_sigma: True
+model:
+  type: ddit
+  hidden_size: 768
+  cond_dim: 128
+  length: 512
+  n_blocks: 12
+  n_heads: 12
+  scale_by_sigma: True
+  dropout: 0.1
+roformer:
+  hidden_size: 768
+  n_layers: 8
+  n_heads: 8
+  max_position_embeddings: 1035
+helmgpt:
+  hidden_size: 256
+  embd_pdrop: 0.1
+  resid_pdrop: 0.1
+  attn_pdrop: 0.1
+  ff_dropout: 0.
+  block_size: 140
+  n_layer: 8
+  n_heads: 8
+trainer:
+  _target_: lightning.Trainer
+  accelerator: cuda
+  num_nodes: 1
+  devices: ${device_count:}
+  accumulate_grad_batches: ${div_up:${loader.global_batch_size}, ${eval:${trainer.devices} * ${loader.batch_size} * ${trainer.num_nodes}}}
+  gradient_clip_val: 1.0
+  precision: 64-true
+  num_sanity_val_steps: 2
+  max_epochs: 100
+  max_steps: 1_000_000
+  log_every_n_steps: 10
+  limit_train_batches: 1.0   # train on full dataset, can be used to toggle quick run
+  limit_val_batches: 1.0     # validate on full dataset, can be used to toggle quick run
+  #val_check_interval: 40 #954
+  check_val_every_n_epoch: 1
+wandb:
+  project: peptune
+  notes: null
+  group: null
+  job_type: null
+  name: sophia-tang
+  id: ${.name}_nov12_set2
+hydra:
+  run:
+    dir: ./${now:%Y.%m.%d}/
+  job:
+    chdir: True
+checkpointing:
+  # Use custom `save_dir` if, e.g., saving to S3 bucket, otherwise leave this parameter as is
+  save_dir: ${cwd:}
+  # Note: `checkpoints` path should correspond to `checkpoint_every_n_steps.dirpath`
+  resume_from_ckpt: True
+  resume_ckpt_path: None
+callbacks:
+  model_checkpoint:
+    _target_: pytorch_lightning.callbacks.ModelCheckpoint
+    every_n_epochs: 1
+    monitor: "val/nll"
+    save_top_k: 10
+    mode: "min"
+    dirpath: './checkpoints/'

src/diffusion.py ADDED Viewed

	@@ -0,0 +1,1015 @@

+## Adapted from MDLM: https://github.com/kuleshov-group/mdlm
+import numpy as np
+import sys
+import itertools
+import time
+import torch
+from torch import Tensor
+import math
+import torch.nn.functional as F
+import numpy as np
+import random as rd
+import lightning as L
+import torchmetrics
+from dataclasses import dataclass
+import gc
+import pickle
+import utils.utils as utils
+from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
+import noise_schedule
+from torch.optim.lr_scheduler import _LRScheduler
+import roformer as roformer
+from utils.app import PeptideAnalyzer
+@dataclass
+class Loss:
+  loss: torch.FloatTensor
+  nlls: torch.FloatTensor
+  attn_mask: torch.FloatTensor
+class NLL(torchmetrics.aggregation.MeanMetric):
+  pass
+class BPD(NLL):
+  def compute(self) -> Tensor:
+    """Computes the bits per dimension.
+    Returns:
+      bpd
+    """
+    return self.mean_value / self.weight / math.log(2)
+class Perplexity(NLL):
+  def compute(self) -> Tensor:
+    """Computes the Perplexity.
+    Returns:
+     Perplexity
+    """
+    return torch.exp(self.mean_value / self.weight)
+class Diffusion(L.LightningModule):
+    def __init__(self, config, tokenizer):
+        super().__init__()
+        self.config = config
+        #self.save_hyperparameters()
+        # PeptideCLM tokenizer
+        self.tokenizer = tokenizer
+        self.vocab_size = self.tokenizer.vocab_size
+        self.mask_token_id = self.tokenizer.mask_token_id
+        self.sampler = self.config.sampling.predictor
+        self.analyzer = PeptideAnalyzer()
+        # backbone LM PeptideCLM model
+        if self.config.backbone == 'roformer':
+            self.backbone = roformer.Roformer(self.config, self.tokenizer)
+            self.backbone.unfreeze_all_layers()
+        elif self.config.backbone == 'finetune_roformer':
+            self.backbone = roformer.Roformer(self.config, self.tokenizer)
+            self.backbone.freeze_model()
+            self.backbone.unfreeze_n_layers(n=8)
+        else:
+            Exception('invalid backbone config')
+        self.neg_infinity = -1000000.0
+        self.T = config.T
+        # noise schedule for non-peptide bond tokens (default to log-linear)
+        self.noise = noise_schedule.get_noise(config)
+        # noise schedule for peptide bonds (log-polynomial)
+        self.bond_noise = noise_schedule.LogPolyNoise()
+        self.time_conditioning = self.config.time_conditioning
+        self.fast_forward_epochs = None
+        self.fast_forward_batches = None
+        self.gen_ppl_eval_model_name_or_path = self.config.eval.gen_ppl_eval_model_name_or_path
+        self.gen_ppl_metric = Perplexity()
+        self.lr = self.config.optim.lr
+        self.sampling_eps = self.config.training.sampling_eps
+        metrics = torchmetrics.MetricCollection({
+            'nll': NLL(),
+            'bpd': BPD(),
+            'ppl': Perplexity(),
+        })
+        metrics.set_dtype(torch.float64)
+        self.train_metrics = metrics.clone(prefix='trainer/')
+        self.valid_metrics = metrics.clone(prefix='val/')
+        self.test_metrics = metrics.clone(prefix='test/')
+    """LOSS FOR INVALID PEPTIDES"""
+    @torch.no_grad()
+    def conditional_gumbel(self, logits, D, k):
+        """
+        Outputs k samples of Q = StandardGumbel(), such that argmax(logits
+        + Q) is given by D (one-hot vector).
+        Input:
+        - logits: Tensor of shape (batch_size, seq_len, vocab_size)
+        - D: One-hot tensor of shape (batch_size, seq_len, vocab_size)
+        - k: Number of Gumbel samples
+        Output:
+        - Adjusted logits with shape (k, batch_size, seq_len, vocab_size)
+        """
+        # iid. exponential samples of shape (k, batch_size, seq_len, vocab_size)
+        E = torch.distributions.exponential.Exponential(rate=torch.ones_like(logits)).sample([k])
+        # E of the chosen class, shape (k, batch_size, seq_len, 1)
+        Ei = (D * E).sum(dim=-1, keepdim=True)
+        # Partition function (normalization constant), shape (batch_size, seq_len, 1)
+        Z = logits.exp().sum(dim=-1, keepdim=True)
+        # Adjusted logits for Gumbel distribution
+        adjusted = (
+            D * (-torch.log(Ei) + torch.log(Z)) +
+            (1 - D) * -torch.log(E / logits.exp() + Ei / Z)
+        )
+        # Adjusted logits shape: (k, batch_size, seq_len, vocab_size)
+        return adjusted - logits
+    def replace_gradient(self, value, surrogate):
+        """
+        Returns `value` but backpropagates gradients through `surrogate`.
+        """
+        return surrogate + (value - surrogate).detach()
+    def gumbel_rao(self, logits, k, temp=1.0, I=None):
+        """
+        Returns a categorical sample from logits (over axis=-1) as a
+        one-hot vector, with gumbel-rao gradient.
+        Input:
+        - logits: Tensor of shape (batch_size, seq_len, vocab_size)
+        - k: Number of Gumbel samples for Rao-Blackwellization
+        - temp: Temperature for softmax
+        - I: Optional, precomputed categorical sample tensor of shape (batch_size, seq_len)
+        Output:
+        - One-hot tensor of shape (batch_size, seq_len, vocab_size)
+        with Gumbel-Rao gradient.
+        """
+        assert logits.shape[-1] == self.tokenizer.vocab_size
+        vocab_size = logits.shape[-1]
+        if I is None:
+            # Sample indices for each token in the batch
+            I = torch.distributions.categorical.Categorical(logits=logits).sample()  # (batch_size, seq_len)
+        # Convert indices to one-hot encodings, shape (batch_size, seq_len, vocab_size)
+        D = torch.nn.functional.one_hot(I, num_classes=vocab_size).float()
+        # Generate k different adjusted logits that all evaluate to the same sequence
+        adjusted = logits + self.conditional_gumbel(logits, D, k=k)  # (k, batch_size, seq_len, vocab_size)
+        # Compute the surrogate by averaging softmax across k samples
+        surrogate = torch.nn.functional.softmax(adjusted / temp, dim=-1).mean(dim=0)  # (batch_size, seq_len, vocab_size)
+        # Return one-hot representation with surrogate gradient
+        return self.replace_gradient(D, surrogate)
+    def compute_invalid_loss(self, logits, k=None, temp=None):
+        """
+        Penalizes logits that produce invalid sequences using the `is_peptide` function,
+        scaling penalties inversely with token probabilities.
+        Args:
+            logits: Tensor of shape [batch_size, seq_len, vocab_size].
+            k: Number of samples for Gumbel-Rao.
+            temp: Temperature for softmax.
+        Returns:
+            loss: A scalar tensor representing the total loss for invalid sequences.
+        """
+        #samples = self.gumbel_rao(logits, k=k, temp=temp)  # (batch_size, seq_len, vocab_size)
+        # Convert logits to sequences using the tokenizer
+        batch_token_ids = logits.argmax(dim=-1).to(self.device)  # (batch_size, seq_len)
+        sampled_sequences = self.tokenizer.batch_decode(batch_token_ids)
+        # Check validity of each sampled sequence (not differentiable)
+        penalties = torch.tensor(
+            [1 if not self.analyzer.is_peptide(seq) else 0 for seq in sampled_sequences],
+            dtype=torch.float32,
+            device=self.device
+        )
+        #print(penalties)
+        # Compute probabilities for each token (batch_size, seq_length)
+        sampled_probs = torch.softmax(logits, dim=-1).gather(dim=-1, index=batch_token_ids.unsqueeze(-1)).squeeze(-1).to(self.device)
+        # scale penalties by softmax probability of sampled tokens
+        scaled_penalty = penalties[:, None] * sampled_probs # (batch_size, seq_length)
+        return scaled_penalty.to(self.device)
+    """DIFFUSION LOSS"""
+    def sample_t(self, n, device):
+        """
+            Sample random time steps for batch training
+        """
+        # sample values uniformly at random from [0, 1)
+        eps_t = torch.rand(n, device=device)
+        # antithetic sampling: reduce variance by pairing each sample with complementary sample
+        if self.config.training.antithetic_sampling:
+            # compute interval between sampled time steps
+            offset = torch.arange(n, device=device) / n
+            # ensure that each eps value is evenly spaced between [0, 1)
+            eps_t = ((eps_t / n) + offset) % 1
+        # ensures values are not exactly 0 or 1
+        t = (1 - self.config.training.sampling_eps) * eps_t + self.config.training.sampling_eps
+        return t
+    def q_xt(self, x, mask_prob):
+        """Computes the noisy sample xt.
+        Args:
+        x: int torch.Tensor with shape (batch_size,
+            diffusion_model_input_length), input.
+        move_chance: float torch.Tensor with shape (batch_size, 1).
+        """
+        actual_seq_length = (x != 0).sum(dim=-1, keepdim=True)
+        #print(actual_seq_length)
+        max_mask_length = (actual_seq_length * 0.75).long()
+        mask_indices = torch.rand(*x.shape, device=x.device) < mask_prob
+        restricted_move_indices = torch.zeros_like(mask_indices, dtype=torch.bool)
+        for i in range(x.shape[0]):
+            true_positions = torch.where(mask_indices[i])[0]
+            if len(true_positions) > max_mask_length[i]:
+                selected_positions = true_positions[:max_mask_length[i].item()]
+                restricted_move_indices[i, selected_positions] = True
+            else:
+                restricted_move_indices[i] = mask_indices[i]
+        xt = torch.where(restricted_move_indices, self.tokenizer.mask_token_id, x)
+        return xt
+    def sample_prior(self, *batch_dims):
+        """
+            Returns array of fully masked sequences with same shape as input
+        """
+        return self.mask_token_id * torch.ones(* batch_dims, dtype=torch.int64)
+    """COMPUTING LOSS"""
+    def compute_diffusion_loss(self, model_output, xt, x0, t):
+        """
+        Computes diffusion loss term in ELBO
+        (evaluates how accurately the model predicts the token probabilities at each time step)
+        Inputs:
+        - model_output: [sequence length, vocab size, vocab size] array of logits for each token at each sequence position
+        - zt: corrupted version of original input x0 at timestep t
+        - x0: original input sequence
+        - t: timestep
+        """
+        # compute interval between each timestep
+        dt = 1 / self.T
+        # compute vectorized alpha scaling terms for the logits at timestep s and t
+        alpha_t = 1 - t + torch.zeros_like(x0)
+        # s = t - dt
+        alpha_s = 1 - (t - dt) + torch.zeros_like(x0)
+        # gather vector of log-probabilities for each token in x0
+        # log<x_theta, x>
+        log_x_theta_at_x0 = torch.gather(model_output, -1, x0[:, :, None]) # shape (B, L, vocab_size)
+        # gather log-probabillities for assigning a masked token at each position in the sequence at time t
+        # log<x_theta, m>
+        log_x_theta_at_m = model_output[:, :, self.mask_token_id]
+        # obtain non-log probability of assigning a masked token
+        # <xt, m>
+        x_theta_at_m = log_x_theta_at_m.exp()
+        # first term of diffusion loss
+        term_1_coef = dt / t
+        term_1_log_numerator = torch.log((alpha_t * x_theta_at_m) / t + 1)
+        term_1_log_denom = log_x_theta_at_x0
+        # second term of diffusion loss
+        term_2_coef = 1 - (dt / t)
+        term_2_log_numerator = term_1_log_numerator
+        term_2_log_denom = torch.log((alpha_s * x_theta_at_m) / (t - dt) + 1)
+        L_vb_masked = (term_1_coef * (term_1_log_numerator - term_1_log_denom) +
+                       term_2_coef * (term_2_log_numerator - term_2_log_denom))
+        # multiply by <zt, m> term
+        L_vb = L_vb_masked * (xt == self.mask_token_id)
+        # scale by T and return
+        return self.T * L_vb
+    def _forward_pass_diffusion(self, x0, attn_mask, bond_mask=None, mask=None):
+        """
+            Training reverse diffusion model x_theta to reconstruct samples x0
+            bond_mask: (batch, seq_length)
+        """
+        # randomly sample time steps to start the denoising process for each x0 in batch
+        t = self.sample_t(x0.shape[0], self.device)
+        # if we are training the intermediate transition blocks
+        if self.T > 0:
+            # scale by total timesteps T and cast to integer
+            t = (t * self.T).to(torch.int)
+            # scale down by T to get a multiple of 1/T
+            t = t / self.T
+            # add 1/T to ensure no 0 values
+            t += (1 / self.T)
+        # get noise and rate of noise at timestep t
+        # sigma = -log(1-t); dsigma = 1 / (1-t)
+        sigma, dsigma = self.noise(t)
+        time_conditioning = sigma[:, None]
+        # Get masking probabilities for all tokens for each batch
+        # log-linear: 1 - alpha = t
+        base_mask_prob = 1 - torch.exp(-sigma[:, None])  # (batch_size, L)
+        if self.config.noise.state_dependent and (bond_mask is not None):
+            # log-polynomial masking schedule: alpha = 1 - t^w
+            # bond_sigma = -log(1-t^w) for w = 3 (default)
+            # bond_dsigma = -wt^(w-1) / (1-t^w)
+            bond_sigma, bond_dsigma = self.bond_noise(t) # scalar
+            # expand dimensions for broadcasting to (B, L)
+            bond_sigma = bond_sigma[:, None]
+            bond_dsigma = bond_dsigma[:, None]
+            sigma = sigma[:, None]
+            dsigma = dsigma[:, None]
+            # compute masking probability for peptide bonds 1 - bond_alpha = t^w
+            bond_mask_prob = 1 - torch.exp(-bond_sigma).to(self.device)
+            # piece together (B, L) tensor with modified masking prob at peptide-bond locations
+            mask_prob = torch.where(bond_mask == 1, bond_mask_prob, base_mask_prob).to(self.device)
+            #print(mask_prob)
+            dsigma = torch.where(bond_mask == 1, bond_dsigma, dsigma).to(self.device)
+            sigma = torch.where(bond_mask == 1, bond_sigma, sigma).to(self.device)
+        else:
+            mask_prob = base_mask_prob.to(self.device)
+        # get masked samples at different timesteps
+        if mask is None:
+            zt = self.q_xt(x0, mask_prob).to(self.device)
+        else:
+            zt = x0.where(mask==1, torch.full_like(x0, self.mask_token_id)).to(self.device)
+        model_output = self.forward(zt, attn_mask=attn_mask.to(self.device), sigma=time_conditioning).to(self.device)
+        # debugging
+        assert not torch.isnan(model_output).any()
+        assert model_output.is_cuda
+        utils.print_nans(model_output, 'model_output')
+        # compute invalid loss
+        invalid_loss = self.compute_invalid_loss(logits=model_output).to(self.device) # (B, L)
+        #print(invalid_loss)
+        if self.T > 0:
+            # compute diffusion loss
+            diffusion_loss = self.compute_diffusion_loss(model_output, zt, x0, t)
+            return diffusion_loss
+        # compute loss for the final that converts from z0 to x0
+        # -log(p_theta)
+        # get (batch_size, L) array of log-probabilities
+        log_p_theta = torch.gather(input=model_output, dim=-1, index=x0[:, :, None]).squeeze(-1).to(self.device) # (B, L)
+        if self.config.noise.state_dependent and (bond_mask is not None):
+            return (-log_p_theta * (dsigma / torch.expm1(sigma)) + invalid_loss).to(self.device)
+        else:
+            return ((-log_p_theta * (dsigma / torch.expm1(sigma))[:, None]) + invalid_loss).to(self.device)
+    def _loss(self, x0, attn_mask, bond_mask=None, mask=None):
+        loss = self._forward_pass_diffusion(x0, attn_mask, bond_mask, mask)
+        # negative log loss
+        nlls = loss * attn_mask
+        # count number of tokens
+        num_tokens = attn_mask.sum()
+        # compute batch loss
+        batch_nll = nlls.sum()
+        # compute per token loss
+        token_nll = batch_nll / num_tokens
+        # return losses
+        return Loss(loss = token_nll.to(self.device), nlls = nlls.to(self.device), attn_mask = attn_mask.to(self.device))
+    def _compute_loss(self, batch, prefix, bond_mask=None):
+        attn_mask = batch['attention_mask'].to(self.device)
+        if 'mask' in batch:
+            mask = batch['mask'].to(self.device)
+        else:
+            mask = None
+        if 'bond_mask' in batch:
+            bond_mask = batch['bond_mask'].to(self.device)
+        else:
+            bond_mask = None
+        losses = self._loss(batch['input_ids'].to(self.device), attn_mask, bond_mask, mask)
+        loss = losses.loss
+        if prefix == 'train':
+            self.train_metrics.update(
+                losses.nlls.to(self.device),
+                losses.attn_mask.to(self.device)
+            )
+            metrics = self.train_metrics
+        elif prefix == 'val':
+            self.valid_metrics.update(
+                losses.nlls.to(self.device),
+                losses.attn_mask.to(self.device)
+            )
+            metrics = self.valid_metrics
+        elif prefix == 'test':
+            self.test_metrics.update(losses.nlls, losses.attn_mask)
+            metrics = self.test_metrics
+        else:
+            raise ValueError(f'Invalid prefix: {prefix}')
+        self.log_dict(metrics,
+                    on_step=False,
+                    on_epoch=True,
+                    sync_dist=True)
+        return loss
+    """SAMPLING"""
+    def generate_from_masked(self, num_samples=None, seq_length=None, sample_steps=128, eps=1e-5):
+        # get number of timesteps
+        if sample_steps is None:
+            sample_steps = self.config.sampling.steps
+        if seq_length is None:
+            seq_length = self.config.sampling.seq_length
+        # sample fully masked sequences
+        z = self.sample_prior(num_samples, seq_length).to(self.device)
+        # create vector of sample_steps timesteps
+        timesteps = torch.linspace(1, eps, sample_steps + 1, device=self.device)
+        # compute interval between timesteps
+        dt = (1 - eps) / sample_steps
+        for i in range(sample_steps):
+            t = timesteps[i] * torch.ones(z.shape[0], 1, device=self.device)
+            z = self.single_reverse_step(z, t, dt)
+        return z
+    """SAMPLING STEP"""
+    def single_reverse_step(self, zt, t, dt, attn_mask=None):
+        """
+            Take a single reverse diffusion step for the expansion step of the MCTS algorithm
+        """
+        # get sigma values that determine masking prob
+        sigma_t, _ = self.noise(t)
+        sigma_s, _ = self.noise(t - dt)
+        # reshape sigmas
+        if sigma_t.ndim > 1:
+            sigma_t = sigma_t.squeeze(-1)
+        if sigma_s.ndim > 1:
+            sigma_s = sigma_s.squeeze(-1)
+        assert sigma_t.ndim == 1, sigma_t.shape
+        assert sigma_s.ndim == 1, sigma_s.shape
+        # compute masking probabilities for each timestep
+        change_prob_t = 1 - torch.exp(-sigma_t)
+        change_prob_s = 1 - torch.exp(-sigma_s)
+        # expand dimensions
+        change_prob_t = change_prob_t[:, None, None]
+        change_prob_s = change_prob_s[:, None, None]
+        # get prodiction model that outputs token probabilities
+        log_p_x0 = self.forward(zt, attn_mask=attn_mask, sigma=sigma_t)
+        # check dimensions match
+        assert change_prob_t.ndim == log_p_x0.ndim
+        # compute reverse diffusion probability of being unmasked at timestep s
+        # (sigma_s - sigma_t)*x_theta
+        q_zs = log_p_x0.exp() * (change_prob_t - change_prob_s)
+        # compute reverse diffusion probability of remaining masked at timestep s
+        # (1 - sigma_s)*m
+        q_zs[:, :, self.mask_token_id] = change_prob_s[:, :, 0]
+        # sample sequence at timestep s from categorical distribution of q_zs
+        z_changed = sample_categorical(q_zs)
+        copy_flag = (zt != self.mask_token_id).to(zt.dtype)
+        return (copy_flag * zt) + ((1 - copy_flag) * z_changed)
+    def cached_reverse_step(self, x, t, dt, p_x0=None, attn_mask=None):
+        assert self.config.noise.type == 'loglinear'
+        sigma_t, _ = self.noise(t)
+        if t.ndim > 1:
+            t = t.squeeze(-1)
+        assert t.ndim == 1
+        change_prob_t = t[:, None, None]
+        change_prob_s = (t - dt)[:, None, None]
+        assert change_prob_t.ndim == 3, change_prob_t.shape
+        if p_x0 is None:
+            p_x0 = self.forward(x, attn_mask=attn_mask, sigma=sigma_t).exp()
+        assert change_prob_t.ndim == p_x0.ndim
+        q_xs = p_x0 * (change_prob_t - change_prob_s)
+        q_xs[:, :, self.mask_token_id] = change_prob_s[:, :, 0]
+        x_changed = sample_categorical(q_xs)
+        copy_flag = (x != self.mask_token_id).to(x.dtype)
+        return p_x0, copy_flag * x + (1 - copy_flag) * x_changed
+    # first step in expansion
+    def batch_cached_reverse_step(self, token_array, t, dt, batch_size, p_x0=None, attn_mask=None):
+        assert self.config.noise.type == 'loglinear'
+        sigma_t, _ = self.noise(t)
+        if t.ndim > 1:
+            t = t.squeeze(-1)
+        assert t.ndim == 1
+        change_prob_t = t[:, None, None]
+        change_prob_s = (t - dt)[:, None, None]
+        assert change_prob_t.ndim == 3, change_prob_t.shape
+        if token_array.dim() == 1:
+            token_array = token_array.unsqueeze(0)
+            #token_array = token_array.repeat(batch_size, 1)
+        attn_mask = torch.ones_like(token_array)
+        if p_x0 is None:
+            p_x0 = self.forward(token_array, attn_mask=attn_mask, sigma=sigma_t).exp()
+        assert change_prob_t.ndim == p_x0.ndim
+        q_xs = p_x0 * (change_prob_t - change_prob_s)
+        # zero-masking probability
+        q_xs[:, :, self.mask_token_id] = change_prob_s[:, :, 0]
+        # repeat the parent token along the first dimension which will be unmasked into distinct sequences
+        token_array = token_array.repeat(batch_size, 1)
+        if self.config.mcts.sampling == 0:
+            x_changed = sample_batched_categorical(q_xs.to(self.device), batch_size)
+        else:
+            x_changed = sample_batched_top_k(q_xs.to(self.device), batch_size, self.config.mcts.sampling)
+        copy_flag = (token_array != self.mask_token_id).to(token_array.dtype)
+        return p_x0, copy_flag * token_array + (1 - copy_flag) * x_changed
+    def _process_sigma(self, sigma):
+        if sigma.ndim > 1:
+            sigma = sigma.squeeze(-1)
+        if not self.time_conditioning:
+            sigma = torch.zeros_like(sigma)
+        assert sigma.ndim == 1, sigma.shape
+        return sigma
+    def forward(self, zt, attn_mask, sigma):
+        """
+        Predicts the token log-probabilities from zt at time t with noise schedule sigma
+        """
+        sigma = self._process_sigma(sigma)
+        with torch.amp.autocast("cuda", enabled=True, dtype=torch.float32, cache_enabled=True):
+            logits = self.backbone(zt, attn_mask).to(self.device)
+        return self.subs_parameterization(logits, zt)
+    def subs_parameterization(self, logits, zt):
+        """
+        Updates reverse diffusion logits based on SUBS parameterization:
+        - zero masking probabilities: -infinity probability of being masked during reverse diffusion
+        - carry-over unmasking: unmasked input tokens remain unchanged during reverse diffusion
+        Args:
+            logits: vector of token probabilities for unmasking masked tokens
+            zt: partially unmasked sequence at current timestep
+        """
+        logits[:, :, self.mask_token_id] += self.neg_infinity # [sequence index, current token, next token]
+        logits = (logits - torch.logsumexp(logits, dim=-1, keepdim=True)).to(self.device)
+        unmasked_indices = (zt != self.mask_token_id).to(self.device)  # shape: [200, seq_length]
+        batch_idx, seq_idx = torch.where(unmasked_indices)  # Get explicit indices
+        batch_idx = batch_idx.to(self.device)
+        seq_idx = seq_idx.to(self.device)
+        tokens = zt[batch_idx, seq_idx].to(self.device)  # Get the tokens at those positions
+        assert logits.is_contiguous(), "logits tensor is not contiguous"
+        assert unmasked_indices.shape == zt.shape, "same shape"
+        assert not torch.isnan(logits).any(), "NaN values found in logits"
+        assert tokens.max() < logits.shape[-1], "token indices out of bounds"
+        assert batch_idx.max() < logits.shape[0], "batch index out of bounds"
+        assert seq_idx.max() < logits.shape[1], "seq index out of bounds"
+        assert batch_idx.device == seq_idx.device == logits.device == tokens.device, "device inconsistent"
+        logits[batch_idx, seq_idx] = self.neg_infinity  # Set everything to -inf first
+        logits[batch_idx, seq_idx, tokens] = 0  # Set only the specific token positions to 0
+        # return logits with SUBS parameterization
+        return logits.to(self.device)
+    """SAMPLING"""
+    @torch.no_grad()
+    def _sample(self, num_steps=None, eps=1e-5, x_input=None):
+        """
+            Generate samples
+        """
+        batch_size_per_gpu = self.config.eval.perplexity_batch_size
+        if num_steps is None:
+            num_steps = self.config.sampling.steps
+        if x_input is not None:
+            x = x_input['input_ids'].to(self.device)
+            attn_mask = x_input['attention_mask'].to(self.device)
+        else:
+            x = self.sample_prior(batch_size_per_gpu, self.config.model.length).to(self.device)
+            attn_mask = torch.ones_like(x).to(self.device)
+        timesteps = torch.linspace(1, eps, num_steps+1, device=self.device)
+        dt = (1 - eps) / num_steps
+        p_x0_cache = None
+        generation_history = [] # used to track which tokens are unmasked
+        for i in range(num_steps):
+            t = timesteps[i] * torch.ones(x.shape[0], 1, device = self.device)
+            if self.sampler == 'ddpm':
+                x = self.single_reverse_step(x, t, dt).to(self.device)
+            elif self.sampler == 'ddpm_cache':
+                p_x0_cache, x_next = self.cached_reverse_step(x, t, dt, p_x0=p_x0_cache, attn_mask=attn_mask)
+                if (not torch.allclose(x_next, x) or self.time_conditioning):
+                    # Disable caching
+                    p_x0_cache = None
+                x = x_next.to(self.device)
+                #print(self.tokenizer.decode(x.squeeze()))
+            else:
+                x = self._analytic_update(x, t, dt, attn_mask).to(self.device)
+        if self.config.sampling.noise_removal:
+            t = timesteps[-1] * torch.ones(x.shape[0], 1, device=self.device)
+            if self.sampler == 'analytic':
+                x = self._denoiser_update(x, t).to(self.device)
+            else:
+                time_conditioning = self.noise(t)[0].to(self.device)
+                x = self.forward(x, attn_mask=attn_mask, sigma=time_conditioning).argmax(dim=-1).to(self.device)
+                #print(self.tokenizer.decode(x.squeeze()))
+        return x.to(self.device)
+    def restore_model_and_sample(self, num_steps, eps=1e-5):
+        """Generate samples from the model."""
+        self.backbone.eval()
+        self.noise.eval()
+        samples = self._sample(num_steps=num_steps, eps=eps)
+        self.backbone.train()
+        self.noise.train()
+        return samples
+    def get_score(self, zt, sigma, attn_mask=None):
+        # score(x, t) = p_t(y) / p_t(x)
+        # => log score(x, t) = log p_t(y) - log p_t(x)
+        # case 1: x = masked
+        #   (i) y = unmasked
+        #     log score(x, t) = log p_\theta(x)|_y + log k
+        #     where k = exp(- sigma) / (1 - exp(- sigma))
+        #   (ii) y = masked
+        #     log score(x, t) = 0
+        # case 2: x = unmasked
+        #   (i) y != masked, y != x
+        #     log score(x_i, t) = - inf
+        #   (ii) y = x
+        #     log score(x_i, t) = 0
+        #   (iii) y = masked token
+        #     log score(x_i, t) = - log k
+        #     where k = exp(- sigma) / (1 - exp(- sigma))
+        model_output = self.forward(zt, attn_mask=attn_mask, sigma=sigma)
+        log_k = -torch.log(torch.expm1(sigma)).squeeze(-1)
+        assert log_k.ndim == 1
+        masked_score = model_output + log_k[:, None, None]
+        masked_score[:, :, self.mask_token_id] = 0
+        unmasked_score = self.neg_infinity * torch.ones_like(model_output)
+        unmasked_score = torch.scatter(
+            unmasked_score, -1,
+            zt[..., None],
+            torch.zeros_like(unmasked_score[..., :1]))
+        unmasked_score[:, :, self.mask_token_id] = - (log_k[:, None] * torch.ones_like(zt))
+        masked_indices = (zt == self.mask_token_id).to(model_output.dtype)[:, :, None]
+        model_output = (masked_score * masked_indices + unmasked_score * (1 - masked_indices))
+        return model_output.exp()
+    def _staggered_score(self, score, dsigma):
+        score = score.clone()
+        extra_const = (1 - dsigma.exp()) * score.sum(dim=-1)
+        score *= dsigma.exp()[:, None]
+        score[..., self.mask_token_id] += extra_const
+        return score
+    def _analytic_update(self, x, t, step_size, attn_mask=None):
+        curr_sigma, _ = self.noise(t)
+        next_sigma, _ = self.noise(t - step_size)
+        dsigma = curr_sigma - next_sigma
+        score = self.get_score(x, attn_mask, curr_sigma)
+        stag_score = self._staggered_score(score, dsigma)
+        probs = stag_score * self._transp_transition(x, dsigma)
+        return sample_categorical(probs)
+    def _denoiser_update(self, x, t):
+        sigma, _ = self.noise(t)
+        score = self.get_score(x, sigma)
+        stag_score = self._staggered_score(score, sigma)
+        probs = stag_score * self._transp_transition(x, sigma)
+        probs[..., self.mask_token_id] = 0
+        samples = sample_categorical(probs)
+        return samples
+    def _transp_transition(self, i, sigma):
+        sigma = unsqueeze(sigma, reference=i[..., None])
+        edge = torch.exp(-sigma) * F.one_hot(
+        i, num_classes=self.vocab_size)
+        edge += torch.where(i == self.mask_token_id,
+                            1 - torch.exp(-sigma).squeeze(-1),
+                            0)[..., None]
+        return edge
+    def on_train_epoch_start(self):
+        torch.cuda.empty_cache()
+        self.backbone.train()
+        self.noise.train()
+    def training_step(self, batch, batch_idx):
+        # Initialize throughput calculation
+        start_time = time.time()
+        if self.config.vocab == 'old_smiles' or self.config.vocab == 'new_smiles':
+            loss = self._compute_loss(batch, prefix='train', bond_mask=batch['bond_mask'])
+        else:
+            loss = self._compute_loss(batch, prefix='train')
+        self.log(name='trainer/loss',
+                value=loss.item(),
+                on_step=True,
+                on_epoch=False,
+                sync_dist=True)
+        # Calculate throughput
+        elapsed_time = time.time() - start_time
+        total_tokens = batch['input_ids'].numel()
+        throughput = total_tokens / elapsed_time
+        self.log(name='trainer/throughput',
+                value=throughput,
+                on_step=True,
+                on_epoch=False,
+                sync_dist=True)
+        return loss
+    def on_load_checkpoint(self, checkpoint):
+        self.fast_forward_epochs = checkpoint['loops']['fit_loop']['epoch_progress']['current']['completed']
+        self.fast_forward_batches = checkpoint['loops']['fit_loop']['epoch_loop.batch_progress']['current']['completed']
+    """VALIDATION"""
+    def on_validation_epoch_start(self):
+        gc.collect()
+        torch.cuda.empty_cache()
+        self.backbone.eval()
+        self.noise.eval()
+        assert self.valid_metrics.nll.mean_value == 0
+        assert self.valid_metrics.nll.weight == 0
+    def validation_step(self, batch, batch_idx):
+        if self.config.vocab == 'old_smiles' or self.config.vocab == 'new_smiles':
+            loss = self._compute_loss(batch, prefix='val', bond_mask=batch['bond_mask'])
+        else:
+            loss = self._compute_loss(batch, prefix='val')
+        self.log(name='trainer/val_loss',
+                value=loss.item(),
+                on_step=True,
+                on_epoch=False,
+                prog_bar=True,
+                sync_dist=True)
+        return loss
+    def on_validation_epoch_end(self):
+        gc.collect()
+        torch.cuda.empty_cache()
+    """OPTIMIZATION"""
+    def optimizer_step(self, *args, **kwargs):
+        super().optimizer_step(*args, **kwargs)
+        gc.collect()
+        torch.cuda.empty_cache()
+    def configure_optimizers(self):
+        optimizer = torch.optim.AdamW(
+            itertools.chain(self.backbone.parameters(),self.noise.parameters()),
+            lr=self.config.optim.lr,
+            betas=(self.config.optim.beta1, self.config.optim.beta2),
+            eps=self.config.optim.eps,
+            weight_decay=self.config.optim.weight_decay
+        )
+        self.total_steps = self.config.trainer.max_steps
+        scheduler = CosineWarmup(optimizer,
+                                warmup_steps=self.config.lr_scheduler.num_warmup_steps,
+                                total_steps=self.total_steps)
+        scheduler_dict = {
+            'scheduler': scheduler,
+            'interval': 'step',
+            'frequency': 1,
+            'monitor': 'val/loss',
+            'name': 'trainer/lr'
+        }
+        return [optimizer], [scheduler_dict]
+    @torch.no_grad()
+    def compute_masked_perplexity(self, generated_ids, input_ids):
+        """
+            Computes masked perplexity between array of generated token ids and masked ids that are converted to logits
+        """
+        total_nll = 0
+        total_tokens = 0
+        input_ids = torch.tensor(input_ids).to(self.device)
+        #print(input_ids)
+        for sequence in generated_ids:
+            # tokenize the sequence
+            gt_ids = torch.tensor(sequence).to(self.device)
+            #print(gt_ids)
+            sys.stdout.flush()
+            # forward pass thorugh backbone peptideclm model
+            attn_mask = torch.ones_like(input_ids).to(self.device)
+            # compute logits using backbone
+            outputs = self.backbone.forward(input_ids=input_ids, attn_mask=attn_mask)
+            # get logits for each position in sequence across all tokens in vocab
+            #logits = outputs[-1] # (batch_size, seq_length, vocab_size)
+            logits = outputs.view(-1, outputs.size(-1))
+            gt_ids = gt_ids.view(-1)
+            #print(logits.shape)
+            #print(gt_ids.shape)
+            # compute loss
+            # shift_logits = logits[:, :-1, :].contiguous() # remove eos
+            # shift_labels = input_ids[:, 1:].contiguous()
+            # print(masked)
+            loss = F.cross_entropy(logits,
+                                    gt_ids.where(input_ids==self.mask_token_id, torch.full_like(gt_ids, -100)).view(-1),
+                                    reduction='sum')
+            total_nll += loss.item()
+            # count all non-padding tokens
+            total_tokens += input_ids.ne(self.tokenizer.pad_token_id).sum().item() # count in bos and eos
+        # compute pseudo-perplexity
+        # print(total_nll, ",;,", total_tokens)
+        pseudo_perplexity = torch.exp(torch.tensor(total_nll / total_tokens))
+        self.gen_ppl_metric.update(pseudo_perplexity)
+        return pseudo_perplexity.item()
+def sample_categorical(categorical_probs):
+    gumbel_norm = (
+        1e-10
+        - (torch.rand_like(categorical_probs) + 1e-10).log())
+    return (categorical_probs / gumbel_norm).argmax(dim=-1)
+def sample_batched_categorical(categorical_probs, batch_size):
+    _, sequence_length, vocab_size = categorical_probs.shape
+    # add Gumbel noise and sample m sequences
+    gumbel_noise = (-torch.log(-torch.log(torch.rand(batch_size, sequence_length, vocab_size) + 1e-10) + 1e-10)).to(categorical_probs.device)
+    noisy_scores = torch.log(categorical_probs) + gumbel_noise  # add Gumbel noise to log probabilities
+    # select the highest score (most likely category after Gumbel noise)
+    sampled_sequences = noisy_scores.argmax(dim=-1)  # shape: (m, sequence_length)
+    return sampled_sequences
+def sample_batched_top_k(categorical_probs, batch_size, k):
+    _, sequence_length, vocab_length = categorical_probs.shape
+    # Add Gumbel noise to the log probabilities
+    gumbel_noise = -torch.log(-torch.log(torch.rand(batch_size, sequence_length, vocab_length) + 1e-10) + 1e-10).to(categorical_probs.device)
+    noisy_scores = torch.log(categorical_probs[None, :, :]) + gumbel_noise  # Shape: (m, sequence_length, vocab_length)
+    # Get the top-k categories based on noisy scores
+    top_k_scores, top_k_indices = torch.topk(noisy_scores, k, dim=-1)  # Shape: (m, sequence_length, k)
+    # Convert top-k scores back to probabilities and normalize
+    top_k_probs = torch.softmax(top_k_scores, dim=-1).to(categorical_probs.device)  # Shape: (m, sequence_length, k)
+    # Sample randomly from the top-k probabilities
+    sampled_indices_in_top_k = torch.multinomial(top_k_probs.reshape(-1, k), num_samples=1).squeeze(-1).to(categorical_probs.device)
+    sampled_indices_in_top_k = sampled_indices_in_top_k.view(batch_size, sequence_length).to(categorical_probs.device)  # Shape: (batch_size, sequence_length)
+    # Map sampled indices back to the original vocabulary indices
+    sampled_sequences = torch.gather(top_k_indices, -1, sampled_indices_in_top_k.unsqueeze(-1)).squeeze(-1).to(categorical_probs.device)
+    return sampled_sequences
+def unsqueeze(x, reference):
+    return x.view(* x.shape, * ((1,) * (len(reference.shape) - len(x.shape))))
+class CosineWarmup(_LRScheduler):
+    def __init__(self, optimizer, warmup_steps, total_steps, eta_ratio=0.1, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.total_steps = total_steps
+        self.eta_ratio = eta_ratio  # The ratio of minimum to maximum learning rate
+        super(CosineWarmup, self).__init__(optimizer, last_epoch)
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            return [base_lr * self.last_epoch / self.warmup_steps for base_lr in self.base_lrs]
+        progress = (self.last_epoch - self.warmup_steps) / (self.total_steps - self.warmup_steps)
+        cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
+        decayed_lr = (1 - self.eta_ratio) * cosine_decay + self.eta_ratio
+        return [decayed_lr * base_lr for base_lr in self.base_lrs]

src/environment.yml ADDED Viewed

	@@ -0,0 +1,40 @@

+name: peptune
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+dependencies:
+  - python=3.10
+  - pip
+  - pytorch
+  - torchvision
+  - pytorch-cuda=12.1
+  - rdkit
+  - numpy
+  - pandas
+  - scikit-learn
+  - jupyterlab
+  - matplotlib-base
+  - seaborn
+  - tqdm
+  - pyyaml
+  - pip:
+      - pytorch-lightning==2.5.5
+      - lightning==2.5.5
+      - fair-esm==2.0.0
+      - transformers==4.56.2
+      - SmilesPE==0.0.3
+      - scipy==1.13.1
+      - wandb==0.22.0
+      - hydra-core==1.3.2
+      - hydra-submitit-launcher==1.2.0
+      - pathos==0.3.4
+      - matplotlib==3.10.1
+      - pandas==2.2.2
+      - seaborn==0.13.2
+      - timm==1.0.20
+      - xgboost==3.0.5
+      - loguru==0.7.3
+      - peft==0.17.1
+      - accelerate==1.11.0
+      - datasets

src/generate_mcts.py ADDED Viewed

	@@ -0,0 +1,365 @@

+#!/usr/bin/env
+import time
+import torch
+import torch.nn.functional as F
+import math
+import random
+import sys
+import pandas as pd
+from utils.generate_utils import mask_for_de_novo
+from diffusion import Diffusion
+from pareto_mcts import Node, MCTS
+import hydra
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModel, pipeline
+from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
+from utils.app import PeptideAnalyzer
+import matplotlib.pyplot as plt
+import os
+import seaborn as sns
+import pandas as pd
+import numpy as np
+# Protein sequence dictionary
+PROTEIN_SEQUENCES = {
+    'amhr': 'MLGSLGLWALLPTAVEAPPNRRTCVFFEAPGVRGSTKTLGELLDTGTELPRAIRCLYSRCCFGIWNLTQDRAQVEMQGCRDSDEPGCESLHCDPSPRAHPSPGSTLFTCSCGTDFCNANYSHLPPPGSPGTPGSQGPQAAPGESIWMALVLLGLFLLLLLLLGSIILALLQRKNYRVRGEPVPEPRPDSGRDWSVELQELPELCFSQVIREGGHAVVWAGQLQGKLVAIKAFPPRSVAQFQAERALYELPGLQHDHIVRFITASRGGPGRLLSGPLLVLELHPKGSLCHYLTQYTSDWGSSLRMALSLAQGLAFLHEERWQNGQYKPGIAHRDLSSQNVLIREDGSCAIGDLGLALVLPGLTQPPAWTPTQPQGPAAIMEAGTQRYMAPELLDKTLDLQDWGMALRRADIYSLALLLWEILSRCPDLRPDSSPPPFQLAYEAELGNTPTSDELWALAVQERRRPYIPSTWRCFATDPDGLRELLEDCWDADPEARLTAECVQQRLAALAHPQESHPFPESCPRGCPPLCPEDCTSIPAPTILPCRPQRSACHFSVQQGPCSRNPQPACTLSPV',
+    'tfr': 'MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNTKANVTKPKRCSGSICYGTIAVIVFFLIGFMIGYLGYCKGVEPKTECERLAGTESPVREEPGEDFPAARRLYWDDLKRKLSEKLDSTDFTGTIKLLNENSYVPREAGSQKDENLALYVENQFREFKLSKVWRDQHFVKIQVKDSAQNSVIIVDKNGRLVYLVENPGGYVAYSKAATVTGKLVHANFGTKKDFEDLYTPVNGSIVIVRAGKITFAEKVANAESLNAIGVLIYMDQTKFPIVNAELSFFGHAHLGTGDPYTPGFPSFNHTQFPPSRSSGLPNIPVQTISRAAAEKLFGNMEGDCPSDWKTDSTCRMVTSESKNVKLTVSNVLKEIKILNIFGVIKGFVEPDHYVVVGAQRDAWGPGAAKSGVGTALLLKLAQMFSDMVLKDGFQPSRSIIFASWSAGDFGSVGATEWLEGYLSSLHLKAFTYINLDKAVLGTSNFKVSASPLLYTLIEKTMQNVKHPVTGQFLYQDSNWASKVEKLTLDNAAFPFLAYSGIPAVSFCFCEDTDYPYLGTTMDTYKELIERIPELNKVARAAAEVAGQFVIKLTHDVELNLDYERYNSQLLSFVRDLNQYRADIKEMGLSLQWLYSARGDFFRATSRLTTDFGNAEKTDRFVMKKLNDRVMRVEYHFLSPYVSPKESPFRHVFWGSGSHTLPALLENLKLRKQNNGAFNETLFRNQLALATWTIQGAANALSGDVWDIDNEF',
+    'gfap': 'MERRRITSAARRSYVSSGEMMVGGLAPGRRLGPGTRLSLARMPPPLPTRVDFSLAGALNAGFKETRASERAEMMELNDRFASYIEKVRFLEQQNKALAAELNQLRAKEPTKLADVYQAELRELRLRLDQLTANSARLEVERDNLAQDLATVRQKLQDETNLRLEAENNLAAYRQEADEATLARLDLERKIESLEEEIRFLRKIHEEEVRELQEQLARQQVHVELDVAKPDLTAALKEIRTQYEAMASSNMHEAEEWYRSKFADLTDAAARNAELLRQAKHEANDYRRQLQSLTCDLESLRGTNESLERQMREQEERHVREAASYQEALARLEEEGQSLKDEMARHLQEYQDLLNVKLALDIEIATYRKLLEGEENRITIPVQTFSNLQIRETSLDTKSVSEGHLKRNIVVKTVEMRDGEVIKESKQEHKDVM',
+    'glp1': 'MAGAPGPLRLALLLLGMVGRAGPRPQGATVSLWETVQKWREYRRQCQRSLTEDPPPATDLFCNRTFDEYACWPDGEPGSFVNVSCPWYLPWASSVPQGHVYRFCTAEGLWLQKDNSSLPWRDLSECEESKRGERSSPEEQLLFLYIIYTVGYALSFSALVIASAILLGFRHLHCTRNYIHLNLFASFILRALSVFIKDAALKWMYSTAAQQHQWDGLLSYQDSLSCRLVFLLMQYCVAANYYWLLVEGVYLYTLLAFSVLSEQWIFRLYVSIGWGVPLLFVVPWGIVKYLYEDEGCWTRNSNMNYWLIIRLPILFAIGVNFLIFVRVICIVVSKLKANLMCKTDIKCRLAKSTLTLIPLLGTHEVIFAFVMDEHARGTLRFIKLFTELSFTSFQGLMVAILYCFVNNEVQLEFRKSWERWRLEHLHIQRDSSMKPLKCPTSSLSSGATAGSSMYTATCQASCS',
+    'glast': 'MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLLTVTAVIVGTILGFTLRPYRMSYREVKYFSFPGELLMRMLQMLVLPLIISSLVTGMAALDSKASGKMGMRAVVYYMTTTIIAVVIGIIIVIIIHPGKGTKENMHREGKIVRVTAADAFLDLIRNMFPPNLVEACFKQFKTNYEKRSFKVPIQANETLVGAVINNVSEAMETLTRITEELVPVPGSVNGVNALGLVVFSMCFGFVIGNMKEQGQALREFFDSLNEAIMRLVAVIMWYAPVGILFLIAGKIVEMEDMGVIGGQLAMYTVTVIVGLLIHAVIVLPLLYFLVTRKNPWVFIGGLLQALITALGTSSSSATLPITFKCLEENNGVDKRVTRFVLPVGATINMDGTALYEALAAIFIAQVNNFELNFGQIITISITATAASIGAAGIPQAGLVTMVIVLTSVGLPTDDITLIIAVDWFLDRLRTTTNVLGDSLGAGIVEHLSRHELKNRDVEMGNSVIEENEMKKPYQLIAQDNETEKPIDSETKM',
+    'ncam': 'LQTKDLIWTLFFLGTAVSLQVDIVPSQGEISVGESKFFLCQVAGDAKDKDISWFSPNGEKLTPNQQRISVVWNDDSSSTLTIYNANIDDAGIYKCVVTGEDGSESEATVNVKIFQKLMFKNAPTPQEFREGEDAVIVCDVVSSLPPTIIWKHKGRDVILKKDVRFIVLSNNYLQIRGIKKTDEGTYRCEGRILARGEINFKDIQVIVNVPPTIQARQNIVNATANLGQSVTLVCDAEGFPEPTMSWTKDGEQIEQEEDDEKYIFSDDSSQLTIKKVDKNDEAEYICIAENKAGEQDATIHLKVFAKPKITYVENQTAMELEEQVTLTCEASGDPIPSITWRTSTRNISSEEKASWTRPEKQETLDGHMVVRSHARVSSLTLKSIQYTDAGEYICTASNTIGQDSQSMYLEVQYAPKLQGPVAVYTWEGNQVNITCEVFAYPSATISWFRDGQLLPSSNYSNIKIYNTPSASYLEVTPDSENDFGNYNCTAVNRIGQESLEFILVQADTPSSPSIDQVEPYSSTAQVQFDEPEATGGVPILKYKAEWRAVGEEVWHSKWYDAKEASMEGIVTIVGLKPETTYAVRLAALNGKGLGEISAASEF',
+    'cereblon': 'MAGEGDQQDAAHNMGNHLPLLPAESEEEDEMEVEDQDSKEAKKPNIINFDTSLPTSHTYLGADMEEFHGRTLHDDDSCQVIPVLPQVMMILIPGQTLPLQLFHPQEVSMVRNLIQKDRTFAVLAYSNVQEREAQFGTTAEIYAYREEQDFGIEIVKVKAIGRQRFKVLELRTQSDGIQQAKVQILPECVLPSTMSAVQLESLNKCQIFPSKPVSREDQCSYKWWQKYQKRKFHCANLTSWPRWLYSLYDAETLMDRIKKQLREWDENLKDDSLPSNPIDFSYRVAACLPIDDVLRIQLLKIGSAIQRLRCELDIMNKCTSLCCKQCQETEITTKNEIFSLSLCGPMAAYVNPHGYVHETLTVYKACNLNLIGRPSTEHSWFPGYAWTVAQCKICASHIGWKFTATKKDMSPQKFWGLTRSALLPTIPDTEDEISPDKVILCL',
+    'ligase': 'MASQPPEDTAESQASDELECKICYNRYNLKQRKPKVLECCHRVCAKCLYKIIDFGDSPQGVIVCPFCRFETCLPDDEVSSLPDDNNILVNLTCGGKGKKCLPENPTELLLTPKRLASLVSPSHTSSNCLVITIMEVQRESSPSLSSTPVVEFYRPASFDSVTTVSHNWTVWNCTSLLFQTSIRVLVWLLGLLYFSSLPLGIYLLVSKKVTLGVVFVSLVPSSLVILMVYGFCQCVCHEFLDCMAPPS',
+    'skp2': 'MHRKHLQEIPDLSSNVATSFTWGWDSSKTSELLSGMGVSALEKEEPDSENIPQELLSNLGHPESPPRKRLKSKGSDKDFVIVRRPKLNRENFPGVSWDSLPDELLLGIFSCLCLPELLKVSGVCKRWYRLASDESLWQTLDLTGKNLHPDVTGRLLSQGVIAFRCPRSFMDQPLAEHFSPFRVQHMDLSNSVIEVSTLHGILSQCSKLQNLSLEGLRLSDPIVNTLAKNSNLVRLNLSGCSGFSEFALQTLLSSCSRLDELNLSWCFDFTEKHVQVAVAHVSETITQLNLSGYRKNLQKSDLSTLVRRCPNLVHLDLSDSVMLKNDCFQEFFQLNYLQHLSLSRCYDIIPETLLELGEIPTLKTLQVFGIVPDGTLQLLKEALPHLQINCSHFTTIARPTIGNKKNQEIWGIKCRLTLQKPSCL',
+    'p53': 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD',
+    'egfp': 'VSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK'
+}
+def save_logs_to_file(config, valid_fraction_log, score_logs, output_path):
+    """
+    Saves the logs to a CSV file.
+    Parameters:
+        valid_fraction_log (list): Log of valid fractions over iterations.
+        score_logs (dict): Dict mapping score func names to lists of scores.
+        output_path (str): Path to save the log CSV file.
+    """
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    log_data = {
+        "Iteration": list(range(1, len(valid_fraction_log) + 1)),
+        "Valid Fraction": valid_fraction_log,
+    }
+    for name, log in score_logs.items():
+        log_data[name] = log
+    df = pd.DataFrame(log_data)
+    # Save to CSV
+    df.to_csv(output_path, index=False)
+def plot_data(log1, log2=None,
+                    save_path=None,
+                    label1="Log 1",
+                    label2=None,
+                    title="Fraction of Valid Peptides Over Iterations",
+                    palette=None):
+    """
+    Plots one or two datasets with their mean values over iterations.
+    Parameters:
+        log1 (list): The first list of mean values for each iteration.
+        log2 (list, optional): The second list of mean values for each iteration. Defaults to None.
+        save_path (str): Path to save the plot. Defaults to None.
+        label1 (str): Label for the first dataset. Defaults to "Log 1".
+        label2 (str, optional): Label for the second dataset. Defaults to None.
+        title (str): Title of the plot. Defaults to "Mean Values Over Iterations".
+        palette (dict, optional): A dictionary defining custom colors for datasets. Defaults to None.
+    """
+    # Prepare data for log1
+    data1 = pd.DataFrame({
+        "Iteration": range(1, len(log1) + 1),
+        "Fraction of Valid Peptides": log1,
+        "Dataset": label1
+    })
+    # Prepare data for log2 if provided
+    if log2 is not None:
+        data2 = pd.DataFrame({
+            "Iteration": range(1, len(log2) + 1),
+            "Fraction of Valid Peptides": log2,
+            "Dataset": label2
+        })
+        data = pd.concat([data1, data2], ignore_index=True)
+    else:
+        data = data1
+    palette = {
+        label1: "#8181ED",  # Default color for log1
+        label2: "#D577FF"   # Default color for log2 (if provided)
+    }
+    # Set Seaborn theme
+    sns.set_theme()
+    sns.set_context("paper")
+    # Create the plot
+    sns.lineplot(
+        data=data,
+        x="Iteration",
+        y="Fraction of Valid Peptides",
+        hue="Dataset",
+        style="Dataset",
+        markers=True,
+        dashes=False,
+        palette=palette
+    )
+    # Titles and labels
+    plt.title(title)
+    plt.xlabel("Iteration")
+    plt.ylabel("Fraction of Valid Peptides")
+    if save_path:
+        plt.savefig(save_path, dpi=300, bbox_inches='tight')
+        print(f"Plot saved to {save_path}")
+    plt.show()
+def plot_data_with_distribution_seaborn(log1, log2=None,
+                                        save_path=None,
+                                        label1=None,
+                                        label2=None,
+                                        title=None):
+    """
+    Plots one or two datasets with the average values and distributions over iterations using Seaborn.
+    Parameters:
+        log1 (list of lists): The first list of scores (each element is a list of scores for an iteration).
+        log2 (list of lists, optional): The second list of scores (each element is a list of scores for an iteration). Defaults to None.
+        save_path (str): Path to save the plot. Defaults to None.
+        label1 (str): Label for the first dataset. Defaults to "Fraction of Valid Peptide SMILES".
+        label2 (str, optional): Label for the second dataset. Defaults to None.
+        title (str): Title of the plot. Defaults to "Fraction of Valid Peptides Over Iterations".
+    """
+    # Prepare data for log1
+    data1 = pd.DataFrame({
+        "Iteration": np.repeat(range(1, len(log1) + 1), [len(scores) for scores in log1]),
+        "Fraction of Valid Peptides": [float(score) for scores in log1 for score in scores],
+        "Dataset": label1,
+        "Style": "Log1"
+    })
+    # Prepare data for log2 if provided
+    if log2 is not None:
+        data2 = pd.DataFrame({
+            "Iteration": np.repeat(range(1, len(log2) + 1), [len(scores) for scores in log2]),
+            "Fraction of Valid Peptides": [float(score) for scores in log2 for score in scores],
+            "Dataset": label2,
+            "Style": "Log2"
+        })
+        data = pd.concat([data1, data2], ignore_index=True)
+    else:
+        data = data1
+    palette = {
+        label1: "#8181ED",  # Default color for log1
+        label2: "#D577FF"   # Default color for log2 (if provided)
+    }
+    # Set Seaborn theme
+    sns.set_theme()
+    sns.set_context("paper")
+    # Create the plot
+    sns.relplot(
+        data=data,
+        kind="line",
+        x="Iteration",
+        y="Fraction of Valid Peptides",
+        hue="Dataset",
+        style="Style",
+        markers=True,
+        dashes=True,
+        ci="sd",  # Show standard deviation
+        height=5,
+        aspect=1.5,
+        palette=palette
+    )
+    # Titles and labels
+    plt.title(title)
+    plt.xlabel("Iteration")
+    plt.ylabel("Fraction of Valid Peptides")
+    if save_path:
+        plt.savefig(save_path, dpi=300, bbox_inches='tight')
+        print(f"Plot saved to {save_path}")
+    plt.show()
+@torch.no_grad()
+def generate_valid_mcts(config, mdlm, prot1=None, prot2=None, filename=None, prot_name1=None, prot_name2 = None):
+    tokenizer = mdlm.tokenizer
+    max_sequence_length = config.sampling.seq_length
+    # generate array of [MASK] tokens
+    masked_array = mask_for_de_novo(config, max_sequence_length)
+    inputs = tokenizer.encode(masked_array)
+    inputs = {key: value.to(mdlm.device) for key, value in inputs.items()}
+    # initialize root node
+    rootNode = Node(config=config, tokens=inputs, timestep=0)
+    # initalize tree search algorithm
+    if config.mcts.perm:
+        score_func_names = ['permeability', 'binding_affinity1', 'solubility', 'hemolysis', 'nonfouling']
+        num_func = [0, 0, 0, 0, 0]
+    elif config.mcts.dual:
+        score_func_names = ['binding_affinity1', 'solubility', 'hemolysis', 'nonfouling', 'binding_affinity2']
+        num_func = [0, 0, 0, 0, 0]
+    elif config.mcts.single:
+        if config.mode == 'binding':
+            score_func_names = ['binding_affinity1']
+        else:
+            score_func_names = ['permeability']
+        num_func = [0]
+    else:
+        score_func_names = ['binding_affinity1', 'solubility', 'hemolysis', 'nonfouling']
+        num_func = [0, 0, 0, 0]
+    if not config.mcts.time_dependent:
+        num_func = [0] * len(score_func_names)
+    if prot1 and prot2 is not None:
+        mcts = MCTS(config=config, max_sequence_length=max_sequence_length, mdlm=mdlm, score_func_names=score_func_names, prot_seqs=[prot1, prot2], num_func=num_func)
+    elif prot1 is not None:
+        mcts = MCTS(config=config, max_sequence_length=max_sequence_length, mdlm=mdlm, score_func_names=score_func_names, prot_seqs=[prot1], num_func=num_func)
+    elif config.mcts.single:
+        mcts = MCTS(config=config, max_sequence_length=max_sequence_length, mdlm=mdlm, score_func_names=score_func_names, num_func=num_func)
+    else:
+        mcts = MCTS(config=config, max_sequence_length=max_sequence_length, mdlm=mdlm, score_func_names=score_func_names, num_func=num_func)
+    paretoFront = mcts.forward(rootNode)
+    output_log_path = f'{config.base_path}/{prot_name1}/log_{filename}.csv'
+    save_logs_to_file(config, mcts.valid_fraction_log, mcts.score_logs, output_log_path)
+    plot_data(mcts.valid_fraction_log,
+            save_path=f'{config.base_path}/{prot_name1}/valid_{filename}.png')
+    for name in mcts.score_func_names:
+        plot_data_with_distribution_seaborn(log1=mcts.score_logs[name],
+                save_path=f'{config.base_path}/{prot_name1}/{name}_{filename}.png',
+                label1=f"Average {name}",
+                title=f"Average {name} Over Iterations")
+    return paretoFront, inputs
+@hydra.main(version_base=None, config_path='.', config_name='config')
+def main(config):
+    # Get parameters from config with defaults
+    prot_name1 = config.get('prot_name1', 'gfap')
+    prot_name2 = config.get('prot_name2', None)
+    mode = config.get('mode', '2')
+    model = config.get('model_type', 'mcts')
+    length = config.get('length', '100')
+    epoch = config.get('epoch', '7')
+    filename = f'{mode}_{model}_length_{length}_epoch_{epoch}'
+    tokenizer = SMILES_SPE_Tokenizer(f'{config.base_path}/src/tokenizer/new_vocab.txt',
+                                    f'{config.base_path}/src/tokenizer/new_splits.txt')
+    mdlm = Diffusion.load_from_checkpoint(config.eval.checkpoint_path, config=config, tokenizer=tokenizer, strict=False)
+    mdlm.eval()
+    device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
+    mdlm.to(device)
+    print("loaded models...")
+    analyzer = PeptideAnalyzer()
+    # Look up protein sequences from names
+    prot_seq1 = PROTEIN_SEQUENCES.get(prot_name1.lower())
+    prot_seq2 = PROTEIN_SEQUENCES.get(prot_name2.lower()) if prot_name2 else None
+    if prot_seq1 is None:
+        raise ValueError(f"Protein '{prot_name1}' not found in PROTEIN_SEQUENCES dictionary. Available proteins: {list(PROTEIN_SEQUENCES.keys())}")
+    if prot_name2 and prot_seq2 is None:
+        raise ValueError(f"Protein '{prot_name2}' not found in PROTEIN_SEQUENCES dictionary. Available proteins: {list(PROTEIN_SEQUENCES.keys())}")
+    print(f"Using protein 1: {prot_name1}")
+    if prot_name2:
+        print(f"Using protein 2: {prot_name2}")
+    t_start = time.time()
+    paretoFront, input_array = generate_valid_mcts(config, mdlm, prot_seq1, prot_seq2, filename, prot_name1, prot_name2)
+    generation_results = []
+    for sequence, v in paretoFront.items():
+        generated_array = v['token_ids'].to(mdlm.device)
+        # compute perplexity
+        perplexity = mdlm.compute_masked_perplexity(generated_array, input_array['input_ids'])
+        perplexity = round(perplexity, 4)
+        aa_seq, seq_length = analyzer.analyze_structure(sequence)
+        scores = v['scores']
+        if config.mcts.single == False:
+            binding1 = scores[0]
+            solubility = scores[1]
+            hemo = scores[2]
+            nonfouling = scores[3]
+        if config.mcts.perm:
+            permeability = scores[4]
+            generation_results.append([sequence, perplexity, aa_seq, binding1, solubility, hemo, nonfouling, permeability])
+            print(f"perplexity: {perplexity} | length: {seq_length} | smiles sequence: {sequence} | amino acid sequence: {aa_seq} | Binding Affinity: {binding1} | Solubility: {solubility} | Hemolysis: {hemo} | Nonfouling: {nonfouling} | Permeability: {permeability}")
+        elif config.mcts.dual:
+            binding2 = scores[4]
+            generation_results.append([sequence, perplexity, aa_seq, binding1, binding2, solubility, hemo, nonfouling])
+            print(f"perplexity: {perplexity} | length: {seq_length} | smiles sequence: {sequence} | amino acid sequence: {aa_seq} | Binding Affinity 1: {binding1} | Binding Affinity 2: {binding2} | Solubility: {solubility} | Hemolysis: {hemo} | Nonfouling: {nonfouling}")
+        elif config.mcts.single:
+            permeability = scores[0]
+        else:
+            generation_results.append([sequence, perplexity, aa_seq, binding1, solubility, hemo, nonfouling])
+            print(f"perplexity: {perplexity} | length: {seq_length} | smiles sequence: {sequence} | amino acid sequence: {aa_seq} | Binding Affinity: {binding1} | Solubility: {solubility} | Hemolysis: {hemo} | Nonfouling: {nonfouling}")
+        sys.stdout.flush()
+    if config.mcts.perm:
+        df = pd.DataFrame(generation_results, columns=['Generated SMILES', 'Perplexity', 'Peptide Sequence', 'Binding Affinity', 'Solubility', 'Hemolysis', 'Nonfouling', 'Permeability'])
+    elif config.mcts.dual:
+        df = pd.DataFrame(generation_results, columns=['Generated SMILES', 'Perplexity', 'Peptide Sequence', 'Binding Affinity 1', 'Binding Affinity 2', 'Solubility', 'Hemolysis', 'Nonfouling'])
+    elif config.mcts.single:
+        df = pd.DataFrame(generation_results, columns=['Generated SMILES', 'Perplexity', 'Peptide Sequence', 'Permeability'])
+    else:
+        df = pd.DataFrame(generation_results, columns=['Generated SMILES', 'Perplexity', 'Peptide Sequence', 'Binding Affinity', 'Solubility', 'Hemolysis', 'Nonfouling'])
+    df.to_csv(f'{config.base_path}/{prot_name1}/{filename}.csv', index=False)
+    # ── timing ──
+    elapsed = time.time() - t_start
+    print(f"\n{'='*60}")
+    print(f"Generation complete in {elapsed:.1f}s ({elapsed/60:.1f} min)")
+    print(f"Pareto front size: {len(df)}")
+    # ── score statistics ──
+    score_cols = [c for c in df.columns if c not in ('Generated SMILES', 'Peptide Sequence')]
+    print(f"\n{'Score':<22} {'Mean':>8} {'Std':>8} {'Min':>8} {'Max':>8}")
+    print('-' * 58)
+    for col in score_cols:
+        vals = pd.to_numeric(df[col], errors='coerce').dropna()
+        if len(vals) == 0:
+            continue
+        print(f"{col:<22} {vals.mean():8.4f} {vals.std():8.4f} {vals.min():8.4f} {vals.max():8.4f}")
+    print('=' * 60)
+if __name__ == "__main__":
+    main()

src/generate_unconditional.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import os
+import torch
+import torch.nn.functional as F
+import sys
+import pandas as pd
+import omegaconf
+from utils.generate_utils import mask_for_de_novo, calculate_cosine_sim, calculate_hamming_dist
+from diffusion import Diffusion
+import hydra
+from tqdm import tqdm
+from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
+from utils.app import PeptideAnalyzer
+from scoring.scoring_functions import ScoringFunctions
+# Register custom OmegaConf resolvers required by config.yaml
+omegaconf.OmegaConf.register_new_resolver('cwd', os.getcwd, replace=True)
+omegaconf.OmegaConf.register_new_resolver('device_count', torch.cuda.device_count, replace=True)
+omegaconf.OmegaConf.register_new_resolver('eval', eval, replace=True)
+omegaconf.OmegaConf.register_new_resolver('div_up', lambda x, y: (x + y - 1) // y, replace=True)
+base_path = '/path/to/your/home/PepTune'
+ckpt_path = base_path + '/checkpoints/peptune-pretrained.ckpt'
+@torch.no_grad()
+def generate_sequence_unconditional(config, sequence_length: int, mdlm: Diffusion):
+    tokenizer = mdlm.tokenizer
+    # generate array of [MASK] tokens
+    masked_array = mask_for_de_novo(config, sequence_length)
+    inputs = tokenizer.encode(masked_array)
+    # tokenized masked array
+    inputs = {key: value.to(mdlm.device) for key, value in inputs.items()}
+    # sample unconditional array of tokens
+    logits = mdlm._sample(x_input=inputs) # using sample, change config.sampling.steps to determine robustness
+    return logits, inputs
+@hydra.main(version_base=None, config_path='.', config_name='config')
+def main(config):
+    tokenizer = SMILES_SPE_Tokenizer(f'{base_path}/src/tokenizer/new_vocab.txt',
+                                f'{base_path}/src/tokenizer/new_splits.txt')
+    # Build model with current config, then load weights manually
+    # (load_from_checkpoint overrides config with saved hparams)
+    mdlm_model = Diffusion(config=config, tokenizer=tokenizer)
+    ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+    mdlm_model.load_state_dict(ckpt["state_dict"], strict=False)
+    mdlm_model.eval()
+    device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
+    mdlm_model.to(device)
+    print("loaded models...")
+    analyzer = PeptideAnalyzer()
+    gfap = 'MERRRITSAARRSYVSSGEMMVGGLAPGRRLGPGTRLSLARMPPPLPTRVDFSLAGALNAGFKETRASERAEMMELNDRFASYIEKVRFLEQQNKALAAELNQLRAKEPTKLADVYQAELRELRLRLDQLTANSARLEVERDNLAQDLATVRQKLQDETNLRLEAENNLAAYRQEADEATLARLDLERKIESLEEEIRFLRKIHEEEVRELQEQLARQQVHVELDVAKPDLTAALKEIRTQYEAMASSNMHEAEEWYRSKFADLTDAAARNAELLRQAKHEANDYRRQLQSLTCDLESLRGTNESLERQMREQEERHVREAASYQEALARLEEEGQSLKDEMARHLQEYQDLLNVKLALDIEIATYRKLLEGEENRITIPVQTFSNLQIRETSLDTKSVSEGHLKRNIVVKTVEMRDGEVIKESKQEHKDVM'
+    # scoring functions
+    score_func_names = ['binding_affinity1', 'solubility', 'hemolysis', 'nonfouling', 'permeability']
+    score_functions = ScoringFunctions(score_func_names, [gfap])
+    max_seq_length = config.sampling.seq_length
+    num_sequences = config.sampling.num_sequences
+    generation_results = []
+    num_valid = 0.
+    num_total = 0.
+    while num_total < num_sequences:
+        num_total += 1
+        generated_array, input_array = generate_sequence_unconditional(config, max_seq_length, mdlm_model)
+        # store in device
+        generated_array = generated_array.to(mdlm_model.device)
+        print(generated_array)
+        # compute masked perplexity
+        perplexity = mdlm_model.compute_masked_perplexity(generated_array, input_array['input_ids'])
+        perplexity = round(perplexity, 4)
+        smiles_seq = tokenizer.decode(generated_array)
+        if analyzer.is_peptide(smiles_seq):
+            aa_seq, seq_length = analyzer.analyze_structure(smiles_seq)
+            num_valid += 1
+            scores = score_functions(input_seqs=[smiles_seq])
+            binding = scores[0][0]
+            sol = scores[0][1]
+            hemo = scores[0][2]
+            nf = scores[0][3]
+            perm = scores[0][4]
+            generation_results.append([smiles_seq, perplexity, aa_seq, binding, sol, hemo, nf, perm])
+        else:
+            aa_seq = "not valid peptide"
+            seq_length = '-'
+            scores = "not valid peptide"
+        print(f"perplexity: {perplexity} | length: {seq_length} | smiles sequence: {smiles_seq} | amino acid sequence: {aa_seq} | scores: {scores}")
+        sys.stdout.flush()
+    valid_frac = num_valid / num_total
+    print(f"fraction of synthesizable peptides: {valid_frac}")
+    df = pd.DataFrame(generation_results, columns=['Generated SMILES', 'Perplexity', 'Peptide Sequence', 'Binding Affinity', 'Solubility', 'Hemolysis', 'Nonfouling', 'Permeability'])
+    df.to_csv(base_path + f'/results/test_generate.csv', index=False)
+if __name__ == "__main__":
+    main()

src/metrics.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import pandas as pd
+from math import sqrt
+def summarize_metrics(skip, csv_path: str, save_path: str | None = None) -> pd.DataFrame:
+    """
+    Compute mean and standard deviation for all columns except the first
+    (assumed non-numeric identifier like 'Peptide Sequence').
+    Returns a DataFrame with rows = column names and columns = ['mean','std','count'].
+    Uses sample std (ddof=1). Non-numeric cells are coerced to NaN.
+    """
+    df = pd.read_csv(csv_path)
+    vals = df.iloc[:, skip:].apply(pd.to_numeric, errors='coerce')  # columns 2..end
+    stats = vals.agg(['mean', 'std', 'count']).T  # shape: (num_metrics, 3)
+    if save_path:
+        stats.to_csv(save_path, index=True)
+    return stats
+def summarize_list(xs, ddof = 1):
+    # Clean & coerce to float
+    vals = []
+    for x in xs:
+        if x is None or x == "":
+            continue
+        try:
+            vals.append(float(x))
+        except (TypeError, ValueError):
+            continue
+    n = len(vals)
+    if n == 0:
+        raise ValueError("No numeric values found.")
+    if n <= ddof:
+        raise ValueError(f"Need at least {ddof + 1} numeric values; got {n}.")
+    # Welford’s algorithm (one pass, stable)
+    mean = 0.0
+    M2 = 0.0
+    count = 0
+    for v in vals:
+        count += 1
+        delta = v - mean
+        mean += delta / count
+        M2 += delta * (v - mean)
+    var = M2 / (count - ddof)
+    std = sqrt(var)
+    result = {"mean": mean, "std": std, "count": count}
+    return result
+def csv_column_to_list(path: str, column: str, *, dropna: bool = True):
+    df = pd.read_csv(path)
+    if column not in df.columns:
+        raise KeyError(f"Column '{column}' not found. Available: {list(df.columns)}")
+    s = df[column]
+    if dropna:
+        s = s.dropna()
+    return s.tolist()
+def main():
+    csv_path = "/scratch/pranamlab/sophtang/home/tr2d2/peptides/plots/glast_resample20_no-mcts/"
+    path = "/scratch/pranamlab/sophtang/home/TR2-D2/tr2d2-pep/results/tfr_resample10_buffer20_numiter10_children50_20260326_183626"
+    prot_name = "tfr"
+    stats = summarize_metrics(skip=1, csv_path=f"{path}/{prot_name}_generation_results.csv",
+                              save_path=f"{path}/results_summary.csv")
+    print(stats)
+if __name__ == '__main__':
+    main()

src/noise_schedule.py ADDED Viewed

	@@ -0,0 +1,152 @@

+## Adapted from MDLM: https://github.com/kuleshov-group/mdlm
+import abc
+import torch
+import torch.nn as nn
+torch._C._jit_set_profiling_mode(False)
+torch._C._jit_set_profiling_executor(False)
+torch._C._jit_override_can_fuse_on_cpu(True)
+torch._C._jit_override_can_fuse_on_gpu(True)
+def get_noise(config, dtype=torch.float32):
+	if config.noise.type == 'geometric':
+		return GeometricNoise(config.noise.sigma_min, config.noise.sigma_max)
+	elif config.noise.type  == 'loglinear':
+		return LogLinearNoise()
+	elif config.noise.type  == 'cosine':
+		return CosineNoise()
+	elif config.noise.type  == 'cosinesqr':
+		return CosineSqrNoise()
+	elif config.noise.type  == 'linear':
+		return Linear(config.noise.sigma_min, config.noise.sigma_max, dtype)
+	else:
+		raise ValueError(f'{config.noise.type} is not a valid noise')
+def binary_discretization(z):
+	z_hard = torch.sign(z)
+	z_soft = z / torch.norm(z, dim=-1, keepdim=True)
+	return z_soft + (z_hard - z_soft).detach()
+class Noise(abc.ABC, nn.Module):
+	"""
+	Baseline forward method to get the total + rate of noise at a timestep
+	"""
+	def forward(self, t):
+		# Assume time goes from 0 to 1
+		return self.total_noise(t), self.rate_noise(t)
+class CosineNoise(Noise):
+	def __init__(self, eps=1e-3):
+		super().__init__()
+		self.eps = eps
+	def rate_noise(self, t):
+		cos = (1 - self.eps) * torch.cos(t * torch.pi / 2)
+		sin = (1 - self.eps) * torch.sin(t * torch.pi / 2)
+		scale = torch.pi / 2
+		return scale * sin / (cos + self.eps)
+	def total_noise(self, t):
+		cos = torch.cos(t * torch.pi / 2)
+		return - torch.log(self.eps + (1 - self.eps) * cos)
+class CosineSqrNoise(Noise):
+	def __init__(self, eps=1e-3):
+		super().__init__()
+		self.eps = eps
+	def rate_noise(self, t):
+		cos = (1 - self.eps) * (
+			torch.cos(t * torch.pi / 2) ** 2)
+		sin = (1 - self.eps) * torch.sin(t * torch.pi)
+		scale = torch.pi / 2
+		return scale * sin / (cos + self.eps)
+	def total_noise(self, t):
+		cos = torch.cos(t * torch.pi / 2) ** 2
+		return - torch.log(self.eps + (1 - self.eps) * cos)
+class Linear(Noise):
+	def __init__(self, sigma_min=0, sigma_max=10, dtype=torch.float32):
+		super().__init__()
+		self.sigma_min = torch.tensor(sigma_min, dtype=dtype)
+		self.sigma_max = torch.tensor(sigma_max, dtype=dtype)
+	def rate_noise(self):
+		return self.sigma_max - self.sigma_min
+	def total_noise(self, t):
+		return self.sigma_min + t * (self.sigma_max - self.sigma_min)
+	def importance_sampling_transformation(self, t):
+		f_T = torch.log1p(- torch.exp(- self.sigma_max))
+		f_0 = torch.log1p(- torch.exp(- self.sigma_min))
+		sigma_t = - torch.log1p(- torch.exp(t * f_T + (1 - t) * f_0))
+		return (sigma_t - self.sigma_min) / (
+			self.sigma_max - self.sigma_min)
+class GeometricNoise(Noise):
+	def __init__(self, sigma_min=1e-3, sigma_max=1):
+		super().__init__()
+		self.sigmas = 1.0 * torch.tensor([sigma_min, sigma_max])
+	def rate_noise(self, t):
+		return self.sigmas[0] ** (1 - t) * self.sigmas[1] ** t * (
+			self.sigmas[1].log() - self.sigmas[0].log())
+	def total_noise(self, t):
+		return self.sigmas[0] ** (1 - t) * self.sigmas[1] ** t
+class LogLinearNoise(Noise):
+	"""Log Linear noise schedule.
+	Built such that 1 - 1/e^(n(t)) interpolates between 0 and
+	~1 when t varies from 0 to 1. Total noise is
+	-log(1 - (1 - eps) * t), so the sigma will be
+	(1 - eps) * t.
+	"""
+	def __init__(self, eps=1e-3):
+		super().__init__()
+		self.eps = eps
+		self.sigma_max = self.total_noise(torch.tensor(1.0))
+		self.sigma_min = self.eps + self.total_noise(torch.tensor(0.0))
+	def rate_noise(self, t):
+		return (1 - self.eps) / (1 - (1 - self.eps) * t)
+	def total_noise(self, t):
+		return -torch.log1p(-(1 - self.eps) * t)
+	def importance_sampling_transformation(self, t):
+		f_T = torch.log1p(- torch.exp(- self.sigma_max))
+		f_0 = torch.log1p(- torch.exp(- self.sigma_min))
+		sigma_t = - torch.log1p(- torch.exp(t * f_T + (1 - t) * f_0))
+		t = - torch.expm1(- sigma_t) / (1 - self.eps)
+		return t
+class LogPolyNoise(Noise):
+	"""
+ 	Log Polynomial noise schedule for slower masking of peptide bond tokens
+	"""
+	def __init__(self, eps=1e-3):
+		super().__init__()
+		self.eps = eps
+		self.sigma_max = self.total_noise(torch.tensor(1.0))
+		self.sigma_min = self.eps + self.total_noise(torch.tensor(0.0))
+	def rate_noise(self, t):
+		# derivative of -log(1-t^w)
+		return ((3 * (t**2)) - self.eps) / (1 - (1 - self.eps) * (t**3))
+	def total_noise(self, t):
+		# -log(1-t^w)
+		return -torch.log1p(-(1 - self.eps) * (t**3))

src/pareto_mcts.py ADDED Viewed

	@@ -0,0 +1,492 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+import numpy as np
+import random as rd
+from diffusion import Diffusion
+from scoring.scoring_functions import ScoringFunctions
+from utils.app import PeptideAnalyzer
+import noise_schedule
+""""
+    Notes: store rolled out sequence?
+    path of node objects or strings?
+    should we only select valid expandable leaf nodes?
+    calculate similarity between sibling nodes?
+    should we evaluate generated sequences?
+"""
+class Node:
+    """
+        Node class: partially unmasked SMILES string
+        - parentNode: Node object at previous time step
+        - childNodes: set of M Node objects generated from sampling M distinct unmasking schemes
+        - totalReward: vector of cumulative rewards for all K objectives
+        - visits: number of times the node has been visited by an interation
+        - path: array of partially unmasked SMILES strings leading to the node from the completely masked root node
+        - timestep: the time step where the sequence was sampled
+        - sampleProb: probability of sampling the sequence from the diffusion model
+    """
+    def __init__(self, config, tokens=None, parentNode=None, childNodes=[], scoreVector=None, totalReward=None, timestep=None, sampleProb=None):
+        self.config = config
+        self.parentNode = parentNode
+        self.childNodes = childNodes
+        self.scoreVector = scoreVector
+        # initialize total rewards to the reward of the roll out unmasked sequence
+        if totalReward is not None:
+            self.totalReward = totalReward
+        else:
+            self.totalReward = np.zeros(self.config.mcts.num_objectives)
+        # set initial visits to 1
+        self.visits = 1
+        # array of all sequences in path from the root -> node
+        #self.path = path
+        # set timestep (value between 0 and num_steps)
+        self.timestep = timestep
+        # set the sampling probabiltiy equal to the probability from the reverse posterior
+        self.sampleProb = sampleProb
+        # dict with 'input_ids' as token array and 'attention_mask'
+        self.tokens = tokens
+        #self.sequence = sequence
+    def selectNode(self, num_func):
+        """
+            Selects a node to move to among the children nodes
+        """
+        # extract the status of the current node
+        nodeStatus = self.getExpandStatus()
+        # if the node is a legal non-leaf node
+        if (nodeStatus == 3):
+            # initialize array that will store select score vectors of each child node
+            paretoFront = {}
+            for childNode in self.childNodes:
+                childStatus = childNode.getExpandStatus()
+                # only append child if it is legal leaf node (expandable) or legal non-leaf node
+                if childStatus == 2 or childStatus == 3:
+                    selectScore = childNode.calcSelectScore()
+                    paretoFront = updateParetoFront(paretoFront, childNode, selectScore, num_func)
+            # if no selectable children (all terminal), return self as a leaf
+            if len(paretoFront) == 0:
+                return self, 1
+            # randomly select a node on the Pareto front
+            #selected = rd.choice(paretoFront)
+            selected = rd.choice(list(paretoFront.keys()))
+            # return selected child node and status
+            return selected, selected.getExpandStatus()
+        # if node is not valid non-leaf node
+        return self, nodeStatus
+    def addChildNode(self, tokens, totalReward, prob=None):
+        """"
+            Adds a child node
+        """
+        child = Node(config=self.config,
+                     tokens=tokens,
+                     parentNode=self,
+                     childNodes=[],
+                     totalReward=totalReward,
+                     timestep=self.timestep+1,
+                     sampleProb=prob)
+        self.childNodes.append(child)
+        return child
+    def updateNode(self, rewards):
+        """
+            Updates the cumulative rewards vector with the reward vector at a descendent leaf node.
+            Increments the number of visits to the node.
+        """
+        self.visits += 1
+        self.totalReward += rewards
+    def calcSelectScore(self):
+        """
+            Calculates the select score for the node from the cumulative rewards vector and number of visits.
+            - c: determines the degree of exploration
+            - minSelectScore: determines the
+        """
+        """"
+        if not self.parentNode:
+            return 0.0
+        """
+        # K-dimensional vector of normalized rewards for each objective
+        normRewards = self.totalReward / self.visits
+        if self.sampleProb is not None:
+            print("Sample Prob")
+            print(self.sampleProb)
+            return normRewards + (self.config.mcts.sample_prob * self.sampleProb * np.sqrt(self.root.visits) / self.visits)
+        return normRewards
+    def getExpandStatus(self):
+        """
+            Returns an integer indicating whether the node is a:
+            1. terminal node (sequence is fully unmasked)
+            2. legal leaf node (partially unmasked sequence that can be expanded)
+            3. legal non-leaf node (already expanded sequence with M child nodes)
+        """
+        if self.timestep == self.config.sampling.steps:
+            return 1
+        elif (self.timestep < self.config.sampling.steps) and (len(self.childNodes) == 0):
+            return 2
+        return 3
+"""END OF NODE CLASS"""
+def updateParetoFront(paretoFront, node, scoreVector, num_func):
+    """
+        Removes sequences that are dominated by scoreVector
+        adds the SMILES sequence if it is non-dominated and its scoreVector
+    """
+    paretoSize = len(paretoFront)
+    if paretoSize == 0:
+        # if pareto front is empty, add sequence and scoreVector
+        paretoFront[node] = scoreVector
+    else:
+        # vector of boolean
+        # true: sequence is non-dominated by the pareto-optimal sequence
+        # false: sequence is completely dominated by the pareto-optimal sequence
+        nondominate = []
+        # sequences to be deleted
+        delete = []
+        for k, v in paretoFront.items():
+            nondominated = scoreVector >= np.asarray(v)
+            dominant = scoreVector > np.asarray(v)
+            if num_func <= len(nondominated):
+                attn_nondominated = nondominated[:num_func]
+                attn_dominant = dominant[:num_func]
+            # all scores are greater than or equal to v and at least one score is strictly greater than v
+            if attn_nondominated.all() and attn_dominant.any():
+                # add the dominated sequence to be deleted
+                delete.append(k)
+                # sequence is dominant
+                nondominate.append(True)
+            elif attn_nondominated.all():
+                # sequence is non-dominated
+                nondominate.append(True)
+            else:
+                # sequence is completely dominated
+                nondominate.append(False)
+        nondominate = np.asarray(nondominate)
+        # if sequence is either dominant or non-dominated by all sequences in pareto-front -> add to pareto front
+        if nondominate.all():
+            paretoFront[node] = scoreVector
+        # delete all dominated sequences
+        while (paretoSize > 0) and (len(delete) > 0):
+            #for k in delete:
+            del paretoFront[delete[0]]
+            del delete[0]
+            paretoSize -= 1
+    return paretoFront
+###BEGINNING OF MCTS CLASS###
+class MCTS:
+    def __init__(self, config, max_sequence_length=None, mdlm=None, score_func_names=[], prot_seqs=None, num_func = []):
+        self.config = config
+        self.noise = noise_schedule.get_noise(config)
+        self.time_conditioning = self.config.time_conditioning
+        # dictionary of k (SMILES string) and v (score vector) of Pareto-optimal sequences
+        self.peptideParetoFront = {}
+        self.num_steps = config.sampling.steps
+        self.num_sequences = config.sampling.num_sequences
+        # mdlm model
+        self.mdlm = mdlm
+        self.tokenizer = mdlm.tokenizer
+        self.device = mdlm.device
+        if max_sequence_length is None:
+            self.sequence_length = self.config.sampling.seq_length
+        else:
+            self.sequence_length = max_sequence_length
+        self.num_iter = config.mcts.num_iter
+        self.num_child = config.mcts.num_children
+        # score functions
+        self.score_functions = ScoringFunctions(score_func_names, prot_seqs)
+        self.score_func_names = score_func_names
+        self.num_func = num_func # K-dimensional vector with the iteration number to start conditioning on each of the objectives in increasng order
+        self.iter_num = 0
+        self.curr_num_func = 1
+        self.analyzer = PeptideAnalyzer()
+        # track fraction of valid peptides
+        self.valid_fraction_log = []
+        self.score_logs = {name: [] for name in score_func_names}
+    def reset(self):
+        self.iter_num = 0
+        self.valid_fraction_log = []
+        self.score_logs = {name: [] for name in self.score_func_names}
+        self.peptideParetoFront = {}
+    def forward(self, rootNode):
+        self.reset()
+        while (self.iter_num < self.num_iter):
+            self.iter_num += 1
+            # traverse the tree form the root node until a leaf node
+            leafNode, _ = self.select(rootNode)
+            #print(leafNode.tokens['input_ids'])
+            # expand leaf node into num_children partially unmasked sequences at the next timestep
+            self.expand(leafNode)
+        # return dictionary of pareto front peptides and their score vectors
+        return self.peptideParetoFront
+    # change to include more even if dominated? since there is error in the scores
+    def updateParetoFront(self, sequence, scoreVector, tokens):
+        """
+            Removes sequences that are dominated by scoreVector
+            adds the SMILES sequence if it is non-dominated and its scoreVector
+            num_func: index of the last objective to consider when updating the pareto front from 0 to K
+        """
+        paretoSize = len(self.peptideParetoFront)
+        self.curr_num_func = 1
+        for i in range(len(self.num_func)):
+            if self.iter_num >= self.num_func[i]:
+                self.curr_num_func = i+1
+        if paretoSize == 0:
+            # if pareto front is empty, add sequence and scoreVector
+            self.peptideParetoFront[sequence] = {'scores': scoreVector, 'token_ids': tokens}
+            # if pareto front is empty, set reward vector to 1s
+            rewardVector = np.ones(len(scoreVector))
+        else:
+            # vector of boolean
+            # true: sequence is non-dominated by the pareto-optimal sequence
+            # false: sequence is completely dominated by the pareto-optimal sequence
+            nondominate = []
+            # sequences to be deleted
+            delete = []
+            # initialize reward vector with zeros
+            rewardVector = np.zeros(len(scoreVector))
+            for k, v in self.peptideParetoFront.items():
+                 # boolean vector
+                # true: if all metrics are equal or larger
+                # false: if the pareto front sequence dominates scoreVector
+                nondominated = scoreVector >= np.asarray(v['scores']) # [num_objectives]
+                dominant = scoreVector > np.asarray(v['scores'])
+                # add to reward vector
+                rewardVector += nondominated # [num_objectives]
+                if self.curr_num_func <= len(nondominated):
+                    attn_nondominated = nondominated[:self.curr_num_func]
+                    attn_dominant = dominant[:self.curr_num_func]
+                # only delete pareto-optimal sequence if
+                # all scores are greater than or equal to v and at least one score is strictly greater than v
+                if attn_nondominated.all() and attn_dominant.any():
+                    # add the dominated sequence to be deleted
+                    delete.append(k)
+                    # sequence is dominant
+                    nondominate.append(True)
+                elif attn_nondominated.all():
+                    # sequence is non-dominated
+                    nondominate.append(True)
+                else:
+                    # sequence is completely dominated
+                    nondominate.append(False)
+            assert len(nondominate) == paretoSize
+            nondominate = np.asarray(nondominate)
+            # if sequence is either dominant or non-dominated by all sequences in pareto-front -> add to pareto front
+            # or if the pareto front does not have enough sequences
+            if nondominate.all() or paretoSize < self.num_sequences:
+                self.peptideParetoFront[sequence] = {'scores': scoreVector, 'token_ids': tokens}
+            rewardVector = rewardVector / paretoSize
+            # delete all dominated sequences if pareto front is larger than num_sequences
+            while (paretoSize > self.num_sequences) and (len(delete) > 0):
+                #for k in delete:
+                del self.peptideParetoFront[delete[0]]
+                del delete[0]
+                paretoSize -= 1
+        return rewardVector
+    def isPathEnd(self, path, maxDepth):
+        """
+            Checks if the node is completely unmasked (ie. end of path)
+            or if the path is at the max depth
+        """
+        if (path[-1] != self.config.mcts.mask_token).all():
+            return True
+        elif len(path) >= maxDepth:
+            return True
+        return False
+    def select(self, currNode):
+        """
+            Traverse the tree from the root node until reaching a legal leaf node
+        """
+        while True:
+            currNode, nodeStatus = currNode.selectNode(self.curr_num_func)
+            if nodeStatus != 3:
+                return currNode, nodeStatus
+    def expand(self, parentNode, eps=1e-5, checkSimilarity = True):
+        """
+            Sample unmasking steps from the pre-trained MDLM
+            adds num_children partially unmasked sequences to the children of the parentNode
+        """
+        num_children = self.config.mcts.num_children
+        # initialize child rewards that will be added to total rewards
+        allChildReward = np.zeros_like(parentNode.totalReward) # (n_objectives)
+        # compute number of rollout steps
+        # if parentNode.timestep = self.num_steps then num_rollout_steps = 1
+        num_rollout_steps = self.num_steps - parentNode.timestep
+        # array of rollout timesteps from the timestep of parent node to 0
+        rollout_t = torch.linspace(1, eps, num_rollout_steps, device=self.device)
+        dt = (1 - eps) / self.num_steps
+        p_x0_cache = None
+        # initialize x and attn_mask
+        x = parentNode.tokens['input_ids'].to(self.device)
+        attn_mask = parentNode.tokens['attention_mask'].to(self.device)
+        t = rollout_t[0] * torch.ones(num_children, 1, device = self.device)
+        # generate (n_children, seq_length) array of sampled children nodes
+        print("token array:")
+        print(x)
+        p_x0_cache, x_children = self.mdlm.batch_cached_reverse_step(token_array=x,
+                                                         t=t, dt=dt,
+                                                         batch_size=num_children,
+                                                         attn_mask=attn_mask)
+        x_rollout = x_children
+        for i in range(1, num_rollout_steps):
+            t = rollout_t[i] * torch.ones(num_children, 1, device = self.device)
+            p_x0_cache, x_next = self.mdlm.cached_reverse_step(x=x_rollout,
+                                                               t=t, dt=dt, p_x0=p_x0_cache,
+                                                               attn_mask=attn_mask)
+            if (not torch.allclose(x_next, x) or self.time_conditioning):
+                # Disable caching
+                p_x0_cache = None
+            x_rollout = x_next
+        if self.config.sampling.noise_removal:
+            t = rollout_t[-1] * torch.ones(x.shape[0], 1, device=self.device)
+            time_cond = self.noise(t)[0]
+            x_rollout = self.mdlm.forward(x_rollout, attn_mask, time_cond).argmax(dim=-1) # (n_children, seq_length)
+        childSequences = self.tokenizer.batch_decode(x_rollout)
+        validSequences = []
+        maskedTokens = []
+        unmaskedTokens = []
+        for i in range(num_children):
+            childSeq = childSequences[i]
+            #scoreVector = scoreVectors[i]
+            rewardVector = np.zeros(self.config.mcts.num_objectives)
+            # check if the peptide is valid
+            if self.analyzer.is_peptide(childSeq):
+                validSequences.append(childSeq)
+                maskedTokens.append(x_children[i])
+                unmaskedTokens.append(x_rollout[i])
+            else:
+                childTokens = {'input_ids': x_children[i], 'attention_mask': attn_mask}
+                parentNode.addChildNode(tokens=childTokens,
+                                totalReward=rewardVector)
+        if (len(validSequences) != 0):
+            scoreVectors = self.score_functions(input_seqs=validSequences)
+            average_scores = scoreVectors.T
+            for i, name in enumerate(self.score_func_names):
+                self.score_logs[name].append(average_scores[i])
+        else:
+            for name in self.score_func_names:
+                self.score_logs[name].append(np.zeros(0))
+        for i, validSeq in enumerate(validSequences):
+            #tokens = validTokens[i]
+            scoreVector = scoreVectors[i]
+            # update pareto front
+            rewardVector = self.updateParetoFront(validSeq, scoreVector, unmaskedTokens[i])
+            print(scoreVector)
+            print(rewardVector)
+            # add to all child reward vector for backprop
+            allChildReward += rewardVector
+            # create node for sequence and add to the children node of parent
+            childTokens = {'input_ids': maskedTokens[i], 'attention_mask': attn_mask}
+            parentNode.addChildNode(tokens=childTokens,
+                            totalReward=rewardVector)
+        # compute fraction of invalid child sequences
+        invalid = (num_children - len(validSequences)) / num_children
+        valid_fraction = len(validSequences) / num_children
+        print(f"Valid fraction: {valid_fraction}")
+        self.valid_fraction_log.append(valid_fraction)
+        print(self.config.mcts.invalid_penalty)
+        # subtract score using fraction of invalid sequences from reward
+        allChildReward = allChildReward - (self.config.mcts.invalid_penalty * invalid)
+        # backpropogate all child rewards
+        self.backprop(parentNode, allChildReward)
+    def backprop(self, node, reward_vector):
+        # backpropogate rewards through the path leading to the leaf node from the root
+        while node:
+            node.updateNode(reward_vector)
+            node = node.parentNode
+    def getSequenceForObjective(self, objective_index, k):
+        """
+            Returns the top-k sequences in the pareto front that has the best score for
+            a given objective and their score vectors for all objectives
+        """
+        # dictionary of top-k peptides for the objective
+        topk = {}
+        peptides = []
+        objectiveScores = []
+        for k, v in self.peptideParetoFront.items():
+            # store peptides in list
+            peptides.append(k)
+            # store score for objective
+            objectiveScores.append(v['token_ids'][objective_index])
+        objectiveScores = torch.tensor(objectiveScores)
+        topKScores = torch.topk(objectiveScores, k)
+        for (_, index) in topKScores.items():
+            seq = peptides[index]
+            topk[seq] = self.peptideParetoFront.get(seq)
+        return topk

src/roformer.py ADDED Viewed

	@@ -0,0 +1,74 @@

+from transformers import RoFormerConfig, RoFormerForMaskedLM
+import torch.nn as nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+import torch
+class Roformer(nn.Module):
+    def __init__(self, config, tokenizer):
+        super(Roformer, self).__init__()
+        self.tokenizer = tokenizer
+        self.vocab_size = self.tokenizer.vocab_size
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.device = device
+        roformer_config = RoFormerConfig(
+            vocab_size=self.tokenizer.vocab_size,
+            embedding_size=config.roformer.hidden_size,
+            hidden_size=config.roformer.hidden_size,
+            num_hidden_layers=config.roformer.n_layers,
+            num_attention_heads=config.roformer.n_heads,
+            intermediate_size=config.roformer.hidden_size * 4,
+            max_position_embeddings=config.roformer.max_position_embeddings,
+            hidden_dropout_prob=0.1,
+            attention_probs_dropout_prob=0.1,
+            pad_token_id=0,
+            rotary_value=False
+        )
+        self.model = RoFormerForMaskedLM(roformer_config).to(self.device)
+    def freeze_model(self):
+        for param in self.model.parameters():
+            param.requires_grad = False
+    def unfreeze_all_layers(self):
+        for param in self.model.parameters():
+            param.requires_grad = True
+    def unfreeze_n_layers(self, n):
+        num_layers = 8
+        for i, layer in enumerate(self.model.roformer.encoder.layer):
+            # finetune final n layers
+            if i >= num_layers - n:
+                # unfreeze query weights
+                for module in layer.attention.self.query.modules():
+                    for param in module.parameters():
+                         param.requires_grad = True
+                # unfreeze key weights
+                for module in layer.attention.self.key.modules():
+                    for param in module.parameters():
+                        param.requires_grad = True
+    def forward(self, input_ids, attn_mask):
+        input_ids = input_ids.to(self.device)
+        attn_mask = attn_mask.to(self.device)
+        # get logits embeddings
+        logits = self.model(input_ids=input_ids, attention_mask=attn_mask)
+        # return logits
+        #print(logits.logits)
+        return logits.logits
+    def save_model(self, save_dir):
+        self.model.save_pretrained(save_dir)
+        self.tokenizer.save_pretrained(save_dir)
+    @classmethod
+    def load_model(cls, save_dir, config, tokenizer):
+        roformer = cls(config, tokenizer)
+        roformer.model = RoFormerForMaskedLM.from_pretrained(save_dir)
+        return roformer

src/scoring/functions/binding.py ADDED Viewed

	@@ -0,0 +1,178 @@

+import sys
+import os, torch
+import numpy as np
+import torch
+import pandas as pd
+import torch.nn as nn
+import esm
+from transformers import AutoModelForMaskedLM
+class ImprovedBindingPredictor(nn.Module):
+    def __init__(self,
+                 esm_dim=1280,
+                 smiles_dim=768,
+                 hidden_dim=512,
+                 n_heads=8,
+                 n_layers=3,
+                 dropout=0.1):
+        super().__init__()
+        # Define binding thresholds
+        self.tight_threshold = 7.5    # Kd/Ki/IC50 ≤ ~30nM
+        self.weak_threshold = 6.0     # Kd/Ki/IC50 > 1μM
+        # Project to same dimension
+        self.smiles_projection = nn.Linear(smiles_dim, hidden_dim)
+        self.protein_projection = nn.Linear(esm_dim, hidden_dim)
+        self.protein_norm = nn.LayerNorm(hidden_dim)
+        self.smiles_norm = nn.LayerNorm(hidden_dim)
+        # Cross attention blocks with layer norm
+        self.cross_attention_layers = nn.ModuleList([
+            nn.ModuleDict({
+                'attention': nn.MultiheadAttention(hidden_dim, n_heads, dropout=dropout),
+                'norm1': nn.LayerNorm(hidden_dim),
+                'ffn': nn.Sequential(
+                    nn.Linear(hidden_dim, hidden_dim * 4),
+                    nn.ReLU(),
+                    nn.Dropout(dropout),
+                    nn.Linear(hidden_dim * 4, hidden_dim)
+                ),
+                'norm2': nn.LayerNorm(hidden_dim)
+            }) for _ in range(n_layers)
+        ])
+        # Prediction heads
+        self.shared_head = nn.Sequential(
+            nn.Linear(hidden_dim * 2, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+        )
+        # Regression head
+        self.regression_head = nn.Linear(hidden_dim, 1)
+        # Classification head (3 classes: tight, medium, loose binding)
+        self.classification_head = nn.Linear(hidden_dim, 3)
+    def get_binding_class(self, affinity):
+        """Convert affinity values to class indices
+        0: tight binding (>= 7.5)
+        1: medium binding (6.0-7.5)
+        2: weak binding (< 6.0)
+        """
+        if isinstance(affinity, torch.Tensor):
+            tight_mask = affinity >= self.tight_threshold
+            weak_mask = affinity < self.weak_threshold
+            medium_mask = ~(tight_mask | weak_mask)
+            classes = torch.zeros_like(affinity, dtype=torch.long)
+            classes[medium_mask] = 1
+            classes[weak_mask] = 2
+            return classes
+        else:
+            if affinity >= self.tight_threshold:
+                return 0  # tight binding
+            elif affinity < self.weak_threshold:
+                return 2  # weak binding
+            else:
+                return 1  # medium binding
+    def forward(self, protein_emb, smiles_emb):
+        protein = self.protein_norm(self.protein_projection(protein_emb))
+        smiles = self.smiles_norm(self.smiles_projection(smiles_emb))
+        #protein = protein.transpose(0, 1)
+        #smiles = smiles.transpose(0, 1)
+        # Cross attention layers
+        for layer in self.cross_attention_layers:
+            # Protein attending to SMILES
+            attended_protein = layer['attention'](
+                protein, smiles, smiles
+            )[0]
+            protein = layer['norm1'](protein + attended_protein)
+            protein = layer['norm2'](protein + layer['ffn'](protein))
+            # SMILES attending to protein
+            attended_smiles = layer['attention'](
+                smiles, protein, protein
+            )[0]
+            smiles = layer['norm1'](smiles + attended_smiles)
+            smiles = layer['norm2'](smiles + layer['ffn'](smiles))
+        # Get sequence-level representations
+        protein_pool = torch.mean(protein, dim=0)
+        smiles_pool = torch.mean(smiles, dim=0)
+        # Concatenate both representations
+        combined = torch.cat([protein_pool, smiles_pool], dim=-1)
+        # Shared features
+        shared_features = self.shared_head(combined)
+        regression_output = self.regression_head(shared_features)
+        classification_logits = self.classification_head(shared_features)
+        return regression_output, classification_logits
+class BindingAffinity:
+    def __init__(self, prot_seq, tokenizer, base_path, device=None, emb_model=None):
+        super().__init__()
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        # peptide embeddings
+        if emb_model is not None:
+            self.pep_model = emb_model.to(self.device).eval()
+        else:
+            self.pep_model = AutoModelForMaskedLM.from_pretrained('aaronfeller/PeptideCLM-23M-all').roformer.to(self.device).eval()
+        self.pep_tokenizer = tokenizer
+        self.model = ImprovedBindingPredictor().to(self.device)
+        checkpoint = torch.load(f'{base_path}/src/scoring/functions/classifiers/binding-affinity.pt',
+                                map_location=self.device,
+                                weights_only=False)
+        self.model.load_state_dict(checkpoint['model_state_dict'])
+        self.model.eval()
+        self.esm_model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()  # load ESM-2 model
+        self.esm_model = self.esm_model.to(self.device).eval()
+        self.prot_tokenizer = alphabet.get_batch_converter() # load esm tokenizer
+        data = [("target", prot_seq)]
+        # get tokenized protein
+        _, _, prot_tokens = self.prot_tokenizer(data)
+        prot_tokens = prot_tokens.to(self.device)
+        with torch.no_grad():
+            results = self.esm_model.forward(prot_tokens, repr_layers=[33])  # Example with ESM-2
+            prot_emb = results["representations"][33]
+        self.prot_emb = prot_emb[0].to(self.device)
+        self.prot_emb = torch.mean(self.prot_emb, dim=0, keepdim=True)
+    def forward(self, input_seqs):
+        with torch.no_grad():
+            scores = []
+            for seq in input_seqs:
+                pep_tokens = self.pep_tokenizer(seq, return_tensors='pt', padding=True)
+                pep_tokens = {k: v.to(self.device) for k, v in pep_tokens.items()}
+                with torch.no_grad():
+                    emb = self.pep_model(input_ids=pep_tokens['input_ids'],
+                                         attention_mask=pep_tokens['attention_mask'],
+                                         output_hidden_states=True)
+                #emb = self.pep_model(input_ids=pep_tokens['input_ids'], attention_mask=pep_tokens['attention_mask'])
+                pep_emb = emb.last_hidden_state.squeeze(0)
+                pep_emb = torch.mean(pep_emb, dim=0, keepdim=True)
+                score, logits = self.model.forward(self.prot_emb, pep_emb)
+                scores.append(score.item())
+        return scores
+    def __call__(self, input_seqs: list):
+        return self.forward(input_seqs)

src/scoring/functions/binding_utils.py ADDED Viewed

	@@ -0,0 +1,290 @@

+from torch import nn
+import torch
+import numpy as np
+def to_var(x):
+    if torch.cuda.is_available():
+        x = x.cuda()
+    return x
+class MultiHeadAttentionSequence(nn.Module):
+    def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
+        super().__init__()
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_k = d_k
+        self.d_v = d_v
+        self.W_Q = nn.Linear(d_model, n_head*d_k)
+        self.W_K = nn.Linear(d_model, n_head*d_k)
+        self.W_V = nn.Linear(d_model, n_head*d_v)
+        self.W_O = nn.Linear(n_head*d_v, d_model)
+        self.layer_norm = nn.LayerNorm(d_model)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, q, k, v):
+        batch, len_q, _ = q.size()
+        batch, len_k, _ = k.size()
+        batch, len_v, _ = v.size()
+        Q = self.W_Q(q).view([batch, len_q, self.n_head, self.d_k])
+        K = self.W_K(k).view([batch, len_k, self.n_head, self.d_k])
+        V = self.W_V(v).view([batch, len_v, self.n_head, self.d_v])
+        Q = Q.transpose(1, 2)
+        K = K.transpose(1, 2).transpose(2, 3)
+        V = V.transpose(1, 2)
+        attention = torch.matmul(Q, K)
+        attention = attention / np.sqrt(self.d_k)
+        attention = F.softmax(attention, dim=-1)
+        output = torch.matmul(attention, V)
+        output = output.transpose(1, 2).reshape([batch, len_q, self.d_v*self.n_head])
+        output = self.W_O(output)
+        output = self.dropout(output)
+        output = self.layer_norm(output + q)
+        return output, attention
+class MultiHeadAttentionReciprocal(nn.Module):
+    def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
+        super().__init__()
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_k = d_k
+        self.d_v = d_v
+        self.W_Q = nn.Linear(d_model, n_head*d_k)
+        self.W_K = nn.Linear(d_model, n_head*d_k)
+        self.W_V = nn.Linear(d_model, n_head*d_v)
+        self.W_O = nn.Linear(n_head*d_v, d_model)
+        self.W_V_2 = nn.Linear(d_model, n_head*d_v)
+        self.W_O_2 = nn.Linear(n_head*d_v, d_model)
+        self.layer_norm = nn.LayerNorm(d_model)
+        self.dropout = nn.Dropout(dropout)
+        self.layer_norm_2 = nn.LayerNorm(d_model)
+        self.dropout_2 = nn.Dropout(dropout)
+    def forward(self, q, k, v, v_2):
+        batch, len_q, _ = q.size()
+        batch, len_k, _ = k.size()
+        batch, len_v, _ = v.size()
+        batch, len_v_2, _ = v_2.size()
+        Q = self.W_Q(q).view([batch, len_q, self.n_head, self.d_k])
+        K = self.W_K(k).view([batch, len_k, self.n_head, self.d_k])
+        V = self.W_V(v).view([batch, len_v, self.n_head, self.d_v])
+        V_2 = self.W_V_2(v_2).view([batch, len_v_2, self.n_head, self.d_v])
+        Q = Q.transpose(1, 2)
+        K = K.transpose(1, 2).transpose(2, 3)
+        V = V.transpose(1, 2)
+        V_2 = V_2.transpose(1,2)
+        attention = torch.matmul(Q, K)
+        attention = attention /np.sqrt(self.d_k)
+        attention_2 = attention.transpose(-2, -1)
+        attention = F.softmax(attention, dim=-1)
+        attention_2 = F.softmax(attention_2, dim=-1)
+        output = torch.matmul(attention, V)
+        output_2 = torch.matmul(attention_2, V_2)
+        output = output.transpose(1, 2).reshape([batch, len_q, self.d_v*self.n_head])
+        output_2 = output_2.transpose(1, 2).reshape([batch, len_k, self.d_v*self.n_head])
+        output = self.W_O(output)
+        output_2 = self.W_O_2(output_2)
+        output = self.dropout(output)
+        output = self.layer_norm(output + q)
+        output_2 = self.dropout(output_2)
+        output_2 = self.layer_norm(output_2 + k)
+        return output, output_2, attention, attention_2
+class FFN(nn.Module):
+    def __init__(self, d_in, d_hid, dropout=0.1):
+        super().__init__()
+        self.layer_1 = nn.Conv1d(d_in, d_hid,1)
+        self.layer_2 = nn.Conv1d(d_hid, d_in,1)
+        self.relu = nn.ReLU()
+        self.layer_norm = nn.LayerNorm(d_in)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        residual = x
+        output = self.layer_1(x.transpose(1, 2))
+        output = self.relu(output)
+        output = self.layer_2(output)
+        output = self.dropout(output)
+        output = self.layer_norm(output.transpose(1, 2)+residual)
+        return output
+class ConvLayer(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, padding, dilation):
+        super(ConvLayer, self).__init__()
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel_size, padding=padding, dilation=dilation)
+        self.relu = nn.ReLU()
+    def forward(self, x):
+        out = self.conv(x)
+        out = self.relu(out)
+        return out
+class DilatedCNN(nn.Module):
+    def __init__(self, d_model, d_hidden):
+        super(DilatedCNN, self).__init__()
+        self.first_ = nn.ModuleList()
+        self.second_ = nn.ModuleList()
+        self.third_ = nn.ModuleList()
+        dilation_tuple = (1, 2, 3)
+        dim_in_tuple = (d_model, d_hidden, d_hidden)
+        dim_out_tuple = (d_hidden, d_hidden, d_hidden)
+        for i, dilation_rate in enumerate(dilation_tuple):
+            self.first_.append(ConvLayer(dim_in_tuple[i], dim_out_tuple[i], kernel_size=3, padding=dilation_rate,
+                                         dilation=dilation_rate))
+        for i, dilation_rate in enumerate(dilation_tuple):
+            self.second_.append(ConvLayer(dim_in_tuple[i], dim_out_tuple[i], kernel_size=5, padding=2*dilation_rate,
+                                          dilation=dilation_rate))
+        for i, dilation_rate in enumerate(dilation_tuple):
+            self.third_.append(ConvLayer(dim_in_tuple[i], dim_out_tuple[i], kernel_size=7, padding=3*dilation_rate,
+                                         dilation=dilation_rate))
+    def forward(self, protein_seq_enc):
+        # pdb.set_trace()
+        protein_seq_enc = protein_seq_enc.transpose(1, 2)    # protein_seq_enc's shape: B*L*d_model -> B*d_model*L
+        first_embedding = protein_seq_enc
+        second_embedding = protein_seq_enc
+        third_embedding = protein_seq_enc
+        for i in range(len(self.first_)):
+            first_embedding = self.first_[i](first_embedding)
+        for i in range(len(self.second_)):
+            second_embedding = self.second_[i](second_embedding)
+        for i in range(len(self.third_)):
+            third_embedding = self.third_[i](third_embedding)
+        # pdb.set_trace()
+        protein_seq_enc = first_embedding + second_embedding + third_embedding
+        return protein_seq_enc.transpose(1, 2)
+class ReciprocalLayerwithCNN(nn.Module):
+    def __init__(self, d_model, d_inner, d_hidden, n_head, d_k, d_v):
+        super().__init__()
+        self.cnn = DilatedCNN(d_model, d_hidden)
+        self.sequence_attention_layer = MultiHeadAttentionSequence(n_head, d_hidden, d_k, d_v)
+        self.protein_attention_layer = MultiHeadAttentionSequence(n_head, d_hidden, d_k, d_v)
+        self.reciprocal_attention_layer = MultiHeadAttentionReciprocal(n_head, d_hidden, d_k, d_v)
+        self.ffn_seq = FFN(d_hidden, d_inner)
+        self.ffn_protein = FFN(d_hidden, d_inner)
+    def forward(self, sequence_enc, protein_seq_enc):
+        # pdb.set_trace()  # protein_seq_enc.shape = B * L * d_model
+        protein_seq_enc = self.cnn(protein_seq_enc)
+        prot_enc, prot_attention = self.protein_attention_layer(protein_seq_enc, protein_seq_enc, protein_seq_enc)
+        seq_enc, sequence_attention = self.sequence_attention_layer(sequence_enc, sequence_enc, sequence_enc)
+        prot_enc, seq_enc, prot_seq_attention, seq_prot_attention = self.reciprocal_attention_layer(prot_enc, seq_enc, seq_enc, prot_enc)
+        prot_enc = self.ffn_protein(prot_enc)
+        seq_enc = self.ffn_seq(seq_enc)
+        return prot_enc, seq_enc, prot_attention, sequence_attention, prot_seq_attention, seq_prot_attention
+class ReciprocalLayer(nn.Module):
+    def __init__(self, d_model, d_inner, n_head, d_k, d_v):
+        super().__init__()
+        self.sequence_attention_layer = MultiHeadAttentionSequence(n_head, d_model, d_k, d_v)
+        self.protein_attention_layer = MultiHeadAttentionSequence(n_head, d_model, d_k, d_v)
+        self.reciprocal_attention_layer = MultiHeadAttentionReciprocal(n_head, d_model, d_k, d_v)
+        self.ffn_seq = FFN(d_model, d_inner)
+        self.ffn_protein = FFN(d_model, d_inner)
+    def forward(self, sequence_enc, protein_seq_enc):
+        prot_enc, prot_attention = self.protein_attention_layer(protein_seq_enc, protein_seq_enc, protein_seq_enc)
+        seq_enc, sequence_attention = self.sequence_attention_layer(sequence_enc, sequence_enc, sequence_enc)
+        prot_enc, seq_enc, prot_seq_attention, seq_prot_attention = self.reciprocal_attention_layer(prot_enc, seq_enc, seq_enc, prot_enc)
+        prot_enc = self.ffn_protein(prot_enc)
+        seq_enc = self.ffn_seq(seq_enc)
+        return prot_enc, seq_enc, prot_attention, sequence_attention, prot_seq_attention, seq_prot_attention

src/scoring/functions/classifiers/hemolysis-xgboost.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src/scoring/functions/classifiers/nonfouling-xgboost.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src/scoring/functions/classifiers/permeability-xgboost.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e5d8c84bdad75f7091b5b3963133d4b0ebd180ae45654618ca6c090eee0bc06
+size 45249160

src/scoring/functions/classifiers/solubility-xgboost.json ADDED Viewed

The diff for this file is too large to render. See raw diff

src/scoring/functions/hemolysis.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import xgboost as xgb
+import torch
+import numpy as np
+from transformers import AutoModelForMaskedLM
+import warnings
+import numpy as np
+from rdkit import rdBase
+rdBase.DisableLog('rdApp.error')
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
+class Hemolysis:
+    def __init__(self, tokenizer, base_path, device=None, emb_model=None):
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.predictor = xgb.Booster(model_file=f'{base_path}/src/scoring/functions/classifiers/hemolysis-xgboost.json')
+        self.emb_model = emb_model if emb_model is not None else AutoModelForMaskedLM.from_pretrained('aaronfeller/PeptideCLM-23M-all').roformer.to(device).eval()
+        self.tokenizer = tokenizer
+    def generate_embeddings(self, sequences):
+        embeddings = []
+        for sequence in sequences:
+            tokenized = self.tokenizer(sequence, return_tensors='pt')
+            tokenized = {k: v.to(self.device) for k, v in tokenized.items()}
+            with torch.no_grad():
+                output = self.emb_model(**tokenized)
+            # Mean pooling across sequence length
+            embedding = output.last_hidden_state.mean(dim=1).squeeze(0).cpu().numpy()
+            embeddings.append(embedding)
+        return np.array(embeddings)
+    def get_scores(self, input_seqs: list):
+        scores = np.ones(len(input_seqs))
+        features = self.generate_embeddings(input_seqs)
+        if len(features) == 0:
+            return scores
+        features = np.nan_to_num(features, nan=0.)
+        features = np.clip(features, np.finfo(np.float32).min, np.finfo(np.float32).max)
+        features = xgb.DMatrix(features)
+        probs = self.predictor.predict(features)
+        # return the probability of it being not hemolytic
+        return scores - probs
+    def __call__(self, input_seqs: list):
+        scores = self.get_scores(input_seqs)
+        return scores
+def unittest():
+    hemo = Hemolysis()
+    seq = ["[te]NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CC(=CN2)C1=C2C=CC=C1)C(=O)N[C@@H](c1ccc(cc1)F)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CCCO)C(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)O"]
+    print(hemo.tokenizer.vocab_size)
+    scores = hemo(input_seqs=seq)
+    print(scores)
+if __name__ == '__main__':
+    unittest()

src/scoring/functions/nonfouling.py ADDED Viewed

	@@ -0,0 +1,66 @@

+import sys
+import os
+import xgboost as xgb
+import torch
+import numpy as np
+from transformers import AutoModelForMaskedLM
+import warnings
+import numpy as np
+from rdkit import Chem, rdBase, DataStructs
+rdBase.DisableLog('rdApp.error')
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
+class Nonfouling:
+    def __init__(self, tokenizer, base_path, device=None, emb_model=None):
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.predictor = xgb.Booster(model_file=f'{base_path}/src/scoring/functions/classifiers/nonfouling-xgboost.json')
+        self.emb_model = emb_model if emb_model is not None else AutoModelForMaskedLM.from_pretrained('aaronfeller/PeptideCLM-23M-all').roformer.to(device).eval()
+        self.tokenizer = tokenizer
+    def generate_embeddings(self, sequences):
+        embeddings = []
+        for sequence in sequences:
+            tokenized = self.tokenizer(sequence, return_tensors='pt')
+            tokenized = {k: v.to(self.device) for k, v in tokenized.items()}
+            with torch.no_grad():
+                output = self.emb_model(**tokenized)
+            # Mean pooling across sequence length
+            embedding = output.last_hidden_state.mean(dim=1).squeeze(0).cpu().numpy()
+            embeddings.append(embedding)
+        return np.array(embeddings)
+    def get_scores(self, input_seqs: list):
+        scores = np.zeros(len(input_seqs))
+        features = self.generate_embeddings(input_seqs)
+        if len(features) == 0:
+            return scores
+        features = np.nan_to_num(features, nan=0.)
+        features = np.clip(features, np.finfo(np.float32).min, np.finfo(np.float32).max)
+        features = xgb.DMatrix(features)
+        scores = self.predictor.predict(features)
+        # return the probability of it being not hemolytic
+        return scores
+    def __call__(self, input_seqs: list):
+        scores = self.get_scores(input_seqs)
+        return scores
+def unittest():
+    nf = Nonfouling()
+    seq = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CC(=CN2)C1=C2C=CC=C1)C(=O)N[C@@H](c1ccc(cc1)F)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CCCO)C(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)O"]
+    scores = nf(input_seqs=seq)
+    print(scores)
+if __name__ == '__main__':
+    unittest()

src/scoring/functions/permeability.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import sys
+import os
+import xgboost as xgb
+import torch
+import numpy as np
+from transformers import AutoModelForMaskedLM
+import warnings
+import numpy as np
+from rdkit.Chem import Descriptors, rdMolDescriptors
+from rdkit import Chem, rdBase, DataStructs
+from rdkit.Chem import AllChem
+from typing import List
+from transformers import AutoModelForMaskedLM
+rdBase.DisableLog('rdApp.error')
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
+def fingerprints_from_smiles(smiles: List, size=2048):
+    """ Create ECFP fingerprints of smiles, with validity check """
+    fps = []
+    valid_mask = []
+    for i, smile in enumerate(smiles):
+        mol = Chem.MolFromSmiles(smile)
+        valid_mask.append(int(mol is not None))
+        fp = fingerprints_from_mol(mol, size=size) if mol else np.zeros((1, size))
+        fps.append(fp)
+    fps = np.concatenate(fps, axis=0)
+    return fps, valid_mask
+def fingerprints_from_mol(molecule, radius=3, size=2048, hashed=False):
+    """ Create ECFP fingerprint of a molecule """
+    if hashed:
+        fp_bits = AllChem.GetHashedMorganFingerprint(molecule, radius, nBits=size)
+    else:
+        fp_bits = AllChem.GetMorganFingerprintAsBitVect(molecule, radius, nBits=size)
+    fp_np = np.zeros((1,))
+    DataStructs.ConvertToNumpyArray(fp_bits, fp_np)
+    return fp_np.reshape(1, -1)
+def getMolDescriptors(mol, missingVal=0):
+    """ calculate the full list of descriptors for a molecule """
+    values, names = [], []
+    for nm, fn in Descriptors._descList:
+        try:
+            val = fn(mol)
+        except:
+            val = missingVal
+        values.append(val)
+        names.append(nm)
+    custom_descriptors = {'hydrogen-bond donors': rdMolDescriptors.CalcNumLipinskiHBD,
+                          'hydrogen-bond acceptors': rdMolDescriptors.CalcNumLipinskiHBA,
+                          'rotatable bonds': rdMolDescriptors.CalcNumRotatableBonds,}
+    for nm, fn in custom_descriptors.items():
+        try:
+            val = fn(mol)
+        except:
+            val = missingVal
+        values.append(val)
+        names.append(nm)
+    return values, names
+def get_pep_dps_from_smi(smi):
+    try:
+        mol = Chem.MolFromSmiles(smi)
+    except:
+        print(f"convert smi {smi} to molecule failed!")
+        mol = None
+    dps, _ = getMolDescriptors(mol)
+    return np.array(dps)
+def get_pep_dps(smi_list):
+    if len(smi_list) == 0:
+        return np.zeros((0, 213))
+    return np.array([get_pep_dps_from_smi(smi) for smi in smi_list])
+def check_smi_validity(smiles: list):
+    valid_smi, valid_idx = [], []
+    for idx, smi in enumerate(smiles):
+        try:
+            mol = Chem.MolFromSmiles(smi) if smi else None
+            if mol:
+                valid_smi.append(smi)
+                valid_idx.append(idx)
+        except Exception as e:
+            # logger.debug(f'Error: {e} in smiles {smi}')
+            pass
+    return valid_smi, valid_idx
+class Permeability:
+    def __init__(self, tokenizer, base_path, device=None, emb_model=None):
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.predictor = xgb.Booster(model_file=f'{base_path}/src/scoring/functions/classifiers/permeability-xgboost.json')
+        if emb_model is not None:
+            self.emb_model = emb_model.to(self.device).eval()
+        else:
+            self.emb_model = AutoModelForMaskedLM.from_pretrained('aaronfeller/PeptideCLM-23M-all').roformer.to(device).eval()
+        self.tokenizer = tokenizer
+    def generate_embeddings(self, sequences):
+        embeddings = []
+        for sequence in sequences:
+            tokenized = self.tokenizer(sequence, return_tensors='pt')
+            tokenized = {k: v.to(self.device) for k, v in tokenized.items()}
+            with torch.no_grad():
+                output = self.emb_model(**tokenized)
+            # Mean pooling across sequence length
+            embedding = output.last_hidden_state.mean(dim=1).squeeze(0).cpu().numpy()
+            embeddings.append(embedding)
+        return np.array(embeddings)
+    def get_features(self, input_seqs: list, dps=False, fps=False):
+        #valid_smiles, valid_idxes = check_smi_validity(input_seqs)
+        if fps:
+            fingerprints = fingerprints_from_smiles(input_seqs)[0]
+        else:
+            fingerprints = torch.empty((len(input_seqs), 0))
+        if dps:
+            descriptors = get_pep_dps(input_seqs)
+        else:
+            descriptors = torch.empty((len(input_seqs), 0))
+        embeddings = self.generate_embeddings(input_seqs)
+        # logger.debug(f'X_fps.shape: {X_fps.shape}, X_dps.shape: {X_dps.shape}')
+        features = np.concatenate([fingerprints, descriptors, embeddings], axis=1)
+        return features
+    def get_scores(self, input_seqs: list):
+        scores = -10 * np.ones(len(input_seqs))
+        features = self.get_features(input_seqs)
+        if len(features) == 0:
+            return scores
+        features = np.nan_to_num(features, nan=0.)
+        features = np.clip(features, np.finfo(np.float32).min, np.finfo(np.float32).max)
+        features = xgb.DMatrix(features)
+        scores = self.predictor.predict(features)
+        return scores
+    def __call__(self, input_seqs: list):
+        scores = self.get_scores(input_seqs)
+        return scores
+def unittest():
+    permeability = Permeability()
+    seq = ['N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H]([C@@H](O)C(C)C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](CC(=CN2)C1=C2C=CC=C1)C(=O)O']
+    scores = permeability(input_seqs=seq)
+    print(scores)
+if __name__ == '__main__':
+    unittest()

src/scoring/functions/scoring_utils.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import warnings
+import numpy as np
+from loguru import logger
+from sklearn.ensemble import RandomForestRegressor
+from rdkit.Chem import Descriptors, rdMolDescriptors
+import joblib
+from rdkit import Chem, rdBase, DataStructs
+from rdkit.Chem import AllChem
+from typing import List
+def fingerprints_from_mol(molecule, radius=3, size=2048, hashed=False):
+    """
+        Create ECFP fingerprint of a molecule
+    """
+    if hashed:
+        fp_bits = AllChem.GetHashedMorganFingerprint(molecule, radius, nBits=size)
+    else:
+        fp_bits = AllChem.GetMorganFingerprintAsBitVect(molecule, radius, nBits=size)
+    fp_np = np.zeros((1,))
+    DataStructs.ConvertToNumpyArray(fp_bits, fp_np)
+    return fp_np.reshape(1, -1)
+def fingerprints_from_smiles(smiles: List, size=2048):
+    """ Create ECFP fingerprints of smiles, with validity check """
+    fps = []
+    valid_mask = []
+    for i, smile in enumerate(smiles):
+        mol = Chem.MolFromSmiles(smile)
+        valid_mask.append(int(mol is not None))
+        fp = fingerprints_from_mol(mol, size=size) if mol else np.zeros((1, size))
+        fps.append(fp)
+    fps = np.concatenate(fps, axis=0) if len(fps) > 0 else np.zeros((0, size))
+    return fps, valid_mask
+def getMolDescriptors(mol, missingVal=0):
+    """ calculate the full list of descriptors for a molecule """
+    values, names = [], []
+    for nm, fn in Descriptors._descList:
+        try:
+            val = fn(mol)
+        except:
+            val = missingVal
+        values.append(val)
+        names.append(nm)
+    custom_descriptors = {'hydrogen-bond donors': rdMolDescriptors.CalcNumLipinskiHBD,
+                          'hydrogen-bond acceptors': rdMolDescriptors.CalcNumLipinskiHBA,
+                          'rotatable bonds': rdMolDescriptors.CalcNumRotatableBonds,}
+    for nm, fn in custom_descriptors.items():
+        try:
+            val = fn(mol)
+        except:
+            val = missingVal
+        values.append(val)
+        names.append(nm)
+    return values, names
+def get_pep_dps_from_smi(smi):
+    try:
+        mol = Chem.MolFromSmiles(smi)
+    except:
+        print(f"convert smi {smi} to molecule failed!")
+        mol = None
+    dps, _ = getMolDescriptors(mol)
+    return np.array(dps)
+def get_pep_dps(smi_list):
+    if len(smi_list) == 0:
+        return np.zeros((0, 211))
+    return np.array([get_pep_dps_from_smi(smi) for smi in smi_list])
+def check_smi_validity(smiles: list):
+    valid_smi, valid_idx = [], []
+    for idx, smi in enumerate(smiles):
+        try:
+            mol = Chem.MolFromSmiles(smi) if smi else None
+            if mol:
+                valid_smi.append(smi)
+                valid_idx.append(idx)
+        except Exception as e:
+            # logger.debug(f'Error: {e} in smiles {smi}')
+            pass
+    return valid_smi, valid_idx

src/scoring/functions/solubility.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import xgboost as xgb
+import torch
+import numpy as np
+from transformers import AutoModelForMaskedLM
+import warnings
+import numpy as np
+from rdkit import rdBase
+rdBase.DisableLog('rdApp.error')
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
+class Solubility:
+    def __init__(self, tokenizer, base_path, device=None, emb_model=None):
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.predictor = xgb.Booster(model_file=f'{base_path}/src/scoring/functions/classifiers/solubility-xgboost.json')
+        if emb_model is not None:
+            self.emb_model = emb_model.to(self.device).eval()
+        else:
+            self.emb_model = AutoModelForMaskedLM.from_pretrained('aaronfeller/PeptideCLM-23M-all').roformer.to(self.device).eval()
+        self.tokenizer = tokenizer
+    def generate_embeddings(self, sequences):
+        embeddings = []
+        for sequence in sequences:
+            tokenized = self.tokenizer(sequence, return_tensors='pt')
+            tokenized = {k: v.to(self.device) for k, v in tokenized.items()}
+            with torch.no_grad():
+                output = self.emb_model(**tokenized)
+            # Mean pooling across sequence length
+            embedding = output.last_hidden_state.mean(dim=1).squeeze(0).cpu().numpy()
+            embeddings.append(embedding)
+        return np.array(embeddings)
+    def get_scores(self, input_seqs: list):
+        scores = np.zeros(len(input_seqs))
+        features = self.generate_embeddings(input_seqs)
+        if len(features) == 0:
+            return scores
+        features = np.nan_to_num(features, nan=0.)
+        features = np.clip(features, np.finfo(np.float32).min, np.finfo(np.float32).max)
+        features = xgb.DMatrix(features)
+        scores = self.predictor.predict(features)
+        return scores
+    def __call__(self, input_seqs: list):
+        scores = self.get_scores(input_seqs)
+        return scores
+def unittest():
+    solubility = Solubility()
+    seq = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CC(=CN2)C1=C2C=CC=C1)C(=O)N[C@@H](c1ccc(cc1)F)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CCCO)C(=O)N[C@@H](CC1=CN=C-N1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)O"]
+    scores = solubility(input_seqs=seq)
+    print(scores)
+if __name__ == '__main__':
+    unittest()

src/scoring/scoring_functions.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
+from transformers import AutoModelForMaskedLM
+import numpy as np
+from scoring.functions.binding import BindingAffinity
+from scoring.functions.permeability import Permeability
+from scoring.functions.solubility import Solubility
+from scoring.functions.hemolysis import Hemolysis
+from scoring.functions.nonfouling import Nonfouling
+base_path = '/path/to/your/home'
+class ScoringFunctions:
+    def __init__(self, score_func_names=None, prot_seqs=None, device=None):
+        """
+        Class for generating score vectors given generated sequence
+        Args:
+            score_func_names: list of scoring function names to be evaluated
+            score_weights: weights to scale scores (default: 1)
+            target_protein: sequence of target protein binder
+        """
+        emb_model = AutoModelForMaskedLM.from_pretrained('aaronfeller/PeptideCLM-23M-all').roformer.to(device).eval()
+        tokenizer = SMILES_SPE_Tokenizer(f'{base_path}/src/scoring/functions/tokenizer/new_vocab.txt',
+                                        f'{base_path}/src/scoring/functions/tokenizer/new_splits.txt')
+        prot_seqs = prot_seqs if prot_seqs is not None else []
+        if score_func_names is None:
+            # just do unmasking based on validity of peptide bonds
+            self.score_func_names = []
+        else:
+            self.score_func_names = score_func_names
+        # binding affinities
+        self.target_protein = prot_seqs
+        print(len(prot_seqs))
+        if ('binding_affinity1' in score_func_names) and (len(prot_seqs) == 1):
+            binding_affinity1 = BindingAffinity(prot_seqs[0], tokenizer=tokenizer, base_path=base_path, device=device)
+            binding_affinity2 = None
+        elif ('binding_affinity1' in score_func_names) and ('binding_affinity2' in score_func_names) and (len(prot_seqs) == 2):
+            binding_affinity1 = BindingAffinity(prot_seqs[0], tokenizer=tokenizer, base_path=base_path, device=device)
+            binding_affinity2 = BindingAffinity(prot_seqs[1], tokenizer=tokenizer, base_path=base_path, device=device)
+        else:
+            print("here")
+            binding_affinity1 = None
+            binding_affinity2 = None
+        permeability = Permeability(tokenizer=tokenizer, base_path=base_path, device=device, emb_model=emb_model)
+        sol = Solubility(tokenizer=tokenizer, base_path=base_path, device=device, emb_model=emb_model)
+        nonfouling = Nonfouling(tokenizer=tokenizer, base_path=base_path, device=device, emb_model=emb_model)
+        hemo = Hemolysis(tokenizer=tokenizer, base_path=base_path, device=device, emb_model=emb_model)
+        self.all_funcs = {'binding_affinity1': binding_affinity1,
+                          'binding_affinity2': binding_affinity2,
+                          'permeability': permeability,
+                          'nonfouling': nonfouling,
+                          'solubility': sol,
+                          'hemolysis': hemo
+                          }
+    def forward(self, input_seqs):
+        scores = []
+        for i, score_func in enumerate(self.score_func_names):
+            score = self.all_funcs[score_func](input_seqs = input_seqs)
+            scores.append(score)
+        # convert to numpy arrays with shape (num_sequences, num_functions)
+        scores = np.float32(scores).T
+        return scores
+    def __call__(self, input_seqs: list):
+        return self.forward(input_seqs)

src/scoring/tokenizer/my_tokenizers.py ADDED Viewed

	@@ -0,0 +1,424 @@

+import collections
+import os
+import re
+from typing import List, Optional
+from transformers import PreTrainedTokenizer
+from SmilesPE.tokenizer import SPE_Tokenizer
+import torch
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+class Atomwise_Tokenizer(object):
+    """Run atom-level SMILES tokenization"""
+    def __init__(self):
+        """ Constructs a atom-level Tokenizer.
+        """
+        # self.regex_pattern = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
+        self.regex_pattern = r"(\([^\(\)]{0,4}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/\/?|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
+        self.regex = re.compile(self.regex_pattern)
+    def tokenize(self, text):
+        """ Basic Tokenization of a SMILES.
+        """
+        tokens = [token for token in self.regex.findall(text)]
+        return tokens
+class SMILES_SPE_Tokenizer(PreTrainedTokenizer):
+    r"""
+    Constructs a SMILES tokenizer. Based on SMILES Pair Encoding (https://github.com/XinhaoLi74/SmilesPE).
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    should refer to the superclass for more information regarding methods.
+    Args:
+        vocab_file (:obj:`string`):
+            File containing the vocabulary.
+        spe_file (:obj:`string`):
+            File containing the trained SMILES Pair Encoding vocabulary.
+        unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+    """
+    def __init__(self, vocab_file, spe_file,
+                unk_token="[UNK]",
+                sep_token="[SEP]",
+                pad_token="[PAD]",
+                cls_token="[CLS]",
+                mask_token="[MASK]",
+                **kwargs):
+        if not os.path.isfile(vocab_file):
+            raise ValueError("Can't find a vocabulary file at path '{}'.".format(vocab_file))
+        if not os.path.isfile(spe_file):
+            raise ValueError("Can't find a SPE vocabulary file at path '{}'.".format(spe_file))
+        self.vocab = load_vocab(vocab_file)
+        self.spe_vocab = open(spe_file, 'r', encoding='utf-8')
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.spe_tokenizer = SPE_Tokenizer(self.spe_vocab)
+        super().__init__(
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs)
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+    def _tokenize(self, text):
+        return self.spe_tokenizer.tokenize(text).split(' ')
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+    # changed encode and decode functions
+    def encode(self, token_array):
+        token_ids = []
+        token_ids.append(2)
+        for token in token_array:
+            id = self._convert_token_to_id(token)
+            token_ids.append(id)
+        token_ids.append(3)
+        token_ids = torch.tensor([token_ids])
+        attn_mask = torch.ones_like(token_ids)
+        return {'input_ids': token_ids, 'attention_mask': attn_mask}
+    def decode(self, token_ids, skip_special_tokens=True):
+        token_ids = token_ids.squeeze(0).cpu().tolist()
+        token_array = []
+        for idx in token_ids:
+            if idx == 3:  # Stop decoding when token ID 3 is encountered
+                break
+            if skip_special_tokens and idx in self.all_special_ids:
+                continue
+            token = self._convert_id_to_token(idx)
+            token_array.append(token)
+        sequence = "".join(token_array)
+        return sequence
+    def batch_decode(self, batch_token_ids, skip_special_tokens=True):
+        sequences = []
+        for token_ids in batch_token_ids:
+            sequences.append(self.decode(token_ids))
+        return sequences
+    def get_token_split(self, token_ids):
+        if isinstance(token_ids, torch.Tensor):
+            token_ids = token_ids.cpu().tolist()
+        token_array = []
+        for seq_ids in token_ids:
+            seq_array = []
+            for id in seq_ids:
+                token = self._convert_id_to_token(id)
+                seq_array.append(token)
+            token_array.append(seq_array)
+        return token_array
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        A BERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Set to True if the token list is already formatted with special tokens for the model
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formated with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A BERT sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+    def save_vocabulary(self, vocab_path):
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Args:
+            vocab_path (:obj:`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
+        """
+        index = 0
+        vocab_file = vocab_path
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+class SMILES_Atomwise_Tokenizer(PreTrainedTokenizer):
+    r"""
+    Constructs a SMILES tokenizer. Based on SMILES Pair Encoding (https://github.com/XinhaoLi74/SmilesPE).
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    should refer to the superclass for more information regarding methods.
+    Args:
+        vocab_file (:obj:`string`):
+            File containing the vocabulary.
+        unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+    """
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super().__init__(
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'.".format(vocab_file)
+            )
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.tokenizer = Atomwise_Tokenizer()
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+    def _tokenize(self, text):
+        return self.tokenizer.tokenize(text)
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        A BERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Set to True if the token list is already formatted with special tokens for the model
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formated with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A BERT sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+    def save_vocabulary(self, vocab_path):
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Args:
+            vocab_path (:obj:`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
+        """
+        index = 0
+        vocab_file = vocab_path
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)

src/scoring/tokenizer/new_splits.txt ADDED Viewed

	@@ -0,0 +1,159 @@

+c 1
+c 2
+c 3
+c 4
+c 5
+c 6
+c 7
+c 8
+c 9
+( c1
+( c2
+c1 )
+c2 )
+n 1
+n 2
+n 3
+n 4
+n 5
+n 6
+n 7
+n 8
+n 9
+( n1
+( n2
+n1 )
+n2 )
+O 1
+O 2
+O 3
+O 4
+O 5
+O 6
+O 7
+O 8
+O 9
+( O1
+( O2
+O2 )
+O2 )
+= O
+= C
+= c
+= N
+= n
+=C C
+=C N
+=C c
+=c c
+=N C
+=N c
+=n C
+=n c
+# N
+# C
+#N C
+#C C
+#C N
+#N N
+( C
+C )
+( O
+O )
+( N
+N )
+Br c
+( =O
+(=O )
+C (=O)
+C =O
+C =N
+C #N
+C #C
+C C
+CC C
+CC N
+CC O
+CC S
+CC c
+CC n
+C N
+CN C
+CN c
+C O
+CO C
+CO N
+CO c
+C S
+CS C
+CS S
+CS c
+C c
+Cl c
+C n
+F c
+N C
+NC C
+NC c
+N N
+N O
+N c
+N n
+O C
+OC C
+OC O
+OC c
+O N
+O O
+O c
+S C
+SC C
+SC c
+S S
+S c
+c c
+cc c
+cc n
+cc o
+cc s
+cc cc
+c n
+cn c
+cn n
+c o
+co c
+c s
+cs c
+cs n
+n c
+nc c
+nc n
+nc o
+nc s
+n n
+nn c
+nn n
+n o
+no c
+no n
+n s
+ns c
+ns n
+o c
+oc c
+o n
+s c
+sc c
+sc n
+s n
+N P
+P N
+C P
+P C
+N S
+S N
+C S
+S C
+S P
+P S
+C I

src/scoring/tokenizer/new_vocab.txt ADDED Viewed

	@@ -0,0 +1,587 @@

+[PAD]
+[UNK]
+[CLS]
+[SEP]
+[MASK]
+#
+%
+(
+)
++
+-
+/
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+=
+@
+A
+B
+Br
+Brc
+C
+CC
+CCC
+CCN
+CCO
+CCS
+CCc
+CCn
+CN
+CNC
+CNc
+CO
+COC
+CON
+COc
+CS
+CSC
+CSS
+CSc
+Cc
+Cl
+Clc
+Cn
+F
+Fc
+H
+I
+K
+L
+M
+N
+NC
+NCC
+NCc
+NN
+NO
+Nc
+Nn
+O
+OC
+OCC
+OCO
+OCc
+ON
+OO
+Oc
+P
+R
+S
+SC
+SCC
+SCc
+SS
+Sc
+T
+X
+Z
+[
+\\
+(/
+]
+a
+b
+c
+cc
+ccc
+cccc
+ccn
+cco
+ccs
+cn
+cnc
+cnn
+co
+coc
+cs
+csc
+csn
+e
+g
+i
+l
+n
+nc
+ncc
+ncn
+nco
+ncs
+nn
+nnc
+nnn
+no
+noc
+non
+ns
+nsc
+nsn
+o
+oc
+occ
+on
+p
+r
+s
+sc
+scc
+scn
+sn
+t
+c1
+c2
+c3
+c4
+c5
+c6
+c7
+c8
+c9
+n1
+n2
+n3
+n4
+n5
+n6
+n7
+n8
+n9
+O1
+O2
+O3
+O4
+O5
+O6
+O7
+O8
+O9
+(c1
+(c2
+c1)
+c2)
+(n1
+(n2
+n1)
+n2)
+(O1
+(O2
+O2)
+=O
+=C
+=c
+=N
+=n
+=CC
+=CN
+=Cc
+=cc
+=NC
+=Nc
+=nC
+=nc
+#C
+#CC
+#CN
+#N
+#NC
+#NN
+(C
+C)
+(O
+O)
+(N
+N)
+NP
+PN
+CP
+PC
+NS
+SN
+SP
+PS
+C(=O)
+(/Br)
+(/C#N)
+(/C)
+(/C=N)
+(/C=O)
+(/CBr)
+(/CC)
+(/CCC)
+(/CCF)
+(/CCN)
+(/CCO)
+(/CCl)
+(/CI)
+(/CN)
+(/CO)
+(/CS)
+(/Cl)
+(/F)
+(/I)
+(/N)
+(/NC)
+(/NCC)
+(/NO)
+(/O)
+(/OC)
+(/OCC)
+(/S)
+(/SC)
+(=C)
+(=C/C)
+(=C/F)
+(=C/I)
+(=C/N)
+(=C/O)
+(=CBr)
+(=CC)
+(=CCF)
+(=CCN)
+(=CCO)
+(=CCl)
+(=CF)
+(=CI)
+(=CN)
+(=CO)
+(=C\\C)
+(=C\\F)
+(=C\\I)
+(=C\\N)
+(=C\\O)
+(=N)
+(=N/C)
+(=N/N)
+(=N/O)
+(=NBr)
+(=NC)
+(=NCC)
+(=NCl)
+(=NN)
+(=NO)
+(=NOC)
+(=N\\C)
+(=N\\N)
+(=N\\O)
+(=O)
+(=S)
+(B)
+(Br)
+(C#C)
+(C#CC)
+(C#CI)
+(C#CO)
+(C#N)
+(C#SN)
+(C)
+(C=C)
+(C=CF)
+(C=CI)
+(C=N)
+(C=NN)
+(C=NO)
+(C=O)
+(C=S)
+(CBr)
+(CC#C)
+(CC#N)
+(CC)
+(CC=C)
+(CC=O)
+(CCBr)
+(CCC)
+(CCCC)
+(CCCF)
+(CCCI)
+(CCCN)
+(CCCO)
+(CCCS)
+(CCCl)
+(CCF)
+(CCI)
+(CCN)
+(CCNC)
+(CCNN)
+(CCNO)
+(CCO)
+(CCOC)
+(CCON)
+(CCS)
+(CCSC)
+(CCl)
+(CF)
+(CI)
+(CN)
+(CN=O)
+(CNC)
+(CNCC)
+(CNCO)
+(CNN)
+(CNNC)
+(CNO)
+(CNOC)
+(CO)
+(COC)
+(COCC)
+(COCI)
+(COCN)
+(COCO)
+(COF)
+(CON)
+(COO)
+(CS)
+(CSC)
+(CSCC)
+(CSCF)
+(CSO)
+(Cl)
+(F)
+(I)
+(N)
+(N=N)
+(N=NO)
+(N=O)
+(N=S)
+(NBr)
+(NC#N)
+(NC)
+(NC=N)
+(NC=O)
+(NC=S)
+(NCBr)
+(NCC)
+(NCCC)
+(NCCF)
+(NCCN)
+(NCCO)
+(NCCS)
+(NCCl)
+(NCNC)
+(NCO)
+(NCS)
+(NCl)
+(NN)
+(NN=O)
+(NNC)
+(NO)
+(NOC)
+(O)
+(OC#N)
+(OC)
+(OC=C)
+(OC=O)
+(OC=S)
+(OCBr)
+(OCC)
+(OCCC)
+(OCCF)
+(OCCI)
+(OCCN)
+(OCCO)
+(OCCS)
+(OCCl)
+(OCF)
+(OCI)
+(OCO)
+(OCOC)
+(OCON)
+(OCSC)
+(OCl)
+(OI)
+(ON)
+(OO)
+(OOC)
+(OOCC)
+(OOSN)
+(OSC)
+(P)
+(S)
+(SC#N)
+(SC)
+(SCC)
+(SCCC)
+(SCCF)
+(SCCN)
+(SCCO)
+(SCCS)
+(SCCl)
+(SCF)
+(SCN)
+(SCOC)
+(SCSC)
+(SCl)
+(SI)
+(SN)
+(SN=O)
+(SO)
+(SOC)
+(SOOO)
+(SS)
+(SSC)
+(SSCC)
+([At])
+([O-])
+([O])
+([S-])
+(\\Br)
+(\\C#N)
+(\\C)
+(\\C=N)
+(\\C=O)
+(\\CBr)
+(\\CC)
+(\\CCC)
+(\\CCO)
+(\\CCl)
+(\\CF)
+(\\CN)
+(\\CNC)
+(\\CO)
+(\\COC)
+(\\Cl)
+(\\F)
+(\\I)
+(\\N)
+(\\NC)
+(\\NCC)
+(\\NN)
+(\\NO)
+(\\NOC)
+(\\O)
+(\\OC)
+(\\OCC)
+(\\ON)
+(\\S)
+(\\SC)
+(\\SCC)
+[Ag+]
+[Ag-4]
+[Ag]
+[Al-3]
+[Al]
+[As+]
+[AsH3]
+[AsH]
+[As]
+[At]
+[B-]
+[B@-]
+[B@@-]
+[BH-]
+[BH2-]
+[BH3-]
+[B]
+[Ba]
+[Br+2]
+[BrH]
+[Br]
+[C+]
+[C-]
+[C@@H]
+[C@@]
+[C@H]
+[C@]
+[CH-]
+[CH2]
+[CH3]
+[CH]
+[C]
+[CaH2]
+[Ca]
+[Cl+2]
+[Cl+3]
+[Cl+]
+[Cs]
+[FH]
+[F]
+[H]
+[He]
+[I+2]
+[I+3]
+[I+]
+[IH]
+[I]
+[K]
+[Kr]
+[Li+]
+[LiH]
+[MgH2]
+[Mg]
+[N+]
+[N-]
+[N@+]
+[N@@+]
+[N@@]
+[N@]
+[NH+]
+[NH-]
+[NH2+]
+[NH3]
+[NH]
+[N]
+[Na]
+[O+]
+[O-]
+[OH+]
+[OH2]
+[OH]
+[O]
+[P+]
+[P@+]
+[P@@+]
+[P@@]
+[P@]
+[PH2]
+[PH]
+[P]
+[Ra]
+[Rb]
+[S+]
+[S-]
+[S@+]
+[S@@+]
+[S@@]
+[S@]
+[SH+]
+[SH2]
+[SH]
+[S]
+[Se+]
+[Se-2]
+[SeH2]
+[SeH]
+[Se]
+[Si@]
+[SiH2]
+[SiH]
+[Si]
+[SrH2]
+[TeH]
+[Te]
+[Xe]
+[Zn+2]
+[Zn-2]
+[Zn]
+[b-]
+[c+]
+[c-]
+[cH-]
+[cH]
+[c]
+[n+]
+[n-]
+[nH]
+[n]
+[o+]
+[s+]
+[se+]
+[se]
+[te+]
+[te]

src/tokenizer/__init__.py ADDED Viewed

File without changes

src/tokenizer/my_tokenizers.py ADDED Viewed

	@@ -0,0 +1,441 @@

+import collections
+import logging
+import os
+import re
+import codecs
+import unicodedata
+from typing import List, Optional
+from transformers import PreTrainedTokenizer
+from SmilesPE.tokenizer import SPE_Tokenizer
+import torch
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+class Atomwise_Tokenizer(object):
+    """Run atom-level SMILES tokenization"""
+    def __init__(self):
+        """ Constructs a atom-level Tokenizer.
+        """
+        # self.regex_pattern = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
+        self.regex_pattern = r"(\([^\(\)]{0,4}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/\/?|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
+        self.regex = re.compile(self.regex_pattern)
+    def tokenize(self, text):
+        """ Basic Tokenization of a SMILES.
+        """
+        tokens = [token for token in self.regex.findall(text)]
+        return tokens
+class SMILES_SPE_Tokenizer(PreTrainedTokenizer):
+    r"""
+    Constructs a SMILES tokenizer. Based on SMILES Pair Encoding (https://github.com/XinhaoLi74/SmilesPE).
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    should refer to the superclass for more information regarding methods.
+    Args:
+        vocab_file (:obj:`string`):
+            File containing the vocabulary.
+        spe_file (:obj:`string`):
+            File containing the trained SMILES Pair Encoding vocabulary.
+        unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+    """
+    def __init__(self, vocab_file, spe_file,
+                unk_token="[UNK]",
+                sep_token="[SEP]",
+                pad_token="[PAD]",
+                cls_token="[CLS]",
+                mask_token="[MASK]",
+                **kwargs):
+        if not os.path.isfile(vocab_file):
+            raise ValueError("Can't find a vocabulary file at path '{}'.".format(vocab_file))
+        if not os.path.isfile(spe_file):
+            raise ValueError("Can't find a SPE vocabulary file at path '{}'.".format(spe_file))
+        self.vocab = load_vocab(vocab_file)
+        self.spe_vocab = open(spe_file, 'r', encoding='utf-8')
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.spe_tokenizer = SPE_Tokenizer(self.spe_vocab)
+        super().__init__(
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs)
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+    def _tokenize(self, text):
+        return self.spe_tokenizer.tokenize(text).split(' ')
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+    # changed encode and decode functions
+    def encode(self, token_array):
+        token_ids = []
+        token_ids.append(2)
+        for token in token_array:
+            id = self._convert_token_to_id(token)
+            token_ids.append(id)
+        token_ids.append(3)
+        token_ids = torch.tensor([token_ids])
+        attn_mask = torch.ones_like(token_ids)
+        return {'input_ids': token_ids, 'attention_mask': attn_mask}
+    def decode(self, token_ids, skip_special_tokens=True):
+        token_ids = token_ids.squeeze(0).cpu().tolist()
+        token_array = []
+        for idx in token_ids:
+            if idx == 3:  # Stop decoding when token ID 3 is encountered
+                break
+            if skip_special_tokens and idx in self.all_special_ids:
+                continue
+            token = self._convert_id_to_token(idx)
+            token_array.append(token)
+        sequence = "".join(token_array)
+        return sequence
+    def batch_decode(self, batch_token_ids, skip_special_tokens=True):
+        sequences = []
+        for token_ids in batch_token_ids:
+            sequences.append(self.decode(token_ids))
+        return sequences
+    def get_token_split(self, token_ids):
+        if isinstance(token_ids, torch.Tensor):
+            token_ids = token_ids.cpu().tolist()
+        token_array = []
+        for seq_ids in token_ids:
+            seq_array = []
+            for id in seq_ids:
+                token = self._convert_id_to_token(id)
+                seq_array.append(token)
+            token_array.append(seq_array)
+        return token_array
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        A BERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Set to True if the token list is already formatted with special tokens for the model
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formated with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A BERT sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+    def save_vocabulary(self, vocab_path):
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Args:
+            vocab_path (:obj:`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
+        """
+        index = 0
+        if os.path.isdir(vocab_path):
+            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES["vocab_file"])
+        else:
+            vocab_file = vocab_path
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        "Saving vocabulary to {}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!".format(vocab_file)
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+class SMILES_Atomwise_Tokenizer(PreTrainedTokenizer):
+    r"""
+    Constructs a SMILES tokenizer. Based on SMILES Pair Encoding (https://github.com/XinhaoLi74/SmilesPE).
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    should refer to the superclass for more information regarding methods.
+    Args:
+        vocab_file (:obj:`string`):
+            File containing the vocabulary.
+        unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+    """
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super().__init__(
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'.".format(vocab_file)
+            )
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.tokenizer = Atomwise_Tokenizer()
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+    def _tokenize(self, text):
+        return self.tokenizer.tokenize(text)
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        A BERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Set to True if the token list is already formatted with special tokens for the model
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formated with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A BERT sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+    def save_vocabulary(self, vocab_path):
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Args:
+            vocab_path (:obj:`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
+        """
+        index = 0
+        if os.path.isdir(vocab_path):
+            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES["vocab_file"])
+        else:
+            vocab_file = vocab_path
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        "Saving vocabulary to {}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!".format(vocab_file)
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)

src/tokenizer/new_splits.txt ADDED Viewed

	@@ -0,0 +1,159 @@

+c 1
+c 2
+c 3
+c 4
+c 5
+c 6
+c 7
+c 8
+c 9
+( c1
+( c2
+c1 )
+c2 )
+n 1
+n 2
+n 3
+n 4
+n 5
+n 6
+n 7
+n 8
+n 9
+( n1
+( n2
+n1 )
+n2 )
+O 1
+O 2
+O 3
+O 4
+O 5
+O 6
+O 7
+O 8
+O 9
+( O1
+( O2
+O2 )
+O2 )
+= O
+= C
+= c
+= N
+= n
+=C C
+=C N
+=C c
+=c c
+=N C
+=N c
+=n C
+=n c
+# N
+# C
+#N C
+#C C
+#C N
+#N N
+( C
+C )
+( O
+O )
+( N
+N )
+Br c
+( =O
+(=O )
+C (=O)
+C =O
+C =N
+C #N
+C #C
+C C
+CC C
+CC N
+CC O
+CC S
+CC c
+CC n
+C N
+CN C
+CN c
+C O
+CO C
+CO N
+CO c
+C S
+CS C
+CS S
+CS c
+C c
+Cl c
+C n
+F c
+N C
+NC C
+NC c
+N N
+N O
+N c
+N n
+O C
+OC C
+OC O
+OC c
+O N
+O O
+O c
+S C
+SC C
+SC c
+S S
+S c
+c c
+cc c
+cc n
+cc o
+cc s
+cc cc
+c n
+cn c
+cn n
+c o
+co c
+c s
+cs c
+cs n
+n c
+nc c
+nc n
+nc o
+nc s
+n n
+nn c
+nn n
+n o
+no c
+no n
+n s
+ns c
+ns n
+o c
+oc c
+o n
+s c
+sc c
+sc n
+s n
+N P
+P N
+C P
+P C
+N S
+S N
+C S
+S C
+S P
+P S
+C I

src/tokenizer/new_vocab.txt ADDED Viewed

	@@ -0,0 +1,587 @@

+[PAD]
+[UNK]
+[CLS]
+[SEP]
+[MASK]
+#
+%
+(
+)
++
+-
+/
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+=
+@
+A
+B
+Br
+Brc
+C
+CC
+CCC
+CCN
+CCO
+CCS
+CCc
+CCn
+CN
+CNC
+CNc
+CO
+COC
+CON
+COc
+CS
+CSC
+CSS
+CSc
+Cc
+Cl
+Clc
+Cn
+F
+Fc
+H
+I
+K
+L
+M
+N
+NC
+NCC
+NCc
+NN
+NO
+Nc
+Nn
+O
+OC
+OCC
+OCO
+OCc
+ON
+OO
+Oc
+P
+R
+S
+SC
+SCC
+SCc
+SS
+Sc
+T
+X
+Z
+[
+\\
+(/
+]
+a
+b
+c
+cc
+ccc
+cccc
+ccn
+cco
+ccs
+cn
+cnc
+cnn
+co
+coc
+cs
+csc
+csn
+e
+g
+i
+l
+n
+nc
+ncc
+ncn
+nco
+ncs
+nn
+nnc
+nnn
+no
+noc
+non
+ns
+nsc
+nsn
+o
+oc
+occ
+on
+p
+r
+s
+sc
+scc
+scn
+sn
+t
+c1
+c2
+c3
+c4
+c5
+c6
+c7
+c8
+c9
+n1
+n2
+n3
+n4
+n5
+n6
+n7
+n8
+n9
+O1
+O2
+O3
+O4
+O5
+O6
+O7
+O8
+O9
+(c1
+(c2
+c1)
+c2)
+(n1
+(n2
+n1)
+n2)
+(O1
+(O2
+O2)
+=O
+=C
+=c
+=N
+=n
+=CC
+=CN
+=Cc
+=cc
+=NC
+=Nc
+=nC
+=nc
+#C
+#CC
+#CN
+#N
+#NC
+#NN
+(C
+C)
+(O
+O)
+(N
+N)
+NP
+PN
+CP
+PC
+NS
+SN
+SP
+PS
+C(=O)
+(/Br)
+(/C#N)
+(/C)
+(/C=N)
+(/C=O)
+(/CBr)
+(/CC)
+(/CCC)
+(/CCF)
+(/CCN)
+(/CCO)
+(/CCl)
+(/CI)
+(/CN)
+(/CO)
+(/CS)
+(/Cl)
+(/F)
+(/I)
+(/N)
+(/NC)
+(/NCC)
+(/NO)
+(/O)
+(/OC)
+(/OCC)
+(/S)
+(/SC)
+(=C)
+(=C/C)
+(=C/F)
+(=C/I)
+(=C/N)
+(=C/O)
+(=CBr)
+(=CC)
+(=CCF)
+(=CCN)
+(=CCO)
+(=CCl)
+(=CF)
+(=CI)
+(=CN)
+(=CO)
+(=C\\C)
+(=C\\F)
+(=C\\I)
+(=C\\N)
+(=C\\O)
+(=N)
+(=N/C)
+(=N/N)
+(=N/O)
+(=NBr)
+(=NC)
+(=NCC)
+(=NCl)
+(=NN)
+(=NO)
+(=NOC)
+(=N\\C)
+(=N\\N)
+(=N\\O)
+(=O)
+(=S)
+(B)
+(Br)
+(C#C)
+(C#CC)
+(C#CI)
+(C#CO)
+(C#N)
+(C#SN)
+(C)
+(C=C)
+(C=CF)
+(C=CI)
+(C=N)
+(C=NN)
+(C=NO)
+(C=O)
+(C=S)
+(CBr)
+(CC#C)
+(CC#N)
+(CC)
+(CC=C)
+(CC=O)
+(CCBr)
+(CCC)
+(CCCC)
+(CCCF)
+(CCCI)
+(CCCN)
+(CCCO)
+(CCCS)
+(CCCl)
+(CCF)
+(CCI)
+(CCN)
+(CCNC)
+(CCNN)
+(CCNO)
+(CCO)
+(CCOC)
+(CCON)
+(CCS)
+(CCSC)
+(CCl)
+(CF)
+(CI)
+(CN)
+(CN=O)
+(CNC)
+(CNCC)
+(CNCO)
+(CNN)
+(CNNC)
+(CNO)
+(CNOC)
+(CO)
+(COC)
+(COCC)
+(COCI)
+(COCN)
+(COCO)
+(COF)
+(CON)
+(COO)
+(CS)
+(CSC)
+(CSCC)
+(CSCF)
+(CSO)
+(Cl)
+(F)
+(I)
+(N)
+(N=N)
+(N=NO)
+(N=O)
+(N=S)
+(NBr)
+(NC#N)
+(NC)
+(NC=N)
+(NC=O)
+(NC=S)
+(NCBr)
+(NCC)
+(NCCC)
+(NCCF)
+(NCCN)
+(NCCO)
+(NCCS)
+(NCCl)
+(NCNC)
+(NCO)
+(NCS)
+(NCl)
+(NN)
+(NN=O)
+(NNC)
+(NO)
+(NOC)
+(O)
+(OC#N)
+(OC)
+(OC=C)
+(OC=O)
+(OC=S)
+(OCBr)
+(OCC)
+(OCCC)
+(OCCF)
+(OCCI)
+(OCCN)
+(OCCO)
+(OCCS)
+(OCCl)
+(OCF)
+(OCI)
+(OCO)
+(OCOC)
+(OCON)
+(OCSC)
+(OCl)
+(OI)
+(ON)
+(OO)
+(OOC)
+(OOCC)
+(OOSN)
+(OSC)
+(P)
+(S)
+(SC#N)
+(SC)
+(SCC)
+(SCCC)
+(SCCF)
+(SCCN)
+(SCCO)
+(SCCS)
+(SCCl)
+(SCF)
+(SCN)
+(SCOC)
+(SCSC)
+(SCl)
+(SI)
+(SN)
+(SN=O)
+(SO)
+(SOC)
+(SOOO)
+(SS)
+(SSC)
+(SSCC)
+([At])
+([O-])
+([O])
+([S-])
+(\\Br)
+(\\C#N)
+(\\C)
+(\\C=N)
+(\\C=O)
+(\\CBr)
+(\\CC)
+(\\CCC)
+(\\CCO)
+(\\CCl)
+(\\CF)
+(\\CN)
+(\\CNC)
+(\\CO)
+(\\COC)
+(\\Cl)
+(\\F)
+(\\I)
+(\\N)
+(\\NC)
+(\\NCC)
+(\\NN)
+(\\NO)
+(\\NOC)
+(\\O)
+(\\OC)
+(\\OCC)
+(\\ON)
+(\\S)
+(\\SC)
+(\\SCC)
+[Ag+]
+[Ag-4]
+[Ag]
+[Al-3]
+[Al]
+[As+]
+[AsH3]
+[AsH]
+[As]
+[At]
+[B-]
+[B@-]
+[B@@-]
+[BH-]
+[BH2-]
+[BH3-]
+[B]
+[Ba]
+[Br+2]
+[BrH]
+[Br]
+[C+]
+[C-]
+[C@@H]
+[C@@]
+[C@H]
+[C@]
+[CH-]
+[CH2]
+[CH3]
+[CH]
+[C]
+[CaH2]
+[Ca]
+[Cl+2]
+[Cl+3]
+[Cl+]
+[Cs]
+[FH]
+[F]
+[H]
+[He]
+[I+2]
+[I+3]
+[I+]
+[IH]
+[I]
+[K]
+[Kr]
+[Li+]
+[LiH]
+[MgH2]
+[Mg]
+[N+]
+[N-]
+[N@+]
+[N@@+]
+[N@@]
+[N@]
+[NH+]
+[NH-]
+[NH2+]
+[NH3]
+[NH]
+[N]
+[Na]
+[O+]
+[O-]
+[OH+]
+[OH2]
+[OH]
+[O]
+[P+]
+[P@+]
+[P@@+]
+[P@@]
+[P@]
+[PH2]
+[PH]
+[P]
+[Ra]
+[Rb]
+[S+]
+[S-]
+[S@+]
+[S@@+]
+[S@@]
+[S@]
+[SH+]
+[SH2]
+[SH]
+[S]
+[Se+]
+[Se-2]
+[SeH2]
+[SeH]
+[Se]
+[Si@]
+[SiH2]
+[SiH]
+[Si]
+[SrH2]
+[TeH]
+[Te]
+[Xe]
+[Zn+2]
+[Zn-2]
+[Zn]
+[b-]
+[c+]
+[c-]
+[cH-]
+[cH]
+[c]
+[n+]
+[n-]
+[nH]
+[n]
+[o+]
+[s+]
+[se+]
+[se]
+[te+]
+[te]

src/train.py ADDED Viewed

	@@ -0,0 +1,133 @@

+# direct reward backpropagation
+from diffusion import Diffusion
+from hydra import initialize, compose
+from hydra.core.global_hydra import GlobalHydra
+import numpy as np
+from scipy.stats import pearsonr
+import torch
+import torch.nn.functional as F
+import argparse
+import wandb
+import os
+import datetime
+from finetune_peptides import finetune
+from peptide_mcts import MCTS
+from utils.utils import str2bool, set_seed
+from scoring.scoring_functions import ScoringFunctions
+argparser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+argparser.add_argument('--base_path', type=str, default='')
+argparser.add_argument('--learning_rate', type=float, default=1e-4)
+argparser.add_argument('--num_epochs', type=int, default=100)
+argparser.add_argument('--num_accum_steps', type=int, default=4)
+argparser.add_argument('--truncate_steps', type=int, default=50)
+argparser.add_argument("--truncate_kl", type=str2bool, default=False)
+argparser.add_argument('--gumbel_temp', type=float, default=1.0)
+argparser.add_argument('--gradnorm_clip', type=float, default=1.0)
+argparser.add_argument('--batch_size', type=int, default=32)
+argparser.add_argument('--name', type=str, default='debug')
+argparser.add_argument('--total_num_steps', type=int, default=128)
+argparser.add_argument('--copy_flag_temp', type=float, default=None)
+argparser.add_argument('--save_every_n_epochs', type=int, default=10)
+argparser.add_argument('--alpha_schedule_warmup', type=int, default=0)
+argparser.add_argument("--seed", type=int, default=0)
+# new
+argparser.add_argument('--run_name', type=str, default='peptides')
+argparser.add_argument("--device", default="cuda:0", type=str)
+argparser.add_argument("--save_path_dir", default="/path/to/your/home/PepTune/checkpoints/", type=str)
+# mcts
+argparser.add_argument('--num_sequences', type=int, default=10)
+argparser.add_argument('--num_children', type=int, default=50)
+argparser.add_argument('--num_iter', type=int, default=30) # iterations of mcts
+argparser.add_argument('--seq_length', type=int, default=200)
+argparser.add_argument('--time_conditioning', action='store_true', default=False)
+argparser.add_argument('--mcts_sampling', type=int, default=0) # for batched categorical sampling: '0' means gumbel noise
+argparser.add_argument('--buffer_size', type=int, default=100)
+argparser.add_argument('--wdce_num_replicates', type=int, default=16)
+argparser.add_argument('--noise_removal', action='store_true', default=False)
+argparser.add_argument('--grad_clip', action='store_true', default=False)
+argparser.add_argument('--resample_every_n_step', type=int, default=10)
+argparser.add_argument('--exploration', type=float, default=0.1)
+argparser.add_argument('--reset_every_n_step', type=int, default=100)
+argparser.add_argument('--alpha', type=float, default=0.01)
+argparser.add_argument('--scalarization', type=str, default='sum')
+argparser.add_argument('--no_mcts', action='store_true', default=False)
+argparser.add_argument("--centering", action='store_true', default=False)
+# objectives
+argparser.add_argument('--num_obj', type=int, default=5)
+argparser.add_argument('--prot_seq', type=str, default=None)
+argparser.add_argument('--prot_name', type=str, default=None)
+args = argparser.parse_args()
+print(args)
+# pretrained model path
+ckpt_path = f'{args.base_path}/checkpoints/peptune-pretrained.ckpt'
+# reinitialize Hydra
+GlobalHydra.instance().clear()
+# Initialize Hydra and compose the configuration
+initialize(config_path="configs", job_name="load_model")
+cfg = compose(config_name="peptune_config.yaml")
+curr_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+# proteins
+amhr = 'MLGSLGLWALLPTAVEAPPNRRTCVFFEAPGVRGSTKTLGELLDTGTELPRAIRCLYSRCCFGIWNLTQDRAQVEMQGCRDSDEPGCESLHCDPSPRAHPSPGSTLFTCSCGTDFCNANYSHLPPPGSPGTPGSQGPQAAPGESIWMALVLLGLFLLLLLLLGSIILALLQRKNYRVRGEPVPEPRPDSGRDWSVELQELPELCFSQVIREGGHAVVWAGQLQGKLVAIKAFPPRSVAQFQAERALYELPGLQHDHIVRFITASRGGPGRLLSGPLLVLELHPKGSLCHYLTQYTSDWGSSLRMALSLAQGLAFLHEERWQNGQYKPGIAHRDLSSQNVLIREDGSCAIGDLGLALVLPGLTQPPAWTPTQPQGPAAIMEAGTQRYMAPELLDKTLDLQDWGMALRRADIYSLALLLWEILSRCPDLRPDSSPPPFQLAYEAELGNTPTSDELWALAVQERRRPYIPSTWRCFATDPDGLRELLEDCWDADPEARLTAECVQQRLAALAHPQESHPFPESCPRGCPPLCPEDCTSIPAPTILPCRPQRSACHFSVQQGPCSRNPQPACTLSPV'
+tfr = 'MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNTKANVTKPKRCSGSICYGTIAVIVFFLIGFMIGYLGYCKGVEPKTECERLAGTESPVREEPGEDFPAARRLYWDDLKRKLSEKLDSTDFTGTIKLLNENSYVPREAGSQKDENLALYVENQFREFKLSKVWRDQHFVKIQVKDSAQNSVIIVDKNGRLVYLVENPGGYVAYSKAATVTGKLVHANFGTKKDFEDLYTPVNGSIVIVRAGKITFAEKVANAESLNAIGVLIYMDQTKFPIVNAELSFFGHAHLGTGDPYTPGFPSFNHTQFPPSRSSGLPNIPVQTISRAAAEKLFGNMEGDCPSDWKTDSTCRMVTSESKNVKLTVSNVLKEIKILNIFGVIKGFVEPDHYVVVGAQRDAWGPGAAKSGVGTALLLKLAQMFSDMVLKDGFQPSRSIIFASWSAGDFGSVGATEWLEGYLSSLHLKAFTYINLDKAVLGTSNFKVSASPLLYTLIEKTMQNVKHPVTGQFLYQDSNWASKVEKLTLDNAAFPFLAYSGIPAVSFCFCEDTDYPYLGTTMDTYKELIERIPELNKVARAAAEVAGQFVIKLTHDVELNLDYERYNSQLLSFVRDLNQYRADIKEMGLSLQWLYSARGDFFRATSRLTTDFGNAEKTDRFVMKKLNDRVMRVEYHFLSPYVSPKESPFRHVFWGSGSHTLPALLENLKLRKQNNGAFNETLFRNQLALATWTIQGAANALSGDVWDIDNEF'
+gfap = 'MERRRITSAARRSYVSSGEMMVGGLAPGRRLGPGTRLSLARMPPPLPTRVDFSLAGALNAGFKETRASERAEMMELNDRFASYIEKVRFLEQQNKALAAELNQLRAKEPTKLADVYQAELRELRLRLDQLTANSARLEVERDNLAQDLATVRQKLQDETNLRLEAENNLAAYRQEADEATLARLDLERKIESLEEEIRFLRKIHEEEVRELQEQLARQQVHVELDVAKPDLTAALKEIRTQYEAMASSNMHEAEEWYRSKFADLTDAAARNAELLRQAKHEANDYRRQLQSLTCDLESLRGTNESLERQMREQEERHVREAASYQEALARLEEEGQSLKDEMARHLQEYQDLLNVKLALDIEIATYRKLLEGEENRITIPVQTFSNLQIRETSLDTKSVSEGHLKRNIVVKTVEMRDGEVIKESKQEHKDVM'
+glp1 = 'MAGAPGPLRLALLLLGMVGRAGPRPQGATVSLWETVQKWREYRRQCQRSLTEDPPPATDLFCNRTFDEYACWPDGEPGSFVNVSCPWYLPWASSVPQGHVYRFCTAEGLWLQKDNSSLPWRDLSECEESKRGERSSPEEQLLFLYIIYTVGYALSFSALVIASAILLGFRHLHCTRNYIHLNLFASFILRALSVFIKDAALKWMYSTAAQQHQWDGLLSYQDSLSCRLVFLLMQYCVAANYYWLLVEGVYLYTLLAFSVLSEQWIFRLYVSIGWGVPLLFVVPWGIVKYLYEDEGCWTRNSNMNYWLIIRLPILFAIGVNFLIFVRVICIVVSKLKANLMCKTDIKCRLAKSTLTLIPLLGTHEVIFAFVMDEHARGTLRFIKLFTELSFTSFQGLMVAILYCFVNNEVQLEFRKSWERWRLEHLHIQRDSSMKPLKCPTSSLSSGATAGSSMYTATCQASCS'
+glast = 'MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLLTVTAVIVGTILGFTLRPYRMSYREVKYFSFPGELLMRMLQMLVLPLIISSLVTGMAALDSKASGKMGMRAVVYYMTTTIIAVVIGIIIVIIIHPGKGTKENMHREGKIVRVTAADAFLDLIRNMFPPNLVEACFKQFKTNYEKRSFKVPIQANETLVGAVINNVSEAMETLTRITEELVPVPGSVNGVNALGLVVFSMCFGFVIGNMKEQGQALREFFDSLNEAIMRLVAVIMWYAPVGILFLIAGKIVEMEDMGVIGGQLAMYTVTVIVGLLIHAVIVLPLLYFLVTRKNPWVFIGGLLQALITALGTSSSSATLPITFKCLEENNGVDKRVTRFVLPVGATINMDGTALYEALAAIFIAQVNNFELNFGQIITISITATAASIGAAGIPQAGLVTMVIVLTSVGLPTDDITLIIAVDWFLDRLRTTTNVLGDSLGAGIVEHLSRHELKNRDVEMGNSVIEENEMKKPYQLIAQDNETEKPIDSETKM'
+ncam = 'LQTKDLIWTLFFLGTAVSLQVDIVPSQGEISVGESKFFLCQVAGDAKDKDISWFSPNGEKLTPNQQRISVVWNDDSSSTLTIYNANIDDAGIYKCVVTGEDGSESEATVNVKIFQKLMFKNAPTPQEFREGEDAVIVCDVVSSLPPTIIWKHKGRDVILKKDVRFIVLSNNYLQIRGIKKTDEGTYRCEGRILARGEINFKDIQVIVNVPPTIQARQNIVNATANLGQSVTLVCDAEGFPEPTMSWTKDGEQIEQEEDDEKYIFSDDSSQLTIKKVDKNDEAEYICIAENKAGEQDATIHLKVFAKPKITYVENQTAMELEEQVTLTCEASGDPIPSITWRTSTRNISSEEKASWTRPEKQETLDGHMVVRSHARVSSLTLKSIQYTDAGEYICTASNTIGQDSQSMYLEVQYAPKLQGPVAVYTWEGNQVNITCEVFAYPSATISWFRDGQLLPSSNYSNIKIYNTPSASYLEVTPDSENDFGNYNCTAVNRIGQESLEFILVQADTPSSPSIDQVEPYSSTAQVQFDEPEATGGVPILKYKAEWRAVGEEVWHSKWYDAKEASMEGIVTIVGLKPETTYAVRLAALNGKGLGEISAASEF'
+cereblon = 'MAGEGDQQDAAHNMGNHLPLLPAESEEEDEMEVEDQDSKEAKKPNIINFDTSLPTSHTYLGADMEEFHGRTLHDDDSCQVIPVLPQVMMILIPGQTLPLQLFHPQEVSMVRNLIQKDRTFAVLAYSNVQEREAQFGTTAEIYAYREEQDFGIEIVKVKAIGRQRFKVLELRTQSDGIQQAKVQILPECVLPSTMSAVQLESLNKCQIFPSKPVSREDQCSYKWWQKYQKRKFHCANLTSWPRWLYSLYDAETLMDRIKKQLREWDENLKDDSLPSNPIDFSYRVAACLPIDDVLRIQLLKIGSAIQRLRCELDIMNKCTSLCCKQCQETEITTKNEIFSLSLCGPMAAYVNPHGYVHETLTVYKACNLNLIGRPSTEHSWFPGYAWTVAQCKICASHIGWKFTATKKDMSPQKFWGLTRSALLPTIPDTEDEISPDKVILCL'
+ligase = 'MASQPPEDTAESQASDELECKICYNRYNLKQRKPKVLECCHRVCAKCLYKIIDFGDSPQGVIVCPFCRFETCLPDDEVSSLPDDNNILVNLTCGGKGKKCLPENPTELLLTPKRLASLVSPSHTSSNCLVITIMEVQRESSPSLSSTPVVEFYRPASFDSVTTVSHNWTVWNCTSLLFQTSIRVLVWLLGLLYFSSLPLGIYLLVSKKVTLGVVFVSLVPSSLVILMVYGFCQCVCHEFLDCMAPPS'
+skp2 = 'MHRKHLQEIPDLSSNVATSFTWGWDSSKTSELLSGMGVSALEKEEPDSENIPQELLSNLGHPESPPRKRLKSKGSDKDFVIVRRPKLNRENFPGVSWDSLPDELLLGIFSCLCLPELLKVSGVCKRWYRLASDESLWQTLDLTGKNLHPDVTGRLLSQGVIAFRCPRSFMDQPLAEHFSPFRVQHMDLSNSVIEVSTLHGILSQCSKLQNLSLEGLRLSDPIVNTLAKNSNLVRLNLSGCSGFSEFALQTLLSSCSRLDELNLSWCFDFTEKHVQVAVAHVSETITQLNLSGYRKNLQKSDLSTLVRRCPNLVHLDLSDSVMLKNDCFQEFFQLNYLQHLSLSRCYDIIPETLLELGEIPTLKTLQVFGIVPDGTLQLLKEALPHLQINCSHFTTIARPTIGNKKNQEIWGIKCRLTLQKPSCL'
+if args.prot_seq is not None:
+    prot = args.prot_seq
+    prot_name = args.prot_name
+    filename = args.prot_name
+else:
+    prot = tfr
+    prot_name = "tfr"
+    filename = "tfr"
+if args.no_mcts:
+    args.run_name = f'{prot_name}_resample{args.resample_every_n_step}_no-mcts'
+else:
+    args.run_name = f'{prot_name}_resample{args.resample_every_n_step}_buffer{args.buffer_size}_numiter{args.num_iter}_children{args.num_children}_{curr_time}'
+args.save_path = os.path.join(args.save_path_dir, args.run_name)
+os.makedirs(args.save_path, exist_ok=True)
+# wandb init
+wandb.init(project='tree-multi', name=args.run_name, config=args, dir=args.save_path)
+log_path = os.path.join(args.save_path, 'log.txt')
+set_seed(args.seed, use_cuda=True)
+# Initialize the model
+policy_model = Diffusion.load_from_checkpoint(ckpt_path,
+                                              config=cfg,
+                                              mode="train",
+                                              device=args.device,
+                                              map_location=args.device)
+pretrained = Diffusion.load_from_checkpoint(ckpt_path,
+                                            config=cfg,
+                                            mode="eval",
+                                            device=args.device,
+                                            map_location=args.device)
+# define mcts
+score_func_names = ['binding_affinity1', 'solubility', 'hemolysis', 'nonfouling', 'permeability']
+mcts = MCTS(args, cfg, policy_model, pretrained, score_func_names, prot_seqs=[prot])
+if args.no_mcts:
+    reward_model = ScoringFunctions(score_func_names, prot_seqs=[prot], device=args.device)
+    finetune(args, cfg, policy_model, reward_model=reward_model, mcts=None, pretrained=pretrained, filename=filename, prot_name=prot_name)
+else:
+    mcts = MCTS(args, cfg, policy_model, pretrained, score_func_names, prot_seqs=[prot])
+    finetune(args, cfg, policy_model, reward_model=mcts.rewardFunc, mcts=mcts, pretrained=None, filename=filename, prot_name=prot_name)

src/train_peptune.py ADDED Viewed

	@@ -0,0 +1,226 @@

+import os
+import uuid
+import sys
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+import wandb
+import fsspec
+import hydra
+import lightning as L
+from lightning.pytorch import Trainer
+from lightning.pytorch.callbacks import ModelCheckpoint, GradientAccumulationScheduler
+import omegaconf
+import rich.syntax
+import rich.tree
+import torch
+import sys
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+import dataloading_for_dynamic_batching as dynamic_dataloader
+from diffusion import Diffusion
+import utils.utils as utils
+from lightning.pytorch.strategies import DDPStrategy
+from datasets import load_dataset
+from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
+omegaconf.OmegaConf.register_new_resolver('cwd', os.getcwd)
+omegaconf.OmegaConf.register_new_resolver('device_count', torch.cuda.device_count)
+omegaconf.OmegaConf.register_new_resolver('eval', eval)
+omegaconf.OmegaConf.register_new_resolver('div_up', lambda x, y: (x + y - 1) // y)
+def _load_from_checkpoint(config, tokenizer):
+	if 'hf' in config.backbone:
+		return Diffusion(
+			config, tokenizer=tokenizer).to('cuda')
+	else:
+		model = Diffusion.load_from_checkpoint(
+			config.eval.checkpoint_path,
+			tokenizer=tokenizer,
+			config=config)
+	return model
+@L.pytorch.utilities.rank_zero_only
+def print_config(
+	config: omegaconf.DictConfig,
+	resolve: bool = True,
+	save_cfg: bool = True) -> None:
+	"""
+ 	Prints content of DictConfig using Rich library and its tree structure.
+	Args:
+		config (DictConfig): Configuration composed by Hydra.
+		resolve (bool): Whether to resolve reference fields of DictConfig.
+		save_cfg (bool): Whether to save the configuration tree to a file.
+	"""
+	style = 'dim'
+	tree = rich.tree.Tree('CONFIG', style=style, guide_style=style)
+	fields = config.keys()
+	for field in fields:
+		branch = tree.add(field, style=style, guide_style=style)
+		config_section = config.get(field)
+		branch_content = str(config_section)
+		if isinstance(config_section, omegaconf.DictConfig):
+			branch_content = omegaconf.OmegaConf.to_yaml(
+			config_section, resolve=resolve)
+		branch.add(rich.syntax.Syntax(branch_content, 'yaml'))
+	rich.print(tree)
+	if save_cfg:
+		with fsspec.open(
+			'{}/config_tree.txt'.format(
+			config.checkpointing.save_dir), 'w') as fp:
+			rich.print(tree, file=fp)
+@L.pytorch.utilities.rank_zero_only
+def print_batch(train_ds, valid_ds, tokenizer, k=64):
+  	#for dl_type, dl in [
+    #('train', train_ds), ('valid', valid_ds)]:
+	for dl_type, dl in [
+		('train', train_ds)]:
+		print(f'Printing {dl_type} dataloader batch.')
+		batch = next(iter(dl))
+		print('Batch input_ids.shape', batch['input_ids'].shape)
+		first = batch['input_ids'][0, :k]
+		last = batch['input_ids'][0, -k:]
+		print(f'First {k} tokens:', tokenizer.decode(first))
+		print('ids:', first)
+		print(f'Last {k} tokens:', tokenizer.decode(last))
+		print('ids:', last)
+def generate_samples(config, logger, tokenizer):
+	logger.info('Generating samples.')
+	model = _load_from_checkpoint(config=config, tokenizer=tokenizer)
+	# model.gen_ppl_metric.reset()
+	#stride_length = config.sampling.stride_length
+	#num_strides = config.sampling.num_strides
+	for _ in range(config.sampling.num_sample_batches):
+		samples = model.restore_model_and_sample(num_steps=config.sampling.steps)
+		peptide_sequences = model.tokenizer.batch_decode(samples)
+		model.compute_generative_perplexity(peptide_sequences)
+	print('Peptide samples:', peptide_sequences)
+	print('Generative perplexity:', model.compute_masked_perplexity())
+	return peptide_sequences
+def ppl_eval(config, logger, tokenizer, data_module):
+	logger.info('Starting Zero Shot Eval.')
+	model = _load_from_checkpoint(config=config, tokenizer=tokenizer)
+	wandb_logger = None
+	if config.get('wandb', None) is not None:
+		wandb_logger = L.pytorch.loggers.WandbLogger(
+		config=omegaconf.OmegaConf.to_object(config),
+		** config.wandb)
+	callbacks = []
+	if 'callbacks' in config:
+		for _, callback in config.callbacks.items():
+			callbacks.append(hydra.utils.instantiate(callback))
+	trainer = hydra.utils.instantiate(
+		config.trainer,
+		default_root_dir=os.getcwd(),
+		callbacks=callbacks,
+		strategy=DDPStrategy(find_unused_parameters = True),
+		logger=wandb_logger)
+	#_, valid_ds = dataloader.get_dataloaders(config, tokenizer, skiptrain=True, valid_seed=config.seed)
+	trainer.test(model, data_module)
+def _train(config, logger, tokenizer, data_module):
+	logger.info('Starting Training.')
+	wandb_logger = None
+	if config.get('wandb', None) is not None:
+		unique_id = str(uuid.uuid4())
+		config.wandb.id = f"{config.wandb.id}_{unique_id}"
+		wandb_logger = L.pytorch.loggers.WandbLogger(
+			config=omegaconf.OmegaConf.to_object(config),
+			** config.wandb)
+	if (config.checkpointing.resume_from_ckpt
+		and config.checkpointing.resume_ckpt_path is not None
+		and utils.fsspec_exists(
+			config.checkpointing.resume_ckpt_path)):
+		ckpt_path = config.checkpointing.resume_ckpt_path
+	else:
+		ckpt_path = None
+	# Lightning callbacks
+	callbacks = []
+	if 'callbacks' in config:
+		for callback_name, callback_config in config.callbacks.items():
+			if callback_name == 'model_checkpoint':
+				model_checkpoint_config = {k: v for k, v in callback_config.items() if k != '_target_'}
+				callbacks.append(ModelCheckpoint(**model_checkpoint_config))
+			else:
+				callbacks.append(hydra.utils.instantiate(callback_config))
+	if config.training.accumulator:
+		accumulator = GradientAccumulationScheduler(scheduling = {1: 5, 2: 4, 3: 3, 4: 1})
+		callbacks.append(accumulator)
+	trainer = hydra.utils.instantiate(
+		config.trainer,
+		default_root_dir=os.getcwd(),
+		callbacks=callbacks,
+		accelerator='cuda',
+		strategy=DDPStrategy(find_unused_parameters = True),
+		devices=[2,3,4,5,6,7],
+		logger=wandb_logger)
+	model = Diffusion(config, tokenizer=tokenizer)
+	if config.backbone == 'finetune_roformer':
+		checkpoint = torch.load(ckpt_path, map_location='cpu')
+		model.load_state_dict(checkpoint['state_dict'])
+	trainer.fit(model, datamodule=data_module, ckpt_path=ckpt_path)
+@hydra.main(version_base=None, config_path=f'{os.getcwd()}/src', config_name='config')
+def main(config):
+	"""
+ 		Main entry point for training
+   """
+	wandb.init(project="peptune")
+	L.seed_everything(config.seed)
+	# print_config(config, resolve=True, save_cfg=True)
+	logger = utils.get_logger(__name__)
+	# load PeptideCLM tokenizer
+	tokenizer = SMILES_SPE_Tokenizer(f'{config.base_path}/src/tokenizer/new_vocab.txt',
+								f'{config.base_path}/src/tokenizer/new_splits.txt')
+	data_module = dynamic_dataloader.CustomDataModule(f'{config.base_path}/data/peptide_data', tokenizer)
+	if config.mode == 'sample_eval':
+		generate_samples(config, logger, tokenizer)
+	elif config.mode == 'ppl_eval':
+		ppl_eval(config, logger, tokenizer, data_module)
+	else:
+		_train(config, logger, tokenizer, data_module)
+if __name__ == '__main__':
+	main()

src/utils/app.py ADDED Viewed

	@@ -0,0 +1,1255 @@

+import os
+import re
+import pandas as pd
+from io import StringIO
+import rdkit
+from rdkit import Chem
+from rdkit.Chem import AllChem, Draw
+import numpy as np
+from PIL import Image, ImageDraw, ImageFont
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+from io import BytesIO
+import tempfile
+from rdkit import Chem
+class PeptideAnalyzer:
+    def __init__(self):
+        self.bond_patterns = [
+            (r'OC\(=O\)', 'ester'),  # Ester bond
+            (r'N\(C\)C\(=O\)', 'n_methyl'),  # N-methylated peptide bond
+            (r'N[0-9]C\(=O\)', 'proline'),  # Proline peptide bond
+            (r'NC\(=O\)', 'peptide'),  # Standard peptide bond
+            (r'C\(=O\)N\(C\)', 'n_methyl_reverse'),  # Reverse N-methylated
+            (r'C\(=O\)N[12]?', 'peptide_reverse')  # Reverse peptide bond
+        ]
+        # Three to one letter code mapping
+        self.three_to_one = {
+            'Ala': 'A', 'Cys': 'C', 'Asp': 'D', 'Glu': 'E',
+            'Phe': 'F', 'Gly': 'G', 'His': 'H', 'Ile': 'I',
+            'Lys': 'K', 'Leu': 'L', 'Met': 'M', 'Asn': 'N',
+            'Pro': 'P', 'Gln': 'Q', 'Arg': 'R', 'Ser': 'S',
+            'Thr': 'T', 'Val': 'V', 'Trp': 'W', 'Tyr': 'Y'
+        }
+    def is_peptide(self, smiles):
+        """Check if the SMILES represents a peptide structure"""
+        mol = Chem.MolFromSmiles(smiles)
+        if mol is None:
+            return False
+        # Look for peptide bonds: NC(=O) pattern
+        peptide_bond_pattern = Chem.MolFromSmarts('[NH][C](=O)')
+        if mol.HasSubstructMatch(peptide_bond_pattern):
+            return True
+        # Look for N-methylated peptide bonds: N(C)C(=O) pattern
+        n_methyl_pattern = Chem.MolFromSmarts('[N;H0;$(NC)](C)[C](=O)')
+        if mol.HasSubstructMatch(n_methyl_pattern):
+            return True
+        return False
+    def is_cyclic(self, smiles):
+        """Improved cyclic peptide detection"""
+        # Check for C-terminal carboxyl
+        if smiles.endswith('C(=O)O'):
+            return False, [], []
+        # Find all numbers used in ring closures
+        ring_numbers = re.findall(r'(?:^|[^c])[0-9](?=[A-Z@\(\)])', smiles)
+        # Find aromatic ring numbers
+        aromatic_matches = re.findall(r'c[0-9](?:ccccc|c\[nH\]c)[0-9]', smiles)
+        aromatic_cycles = []
+        for match in aromatic_matches:
+            numbers = re.findall(r'[0-9]', match)
+            aromatic_cycles.extend(numbers)
+        # Numbers that aren't part of aromatic rings are peptide cycles
+        peptide_cycles = [n for n in ring_numbers if n not in aromatic_cycles]
+        is_cyclic = len(peptide_cycles) > 0 and not smiles.endswith('C(=O)O')
+        return is_cyclic, peptide_cycles, aromatic_cycles
+    def split_on_bonds(self, smiles):
+        """Split SMILES into segments with simplified Pro handling"""
+        positions = []
+        used = set()
+        # Find Gly pattern first
+        gly_pattern = r'NCC\(=O\)'
+        for match in re.finditer(gly_pattern, smiles):
+            if not any(p in range(match.start(), match.end()) for p in used):
+                positions.append({
+                    'start': match.start(),
+                    'end': match.end(),
+                    'type': 'gly',
+                    'pattern': match.group()
+                })
+                used.update(range(match.start(), match.end()))
+        for pattern, bond_type in self.bond_patterns:
+            for match in re.finditer(pattern, smiles):
+                if not any(p in range(match.start(), match.end()) for p in used):
+                    positions.append({
+                        'start': match.start(),
+                        'end': match.end(),
+                        'type': bond_type,
+                        'pattern': match.group()
+                    })
+                    used.update(range(match.start(), match.end()))
+        # Sort by position
+        positions.sort(key=lambda x: x['start'])
+        # Create segments
+        segments = []
+        if positions:
+            # First segment
+            if positions[0]['start'] > 0:
+                segments.append({
+                    'content': smiles[0:positions[0]['start']],
+                    'bond_after': positions[0]['pattern']
+                })
+            # Process segments
+            for i in range(len(positions)-1):
+                current = positions[i]
+                next_pos = positions[i+1]
+                if current['type'] == 'gly':
+                    segments.append({
+                        'content': 'NCC(=O)',
+                        'bond_before': positions[i-1]['pattern'] if i > 0 else None,
+                        'bond_after': next_pos['pattern']
+                    })
+                else:
+                    content = smiles[current['end']:next_pos['start']]
+                    if content:
+                        segments.append({
+                            'content': content,
+                            'bond_before': current['pattern'],
+                            'bond_after': next_pos['pattern']
+                        })
+            # Last segment
+            if positions[-1]['end'] < len(smiles):
+                segments.append({
+                    'content': smiles[positions[-1]['end']:],
+                    'bond_before': positions[-1]['pattern']
+                })
+        return segments
+    def clean_terminal_carboxyl(self, segment):
+        """Remove C-terminal carboxyl only if it's the true terminus"""
+        content = segment['content']
+        # Only clean if:
+        # 1. Contains C(=O)O
+        # 2. No bond_after exists (meaning it's the last segment)
+        # 3. C(=O)O is at the end of the content
+        if 'C(=O)O' in content and not segment.get('bond_after'):
+            print('recognized?')
+            # Remove C(=O)O pattern regardless of position
+            cleaned = re.sub(r'\(C\(=O\)O\)', '', content)
+            # Remove any leftover empty parentheses
+            cleaned = re.sub(r'\(\)', '', cleaned)
+            print(cleaned)
+            return cleaned
+        return content
+    def identify_residue(self, segment):
+        """Identify residue with Pro reconstruction"""
+        # Only clean terminal carboxyl if this is the last segment
+        content = self.clean_terminal_carboxyl(segment)
+        mods = self.get_modifications(segment)
+        # UAA pattern matching section - before regular residues
+        # Phenylglycine and derivatives
+        if 'c1ccccc1' in content:
+            if '[C@@H](c1ccccc1)' in content or '[C@H](c1ccccc1)' in content:
+                return '4', mods  # Base phenylglycine
+        # 4-substituted phenylalanines
+        if 'Cc1ccc' in content:
+            if 'OMe' in content or 'OCc1ccc' in content:
+                return '0A1', mods  # 4-methoxy-Phenylalanine
+            elif 'Clc1ccc' in content:
+                return '200', mods  # 4-chloro-Phenylalanine
+            elif 'Brc1ccc' in content:
+                return '4BF', mods  # 4-Bromo-phenylalanine
+            elif 'C#Nc1ccc' in content:
+                return '4CF', mods  # 4-cyano-phenylalanine
+            elif 'Ic1ccc' in content:
+                return 'PHI', mods  # 4-Iodo-phenylalanine
+            elif 'Fc1ccc' in content:
+                return 'PFF', mods  # 4-Fluoro-phenylalanine
+        # Modified tryptophans
+        if 'c[nH]c2' in content:
+            if 'Oc2cccc2' in content:
+                return '0AF', mods  # 7-hydroxy-tryptophan
+            elif 'Fc2cccc2' in content:
+                return '4FW', mods  # 4-fluoro-tryptophan
+            elif 'Clc2cccc2' in content:
+                return '6CW', mods  # 6-chloro-tryptophan
+            elif 'Brc2cccc2' in content:
+                return 'BTR', mods  # 6-bromo-tryptophan
+            elif 'COc2cccc2' in content:
+                return 'MOT5', mods  # 5-Methoxy-tryptophan
+            elif 'Cc2cccc2' in content:
+                return 'MTR5', mods  # 5-Methyl-tryptophan
+        # Special amino acids
+        if 'CC(C)(C)[C@@H]' in content or 'CC(C)(C)[C@H]' in content:
+            return 'BUG', mods  # Tertleucine
+        if 'CCCNC(=N)N' in content:
+            return 'CIR', mods  # Citrulline
+        if '[SeH]' in content:
+            return 'CSE', mods  # Selenocysteine
+        if '[NH3]CC[C@@H]' in content or '[NH3]CC[C@H]' in content:
+            return 'DAB', mods  # Diaminobutyric acid
+        if 'C1CCCCC1' in content:
+            if 'C1CCCCC1[C@@H]' in content or 'C1CCCCC1[C@H]' in content:
+                return 'CHG', mods  # Cyclohexylglycine
+            elif 'C1CCCCC1C[C@@H]' in content or 'C1CCCCC1C[C@H]' in content:
+                return 'ALC', mods  # 3-cyclohexyl-alanine
+        # Naphthalene derivatives
+        if 'c1cccc2c1cccc2' in content:
+            if 'c1cccc2c1cccc2[C@@H]' in content or 'c1cccc2c1cccc2[C@H]' in content:
+                return 'NAL', mods  # 2-Naphthyl-alanine
+        # Heteroaromatic derivatives
+        if 'c1cncc' in content:
+            return 'PYR4', mods  # 3-(4-Pyridyl)-alanine
+        if 'c1cscc' in content:
+            return 'THA3', mods  # 3-(3-thienyl)-alanine
+        if 'c1nnc' in content:
+            return 'TRZ4', mods  # 3-(1,2,4-Triazol-1-yl)-alanine
+        # Modified serines and threonines
+        if 'OP(O)(O)O' in content:
+            if '[C@@H](COP' in content or '[C@H](COP' in content:
+                return 'SEP', mods  # phosphoserine
+            elif '[C@@H](OP' in content or '[C@H](OP' in content:
+                return 'TPO', mods  # phosphothreonine
+        # Specialized ring systems
+        if 'c1c2ccccc2cc2c1cccc2' in content:
+            return 'ANTH', mods  # 3-(9-anthryl)-alanine
+        if 'c1csc2c1cccc2' in content:
+            return 'BTH3', mods  # 3-(3-benzothienyl)-alanine
+        if '[C@]12C[C@H]3C[C@@H](C2)C[C@@H](C1)C3' in content:
+            return 'ADAM', mods  # Adamanthane
+        # Fluorinated derivatives
+        if 'FC(F)(F)' in content:
+            if 'CC(F)(F)F' in content:
+                return 'FLA', mods  # Trifluoro-alanine
+            if 'C(F)(F)F)c1' in content:
+                if 'c1ccccc1C(F)(F)F' in content:
+                    return 'TFG2', mods  # 2-(Trifluoromethyl)-phenylglycine
+                if 'c1cccc(c1)C(F)(F)F' in content:
+                    return 'TFG3', mods  # 3-(Trifluoromethyl)-phenylglycine
+                if 'c1ccc(cc1)C(F)(F)F' in content:
+                    return 'TFG4', mods  # 4-(Trifluoromethyl)-phenylglycine
+        # Multiple halogen patterns
+        if 'F' in content and 'c1' in content:
+            if 'c1ccc(c(c1)F)F' in content:
+                return 'F2F', mods  # 3,4-Difluoro-phenylalanine
+            if 'cc(F)cc(c1)F' in content:
+                return 'WFP', mods  # 3,5-Difluoro-phenylalanine
+        if 'Cl' in content and 'c1' in content:
+            if 'c1ccc(cc1Cl)Cl' in content:
+                return 'CP24', mods  # 2,4-dichloro-phenylalanine
+            if 'c1ccc(c(c1)Cl)Cl' in content:
+                return 'CP34', mods  # 3,4-dichloro-phenylalanine
+        # Hydroxy and amino derivatives
+        if 'O' in content and 'c1' in content:
+            if 'c1cc(O)cc(c1)O' in content:
+                return '3FG', mods  # (2s)-amino(3,5-dihydroxyphenyl)-ethanoic acid
+            if 'c1ccc(c(c1)O)O' in content:
+                return 'DAH', mods  # 3,4-Dihydroxy-phenylalanine
+        # Cyclic amino acids
+        if 'C1CCCC1' in content:
+            return 'CPA3', mods  # 3-Cyclopentyl-alanine
+        if 'C1CCCCC1' in content:
+            if 'CC1CCCCC1' in content:
+                return 'ALC', mods  # 3-cyclohexyl-alanine
+            else:
+                return 'CHG', mods  # Cyclohexylglycine
+        # Chain-length variants
+        if 'CCC[C@@H]' in content or 'CCC[C@H]' in content:
+            return 'NLE', mods  # Norleucine
+        if 'CC[C@@H]' in content or 'CC[C@H]' in content:
+            if not any(x in content for x in ['CC(C)', 'COC', 'CN(']):
+                return 'ABA', mods  # 2-Aminobutyric acid
+        # Modified histidines
+        if 'c1cnc' in content:
+            if '[C@@H]1CN[C@@H](N1)F' in content:
+                return '2HF', mods  # 2-fluoro-l-histidine
+            if 'c1cnc([nH]1)F' in content:
+                return '2HF1', mods  # 2-fluoro-l-histidine variant
+            if 'c1c[nH]c(n1)F' in content:
+                return '2HF2', mods  # 2-fluoro-l-histidine variant
+        # Sulfur and selenium containing
+        if '[SeH]' in content:
+            return 'CSE', mods  # Selenocysteine
+        if 'S' in content:
+            if 'CSCc1ccccc1' in content:
+                return 'BCS', mods  # benzylcysteine
+            if 'CCSC' in content:
+                return 'ESC', mods  # Ethionine
+            if 'CCS' in content:
+                return 'HCS', mods  # homocysteine
+        # Additional modifications
+        if 'CN=[N]=N' in content:
+            return 'AZDA', mods  # azido-alanine
+        if '[NH]=[C](=[NH2])=[NH2]' in content:
+            if 'CCC[NH]=' in content:
+                return 'AGM', mods  # 5-methyl-arginine
+            if 'CC[NH]=' in content:
+                return 'GDPR', mods  # 2-Amino-3-guanidinopropionic acid
+        if 'CCON' in content:
+            return 'CAN', mods  # canaline
+        if '[C@@H]1C=C[C@@H](C=C1)' in content:
+            return 'ACZ', mods  # cis-amiclenomycin
+        if 'CCC(=O)[NH3]' in content:
+            return 'ONL', mods  # 5-oxo-l-norleucine
+        if 'c1ccncc1' in content:
+            return 'PYR4', mods  # 3-(4-Pyridyl)-alanine
+        if 'c1ccco1' in content:
+            return 'FUA2', mods  # (2-furyl)-alanine
+        if 'c1ccc' in content:
+            if 'c1ccc(cc1)c1ccccc1' in content:
+                return 'BIF', mods  # 4,4-biphenylalanine
+            if 'c1ccc(cc1)C(=O)c1ccccc1' in content:
+                return 'PBF', mods  # 4-benzoyl-phenylalanine
+            if 'c1ccc(cc1)C(C)(C)C' in content:
+                return 'TBP4', mods  # 4-tert-butyl-phenylalanine
+            if 'c1ccc(cc1)[C](=[NH2])=[NH2]' in content:
+                return '0BN', mods  # 4-carbamimidoyl-l-phenylalanine
+            if 'c1cccc(c1)[C](=[NH2])=[NH2]' in content:
+                return 'APM', mods  # m-amidinophenyl-3-alanine
+        # Multiple hydroxy patterns
+        if 'O' in content:
+            if '[C@H]([C@H](C)O)O' in content:
+                return 'ILX', mods  # 4,5-dihydroxy-isoleucine
+            if '[C@H]([C@@H](C)O)O' in content:
+                return 'ALO', mods  # Allo-threonine
+            if '[C@H](COP(O)(O)O)' in content:
+                return 'SEP', mods  # phosphoserine
+            if '[C@H]([C@@H](C)OP(O)(O)O)' in content:
+                return 'TPO', mods  # phosphothreonine
+            if '[C@H](c1ccc(O)cc1)O' in content:
+                return 'OMX', mods  # (betar)-beta-hydroxy-l-tyrosine
+            if '[C@H](c1ccc(c(Cl)c1)O)O' in content:
+                return 'OMY', mods  # (betar)-3-chloro-beta-hydroxy-l-tyrosine
+        # Heterocyclic patterns
+        if 'n1' in content:
+            if 'n1cccn1' in content:
+                return 'PYZ1', mods  # 3-(1-Pyrazolyl)-alanine
+            if 'n1nncn1' in content:
+                return 'TEZA', mods  # 3-(2-Tetrazolyl)-alanine
+            if 'c2c(n1)cccc2' in content:
+                return 'QU32', mods  # 3-(2-Quinolyl)-alanine
+            if 'c1cnc2c(c1)cccc2' in content:
+                return 'QU33', mods  # 3-(3-quinolyl)-alanine
+            if 'c1ccnc2c1cccc2' in content:
+                return 'QU34', mods  # 3-(4-quinolyl)-alanine
+            if 'c1ccc2c(c1)nccc2' in content:
+                return 'QU35', mods  # 3-(5-Quinolyl)-alanine
+            if 'c1ccc2c(c1)cncc2' in content:
+                return 'QU36', mods  # 3-(6-Quinolyl)-alanine
+            if 'c1cnc2c(n1)cccc2' in content:
+                return 'QX32', mods  # 3-(2-quinoxalyl)-alanine
+        # Multiple nitrogen patterns
+        if 'N' in content:
+            if '[NH3]CC[C@@H]' in content:
+                return 'DAB', mods  # Diaminobutyric acid
+            if '[NH3]C[C@@H]' in content:
+                return 'DPP', mods  # 2,3-Diaminopropanoic acid
+            if '[NH3]CCCCCC[C@@H]' in content:
+                return 'HHK', mods  # (2s)-2,8-diaminooctanoic acid
+            if 'CCC[NH]=[C](=[NH2])=[NH2]' in content:
+                return 'GBUT', mods  # 2-Amino-4-guanidinobutryric acid
+            if '[NH]=[C](=S)=[NH2]' in content:
+                return 'THIC', mods  # Thio-citrulline
+        # Chain modified amino acids
+        if 'CC' in content:
+            if 'CCCC[C@@H]' in content:
+                return 'AHP', mods  # 2-Aminoheptanoic acid
+            if 'CCC([C@@H])(C)C' in content:
+                return 'I2M', mods  # 3-methyl-l-alloisoleucine
+            if 'CC[C@H]([C@@H])C' in content:
+                return 'IIL', mods  # Allo-Isoleucine
+            if '[C@H](CCC(C)C)' in content:
+                return 'HLEU', mods  # Homoleucine
+            if '[C@@H]([C@@H](C)O)C' in content:
+                return 'HLU', mods  # beta-hydroxyleucine
+        # Modified glutamate/aspartate patterns
+        if '[C@@H]' in content:
+            if '[C@@H](C[C@@H](F))' in content:
+                return 'FGA4', mods  # 4-Fluoro-glutamic acid
+            if '[C@@H](C[C@@H](O))' in content:
+                return '3GL', mods  # 4-hydroxy-glutamic-acid
+            if '[C@@H](C[C@H](C))' in content:
+                return 'LME', mods  # (3r)-3-methyl-l-glutamic acid
+            if '[C@@H](CC[C@H](C))' in content:
+                return 'MEG', mods  # (3s)-3-methyl-l-glutamic acid
+        # Sulfur and selenium modifications
+        if 'S' in content:
+            if 'SCC[C@@H]' in content:
+                return 'HSER', mods  # homoserine
+            if 'SCCN' in content:
+                return 'SLZ', mods  # thialysine
+            if 'SC(=O)' in content:
+                return 'CSA', mods  # s-acetonylcysteine
+            if '[S@@](=O)' in content:
+                return 'SME', mods  # Methionine sulfoxide
+            if 'S(=O)(=O)' in content:
+                return 'OMT', mods  # Methionine sulfone
+        # Double bond containing
+        if 'C=' in content:
+            if 'C=C[C@@H]' in content:
+                return '2AG', mods  # 2-Allyl-glycine
+            if 'C=C[C@@H]' in content:
+                return 'LVG', mods  # vinylglycine
+            if 'C=Cc1ccccc1' in content:
+                return 'STYA', mods  # Styrylalanine
+        # Special cases
+        if '[C@@H]1Cc2c(C1)cccc2' in content:
+            return 'IGL', mods  # alpha-amino-2-indanacetic acid
+        if '[C](=[C](=O)=O)=O' in content:
+            return '26P', mods  # 2-amino-6-oxopimelic acid
+        if '[C](=[C](=O)=O)=C' in content:
+            return '2NP', mods  # l-2-amino-6-methylene-pimelic acid
+        if 'c2cnc[nH]2' in content:
+            return 'HIS', mods  # histidine core
+        if 'c1cccc2c1cc(O)cc2' in content:
+            return 'NAO1', mods  # 5-hydroxy-1-naphthalene
+        if 'c1ccc2c(c1)cc(O)cc2' in content:
+            return 'NAO2', mods  # 6-hydroxy-2-naphthalene
+        # Proline (P) - flexible ring numbers
+        if any([
+            # Check for any ring number in bond patterns
+            (segment.get('bond_after', '').startswith(f'N{n}C(=O)') and 'CCC' in content and
+            any(f'[C@@H]{n}' in content or f'[C@H]{n}' in content for n in '123456789'))
+            for n in '123456789'
+        ]) or any([
+            # Check ending patterns with any ring number
+            (f'CCCN{n}' in content and content.endswith('=O') and
+            any(f'[C@@H]{n}' in content or f'[C@H]{n}' in content for n in '123456789'))
+            for n in '123456789'
+        ]) or any([
+            # Handle CCC[C@H]n patterns
+            (content == f'CCC[C@H]{n}' and segment.get('bond_before', '').startswith(f'C(=O)N{n}')) or
+            (content == f'CCC[C@@H]{n}' and segment.get('bond_before', '').startswith(f'C(=O)N{n}')) or
+            # N-terminal Pro with any ring number
+            (f'N{n}CCC[C@H]{n}' in content) or
+            (f'N{n}CCC[C@@H]{n}' in content)
+            for n in '123456789'
+        ]):
+            return 'Pro', mods
+        # Tryptophan (W) - more specific indole pattern
+        if re.search(r'c[0-9]c\[nH\]c[0-9]ccccc[0-9][0-9]', content) and \
+        'c[nH]c' in content.replace(' ', ''):
+            return 'Trp', mods
+        # Lysine (K) - both patterns
+        if '[C@@H](CCCCN)' in content or '[C@H](CCCCN)' in content:
+            return 'Lys', mods
+        # Arginine (R) - both patterns
+        if '[C@@H](CCCNC(=N)N)' in content or '[C@H](CCCNC(=N)N)' in content:
+            return 'Arg', mods
+        if ('C[C@H](CCCC)' in content or 'C[C@@H](CCCC)' in content) and 'CC(C)' not in content:
+            return 'Nle', mods
+        # Ornithine (Orn) - 3-carbon chain with NH2
+        if ('C[C@H](CCCN)' in content or 'C[C@@H](CCCN)' in content) and 'CC(C)' not in content:
+            return 'Orn', mods
+        # 2-Naphthylalanine (2Nal) - distinct from Phe pattern
+        if ('Cc3cc2ccccc2c3' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return '2Nal', mods
+        # Cyclohexylalanine (Cha) - already in your code but moved here for clarity
+        if 'N2CCCCC2' in content or 'CCCCC2' in content:
+            return 'Cha', mods
+        # Aminobutyric acid (Abu) - 2-carbon chain
+        if ('C[C@H](CC)' in content or 'C[C@@H](CC)' in content) and not any(p in content for p in ['CC(C)', 'CCCC', 'CCC(C)']):
+            return 'Abu', mods
+        # Pipecolic acid (Pip) - 6-membered ring like Pro
+        if ('N3CCCCC3' in content or 'CCCCC3' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'Pip', mods
+        # Cyclohexylglycine (Chg) - direct cyclohexyl without CH2
+        if ('C[C@H](C1CCCCC1)' in content or 'C[C@@H](C1CCCCC1)' in content):
+            return 'Chg', mods
+        # 4-Fluorophenylalanine (4F-Phe)
+        if ('Cc2ccc(F)cc2' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return '4F-Phe', mods
+        # Regular residue identification
+        if ('NCC(=O)' in content) or (content == 'C'):
+            # Middle case - between bonds
+            if segment.get('bond_before') and segment.get('bond_after'):
+                if ('C(=O)N' in segment['bond_before'] or 'C(=O)N(C)' in segment['bond_before']):
+                    return 'Gly', mods
+            # Terminal case - at the end
+            elif segment.get('bond_before') and segment.get('bond_before').startswith('C(=O)N'):
+                return 'Gly', mods
+        if 'CC(C)C[C@H]' in content or 'CC(C)C[C@@H]' in content:
+            return 'Leu', mods
+        if '[C@@H](CC(C)C)' in content or '[C@H](CC(C)C)' in content:
+            return 'Leu', mods
+        if '[C@@H]([C@@H](C)O)' in content or '[C@H]([C@H](C)O)' in content:
+            return 'Thr', mods
+        if '[C@H](Cc2ccccc2)' in content or '[C@@H](Cc2ccccc2)' in content:
+            return 'Phe', mods
+        if ('[C@H](C(C)C)' in content or       # With outer parentheses
+            '[C@@H](C(C)C)' in content or      # With outer parentheses
+            '[C@H]C(C)C' in content or         # Without outer parentheses
+            '[C@@H]C(C)C' in content):         # Without outer parentheses
+            if not any(p in content for p in ['CC(C)C[C@H]', 'CC(C)C[C@@H]']):  # Still check not Leu
+                return 'Val', mods
+        if '[C@H](COC(C)(C)C)' in content or '[C@@H](COC(C)(C)C)' in content:
+            return 'O-tBu', mods
+        if any([
+            'CC[C@H](C)' in content,
+            'CC[C@@H](C)' in content,
+            'C(C)C[C@H]' in content and 'CC(C)C' not in content,
+            'C(C)C[C@@H]' in content and 'CC(C)C' not in content
+        ]):
+            return 'Ile', mods
+        if ('[C@H](C)' in content or '[C@@H](C)' in content):
+            if not any(p in content for p in ['C(C)C', 'COC', 'CN(', 'C(C)O', 'CC[C@H]', 'CC[C@@H]']):
+                return 'Ala', mods
+        # Tyrosine (Tyr) - 4-hydroxybenzyl side chain
+        if re.search(r'Cc[0-9]ccc\(O\)cc[0-9]', content):
+            return 'Tyr', mods
+        # Serine (Ser) - Hydroxymethyl side chain
+        if '[C@H](CO)' in content or '[C@@H](CO)' in content:
+            if not ('C(C)O' in content or 'COC' in content):
+                return 'Ser', mods
+        # Threonine (Thr) - 1-hydroxyethyl side chain
+        if '[C@@H]([C@@H](C)O)' in content or '[C@H]([C@H](C)O)' in content or '[C@@H](C)O' in content or '[C@H](C)O' in content:
+            return 'Thr', mods
+        # Cysteine (Cys) - Thiol side chain
+        if '[C@H](CS)' in content or '[C@@H](CS)' in content:
+            return 'Cys', mods
+        # Methionine (Met) - Methylthioethyl side chain
+        if ('C[C@H](CCSC)' in content or 'C[C@@H](CCSC)' in content):
+            return 'Met', mods
+        # Asparagine (Asn) - Carbamoylmethyl side chain
+        if ('CC(=O)N' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'Asn', mods
+        # Glutamine (Gln) - Carbamoylethyl side chain
+        if ('CCC(=O)N' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'Gln', mods
+        # Aspartic acid (Asp) - Carboxymethyl side chain
+        if ('CC(=O)O' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'Asp', mods
+        # Glutamic acid (Glu) - Carboxyethyl side chain
+        if ('CCC(=O)O' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'Glu', mods
+        # Arginine (Arg) - 3-guanidinopropyl side chain
+        if ('CCCNC(=N)N' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'Arg', mods
+        # Histidine (His) - Imidazole side chain
+        if ('Cc2cnc[nH]2' in content) and ('C[C@H]' in content or 'C[C@@H]' in content):
+            return 'His', mods
+        return None, mods
+    def get_modifications(self, segment):
+        """Get modifications based on bond types"""
+        mods = []
+        if segment.get('bond_after'):
+            if 'N(C)' in segment['bond_after'] or segment['bond_after'].startswith('C(=O)N(C)'):
+                mods.append('N-Me')
+            if 'OC(=O)' in segment['bond_after']:
+                mods.append('O-linked')
+        return mods
+    def analyze_structure(self, smiles):
+        """Main analysis function with debug output"""
+        print("\nAnalyzing structure:", smiles)
+        # Split into segments
+        segments = self.split_on_bonds(smiles)
+        print("\nSegment Analysis:")
+        sequence = []
+        for i, segment in enumerate(segments):
+            print(f"\nSegment {i}:")
+            print(f"Content: {segment['content']}")
+            print(f"Bond before: {segment.get('bond_before', 'None')}")
+            print(f"Bond after: {segment.get('bond_after', 'None')}")
+            residue, mods = self.identify_residue(segment)
+            if residue:
+                if mods:
+                    sequence.append(f"{residue}({','.join(mods)})")
+                else:
+                    sequence.append(residue)
+                print(f"Identified as: {residue}")
+                print(f"Modifications: {mods}")
+            else:
+                print(f"Warning: Could not identify residue in segment: {segment['content']}")
+        # Check if cyclic
+        is_cyclic, peptide_cycles, aromatic_cycles = self.is_cyclic(smiles)
+        three_letter = '-'.join(sequence)
+        one_letter = ''.join(self.three_to_one.get(aa.split('(')[0], 'X') for aa in sequence)
+        if is_cyclic:
+            three_letter = f"cyclo({three_letter})"
+            one_letter = f"cyclo({one_letter})"
+        print(f"\nFinal sequence: {three_letter}")
+        print(f"One-letter code: {one_letter}")
+        print(f"Is cyclic: {is_cyclic}")
+        #print(f"Peptide cycles: {peptide_cycles}")
+        #print(f"Aromatic cycles: {aromatic_cycles}")
+        return three_letter, len(segments)
+        """return {
+            'three_letter': three_letter,
+            #'one_letter': one_letter,
+            'is_cyclic': is_cyclic
+        }"""
+    def return_sequence(self, smiles):
+        """Main analysis function with debug output"""
+        print("\nAnalyzing structure:", smiles)
+        # Split into segments
+        segments = self.split_on_bonds(smiles)
+        print("\nSegment Analysis:")
+        sequence = []
+        for i, segment in enumerate(segments):
+            print(f"\nSegment {i}:")
+            print(f"Content: {segment['content']}")
+            print(f"Bond before: {segment.get('bond_before', 'None')}")
+            print(f"Bond after: {segment.get('bond_after', 'None')}")
+            residue, mods = self.identify_residue(segment)
+            if residue:
+                if mods:
+                    sequence.append(f"{residue}({','.join(mods)})")
+                else:
+                    sequence.append(residue)
+                print(f"Identified as: {residue}")
+                print(f"Modifications: {mods}")
+            else:
+                print(f"Warning: Could not identify residue in segment: {segment['content']}")
+        return sequence
+"""
+def annotate_cyclic_structure(mol, sequence):
+    '''Create annotated 2D structure with clear, non-overlapping residue labels'''
+    # Generate 2D coordinates
+    # Generate 2D coordinates
+    AllChem.Compute2DCoords(mol)
+    # Create drawer with larger size for annotations
+    drawer = Draw.rdMolDraw2D.MolDraw2DCairo(2000, 2000)  # Even larger size
+    # Get residue list and reverse it to match structural representation
+    if sequence.startswith('cyclo('):
+        residues = sequence[6:-1].split('-')
+    else:
+        residues = sequence.split('-')
+    residues = list(reversed(residues))  # Reverse the sequence
+    # Draw molecule first to get its bounds
+    drawer.drawOptions().addAtomIndices = False
+    drawer.DrawMolecule(mol)
+    drawer.FinishDrawing()
+    # Convert to PIL Image
+    img = Image.open(BytesIO(drawer.GetDrawingText()))
+    draw = ImageDraw.Draw(img)
+    try:
+        # Try to use DejaVuSans as it's commonly available on Linux systems
+        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 60)
+        small_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 60)
+    except OSError:
+        try:
+            # Fallback to Arial if available (common on Windows)
+            font = ImageFont.truetype("arial.ttf", 60)
+            small_font = ImageFont.truetype("arial.ttf", 60)
+        except OSError:
+            # If no TrueType fonts are available, fall back to default
+            print("Warning: TrueType fonts not available, using default font")
+            font = ImageFont.load_default()
+            small_font = ImageFont.load_default()
+    # Get molecule bounds
+    conf = mol.GetConformer()
+    positions = []
+    for i in range(mol.GetNumAtoms()):
+        pos = conf.GetAtomPosition(i)
+        positions.append((pos.x, pos.y))
+    x_coords = [p[0] for p in positions]
+    y_coords = [p[1] for p in positions]
+    min_x, max_x = min(x_coords), max(x_coords)
+    min_y, max_y = min(y_coords), max(y_coords)
+    # Calculate scaling factors
+    scale = 150  # Increased scale factor
+    center_x = 1000  # Image center
+    center_y = 1000
+    # Add residue labels in a circular arrangement around the structure
+    n_residues = len(residues)
+    radius = 700  # Distance of labels from center
+    # Start from the rightmost point (3 o'clock position) and go counterclockwise
+    # Offset by -3 positions to align with structure
+    offset = 0  # Adjust this value to match the structure alignment
+    for i, residue in enumerate(residues):
+        # Calculate position in a circle around the structure
+        # Start from 0 (3 o'clock) and go counterclockwise
+        angle = -(2 * np.pi * ((i + offset) % n_residues) / n_residues)
+        # Calculate label position
+        label_x = center_x + radius * np.cos(angle)
+        label_y = center_y + radius * np.sin(angle)
+        # Draw residue label
+        text = f"{i+1}. {residue}"
+        bbox = draw.textbbox((label_x, label_y), text, font=font)
+        padding = 10
+        draw.rectangle([bbox[0]-padding, bbox[1]-padding,
+                       bbox[2]+padding, bbox[3]+padding],
+                      fill='white', outline='white')
+        draw.text((label_x, label_y), text,
+                 font=font, fill='black', anchor="mm")
+    # Add sequence at the top with white background
+    seq_text = f"Sequence: {sequence}"
+    bbox = draw.textbbox((center_x, 100), seq_text, font=small_font)
+    padding = 10
+    draw.rectangle([bbox[0]-padding, bbox[1]-padding,
+                   bbox[2]+padding, bbox[3]+padding],
+                  fill='white', outline='white')
+    draw.text((center_x, 100), seq_text,
+             font=small_font, fill='black', anchor="mm")
+    return img
+"""
+def annotate_cyclic_structure(mol, sequence):
+    """Create structure visualization with just the sequence header"""
+    # Generate 2D coordinates
+    AllChem.Compute2DCoords(mol)
+    # Create drawer with larger size for annotations
+    drawer = Draw.rdMolDraw2D.MolDraw2DCairo(2000, 2000)
+    # Draw molecule first
+    drawer.drawOptions().addAtomIndices = False
+    drawer.DrawMolecule(mol)
+    drawer.FinishDrawing()
+    # Convert to PIL Image
+    img = Image.open(BytesIO(drawer.GetDrawingText()))
+    draw = ImageDraw.Draw(img)
+    try:
+        small_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 60)
+    except OSError:
+        try:
+            small_font = ImageFont.truetype("arial.ttf", 60)
+        except OSError:
+            print("Warning: TrueType fonts not available, using default font")
+            small_font = ImageFont.load_default()
+    # Add just the sequence header at the top
+    seq_text = f"Sequence: {sequence}"
+    bbox = draw.textbbox((1000, 100), seq_text, font=small_font)
+    padding = 10
+    draw.rectangle([bbox[0]-padding, bbox[1]-padding,
+                   bbox[2]+padding, bbox[3]+padding],
+                  fill='white', outline='white')
+    draw.text((1000, 100), seq_text,
+             font=small_font, fill='black', anchor="mm")
+    return img
+def create_enhanced_linear_viz(sequence, smiles):
+    """Create an enhanced linear representation using PeptideAnalyzer"""
+    analyzer = PeptideAnalyzer()  # Create analyzer instance
+    # Create figure with two subplots
+    fig = plt.figure(figsize=(15, 10))
+    gs = fig.add_gridspec(2, 1, height_ratios=[1, 2])
+    ax_struct = fig.add_subplot(gs[0])
+    ax_detail = fig.add_subplot(gs[1])
+    # Parse sequence and get residues
+    if sequence.startswith('cyclo('):
+        residues = sequence[6:-1].split('-')
+    else:
+        residues = sequence.split('-')
+    # Get segments using analyzer
+    segments = analyzer.split_on_bonds(smiles)
+    # Debug print
+    print(f"Number of residues: {len(residues)}")
+    print(f"Number of segments: {len(segments)}")
+    # Top subplot - Basic structure
+    ax_struct.set_xlim(0, 10)
+    ax_struct.set_ylim(0, 2)
+    num_residues = len(residues)
+    spacing = 9.0 / (num_residues - 1) if num_residues > 1 else 9.0
+    # Draw basic structure
+    y_pos = 1.5
+    for i in range(num_residues):
+        x_pos = 0.5 + i * spacing
+        # Draw amino acid box
+        rect = patches.Rectangle((x_pos-0.3, y_pos-0.2), 0.6, 0.4,
+                               facecolor='lightblue', edgecolor='black')
+        ax_struct.add_patch(rect)
+        # Draw connecting bonds if not the last residue
+        if i < num_residues - 1:
+            segment = segments[i] if i < len(segments) else None
+            if segment:
+                # Determine bond type from segment info
+                bond_type = 'ester' if 'O-linked' in segment.get('bond_after', '') else 'peptide'
+                is_n_methylated = 'N-Me' in segment.get('bond_after', '')
+                bond_color = 'red' if bond_type == 'ester' else 'black'
+                linestyle = '--' if bond_type == 'ester' else '-'
+                # Draw bond line
+                ax_struct.plot([x_pos+0.3, x_pos+spacing-0.3], [y_pos, y_pos],
+                             color=bond_color, linestyle=linestyle, linewidth=2)
+                # Add bond type label
+                mid_x = x_pos + spacing/2
+                bond_label = f"{bond_type}"
+                if is_n_methylated:
+                    bond_label += "\n(N-Me)"
+                ax_struct.text(mid_x, y_pos+0.1, bond_label,
+                             ha='center', va='bottom', fontsize=10,
+                             color=bond_color)
+        # Add residue label
+        ax_struct.text(x_pos, y_pos-0.5, residues[i],
+                      ha='center', va='top', fontsize=14)
+    # Bottom subplot - Detailed breakdown
+    ax_detail.set_ylim(0, len(segments)+1)
+    ax_detail.set_xlim(0, 1)
+    # Create detailed breakdown
+    segment_y = len(segments)  # Start from top
+    for i, segment in enumerate(segments):
+        y = segment_y - i
+        # Check if this is a bond or residue
+        residue, mods = analyzer.identify_residue(segment)
+        if residue:
+            text = f"Residue {i+1}: {residue}"
+            if mods:
+                text += f" ({', '.join(mods)})"
+            color = 'blue'
+        else:
+            # Must be a bond
+            text = f"Bond {i}: "
+            if 'O-linked' in segment.get('bond_after', ''):
+                text += "ester"
+            elif 'N-Me' in segment.get('bond_after', ''):
+                text += "peptide (N-methylated)"
+            else:
+                text += "peptide"
+            color = 'red'
+        # Add segment analysis
+        ax_detail.text(0.05, y, text, fontsize=12, color=color)
+        ax_detail.text(0.5, y, f"SMILES: {segment.get('content', '')}", fontsize=10, color='gray')
+    # If cyclic, add connection indicator
+    if sequence.startswith('cyclo('):
+        ax_struct.annotate('', xy=(9.5, y_pos), xytext=(0.5, y_pos),
+                          arrowprops=dict(arrowstyle='<->', color='red', lw=2))
+        ax_struct.text(5, y_pos+0.3, 'Cyclic Connection',
+                      ha='center', color='red', fontsize=14)
+    # Add titles and adjust layout
+    ax_struct.set_title("Peptide Structure Overview", pad=20)
+    ax_detail.set_title("Segment Analysis Breakdown", pad=20)
+    # Remove axes
+    for ax in [ax_struct, ax_detail]:
+        ax.set_xticks([])
+        ax.set_yticks([])
+        ax.axis('off')
+    plt.tight_layout()
+    return fig
+class PeptideStructureGenerator:
+    """A class to generate 3D structures of peptides using different embedding methods"""
+    @staticmethod
+    def prepare_molecule(smiles):
+        """Prepare molecule with proper hydrogen handling"""
+        mol = Chem.MolFromSmiles(smiles, sanitize=False)
+        if mol is None:
+            raise ValueError("Failed to create molecule from SMILES")
+        # Calculate valence for each atom
+        for atom in mol.GetAtoms():
+            atom.UpdatePropertyCache(strict=False)
+        # Sanitize with reduced requirements
+        Chem.SanitizeMol(mol,
+                        sanitizeOps=Chem.SANITIZE_FINDRADICALS|
+                                  Chem.SANITIZE_KEKULIZE|
+                                  Chem.SANITIZE_SETAROMATICITY|
+                                  Chem.SANITIZE_SETCONJUGATION|
+                                  Chem.SANITIZE_SETHYBRIDIZATION|
+                                  Chem.SANITIZE_CLEANUPCHIRALITY)
+        mol = Chem.AddHs(mol)
+        return mol
+    @staticmethod
+    def get_etkdg_params(attempt=0):
+        """Get ETKDG parameters with optional modifications based on attempt number"""
+        params = AllChem.ETKDGv3()
+        params.randomSeed = -1
+        params.maxIterations = 200
+        params.numThreads = 4  # Reduced for web interface
+        params.useBasicKnowledge = True
+        params.enforceChirality = True
+        params.useExpTorsionAnglePrefs = True
+        params.useSmallRingTorsions = True
+        params.useMacrocycleTorsions = True
+        params.ETversion = 2
+        params.pruneRmsThresh = -1
+        params.embedRmsThresh = 0.5
+        if attempt > 10:
+            params.bondLength = 1.5 + (attempt - 10) * 0.02
+            params.useExpTorsionAnglePrefs = False
+        return params
+    def generate_structure_etkdg(self, smiles, max_attempts=20):
+        """Generate 3D structure using ETKDG without UFF optimization"""
+        success = False
+        mol = None
+        for attempt in range(max_attempts):
+            try:
+                mol = self.prepare_molecule(smiles)
+                params = self.get_etkdg_params(attempt)
+                if AllChem.EmbedMolecule(mol, params) == 0:
+                    success = True
+                    break
+            except Exception as e:
+                continue
+        if not success:
+            raise ValueError("Failed to generate structure with ETKDG")
+        return mol
+    def generate_structure_uff(self, smiles, max_attempts=20):
+        """Generate 3D structure using ETKDG followed by UFF optimization"""
+        best_mol = None
+        lowest_energy = float('inf')
+        for attempt in range(max_attempts):
+            try:
+                test_mol = self.prepare_molecule(smiles)
+                params = self.get_etkdg_params(attempt)
+                if AllChem.EmbedMolecule(test_mol, params) == 0:
+                    res = AllChem.UFFOptimizeMolecule(test_mol, maxIters=2000,
+                                                     vdwThresh=10.0, confId=0,
+                                                     ignoreInterfragInteractions=True)
+                    if res == 0:
+                        ff = AllChem.UFFGetMoleculeForceField(test_mol)
+                        if ff:
+                            current_energy = ff.CalcEnergy()
+                            if current_energy < lowest_energy:
+                                lowest_energy = current_energy
+                                best_mol = Chem.Mol(test_mol)
+            except Exception:
+                continue
+        if best_mol is None:
+            raise ValueError("Failed to generate optimized structure")
+        return best_mol
+    @staticmethod
+    def mol_to_sdf_bytes(mol):
+        """Convert RDKit molecule to SDF file bytes"""
+        # First write to StringIO in text mode
+        sio = StringIO()
+        writer = Chem.SDWriter(sio)
+        writer.write(mol)
+        writer.close()
+        # Convert the string to bytes
+        return sio.getvalue().encode('utf-8')
+def process_input(smiles_input=None, file_obj=None, show_linear=False,
+                 show_segment_details=False, generate_3d=False, use_uff=False):
+    """Process input and create visualizations using PeptideAnalyzer"""
+    analyzer = PeptideAnalyzer()
+    temp_dir = tempfile.mkdtemp() if generate_3d else None
+    structure_files = []
+    # Handle direct SMILES input
+    if smiles_input:
+        smiles = smiles_input.strip()
+        # First check if it's a peptide using analyzer's method
+        if not analyzer.is_peptide(smiles):
+            return "Error: Input SMILES does not appear to be a peptide structure.", None, None
+        try:
+            # Create molecule
+            mol = Chem.MolFromSmiles(smiles)
+            if mol is None:
+                return "Error: Invalid SMILES notation.", None, None
+            # Generate 3D structures if requested
+            if generate_3d:
+                generator = PeptideStructureGenerator()
+                try:
+                    # Generate ETKDG structure
+                    mol_etkdg = generator.generate_structure_etkdg(smiles)
+                    etkdg_path = os.path.join(temp_dir, "structure_etkdg.sdf")
+                    writer = Chem.SDWriter(etkdg_path)
+                    writer.write(mol_etkdg)
+                    writer.close()
+                    structure_files.append(etkdg_path)
+                    # Generate UFF structure if requested
+                    if use_uff:
+                        mol_uff = generator.generate_structure_uff(smiles)
+                        uff_path = os.path.join(temp_dir, "structure_uff.sdf")
+                        writer = Chem.SDWriter(uff_path)
+                        writer.write(mol_uff)
+                        writer.close()
+                        structure_files.append(uff_path)
+                except Exception as e:
+                    return f"Error generating 3D structures: {str(e)}", None, None, None
+            # Use analyzer to get sequence
+            segments = analyzer.split_on_bonds(smiles)
+            # Process segments and build sequence
+            sequence_parts = []
+            output_text = ""
+            # Only include segment analysis in output if requested
+            if show_segment_details:
+                output_text += "Segment Analysis:\n"
+                for i, segment in enumerate(segments):
+                    output_text += f"\nSegment {i}:\n"
+                    output_text += f"Content: {segment['content']}\n"
+                    output_text += f"Bond before: {segment.get('bond_before', 'None')}\n"
+                    output_text += f"Bond after: {segment.get('bond_after', 'None')}\n"
+                    residue, mods = analyzer.identify_residue(segment)
+                    if residue:
+                        if mods:
+                            sequence_parts.append(f"{residue}({','.join(mods)})")
+                        else:
+                            sequence_parts.append(residue)
+                        output_text += f"Identified as: {residue}\n"
+                        output_text += f"Modifications: {mods}\n"
+                    else:
+                        output_text += f"Warning: Could not identify residue in segment: {segment['content']}\n"
+                output_text += "\n"
+            else:
+                # Just build sequence without detailed analysis in output
+                for segment in segments:
+                    residue, mods = analyzer.identify_residue(segment)
+                    if residue:
+                        if mods:
+                            sequence_parts.append(f"{residue}({','.join(mods)})")
+                        else:
+                            sequence_parts.append(residue)
+            # Check if cyclic using analyzer's method
+            is_cyclic, peptide_cycles, aromatic_cycles = analyzer.is_cyclic(smiles)
+            three_letter = '-'.join(sequence_parts)
+            one_letter = ''.join(analyzer.three_to_one.get(aa.split('(')[0], 'X') for aa in sequence_parts)
+            if is_cyclic:
+                three_letter = f"cyclo({three_letter})"
+                one_letter = f"cyclo({one_letter})"
+            # Create cyclic structure visualization
+            img_cyclic = annotate_cyclic_structure(mol, three_letter)
+            # Create linear representation if requested
+            img_linear = None
+            if show_linear:
+                fig_linear = create_enhanced_linear_viz(three_letter, smiles)
+                buf = BytesIO()
+                fig_linear.savefig(buf, format='png', bbox_inches='tight', dpi=300)
+                buf.seek(0)
+                img_linear = Image.open(buf)
+                plt.close(fig_linear)
+            # Add summary to output
+            summary = "Summary:\n"
+            summary += f"Sequence: {three_letter}\n"
+            summary += f"One-letter code: {one_letter}\n"
+            summary += f"Is Cyclic: {'Yes' if is_cyclic else 'No'}\n"
+            #if is_cyclic:
+                #summary += f"Peptide Cycles: {', '.join(peptide_cycles)}\n"
+                #summary += f"Aromatic Cycles: {', '.join(aromatic_cycles)}\n"
+            if structure_files:
+                summary += "\n3D Structures Generated:\n"
+                for filepath in structure_files:
+                    summary += f"- {os.path.basename(filepath)}\n"
+            return summary + output_text, img_cyclic, img_linear, structure_files if structure_files else None
+        except Exception as e:
+            return f"Error processing SMILES: {str(e)}", None, None, None
+    # Handle file input
+    if file_obj is not None:
+        try:
+            # Handle file content
+            if hasattr(file_obj, 'name'):
+                with open(file_obj.name, 'r') as f:
+                    content = f.read()
+            else:
+                content = file_obj.decode('utf-8') if isinstance(file_obj, bytes) else str(file_obj)
+            output_text = ""
+            for line in content.splitlines():
+                smiles = line.strip()
+                if smiles:
+                    # Check if it's a peptide
+                    if not analyzer.is_peptide(smiles):
+                        output_text += f"Skipping non-peptide SMILES: {smiles}\n"
+                        continue
+                    # Process this SMILES
+                    segments = analyzer.split_on_bonds(smiles)
+                    sequence_parts = []
+                    # Add segment details if requested
+                    if show_segment_details:
+                        output_text += f"\nSegment Analysis for SMILES: {smiles}\n"
+                        for i, segment in enumerate(segments):
+                            output_text += f"\nSegment {i}:\n"
+                            output_text += f"Content: {segment['content']}\n"
+                            output_text += f"Bond before: {segment.get('bond_before', 'None')}\n"
+                            output_text += f"Bond after: {segment.get('bond_after', 'None')}\n"
+                            residue, mods = analyzer.identify_residue(segment)
+                            if residue:
+                                if mods:
+                                    sequence_parts.append(f"{residue}({','.join(mods)})")
+                                else:
+                                    sequence_parts.append(residue)
+                                output_text += f"Identified as: {residue}\n"
+                                output_text += f"Modifications: {mods}\n"
+                    else:
+                        for segment in segments:
+                            residue, mods = analyzer.identify_residue(segment)
+                            if residue:
+                                if mods:
+                                    sequence_parts.append(f"{residue}({','.join(mods)})")
+                                else:
+                                    sequence_parts.append(residue)
+                    # Get cyclicity and create sequence
+                    is_cyclic, peptide_cycles, aromatic_cycles = analyzer.is_cyclic(smiles)
+                    sequence = f"cyclo({'-'.join(sequence_parts)})" if is_cyclic else '-'.join(sequence_parts)
+                    output_text += f"\nSummary for SMILES: {smiles}\n"
+                    output_text += f"Sequence: {sequence}\n"
+                    output_text += f"Is Cyclic: {'Yes' if is_cyclic else 'No'}\n"
+                    if is_cyclic:
+                        output_text += f"Peptide Cycles: {', '.join(peptide_cycles)}\n"
+                        #output_text += f"Aromatic Cycles: {', '.join(aromatic_cycles)}\n"
+                    output_text += "-" * 50 + "\n"
+            return output_text, None, None
+        except Exception as e:
+            return f"Error processing file: {str(e)}", None, None
+    return "No input provided.", None, None

src/utils/generate_utils.py ADDED Viewed

	@@ -0,0 +1,77 @@

+import torch
+import math
+import sys
+import pandas as pd
+from transformers import AutoModelForMaskedLM, AutoModel, AutoTokenizer
+def mask_for_de_novo(config, sequence_length):
+    if config.vocab == 'helm':
+        return "[MASK]" * sequence_length
+    elif config.vocab == 'new_smiles' or config.vocab == 'selfies':
+        return ["<mask>"] * sequence_length
+    else:
+        return ["[MASK]"] * sequence_length
+def generate_de_novo(sequence_length, tokenizer, model):
+    masked_sequence = mask_for_de_novo(sequence_length)
+    inputs = tokenizer(masked_sequence, return_tensors='pt').to(model.device)
+    with torch.no_grad():
+        logits = model(**inputs).logits
+    mask_token_indices = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+    logits_at_masks = logits[0, mask_token_indices]
+    pred_tokens = []
+    for i in mask_token_indices:
+        topk_logits, topk_indices = logits_at_masks[i].topk(k=3, dim=-1)
+        probabilities = torch.nn.functional.softmax(topk_logits, dim=-1)
+        predicted_index = torch.distributions.categorical.Categorical(probabilities).sample()
+        predicted_token_id = topk_indices[predicted_index].item()
+        predicted_token = tokenizer.decode([predicted_token_id], skip_special_tokens=True)
+        pred_tokens.append(predicted_token)
+    generated_sequence = ''.join(pred_tokens)
+    perplexity = calculate_perplexity(model, tokenizer, generated_sequence)
+    return (generated_sequence, perplexity)
+def calculate_perplexity(model, tokenizer, generated_sequence, mask_token_indices):
+    total_loss = 0.0
+    tensor_input = tokenizer.encode(generated_sequence, return_tensors='pt').to(model.device)
+    for i in mask_token_indices:
+        masked_input = tensor_input.clone()
+        masked_input[0, i] = tokenizer.mask_token_id
+        labels = torch.full(tensor_input.shape, -100).to(model.device)
+        labels[0, i] = tensor_input[0, i]
+        with torch.no_grad():
+            outputs = model(masked_input, labels=labels)
+            total_loss += outputs.loss.item()
+    num_mask_tokens = len(mask_token_indices)
+    if num_mask_tokens == 0:
+        perplexity = 10000
+    else:
+        avg_loss = total_loss / num_mask_tokens
+        perplexity = math.exp(avg_loss)
+    return perplexity
+def calculate_cosine_sim(original_sequence, generated_sequence, tokenizer, pepclm_model, device):
+    og_embeddings = pepclm_model.roformer.encoder(original_sequence)
+    new_embeddings = pepclm_model.roformer.encoder(generated_sequence)
+    sequence_similarity = torch.nn.functional.cosine_similarity(og_embeddings, new_embeddings, dim=-1)
+    cosine_similarity = torch.mean(sequence_similarity).item()
+    return cosine_similarity
+def calculate_hamming_dist(original_sequence, generated_sequence):
+    generated_sequence = generated_sequence
+    original_sequence = original_sequence
+    return sum(1 if original_sequence[i] != generated_sequence[i] else 0 for i in range(len(original_sequence)))

src/utils/utils.py ADDED Viewed

	@@ -0,0 +1,256 @@

+"""Console logger utilities.
+Copied from https://github.com/HazyResearch/transformers/blob/master/src/utils/utils.py
+Copied from https://docs.python.org/3/howto/logging-cookbook.html#using-a-context-manager-for-selective-logging
+"""
+import logging
+import math
+import fsspec
+import lightning
+import torch
+from timm.scheduler import CosineLRScheduler
+from multiprocessing import Pool
+def fsspec_exists(filename):
+  """Check if a file exists using fsspec."""
+  fs, _ = fsspec.core.url_to_fs(filename)
+  return fs.exists(filename)
+def fsspec_listdir(dirname):
+  """Listdir in manner compatible with fsspec."""
+  fs, _ = fsspec.core.url_to_fs(dirname)
+  return fs.ls(dirname)
+def fsspec_mkdirs(dirname, exist_ok=True):
+  """Mkdirs in manner compatible with fsspec."""
+  fs, _ = fsspec.core.url_to_fs(dirname)
+  fs.makedirs(dirname, exist_ok=exist_ok)
+def print_nans(tensor, name):
+  if torch.isnan(tensor).any():
+    print(name, tensor)
+class CosineDecayWarmupLRScheduler(
+  CosineLRScheduler,
+  torch.optim.lr_scheduler._LRScheduler):
+  """Wrap timm.scheduler.CosineLRScheduler
+  Enables calling scheduler.step() without passing in epoch.
+  Supports resuming as well.
+  Adapted from:
+    https://github.com/HazyResearch/hyena-dna/blob/main/src/utils/optim/schedulers.py
+  """
+  def __init__(self, *args, **kwargs):
+    super().__init__(*args, **kwargs)
+    self._last_epoch = -1
+    self.step(epoch=0)
+  def step(self, epoch=None):
+    if epoch is None:
+      self._last_epoch += 1
+    else:
+      self._last_epoch = epoch
+    # We call either step or step_update, depending on
+    # whether we're using the scheduler every epoch or every
+    # step.
+    # Otherwise, lightning will always call step (i.e.,
+    # meant for each epoch), and if we set scheduler
+    # interval to "step", then the learning rate update will
+    # be wrong.
+    if self.t_in_epochs:
+      super().step(epoch=self._last_epoch)
+    else:
+      super().step_update(num_updates=self._last_epoch)
+class LoggingContext:
+  """Context manager for selective logging."""
+  def __init__(self, logger, level=None, handler=None, close=True):
+    self.logger = logger
+    self.level = level
+    self.handler = handler
+    self.close = close
+  def __enter__(self):
+    if self.level is not None:
+      self.old_level = self.logger.level
+      self.logger.setLevel(self.level)
+    if self.handler:
+      self.logger.addHandler(self.handler)
+  def __exit__(self, et, ev, tb):
+    if self.level is not None:
+      self.logger.setLevel(self.old_level)
+    if self.handler:
+      self.logger.removeHandler(self.handler)
+    if self.handler and self.close:
+      self.handler.close()
+def get_logger(name=__name__, level=logging.INFO) -> logging.Logger:
+  """Initializes multi-GPU-friendly python logger."""
+  logger = logging.getLogger(name)
+  logger.setLevel(level)
+  # this ensures all logging levels get marked with the rank zero decorator
+  # otherwise logs would get multiplied for each GPU process in multi-GPU setup
+  for level in ('debug', 'info', 'warning', 'error',
+                'exception', 'fatal', 'critical'):
+    setattr(logger,
+            level,
+            lightning.pytorch.utilities.rank_zero_only(
+              getattr(logger, level)))
+  return logger
+class Sampler:
+  def __init__(self, shape):
+    self.shape = shape
+  def _sampling_noise(self):
+    pass
+  def _hard_sample(self, logits):
+    pass
+  def _soft_sample(self, logits):
+    return 0
+  def sample(self, logits):
+    noise = self._sampling_noise()
+    noise = noise[: logits.shape[0], :]
+    logits = logits + noise.to(
+      dtype=logits.dtype, device=logits.device)
+    hard_sample = self._hard_sample(logits)
+    soft_sample = self._soft_sample(logits)
+    return soft_sample + (hard_sample - soft_sample).detach()
+class TopKSampler(Sampler):
+  def __init__(self, k, shape, gamma_tau=1.0):
+    super().__init__(shape)
+    self.k = k
+    self.gamma_tau = gamma_tau
+    self.num_betas = 10
+    self.sampler = torch.distributions.gamma.Gamma(
+      1 / k * torch.ones(self.num_betas, * self.shape), 1.0)
+  def _sampling_noise(self):
+    noise = self.sampler.sample()
+    beta = self.k / torch.arange(1, self.num_betas + 1, 1,
+                                 dtype=torch.float32)
+    beta = beta[:, None, None]
+    assert beta.ndim == noise.ndim
+    s = noise / beta
+    s = torch.sum(s, axis=0)
+    s = s - math.log(10.0)
+    s = self.gamma_tau * (s / self.k)
+    return s
+  def _hard_sample(self, logits):
+    assert logits.ndim == 2
+    thresholds, _ = torch.sort(logits, dim=-1)
+    thresholds = thresholds[:, - self.k][:, None]
+    return (logits >= thresholds).type(logits.dtype)
+  def _soft_sample(self, logits):
+    soft_top_k = logits - torch.mean(logits, dim=-1,
+                                     keepdim=True)
+    return soft_top_k / torch.norm(soft_top_k, dim=-1,
+                                   keepdim=True)
+class DeterministicTopK(TopKSampler):
+  def __init__(self, k):
+    super().__init__(k, shape=(1, 1))
+  def _sampling_noise(self):
+    return 0
+  def discreize(self, x):
+    hard_sample = self._hard_sample(x)
+    soft_sample = self._soft_sample(x)
+    return soft_sample + (hard_sample - soft_sample).detach()
+class GumbelSampler(Sampler):
+  def __init__(self, shape, temperature=1.0):
+    super().__init__(shape)
+    self.temperature = temperature
+  def _sampling_noise(self):
+    return - (1e-10 - (
+      torch.rand(* self.shape) + 1e-10).log()).log()
+  def _hard_sample(self, logits):
+    assert logits.ndim == 2
+    indices = torch.argmax(logits, dim=-1)
+    zeros = logits * 0
+    ones = torch.ones_like(logits[:, :, :1])
+    return torch.scatter(zeros, -1, indices[:, :, None],
+                         ones)
+  def _soft_sample(self, logits):
+    return torch.nn.functional.softmax(
+      logits / self.temperature, dim=-1)
+class BinarySampler(GumbelSampler):
+  def sample(self, probs):
+    # TODO(subhamsahoo): use the temperature parameter.
+    pos_noise = self._sampling_noise().to(
+      dtype=probs.dtype, device=probs.device)
+    neg_noise = self._sampling_noise().to(
+      dtype=probs.dtype, device=probs.device)
+    del_noise_exp = (neg_noise - pos_noise).exp()
+    hard_sample = (probs * (1 + del_noise_exp)
+                   > 1).to(probs.dtype)
+    soft_sample = probs / (probs + (1 - probs) * del_noise_exp)
+    return soft_sample + (hard_sample - soft_sample).detach()
+class GaussianSampler:
+    def __init__(self):
+        self.softplus = torch.nn.Softplus()
+    def sample(self, x):
+        assert x.ndim == 2
+        n = x.shape[-1] // 2
+        mu = x[:, :n]
+        sigma = self.softplus(x[:, n:]).sqrt()
+        return mu + sigma * torch.randn_like(mu)
+def mapper(n_jobs):
+    '''
+    Returns function for map call.
+    If n_jobs == 1, will use standard map
+    If n_jobs > 1, will use multiprocessing pool
+    If n_jobs is a pool object, will return its map function
+    '''
+    if n_jobs == 1:
+        def _mapper(*args, **kwargs):
+            return list(map(*args, **kwargs))
+        return _mapper
+    if isinstance(n_jobs, int):
+        pool = Pool(n_jobs)
+        def _mapper(*args, **kwargs):
+            try:
+                result = pool.map(*args, **kwargs)
+            finally:
+                pool.terminate()
+            return result
+        return _mapper
+    return n_jobs.map