Update README.md
Browse files
README.md
CHANGED
|
@@ -2,70 +2,66 @@
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
| 5 |
-
1. Model Overview
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
6.
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
7.
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
• Hardware Requirements (Optional): Specify the hardware required to run the model efficiently.
|
| 66 |
-
• GitHub Links (Optional): Include links to GitHub repositories showing existing integrations and usage examples.
|
| 67 |
-
8. Citation
|
| 68 |
-
• BibTeX (Mandatory):
|
| 69 |
@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
|
| 70 |
title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
|
| 71 |
author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
|
|
@@ -88,13 +84,3 @@ doi = {10.26434/chemrxiv.15001722/v1},
|
|
| 88 |
URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
|
| 89 |
eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
|
| 90 |
}
|
| 91 |
-
|
| 92 |
-
• DOI (Optional): Include a DOI if available.
|
| 93 |
-
9. Contributors
|
| 94 |
-
• Names and Roles (Optional): Additional list of contributors (not developers – see section Model Overview – Developed by) and their roles in the model’s development.
|
| 95 |
-
10. Contact Information
|
| 96 |
-
• Support Channels (Optional): List channels for support or feedback (e.g., email, forums, GitHub issues).
|
| 97 |
-
11. Acknowledgements (Optional): Recognize any additional contributors, sponsors, or related works.
|
| 98 |
-
12. Disclaimer
|
| 99 |
-
• Legal Disclaimer (Mandatory): Include any legal disclaimers regarding the model’s use.
|
| 100 |
-
|
|
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
| 5 |
+
## 1. Model Overview
|
| 6 |
+
- **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
|
| 7 |
+
- **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
|
| 8 |
+
- **Model Specification:** Encoder–decoder Transformer. 220M parameters.
|
| 9 |
+
- **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
|
| 10 |
+
- **License:** MIT license.
|
| 11 |
+
- **Base Model:** ChemT5 (chemistry-domain pretrained T5).
|
| 12 |
+
- **Model Type:** Transformer
|
| 13 |
+
- **Languages:** SMARTS (chemical substructure representation)
|
| 14 |
+
- **Pipeline Tag:** text2text-generation for MMP transformation
|
| 15 |
+
- **Library:** Transformers, PyTorch
|
| 16 |
+
|
| 17 |
+
## 2. Intended Use
|
| 18 |
+
- **Direct Use:**
|
| 19 |
+
- Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
|
| 20 |
+
- Analog design at a **user-specified edit site** (R-group substitution or core hopping)
|
| 21 |
+
- **Downstream Use:**
|
| 22 |
+
- Integration into analog enumeration pipelines
|
| 23 |
+
- Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
|
| 24 |
+
|
| 25 |
+
## 3. Bias, Risks, and Limitations
|
| 26 |
+
- **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
|
| 27 |
+
- **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
|
| 28 |
+
- **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
|
| 29 |
+
|
| 30 |
+
## 4. Training Details
|
| 31 |
+
- **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
|
| 32 |
+
- **Training Data Preprocessing:**
|
| 33 |
+
- Drug-likeness filtering using *rule_of_druglike_soft*
|
| 34 |
+
- Molecular weight ≥ 200 Da
|
| 35 |
+
- Removal of structural alerts using the curated Walters alert list
|
| 36 |
+
- Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
|
| 37 |
+
- **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
|
| 38 |
+
- **Training Procedure:**
|
| 39 |
+
- Supervised sequence-to-sequence learning with teacher forcing
|
| 40 |
+
- Cross-entropy loss
|
| 41 |
+
- Batch size: 64
|
| 42 |
+
- Learning rate: 5 × 10⁻⁴
|
| 43 |
+
- Hardware:
|
| 44 |
+
- MMPT-FM: 4 × NVIDIA A6000 GPUs
|
| 45 |
+
- MMP-based baselines: 4 × NVIDIA H100 GPUs
|
| 46 |
+
|
| 47 |
+
## 5. Evaluation
|
| 48 |
+
- **Metrics:**
|
| 49 |
+
- Validity
|
| 50 |
+
- Novelty (Novel/valid, Novel/all)
|
| 51 |
+
- Recall (overall, in-training, out-of-training)
|
| 52 |
+
- **Benchmarks:**
|
| 53 |
+
- Held-out ChEMBL MMPT test set (in-distribution)
|
| 54 |
+
- Within-patent analog generation (PMV17)
|
| 55 |
+
- Cross-patent analog generation (PMV17 → PMV21)
|
| 56 |
+
- **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
|
| 57 |
+
|
| 58 |
+
## 6. Usage
|
| 59 |
+
- **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
|
| 60 |
+
|
| 61 |
+
## 7. Citation
|
| 62 |
+
- **BibTeX:**
|
| 63 |
+
|
| 64 |
+
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
|
| 66 |
title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
|
| 67 |
author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
|
|
|
|
| 84 |
URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
|
| 85 |
eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
|
| 86 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|