Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,100 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
1. Model Overview
|
| 6 |
+
• Model Name (Mandatory): MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
|
| 7 |
+
• Summary (Mandatory): MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
|
| 8 |
+
• Model Specification (Mandatory): Encoder–decoder Transformer. 220M parameters.
|
| 9 |
+
• Developed by (Mandatory): Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
|
| 10 |
+
• License (Mandatory): MIT license.
|
| 11 |
+
• Base Model (Mandatory): ChemT5 (chemistry-domain pretrained T5).
|
| 12 |
+
• Model Type (Mandatory): Transformer
|
| 13 |
+
• Languages (Optional): SMARTS (chemical substructure representation)
|
| 14 |
+
• Pipeline Tag (Mandatory): text2text-generation for MMP transformation
|
| 15 |
+
• Library (Mandatory): Transformers, PyTorch
|
| 16 |
+
2. Intended Use
|
| 17 |
+
• Direct Use (Mandatory):
|
| 18 |
+
o Generation of chemically valid matched molecular pair transformations (MMPTs).
|
| 19 |
+
o Analog design at a user-specified edit site (R-group substitution or core hopping)
|
| 20 |
+
• Downstream Use (Mandatory):
|
| 21 |
+
o Integration into analog enumeration pipelines
|
| 22 |
+
o Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
|
| 23 |
+
• Out-of-Scope Use (Optional): None
|
| 24 |
+
3. Bias, Risks, and Limitations
|
| 25 |
+
• Known Limitations (Mandatory): The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
|
| 26 |
+
• Biases (Mandatory): Inherits biases from ChEMBL-derived medicinal chemistry literature.
|
| 27 |
+
• Risk Areas (Mandatory): Our framework is intended for research use, and does not introduce specific ethical concerns.
|
| 28 |
+
• Recommendations (Mandatory): None
|
| 29 |
+
4. Training Details
|
| 30 |
+
• Training Data (Mandatory): Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
|
| 31 |
+
• Training Data Preprocessing (Optional):
|
| 32 |
+
o Drug-likeness filtering using rule_of_druglike_soft
|
| 33 |
+
o Molecular weight ≥ 200 Da
|
| 34 |
+
o Removal of structural alerts using the curated Walters alert list
|
| 35 |
+
o Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
|
| 36 |
+
• Pre-Training (Optional): Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
|
| 37 |
+
• Training Procedure (Mandatory):
|
| 38 |
+
o Supervised sequence-to-sequence learning with teacher forcing
|
| 39 |
+
o Cross-entropy loss
|
| 40 |
+
o Batch size: 64
|
| 41 |
+
o Learning rate: 5 × 10⁻⁴
|
| 42 |
+
o Hardware:
|
| 43 |
+
MMPT-FM: 4 × NVIDIA A6000 GPUs
|
| 44 |
+
MMP-based baselines: 4 × NVIDIA H100 GPUs
|
| 45 |
+
• Fine-Tuning (Optional): None.
|
| 46 |
+
• Environmental impact (Optional): None.
|
| 47 |
+
• Societal Impact Assessment (Optional): None.
|
| 48 |
+
5. Evaluation
|
| 49 |
+
• Metrics (Optional):
|
| 50 |
+
o Validity
|
| 51 |
+
o Novelty (Novel/valid, Novel/all)
|
| 52 |
+
o Recall (overall, in-training, out-of-training)
|
| 53 |
+
• Benchmarks (Mandatory):
|
| 54 |
+
o Held-out ChEMBL MMPT test set (in-distribution)
|
| 55 |
+
o Within-patent analog generation (PMV17)
|
| 56 |
+
o Cross-patent analog generation (PMV17 → PMV21)
|
| 57 |
+
• Testing Data (Optional): Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
|
| 58 |
+
6. Model Architecture
|
| 59 |
+
• Architecture Details (Optional): Provide a detailed description of the model’s architecture.
|
| 60 |
+
• Diagram (Optional): Include a diagram of the architecture, if available.
|
| 61 |
+
7. Usage
|
| 62 |
+
• Sample Inference Code (Mandatory): Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
|
| 63 |
+
• Sample Fine-Tuning Code (Optional): Provide example code for fine-tuning the model.
|
| 64 |
+
• Prompt Format (Optional): For generative models, specify the expected prompt format.
|
| 65 |
+
• Hardware Requirements (Optional): Specify the hardware required to run the model efficiently.
|
| 66 |
+
• GitHub Links (Optional): Include links to GitHub repositories showing existing integrations and usage examples.
|
| 67 |
+
8. Citation
|
| 68 |
+
• BibTeX (Mandatory):
|
| 69 |
+
@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
|
| 70 |
+
title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
|
| 71 |
+
author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
|
| 72 |
+
year={2026},
|
| 73 |
+
eprint={2602.16684},
|
| 74 |
+
archivePrefix={arXiv},
|
| 75 |
+
primaryClass={cs.LG},
|
| 76 |
+
url={https://arxiv.org/abs/2602.16684},
|
| 77 |
+
}
|
| 78 |
+
@article{
|
| 79 |
+
doi:10.26434/chemrxiv.15001722/v1,
|
| 80 |
+
author = {Hao-Wei Pang and Peter Zhiping Zhang and Bo Pan and Liang Zhao and Xiang Yu and Liying Zhang },
|
| 81 |
+
title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
|
| 82 |
+
journal = {ChemRxiv},
|
| 83 |
+
volume = {2026},
|
| 84 |
+
number = {0407},
|
| 85 |
+
pages = {},
|
| 86 |
+
year = {2026},
|
| 87 |
+
doi = {10.26434/chemrxiv.15001722/v1},
|
| 88 |
+
URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
|
| 89 |
+
eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
• DOI (Optional): Include a DOI if available.
|
| 93 |
+
9. Contributors
|
| 94 |
+
• Names and Roles (Optional): Additional list of contributors (not developers – see section Model Overview – Developed by) and their roles in the model’s development.
|
| 95 |
+
10. Contact Information
|
| 96 |
+
• Support Channels (Optional): List channels for support or feedback (e.g., email, forums, GitHub issues).
|
| 97 |
+
11. Acknowledgements (Optional): Recognize any additional contributors, sponsors, or related works.
|
| 98 |
+
12. Disclaimer
|
| 99 |
+
• Legal Disclaimer (Mandatory): Include any legal disclaimers regarding the model’s use.
|
| 100 |
+
|