Merck
/

MMPT-FM

Safetensors

Model card Files Files and versions

xet

Community

model-ingest commited on 4 days ago

Commit

f87bed3

verified ·

1 Parent(s): 43cdbee

Update README.md

Browse files

Files changed (1) hide show

README.md +84 -45

README.md CHANGED Viewed

@@ -1,50 +1,82 @@
 ---
 license: mit
 ---
-## 1. Model Overview
-- **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
-- **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
-- **Model Specification:** Encoder–decoder Transformer. 220M parameters.
 - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
 - **License:** MIT license.
 - **Base Model:** ChemT5 (chemistry-domain pretrained T5).
 - **Model Type:** Transformer
-- **Languages:** SMARTS (chemical substructure representation)
 - **Pipeline Tag:** text2text-generation for MMP transformation
 - **Library:** Transformers, PyTorch
-## 2. Intended Use
 - **Direct Use:**
-  - Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
-  - Analog design at a **user-specified edit site** (R-group substitution or core hopping)
 - **Downstream Use:**
-  - Integration into analog enumeration pipelines
-  - Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
-## 3. Bias, Risks, and Limitations
-- **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
 - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
-- **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
-## 4. Training Details
-- **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
 - **Training Data Preprocessing:**
-  - Drug-likeness filtering using *rule_of_druglike_soft*
   - Molecular weight ≥ 200 Da
   - Removal of structural alerts using the curated Walters alert list
-  - Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
-- **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
 - **Training Procedure:**
-  - Supervised sequence-to-sequence learning with teacher forcing
   - Cross-entropy loss
   - Batch size: 64
-  - Learning rate: 5 × 10⁻⁴
   - Hardware:
     - MMPT-FM: 4 × NVIDIA A6000 GPUs
-    - MMP-based baselines: 4 × NVIDIA H100 GPUs
-## 5. Evaluation
 - **Metrics:**
   - Validity
   - Novelty (Novel/valid, Novel/all)
@@ -55,30 +87,37 @@ license: mit
   - Cross-patent analog generation (PMV17 → PMV21)
 - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
-## 6. Usage
-- **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
-## 7. Citation
 ```bibtex
-@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
-      title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
-      author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
-      year={2026},
-      eprint={2602.16684},
-      archivePrefix={arXiv},
-      primaryClass={cs.LG},
-      url={https://arxiv.org/abs/2602.16684},
 }
-@article{
-doi:10.26434/chemrxiv.15001722/v1,
-author = {Hao-Wei Pang  and Peter Zhiping Zhang  and Bo Pan  and Liang Zhao  and Xiang Yu  and Liying Zhang },
-title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
-journal = {ChemRxiv},
-volume = {2026},
-number = {0407},
-pages = {},
-year = {2026},
-doi = {10.26434/chemrxiv.15001722/v1},
-URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
-eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
 }

 ---
 license: mit
 ---
+# 1. Model Overview
+- **Model Name:** MMPT-FM & its MMP variants
+- **Summary:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
+- **Model Specification:** Encoder–decoder Transformer. 220M parameters for each model.
 - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
 - **License:** MIT license.
 - **Base Model:** ChemT5 (chemistry-domain pretrained T5).
 - **Model Type:** Transformer
+- **Languages:** SMARTS & SMILES (chemical substructure representation)
 - **Pipeline Tag:** text2text-generation for MMP transformation
 - **Library:** Transformers, PyTorch
+---
+# 2. Intended Use
 - **Direct Use:**
+  - **MMPT-FM:**
+    - Generation of chemically valid matched molecular pair transformations (MMPTs)
+    - Analog design at a user-specified edit site.
+  - **MMP-M2M:**
+    - Generation of chemically valid matched molecular pairs (MMPs)
+  - **MMP-M2T:**
+    - Generation of chemically valid matched molecular pair transformations
+    - Analog design at a user-specified edit site
+  - **MMP-C2V:**
+    - Analog design at a user-specified edit site
 - **Downstream Use:**
+  - **MMPT-FM:**
+    - Integration into analog enumeration pipelines
+    - Integration into high-throughput virtual screening pipelines
+    - Serve as the base model for retrieval-augmented generation (MMPT-RAG).
+  - **MMP-M2M:**
+    - Integration into analog enumeration pipelines
+    - Integration into high-throughput virtual screening pipelines
+  - **MMP-M2T:**
+    - Integration into analog enumeration pipelines
+    - Integration into high-throughput virtual screening pipelines
+  - **MMP-C2V:**
+    - Integration into analog enumeration pipelines
+    - Integration into high-throughput virtual screening pipelines
+---
+# 3. Bias, Risks, and Limitations
+- **Known Limitations:** The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
 - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
+- **Risk Areas:** Our framework is intended for research use and does not introduce specific ethical concerns.
+- **Recommendations:** None
+---
+# 4. Training Details
+- **Training Data:** Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads).
 - **Training Data Preprocessing:**
+  - Drug-likeness filtering using `rule_of_druglike_soft`
   - Molecular weight ≥ 200 Da
   - Removal of structural alerts using the curated Walters alert list
+  - Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb).
+- **Pre-Training:** Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5).
 - **Training Procedure:**
+  - Supervised sequence-to-sequence learning
   - Cross-entropy loss
   - Batch size: 64
+  - Learning rate: `5 × 10⁻⁴`
   - Hardware:
     - MMPT-FM: 4 × NVIDIA A6000 GPUs
+    - MMP variants: 4 × NVIDIA H100 GPUs
+---
+# 5. Evaluation
 - **Metrics:**
   - Validity
   - Novelty (Novel/valid, Novel/all)
   - Cross-patent analog generation (PMV17 → PMV21)
 - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
+---
+# 6. Usage
+- **Sample Inference Code:** Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer).
+- **GitHub Links:** [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer)
+---
+# 7. Citation
+**BibTeX:**
 ```bibtex
+@article{pang2026scalable,
+  title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
+  author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying},
+  year={2026}
 }
+@article{pan2026retrieval,
+  title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
+  author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang},
+  journal={arXiv preprint arXiv:2602.16684},
+  year={2026}
+}
+@article{pan2026transformer,
+  title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds},
+  author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang},
+  journal={arXiv preprint arXiv:2601.07930},
+  year={2026}
 }
+```