Merck
/

MMPT-FM

Safetensors

Model card Files Files and versions

xet

Community

model-ingest commited on 6 days ago

Commit

87be3b1

verified ·

1 Parent(s): 1d396a8

Update README.md

Browse files

Files changed (1) hide show

README.md +60 -74

README.md CHANGED Viewed

@@ -2,70 +2,66 @@
 license: mit
 ---
-1. Model Overview
-•	Model Name (Mandatory): MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
-•	Summary (Mandatory): MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
-•	Model Specification (Mandatory): Encoder–decoder Transformer. 220M parameters.
-•	Developed by (Mandatory): Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
-•	License (Mandatory): MIT license.
-•	Base Model (Mandatory): ChemT5 (chemistry-domain pretrained T5).
-•	Model Type (Mandatory): Transformer
-•	Languages (Optional): SMARTS (chemical substructure representation)
-•	Pipeline Tag (Mandatory): text2text-generation for MMP transformation
-•	Library (Mandatory): Transformers, PyTorch
-2. Intended Use
-•	Direct Use (Mandatory):
-o	Generation of chemically valid matched molecular pair transformations (MMPTs).
-o	Analog design at a user-specified edit site (R-group substitution or core hopping)
-•	Downstream Use (Mandatory):
-o	Integration into analog enumeration pipelines
-o	Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
-•	Out-of-Scope Use (Optional): None
-3. Bias, Risks, and Limitations
-•	Known Limitations (Mandatory): The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
-•	Biases (Mandatory): Inherits biases from ChEMBL-derived medicinal chemistry literature.
-•	Risk Areas (Mandatory): Our framework is intended for research use, and does not introduce specific ethical concerns.
-•	Recommendations (Mandatory): None
-4. Training Details
-•	Training Data (Mandatory): Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
-•	Training Data Preprocessing (Optional):
-o	Drug-likeness filtering using rule_of_druglike_soft
-o	Molecular weight ≥ 200 Da
-o	Removal of structural alerts using the curated Walters alert list
-o	Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
-•	Pre-Training (Optional): Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
-•	Training Procedure (Mandatory):
-o	Supervised sequence-to-sequence learning with teacher forcing
-o	Cross-entropy loss
-o	Batch size: 64
-o	Learning rate: 5 × 10⁻⁴
-o	Hardware:
-	MMPT-FM: 4 × NVIDIA A6000 GPUs
-	MMP-based baselines: 4 × NVIDIA H100 GPUs
-•	Fine-Tuning (Optional): None.
-•	Environmental impact (Optional): None.
-•	Societal Impact Assessment (Optional): None.
-5. Evaluation
-•	Metrics (Optional):
-o	Validity
-o	Novelty (Novel/valid, Novel/all)
-o	Recall (overall, in-training, out-of-training)
-•	Benchmarks (Mandatory):
-o	Held-out ChEMBL MMPT test set (in-distribution)
-o	Within-patent analog generation (PMV17)
-o	Cross-patent analog generation (PMV17 → PMV21)
-•	Testing Data (Optional): Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
-6. Model Architecture
-•	Architecture Details (Optional): Provide a detailed description of the model’s architecture.
-•	Diagram (Optional): Include a diagram of the architecture, if available.
-7. Usage
-•	Sample Inference Code (Mandatory): Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
-•	Sample Fine-Tuning Code (Optional): Provide example code for fine-tuning the model.
-•	Prompt Format (Optional): For generative models, specify the expected prompt format.
-•	Hardware Requirements (Optional): Specify the hardware required to run the model efficiently.
-•	GitHub Links (Optional): Include links to GitHub repositories showing existing integrations and usage examples.
-8. Citation
-•	BibTeX (Mandatory):
 @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
       title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
       author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
@@ -88,13 +84,3 @@ doi = {10.26434/chemrxiv.15001722/v1},
 URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
 eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
 }
-•	DOI (Optional): Include a DOI if available.
-9. Contributors
-•	Names and Roles (Optional): Additional list of contributors (not developers – see section Model Overview – Developed by) and their roles in the model’s development.
-10. Contact Information
-•	Support Channels (Optional): List channels for support or feedback (e.g., email, forums, GitHub issues).
-11. Acknowledgements (Optional): Recognize any additional contributors, sponsors, or related works.
-12. Disclaimer
-•	Legal Disclaimer (Mandatory): Include any legal disclaimers regarding the model’s use.

 license: mit
 ---
+## 1. Model Overview
+- **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
+- **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
+- **Model Specification:** Encoder–decoder Transformer. 220M parameters.
+- **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
+- **License:** MIT license.
+- **Base Model:** ChemT5 (chemistry-domain pretrained T5).
+- **Model Type:** Transformer
+- **Languages:** SMARTS (chemical substructure representation)
+- **Pipeline Tag:** text2text-generation for MMP transformation
+- **Library:** Transformers, PyTorch
+## 2. Intended Use
+- **Direct Use:**
+  - Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
+  - Analog design at a **user-specified edit site** (R-group substitution or core hopping)
+- **Downstream Use:**
+  - Integration into analog enumeration pipelines
+  - Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
+## 3. Bias, Risks, and Limitations
+- **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
+- **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
+- **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
+## 4. Training Details
+- **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
+- **Training Data Preprocessing:**
+  - Drug-likeness filtering using *rule_of_druglike_soft*
+  - Molecular weight ≥ 200 Da
+  - Removal of structural alerts using the curated Walters alert list
+  - Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
+- **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
+- **Training Procedure:**
+  - Supervised sequence-to-sequence learning with teacher forcing
+  - Cross-entropy loss
+  - Batch size: 64
+  - Learning rate: 5 × 10⁻⁴
+  - Hardware:
+    - MMPT-FM: 4 × NVIDIA A6000 GPUs
+    - MMP-based baselines: 4 × NVIDIA H100 GPUs
+## 5. Evaluation
+- **Metrics:**
+  - Validity
+  - Novelty (Novel/valid, Novel/all)
+  - Recall (overall, in-training, out-of-training)
+- **Benchmarks:**
+  - Held-out ChEMBL MMPT test set (in-distribution)
+  - Within-patent analog generation (PMV17)
+  - Cross-patent analog generation (PMV17 → PMV21)
+- **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
+## 6. Usage
+- **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
+## 7. Citation
+- **BibTeX:**
+```bibtex
 @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
       title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
       author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
 URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
 eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
 }