Safetensors
model-ingest commited on
Commit
87be3b1
·
verified ·
1 Parent(s): 1d396a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -74
README.md CHANGED
@@ -2,70 +2,66 @@
2
  license: mit
3
  ---
4
 
5
- 1. Model Overview
6
- Model Name (Mandatory): MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
7
- • Summary (Mandatory): MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
8
- Model Specification (Mandatory): Encoder–decoder Transformer. 220M parameters.
9
- Developed by (Mandatory): Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
10
- • License (Mandatory): MIT license.
11
- Base Model (Mandatory): ChemT5 (chemistry-domain pretrained T5).
12
- Model Type (Mandatory): Transformer
13
- • Languages (Optional): SMARTS (chemical substructure representation)
14
- Pipeline Tag (Mandatory): text2text-generation for MMP transformation
15
- • Library (Mandatory): Transformers, PyTorch
16
- 2. Intended Use
17
- • Direct Use (Mandatory):
18
- o Generation of chemically valid matched molecular pair transformations (MMPTs).
19
- o Analog design at a user-specified edit site (R-group substitution or core hopping)
20
- • Downstream Use (Mandatory):
21
- o Integration into analog enumeration pipelines
22
- o Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
23
- • Out-of-Scope Use (Optional): None
24
- 3. Bias, Risks, and Limitations
25
- • Known Limitations (Mandatory): The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
26
- • Biases (Mandatory): Inherits biases from ChEMBL-derived medicinal chemistry literature.
27
- • Risk Areas (Mandatory): Our framework is intended for research use, and does not introduce specific ethical concerns.
28
- • Recommendations (Mandatory): None
29
- 4. Training Details
30
- • Training Data (Mandatory): Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
31
- Training Data Preprocessing (Optional):
32
- o Drug-likeness filtering using rule_of_druglike_soft
33
- o Molecular weight 200 Da
34
- o Removal of structural alerts using the curated Walters alert list
35
- o Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
36
- • Pre-Training (Optional): Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
37
- Training Procedure (Mandatory):
38
- o Supervised sequence-to-sequence learning with teacher forcing
39
- o Cross-entropy loss
40
- o Batch size: 64
41
- o Learning rate: 5 × 10⁻⁴
42
- o Hardware:
43
-  MMPT-FM: 4 × NVIDIA A6000 GPUs
44
-  MMP-based baselines: 4 × NVIDIA H100 GPUs
45
- • Fine-Tuning (Optional): None.
46
- • Environmental impact (Optional): None.
47
- • Societal Impact Assessment (Optional): None.
48
- 5. Evaluation
49
- • Metrics (Optional):
50
- o Validity
51
- o Novelty (Novel/valid, Novel/all)
52
- o Recall (overall, in-training, out-of-training)
53
- • Benchmarks (Mandatory):
54
- o Held-out ChEMBL MMPT test set (in-distribution)
55
- o Within-patent analog generation (PMV17)
56
- o Cross-patent analog generation (PMV17 PMV21)
57
- • Testing Data (Optional): Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
58
- 6. Model Architecture
59
- • Architecture Details (Optional): Provide a detailed description of the model’s architecture.
60
- • Diagram (Optional): Include a diagram of the architecture, if available.
61
- 7. Usage
62
- • Sample Inference Code (Mandatory): Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
63
- • Sample Fine-Tuning Code (Optional): Provide example code for fine-tuning the model.
64
- • Prompt Format (Optional): For generative models, specify the expected prompt format.
65
- • Hardware Requirements (Optional): Specify the hardware required to run the model efficiently.
66
- • GitHub Links (Optional): Include links to GitHub repositories showing existing integrations and usage examples.
67
- 8. Citation
68
- • BibTeX (Mandatory):
69
  @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
70
  title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
71
  author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
@@ -88,13 +84,3 @@ doi = {10.26434/chemrxiv.15001722/v1},
88
  URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
89
  eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
90
  }
91
-
92
- • DOI (Optional): Include a DOI if available.
93
- 9. Contributors
94
- • Names and Roles (Optional): Additional list of contributors (not developers – see section Model Overview – Developed by) and their roles in the model’s development.
95
- 10. Contact Information
96
- • Support Channels (Optional): List channels for support or feedback (e.g., email, forums, GitHub issues).
97
- 11. Acknowledgements (Optional): Recognize any additional contributors, sponsors, or related works.
98
- 12. Disclaimer
99
- • Legal Disclaimer (Mandatory): Include any legal disclaimers regarding the model’s use.
100
-
 
2
  license: mit
3
  ---
4
 
5
+ ## 1. Model Overview
6
+ - **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
7
+ - **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
8
+ - **Model Specification:** Encoder–decoder Transformer. 220M parameters.
9
+ - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
10
+ - **License:** MIT license.
11
+ - **Base Model:** ChemT5 (chemistry-domain pretrained T5).
12
+ - **Model Type:** Transformer
13
+ - **Languages:** SMARTS (chemical substructure representation)
14
+ - **Pipeline Tag:** text2text-generation for MMP transformation
15
+ - **Library:** Transformers, PyTorch
16
+
17
+ ## 2. Intended Use
18
+ - **Direct Use:**
19
+ - Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
20
+ - Analog design at a **user-specified edit site** (R-group substitution or core hopping)
21
+ - **Downstream Use:**
22
+ - Integration into analog enumeration pipelines
23
+ - Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
24
+
25
+ ## 3. Bias, Risks, and Limitations
26
+ - **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
27
+ - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
28
+ - **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
29
+
30
+ ## 4. Training Details
31
+ - **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
32
+ - **Training Data Preprocessing:**
33
+ - Drug-likeness filtering using *rule_of_druglike_soft*
34
+ - Molecular weight 200 Da
35
+ - Removal of structural alerts using the curated Walters alert list
36
+ - Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
37
+ - **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
38
+ - **Training Procedure:**
39
+ - Supervised sequence-to-sequence learning with teacher forcing
40
+ - Cross-entropy loss
41
+ - Batch size: 64
42
+ - Learning rate: 5 × 10⁻⁴
43
+ - Hardware:
44
+ - MMPT-FM: 4 × NVIDIA A6000 GPUs
45
+ - MMP-based baselines: 4 × NVIDIA H100 GPUs
46
+
47
+ ## 5. Evaluation
48
+ - **Metrics:**
49
+ - Validity
50
+ - Novelty (Novel/valid, Novel/all)
51
+ - Recall (overall, in-training, out-of-training)
52
+ - **Benchmarks:**
53
+ - Held-out ChEMBL MMPT test set (in-distribution)
54
+ - Within-patent analog generation (PMV17)
55
+ - Cross-patent analog generation (PMV17 → PMV21)
56
+ - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
57
+
58
+ ## 6. Usage
59
+ - **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
60
+
61
+ ## 7. Citation
62
+ - **BibTeX:**
63
+
64
+ ```bibtex
 
 
 
 
65
  @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
66
  title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
67
  author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
 
84
  URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
85
  eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
86
  }