Safetensors
model-ingest commited on
Commit
61969f1
·
verified ·
1 Parent(s): c148b24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -1
README.md CHANGED
@@ -1,3 +1,100 @@
1
  ---
2
  license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ ---
4
+
5
+ 1. Model Overview
6
+ • Model Name (Mandatory): MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
7
+ • Summary (Mandatory): MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
8
+ • Model Specification (Mandatory): Encoder–decoder Transformer. 220M parameters.
9
+ • Developed by (Mandatory): Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
10
+ • License (Mandatory): MIT license.
11
+ • Base Model (Mandatory): ChemT5 (chemistry-domain pretrained T5).
12
+ • Model Type (Mandatory): Transformer
13
+ • Languages (Optional): SMARTS (chemical substructure representation)
14
+ • Pipeline Tag (Mandatory): text2text-generation for MMP transformation
15
+ • Library (Mandatory): Transformers, PyTorch
16
+ 2. Intended Use
17
+ • Direct Use (Mandatory):
18
+ o Generation of chemically valid matched molecular pair transformations (MMPTs).
19
+ o Analog design at a user-specified edit site (R-group substitution or core hopping)
20
+ • Downstream Use (Mandatory):
21
+ o Integration into analog enumeration pipelines
22
+ o Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
23
+ • Out-of-Scope Use (Optional): None
24
+ 3. Bias, Risks, and Limitations
25
+ • Known Limitations (Mandatory): The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
26
+ • Biases (Mandatory): Inherits biases from ChEMBL-derived medicinal chemistry literature.
27
+ • Risk Areas (Mandatory): Our framework is intended for research use, and does not introduce specific ethical concerns.
28
+ • Recommendations (Mandatory): None
29
+ 4. Training Details
30
+ • Training Data (Mandatory): Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
31
+ • Training Data Preprocessing (Optional):
32
+ o Drug-likeness filtering using rule_of_druglike_soft
33
+ o Molecular weight ≥ 200 Da
34
+ o Removal of structural alerts using the curated Walters alert list
35
+ o Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
36
+ • Pre-Training (Optional): Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
37
+ • Training Procedure (Mandatory):
38
+ o Supervised sequence-to-sequence learning with teacher forcing
39
+ o Cross-entropy loss
40
+ o Batch size: 64
41
+ o Learning rate: 5 × 10⁻⁴
42
+ o Hardware:
43
+  MMPT-FM: 4 × NVIDIA A6000 GPUs
44
+  MMP-based baselines: 4 × NVIDIA H100 GPUs
45
+ • Fine-Tuning (Optional): None.
46
+ • Environmental impact (Optional): None.
47
+ • Societal Impact Assessment (Optional): None.
48
+ 5. Evaluation
49
+ • Metrics (Optional):
50
+ o Validity
51
+ o Novelty (Novel/valid, Novel/all)
52
+ o Recall (overall, in-training, out-of-training)
53
+ • Benchmarks (Mandatory):
54
+ o Held-out ChEMBL MMPT test set (in-distribution)
55
+ o Within-patent analog generation (PMV17)
56
+ o Cross-patent analog generation (PMV17 → PMV21)
57
+ • Testing Data (Optional): Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
58
+ 6. Model Architecture
59
+ • Architecture Details (Optional): Provide a detailed description of the model’s architecture.
60
+ • Diagram (Optional): Include a diagram of the architecture, if available.
61
+ 7. Usage
62
+ • Sample Inference Code (Mandatory): Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
63
+ • Sample Fine-Tuning Code (Optional): Provide example code for fine-tuning the model.
64
+ • Prompt Format (Optional): For generative models, specify the expected prompt format.
65
+ • Hardware Requirements (Optional): Specify the hardware required to run the model efficiently.
66
+ • GitHub Links (Optional): Include links to GitHub repositories showing existing integrations and usage examples.
67
+ 8. Citation
68
+ • BibTeX (Mandatory):
69
+ @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
70
+ title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
71
+ author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
72
+ year={2026},
73
+ eprint={2602.16684},
74
+ archivePrefix={arXiv},
75
+ primaryClass={cs.LG},
76
+ url={https://arxiv.org/abs/2602.16684},
77
+ }
78
+ @article{
79
+ doi:10.26434/chemrxiv.15001722/v1,
80
+ author = {Hao-Wei Pang and Peter Zhiping Zhang and Bo Pan and Liang Zhao and Xiang Yu and Liying Zhang },
81
+ title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
82
+ journal = {ChemRxiv},
83
+ volume = {2026},
84
+ number = {0407},
85
+ pages = {},
86
+ year = {2026},
87
+ doi = {10.26434/chemrxiv.15001722/v1},
88
+ URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
89
+ eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
90
+ }
91
+
92
+ • DOI (Optional): Include a DOI if available.
93
+ 9. Contributors
94
+ • Names and Roles (Optional): Additional list of contributors (not developers – see section Model Overview – Developed by) and their roles in the model’s development.
95
+ 10. Contact Information
96
+ • Support Channels (Optional): List channels for support or feedback (e.g., email, forums, GitHub issues).
97
+ 11. Acknowledgements (Optional): Recognize any additional contributors, sponsors, or related works.
98
+ 12. Disclaimer
99
+ • Legal Disclaimer (Mandatory): Include any legal disclaimers regarding the model’s use.
100
+