Merck
/

MMPT-FM

Safetensors

Model card Files Files and versions

xet

Community

model-ingest commited on 8 days ago

Commit

61969f1

verified ·

1 Parent(s): c148b24

Update README.md

Browse files

Files changed (1) hide show

README.md +98 -1

README.md CHANGED Viewed

@@ -1,3 +1,100 @@
 ---
 license: mit
----

 ---
 license: mit
+---
+1. Model Overview
+•	Model Name (Mandatory): MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
+•	Summary (Mandatory): MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
+•	Model Specification (Mandatory): Encoder–decoder Transformer. 220M parameters.
+•	Developed by (Mandatory): Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
+•	License (Mandatory): MIT license.
+•	Base Model (Mandatory): ChemT5 (chemistry-domain pretrained T5).
+•	Model Type (Mandatory): Transformer
+•	Languages (Optional): SMARTS (chemical substructure representation)
+•	Pipeline Tag (Mandatory): text2text-generation for MMP transformation
+•	Library (Mandatory): Transformers, PyTorch
+2. Intended Use
+•	Direct Use (Mandatory):
+o	Generation of chemically valid matched molecular pair transformations (MMPTs).
+o	Analog design at a user-specified edit site (R-group substitution or core hopping)
+•	Downstream Use (Mandatory):
+o	Integration into analog enumeration pipelines
+o	Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
+•	Out-of-Scope Use (Optional): None
+3. Bias, Risks, and Limitations
+•	Known Limitations (Mandatory): The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
+•	Biases (Mandatory): Inherits biases from ChEMBL-derived medicinal chemistry literature.
+•	Risk Areas (Mandatory): Our framework is intended for research use, and does not introduce specific ethical concerns.
+•	Recommendations (Mandatory): None
+4. Training Details
+•	Training Data (Mandatory): Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
+•	Training Data Preprocessing (Optional):
+o	Drug-likeness filtering using rule_of_druglike_soft
+o	Molecular weight ≥ 200 Da
+o	Removal of structural alerts using the curated Walters alert list
+o	Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
+•	Pre-Training (Optional): Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
+•	Training Procedure (Mandatory):
+o	Supervised sequence-to-sequence learning with teacher forcing
+o	Cross-entropy loss
+o	Batch size: 64
+o	Learning rate: 5 × 10⁻⁴
+o	Hardware:
+	MMPT-FM: 4 × NVIDIA A6000 GPUs
+	MMP-based baselines: 4 × NVIDIA H100 GPUs
+•	Fine-Tuning (Optional): None.
+•	Environmental impact (Optional): None.
+•	Societal Impact Assessment (Optional): None.
+5. Evaluation
+•	Metrics (Optional):
+o	Validity
+o	Novelty (Novel/valid, Novel/all)
+o	Recall (overall, in-training, out-of-training)
+•	Benchmarks (Mandatory):
+o	Held-out ChEMBL MMPT test set (in-distribution)
+o	Within-patent analog generation (PMV17)
+o	Cross-patent analog generation (PMV17 → PMV21)
+•	Testing Data (Optional): Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
+6. Model Architecture
+•	Architecture Details (Optional): Provide a detailed description of the model’s architecture.
+•	Diagram (Optional): Include a diagram of the architecture, if available.
+7. Usage
+•	Sample Inference Code (Mandatory): Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
+•	Sample Fine-Tuning Code (Optional): Provide example code for fine-tuning the model.
+•	Prompt Format (Optional): For generative models, specify the expected prompt format.
+•	Hardware Requirements (Optional): Specify the hardware required to run the model efficiently.
+•	GitHub Links (Optional): Include links to GitHub repositories showing existing integrations and usage examples.
+8. Citation
+•	BibTeX (Mandatory):
+@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
+      title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
+      author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
+      year={2026},
+      eprint={2602.16684},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2602.16684},
+}
+@article{
+doi:10.26434/chemrxiv.15001722/v1,
+author = {Hao-Wei Pang  and Peter Zhiping Zhang  and Bo Pan  and Liang Zhao  and Xiang Yu  and Liying Zhang },
+title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
+journal = {ChemRxiv},
+volume = {2026},
+number = {0407},
+pages = {},
+year = {2026},
+doi = {10.26434/chemrxiv.15001722/v1},
+URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
+eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
+}
+•	DOI (Optional): Include a DOI if available.
+9. Contributors
+•	Names and Roles (Optional): Additional list of contributors (not developers – see section Model Overview – Developed by) and their roles in the model’s development.
+10. Contact Information
+•	Support Channels (Optional): List channels for support or feedback (e.g., email, forums, GitHub issues).
+11. Acknowledgements (Optional): Recognize any additional contributors, sponsors, or related works.
+12. Disclaimer
+•	Legal Disclaimer (Mandatory): Include any legal disclaimers regarding the model’s use.