Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeMachine-learned molecular mechanics force field for the simulation of protein-ligand systems and beyond
The development of reliable and extensible molecular mechanics (MM) force fields -- fast, empirical models characterizing the potential energy surface of molecular systems -- is indispensable for biomolecular simulation and computer-aided drug design. Here, we introduce a generalized and extensible machine-learned MM force field, espaloma-0.3, and an end-to-end differentiable framework using graph neural networks to overcome the limitations of traditional rule-based methods. Trained in a single GPU-day to fit a large and diverse quantum chemical dataset of over 1.1M energy and force calculations, espaloma-0.3 reproduces quantum chemical energetic properties of chemical domains highly relevant to drug discovery, including small molecules, peptides, and nucleic acids. Moreover, this force field maintains the quantum chemical energy-minimized geometries of small molecules and preserves the condensed phase properties of peptides, self-consistently parametrizing proteins and ligands to produce stable simulations leading to highly accurate predictions of binding free energies. This methodology demonstrates significant promise as a path forward for systematically building more accurate force fields that are easily extensible to new chemical domains of interest.
A foundation model for atomistic materials chemistry
Machine-learned force fields have transformed the atomistic modelling of materials by enabling simulations of ab initio quality on unprecedented time and length scales. However, they are currently limited by: (i) the significant computational and human effort that must go into development and validation of potentials for each particular system of interest; and (ii) a general lack of transferability from one chemical system to the next. Here, using the state-of-the-art MACE architecture we introduce a single general-purpose ML model, trained on a public database of 150k inorganic crystals, that is capable of running stable molecular dynamics on molecules and materials. We demonstrate the power of the MACE-MP-0 model -- and its qualitative and at times quantitative accuracy -- on a diverse set problems in the physical sciences, including the properties of solids, liquids, gases, and chemical reactions. The model can be applied out of the box and as a starting or "foundation model" for any atomistic system of interest and is thus a step towards democratising the revolution of ML force fields by lowering the barriers to entry.
Machine Learning Force Fields with Data Cost Aware Training
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as O(n^3) to O(n^7), with n proportional to the number of basis functions. To address this issue, we propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting tahe potential bias of this data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabeled. Extensive experiments on MD datasets and downstream tasks validate the efficacy of ASTEROID. Our code and data are available at https://github.com/abukharin3/asteroid.
Symmetry-invariant quantum machine learning force fields
Machine learning techniques are essential tools to compute efficient, yet accurate, force fields for atomistic simulations. This approach has recently been extended to incorporate quantum computational methods, making use of variational quantum learning models to predict potential energy surfaces and atomic forces from ab initio training data. However, the trainability and scalability of such models are still limited, due to both theoretical and practical barriers. Inspired by recent developments in geometric classical and quantum machine learning, here we design quantum neural networks that explicitly incorporate, as a data-inspired prior, an extensive set of physically relevant symmetries. We find that our invariant quantum learning models outperform their more generic counterparts on individual molecules of growing complexity. Furthermore, we study a water dimer as a minimal example of a system with multiple components, showcasing the versatility of our proposed approach and opening the way towards larger simulations. Our results suggest that molecular force fields generation can significantly profit from leveraging the framework of geometric quantum machine learning, and that chemical systems represent, in fact, an interesting and rich playground for the development and application of advanced quantum machine learning tools.
Deep Learning for Protein-Ligand Docking: Are We There Yet?
The effects of ligand binding on protein structures and their in vivo functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of the latest docking and structure prediction methods within the broadly applicable context of (1) using predicted (apo) protein structures for docking (e.g., for applicability to new proteins); (2) binding multiple (cofactor) ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for generalization to unknown pockets). To enable a deeper understanding of docking methods' real-world utility, we introduce PoseBench, the first comprehensive benchmark for broadly applicable protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL methods for apo-to-holo protein-ligand docking and protein-ligand structure prediction using both primary ligand and multi-ligand benchmark datasets, the latter of which we introduce for the first time to the DL community. Empirically, using PoseBench, we find that (1) DL co-folding methods generally outperform comparable conventional and DL docking baselines, yet popular methods such as AlphaFold 3 are still challenged by prediction targets with novel protein sequences; (2) certain DL co-folding methods are highly sensitive to their input multiple sequence alignments, while others are not; and (3) DL methods struggle to strike a balance between structural accuracy and chemical specificity when predicting novel or multi-ligand protein targets. Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench.
Protein-ligand binding representation learning from fine-grained interactions
The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different from existing pre-training approaches which treat proteins and ligands individually, we emphasize to discern the intricate binding patterns from fine-grained interactions. Specifically, this self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure given a pocket and ligand with a Transformer based interaction module, which naturally emulates the binding process. To ensure the representation of rich binding information, we introduce two pre-training tasks, i.e.~atomic pairwise distance map prediction and mask ligand reconstruction, which comprehensively model the fine-grained interactions from both structure and feature space. Extensive experiments have demonstrated the superiority of our method across various binding tasks, including protein-ligand affinity prediction, virtual screening and protein-ligand docking.
TorchMD-Net 2.0: Fast Neural Network Potentials for Molecular Simulations
Achieving a balance between computational speed, prediction accuracy, and universal applicability in molecular simulations has been a persistent challenge. This paper presents substantial advancements in the TorchMD-Net software, a pivotal step forward in the shift from conventional force fields to neural network-based potentials. The evolution of TorchMD-Net into a more comprehensive and versatile framework is highlighted, incorporating cutting-edge architectures such as TensorNet. This transformation is achieved through a modular design approach, encouraging customized applications within the scientific community. The most notable enhancement is a significant improvement in computational efficiency, achieving a very remarkable acceleration in the computation of energy and forces for TensorNet models, with performance gains ranging from 2-fold to 10-fold over previous iterations. Other enhancements include highly optimized neighbor search algorithms that support periodic boundary conditions and the smooth integration with existing molecular dynamics frameworks. Additionally, the updated version introduces the capability to integrate physical priors, further enriching its application spectrum and utility in research. The software is available at https://github.com/torchmd/torchmd-net.
Spherical Channels for Modeling Atomic Interactions
Modeling the energy and forces of atomic systems is a fundamental problem in computational chemistry with the potential to help address many of the world's most pressing problems, including those related to energy scarcity and climate change. These calculations are traditionally performed using Density Functional Theory, which is computationally very expensive. Machine learning has the potential to dramatically improve the efficiency of these calculations from days or hours to seconds. We propose the Spherical Channel Network (SCN) to model atomic energies and forces. The SCN is a graph neural network where nodes represent atoms and edges their neighboring atoms. The atom embeddings are a set of spherical functions, called spherical channels, represented using spherical harmonics. We demonstrate, that by rotating the embeddings based on the 3D edge orientation, more information may be utilized while maintaining the rotational equivariance of the messages. While equivariance is a desirable property, we find that by relaxing this constraint in both message passing and aggregation, improved accuracy may be achieved. We demonstrate state-of-the-art results on the large-scale Open Catalyst dataset in both energy and force prediction for numerous tasks and metrics.
Leveraging Side Information for Ligand Conformation Generation using Diffusion-Based Approaches
Ligand molecule conformation generation is a critical challenge in drug discovery. Deep learning models have been developed to tackle this problem, particularly through the use of generative models in recent years. However, these models often generate conformations that lack meaningful structure and randomness due to the absence of essential side information. Examples of such side information include the chemical and geometric features of the target protein, ligand-target compound interactions, and ligand chemical properties. Without these constraints, the generated conformations may not be suitable for further selection and design of new drugs. To address this limitation, we propose a novel method for generating ligand conformations that leverage side information and incorporate flexible constraints into standard diffusion models. Drawing inspiration from the concept of message passing, we introduce ligand-target massage passing block, a mechanism that facilitates the exchange of information between target nodes and ligand nodes, thereby incorporating target node features. To capture non-covalent interactions, we introduce ligand-target compound inter and intra edges. To further improve the biological relevance of the generated conformations, we train energy models using scalar chemical features. These models guide the progress of the standard Denoising Diffusion Probabilistic Models, resulting in more biologically meaningful conformations. We evaluate the performance of SIDEGEN using the PDBBind-2020 dataset, comparing it against other methods. The results demonstrate improvements in both Aligned RMSD and Ligand RMSD evaluations. Specifically, our model outperforms GeoDiff (trained on PDBBind-2020) by 20% in terms of the median aligned RMSD metric.
FABind: Fast and Accurate Protein-Ligand Binding
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose FABind, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. FABind incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed FABind demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind
MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields
Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.
CHGNet: Pretrained universal neural network potential for charge-informed atomistic modeling
The simulation of large-scale systems with complex electron interactions remains one of the greatest challenges for the atomistic modeling of materials. Although classical force fields often fail to describe the coupling between electronic states and ionic rearrangements, the more accurate ab-initio molecular dynamics suffers from computational complexity that prevents long-time and large-scale simulations, which are essential to study many technologically relevant phenomena, such as reactions, ion migrations, phase transformations, and degradation. In this work, we present the Crystal Hamiltonian Graph neural Network (CHGNet) as a novel machine-learning interatomic potential (MLIP), using a graph-neural-network-based force field to model a universal potential energy surface. CHGNet is pretrained on the energies, forces, stresses, and magnetic moments from the Materials Project Trajectory Dataset, which consists of over 10 years of density functional theory static and relaxation trajectories of sim 1.5 million inorganic structures. The explicit inclusion of magnetic moments enables CHGNet to learn and accurately represent the orbital occupancy of electrons, enhancing its capability to describe both atomic and electronic degrees of freedom. We demonstrate several applications of CHGNet in solid-state materials, including charge-informed molecular dynamics in Li_xMnO_2, the finite temperature phase diagram for Li_xFePO_4 and Li diffusion in garnet conductors. We critically analyze the significance of including charge information for capturing appropriate chemistry, and we provide new insights into ionic systems with additional electronic degrees of freedom that can not be observed by previous MLIPs.
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called nabla^2DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level (omegaB97X-D/def2-SVP) for each conformation. Moreover, nabla^2DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
Multi-view biomedical foundation models for molecule-target and property prediction
Foundation models applied to bio-molecular space hold promise to accelerate drug discovery. Molecular representation is key to building such models. Previous works have typically focused on a single representation or view of the molecules. Here, we develop a multi-view foundation model approach, that integrates molecular views of graph, image and text. Single-view foundation models are each pre-trained on a dataset of up to 200M molecules and then aggregated into combined representations. Our multi-view model is validated on a diverse set of 18 tasks, encompassing ligand-protein binding, molecular solubility, metabolism and toxicity. We show that the multi-view models perform robustly and are able to balance the strengths and weaknesses of specific views. We then apply this model to screen compounds against a large (>100 targets) set of G Protein-Coupled receptors (GPCRs). From this library of targets, we identify 33 that are related to Alzheimer's disease. On this subset, we employ our model to identify strong binders, which are validated through structure-based modeling and identification of key binding motifs.
AdsorbML: Accelerating Adsorption Energy Calculations with Machine Learning
Computational catalysis is playing an increasingly significant role in the design of catalysts across a wide range of applications. A common task for many computational methods is the need to accurately compute the minimum binding energy - the adsorption energy - for an adsorbate and a catalyst surface of interest. Traditionally, the identification of low energy adsorbate-surface configurations relies on heuristic methods and researcher intuition. As the desire to perform high-throughput screening increases, it becomes challenging to use heuristics and intuition alone. In this paper, we demonstrate machine learning potentials can be leveraged to identify low energy adsorbate-surface configurations more accurately and efficiently. Our algorithm provides a spectrum of trade-offs between accuracy and efficiency, with one balanced option finding the lowest energy configuration, within a 0.1 eV threshold, 86.33% of the time, while achieving a 1331x speedup in computation. To standardize benchmarking, we introduce the Open Catalyst Dense dataset containing nearly 1,000 diverse surfaces and 85,658 unique configurations.
BAMBOO: a predictive and transferable machine learning force field framework for liquid electrolyte development
Despite the widespread applications of machine learning force field (MLFF) on solids and small molecules, there is a notable gap in applying MLFF to complex liquid electrolytes. In this work, we introduce BAMBOO (ByteDance AI Molecular Simulation Booster), a novel framework for molecular dynamics (MD) simulations, with a demonstration of its capabilities in the context of liquid electrolytes for lithium batteries. We design a physics-inspired graph equivariant transformer architecture as the backbone of BAMBOO to learn from quantum mechanical simulations. Additionally, we pioneer an ensemble knowledge distillation approach and apply it on MLFFs to improve the stability of MD simulations. Finally, we propose the density alignment algorithm to align BAMBOO with experimental measurements. BAMBOO demonstrates state-of-the-art accuracy in predicting key electrolyte properties such as density, viscosity, and ionic conductivity across various solvents and salt combinations. Our current model, trained on more than 15 chemical species, achieves the average density error of 0.01 g/cm^3 on various compositions compared with experimental data. Moreover, our model demonstrates transferability to molecules not included in the quantum mechanical dataset. We envision this work as paving the way to a "universal MLFF" capable of simulating properties of common organic liquids.
Generating Molecular Conformer Fields
In this paper we tackle the problem of generating conformers of a molecule in 3D space given its molecular graph. We parameterize these conformers as continuous functions that map elements from the molecular graph to points in 3D space. We then formulate the problem of learning to generate conformers as learning a distribution over these functions using a diffusion generative model, called Molecular Conformer Fields (MCF). Our approach is simple and scalable, and achieves state-of-the-art performance on challenging molecular conformer generation benchmarks while making no assumptions about the explicit structure of molecules (e.g. modeling torsional angles). MCF represents an advance in extending diffusion models to handle complex scientific problems in a conceptually simple, scalable and effective manner.
ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment
Pocket representations play a vital role in various biomedical applications, such as druggability estimation, ligand affinity prediction, and de novo drug design. While existing geometric features and pretrained representations have demonstrated promising results, they usually treat pockets independent of ligands, neglecting the fundamental interactions between them. However, the limited pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hampers large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pretrained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pretrained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.
Latent Field Discovery In Interacting Dynamical Systems With Neural Fields
Systems of interacting objects often evolve under the influence of field effects that govern their dynamics, yet previous works have abstracted away from such effects, and assume that systems evolve in a vacuum. In this work, we focus on discovering these fields, and infer them from the observed dynamics alone, without directly observing them. We theorize the presence of latent force fields, and propose neural fields to learn them. Since the observed dynamics constitute the net effect of local object interactions and global field effects, recently popularized equivariant networks are inapplicable, as they fail to capture global information. To address this, we propose to disentangle local object interactions -- which are SE(n) equivariant and depend on relative states -- from external global field effects -- which depend on absolute states. We model interactions with equivariant graph networks, and combine them with neural fields in a novel graph network that integrates field forces. Our experiments show that we can accurately discover the underlying fields in charged particles settings, traffic scenes, and gravitational n-body problems, and effectively use them to learn the system and forecast future trajectories.
EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.
Sliced Denoising: A Physics-Informed Molecular Pre-Training Method
While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their accuracy is often compromised by ad-hoc noise design, leading to inaccurate learned force fields. To address this limitation, this paper proposes a new method for molecular pre-training, called sliced denoising (SliDe), which is based on the classical mechanical intramolecular potential theory. SliDe utilizes a novel noise strategy that perturbs bond lengths, angles, and torsion angles to achieve better sampling over conformations. Additionally, it introduces a random slicing approach that circumvents the computationally expensive calculation of the Jacobian matrix, which is otherwise essential for estimating the force field. By aligning with physical principles, SliDe shows a 42\% improvement in the accuracy of estimated force fields compared to current state-of-the-art denoising methods, and thus outperforms traditional baselines on various molecular property prediction tasks.
Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling
The dynamic nature of proteins is crucial for determining their biological functions and properties, for which Monte Carlo (MC) and molecular dynamics (MD) simulations stand as predominant tools to study such phenomena. By utilizing empirically derived force fields, MC or MD simulations explore the conformational space through numerically evolving the system via Markov chain or Newtonian mechanics. However, the high-energy barrier of the force fields can hamper the exploration of both methods by the rare event, resulting in inadequately sampled ensemble without exhaustive running. Existing learning-based approaches perform direct sampling yet heavily rely on target-specific simulation data for training, which suffers from high data acquisition cost and poor generalizability. Inspired by simulated annealing, we propose Str2Str, a novel structure-to-structure translation framework capable of zero-shot conformation sampling with roto-translation equivariant property. Our method leverages an amortized denoising score matching objective trained on general crystal structures and has no reliance on simulation data during both training and inference. Experimental results across several benchmarking protein systems demonstrate that Str2Str outperforms previous state-of-the-art generative structure prediction models and can be orders of magnitude faster compared to long MD simulations. Our open-source implementation is available at https://github.com/lujiarui/Str2Str
Solvation Free Energies from Neural Thermodynamic Integration
We present a method for computing free-energy differences using thermodynamic integration with a neural network potential that interpolates between two target Hamiltonians. The interpolation is defined at the sample distribution level, and the neural network potential is optimized to match the corresponding equilibrium potential at every intermediate time-step. Once the interpolating potentials and samples are well-aligned, the free-energy difference can be estimated using (neural) thermodynamic integration. To target molecular systems, we simultaneously couple Lennard-Jones and electrostatic interactions and model the rigid-body rotation of molecules. We report accurate results for several benchmark systems: a Lennard-Jones particle in a Lennard-Jones fluid, as well as the insertion of both water and methane solutes in a water solvent at atomistic resolution using a simple three-body neural-network potential.
Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models
We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.
Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design
The efficient exploration of chemical space to design molecules with intended properties enables the accelerated discovery of drugs, materials, and catalysts, and is one of the most important outstanding challenges in chemistry. Encouraged by the recent surge in computer power and artificial intelligence development, many algorithms have been developed to tackle this problem. However, despite the emergence of many new approaches in recent years, comparatively little progress has been made in developing realistic benchmarks that reflect the complexity of molecular design for real-world applications. In this work, we develop a set of practical benchmark tasks relying on physical simulation of molecular systems mimicking real-life molecular design problems for materials, drugs, and chemical reactions. Additionally, we demonstrate the utility and ease of use of our new benchmark set by demonstrating how to compare the performance of several well-established families of algorithms. Surprisingly, we find that model performance can strongly depend on the benchmark domain. We believe that our benchmark suite will help move the field towards more realistic molecular design benchmarks, and move the development of inverse molecular design algorithms closer to designing molecules that solve existing problems in both academia and industry alike.
Molecular Graph Generation via Geometric Scattering
Graph neural networks (GNNs) have been used extensively for addressing problems in drug design and discovery. Both ligand and target molecules are represented as graphs with node and edge features encoding information about atomic elements and bonds respectively. Although existing deep learning models perform remarkably well at predicting physicochemical properties and binding affinities, the generation of new molecules with optimized properties remains challenging. Inherently, most GNNs perform poorly in whole-graph representation due to the limitations of the message-passing paradigm. Furthermore, step-by-step graph generation frameworks that use reinforcement learning or other sequential processing can be slow and result in a high proportion of invalid molecules with substantial post-processing needed in order to satisfy the principles of stoichiometry. To address these issues, we propose a representation-first approach to molecular graph generation. We guide the latent representation of an autoencoder by capturing graph structure information with the geometric scattering transform and apply penalties that structure the representation also by molecular properties. We show that this highly structured latent space can be directly used for molecular graph generation by the use of a GAN. We demonstrate that our architecture learns meaningful representations of drug datasets and provides a platform for goal-directed drug synthesis.
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.
Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.
AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design
A central challenge of the clean energy transition is the development of catalysts for low-emissions technologies. Recent advances in Machine Learning for quantum chemistry drastically accelerate the computation of catalytic activity descriptors such as adsorption energies. Here we introduce AdsorbRL, a Deep Reinforcement Learning agent aiming to identify potential catalysts given a multi-objective binding energy target, trained using offline learning on the Open Catalyst 2020 and Materials Project data sets. We experiment with Deep Q-Network agents to traverse the space of all ~160,000 possible unary, binary and ternary compounds of 55 chemical elements, with very sparse rewards based on adsorption energy known for only between 2,000 and 3,000 catalysts per adsorbate. To constrain the actions space, we introduce Random Edge Traversal and train a single-objective DQN agent on the known states subgraph, which we find strengthens target binding energy by an average of 4.1 eV. We extend this approach to multi-objective, goal-conditioned learning, and train a DQN agent to identify materials with the highest (respectively lowest) adsorption energies for multiple simultaneous target adsorbates. We experiment with Objective Sub-Sampling, a novel training scheme aimed at encouraging exploration in the multi-objective setup, and demonstrate simultaneous adsorption energy improvement across all target adsorbates, by an average of 0.8 eV. Overall, our results suggest strong potential for Deep Reinforcement Learning applied to the inverse catalysts design problem.
TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials
The development of efficient machine learning models for molecular systems representation is becoming crucial in scientific research. We introduce TensorNet, an innovative O(3)-equivariant message-passing neural network architecture that leverages Cartesian tensor representations. By using Cartesian tensor atomic embeddings, feature mixing is simplified through matrix product operations. Furthermore, the cost-effective decomposition of these tensors into rotation group irreducible representations allows for the separate processing of scalars, vectors, and tensors when necessary. Compared to higher-rank spherical tensor models, TensorNet demonstrates state-of-the-art performance with significantly fewer parameters. For small molecule potential energies, this can be achieved even with a single interaction layer. As a result of all these properties, the model's computational cost is substantially decreased. Moreover, the accurate prediction of vector and tensor molecular quantities on top of potential energies and forces is possible. In summary, TensorNet's framework opens up a new space for the design of state-of-the-art equivariant models.
mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics
Recent advancements in protein structure determination are revolutionizing our understanding of proteins. Still, a significant gap remains in the availability of comprehensive datasets that focus on the dynamics of proteins, which are crucial for understanding protein function, folding, and interactions. To address this critical gap, we introduce mdCATH, a dataset generated through an extensive set of all-atom molecular dynamics simulations of a diverse and representative collection of protein domains. This dataset comprises all-atom systems for 5,398 domains, modeled with a state-of-the-art classical force field, and simulated in five replicates each at five temperatures from 320 K to 413 K. The mdCATH dataset records coordinates and forces every 1 ns, for over 62 ms of accumulated simulation time, effectively capturing the dynamics of the various classes of domains and providing a unique resource for proteome-wide statistical analyses of protein unfolding thermodynamics and kinetics. We outline the dataset structure and showcase its potential through four easily reproducible case studies, highlighting its capabilities in advancing protein science.
Gradual Optimization Learning for Conformational Energy Minimization
Molecular conformation optimization is crucial to computer-aided drug discovery and materials design. Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients. However, this is a computationally expensive approach that requires many interactions with a physical simulator. One way to accelerate this procedure is to replace the physical simulator with a neural network. Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to distribution shift, leading to inaccurate energy minimization. We find that the quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data. Still, it takes around 5 times 10^5 additional conformations to match the physical simulator's optimization quality. In this work, we present the Gradual Optimization Learning Framework (GOLF) for energy minimization with neural networks that significantly reduces the required additional data. The framework consists of an efficient data-collecting scheme and an external optimizer. The external optimizer utilizes gradients from the energy prediction model to generate optimization trajectories, and the data-collecting scheme selects additional training data to be processed by the physical simulator. Our results demonstrate that the neural network trained with GOLF performs on par with the oracle on a benchmark of diverse drug-like molecules using 50x less additional data.
Ewald-based Long-Range Message Passing for Molecular Graphs
Neural architectures that learn potential energy surfaces from molecular data have undergone fast improvement in recent years. A key driver of this success is the Message Passing Neural Network (MPNN) paradigm. Its favorable scaling with system size partly relies upon a spatial distance limit on messages. While this focus on locality is a useful inductive bias, it also impedes the learning of long-range interactions such as electrostatics and van der Waals forces. To address this drawback, we propose Ewald message passing: a nonlocal Fourier space scheme which limits interactions via a cutoff on frequency instead of distance, and is theoretically well-founded in the Ewald summation method. It can serve as an augmentation on top of existing MPNN architectures as it is computationally inexpensive and agnostic to architectural details. We test the approach with four baseline models and two datasets containing diverse periodic (OC20) and aperiodic structures (OE62). We observe robust improvements in energy mean absolute errors across all models and datasets, averaging 10% on OC20 and 16% on OE62. Our analysis shows an outsize impact of these improvements on structures with high long-range contributions to the ground truth energy.
Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.
Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation
Deep generative diffusion models are a promising avenue for 3D de novo molecular design in materials science and drug discovery. However, their utility is still limited by suboptimal performance on large molecular structures and limited training data. To address this gap, we explore the design space of E(3)-equivariant diffusion models, focusing on previously unexplored areas. Our extensive comparative analysis evaluates the interplay between continuous and discrete state spaces. From this investigation, we present the EQGAT-diff model, which consistently outperforms established models for the QM9 and GEOM-Drugs datasets. Significantly, EQGAT-diff takes continuous atom positions, while chemical elements and bond types are categorical and uses time-dependent loss weighting, substantially increasing training convergence, the quality of generated samples, and inference time. We also showcase that including chemically motivated additional features like hybridization states in the diffusion process enhances the validity of generated molecules. To further strengthen the applicability of diffusion models to limited training data, we investigate the transferability of EQGAT-diff trained on the large PubChem3D dataset with implicit hydrogen atoms to target different data distributions. Fine-tuning EQGAT-diff for just a few iterations shows an efficient distribution shift, further improving performance throughout data sets. Finally, we test our model on the Crossdocked data set for structure-based de novo ligand generation, underlining the importance of our findings showing state-of-the-art performance on Vina docking scores.
QH9: A Quantum Hamiltonian Prediction Benchmark for QM9 Molecules
Supervised machine learning approaches have been increasingly used in accelerating electronic structure prediction as surrogates of first-principle computational methods, such as density functional theory (DFT). While numerous quantum chemistry datasets focus on chemical properties and atomic forces, the ability to achieve accurate and efficient prediction of the Hamiltonian matrix is highly desired, as it is the most important and fundamental physical quantity that determines the quantum states of physical systems and chemical properties. In this work, we generate a new Quantum Hamiltonian dataset, named as QH9, to provide precise Hamiltonian matrices for 999 or 2998 molecular dynamics trajectories and 130,831 stable molecular geometries, based on the QM9 dataset. By designing benchmark tasks with various molecules, we show that current machine learning models have the capacity to predict Hamiltonian matrices for arbitrary molecules. Both the QH9 dataset and the baseline models are provided to the community through an open-source benchmark, which can be highly valuable for developing machine learning methods and accelerating molecular and materials design for scientific and technological applications. Our benchmark is publicly available at https://github.com/divelab/AIRS/tree/main/OpenDFT/QHBench.
Chemical Heredity as Group Selection at the Molecular Level
Many examples of cooperation exist in biology. In chemical systems however, which can sometimes be quite complex, we do not appear to observe intricate cooperative interactions. A key question for the origin of life, is then how can molecular cooperation first arise in an abiotic system prior to the emergence of biological replication. We postulate that selection at the molecular level is a driving force behind the complexification of chemical systems, particularly during the origins of life. In the theory of multilevel selection the two selective forces are: within-group and between-group, where the former tends to favor "selfish" replication of individuals and the latter favor cooperation between individuals enhancing the replication of the group as a whole. These forces can be quantified using the Price equation, which is a standard tool used in evolutionary biology to quantify evolutionary change. Our central claim is that replication and heredity in chemical systems are subject to selection, and quantifiable using the multilevel Price equation. We demonstrate this using the Graded Autocatalysis Replication Domain computer model, describing simple protocell composed out of molecules and its replication, which respectively analogue to the group and the individuals. In contrast to previous treatments of this model, we treat the lipid molecules themselves as replicating individuals and the protocells they form as groups of individuals. Our goal is to demonstrate how evolutionary biology tools and concepts can be applied in chemistry and we suggest that molecular cooperation may arise as a result of group selection. Further, the biological relation of parent-progeny is proposed to be analogue to the reactant-product relation in chemistry, thus allowing for tools from evolutionary biology to be applied to chemistry and would deepen the connection between chemistry and biology.
The Open Catalyst 2020 (OC20) Dataset and Community Challenges
Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.
Exploiting Pretrained Biochemical Language Models for Targeted Drug Design
Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabeled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. Results: The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. Availability and implementation: The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145
Molecule3D: A Benchmark for Predicting 3D Geometries from Molecular Graphs
Graph neural networks are emerging as promising methods for modeling molecular graphs, in which nodes and edges correspond to atoms and chemical bonds, respectively. Recent studies show that when 3D molecular geometries, such as bond lengths and angles, are available, molecular property prediction tasks can be made more accurate. However, computing of 3D molecular geometries requires quantum calculations that are computationally prohibitive. For example, accurate calculation of 3D geometries of a small molecule requires hours of computing time using density functional theory (DFT). Here, we propose to predict the ground-state 3D geometries from molecular graphs using machine learning methods. To make this feasible, we develop a benchmark, known as Molecule3D, that includes a dataset with precise ground-state geometries of approximately 4 million molecules derived from DFT. We also provide a set of software tools for data processing, splitting, training, and evaluation, etc. Specifically, we propose to assess the error and validity of predicted geometries using four metrics. We implement two baseline methods that either predict the pairwise distance between atoms or atom coordinates in 3D space. Experimental results show that, compared with generating 3D geometries with RDKit, our method can achieve comparable prediction accuracy but with much smaller computational costs. Our Molecule3D is available as a module of the MoleculeX software library (https://github.com/divelab/MoleculeX).
Quantum-Inspired Machine Learning for Molecular Docking
Molecular docking is an important tool for structure-based drug design, accelerating the efficiency of drug development. Complex and dynamic binding processes between proteins and small molecules require searching and sampling over a wide spatial range. Traditional docking by searching for possible binding sites and conformations is computationally complex and results poorly under blind docking. Quantum-inspired algorithms combining quantum properties and annealing show great advantages in solving combinatorial optimization problems. Inspired by this, we achieve an improved in blind docking by using quantum-inspired combined with gradients learned by deep learning in the encoded molecular space. Numerical simulation shows that our method outperforms traditional docking algorithms and deep learning-based algorithms over 10\%. Compared to the current state-of-the-art deep learning-based docking algorithm DiffDock, the success rate of Top-1 (RMSD<2) achieves an improvement from 33\% to 35\% in our same setup. In particular, a 6\% improvement is realized in the high-precision region(RMSD<1) on molecules data unseen in DiffDock, which demonstrates the well-generalized of our method.
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
Predicting the binding structure of a small molecule ligand to a protein -- a task known as molecular docking -- is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), DiffDock maintains significantly higher precision (21.7%). Finally, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.
Scalable Bayesian Uncertainty Quantification for Neural Network Potentials: Promise and Pitfalls
Neural network (NN) potentials promise highly accurate molecular dynamics (MD) simulations within the computational complexity of classical MD force fields. However, when applied outside their training domain, NN potential predictions can be inaccurate, increasing the need for Uncertainty Quantification (UQ). Bayesian modeling provides the mathematical framework for UQ, but classical Bayesian methods based on Markov chain Monte Carlo (MCMC) are computationally intractable for NN potentials. By training graph NN potentials for coarse-grained systems of liquid water and alanine dipeptide, we demonstrate here that scalable Bayesian UQ via stochastic gradient MCMC (SG-MCMC) yields reliable uncertainty estimates for MD observables. We show that cold posteriors can reduce the required training data size and that for reliable UQ, multiple Markov chains are needed. Additionally, we find that SG-MCMC and the Deep Ensemble method achieve comparable results, despite shorter training and less hyperparameter tuning of the latter. We show that both methods can capture aleatoric and epistemic uncertainty reliably, but not systematic uncertainty, which needs to be minimized by adequate modeling to obtain accurate credible intervals for MD observables. Our results represent a step towards accurate UQ that is of vital importance for trustworthy NN potential-based MD simulations required for decision-making in practice.
Generating π-Functional Molecules Using STGG+ with Active Learning
Generating novel molecules with out-of-distribution properties is a major challenge in molecular discovery. While supervised learning methods generate high-quality molecules similar to those in a dataset, they struggle to generalize to out-of-distribution properties. Reinforcement learning can explore new chemical spaces but often conducts 'reward-hacking' and generates non-synthesizable molecules. In this work, we address this problem by integrating a state-of-the-art supervised learning method, STGG+, in an active learning loop. Our approach iteratively generates, evaluates, and fine-tunes STGG+ to continuously expand its knowledge. We denote this approach STGG+AL. We apply STGG+AL to the design of organic pi-functional materials, specifically two challenging tasks: 1) generating highly absorptive molecules characterized by high oscillator strength and 2) designing absorptive molecules with reasonable oscillator strength in the near-infrared (NIR) range. The generated molecules are validated and rationalized in-silico with time-dependent density functional theory. Our results demonstrate that our method is highly effective in generating novel molecules with high oscillator strength, contrary to existing methods such as reinforcement learning (RL) methods. We open-source our active-learning code along with our Conjugated-xTB dataset containing 2.9 million pi-conjugated molecules and the function for approximating the oscillator strength and absorption wavelength (based on sTDA-xTB).
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the SSM-DTA framework, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling (MLM) using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS, and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations, and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes. Our code is available at https://github.com/QizhiPei/SSM-DTA{Github}.
Multi-scale Iterative Refinement towards Robust and Versatile Molecular Docking
Molecular docking is a key computational tool utilized to predict the binding conformations of small molecules to protein targets, which is fundamental in the design of novel drugs. Despite recent advancements in geometric deep learning-based approaches leading to improvements in blind docking efficiency, these methods have encountered notable challenges, such as limited generalization performance on unseen proteins, the inability to concurrently address the settings of blind docking and site-specific docking, and the frequent occurrence of physical implausibilities such as inter-molecular steric clash. In this study, we introduce DeltaDock, a robust and versatile framework designed for efficient molecular docking to overcome these challenges. DeltaDock operates in a two-step process: rapid initial complex structures sampling followed by multi-scale iterative refinement of the initial structures. In the initial stage, to sample accurate structures with high efficiency, we develop a ligand-dependent binding site prediction model founded on large protein models and graph neural networks. This model is then paired with GPU-accelerated sampling algorithms. The sampled structures are updated using a multi-scale iterative refinement module that captures both protein-ligand atom-atom interactions and residue-atom interactions in the following stage. Distinct from previous geometric deep learning methods that are conditioned on the blind docking setting, DeltaDock demonstrates superior performance in both blind docking and site-specific docking settings. Comprehensive experimental results reveal that DeltaDock consistently surpasses baseline methods in terms of docking accuracy. Furthermore, it displays remarkable generalization capabilities and proficiency for predicting physically valid structures, thereby attesting to its robustness and reliability in various scenarios.
BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning
Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.
ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning
Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data -- natural vibrational frequencies -- via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.
Nonequilibrium Phenomena in Driven and Active Coulomb Field Theories
The classical Coulomb gas model has served as one of the most versatile frameworks in statistical physics, connecting a vast range of phenomena across many different areas. Nonequilibrium generalisations of this model have so far been studied much more scarcely. With the abundance of contemporary research into active and driven systems, one would naturally expect that such generalisations of systems with long-ranged Coulomb-like interactions will form a fertile playground for interesting developments. Here, we present two examples of novel macroscopic behaviour that arise from nonequilibrium fluctuations in long-range interacting systems, namely (1) unscreened long-ranged correlations in strong electrolytes driven by an external electric field and the associated fluctuation-induced forces in the confined Casimir geometry, and (2) out-of-equilibrium critical behaviour in self-chemotactic models that incorporate the particle polarity in the chemotactic response of the cells. Both of these systems have nonlocal Coulomb-like interactions among their constituent particles, namely, the electrostatic interactions in the case of the driven electrolyte, and the chemotactic forces mediated by fast-diffusing signals in the case of self-chemotactic systems. The results presented here hint to the rich phenomenology of nonequilibrium effects that can arise from strong fluctuations in Coulomb interacting systems, and a rich variety of potential future directions, which are discussed.
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.
MoleculeNet: A Benchmark for Molecular Machine Learning
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.
Enhancing Ligand Pose Sampling for Molecular Docking
Deep learning promises to dramatically improve scoring functions for molecular docking, leading to substantial advances in binding pose prediction and virtual screening. To train scoring functions-and to perform molecular docking-one must generate a set of candidate ligand binding poses. Unfortunately, the sampling protocols currently used to generate candidate poses frequently fail to produce any poses close to the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking. Here, we describe two improved protocols for pose sampling: GLOW (auGmented sampLing with sOftened vdW potential) and a novel technique named IVES (IteratiVe Ensemble Sampling). Our benchmarking results demonstrate the effectiveness of our methods in improving the likelihood of sampling accurate poses, especially for binding pockets whose shape changes substantially when different ligands bind. This improvement is observed across both experimentally determined and AlphaFold-generated protein structures. Additionally, we present datasets of candidate ligand poses generated using our methods for each of around 5,000 protein-ligand cross-docking pairs, for training and testing scoring functions. To benefit the research community, we provide these cross-docking datasets and an open-source Python implementation of GLOW and IVES at https://github.com/drorlab/GLOW_IVES .
Space Group Constrained Crystal Generation
Crystals are the foundation of numerous scientific and industrial applications. While various learning-based approaches have been proposed for crystal generation, existing methods seldom consider the space group constraint which is crucial in describing the geometry of crystals and closely relevant to many desirable properties. However, considering space group constraint is challenging owing to its diverse and nontrivial forms. In this paper, we reduce the space group constraint into an equivalent formulation that is more tractable to be handcrafted into the generation process. In particular, we translate the space group constraint into two parts: the basis constraint of the invariant logarithmic space of the lattice matrix and the Wyckoff position constraint of the fractional coordinates. Upon the derived constraints, we then propose DiffCSP++, a novel diffusion model that has enhanced a previous work DiffCSP by further taking space group constraint into account. Experiments on several popular datasets verify the benefit of the involvement of the space group constraint, and show that our DiffCSP++ achieves promising performance on crystal structure prediction, ab initio crystal generation and controllable generation with customized space groups.
Zyxin is all you need: machine learning adherent cell mechanics
Cellular form and function emerge from complex mechanochemical systems within the cytoplasm. No systematic strategy currently exists to infer large-scale physical properties of a cell from its many molecular components. This is a significant obstacle to understanding biophysical processes such as cell adhesion and migration. Here, we develop a data-driven biophysical modeling approach to learn the mechanical behavior of adherent cells. We first train neural networks to predict forces generated by adherent cells from images of cytoskeletal proteins. Strikingly, experimental images of a single focal adhesion protein, such as zyxin, are sufficient to predict forces and generalize to unseen biological regimes. This protein field alone contains enough information to yield accurate predictions even if forces themselves are generated by many interacting proteins. We next develop two approaches - one explicitly constrained by physics, the other more agnostic - that help construct data-driven continuum models of cellular forces using this single focal adhesion field. Both strategies consistently reveal that cellular forces are encoded by two different length scales in adhesion protein distributions. Beyond adherent cell mechanics, our work serves as a case study for how to integrate neural networks in the construction of predictive phenomenological models in cell biology, even when little knowledge of the underlying microscopic mechanisms exist.
Lifelong Machine Learning Potentials
Machine learning potentials (MLPs) trained on accurate quantum chemical data can retain the high accuracy, while inflicting little computational demands. On the downside, they need to be trained for each individual system. In recent years, a vast number of MLPs has been trained from scratch because learning additional data typically requires to train again on all data to not forget previously acquired knowledge. Additionally, most common structural descriptors of MLPs cannot represent efficiently a large number of different chemical elements. In this work, we tackle these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs) which combine structural properties and element information from the periodic table. These eeACSFs are a key for our development of a lifelong machine learning potential (lMLP). Uncertainty quantification can be exploited to transgress a fixed, pre-trained MLP to arrive at a continuously adapting lMLP, because a predefined level of accuracy can be ensured. To extend the applicability of an lMLP to new systems, we apply continual learning strategies to enable autonomous and on-the-fly training on a continuous stream of new data. For the training of deep neural networks, we propose the continual resilient (CoRe) optimizer and incremental learning strategies relying on rehearsal of data, regularization of parameters, and the architecture of the model.
Learning Subpocket Prototypes for Generalizable Structure-based Drug Design
Generating molecules with high binding affinities to target proteins (a.k.a. structure-based drug design) is a fundamental and challenging task in drug discovery. Recently, deep generative models have achieved remarkable success in generating 3D molecules conditioned on the protein pocket. However, most existing methods consider molecular generation for protein pockets independently while neglecting the underlying connections such as subpocket-level similarities. Subpockets are the local protein environments of ligand fragments and pockets with similar subpockets may bind the same molecular fragment (motif) even though their overall structures are different. Therefore, the trained models can hardly generalize to unseen protein pockets in real-world applications. In this paper, we propose a novel method DrugGPS for generalizable structure-based drug design. With the biochemical priors, we propose to learn subpocket prototypes and construct a global interaction graph to model the interactions between subpocket prototypes and molecular motifs. Moreover, a hierarchical graph transformer encoder and motif-based 3D molecule generation scheme are used to improve the model's performance. The experimental results show that our model consistently outperforms baselines in generating realistic drug candidates with high affinities in challenging out-of-distribution settings.
DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization
Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving de novo design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both de novo design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks.
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models
Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervised predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare and rank them. In this work, we introduce a benchmarking platform called Molecular Sets (MOSES) to standardize training and comparison of molecular generative models. MOSES provides a training and testing datasets, and a set of metrics to evaluate the quality and diversity of generated structures. We have implemented and compared several molecular generation models and suggest to use our results as reference points for further advancements in generative chemistry research. The platform and source code are available at https://github.com/molecularsets/moses.
ChemCrow: Augmenting large-language models with chemistry tools
Over the last decades, excellent computational chemistry tools have been developed. Their full potential has not yet been reached as most are challenging to learn and exist in isolation. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 17 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned the syntheses of an insect repellent, three organocatalysts, as well as other relevant molecules. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow's performance. There is a significant risk of misuse of tools like ChemCrow, and we discuss their potential harms. Employed responsibly, our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry. A subset of the code is publicly available at https://github.com/ur-whitelab/chemcrow-public.
Towards Data-Efficient Pretraining for Atomic Property Prediction
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
Bayesian active learning for optimization and uncertainty quantification in protein docking
Motivation: Ab initio protein docking represents a major challenge for optimizing a noisy and costly "black box"-like function in a high-dimensional space. Despite progress in this field, there is no docking method available for rigorous uncertainty quantification (UQ) of its solution quality (e.g. interface RMSD or iRMSD). Results: We introduce a novel algorithm, Bayesian Active Learning (BAL), for optimization and UQ of such black-box functions and flexible protein docking. BAL directly models the posterior distribution of the global optimum (or native structures for protein docking) with active sampling and posterior estimation iteratively feeding each other. Furthermore, we use complex normal modes to represent a homogeneous Euclidean conformation space suitable for high-dimension optimization and construct funnel-like energy models for encounter complexes. Over a protein docking benchmark set and a CAPRI set including homology docking, we establish that BAL significantly improve against both starting points by rigid docking and refinements by particle swarm optimization, providing for one third targets a top-3 near-native prediction. BAL also generates tight confidence intervals with half range around 25% of iRMSD and confidence level at 85%. Its estimated probability of a prediction being native or not achieves binary classification AUROC at 0.93 and AUPRC over 0.60 (compared to 0.14 by chance); and also found to help ranking predictions. To the best of our knowledge, this study represents the first uncertainty quantification solution for protein docking, with theoretical rigor and comprehensive assessment. Source codes are available at https://github.com/Shen-Lab/BAL.
From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction
Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains. To address this, we introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task framework. Our combined training dataset consists of sim120M systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and generalization by fine-tuning over a diverse set of downstream tasks and datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP demonstrates an average improvement of 59% over training from scratch, and matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the potential of pre-training strategies that utilize diverse data to advance property prediction across chemical domains, especially for low-data tasks.
Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery
Recent research in representation learning utilizes large databases of proteins or molecules to acquire knowledge of drug and protein structures through unsupervised learning techniques. These pre-trained representations have proven to significantly enhance the accuracy of subsequent tasks, such as predicting the affinity between drugs and target proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results on established benchmark datasets. We provide preprocessed and integrated data obtained from 7 public sources, which encompass over 30M triples. Additionally, we make available the pre-trained models based on this data, along with the reported outcomes of their performance on three widely-used benchmark datasets for drug-target binding affinity prediction found in the Therapeutic Data Commons (TDC) benchmarks. Additionally, we make the source code for training models on benchmark datasets publicly available. Our objective in releasing these pre-trained models, accompanied by clean data for model pretraining and benchmark results, is to encourage research in knowledge-enhanced representation learning.
Convolutional Neural Networks and Volcano Plots: Screening and Prediction of Two-Dimensional Single-Atom Catalysts
Single-atom catalysts (SACs) have emerged as frontiers for catalyzing chemical reactions, yet the diverse combinations of active elements and support materials, the nature of coordination environments, elude traditional methodologies in searching optimal SAC systems with superior catalytic performance. Herein, by integrating multi-branch Convolutional Neural Network (CNN) analysis models to hybrid descriptor based activity volcano plot, 2D SAC system composed of diverse metallic single atoms anchored on six type of 2D supports, including graphitic carbon nitride, nitrogen-doped graphene, graphene with dual-vacancy, black phosphorous, boron nitride, and C2N, are screened for efficient CO2RR. Starting from establishing a correlation map between the adsorption energies of intermediates and diverse electronic and elementary descriptors, sole singular descriptor lost magic to predict catalytic activity. Deep learning method utilizing multi-branch CNN model therefore was employed, using 2D electronic density of states as input to predict adsorption energies. Hybrid-descriptor enveloping both C- and O-types of CO2RR intermediates was introduced to construct volcano plots and limiting potential periodic table, aiming for intuitive screening of catalyst candidates for efficient CO2 reduction to CH4. The eDOS occlusion experiments were performed to unravel individual orbital contribution to adsorption energy. To explore the electronic scale principle governing practical engineering catalytic CO2RR activity, orbitalwise eDOS shifting experiments based on CNN model were employed. The study involves examining the adsorption energy and, consequently, catalytic activities while varying supported single atoms. This work offers a tangible framework to inform both theoretical screening and experimental synthesis, thereby paving the way for systematically designing efficient SACs.
Kolmogorov--Arnold networks in molecular dynamics
We explore the integration of Kolmogorov Networks (KANs) into molecular dynamics (MD) simulations to improve interatomic potentials. We propose that widely used potentials, such as the Lennard-Jones (LJ) potential, the embedded atom model (EAM), and artificial neural network (ANN) potentials, can be interpreted within the KAN framework. Specifically, we demonstrate that the descriptors for ANN potentials, typically constructed using polynomials, can be redefined using KAN's non-linear functions. By employing linear or cubic spline interpolations for these KAN functions, we show that the computational cost of evaluating ANN potentials and their derivatives is reduced.
M^{3}-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery
This paper introduces M^{3}-20M, a large-scale Multi-Modal Molecular dataset that contains over 20 million molecules. Designed to support AI-driven drug design and discovery, M^{3}-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit training or fine-tuning large (language) models with superior performance for drug design and discovery. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated by using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M^{3}-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, and GPT-4. Our experimental results show that M^{3}-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than the existing single-modal datasets, which validates the value and potential of M^{3}-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.
Neural Message Passing for Quantum Chemistry
Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.
MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation
Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
Deep Confident Steps to New Pockets: Strategies for Docking Generalization
Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
Omni-Mol: Exploring Universal Convergent Space for Omni-Molecular Tasks
Building generalist models has recently demonstrated remarkable capabilities in diverse scientific domains. Within the realm of molecular learning, several studies have explored unifying diverse tasks across diverse domains. However, negative conflicts and interference between molecules and knowledge from different domain may have a worse impact in threefold. First, conflicting molecular representations can lead to optimization difficulties for the models. Second, mixing and scaling up training data across diverse tasks is inherently challenging. Third, the computational cost of refined pretraining is prohibitively high. To address these limitations, this paper presents Omni-Mol, a scalable and unified LLM-based framework for direct instruction tuning. Omni-Mol builds on three key components to tackles conflicts: (1) a unified encoding mechanism for any task input; (2) an active-learning-driven data selection strategy that significantly reduces dataset size; (3) a novel design of the adaptive gradient stabilization module and anchor-and-reconcile MoE framework that ensures stable convergence. Experimentally, Omni-Mol achieves state-of-the-art performance across 15 molecular tasks, demonstrates the presence of scaling laws in the molecular domain, and is supported by extensive ablation studies and analyses validating the effectiveness of its design. The code and weights of the powerful AI-driven chemistry generalist are open-sourced at: https://anonymous.4open.science/r/Omni-Mol-8EDB.
Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs
Molecular representation is a foundational element in our understanding of the physical world. Its importance ranges from the fundamentals of chemical reactions to the design of new therapies and materials. Previous molecular machine learning models have employed strings, fingerprints, global features, and simple molecular graphs that are inherently information-sparse representations. However, as the complexity of prediction tasks increases, the molecular representation needs to encode higher fidelity information. This work introduces a novel approach to infusing quantum-chemical-rich information into molecular graphs via stereoelectronic effects. We show that the explicit addition of stereoelectronic interactions significantly improves the performance of molecular machine learning models. Furthermore, stereoelectronics-infused representations can be learned and deployed with a tailored double graph neural network workflow, enabling its application to any downstream molecular machine learning task. Finally, we show that the learned representations allow for facile stereoelectronic evaluation of previously intractable systems, such as entire proteins, opening new avenues of molecular design.
Matbench Discovery -- An evaluation framework for machine learning crystal stability prediction
Matbench Discovery simulates the deployment of machine learning (ML) energy models in a high-throughput search for stable inorganic crystals. We address the disconnect between (i) thermodynamic stability and formation energy and (ii) in-domain vs out-of-distribution performance. Alongside this paper, we publish a Python package to aid with future model submissions and a growing online leaderboard with further insights into trade-offs between various performance metrics. To answer the question which ML methodology performs best at materials discovery, our initial release explores a variety of models including random forests, graph neural networks (GNN), one-shot predictors, iterative Bayesian optimizers and universal interatomic potentials (UIP). Ranked best-to-worst by their test set F1 score on thermodynamic stability prediction, we find CHGNet > M3GNet > MACE > ALIGNN > MEGNet > CGCNN > CGCNN+P > Wrenformer > BOWSR > Voronoi tessellation fingerprints with random forest. The top 3 models are UIPs, the winning methodology for ML-guided materials discovery, achieving F1 scores of ~0.6 for crystal stability classification and discovery acceleration factors (DAF) of up to 5x on the first 10k most stable predictions compared to dummy selection from our test set. We also highlight a sharp disconnect between commonly used global regression metrics and more task-relevant classification metrics. Accurate regressors are susceptible to unexpectedly high false-positive rates if those accurate predictions lie close to the decision boundary at 0 eV/atom above the convex hull where most materials are. Our results highlight the need to focus on classification metrics that actually correlate with improved stability hit rate.
Hybrid Quantum Generative Adversarial Networks for Molecular Simulation and Drug Discovery
In molecular research, simulation \& design of molecules are key areas with significant implications for drug development, material science, and other fields. Current classical computational power falls inadequate to simulate any more than small molecules, let alone protein chains on hundreds of peptide. Therefore these experiment are done physically in wet-lab, but it takes a lot of time \& not possible to examine every molecule due to the size of the search area, tens of billions of dollars are spent every year in these research experiments. Molecule simulation \& design has lately advanced significantly by machine learning models, A fresh perspective on the issue of chemical synthesis is provided by deep generative models for graph-structured data. By optimising differentiable models that produce molecular graphs directly, it is feasible to avoid costly search techniques in the discrete and huge space of chemical structures. But these models also suffer from computational limitations when dimensions become huge and consume huge amount of resources. Quantum Generative machine learning in recent years have shown some empirical results promising significant advantages over classical counterparts.
DrugGen: Advancing Drug Discovery with Large Language Models and Reinforcement Learning Feedback
Traditional drug design faces significant challenges due to inherent chemical and biological complexities, often resulting in high failure rates in clinical trials. Deep learning advancements, particularly generative models, offer potential solutions to these challenges. One promising algorithm is DrugGPT, a transformer-based model, that generates small molecules for input protein sequences. Although promising, it generates both chemically valid and invalid structures and does not incorporate the features of approved drugs, resulting in time-consuming and inefficient drug discovery. To address these issues, we introduce DrugGen, an enhanced model based on the DrugGPT structure. DrugGen is fine-tuned on approved drug-target interactions and optimized with proximal policy optimization. By giving reward feedback from protein-ligand binding affinity prediction using pre-trained transformers (PLAPT) and a customized invalid structure assessor, DrugGen significantly improves performance. Evaluation across multiple targets demonstrated that DrugGen achieves 100% valid structure generation compared to 95.5% with DrugGPT and produced molecules with higher predicted binding affinities (7.22 [6.30-8.07]) compared to DrugGPT (5.81 [4.97-6.63]) while maintaining diversity and novelty. Docking simulations further validate its ability to generate molecules targeting binding sites effectively. For example, in the case of fatty acid-binding protein 5 (FABP5), DrugGen generated molecules with superior docking scores (FABP5/11, -9.537 and FABP5/5, -8.399) compared to the reference molecule (Palmitic acid, -6.177). Beyond lead compound generation, DrugGen also shows potential for drug repositioning and creating novel pharmacophores for existing targets. By producing high-quality small molecules, DrugGen provides a high-performance medium for advancing pharmaceutical research and drug discovery.
Generalizing Neural Wave Functions
Recent neural network-based wave functions have achieved state-of-the-art accuracies in modeling ab-initio ground-state potential energy surface. However, these networks can only solve different spatial arrangements of the same set of atoms. To overcome this limitation, we present Graph-learned orbital embeddings (Globe), a neural network-based reparametrization method that can adapt neural wave functions to different molecules. Globe learns representations of local electronic structures that generalize across molecules via spatial message passing by connecting molecular orbitals to covalent bonds. Further, we propose a size-consistent wave function Ansatz, the Molecular orbital network (Moon), tailored to jointly solve Schr\"odinger equations of different molecules. In our experiments, we find Moon converging in 4.5 times fewer steps to similar accuracy as previous methods or to lower energies given the same time. Further, our analysis shows that Moon's energy estimate scales additively with increased system sizes, unlike previous work where we observe divergence. In both computational chemistry and machine learning, we are the first to demonstrate that a single wave function can solve the Schr\"odinger equation of molecules with different atoms jointly.
GemNet-OC: Developing Graph Neural Networks for Large and Diverse Molecular Simulation Datasets
Recent years have seen the advent of molecular simulation datasets that are orders of magnitude larger and more diverse. These new datasets differ substantially in four aspects of complexity: 1. Chemical diversity (number of different elements), 2. system size (number of atoms per sample), 3. dataset size (number of data samples), and 4. domain shift (similarity of the training and test set). Despite these large differences, benchmarks on small and narrow datasets remain the predominant method of demonstrating progress in graph neural networks (GNNs) for molecular simulation, likely due to cheaper training compute requirements. This raises the question -- does GNN progress on small and narrow datasets translate to these more complex datasets? This work investigates this question by first developing the GemNet-OC model based on the large Open Catalyst 2020 (OC20) dataset. GemNet-OC outperforms the previous state-of-the-art on OC20 by 16% while reducing training time by a factor of 10. We then compare the impact of 18 model components and hyperparameter choices on performance in multiple datasets. We find that the resulting model would be drastically different depending on the dataset used for making model choices. To isolate the source of this discrepancy we study six subsets of the OC20 dataset that individually test each of the above-mentioned four dataset aspects. We find that results on the OC-2M subset correlate well with the full OC20 dataset while being substantially cheaper to train on. Our findings challenge the common practice of developing GNNs solely on small datasets, but highlight ways of achieving fast development cycles and generalizable results via moderately-sized, representative datasets such as OC-2M and efficient models such as GemNet-OC. Our code and pretrained model weights are open-sourced.
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling.
MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design
Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture due to their exceptional porosity and tunable chemistry. Their modular nature has enabled the use of template-based methods to generate hypothetical MOFs by combining molecular building blocks in accordance with known network topologies. However, the ability of these methods to identify top-performing MOFs is often hindered by the limited diversity of the resulting chemical space. In this work, we propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures through a denoising diffusion process over the coordinates and identities of the building blocks. The all-atom MOF structure is then determined through a novel assembly algorithm. Equivariant graph neural networks are used for the diffusion model to respect the permutational and roto-translational symmetries. We comprehensively evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials for carbon capture applications with molecular simulations.
Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model
While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.
Generative Artificial Intelligence for Navigating Synthesizable Chemical Space
We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space. Unlike traditional molecular generation approaches, we generate synthetic pathways for molecules to ensure that designs are synthetically tractable. By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design. We demonstrate SynFormer's effectiveness in two key applications: (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle. Additionally, we demonstrate the scalability of our approach via the improvement in performance as more computational resources become available. With our code and trained models openly available, we hope that SynFormer will find use across applications in drug discovery and materials science.
Transition-Based Constrained DFT for the Robust and Reliable Treatment of Excitations in Supramolecular Systems
Despite the variety of available computational approaches, state-of-the-art methods for calculating excitation energies such as time-dependent density functional theory (TDDFT), are computationally demanding and thus limited to moderate system sizes. Here, we introduce a new variation of constrained DFT (CDFT), wherein the constraint corresponds to a particular transition (T), or combination of transitions, between occupied and virtual orbitals, rather than a region of the simulation space as in traditional CDFT. We compare T-CDFT with TDDFT and DeltaSCF results for the low lying excited states (S_{1} and T_{1}) of a set of gas phase acene molecules and OLED emitters, as well as with reference results from the literature. At the PBE level of theory, T-CDFT outperforms DeltaSCF for both classes of molecules, while also proving to be more robust. For the local excitations seen in the acenes, T-CDFT and TDDFT perform equally well. For the charge-transfer (CT)-like excitations seen in the OLED molecules, T-CDFT also performs well, in contrast to the severe energy underestimation seen with TDDFT. In other words, T-CDFT is equally applicable to both local excitations and CT states, providing more reliable excitation energies at a much lower computational cost than TDDFT. T-CDFT is designed for large systems and has been implemented in the linear scaling BigDFT code. It is therefore ideally suited for exploring the effects of explicit environments on excitation energies, paving the way for future simulations of excited states in complex realistic morphologies, such as those which occur in OLED materials.
Generative Discovery of Novel Chemical Designs using Diffusion Modeling and Transformer Deep Neural Networks with Application to Deep Eutectic Solvents
We report a series of deep learning models to solve complex forward and inverse design problems in molecular modeling and design. Using both diffusion models inspired by nonequilibrium thermodynamics and attention-based transformer architectures, we demonstrate a flexible framework to capture complex chemical structures. First trained on the QM9 dataset and a series of quantum mechanical properties (e.g. homo, lumo, free energy, heat capacity, etc.), we then generalize the model to study and design key properties of deep eutectic solvents. In addition to separate forward and inverse models, we also report an integrated fully prompt-based multi-task generative pretrained transformer model that solves multiple forward, inverse design, and prediction tasks, flexibly and within one model. We show that the multi-task generative model has the overall best performance and allows for flexible integration of multiple objectives, within one model, and for distinct chemistries, suggesting that synergies emerge during training of this large language model. Trained jointly in tasks related to the QM9 dataset and deep eutectic solvents (DESs), the model can predict various quantum mechanical properties and critical properties to achieve deep eutectic solvent behavior. Several novel combinations of DESs are proposed based on this framework.
Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.
Agentic End-to-End De Novo Protein Design for Tailored Dynamics Using a Language Diffusion Model
Proteins are dynamic molecular machines whose biological functions, spanning enzymatic catalysis, signal transduction, and structural adaptation, are intrinsically linked to their motions. Designing proteins with targeted dynamic properties, however, remains a challenge due to the complex, degenerate relationships between sequence, structure, and molecular motion. Here, we introduce VibeGen, a generative AI framework that enables end-to-end de novo protein design conditioned on normal mode vibrations. VibeGen employs an agentic dual-model architecture, comprising a protein designer that generates sequence candidates based on specified vibrational modes and a protein predictor that evaluates their dynamic accuracy. This approach synergizes diversity, accuracy, and novelty during the design process. Via full-atom molecular simulations as direct validation, we demonstrate that the designed proteins accurately reproduce the prescribed normal mode amplitudes across the backbone while adopting various stable, functionally relevant structures. Notably, generated sequences are de novo, exhibiting no significant similarity to natural proteins, thereby expanding the accessible protein space beyond evolutionary constraints. Our work integrates protein dynamics into generative protein design, and establishes a direct, bidirectional link between sequence and vibrational behavior, unlocking new pathways for engineering biomolecules with tailored dynamical and functional properties. This framework holds broad implications for the rational design of flexible enzymes, dynamic scaffolds, and biomaterials, paving the way toward dynamics-informed AI-driven protein engineering.
ProteinBench: A Holistic Evaluation of Protein Foundation Models
Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.
Crystal-GFN: sampling crystals with desirable properties and constraints
Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state materials such as electrocatalysts, super-ionic conductors or photovoltaic materials can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFN, a generative model of crystal structures that sequentially samples structural properties of crystalline materials, namely the space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and structural hard constraints, as well as the use of any available predictive model of a desired physicochemical property as an objective function. To design stable materials, one must target the candidates with the lowest formation energy. Here, we use as objective the formation energy per atom of a crystal structure predicted by a new proxy machine learning model trained on MatBench. The results demonstrate that Crystal-GFN is able to sample highly diverse crystals with low (median -3.1 eV/atom) predicted formation energy.
Small Molecule Optimization with Large Language Models
Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
Crystal Structure Generation with Autoregressive Large Language Modeling
The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. Quickly generating and predicting inorganic crystal structures is important for the discovery of new materials, which can target applications such as energy or electronic devices. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. The integration with predictors of formation energy permits the use of a Monte Carlo Tree Search algorithm to improve the generation of meaningful structures. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective 'world models' of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.
MOFA: Discovering Materials for Carbon Capture with a GenAI- and Simulation-Based Workflow
We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screening and filtering AI-generated MOFs using molecular dynamics, density functional theory, and Monte Carlo simulations. These heterogeneous tasks are unified within an online learning framework that optimizes the utilization of available CPU and GPU resources across HPC systems. Performance metrics from a 450-node (14,400 AMD Zen 3 CPUs + 1800 NVIDIA A100 GPUs) supercomputer run demonstrate that MOFA achieves high-throughput generation of novel MOF structures, with CO_2 adsorption capacities ranking among the top 10 in the hypothetical MOF (hMOF) dataset. Furthermore, the production of high-quality MOFs exhibits a linear relationship with the number of nodes utilized. The modular architecture of MOFA will facilitate its integration into other scientific applications that dynamically combine GenAI with large-scale simulations.
Precision measurement of the last bound states in H_2 and determination of the H + H scattering length
The binding energies of the five bound rotational levels J=0-4 in the highest vibrational level v=14 in the X^1Sigma_g^+ ground electronic state of H_2 were measured in a three-step ultraviolet-laser experiment. Two-photon UV-photolysis of H_2S produced population in these high-lying bound states, that were subsequently interrogated at high precision via Doppler-free spectroscopy of the F^1Sigma_g^+ - X^1Sigma_g^+ system. A third UV-laser was used for detection through auto-ionizing resonances. The experimentally determined binding energies were found to be in excellent agreement with calculations based on non-adiabatic perturbation theory, also including relativistic and quantum electrodynamical contributions. The s-wave scattering length of the H + H system is derived from the binding energy of the last bound J=0 level via a direct semi-empirical approach, yielding a value of a_s = 0.2724(5) a_0, in good agreement with a result from a previously followed theoretical approach. The subtle effect of the malpha^4 relativity contribution to a_s was found to be significant. In a similar manner a value for the p-wave scattering volume is determined via the J=1 binding energy yielding a_p = -134.0000(6) a_0^3. The binding energy of the last bound state in H_2, the (v=14, J=4) level, is determined at 0.023(4) cm^{-1}, in good agreement with calculation. The effect of the hyperfine substructure caused by the two hydrogen atoms at large internuclear separation, giving rise to three distinct dissociation limits, is discussed.
Unified Generative Modeling of 3D Molecules via Bayesian Flow Networks
Advanced generative model (e.g., diffusion model) derived from simplified continuity assumptions of data distribution, though showing promising progress, has been difficult to apply directly to geometry generation applications due to the multi-modality and noise-sensitive nature of molecule geometry. This work introduces Geometric Bayesian Flow Networks (GeoBFN), which naturally fits molecule geometry by modeling diverse modalities in the differentiable parameter space of distributions. GeoBFN maintains the SE-(3) invariant density modeling property by incorporating equivariant inter-dependency modeling on parameters of distributions and unifying the probabilistic modeling of different modalities. Through optimized training and sampling techniques, we demonstrate that GeoBFN achieves state-of-the-art performance on multiple 3D molecule generation benchmarks in terms of generation quality (90.87% molecule stability in QM9 and 85.6% atom stability in GEOM-DRUG. GeoBFN can also conduct sampling with any number of steps to reach an optimal trade-off between efficiency and quality (e.g., 20-times speedup without sacrificing performance).
Thermally Averaged Magnetic Anisotropy Tensors via Machine Learning Based on Gaussian Moments
We propose a machine learning method to model molecular tensorial quantities, namely the magnetic anisotropy tensor, based on the Gaussian-moment neural-network approach. We demonstrate that the proposed methodology can achieve an accuracy of 0.3--0.4 cm^{-1} and has excellent generalization capability for out-of-sample configurations. Moreover, in combination with machine-learned interatomic potential energies based on Gaussian moments, our approach can be applied to study the dynamic behavior of magnetic anisotropy tensors and provide a unique insight into spin-phonon relaxation.
Long-Range Neural Atom Learning for Molecular Graphs
Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs are mainly good at leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method that implicitly projects all original atoms into a few Neural Atoms, which abstracts the collective information of atomic groups within a molecule. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection with the traditional LRI calculation method, Ewald Summation. We conduct extensive experiments on three long-range graph benchmarks, covering both graph-level and link-level tasks on molecular graphs. We empirically justify that our method can be equipped with an arbitrary GNN and help to capture LRI.
Analyzing Learned Molecular Representations for Property Prediction
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.
De novo design of high-affinity protein binders with AlphaProteo
Computational design of protein-binding proteins is a fundamental capability with broad utility in biomedical research and biotechnology. Recent methods have made strides against some target proteins, but on-demand creation of high-affinity binders without multiple rounds of experimental testing remains an unsolved challenge. This technical report introduces AlphaProteo, a family of machine learning models for protein design, and details its performance on the de novo binder design problem. With AlphaProteo, we achieve 3- to 300-fold better binding affinities and higher experimental success rates than the best existing methods on seven target proteins. Our results suggest that AlphaProteo can generate binders "ready-to-use" for many research applications using only one round of medium-throughput screening and no further optimization.
Language models in molecular discovery
The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Chemically Transferable Generative Backmapping of Coarse-Grained Proteins
Coarse-graining (CG) accelerates molecular simulations of protein dynamics by simulating sets of atoms as singular beads. Backmapping is the opposite operation of bringing lost atomistic details back from the CG representation. While machine learning (ML) has produced accurate and efficient CG simulations of proteins, fast and reliable backmapping remains a challenge. Rule-based methods produce poor all-atom geometries, needing computationally costly refinement through additional simulations. Recently proposed ML approaches outperform traditional baselines but are not transferable between proteins and sometimes generate unphysical atom placements with steric clashes and implausible torsion angles. This work addresses both issues to build a fast, transferable, and reliable generative backmapping tool for CG protein representations. We achieve generalization and reliability through a combined set of innovations: representation based on internal coordinates; an equivariant encoder/prior; a custom loss function that helps ensure local structure, global structure, and physical constraints; and expert curation of high-quality out-of-equilibrium protein data for training. Our results pave the way for out-of-the-box backmapping of coarse-grained simulations for arbitrary proteins.
Conditional Graph Information Bottleneck for Molecular Relational Learning
Molecular relational learning, whose goal is to learn the interaction behavior between molecular pairs, got a surge of interest in molecular sciences due to its wide range of applications. Recently, graph neural networks have recently shown great success in molecular relational learning by modeling a molecule as a graph structure, and considering atom-level interactions between two molecules. Despite their success, existing molecular relational learning methods tend to overlook the nature of chemistry, i.e., a chemical compound is composed of multiple substructures such as functional groups that cause distinctive chemical reactions. In this work, we propose a novel relational learning framework, called CGIB, that predicts the interaction behavior between a pair of graphs by detecting core subgraphs therein. The main idea is, given a pair of graphs, to find a subgraph from a graph that contains the minimal sufficient information regarding the task at hand conditioned on the paired graph based on the principle of conditional graph information bottleneck. We argue that our proposed method mimics the nature of chemical reactions, i.e., the core substructure of a molecule varies depending on which other molecule it interacts with. Extensive experiments on various tasks with real-world datasets demonstrate the superiority of CGIB over state-of-the-art baselines. Our code is available at https://github.com/Namkyeong/CGIB.
An inorganic ABX3 perovskite materials dataset for target property prediction and classification using machine learning
The reliability with Machine Learning (ML) techniques in novel materials discovery often depend on the quality of the dataset, in addition to the relevant features used in describing the material. In this regard, the current study presents and validates a newly processed materials dataset that can be utilized for benchmark ML analysis, as it relates to the prediction and classification of deterministic target properties. Originally, the dataset was extracted from the Open Quantum Materials Database (OQMD) and contains a robust 16,323 samples of ABX3 inorganic perovskite structures. The dataset is tabular in form and is preprocessed to include sixty-one generalized input features that broadly describes the physicochemical, stability/geometrical, and Density Functional Theory (DFT) target properties associated with the elemental ionic sites in a three-dimensional ABX3 polyhedral. For validation, four different ML models are employed to predict three distinctive target properties, namely: formation energy, energy band gap, and crystal system. On experimentation, the best accuracy measurements are reported at 0.013 eV/atom MAE, 0.216 eV MAE, and 85% F1, corresponding to the formation energy prediction, band gap prediction and crystal system multi-classification, respectively. Moreover, the realized results are compared with previous literature and as such, affirms the resourcefulness of the current dataset for future benchmark materials analysis via ML techniques. The preprocessed dataset and source codes are openly available to download from github.com/chenebuah/ML_abx3_dataset.
Computational design of target-specific linear peptide binders with TransformerBeta
The computational prediction and design of peptide binders targeting specific linear epitopes is crucial in biological and biomedical research, yet it remains challenging due to their highly dynamic nature and the scarcity of experimentally solved binding data. To address this problem, we built an unprecedentedly large-scale library of peptide pairs within stable secondary structures (beta sheets), leveraging newly available AlphaFold predicted structures. We then developed a machine learning method based on the Transformer architecture for the design of specific linear binders, in analogy to a language translation task. Our method, TransformerBeta, accurately predicts specific beta strand interactions and samples sequences with beta sheet-like molecular properties, while capturing interpretable physico-chemical interaction patterns. As such, it can propose specific candidate binders targeting linear epitope for experimental validation to inform protein design.
A combined statistical mechanical and ab initio approach to understanding H2O/CO2 co-adsorption in mmen-Mg2(dobpdc)
We study the effects of H2O on CO2 adsorption in an amine-appended variant of the metal-organic framework Mg2(dobpdc), which is known to exhibit chaining behavior that presents in a step-shaped adsorption isotherm. We first show how the presence of different levels of local H2O affects this chaining behavior and the energetics of CO2 adsorption, based on a series of ab initio calculations, giving insight into the atomic-scale environment. In particular, we predict a novel adsorbed configuration, in which H2O and CO2 intertwine to make a braided chain down the MOF pore. We then show how an existing lattice model can be adapted to incorporate the effect of water, and predict the CO2 isotherms for the various water levels, observing a sharp shift the uptake at low partial pressures. In addition to the physical further work on this and related materials.
From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery
Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.
Rigid Protein-Protein Docking via Equivariant Elliptic-Paraboloid Interface Prediction
The study of rigid protein-protein docking plays an essential role in a variety of tasks such as drug design and protein engineering. Recently, several learning-based methods have been proposed for the task, exhibiting much faster docking speed than those computational methods. In this paper, we propose a novel learning-based method called ElliDock, which predicts an elliptic paraboloid to represent the protein-protein docking interface. To be specific, our model estimates elliptic paraboloid interfaces for the two input proteins respectively, and obtains the roto-translation transformation for docking by making two interfaces coincide. By its design, ElliDock is independently equivariant with respect to arbitrary rotations/translations of the proteins, which is an indispensable property to ensure the generalization of the docking process. Experimental evaluations show that ElliDock achieves the fastest inference time among all compared methods and is strongly competitive with current state-of-the-art learning-based models such as DiffDock-PP and Multimer particularly for antibody-antigen docking.
NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation
3D molecule generation is crucial for drug discovery and material design. While prior efforts focus on 3D diffusion models for their benefits in modeling continuous 3D conformers, they overlook the advantages of 1D SELFIES-based Language Models (LMs), which can generate 100% valid molecules and leverage the billion-scale 1D molecule datasets. To combine these advantages for 3D molecule generation, we propose a foundation model -- NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation. NExT-Mol uses an extensively pretrained molecule LM for 1D molecule generation, and subsequently predicts the generated molecule's 3D conformers with a 3D diffusion model. We enhance NExT-Mol's performance by scaling up the LM's model size, refining the diffusion neural architecture, and applying 1D to 3D transfer learning. Notably, our 1D molecule LM significantly outperforms baselines in distributional similarity while ensuring validity, and our 3D diffusion model achieves leading performances in conformer prediction. Given these improvements in 1D and 3D modeling, NExT-Mol achieves a 26% relative improvement in 3D FCD for de novo 3D generation on GEOM-DRUGS, and a 13% average relative gain for conditional 3D generation on QM9-2014. Our codes and pretrained checkpoints are available at https://github.com/acharkq/NExT-Mol.
PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion
Peptide therapeutics, a major class of medicines, have achieved remarkable success across diseases such as diabetes and cancer, with landmark examples such as GLP-1 receptor agonists revolutionizing the treatment of type-2 diabetes and obesity. Despite their success, designing peptides that satisfy multiple conflicting objectives, such as target binding affinity, solubility, and membrane permeability, remains a major challenge. Classical drug development and structure-based design are ineffective for such tasks, as they fail to optimize global functional properties critical for therapeutic efficacy. Existing generative frameworks are largely limited to continuous spaces, unconditioned outputs, or single-objective guidance, making them unsuitable for discrete sequence optimization across multiple properties. To address this, we present PepTune, a multi-objective discrete diffusion model for the simultaneous generation and optimization of therapeutic peptide SMILES. Built on the Masked Discrete Language Model (MDLM) framework, PepTune ensures valid peptide structures with state-dependent masking schedules and penalty-based objectives. To guide the diffusion process, we propose a Monte Carlo Tree Search (MCTS)-based strategy that balances exploration and exploitation to iteratively refine Pareto-optimal sequences. MCTS integrates classifier-based rewards with search-tree expansion, overcoming gradient estimation challenges and data sparsity inherent to discrete spaces. Using PepTune, we generate diverse, chemically-modified peptides optimized for multiple therapeutic properties, including target binding affinity, membrane permeability, solubility, hemolysis, and non-fouling characteristics on various disease-relevant targets. In total, our results demonstrate that MCTS-guided discrete diffusion is a powerful and modular approach for multi-objective sequence design in discrete state spaces.
Comparative modeling studies of TSDC: investigation of Alpha-relaxation in Amorphous polymers
A model to investigate Thermally Stimulated Depolarization Current (TSDC) peak parameters using the dipole-dipole interaction concept is proposed by the author in this work. The proposed model describe the (TSDC) peak successfully since it gives a significant peak parameters (i.e. Activation energy (E) and the per-exponential factor (\tau_0) in addition to the dipole-dipole interaction strength parameter (di). Application of this model to determine the peak parameters of polyvinyl chloride(PVC) polymer is presented . The results show how the model fit the experimental thermal sampling data. Finally the results are compared to the well know techniques; the initial rise method (IR), the half width method (HW) in addition to the Cowell and Woods analysis.
Searching for High-Value Molecules Using Reinforcement Learning and Transformers
Reinforcement learning (RL) over text representations can be effective for finding high-value policies that can search over graphs. However, RL requires careful structuring of the search space and algorithm design to be effective in this challenge. Through extensive experiments, we explore how different design choices for text grammar and algorithmic choices for training can affect an RL policy's ability to generate molecules with desired properties. We arrive at a new RL-based molecular design algorithm (ChemRLformer) and perform a thorough analysis using 25 molecule design tasks, including computationally complex protein docking simulations. From this analysis, we discover unique insights in this problem space and show that ChemRLformer achieves state-of-the-art performance while being more straightforward than prior work by demystifying which design choices are actually helpful for text-based molecule design.
Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding
Chemical language models (CLMs) are prominent for their effectiveness in exploring chemical space and enabling molecular engineering. However, while exploring chemical-linguistic space, CLMs suffer from the gap between natural language and molecular representations. This challenge is primarily due to the inherent modeling differences between molecules and texts: molecules operate unified modeling to learn chemical space, while natural language sequentially models the semantic space. Additionally, the limited availability of high-quality text-to-molecule datasets further exacerbates this challenge. To address the problem, we first verified the information bias in molecular representations from different perspectives. We then developed the Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning. To better model chemical-linguistic space, we further constructed the MCMoD dataset, which contains over one million molecules with various conditions, including properties, fragments, and descriptions. Experimentally, HME promotes CLMs to achieve chemical-linguistic sharing space exploration: (1) chemical space exploration with linguistic guidance, where HME achieves significant improvements (+37.8\% FCD) for molecular design in multiple constraints, even in zero-shot scenarios; (2) linguistic space exploration with molecular guidance, where HME generates textual descriptions with high qualities (+11.6\% BLEU) for molecules. These results highlight the precision of HME in handling multi-objective and cross-domain tasks, as well as its remarkable generalization capability on unseen task combinations. HME offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.
Von Mises Mixture Distributions for Molecular Conformation Generation
Molecules are frequently represented as graphs, but the underlying 3D molecular geometry (the locations of the atoms) ultimately determines most molecular properties. However, most molecules are not static and at room temperature adopt a wide variety of geometries or conformations. The resulting distribution on geometries p(x) is known as the Boltzmann distribution, and many molecular properties are expectations computed under this distribution. Generating accurate samples from the Boltzmann distribution is therefore essential for computing these expectations accurately. Traditional sampling-based methods are computationally expensive, and most recent machine learning-based methods have focused on identifying modes in this distribution rather than generating true samples. Generating such samples requires capturing conformational variability, and it has been widely recognized that the majority of conformational variability in molecules arises from rotatable bonds. In this work, we present VonMisesNet, a new graph neural network that captures conformational variability via a variational approximation of rotatable bond torsion angles as a mixture of von Mises distributions. We demonstrate that VonMisesNet can generate conformations for arbitrary molecules in a way that is both physically accurate with respect to the Boltzmann distribution and orders of magnitude faster than existing sampling methods.
Molecular Contrastive Learning with Chemical Element Knowledge Graph
Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL.
Full-Atom Peptide Design based on Multi-modal Flow Matching
Peptides, short chains of amino acid residues, play a vital role in numerous biological processes by interacting with other target molecules, offering substantial potential in drug discovery. In this work, we present PepFlow, the first multi-modal deep generative model grounded in the flow-matching framework for the design of full-atom peptides that target specific protein receptors. Drawing inspiration from the crucial roles of residue backbone orientations and side-chain dynamics in protein-peptide interactions, we characterize the peptide structure using rigid backbone frames within the SE(3) manifold and side-chain angles on high-dimensional tori. Furthermore, we represent discrete residue types in the peptide sequence as categorical distributions on the probability simplex. By learning the joint distributions of each modality using derived flows and vector fields on corresponding manifolds, our method excels in the fine-grained design of full-atom peptides. Harnessing the multi-modal paradigm, our approach adeptly tackles various tasks such as fix-backbone sequence design and side-chain packing through partial sampling. Through meticulously crafted experiments, we demonstrate that PepFlow exhibits superior performance in comprehensive benchmarks, highlighting its significant potential in computational peptide design and analysis.
Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials
Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is made available to the community which can fuel the development of methods anchored in physical principles for the fitting of fourth-order tensors. In addition, we present a higher-order GNN model trained on this dataset. The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy. We compare the model to non-equivariant models based on a number of error metrics and demonstrate its benefits in terms of predictive performance and reduced training requirements. Finally, we demonstrate an example application of the model to an architected material design task. The methods which we developed are applicable to fourth-order tensors beyond elasticity such as piezo-optical tensor etc.
Equivariant 3D-Conditional Diffusion Models for Molecular Linker Design
Fragment-based drug discovery has been an effective paradigm in early-stage drug development. An open challenge in this area is designing linkers between disconnected molecular fragments of interest to obtain chemically-relevant candidate drug molecules. In this work, we propose DiffLinker, an E(3)-equivariant 3D-conditional diffusion model for molecular linker design. Given a set of disconnected fragments, our model places missing atoms in between and designs a molecule incorporating all the initial fragments. Unlike previous approaches that are only able to connect pairs of molecular fragments, our method can link an arbitrary number of fragments. Additionally, the model automatically determines the number of atoms in the linker and its attachment points to the input fragments. We demonstrate that DiffLinker outperforms other methods on the standard datasets generating more diverse and synthetically-accessible molecules. Besides, we experimentally test our method in real-world applications, showing that it can successfully generate valid linkers conditioned on target protein pockets.
Structured Chemistry Reasoning with Large Language Models
This paper studies the problem of solving complex chemistry problems with large language models (LLMs). Despite the extensive general knowledge in LLMs (such as GPT-4), they struggle with chemistry reasoning that requires faithful grounded reasoning with diverse chemical knowledge and an integrative understanding of chemical interactions. We propose InstructChem, a new structured reasoning approach that substantially boosts the LLMs' chemical reasoning capabilities. InstructChem explicitly decomposes the reasoning into three critical phrases, including chemical formulae generation by LLMs that offers the basis for subsequent grounded reasoning, step-by-step reasoning that makes multi-step derivations with the identified formulae for a preliminary answer, and iterative review-and-refinement that steers LLMs to progressively revise the previous phases for increasing confidence, leading to the final high-confidence answer. We conduct extensive experiments on four different chemistry challenges, including quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach significantly enhances GPT-4 on chemistry reasoning, yielding an 8% average absolute improvement and a 30% peak improvement. We further use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe strong improvement of the smaller LMs. This validates our approach and enables LLMs to generate high-quality reasoning.
Differentiable Simulations for Enhanced Sampling of Rare Events
Simulating rare events, such as the transformation of a reactant into a product in a chemical reaction typically requires enhanced sampling techniques that rely on heuristically chosen collective variables (CVs). We propose using differentiable simulations (DiffSim) for the discovery and enhanced sampling of chemical transformations without a need to resort to preselected CVs, using only a distance metric. Reaction path discovery and estimation of the biasing potential that enhances the sampling are merged into a single end-to-end problem that is solved by path-integral optimization. This is achieved by introducing multiple improvements over standard DiffSim such as partial backpropagation and graph mini-batching making DiffSim training stable and efficient. The potential of DiffSim is demonstrated in the successful discovery of transition paths for the Muller-Brown model potential as well as a benchmark chemical system - alanine dipeptide.