Research ArticleAPPLIED MATHEMATICS

Machine learning of accurate energy-conserving molecular force fields

See allHide authors and affiliations

Science Advances  05 May 2017:
Vol. 3, no. 5, e1603015
DOI: 10.1126/sciadv.1603015
  • Fig. 1 The construction of ML models: First, reference data from an MD trajectory are sampled.

    (A) The geometry of each molecule is encoded in a descriptor. This representation introduces elementary transformational invariances of energy and constitutes the first part of the prior. A kernel function then relates all descriptors to form the kernel matrix—the second part of the prior. The kernel function encodes similarity between data points. Our particular choice makes only weak assumptions: It limits the frequency spectrum of the resulting model and adds the energy conservation constraint. Hess, Hessian. (C) These general priors are sufficient to reproduce good estimates from a restricted number of force samples. (B) A comparable energy model is not able to reproduce the PES to the same level of detail.

  • Fig. 2 Modeling the true vector field (leftmost subfigure) based on a small number of vector samples

    With GDML, a conservative vector field estimate Embedded Image is obtained directly. A naïve estimator Embedded Image with independent predictions for each element of the output vector is not capable of imposing energy conservation constraints. We perform a Helmholtz decomposition of this nonconservative vector field to show the error component that violates the law of energy conservation. This is the portion of the overall prediction error that was avoided with GDML because of the addition of the energy conservation constraint.

  • Fig. 3 Efficiency of the GDML predictor versus a model that has been trained on energies.

    (A) Required number of samples for a force prediction performance of MAE (1 kcal mol−1 Å−1) with the energy-based model (gray) and GDML (blue). The energy-based model was not able to achieve the targeted performance with the maximum number of 63,000 samples for aspirin. (B) Force prediction errors for the converged models (same number of partial derivative samples and energy samples). (C) Energy prediction errors for the converged models. All reported prediction errors have been estimated via cross-validation.

  • Fig. 4 Results of classical and PIMD simulations.

    The recently developed estimators based on perturbation theory were used to evaluate structural and electronic observables (30). (A) Comparison of the interatomic distance distributions, Embedded Image, obtained from GDML (blue line) and DFT (dashed red line) with classical MD (main frame), and PIMD (inset). a.u., arbitrary units. (B) Probability distribution of the dihedral angles (corresponding to carboxylic acid and ester functional groups) using a 20 ps time interval from a total PIMD trajectory of 200 ps.

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/3/5/e1603015/DC1

    section S1. Noise amplification by differentiation

    section S2. Vector-valued kernel learning

    section S3. Descriptors

    section S4. Model analysis

    section S5. Details of the PIMD simulation

    fig. S1. The accuracy of the GDML model (in terms of the MAE) as a function of training set size: Chemical accuracy of less than 1 kcal/mol is already achieved for small training sets.

    fig. S2. Predicting energies and forces for consecutive time steps of an MD simulation of uracil at 500 K.

    table S1. Properties of MD data sets that were used for numerical testing.

    table S2. GDML prediction accuracy for interatomic forces and total energies for all data sets.

    table S3. Accuracy of the naïve force predictor.

    table S4. Accuracy of the converged energy-based predictor.

    References (3136)

  • Supplementary Materials

    This PDF file includes:

    • section S1. Noise amplification by differentiation
    • section S2. Vector-valued kernel learning
    • section S3. Descriptors
    • section S4. Model analysis
    • section S5. Details of the PIMD simulation
    • fig. S1. The accuracy of the GDML model (in terms of the MAE) as a function of training set size: Chemical accuracy of less than 1 kcal/mol is already achieved for small training sets.
    • fig. S2. Predicting energies and forces for consecutive time steps of an MD simulation of uracil at 500 K.
    • table S1. Properties of MD data sets that were used for numerical testing.
    • table S2. GDML prediction accuracy for interatomic forces and total energies for all data sets.
    • table S3. Accuracy of the naïve force predictor.
    • table S4. Accuracy of the converged energy-based predictor.
    • References (31–36)

    Download PDF

    Files in this Data Supplement: