Research ArticlePHYSICS

Machine learning unifies the modeling of materials and molecules

See allHide authors and affiliations

Science Advances  13 Dec 2017:
Vol. 3, no. 12, e1701816
DOI: 10.1126/sciadv.1701816
  • Fig. 1 SOAP-GAP predictions for silicon surfaces.

    (A) The tilt angle of dimers on the reconstructed Si(100) surface [left, STM image (13); right, SOAP-GAP–relaxed structure] is the result of a Jahn-Teller distortion, predicted to be about 19° by DFT and SOAP-GAP. Empirical force fields show no tilt. (B) The Si(111)–7 × 7 reconstruction is an iconic example of the complex structures that can emerge from the interplay of different quantum mechanical effects [left, STM image (14); right, SOAP-GAP–relaxed structure colored by predicted local energy error when using a training set without adatoms]. (C) Reproducing this delicate balance and predicting that the 7 × 7 is the ground-state structure is one of the historical successes of DFT: a SOAP-based ML model is the only one that can describe this ordering, whereas widely used force fields incorrectly predict the unreconstructed surface (dashed lines) to a lower-energy state.

  • Fig. 2 SOAP-GAP predictions for a molecular database.

    (A) Learning curves for the CC atomization energy of molecules in the GDB9 data set, using the average-kernel SOAP with a cutoff of 3 Å. Black lines correspond to using DFT geometries to predict CC energies for the DFT-optimized geometry. Using the DFT energies as a baseline and learning ΔDFT − CC = ECCEDFT lead to a fivefold reduction of the test error compared to learning CC energies directly as the target property (CCDFT). The other curves correspond to using PM7-optimized geometries as the input to the prediction of CC energies of the DFT geometries. There is little improvement when learning the energy correction (ΔPM7 − CC) compared to direct training on the CC energies (CCPM7). However, using information on the structural discrepancy between PM7 and DFT geometries in the training set brings the prediction error down to 1 kcal/mol mean absolute error (MAE) (Embedded Image). (B) A sketch-map representation of the GDB9 (each gray point corresponding to one structure) highlights the importance of selecting training configurations to uniformly cover configuration space. The average prediction error for different portions of the map is markedly different when using a random selection (C) and FPS (D). The latter is much better behaved in the peripheral, poorly populated regions.

  • Fig. 3 Predictions of the stability of glucose conformers at different levels of theory.

    (A) Extensive tests on 208 conformers of glucose (taking only 20 FPS samples for training) reveal the potential of an ML approach to bridge different levels of quantum chemistry; the diagonal of the plot shows the MAE resulting from direct training on each level of theory; the upper half shows the intrinsic difference between each pairs of models; the lower half shows the MAE for learning each correction. (B) The energy difference between three pairs of electronic structure methods, partitioned in atomic contributions based on a SOAP analysis and represented as a heat map. The molecule on the left represents the lowest-energy conformer of glucose in the data set, and the one on the right represents the highest-energy conformer.

  • Fig. 4 Predictions of ligand-receptor binding.

    (A) ROCs of binary classifiers based on a SOAP kernel, applied to the prediction of the binding behavior of ligands and decoys taken from the DUD-E, trained on 60 examples. Each ROC corresponds to one specific protein receptor. The red curve is the average over the individual ROCs. The dashed line corresponds to receptor FGFR1, which contains inconsistent data in the latest version of the DUD-E. Inset: AUC performance measure as a function of the number of ligands used in the training, for the “best match”–SOAP kernel (MATCH) and average molecular SOAP kernel (AVG). (B and C) Visualization of binding moieties for adenosine receptor A2, as predicted for the crystal ligand (B), as well as two known ligands and one decoy (C). The contribution of an individual atomic environment to the classification is quantified by the contribution δzi in signed distance z to the SVM decision boundary and visualized as a heat map projected on the SOAP neighbor density [images for all ligands and all receptors are accessible online (27)]. Regions with δz > 0 contain structural patterns expected to promote binding (see color scale and text). The snapshot in (B) indicates the position of the crystal ligand in the receptor pocket as obtained by x-ray crystallography (28). PDB, Protein Data Bank.

  • Fig. 5 A kernel function to compare solids and molecules can be built based on density overlap kernels between atom-centered environments.

    Chemical variability is accounted for by building separate neighbor densities for each distinct element [see the study of De et al. (20) and the Supplementary Materials].

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/3/12/e1701816/DC1

    section 1. The atom-centered GAP is equivalent to the average molecular kernel

    section 2. A SOAP-GAP potential for silicon

    section 3. Predicting atomization energies for the GDB9 and QM7b databases

    section 4. Ligand classification and visualization

    table S1. Summary of the database for the silicon model.

    fig. S1. Energetics of configuration paths that correspond to the formation of stacking faults in the diamond structure.

    fig. S2. Fraction of test configurations with an error smaller than a given threshold, for ntrain = 20,000 training structures selected at random (dashed line) or by FPS (full line).

    fig. S3. Optimal range of interactions for learning GDB9 DFT energies.

    fig. S4. Optimal range of interactions for learning GDB9 CC and ΔCC-DFT energies.

    fig. S5. Training curves for the prediction of DFT energies using DFT geometries as inputs for the GDB9 data set.

    fig. S6. Training curves for the prediction of DFT energies using DFT geometries as inputs for the QM7b data set.

    fig. S7. Training curves for the prediction of DFT energies using DFT geometries as inputs for the GDB9 data set.

    fig. S8. Training curves for the prediction of DFT energies using DFT geometries as inputs, for a data set containing a total of 684 configurations of glutamic acid dipeptide (E) and aspartic acid dipeptide (D).

    fig. S9. Correlation plots for the learning of the energetics of dipeptide configurations, based on GDB9.

    References (4468)

  • Supplementary Materials

    This PDF file includes:

    • section 1. The atom-centered GAP is equivalent to the average molecular kernel
    • section 2. A SOAP-GAP potential for silicon
    • section 3. Predicting atomization energies for the GDB9 and QM7b databases
    • section 4. Ligand classification and visualization
    • table S1. Summary of the database for the silicon model.
    • fig. S1. Energetics of configuration paths that correspond to the formation of stacking faults in the diamond structure.
    • fig. S2. Fraction of test configurations with an error smaller than a given
      threshold, for ntrain = 20,000 training structures selected at random (dashed line) or by FPS (full line).
    • fig. S3. Optimal range of interactions for learning GDB9 DFT energies.
    • fig. S4. Optimal range of interactions for learning GDB9 CC and ΔCC-DFT energies.
    • fig. S5. Training curves for the prediction of DFT energies using DFT geometries as inputs for the GDB9 data set.
    • fig. S6. Training curves for the prediction of DFT energies using DFT geometries as inputs for the QM7b data set.
    • fig. S7. Training curves for the prediction of DFT energies using DFT geometries as inputs for the GDB9 data set.
    • fig. S8. Training curves for the prediction of DFT energies using DFT geometries as inputs, for a data set containing a total of 684 configurations of glutamic acid dipeptide (E) and aspartic acid dipeptide (D).
    • fig. S9. Correlation plots for the learning of the energetics of dipeptide configurations, based on GDB9.
    • References (44–68)

    Download PDF

    Files in this Data Supplement:

Navigate This Article