Research ArticleMATHEMATICS

Learning to learn from data: Using deep adversarial learning to construct optimal statistical procedures

See allHide authors and affiliations

Science Advances  26 Feb 2020:
Vol. 6, no. 9, eaaw2140
DOI: 10.1126/sciadv.aaw2140
  • Fig. 1 Schematic overview of algorithms for constructing optimal statistical procedures.

    Overview of an iteration of (A) nested minimax algorithms, (B) nested maximin algorithms, and (C) alternating algorithms, all in the special case where the R(T, P) = EP[L(T(X), P)] for some loss function L. Green boxes involve evaluating or updating the statistical procedure, and blue boxes involve evaluating, updating, or identifying the least favorable distribution or prior. Shading is used to emphasize the similarities between the different steps of the three learning schemes. More than one draw of X ~ Pk can be taken in each step. In this case, the resulting loss functions L(Tk(X), P) are averaged. Similarly, more than one draw of Pk ~ Πk may be taken for the alternating algorithm. *This gradient takes into account the fact that Pk depends on Πk, and X depends on Πk through Pk. Details on how this dependence can be taken into account are given following the presentation of the pseudocode in Materials and Methods.

  • Fig. 2 Risk of learned estimators in example where the MLE is inconsistent.

    Risks are displayed at different parameter values and sample sizes. Unlike for the MLE, for which the maximal risk increases with sample size, the maximal risk of our estimators decreases with sample size.

  • Fig. 3 Performance of learned prediction algorithms.

    Percent improvement in maximal risk and Bayes risk of our learned prediction algorithms relative to MLEs in the models shown in Table 1. The Bayes risk conveys a method’s performance over the entire model, although it is least informative for the multilayer perceptron model (where this prior puts most of its mass on simpler functional forms, e.g., on the top left panel of fig. S2 rather than the bottom left panel).

  • Fig. 4 Performance of learned clustering procedure compared to performance of the EM.

    The difference between the risk of EM and our learned procedure is displayed—larger values indicate that our learned procedure outperformed EM. Three fixed values of the mixture weight ω are considered: 0.1,0.3, and 0.5. Contours indicate comparable performance of our learned clustering procedure and EM. Contours are drawn using smoothed estimates of the difference of the risks of the two procedures, where the smoothing is performed using k-nearest neighbors (k = 25). Our learned procedure outperformed EM both in terms of the risk in (3) that was used during training, and in terms of misclassification error.

  • Fig. 5 Cross-validated performance of learned prediction algorithms and MLEs.

    Models of three different complexities are considered when training the learned prediction algorithms for each application (see Materials and Methods for details). MLEs are evaluated over the same models that were used to train the learned prediction algorithms.

  • Table 1 Settings for the prediction example.

    The parameterization of the models considered is described where the example in Results is introduced. Complexity identifies the relative size of the models in the multilayer perceptron settings i, ii, and iii, the 10-dimensional generalized linear model settings iv, v, and vi, and the 2-dimensional generalized linear model settings x, xi, and xii. “Gaussian” corresponds to p independent standard normal predictors. “Mixed” correspond to two independent predictors following standard normal and Rademacher distributions. The variable h is the number of hidden layers that the model uses for the E[Y|W] network; b1 is the bound on the magnitude of the bias in the output node of the network; b2 is a bound on all other biases and all network weights; ρ is the correlation between the predictors; s1, s2, and s3 are the number of distributions in the random search for an unfavorable distribution that are chosen uniformly from the entire parameter space, uniformly from the boundary, and a mixture of a uniform draw from the entire parameter space and from the boundary (details in the main text); and t is the number of starts used for the shallow interrogation.

    SettingsComplexityPredictorsphb1b2ρs1s2s3t
    iLowestGaussian20220200203
    iiMediumGaussian2122015050505
    iiiHighestGaussian2222015050505
    ivLowestGaussian10000.501505005
    vMediumGaussian10010.501505005
    viHighestGaussian10020.501505005
    viiLowestGaussian10000.50.31505005
    viiiMediumGaussian10000.50.61505005
    ixHighestGaussian10000.50.91505005
    xLowestMixed2010.50200203
    xiMediumMixed20110200203
    xiiHighestMixed20120200203

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/6/9/eaaw2140/DC1

    Appendix A. Supplementary tables and figures for numerical experiments.

    Appendix B. Supplementary tables for data applications.

    Appendix C. Methods for confidence region construction experiments.

    Appendix D. Neural network architectures in numerical experiments.

    Appendix E. Further discussion of guarantees for nested minimax algorithms.

    Appendix F. An example showing challenges faced by existing nested maximin algorithms.

    Appendix G. Captions for additional file types.

    Fig. S1. Convergence of the risk of the learned estimators in the Gaussian estimation example.

    Fig. S2. Pointwise quantiles of the fit of our learned two-layer multilayer perceptron prediction function at W2 = 0 and different values of W1 based on n = 50 observations from four data-generating distributions.

    Fig. S3. Learning curves for worst-case (red) and uniform-prior Bayes (blue) prediction performance in the logistic regression settings i to xii shown in Table 1.

    Fig. S4. Performance of the learned 95% level confidence region procedure across different values of η.

    Fig. S5. Prior generator multilayer perceptrons and procedure multilayer perceptrons used for the point estimation examples.

    Fig. S6. Estimator LSTM used when estimating binary regressions.

    Fig. S7. Estimator LSTM used when defining the interior point of our confidence regions.

    Table S1. Gaussian model with σ2 = 1, |μ| ≤ m, and n = 1.

    Table S2. Final estimated performance of the learned prediction algorithms, the MLE, and a linear-logistic regression.

    Table S3. Performance of our learned procedures and of existing procedures in data illustrations.

    Movie S1. Evolution of the risk of the learned estimator of μ as the weights of the neural network are updated in the Gaussian model with n = 50 observations and unknown (μ, σ).

    References (4850)

  • Supplementary Materials

    The PDF file includes:

    • Appendix A. Supplementary tables and figures for numerical experiments.
    • Appendix B. Supplementary tables for data applications.
    • Appendix C. Methods for confidence region construction experiments.
    • Appendix D. Neural network architectures in numerical experiments.
    • Appendix E. Further discussion of guarantees for nested minimax algorithms.
    • Appendix F. An example showing challenges faced by existing nested maximin algorithms.
    • Appendix G. Captions for additional file types.
    • Fig. S1. Convergence of the risk of the learned estimators in the Gaussian estimation example.
    • Fig. S2. Pointwise quantiles of the fit of our learned two-layer multilayer perceptron prediction function at W2 = 0 and different values of W1 based on n = 50 observations from four data-generating distributions.
    • Fig. S3. Learning curves for worst-case (red) and uniform-prior Bayes (blue) prediction performance in the logistic regression settings i to xii shown in Table 1.
    • Fig. S4. Performance of the learned 95% level confidence region procedure across different values of η.
    • Fig. S5. Prior generator multilayer perceptrons and procedure multilayer perceptrons used for the point estimation examples.
    • Fig. S6. Estimator LSTM used when estimating binary regressions.
    • Fig. S7. Estimator LSTM used when defining the interior point of our confidence regions.
    • Table S1. Gaussian model with σ2 = 1, |μ| ≤ m, and n = 1.
    • Table S2. Final estimated performance of the learned prediction algorithms, the MLE, and a linear-logistic regression.
    • Table S3. Performance of our learned procedures and of existing procedures in data illustrations.
    • References (4850)

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Movie S1 (.mp4 format). Evolution of the risk of the learned estimator of μ as the weights of the neural network are updated in the Gaussian model with n = 50 observations and unknown (μ, σ).

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article