## Abstract

Mitigating the effects of disease outbreaks with timely and effective interventions requires accurate real-time surveillance and forecasting of disease activity, but traditional health care–based surveillance systems are limited by inherent reporting delays. Machine learning methods have the potential to fill this temporal “data gap,” but work to date in this area has focused on relatively simple methods and coarse geographic resolutions (state level and above). We evaluate the predictive performance of a gated recurrent unit neural network approach in comparison with baseline machine learning methods for estimating influenza activity in the United States at the state and city levels and experiment with the inclusion of real-time Internet search data. We find that the neural network approach improves upon baseline models for long time horizons of prediction but is not improved by real-time internet search data. We conduct a thorough analysis of feature importances in all considered models for interpretability purposes.

## INTRODUCTION

In today’s era of “big data,” a tremendous amount of information is collected as we interact with technology in every aspect of our lives. Machine learning methods that detect and extract patterns from these data can reveal details about human behavior that are not readily apparent to the naked eye. In recent years, much of the attention of the machine learning community has focused on methods based on neural networks, which have been shown to achieve impressive results on data-intensive problems for which large amounts of high-dimensional data are available. Neural networks are now the state of the art in machine learning for a variety of prediction tasks, including, but not limited to, facial recognition (*1*), robot navigation (*2*), and sentiment detection (*3*).

As in many other scientific fields, the amount and quality of data on disease activity collected by health care systems are on the rise. Surveillance of influenza-like illnesses (ILI) in the United States is particularly well recorded by the U.S. Centers for Disease Control and Prevention (CDC). The CDC, for example, has made publicly available near–real time datasets of regional- and state-level ILI case counts (*4*). More granular influenza-related data have been collected retrospectively by electronic health record companies and have been used in a collection of studies (*5*–*9*). With the growing availability of longitudinal epidemiological datasets like these, there is an opportunity to investigate whether neural network methods can improve monitoring of disease activity.

Accurate and reliable real-time estimations of disease activity have the potential to make enormous positive impact in public health. Infectious diseases affect billions of people every year and cause considerable morbidity and mortality worldwide. Influenza alone infects 35 million people in the United States each year and causes between 12,000 and 56,000 annual deaths (*10*). Mitigating the effects of disease outbreaks with timely and effective interventions relies upon accurate real-time surveillance and forecasting of disease activity, but traditional health care–based disease surveillance systems are limited by inherent reporting delays. Data from the U.S. CDC on ILI rates, for example, are available at approximately a 1- to 2-week reporting delay (defined as the gap between when a case occurs and when it is first reported) and are frequently retrospectively revised (*11*). Time series machine learning methods that provide accurate real-time estimation of disease activity at a high spatial resolution could help fill this data gap, allowing hospitals, clinics, and communities to better respond to public health threats.

Previous work on improving real-time estimation and forecasting of disease activity with computational methods has focused mainly on monitoring and forecasting ILI in the United States. Studies in this field used methods ranging from applied machine learning and statistical modeling (*11*–*21*) to standard mechanistic epidemiological modeling (*22*, *23*) and network approaches (*6*, *7*, *16*, *24*). Many explore the use of emerging internet-based data sources, including Google Trends (GT), Twitter microblogs, and news alerts, to complement available epidemiological data with digital signals available in real time, producing highly accurate “nowcasts” of influenza incidence (*11*–*14*, *16*, *19*, *25*, *26*).

An important data-driven system for influenza tracking, Google Flu Trends (GFT) (*12*), was introduced in 2008 to track ILI on the state level in the United States using query volumes from GT. However, it was discontinued in 2015 after large prediction errors were identified in the system (*11*, *13*, *27*, *28*). Yang *et al*. (*11*) revised the GFT algorithm, estimating ILI activity on the national level more effectively using a regularized linear regression approach that combines historical ILI activity and GT data, named “ARGO.” Santillana *et al*. (*14*) explored more complex machine learning methods (random forests and support vector machines) to forecast ILI activity up to 4 weeks ahead of the publication of CDC ILI reports. Other studies have explored different diseases, data sources, and spatial resolutions: Yang *et al*. (*29*) used GT for dengue estimation in several middle-income countries; Lu *et al*. (*30*) experimented with Google searches, Twitter posts, electronic health records, and a crowd-sourced reporting system for tracking ILI in Boston; and Santillana *et al*. (*31*) used clinician’s search query data to track national influenza rates.

Many studies on machine learning for ILI prediction focus on models that use information from single locations, without accounting for potential synchronicities in epidemiological signals and spatial transmission of disease (*12*–*14*, *32*). Lu *et al*. (*16*) introduced ARGONet, one of the current state-of-the-art methods in ILI nowcasting, a regularized linear regression that incorporates historic epidemiological data and synchronous data from flu-related internet search activity and electronic health records to provide highly accurate real-time estimates of ILI on the state level in the United States. Moreover, ARGONet uses historic epidemiological activity lags and real-time internet-based data from all states for the prediction of synchronous influenza rates in each state, a network approach that allows models to take into account the spatial transmission of influenza across states. Lu *et al*. (*16*) found that ARGONet performs better than the standard ARGO algorithm.

The current literature in data-driven ILI prediction—especially the literature that considers the inclusion of nontraditional real-time data from sources like GT—mainly uses variations of linear regressions (*11*, *16*) or, less often, random forests or support vector machines (*14*). From a machine learning perspective, however, the problem of disease incidence estimation is most suited to a time series–specific model architecture. Recurrent neural networks, and more specifically their variants long short-term memory (LSTM) and gated recurrent unit (GRU) networks, have recently produced favorable results on time series forecasting problems—including stock price prediction (*33*), language translation (*34*), and speech recognition (*35*)—when compared with other machine learning and time series methods. It is a reasonable hypothesis that these time series–specific neural network architectures might be equally well suited to epidemiological prediction.

While previous papers have explored time series deep learning methods for ILI prediction, none to our knowledge do so with both spatial precision (county- or city-level prediction) and attention to nontraditional data sources, which is where the contribution of our paper lies. We note here the contributions of some of the most important related papers to date.

Wu *et al*. (*17*) explored a combined convolutional neural network (CNN)–GRU architecture for ILI forecasting using three datasets (47 Japanese prefectures, 29 U.S. states, and 10 U.S. districts), finding that the time series deep learning model achieves superior accuracy compared with several baseline machine learning methods for nearly all datasets and reporting delays of up to 8 weeks. However, the paper does not explore the use of digital data and does not evaluate models in the same method that they would be deployed in practice. Specifically, models are not retrained in each week with all available data (a technique termed “dynamic training” or “walk-forward validation”). Li *et al*. (*18*) used a graph-structured recurrent neural network that expressly accounts for the network structure of disease spread (the network structure is dictated by geographic proximity). The deep learning model performance was evaluated only at a 1-week reporting delay for the regional ILI tracking in the United States. However, internet-based data incorporation is not explored, and, moreover, no region-specific baseline models are presented here, so it is difficult to ascertain whether the presented neural network exceeds the current state-of-the-art methods in ILI estimation. Hu *et al*. (*19*) and Liu *et al*. (*20*) explored the use of a fully connected neural network exploiting historical epidemiological and synchronous digital data for real-time estimation of ILI in the United States nationally and in the state of Georgia, respectively, but lack baseline models for comparison.

Other papers make important contributions to tracking ILI in specific regions, incorporating outside data sources, or developing improved model architectures. In the most spatially granular study to date, Xi *et al*. (*36*) predicted intraurban ILI trends in Shenzen, China, with several neural network variants, making a case for the network’s ability to capture synchronicities and spatial disease spread. Burdakov *et al*. (*37*) and Yang *et al*. (*38*) introduced the use of climate data in deep learning–based ILI prediction. Last, Adhikari *et al*. (*32*) introduced the EpiDeep model, which leverages deep clustering to learn effectively from past influenza seasons, but again does not incorporate nontraditional data sources or attempt synchronous prediction at a spatial scale below the regional.

### Our contributions

This work uses appropriate datasets and established techniques in public health to measure whether neural network methods can improve upon current methods for estimating disease activity. First, we compare the performance of a time series–specific neural network model, the GRU, to baseline machine learning methods (linear regression and random forest) for real-time ILI estimation on the state and city levels in the United States (geographic resolutions relevant to actionable public health response). We find that the GRU is superior to baseline methods when there is a large reporting delay in the standard surveillance system (over 2 weeks). Second, we confirm prior work showing that the state-level ILI predictions produced by linear regression and random forest models are improved with the incorporation of data from GT and extend this result to the city level. However, we show that GRU predictions are not improved by GT data. Third, we compare ARGO-like methods that operate in isolation on each location with network approaches that account for synchronicities between locations and find that the ARGONet approach is superior for all time horizons of prediction and is particularly effective on the city level. Last, we conduct an in-depth analysis of feature importances for each model we build, as interpretability is key to effective practical use of data-driven models in public health.

## RESULTS

We train our neural network model (GRU) to produce out-of-sample weekly ILI activity estimates for 37 U.S. states and 159 U.S. cities, experimenting with assumed reporting delays of 1, 2, 4, and 8 weeks, including information from all locations in each dataset. Furthermore, we experiment with both the inclusion and exclusion of real-time data from Google search trends. Our two datasets (state and city levels) are approximately 6 and 7 years, respectively, but cover different time periods (October 2010 to May 2016 for the state-level dataset and January 2003 to June 2010 for the city-level dataset). For each dataset, we train the GRU on the first half of the available data and evaluate its performance on the second half, where the model is retrained each week with all data available before that week (known as “dynamic training” or “walk-forward validation”). The resulting evaluation period for the state-level dataset is 21 August 2013 to 1 May 2016; the evaluation period for the city-level dataset is 30 September 2007 to 20 June 2010. Note that the city-level evaluation period includes the 2009 influenza pandemic; it is particularly important to assess accuracy in this time period.

For comparison, we implement four baseline methods similar to those used in previous work: a persistence model (a standard naive baseline in which the most recent observation of ILI incidence is used as the prediction), a regularized linear regression treating each location in isolation (similar to ARGO when GT data are included), a regularized linear regression that incorporates information from all locations in each dataset (similar to ARGONet when GT data are included), and a random forest approach that incorporates data from all locations in each dataset. We refer to the baseline models as follows: persistence, AR for the single-location linear regression, and LR and RF for the linear regression and random forest, respectively, incorporating a network approach. For consistency, each of the baseline methods is evaluated with walk-forward validation on the same weeks as the GRU, and we experiment with up to 8-week reporting delays and the inclusion and exclusion of GT data. We compare predictive accuracy across models, reporting delays, geographic resolutions, and internet search data incorporation using root mean square error (RMSE) and Pearson’s correlation. We also perform two sets of Wilcoxon signed-rank tests: the first is to evaluate the significance of each model’s improvement over the naive persistence model, and the second is to evaluate the statistical significance of the difference between the GRU’s predictions and those of the most competitive alternative model. We present our aggregated results across states and cities in Figs. 1 to 4 and Tables 1 and 2. Predictions for all locations are available in an interactive dashboard online at emilylaiken.github.io/ili-viz/.

### Neural network models outperform baseline models for long time horizons of prediction

We find that the GRU performs substantially better than the less flexible machine learning models in the presence of long reporting delays. Specifically, as shown in Figs. 1 and 2, along with Tables 1 and 2, the GRU demonstrates superior performance (error reductions of 35 to 51% for state level and 45 to 59% for city level) on 4- and 8-week reporting delays for both datasets when only epidemiological data are used and for an 8-week reporting delay when both epidemiological and internet-based data are incorporated. We observe that the GRU leads to larger error reductions on the city-level dataset. For interpretability, we also report the percentage of locations in which each model is preferred in Table 3, with consistent results.

For completeness, we also provide a probabilistic evaluation following the CDC FluSight’s multibin log score methodology at the state level in Table 4, noting results consistent with previous observations. To the best of our knowledge, the first season where state-level forecasts were requested by the CDC was the 2017–2018 flu season, as a pilot project, so no comparable baseline exists for our evaluation period. Moreover, the results from that pilot project have not yet been released at the time of writing of this paper. Compared with the metrics from the currently unreleased state-level assessment paper, our results suggest state-of-the-art performance. We note, however, that the results cannot be directly compared, as the evaluation period is not the same.

### Leveraging real-time data from GT improves performance of most models across datasets and time horizons

On all time horizons of prediction, models other than the GRU are improved by incorporating real-time digital data from GT. This result extends past research showing that internet-based data improve ILI nowcasting with linear regressions and random forests on the country (*11*–*14*) and state levels (*16*) to the city level. We observe that the inclusion of GT data causes only a very small improvement in GRU performance at longer reporting delays (4 and 8 weeks) and causes a slight performance decrease at short reporting delays (1 and 2 weeks).

For the linear regression and random forest models, we observe a larger accuracy improvement as a result of GT data incorporation at long time horizons of prediction. Furthermore, while GT data do improve city-level prediction, we find that it provides less improvement than on the state level. Figures 3 and 4 compare the performance of models that incorporate GT data to those that use only historical epidemiological data for the state and city levels, respectively.

### Adopting a network approach improves performance on all prediction tasks

Comparing among the baseline machine learning models implemented (simple linear autoregression, labeled AR; linear autoregression adopting a network approach, labeled LR; and random forest adopting a network approach, labeled RF), we find that models that capture the spatial transmission of disease by adopting a network approach—using information from all available locations as documented in (*16*, *39*)—outperform models that predict each location in the dataset in isolation on nearly all objectives and time horizons of prediction. This result is consistent with the results presented in (*16*) and extends these results to the city-level dataset. We find that while a network approach provides a slight improvement in overall accuracy for state-level predictions, it provides substantially more improvement for predictions on the city level. The performance of the AR, LR, and RF models for state-level prediction can be directly compared in Fig. 1, and for city-level prediction in Fig. 2.

### Feature importance

For interpretability purposes, we analyze feature importances across each method. Specifically, we obtain regression coefficients for each linear regression, feature importances (*40*) for each random forest model, and saliency maps (*41*) for each GRU prediction, at every point in time. Because models are trained dynamically, the most important features in each model may change from week to week. Figure 5 compares average feature importances for predicting ILI in Virginia assuming 1- and 8-week reporting delays. A comprehensive set of heatmaps for each location is available online at https://emilylaiken.github.io/ili-viz/explore, and for further reference, a selection of similar heatmaps is included in figs. S7 to S10.

The maps of feature importance demonstrate certain results consistent with intuitive spatial and temporal model interpretation. First, the most important features in linear regression and random forest models for the city-level dataset tend to be epidemiological lags and GT information from cities near to the city of prediction (a result most readily visible on the online dashboard). Second, the most immediately available epidemiological information (lags 1 to 4) tends to be important features in linear regression and random forest models that predict assuming short reporting delays, while information from the previous season (lags 48 to 52) is more important in models that predict assuming long reporting delays. Similarly, saliency maps indicate that GRU attention extends much further back for neural network models that predict assuming an 8-week reporting delay than for models working with a 1-week reporting delay, as shown in Fig. 5. This intuition is consistent with the disparity in model accuracy at the 8-week time horizon: It appears that non–deep learning approaches rely on sparse features for prediction, particularly at the 8-week time horizon, while GRU attention is smoothly dispersed across locations and features, perhaps contributing to smoother, more accurate GRU predictions at long time horizons.

### Computational cost

We evaluate computational expense across models by profiling runtime of training and predicting a single week’s incidence on a single core. While the linear regression and random forest models are more runtime efficient than the neural network model for an individual location, there is redundant computation when running the AR-net LR and RF models on datasets with multiple locations. In a dataset of 100 locations, for example, the GRU takes 5 times longer to run than AR (30 s versus 6 s on a single core), while the LR model takes 50 times longer to run (300 s versus 6 s on a single core). The runtime of all models scales linearly with the number of locations in the dataset.

## DISCUSSION

Here, we present the use of a time series–specific neural network model (GRU) for ILI prediction that improves upon the predictive accuracy of previously used machine learning methods for ILI prediction in the presence of reporting delays of over 2 weeks. We show that the GRU achieves superior accuracy at two spatial resolutions relevant to actionable interventions. Specifically, while the previously used machine learning methods show a large loss of accuracy between a reporting delay of 1 week and a reporting delay of 8 weeks, there is little change in the accuracy of the GRU. This finding is robust across the two spatial resolutions analyzed here and could help with real-time tracking of ILI given the reporting delays inherent to standard health care–based surveillance systems. Furthermore, our results indicate that under short reporting delays, the GRU could provide highly accurate forecasts of ILI activity (as predictions by models that do not use GT data can also be interpreted as forecasts).

As the amount of high spatial resolution disease-specific data available grows in the field of public health, using neural network models like the one presented here becomes increasingly feasible. Trade-offs in interpretability and computational expense should be considered, however, when comparing neural networks to less complex machine learning methods. For that reason, we have presented a comprehensive feature importance analysis in this work. Note that while linear regression coefficients like the ones extracted in our analysis are highly interpretable, feature importances in a random forest model includes more stochasticity, and the saliency maps produced for predictions by the neural network model represent only rough approximations of model attention. The trade-off between model complexity and computational expense should also be taken into account in cross-model comparisons, although we observe that the runtime of all models is well within the constraints of even a resource-limited public health system.

Across spatial resolutions, the GRU outperforms other model architectures only at reporting delays longer than 2 weeks when only historical epidemiological data are used for prediction, and only at an 8-week reporting delay when GT data are incorporated. These performance differences are consistent with the trade-off between model complexity and convergence in the data-deficient scenario. The simpler models evaluated here have fewer trainable parameters than the GRU, while the GRU’s complex model architecture and time series–specific structure allow it to learn embedded patterns in historic data better than the other models. The benefit of the GRU is therefore larger when there are no real-time data available to the model, and is smaller when GT information is available and the many additional trainable parameters introduced by the GT data lead to convergence issues. With the availability of more training data, we would expect to see that the GRU would outperform simpler models even for short reporting delays.

In addition to the contribution of the neural network model, our results also extend previous work suggesting that real-time estimates of influenza are improved by the incorporation of internet-based data from GT to the city level. We find, however, that incorporating internet-based data from GT provides more improvement on the state level than on the city level. Because publicly available data from the GT Application Programming Interface (API) accounts for only a subset of Google searches in a given location, GT data are sparser, and therefore less useful, on the city level than on the state level. Furthermore, there were substantial changes in the methods of collection and aggregation for GT data during the time period of the city-level study, so it is possible that high-resolution data are not as reliable in the evaluation period for the city-level dataset as it is for the state-level dataset. Full access to trend information could improve performance of granular nowcasting models substantially.

Last, we observe that for both the state and city levels, adopting a network approach that accounts for correlation in influenza signals across locations provides more accurate predictions than modeling each location in isolation. This result confirms the findings in Lu *et al*. (*16*) for state-level prediction and extends it to the city level. We observe, however, that adopting a network approach is more useful on the city level than on the state level. This result likely stems from the relative densities of the network in each case: We model 159 cities, a much denser network than the 37 states available in the state-level dataset, so spatial factors, such as geographic proximity, may be more useful in city-level prediction. It is also possible that city-level modeling captures the true spatial scale of disease spread; state-level modeling may represent only an aggregate of local and sometimes asynchronous epidemics and therefore may obscure finer details of spatial connectivity that are useful for prediction.

Two key limitations to this study are tuning of the neural network model and lack of access to real-time epidemiological data. First, the performance of neural network models is sensitive to several hyperparameters, including optimization choices, depth, number of units in the hidden layers, and regularization. Because of computational limits, we adopt a simple GRU architecture with a single five-unit hidden layer and do not tune the model for other hyperparameters (more details on the model architecture are available in Methods). Likely, the performance of the GRU would be improved in all prediction scenarios if cross-validation was used to tune key hyperparameters. Second, we have work only with final (revised) ILI data, but, as noted in Introduction, these data are frequently updated with post hoc revisions up until several weeks after their original release. Previous studies have suggested that working with unrevised data in practice could complicate model training and testing (*11*, *42*).

There is much room for further exploration of flexible machine learning methods for epidemiological prediction. For example, it would be interesting to explore how well the models presented here can track other infectious diseases in and outside the United States. There is also room for experimentation with other neural network model architectures. In particular, we note that our neural network architecture can be considered as very small in size, both in terms of width and depth, when compared with neural networks used in practice for applications of deep learning. In this work, we demonstrate that even a small neural network architecture can yield improvements in ILI forecasting. A final avenue for future exploration is the prediction interpretation and explainability, which is key to practical model adoption by public health authorities. While the exploration of feature importances here is a start, other methods like the Shapley additive explanations (SHAP) values introduced by Lundberg and Lee (*43*) could provide additional intuition for model behavior.

## METHODS

We implemented a collection of machine learning models with the goal of answering the three following questions:

1. Previous results indicate that incorporating synchronous internet-based data improves real-time estimation of influenza activity on the country and state levels in the United States (*11*–*14*). Can these results be extended to nowcasting influenza on the city level?

2. Lu *et al*. (*16*) indicate that influenza activity estimation benefits from a network approach that accounts for the spatial transmission of disease but only analyzes a state-level dataset and a 1-week time horizon of prediction. Can these results be replicated at the state and city levels for a diverse set of prediction time horizons?

3. Does a time series neural network model outperform simpler machine learning methods for tasks in disease forecasting and nowcasting (both including and excluding real-time internet-based data)?

### Data sources

*State-level epidemiological data*. We use the same state-level dataset as Lu *et al*. (*16*), which was obtained from the CDC official counts of ILI from 4 October 2009 to 4 May 2017. In this dataset, the weekly ILI rate is measured by the number of doctor visits for ILI divided by the total number of visits. Only the 37 states without missing data are included in the analysis. These data are available at https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html.

*City-level epidemiological data*. Data on influenza incidence in 336 U.S. cities from 1 January 2004 to 20 July 2010 are taken from a dataset compiled by Intercontinental Marketing Statistics (IMS) Health based on medical claims data. As of 2009, the IMS health dataset captured 61.5% of U.S. physicians. Viboud *et al*. (*6*) found that this dataset “accurately capture[s] weekly fluctuations in influenza activity in all U.S. regions.” In the IMS health dataset, medical claims are categorized by type, with claims for influenza, fever combined with a respiratory infection, and febrile viral illness being categorized as ILI. Case counts are then aggregated at the level of center facilities as designated by the U.S. Postal Service, typically akin to a city or county (referred to here as a “city”). Last, case count data are standardized by taking the ratio of weekly ILI visits to the total number of physician visits in the region, per 100,000 population. As in Charu *et al*. (*7*), we remove all cities with a population less than 100,000 to eliminate demographic noise, leaving a dataset of 316 cities. This dataset has been used in several previous studies and has been shown to accurately capture spatial transmission dynamics of Influenza in the United States (*6*–*8*).

*Google search activity data*. Time series downloaded from GT (*44*) describe the proportion of searches for a specified keyword or phrase, with respect to the total number of searches, in a specified geographic region, each day, week, or month (normalized to a 0-to-100 range) in a time range. We extract trends for 256 key words shown in previous work to have strong correlation with flu incidence (*16*). For state-level modeling, we extract GT data for the 37 U.S. states in the epidemiological dataset on a weekly granularity. For city-level modeling, we extract GT data for U.S. regions on the finest geographic granularity available, designated market area regions as defined by The Nielsen Company. As in the IMS dataset, these regions typically correspond to a city or county. Upon inner-joining the IMS dataset and the GT dataset, a dataset of 159 U.S. cities remains, which we use in all city-level analyses.

### Modeling

*Preprocessing*. Before models are built, the values of each (nonnegative) time series are normalized in time to the range [0, 1]. The minimum and maximum values are identified from the “training” time period, and both the training and validation sets are transformed to the range [0, 1]

*Persistence*. A persistence model is the standard naive baseline for time series prediction. In a persistence model, the most recently observed incidence *y*_{t − h} is propagated forward to predict the incidence *y _{t}*

*Linear autoregression (AR)*. A linear autoregressive model uses a linear combination of *N* autoregressive observations of ILI incidence in a given location to predict incidence at a given time horizon in that location

A linear autoregression incorporating synchronous Google search data [very similar to the previously introduced ARGO method (*11*)] uses a linear combination of *N* autoregressive terms and query volumes from a set of search terms *G* from the same location to nowcast incidence at a given time horizon

In this analysis, we use a look-back window of *N* = 52 autoregressive lags. To make the linear autoregressive model more robust, we add L1-regularization and optimize regression weights to reduce mean squared error as a LASSO regression. The regularization strength parameter λ is selected with fourfold cross-validation from the set {10^{−5},10^{−4},10^{−3},10^{−2},10^{−1}}. A separate autoregressive model is fit for each location in the dataset.

We choose to use this AR model as a baseline rather than an autoregressive integrated moving average model (ARIMA) or seasonal ARIMA (SARIMA) approach based on literature that has shown that, due to issues of overfitting and model drag, dynamically trained and regularized machine learning models like the ones presented here are comparable or superior to an ARIMA-like approach (*11*, *45*, *46*). Given that historical flu activity signals are not strictly stationary, self-correcting regularized regression approaches that continuously self-evaluate the inclusion of multiple lags as input variables, such as the AR and ARGO models in this work, produce high-quality predictions in short-term prediction tasks that compare to and are frequently better than traditional ARIMA and SARIMA models. Moreover, in contrast to traditional ARIMA and SARIMA models, our choice of baseline AR models incorporate multiple lags (up to 52) of local and neighboring flu activity as input, and thus, a dynamically trained regularized regression approach has to be implemented to minimize the inherent risk of overfitting given the high number of input variables.

*Linear network autoregression (LR)*. Not to be confused with a neural network, this method uses a physical “network approach” as introduced in (*16*) to capture spatial spread of disease. Like a linear autoregression, this method uses a linear combination of *N* autoregressive terms to predict future prevalence, but unlike the single-region AR model, it incorporates autoregressive terms from a set of regions *R* available in the dataset

We also implement a form of the LR for which Google search query volumes from all regions in the dataset are incorporated. This model is very similar to the previously introduced ARGONet methodology (*16*)

Like the AR models, a look-back window of *N* = 52 is used and models are made more robust through L1-regularization with regularization strength parameter selected via fourfold cross-validation from the set {10^{−5},10^{−4},10^{−3},10^{−2},10^{−1}}. As in Lu *et al*. (*16*), to improve the speed and convergence of the LASSO optimization routine, we perform a preprocessing step to reduce the number of predictors in the LR models. The magnitude of set *R* is limited to a fixed number that is selected via fourfold cross-validation from the set {10,20,40}. Only the ∣*R*∣ predictors with the highest correlation with ground truth data in the training period are included in the regression model. If the model includes GT data, ∣*R*∣ for autoregressive epidemiological data and GT data are selected independently.

*Random forest*. Our random forest approach uses the same predictors as the above network autoregression model but allows for nonlinear and nonparametric mapping between autoregressive predictors with a random forest model. As in the other models, we experiment with both the inclusion and exclusion of GT data. In both cases, a random forest made up of 50 decision trees is used. Again, to improve speed and convergence, only autoregressive predictors from the top ∣*R*∣ most correlated time series are passed to the model, with ∣*R*∣ being selected from {10,20,40}. To prevent overfitting, maximum tree depth is selected with fourfold cross-validation from {2,4,8,16}.

*GRU neural network*. A GRU with one five-node hidden layer is used. Without GT data, the GRU accepts as input *N* = 52 autoregressive terms from all locations in the dataset and predicts incidence at the given time horizon for all locations simultaneously. When using GT data, the GRU accepts as input *N* = 52 autoregressive terms from all regions in the dataset and ∣*G* ∣ = 10 total synchronous Google search query volumes. The 10 queries selected are the ones with the highest correlation (in the training period) with the time series of ILI incidence in any location in the dataset. The GRU is trained on an MSE objective with a dropout rate of 0.3 after the hidden layer to reduce overfitting. We use a learning rate of 0.001 for stochastic gradient descent and train the model for 1000 epochs.

Note that this GRU model is simpler than the CNN-GRU model with residual links used by Wu *et al*. (*17*) or the graph-structured network in Li *et al*. (*18*). We choose the simpler model for the sake of computational expense and convergence on limited training data. Also note that the ablation tests presented in Wu *et al*. (*17*) indicate that the CNN-GRU model is often outperformed by a GRU-only component.

More broadly, we note that this architecture can be considered very small in size, both in terms of width and depth, when compared with neural networks used in practice for applications of deep learning [although note that the model size is in the same order of magnitude as (*17*)]. We leave the exploration of deeper and wider architectures as future work due to computational constraints.

*Training and hyperparameter selection*. For each of the methods evaluated, models are trained dynamically (a technique also termed “walk-forward validation” or “online learning”): For each time step *t*, the model is trained on all available training data *D _{t}* = {

*d*

_{0},

*d*

_{1},

*d*

_{2}…

*d*

_{t − 1}}, and this model is used to predict one time step at the given time horizon

*h*. Thus, each model is retrained and hyperparameters are reselected via fourfold cross-validation on the training data at each time step. In addition to eliminating forward-looking bias and allowing models to use all the available data at each time-step prediction, dynamic training has been shown in previous ILI-specific work to increase model accuracy (

*13*). Furthermore, walk-forward validation, unlike the standard form of evaluation used in the previous work on ILI prediction with neural networks (

*17*–

*20*), accurately reflects how these models would be used in real-world scenarios. While walk-forward validation is the preferred method for evaluating time series prediction tasks like ILI forecasting, it is very computationally expensive: approximately 150 weeks of evaluation in 159 cities and 37 states for eight different models requires hundreds of thousands of model runs. Like most applied machine learning work, our evaluation is inherently constrained by computational limits as a result of the scale of the prediction task at hand. We therefore leave the exploration of deeper and wider networks and larger ensembles for methods like the random forest as future work.

### Evaluation

Models are evaluated with walk-forward validation on the second half of each of the datasets. In the state-level dataset, the evaluation period is therefore 21 August 2013 to 1 May 2016; in the city-level dataset, it is 30 September 2007 to 20 June 2010. Models are evaluated based on two metrics: the distribution of RMSE across all locations and the distribution of Pearson’s correlation coefficients (CORR) with the ground truth time series across all locations

We note that while it has become popular to use the CDC FluSight’s binned logarithmic scoring rule to evaluate probabilistic influenza forecasts, the CDC FluSight’s evaluation metric is an improper scoring rule that does not elicit the reporting of the best prediction but instead allows for gaming (*47*, *48*). We also note that the dropout layer of our model can be used to produce probabilistic outputs (*49*).

For completeness, we also evaluate at the state level using the CDC’s multibin log score (*50*) defined as follows, where *p _{l}* is the probability assigned to outcome

*l*by a forecast. The bin size is set to 0.1 percentage points, and

*d*= 5 following the FluSight evaluation methodology. In other words, this means that forecasts within a range of 0.5 percentage points are considered as accurate

For the methods considered here, we obtain a probabilistic forecast for this evaluation by using the original point prediction of our methods as a mean of our probabilistic prediction and then computing the SD of residuals from a rolling 20-week window. A normal distribution with the predicted means and SD is then used as the probabilistic prediction around the point estimate. While the assumed normal distribution around zero of the historical out-of-sample prediction errors was not empirically satisfied for all time periods and locations, the associated probabilistic forecasts—which result from using this assumed error distribution—proved to be appropriate and informative, as reflected by the observed accuracy shown when assessing our predictions using the log score performance metric. This empirically confirms that the assumption happens frequently enough to justify its use.

As is done in common practice, the exponentiated average multibin log score is reported, with outliers assigned a logarithmic score of −10. We note that outliers need to be handled in order for the scores to be averaged, as a low-enough assigned probability to an observed event will cause a score of negative infinity. These outliers account for less than 0.242% of predictions made by each method, and assigning them a score of −10 follows the convention set in the literature.

We also conduct two sets of Wilcoxon signed-rank tests (*51*): the first testing whether the mean RMSE across locations is lower for the machine learning methods than for the naive baseline persistence method, disaggregated by time horizon of prediction and geographic resolution, and the second comparing the mean RMSE for the GRU to that of the most competitive baseline model. The use of the Wilcoxon signed-rank test is common when comparing machine learning models across multiple datasets (*51*). We treat each location as a different dataset when running our tests, and we note as a limitation that our datasets are not independent as the underlying influenza levels are correlated across locations. As a result, the interpretation of our test results should take into consideration this caveat. There is nuance to this limitation however as while the influenza levels across locations may be dependent, the inherent difficulty of the regression task is not necessarily correlated. For example, differences in data collection and reporting across locations would introduce different kinds of location-specific noise.

### Tools and code availability

We use Python 3.6 for all analyses presented here. Linear regression and random forest models are implemented in scikit-learn 0.19.1, and neural network models are implemented in Keras 2.1.5 (Tensorflow backend). Code and data used in this study are currently available upon request and will be made publicly available via GitHub upon publication.

## SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/7/25/eabb1237/DC1

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is **not** for commercial advantage and provided the original work is properly cited.

## REFERENCES AND NOTES

**Acknowledgments:**This study does not necessarily represent the views of the NIH or the U.S. government.

**Funding:**This work was supported by the NIH (grant number R01GM130668).

**Author Contributions:**E.L.A. and M.S. conceived of the project. E.L.A. carried out the analysis and led the manuscript preparation with guidance from M.S. A.T.N. assisted with the deep learning methods implemented, particularly with saliency mapping. C.V. compiled the city-level data and provided feedback and suggestions on epidemiological interpretation of the analysis. M.S. supervised the project.

**Competing interests:**The authors declare that they have no competing interests.

**Data and materials availability:**All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. State-level epidemiological data are available from the Harvard Dataverse at https://doi.org/10.7910/DVN/L5NT70. City-level epidemiological data is available at https://github.com/emilylaiken/ml-flu-prediction. Google Trends data are publicly available at https://trends.google.com/

- Copyright © 2021 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).