## Abstract

Modeling a complex system is almost invariably a challenging task. The incorporation of experimental observations can be used to improve the quality of a model and thus to obtain better predictions about the behavior of the corresponding system. This approach, however, is affected by a variety of different errors, especially when a system simultaneously populates an ensemble of different states and experimental data are measured as averages over such states. To address this problem, we present a Bayesian inference method, called “metainference,” that is able to deal with errors in experimental measurements and with experimental measurements averaged over multiple states. To achieve this goal, metainference models a finite sample of the distribution of models using a replica approach, in the spirit of the replica-averaging modeling based on the maximum entropy principle. To illustrate the method, we present its application to a heterogeneous model system and to the determination of an ensemble of structures corresponding to the thermal fluctuations of a protein molecule. Metainference thus provides an approach to modeling complex systems with heterogeneous components and interconverting between different states by taking into account all possible sources of errors.

- Statistical inference
- structural biology
- maximum entropy principle

## INTRODUCTION

The quantitative interpretation of experimental measurements requires the construction of a model of the system under observation. The model usually consists of a description of the system in terms of several parameters, which are determined by requiring consistency with the experimental measurements themselves and with theoretical information, either physical or statistical in nature. This procedure presents several complications. First, experimental data (Fig. 1A) are always affected by random and systematic errors (Fig. 1B, green), which must be properly accounted for to obtain accurate and precise models. Furthermore, when integrating multiple experimental observations, one must consider that each experiment has a different level of noise so that every element of information is properly weighted according to its reliability. Second, the prediction of experimental observables from the model, which is required to assess the consistency, is often based on an approximate description of a given experiment (the so-called “forward model”) and thus is intrinsically inaccurate in itself (Fig. 1B, green). Third, physical systems under equilibrium conditions often populate a variety of different states whose thermodynamic behavior can be described by statistical mechanics. In these heterogeneous systems, experimental observations depend on—and thus probe—a population of states (Fig. 1B, purple) so that one should determine an ensemble of models rather than a single one (Fig. 1C).

Among the theoretical approaches available for model building, two frameworks have emerged as particularly successful: Bayesian inference (*1*–*3*) and the maximum entropy principle (*4*). Bayesian modeling is a rigorous approach to combining prior information on a system with experimental data and to dealing with errors in such data (*1*–*3*, *5*–*8*). It proceeds by constructing a model of noise as a function of one or more unknown uncertainty parameters, which quantify the agreement between predictions and observations and which are inferred along with the model of the system. This method has a long history and is routinely used in a wide range of applications, including the reconstruction of phylogenetic trees (*9*), determination of population structures from genotype data (*10*), interpolation of noisy data (*11*), image reconstruction (*12*), decision theory (*13*), analysis of microarray data (*14*), and structure determination of proteins (*15*, *16*) and protein complexes (*17*). It has also been extended to deal with mixtures of states (*18*–*21*) by treating the number of states as a parameter to be determined by the procedure. The maximum entropy principle is at the basis of approaches that deal with experimental data averaged over an ensemble of states (*4*) and provides a link between information theory and statistical mechanics. In these methods, an ensemble generated using a prior model is minimally modified by some partial and inaccurate information to exactly match the observed data. In the recently proposed replica-averaging scheme (*22*–*26*), this result is achieved by modeling an ensemble of replicas of the system using the available information and additional terms that restrain the average values of the predicted data to be close to the experimental observations. This method has been used to determine ensembles representing the structure and dynamics of proteins (*22*–*26*).

Each of the two methods described above can deal with some, but not all, of the challenges in characterizing complex systems by integrating multiple sources of information (Fig. 1B). To simultaneously overcome all of these problems, we present the “metainference” method, a Bayesian inference approach that quantifies the extent to which a prior distribution of models is modified by the introduction of experimental data that are expectation values over a heterogeneous distribution and subject to errors. To achieve this goal, metainference models a finite sample of this distribution, in the spirit of the replica-averaged modeling based on the maximum entropy principle. Notably, our approach reduces to the maximum entropy modeling in the limit of the absence of noise in the data, and to standard Bayesian modeling when experimental data are not ensemble averages. This link between Bayesian inference and the maximum entropy principle is not surprising given the connections between these two approaches (*27*, *28*). We first benchmark the accuracy of our method on a simple heterogeneous model system, in which synthetic experimental data can be generated with different levels of noise as averages over a discrete number of states of the system. We then show its application with nuclear magnetic resonance (NMR) spectroscopy data in the case of the structural fluctuations of the protein ubiquitin in its native state, which we modeled by combining chemical shifts with residual dipolar couplings (RDCs).

## RESULTS AND DISCUSSION

Metainference is a Bayesian approach to modeling a heterogeneous system and all sources of error by considering a set of copies of the system (replicas), which represent a finite sample of the distribution of models, in the spirit of the replica-averaged formulation of the maximum entropy principle (*22*–*26*). The generation of models by suitable sampling algorithms [typically Monte Carlo or molecular dynamics (MD)] is guided by a score given in terms of the negative logarithm of the posterior probability (Materials and Methods) where **X** = [*X*_{r}] and **σ** = [σ_{r}] are, respectively, the sets of conformational states and uncertainties, one for each replica. σ_{r} includes all of the sources of errors, that is, the error in representing the ensemble with a finite number of replicas (), as well as random, systematic, and forward model errors (). *P* is the prior probability that encodes information other than experimental data, and Δ^{2}(**X**) is the deviation of the experimental data from the data predicted by the forward model. This schematic equation, which omits the data likelihood normalization term for the uncertainty parameters, holds for Gaussian errors and a single data point, and a more general formulation can be found in Materials and Methods (Eqs. 5 and 8).

### Metainference of a heterogeneous model system

We first illustrate the metainference method for a model system that can simultaneously populate a set of discrete states, that is, a mixture. In this example, the number of states in the mixture and their population can be varied arbitrarily. We created synthetic data as ensemble averages over these discrete states (Fig. 2A), and we added random and systematic noise. We thus introduced prior information, which provides an approximate description of the system and its distribution of states and whose accuracy can also be tuned. We then used the reference data to complement the prior information and to recover the correct number and populations of the states. We tested the following approaches: metainference (with the Gaussian and outliers noise models in Eqs. 9 and 11, respectively), replica-averaging maximum entropy, and standard Bayesian inference (that is, Bayesian inference without mixtures). The accuracy of a given approach was defined as the root mean square deviation of the inferred populations from the correct populations of the discrete states. We benchmarked the accuracy as a function of the number of data points used, the level of noise in the data, the number of states and replicas, and the accuracy of the prior information. Details of the simulations, generation of data, sampling algorithm, and likelihood and model to treat systematic errors and outliers can be found in the Supplementary Materials.

### Comparison with the maximum entropy method

We found that the metainference and maximum entropy methods perform equally well in the absence of noise in the data or in the presence of random noise alone (Fig. 2, B and C, gray and orange lines), as expected, given that maximum entropy is particularly effective in the case of mixtures of states (*22*, *23*). The accuracy of the two methods was comparable and, most importantly, increased with the number of data points used (Fig. 2, B and C). With 20 data points and 128 replicas, and in the absence of noise, the accuracy averaged over 300 independent simulations of a five-state system was equal to 0.4 ± 0.2% and 0.2 ± 0.1% for the metainference and maximum entropy approaches, respectively. For reference, the accuracy of the prior information alone was much lower, that is, 16%. Metainference, however, outperformed the maximum entropy approach in the presence of systematic errors (Fig. 2, B and C, green lines). The accuracy of metainference increased significantly more rapidly upon the addition of new information, despite the high level of noise. When using 20 data points, 128 replicas, and 30% outliers ratio, the accuracy averaged over 300 independent simulations of a five-state system was equal to 2 ± 2% and 14 ± 5% for the metainference and maximum entropy approaches, respectively. As systematic errors are ubiquitous both in the experimental data and in the forward model used to predict the data, this situation more closely reflects a realistic scenario. The ability of metainference to effectively deal with averaging and with the presence of systematic errors at the same time is the main motivation for introducing this method. This approach can thus leverage the substantial amount of noisy data produced by high-throughput techniques and accurately model conformational ensembles of heterogeneous systems.

### Comparison with standard Bayesian modeling

In the standard Bayesian approach, one assumes the presence of a single state in the sample and estimates its probability or confidence level given experimental data and prior knowledge available. When modeling multiple-state systems with ensemble-averaged data and standard Bayesian modeling, one could be tempted to interpret the probability of each state as its equilibrium population. In doing so, however, one makes a significant error, which grows with the number of data points used, regardless of the level of noise in the data (Fig. 2D).

### Role of prior information

We tested two priors of different accuracies, with an average population error per state equal to 8 and 16%, respectively. The results suggest that the number of experimental data points required to achieve a given accuracy of the inferred populations depends on the quality of the prior information (fig. S1). The more accurate the prior is, the fewer data points are needed. This is an intuitive, yet important, result. Accurate priors almost invariably require more complex descriptions of the system under study; thus, they usually come at a higher computational cost.

### Scaling with the number of replicas

As the number of replicas grows, the error in estimating ensemble averages using a finite number of replicas decreases, and the overall accuracy of the inferred populations increases (fig. S2), regardless of the level of noise in the data. Furthermore, we verified numerically that, in the absence of random and systematic errors in the data, the intensity of the harmonic restraint, which couples the average of the forward model on the *N* replicas to the experimental data (Eq. 7), scales as *N*^{2} (Fig. 3). This test confirms that, in the limit of the absence of noise in the data, metainference coincides with the replica-averaging maximum entropy modeling (Materials and Methods).

### Scaling with the number of states

Metainference is also robust to the number of states populated by the system. We tested our model in the case of 5 and 50 states and determined that the number of data points needed to achieve a given accuracy scales less than linearly with the number of states (fig. S3).

### Outliers model and error marginalization

As the numbers of data points and replicas increase, using one error parameter per replica and data point becomes computationally more and more inconvenient. In this situation, one can assume a unimodal and long-tailed distribution for the errors, peaked around a typical value for a data set (or experiment type) and replica, and marginalize all of the uncertainty parameters of the single data points (Materials and Methods). The accuracy of this marginalized error model was found to be similar to the case in which a single error parameter was used for each data point (fig. S4).

### Analysis of the inferred uncertainties

We analyzed the distribution of inferred uncertainties σ^{B} in the presence of systematic errors (outliers) when using a Gaussian data likelihood with one uncertainty per data point (Eq. 9) and the outliers model with one uncertainty per data set (Eq. 11). In the former case, metainference was able to automatically detect the data points affected by systematic errors, assign a higher uncertainty unto them, and thus downweight the associated restraints (Fig. 4A). In the latter case, the inferred typical data set uncertainty was somewhere in between the uncertainty inferred using the Gaussian likelihood on the data points with no noise and the uncertainty inferred using the Gaussian likelihood on the outliers (Fig. 4B). In this specific test (five states, 20 data points including eight outliers, prior accuracy equal to 16%, and 128 replicas), both data noise models generated an ensemble of comparable accuracy (3%).

### Metainference in integrative structural biology

We compared the metainference and maximum entropy approaches using NMR experimental data on a classical example in structural biology—the structural fluctuations in the native state of ubiquitin (*22*, *29*, *30*). A conformational ensemble of ubiquitin was modeled using CA, CB, CO, HA, HN, and NH chemical shifts combined with RDCs collected in a steric medium (*30*) (Fig. 5A). The ensemble was validated by multiple criteria (table S1). The stereochemical quality was assessed by PROCHECK (*31*); data not used for modeling, including ^{3}*J*_{HNC} and ^{3}*J*_{HNHA} scalar couplings and RDCs collected in other media (*32*), were backcalculated and compared with the experimental data. Exhaustive sampling was achieved by 1-μs-long MD simulations performed with GROMACS (*33*) equipped with PLUMED (*34*). We used the CHARMM22* force field as prior information (*35*). Additional details of these simulations can be found in the Supplementary Materials.

The quality of the metainference ensemble (Fig. 5B) was higher than that of the maximum entropy ensemble, as suggested by the better fit with the data not used in the modeling (Fig. 5C and table S1) and by the stereochemical quality (table S2). Data used as restraints were also more accurately reproduced by metainference. One of the major differences between the two approaches is that metainference can deal more effectively with the errors in the chemical shifts calculated on different nuclei. The more inaccurate HN and NH chemical shifts were detected by metainference and thus automatically downweighted in constructing the ensemble (Fig. 6).

We also compared the metainference ensemble with an ensemble generated by standard MD simulations and with a high-resolution NMR structure. The metainference ensemble obtained by combining chemical shifts and RDCs reproduced all of the experimental data not used for the modeling better than the MD ensemble and the NMR structure. The only exception were the ^{3}*J*_{HNC} scalar couplings, which were slightly more accurate in the MD ensemble, and the ^{3}*J*_{HNHA} scalar couplings, which were better predicted by the NMR structure (Fig. 5C and table S1).

The NMR structure, which was determined according to the criterion of maximum parsimony, accurately reproduced most of the available experimental data. Ubiquitin, however, exhibits rich dynamical properties over a wide range of time scales averaged in the experimental data (*36*). In particular, a main source of dynamics involves a flip of the backbone of residues D^{52}-G^{53} coupled with the formation of a hydrogen bond between the side chain of E^{24} and the backbone of G^{53}. Although metainference was able to capture the conformational exchange between these two states, the static representation provided by the NMR structure could not (Fig. 5B).

In conclusion, we have presented the metainference approach, which enables the building of an ensemble of models consistent with experimental data when the data are affected by errors and are averaged over mixtures of the states of a system. Because complex systems and experimental data almost invariably exhibit both heterogeneity and errors, we anticipate that our method will find applications across a wide variety of scientific fields, including genomics, proteomics, metabolomics, and integrative structural biology.

## MATERIALS AND METHODS

The quantitative understanding of a system involves the construction of a model *M* to represent it. If a system can occupy multiple possible states, one should determine the distribution of models *p*(*M*) that specifies in which states the system can be found and the corresponding probabilities. To construct this distribution of models, one should take into account the consistency with the overall knowledge that one has about the system. This includes theoretical knowledge (called the “prior” information *I*) and the information acquired from experimental measurements (that is, the “data” *D*) (*1*). In Bayesian inference, the probability of a model given the information available is known as the posterior probability *p*(*M*|*D*, *I*) of *M* given *D* and *I*, and it is given by(1)where the likelihood function *p*(*D*|*M*, *I*) is the probability of observing *D* given *M* and *I*, and the prior probability *p*(*M*|*I*) is the probability of *M* given *I*. To define the likelihood function, one needs a forward model *f*(*M*) that predicts the data that would be observed for model *M* and a noise model that specifies the distribution of the deviations between the observed data and the predicted data. In the following, we assumed that the forward model depends only on the conformational state *X* of the system and that the noise model is defined in terms of unknown parameters σ that are part of the model *M* = (*X*, σ). These parameters quantify the level of noise in the data, and they are inferred along with the state *X* by sampling the posterior distribution. The sampling is usually carried out using computational techniques such as Monte Carlo, MD, or combined methods based on Gibbs sampling (*1*).

### Mixture of states

Experimental data collected under equilibrium conditions are usually the result of ensemble averages over a large number of states. In metainference, the prior information *p*(*X*) of state *X* provides an a priori description of the distribution of states. To quantify the fit with the observed data and to determine to what extent the prior distribution is modified by the introduction of the data, we needed to calculate the expectation values of the forward model over the distribution of states. Inspired by the replica-averaged modeling based on the maximum entropy principle (*22*–*26*), we considered a finite sample of this distribution by simultaneously modeling *N* replicas of the model **M** = [*M*_{r}], and we calculated the forward model as an average over the states **X** = [*X*_{r}](2)Typically, we have information only about expectation values on the distribution of states *X*, and not on the other parameters of the model, such as σ. However, we were also interested in determining how the prior distributions of these parameters are modified by the introduction of the experimental data. Therefore, we modeled a finite sample of the joint probability distribution of all parameters of the model.

For a reduced computational cost, a relatively small number of replicas are typically used in the modeling. In this situation, the estimate *f*(**X**) of the forward model deviated from the average that would be obtained using an infinite number of replicas. This was an unknown quantity, which we added to the parameters of our model. However, the central limit theorem provided a strong parametric prior because it guaranteed that the probability of having a certain value of given a finite number of states **X** is a Gaussian distribution(3)where the standard error of the mean σ^{SEM} decreases with the square root of the number of replicas(4)We recognized that, in considering a finite sample of our distribution of states, we introduced an error in the calculation of expectation values. Therefore, experimental data should be compared to the (unknown) average of the forward model over an infinite number of replicas , which is then related to the average over our finite sample *f*(**X**) via the central limit theorem of Eq. 3. From these considerations, we can derive the posterior probability of the ensemble of *N* replicas representing a finite sample of our distribution of models . In the case of a single experimental data point *d*, this can be expressed as (Supplementary Materials)(5)The data likelihood relates the experimental data *d* to the average of the forward model over an infinite number of replicas, given the uncertainty . This parameter describes random and systematic errors in the experimental data and errors in the forward model. The functional form of depends on the nature of the experimental data, and it is typically a Gaussian or lognormal distribution. As noted above, is the parametric prior on that relates the (unknown) average to the estimate *f*(**X**) computed with a finite number of replicas *N* via the central limit theorem of Eq. 3, and thus it is always a Gaussian distribution. is the prior on the standard error of the mean and encodes Eq. 4. is the prior on the uncertainty parameter , and *p*(*X*_{r}) is the prior on the structure *X*_{r}.

### Gaussian noise model

We can further simplify Eq. 5 in the case of Gaussian data likelihood . In this situation, can be marginalized (Supplementary Materials), and the posterior probability can be written as(6)where the effective uncertainty encodes all sources of errors: the statistical error due to the use of a finite number of replicas, experimental and systematic errors, and errors in the forward model. The associated energy function in units of *k*_{B}*T* becomes

This equation shows how metainference includes different existing modeling methods in limiting cases. In the absence of data and forward model errors (), our approach reduces to the replica-averaged maximum entropy modeling, in which a harmonic restraint couples the replica-averaged observable to the experimental data. The intensity of the restraint scales with the number of replicas as *N*^{2}, that is, more than linearly, as required by the maximum entropy principle (*24*). We numerically verified this behavior in our heterogeneous model system in the absence of any errors in the data (Fig. 3). In the presence of errors (), the intensity *k* scales as *N*, and it is modulated by the data uncertainty . Finally, in the case in which the experimental data are not ensemble averages (), we recover the standard Bayesian modeling.

### Multiple experimental data points

Equation 5 can be extended to the case of *N*_{d} independent data points **D** = [*d*_{i}], possibly gathered in different experiments at varying levels of noise (Supplementary Materials)(8)

### Outliers model

To reduce the number of parameters that need to be sampled in the case of multiple experimental data points, one can model the distribution of the errors around a typical data set error and marginalize the error parameters for the individual data points. For example, a data set can be defined as a set of chemical shifts or RDCs on a given nucleus. In this case, it is reasonable to assume that the level of error of the individual data points in the data set is homogeneous, except for the presence of few outliers. Let us consider, for example, the case of Gaussian data noise. In the case of multiple experimental data points, Eq. 6 becomes(9)The prior *p*(σ_{r,i}) can be modeled using a unimodal distribution peaked around a typical data set effective uncertainty *σ*_{r,0} and with a long tail to tolerate outliers data points (*37*)(10)where , with σ^{SEM} as the standard error of the mean for all data points in the data set and replicas and with as the typical data uncertainty of the data set. We can thus marginalize σ_{r,i} by integrating over all its possible values, given that all of the data uncertainties range from 0 to infinity(11)After marginalization, we are left with just one parameter per replica that needs to be sampled.

## SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/2/1/e1501177/DC1

Derivation of the basic metainference equations

Details of the model system simulations

Details of the ubiquitin MD simulations

Fig. S1. Effect of prior accuracy on the error of the metainference method.

Fig. S2. Scaling of metainference error with the number of replicas at varying levels of noise in the data.

Fig. S3. Scaling of metainference error with the number of states.

Fig. S4. Accuracy of the outliers model.

Table S1. Comparison of the quality of the ensembles obtained using different modeling approaches in the case of the native state of the protein ubiquitin.

Table S2. Comparison of the stereochemical quality of the ensembles or single models generated by the approaches defined in table S1.

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is **not** for commercial advantage and provided the original work is properly cited.

## REFERENCES AND NOTES

**Acknowledgments:**We would like to thank A. Mira for useful discussions on the Bayesian method.

**Funding:**This work received no funding.

**Author contributions:**M.B., C.C., A.C., and M.V. designed the research and analyzed the data. M.B. and C.C. performed the research. M.B., C.C., and M.V. wrote the paper.

**Competing interests:**The authors declare that they have no competing interests.

**Data and materials availability:**All data needed to evaluate the conclusions in the paper are present in the paper itself and in the Supplementary Materials or are available upon request from the authors.

- Copyright © 2016, The Authors