## Abstract

Quantum computing and artificial intelligence, combined together, may revolutionize future technologies. A significant school of thought regarding artificial intelligence is based on generative models. Here, we propose a general quantum algorithm for machine learning based on a quantum generative model. We prove that our proposed model is more capable of representing probability distributions compared with classical generative models and has exponential speedup in learning and inference at least for some instances if a quantum computer cannot be efficiently simulated classically. Our result opens a new direction for quantum machine learning and offers a remarkable example where a quantum algorithm shows exponential improvement over classical algorithms in an important application field.

## INTRODUCTION

A central task in the field of quantum computing is to find applications where a quantum computer could provide exponential speedup over any classical computer (*1*–*3*). Machine learning represents an important field with broad applications where a quantum computer may offer substantial speedup (*4*–*14*). The candidate algorithms with potential exponential speedup in runtime so far rely on efficient quantum solutions of linear algebraic problems (*6*–*11*). These algorithms require quantum random access memory (QRAM) as a critical component in addition to a quantum computer (*4*, *5*, *15*). In a QRAM, the number of required quantum routing operations (*16*) scales up exponentially with the number of qubits in those algorithms (*9*, *15*, *17*). This exponential overhead in resource requirement poses a serious challenge for its experimental implementation and is a caveat for fair comparison with the corresponding classical algorithms (*5*, *18*).

A significant school of thought regarding artificial intelligence is based on generative models (*19*–*21*), which are widely used for probabilistic reasoning as well as for supervised and unsupervised machine learning (*19*–*21*). Generative models try to capture the full underlying probability distributions for the observed data. Compared to discriminative models such as support vector machines and feed-forward neural networks, generative models can express far more complex relations among variables, which makes them broadly applicable but at the same time harder to tackle (*19*, *21*, *22*). Typical generative models include probabilistic graphical models such as the Bayesian nets and the Markov random fields (*19*, *20*, *22*), and generative neural networks such as the Boltzmann machines, the deep belief nets, and the generative adversarial networks (*21*). The probability distributions in these classical generative models can be represented by the so-called factor graphs (*19*, *21*, *22*). In section S1, we give a brief introduction to generative and discriminative models and their applications in machine learning.

Here, we propose a generative quantum machine learning algorithm that offers potential exponential improvement on three key elements of the generative models, that is, the representational power, and the runtimes for learning and inference. In our introduced quantum generative model (QGM), the underlying probability distribution describing correlations in data is generated by measuring a set of observables under a many-body entangled state. In terms of the representational power, we prove that our introduced QGM can efficiently represent any factor graphs, which include almost all the classical generative models as particular cases. Throughout this paper, the word “efficiently” means that the computational or memory cost is bounded by a polynomial function of the size of the problem. Furthermore, we show that the QGM is exponentially more expressive than factor graphs by proving that at least some instances generated by the QGM cannot be efficiently represented by any factor graph with a polynomial number of variables if a widely accepted conjecture in the computational complexity theory holds, that is, the polynomial hierarchy, which is a generalization of the famous P versus NP problem, does not collapse. For learning and inference in our QGM, we propose a general heuristic algorithm using a combination of techniques such as tensor networks, construction of parent Hamiltonians for many-body entangled states, and recursive quantum phase estimation. Although it is unreasonable to expect that the proposed quantum algorithm has polynomial scaling in runtime in all the cases (as this would imply that a quantum computer could efficiently solve any NP problems, an unlikely result), we prove that, at least for some instances, our quantum algorithm has exponential speedup over any classical algorithms, assuming that a quantum computer cannot be efficiently simulated in general by classical computers, a conjecture that is believed to hold. Very recently, a different generative model for quantum machine learning was proposed on the basis of a quantum version of the Boltzmann machine (*23*). The quantum advantages compared with the classical generative models, however, still remain unknown for that model in terms of the representational power and the runtimes for learning and inference.

The intuition for quantum speedup in our algorithm can be understood as follows: The purpose of generative machine learning is to model any data generation process in nature by finding the underlying probability distribution. As nature is governed by the law of quantum mechanics, it is too restrictive to assume that the real-world data can always be modeled by an underlying probability distribution as in classical generative models. Instead, in our QGM, correlations in data are parameterized by the underlying probability amplitudes of a many-body entangled state. As the interference of quantum probability amplitudes can lead to phenomena much more complex than those from classical probabilistic models, it is possible to achieve marked improvement in our QGM under certain circumstances.

## RESULTS

### Factor graphs and our QGM

We start by defining factor graphs and our QGM. Direct characterization of a probability distribution of *n* binary variables has an exponential cost of 2^{n}. A factor graph, which includes many classical generative models as special cases, is a compact way to represent *n*-particle correlation (*21*, *22*). As shown in Fig. 1A, a factor graph is associated with a bipartite graph where the probability distribution can be expressed as a product of positive correlation functions of a constant number of variables. Here, without loss of generality, we have assumed constant-degree graphs, where the maximum number of edges per vertex is bounded by a constant.

Our QGM is defined on a graph state |*G*〉 of *m* qubits. As a powerful tool to represent many-body entangled states, the graph state |*G*〉 is defined on a graph *G* (*24*), where each vertex in *G* is associated with a qubit. To prepare the graph state |*G*〉, all the qubits are initialized to the state at the beginning, where |0〉, |1〉 denote the qubit computational basis vectors, and then a controlled phase flip gate is applied to all the qubit pairs connected by an edge in the graph *G*. We then introduce the following transformation to the graph state |*G*〉(1)where *M*_{i} denotes an invertible (in general, nonunitary) 2 × 2 matrix applied on the Hilbert space of qubit *i*. Note that for general nonunitary *M*_{i}, the state |*Q*〉 cannot be directly prepared from the state |*G*〉 efficiently, and in our following learning algorithm, we actually first transform |*Q*〉 into a tensor network state and then use a combination of techniques to prepare this tensor network state. From *m* vertices of the graph *G*, we choose a subset of *n* qubits as the visible units and measure them in the computational basis {|0〉, |1〉}. The measurement outcomes sample from a probability distribution *Q*({*x*_{i}}) of *n* binary variables {*x*_{i}, *i* = 1, 2, ⋯ *n*} (the other *m* − *n* hidden qubits are just traced over to give the reduced density matrix). Given a graph *G* and the subset of visible vertices, the distribution *Q*({*x*_{i}}) defines our QGM, which is parameterized efficiently by the parameters in the matrices *M*_{i}. As shown in Fig. 1C, the pure entangled quantum state |*Q*〉 can be written as a special tensor network state. We define our model in this form for two reasons: First, the probability distribution *Q*({*x*_{i}}) needs to be general enough to include all the factor graphs; second, for the specific form of the state |*Q*〉 introduced in this paper, the parameters in this model can be conveniently trained by a quantum algorithm on any given dataset.

### Representational power of our QGM

Representational power is a key property of a generative model. It sets the ultimate limit to which the model might be useful. The more probability distributions a generative model can efficiently represent, the wider applications it can potentially have. The representational power is also closely related to the so-called generalization ability of a probabilistic model (see section S2). In this subsection, we prove that the QGM introduced above is exponentially more expressive in terms of the representation power compared with the classical factor graphs. This result is more accurately described by theorems 1 and 2. First, we show that any factor graph can be viewed as a special case of our QGM by the following theorem.

*Theorem 1*. The QGM defined above can efficiently represent probability distributions from any constant-degree factor graphs with arbitrary precision. Concretely, the number of parameters in the QGM can be bounded by *O*(2^{k}*s*), where *s* is the number of function nodes in the factor graph and *k* is the degree of the graph.

As probabilistic graphical models and generative neural networks can all be reduced to constant-degree factor graphs (*22*), the above theorem shows that our proposed QGM is general enough to include those probability distributions in widely used classical generative models. In section S3, we show how to reduce typical classical generative models to factor graphs with a bounded degree. Actually, by reducing the degree of the factor graph through equivalence transformation (see fig. S4), we can consider the factor graphs with degree *k* = 3 without loss of generality.

Proof of theorem 1: First, in any factor graph with its degree bounded by a constant *k*, each of the functions (denoted by square nodes in Fig. 1) has at most *k* variables and therefore, by the universal approximation theorem (*25*), it can be approximated arbitrarily well with a restricted Boltzmann machine with *k* visible variables and 2^{k} hidden variables. In section S4, we give an accurate description of the universal approximation theorem and an explanation of the restricted Boltzmann machine. As the degree *k* of the factor graph can be reduced to a very small number (such as *k* = 3), the number of hidden variables 2^{k} only represents a moderate constant cost, which does not increase with the system size. In a restricted Boltzmann machine, only the visible and the hidden variables are connected by the two-variable correlator that takes the generic form , where *x*_{1}, *x*_{2} denote the binary variables and *a*, *b*, *c* are real parameters. The representation precision using the universal approximation theorem does not affect the number of variables but only depends on the range of the parameters *a*, *b*, *c*, e.g., arbitrary prevision can be achieved if *a*, *b*, *c* are allowed to vary over the whole real axis. As *Q*({*x*_{i}}) has a similar factorization structure as the factor graph after measuring the visible qubits *x*_{i} under a diagonal matrix *M*_{i} (see Fig. 2), it is sufficient to show that each correlator *f*(*x*_{1}, *x*_{2}) can be constructed in the QGM. This construction can be achieved by adding one hidden variable (qubit) *j* with the invertible matrix *M*_{j} between two visible variables *x*_{1} and *x*_{2}. As shown in Fig. 2, *Q*({*x*_{1}, *x*_{2}}) can be calculated by contracting two copies of a diagram from Fig. 1C as 〈*Q*|*x*_{1}, *x*_{2}〉〈*x*_{1}, *x*_{2}|*Q*〉. To make this equal to arbitrary correlation *f*(*x*_{1}, *x*_{2}) between the two variables *x*_{1} and *x*_{2}, we first take the matrix , where , and λ_{1} and λ_{2} terms account for positive and negative correlations, respectively. Here, positive (negative) correlation means that two variables tend to be the same (different). The convex hull of these positive and negative correlations generates any symmetric correlation of the form between the two variables *x*_{1} and *x*_{2}. Then, we take additional single-qubit matrix *D*_{1} and *D*_{2} to account for the remaining difference between −*ax*_{1}/2 − *ax*_{2}/2 and *bx*_{1} + *cx*_{2} in the correlation function *f*(*x*_{1}, *x*_{2}). Specifically, we take *D*_{1} and *D*_{2} to be diagonal with eigenvalues and , respectively. As shown in Fig. 2 (C and D), the correlator between *x*_{1} and *x*_{2} in the QGM is then given by , and one simple solution with *d*_{1}(0) = *d*_{2}(0) = λ_{1}/2 = 1 and *d*_{1}(1) = *e*^{b + a/2}, *d*_{2}(1) = *e*^{c + a/2}, λ_{2} = 2*e*^{−a/2} makes it equal to the correlation function *f*(*x*_{1}, *x*_{2}) with arbitrary coefficients *a*, *b*, *c*. From the above proof, each function node introduces the number of parameters of the order *O*(2^{k}) independent of the representation precision. For a factor graph of *s* function nodes (note that *s* is upper bounded by *n*^{k}, where *n* is the number of variables), the total number of parameters is of the order *O*(2^{k}*s*). This completes the proof.

Furthermore, we can show that the QGM is exponentially more expressive than factor graphs in representing some probability distributions. This is summarized by the following theorem.

*Theorem 2*. If the polynomial hierarchy in the computational complexity theory does not collapse, there exist probability distributions that can be efficiently represented by a QGM but cannot be efficiently represented even under approximation by conditional probabilities from any classical generative models that are reducible to factor graphs.

The rigorous proof of this theorem is lengthy and technical, and we thus include it in section S5. Here, we briefly summarize the major idea of the proof: In the QGM, we construct the underlying distribution by adding up the probability amplitudes (complex numbers), while in the factor graphs, we only add up probabilities (positive real numbers). The complexity of these two addition problems turns out to be different: Adding the positive probabilities up to get the normalization factor in factor graphs is a nondecreasing process and its complexity is at a lower level of the polynomial hierarchy in the computational complexity theory according to the Stockmeyer’s theorem (*26*). In contrast, the summation of complex numbers to get the normalization factor in the QGM involves marked oscillating behaviors, similar to the sign problem in the quantum Monte Carlo method (*27*), which puts its computational complexity at a much higher level. Because of this difference in the computational complexity, by adding up probability amplitudes in the QGM, we can generate a much wider type of distributions, which are hard to represent by factor graphs through adding up the positive probabilities. To represent some distributions generated by the QGM with factor graphs, the number of parameters in the factor graphs is required to scale up exponentially with the size of the QGM, and therefore, there is an exponential gap between the representational power of these two models if the polynomial hierarchy in the computational complexity theory does not collapse.

Theorems 1 and 2 show that the QGM is much more powerful to represent probability distributions compared with the classical factor graphs. Representational power is important for the success of a machine learning model. It is closely related to the generalization ability. Generalization ability characterizes the ability of a model to make a good prediction for new data by using as few training data as possible. Higher representational power usually implies better generalization ability in practice (*21*, *28*). Similar to the principle of Occam’s razor in physics, a good choice for the machine learning model is the one with a minimum number of parameters but still can well explain the observed training data. The representational power characterizes this feature by representing a wide class of distributions using as few parameters as possible. Our QGM can efficiently represent some probability distributions that are out of the reach of the classical factor graphs, as the latter may need an exponentially large number of parameters for the representation. This suggests that our proposed QGM is a good choice for the machine learning model in terms of the representational power.

As any factor graphs can be efficiently represented by our QGM, this suggests that we should not expect that an arbitrary state |*Q*〉 can be prepared efficiently with a quantum computer. If we can prepare arbitrary |*Q*〉 efficiently, we can efficiently solve the inference problem for any factor graph, which is known to be an NP-hard problem and unlikely to be fully solved with a quantum computer. The detailed proof of this statement is included in section S6. For applications in generative learning, we have the freedom to choose the topology and the parameter form of |*Q*〉 and only need to use a subset of states |*Q*〉 that can be efficiently prepared. Normally, we first construct the parent Hamiltonian for the state |*Q*〉 (see section S7) and then use the tensor network formalism for its preparation, inference, and learning, as will be explained in the next section.

### A quantum algorithm for inference and training of our QGM

For a generative model to be useful for machine learning, apart from its representational power and generalization ability, we also need to have an efficient algorithm for both training and inference. Training is the process of tuning parameters in a model to represent the probability distribution as close as possible to that of the dataset. This usually involves minimization of a cost function, which determines how close these two distributions are. Once we have a trained model, we make inference to extract useful information for analysis or prediction. Most of inference tasks can be reduced to computing conditional probability ∑_{y}*p*(*x*, *y*|*z*) (*22*), where *x*, *y*, *z* denote different variable sets. For training, we choose to minimize the Kullback-Leibler (KL) divergence *D*(*q*_{d}||*p*_{θ}) = − ∑_{v}*q*_{d}(*v*)log(*p*_{θ}(*v*)/*q*_{d}(*v*)) between *q*_{d}, the distribution of the given data sample, and *p*_{θ}, the distribution of the generative model, with the whole parameter set denoted by θ. Typically, one minimizes *D*(*q*_{d}||*p*_{θ}) by optimizing the model parameters θ using the gradient descent method (*21*). The θ-dependent part *D*(θ) of *D*(*q*_{d}||*p*_{θ}) can be expressed as − ∑_{v∈data set} log *p*_{θ}(*v*)/*M*, where *M* denotes the total number of data points.

In our QGM, both the conditional probability ∑_{y}*p*(*x*, *y*|*z*) and the gradient of the KL divergence ∂_{θ}*D*(θ) can be conveniently calculated using the structure of state |*Q*〉 defined in Fig. 1. We first define a tensor network state |*Q*(*z*)〉 ≡ (*I* ⊗ 〈*z*|)|*Q*〉 by projecting the variable set *z* to the computational basis. As shown in section S8, the conditional probability can be expressed as(2)which is the expectation value of the operator *O* = |*x*〉 〈*x*| under the state |*Q*(*z*)〉. For inference problems, we typically only need to know the distribution *p*(*x*|*z*) = ∑_{y}*p*(*x*, *y*|*z*) in the label *x* by a constant precision. This can be conveniently achieved by measuring the operator *O* under the state |*Q*(*z*)〉. The measurement results correspond to an importance sampling and automatically give the dominant *x* labels for which the probability distribution *p*(*x*|*z*) has notable nonzero values. Similarly, we show in section S8 that ∂_{θ}*D*(θ) can be reduced to a combination of terms taking the same form as Eq. 2, with operator *O* replaced by or , where θ_{i} denotes a specific parameter in the invertible matrix *M*_{i}, *v*_{i} is the qubit of data *v* corresponding to the variable *x*_{i}, and H. c. stands for the Hermitian conjugate term. The variable *z* in this case takes the value of *v* (or *v* excluding *v*_{i}) expressed in a binary string.

With the above simplification, training or inference in the QGM is reduced to preparation of the tensor network state |*Q*(*z*)〉. With an algorithm similar to the one in (*29*), we use recurrent quantum phase estimation to prepare the state |*Q*(*z*)〉. For this purpose, we first construct the parent Hamiltonian *H*(*z*), which has a unique ground state given by |*Q*(*z*)〉 with zero eigenenergy in the generic case as shown in (*30*). The procedure is illustrated in Fig. 3, where the variables *z* = {*z*_{i}} are grouped as in Fig. 3A. In this case, the corresponding local tensors in the tensor network state |*Q*(*z*)〉 are all easy to compute and of a constant degree. The parent Hamiltonian *H*(*z*) is constructed through contraction of these local tensors as shown in Fig. 3 (C and D) (*30*).

By constructing the parent Hamiltonian for the tensor network state, the quantum algorithm for training and inference in the QGM is realized through the following steps:

Step 1: Construct a classical description of a series of tensor network states {|*Q*_{t}〉} with *t* = 0, 1,..., *n*, as shown in Fig. 3C, by reduction from |*Q*_{n}〉 = |*Q*(*z*)〉. The initial state |*Q*_{0}〉 is a product state |0〉^{⊗n}, and |*Q*_{t}〉 is constructed from |*Q*_{t −1}〉 by adding one more tensor in |*Q*(*z*)〉 that is not contained in |*Q*_{t −1}〉 and setting the uncontracted virtual indices as 0.

Step 2: Construct a classical description of the parent Hamiltonian *H*_{t} for each |*Q*_{t}〉 with the method illustrated in Fig. 3 (B to D) (*30*).

Step 3: On a quantum computer, starting from |*Q*_{0}〉, we prepare |*Q*_{1}〉,..., |*Q*_{n}〉 sequentially. Suppose that we have prepared |*Q*_{t −1}〉. The following substeps will prepare |*Q*_{t}〉 based on the recursive quantum phase estimation to measure the eigenstates of the parent Hamiltonian *H*_{t}.

Substep 1: We use the quantum phase estimation algorithm to measure whether the eigenenergy of the parent Hamiltonian *H*_{t} is zero (*31*), which implements a projective measurement to the corresponding eigenspaces {|*Q*_{t}〉〈*Q*_{t}|, *I* − |*Q*_{t}〉〈*Q*_{t}|} of *H*_{t} with zero and nonzero eigenenergies, respectively (see Fig. 3E). Similarly, we can implement the projective measurement {|*Q*_{t −1}〉〈*Q*_{t −1}|, *I* − |*Q*_{t −1}〉〈*Q*_{t −1}|} by the quantum phase estimation using the parent Hamiltonian *H*_{t −1}.

Substep 2: Starting from |*Q*_{t −1}〉, we perform the projective measurement {|*Q*_{t}〉〈*Q*_{t}|, *I* − |*Q*_{t}〉〈*Q*_{t}|}. If the result is |*Q*_{t}〉, we succeed and skip the following substeps. Otherwise, we get lying in the plane spanned by |*Q*_{t −1}〉 and |*Q*_{t}〉.

Substep 3: We perform the projective measurement {|*Q*_{t −1}〉〈*Q*_{t −1}|, *I* − |*Q*_{t −1}〉〈*Q*_{t −1}|} on the state . The result is either |*Q*_{t −1}〉 or .

Substep 4: We perform the projective measurement {|*Q*_{t}〉〈*Q*_{t}|, *I* − |*Q*_{t}〉〈*Q*_{t}|} again. We either succeed in getting |*Q*_{t}〉, with probability η_{t} = |〈*Q*_{t}|*Q*_{t −1}〉|^{2}, or have . In the latter case, we go back to the substep 3 and continue until success.

Step 4: After successful preparation of the state |*Q*(*z*)〉, we measure the operator *O* (for inference) or *O*_{1}, *O*_{2} (for training), and the expectation value of the measurement gives the required conditional probability (for inference) or the gradient of the KL divergence (for training).

In the following, we analyze the runtime of this algorithm for getting the correct answer. Step 1 computes *n* tensor network states with runtime because we only add one tensor with a constant number of variables in constructing each of *n* states |*Q*_{t}〉 with *t* from 1 to *n*. Step 2 computes *n* Hamiltonians with runtime for some constant *c*, which denotes the number of nodes involved for constructing a local term of the parent Hamiltonians with the method shown in Fig. 3D [thus, there are local terms]; *H*_{t} constructed by this method generically has the unique ground state |*Q*_{t}〉 (*30*) and the larger *c* is, the more likely there is no ground-state degeneracy.

A quantum computer is required for implementation of step 3. When this step gives the correct answer, the correctness of this algorithm is guaranteed. Let Δ_{t} denote the energy gap for the parent Hamiltonian *H*_{t}. Because we require the precision of the quantum phase estimation to be Δ to distinguish the ground state from the excited states of *H*_{t}, the quantum phase estimation algorithm in substep 1 has runtime complexity for the QGM defined on a constant-degree graph (*32*–*34*), where Δ = min_{t}Δ_{t} and suppresses slowly growing factors such as (*n*/Δ)^{o(1)} (*6*, *30*). The key idea of step 3 is that the subspace spanned by {|*Q*_{t}〉, |*Q*_{t −1}〉}, as shown in Fig. 3E, is the invariant subspace of all the projective operators involved. It can be seen from the evolution tree of step 3 (shown in Fig. 3F) that the construction of |*Q*_{t}〉 from |*Q*_{t −1}〉 can terminate within 2*k* + 1 iterations with probability(3)

This implies that within *s* steps of iteration, the probability of failure to get |*Q*_{t}〉 is smaller than *p*_{fail}(*s*) < *s*η*e*/2 for all *t*, where *e* is the base of the natural logarithm and η = min_{t}η_{t} (*30*). Denote the probability of failure to get |*Q*_{n}〉 as ε, then we require (1 − *p*_{fail})^{n} ≥ 1 − ε, so *s* > *n*/(2η*e*ε) is sufficient. Thus, the runtime of preparing |*Q*_{t}〉 in step 3 is . Iterating from 1 to *n*, we can get |*Q*(*z*)〉 with a probability of at least 1 − ε. Measuring operator *O* over |*Q*(*z*)〉 1/δ times, the result will be an approximation of the conditional probability to an additive error δ as shown in Eq. 2.

Therefore, the total runtime of approximating the conditional probability is(4)

The gap Δ_{t} and the overlap η_{t} depend on the topology of the graph *G*, the variable set *z*, and the parameters of the matrices *M*_{i}. If these two quantities are bounded by poly (*n*) for all the steps *t* from 1 to *n*, this quantum algorithm will be efficient with its runtime bounded by poly (*n*).

Gradient of the KL divergence for each parameter is constituted by several simple terms (see section S8), where each term is similar to the expression of the conditional probability except that the operator *O* is replaced by *O*_{1} or *O*_{2}. The number of terms is proportional to *M*, which is the number of training data for a full gradient descent method, or the batch size using the stochastic gradient descent method, which usually requires only *O*(1) data for calculation of the gradient (*22*).

### Quantum speedup

Although we do not expect the above algorithm to be efficient in the worst case [even for the simplified classical generative model such as the Bayesian network, the worst-case complexity is at least NP hard (*35*)], we know that the QGM with the above heuristic algorithm will provide exponential speedup over classical generative models for some instances. In section S9, we give a rigorous proof that our inference and learning algorithm has an exponential speedup over any classical algorithm for some instances under a well-accepted conjecture in the computational complexity theory. We arrive at the following theorem.

*Theorem 3*. There exist instances for computing the conditional probabilities or the gradients of the KL divergence in our QGM to an additive error 1/ poly (*n*) such that (i) our algorithm can achieve those calculations in a polynomial time and (ii) any classical algorithm cannot accomplish them in a polynomial time if universal quantum computing cannot be efficiently simulated by a classical computer.

The proof of this theorem is lengthy and can be found in the Supplementary Materials. Here, we briefly summarize the major idea. We construct a specific |*Q*(*z*)〉 that corresponds to the tensor network representation of the history state for universal quantum circuits rearranged into a 2D spatial layout. The history state is a powerful tool in quantum complexity theory (*36*). It is a superposition state of the entire computing history of an arbitrary universal quantum circuit(5)where *T* is the number of total gates in a quantum circuit, which is assumed to be a polynomial function of the total qubit number *m*; |*t*〉 is a quantum state encoding the step counter *t*; |0〉^{⊗m} is the input state of the circuits with *m* qubits; and *V*_{t} is the *t*th gate. A similar type of the history state has been used before to prove the QMA (quantum Merlin-Arthur) hardness for spin systems in a 2D lattice (*37*). For this specific |*Q*(*z*)〉, we prove that both the gap Δ_{t} and the overlap η_{t} scale as 1/ poly (*n*) for all the steps *t*, by calculating the parent Hamiltonian of |*Q*_{t}〉 with proper grouping of the local tensors. Our heuristic algorithm for training and inference therefore can be accomplished in a polynomial time if all the conditional probabilities that we need can be generated from those history states. On the other hand, our specific state |*Q*(*z*)〉 encodes universal quantum computation through representation of the history state, so it not only should include a large set of useful distributions but also cannot be achieved by any classical algorithm in a polynomial time if universal quantum computation cannot be efficiently simulated by a classical computer.

Here, we focus on the complexity theoretical proofs to support the quantum advantages of our proposed QGM in terms of the representational power and the runtimes for learning and inference. Before ending the paper, we briefly discuss the possibility to apply the model here for solving practical problems in machine learning. We mention two examples in section S10 and show a sketchy diagram there on how to embed our QGM model into a typical machine learning pipeline. The first example is on classification of handwritten digits, which is a typical task for supervised learning. The second example is on completion of incomplete pictures, a typical unsupervised learning task. The diagrams in section S10 illustrate the basic steps for these two examples, which require a combination of both classical and quantum computing. The numerical test for these two examples requires efficient classical simulation of the quantum computing part, which puts restrictions on the topology of the underlying graph states, requires a lot of numerical optimizations, and is still an ongoing project. The efficiency and runtimes for real-world problems need to be tested with a quantum computer or classical simulation of sufficiently large quantum circuits, which, of course, is not a easy task and remains an interesting question for the future.

### Summary

In summary, we have introduced a QGM for machine learning and proven that it offers potential exponential improvement in the representational power over widely used classical generative models. We have also proposed a heuristic quantum algorithm for training and making inference on our model, and proven that this quantum algorithm offers exponential speedup over classical algorithms at least for some instances if quantum computing cannot be efficiently simulated by a classical computer. Our result combines the tools of different areas and generates an intriguing link between quantum many-body physics, quantum computational complexity theory, and the machine learning frontier. This result opens a new route to apply the power of quantum computation to solving the challenging problems in machine learning and artificial intelligence, which, apart from its fundamental interest, may have wide applications in the future.

## SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/4/12/eaat9004/DC1

Section S1. A brief introduction to generative models

Section S2. Representational power and generalization ability

Section S3. Reduction of typical generative models to factor graphs

Section S4. Universal approximation theorem

Section S5. Proof of theorem 2

Section S6. NP hardness for preparation of an arbitrary QGM state |*Q*〉

Section S7. Parent Hamiltonian of the state |*Q*〉

Section S8. Training and inference in the QGM

Section S9. Proof of theorem 3

Section S10. Applying the QGM to practical examples

Fig. S1. Parameter space of factor graph and QGM.

Fig. S2. Probabilistic graphical models.

Fig. S3. Energy-based neural networks.

Fig. S4. Simulating graphs of unbounded degrees with graphs of constantly bounded degrees.

Fig. S5. Illustration of the universal approximation theorem by a restricted Boltzmann machine.

Fig. S6. #P-hardness for the QGM.

Fig. S7. Contraction between two local tensors using the structure of QGM state and conditioned variables.

Fig. S8. Construction of the history state.

Fig. S9. Flow chart for a machine learning process using the QGM.

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is **not** for commercial advantage and provided the original work is properly cited.

## REFERENCES AND NOTES

**Acknowledgments:**We thank D. Deng, J. Liu, T. Liu, M. Lukin, S. Lu, L. Wang, S. Wang, and Z. Wei for helpful discussions.

**Funding:**This work was supported by the Ministry of Education and the National Key Research and Development Program of China (2016YFA0301902). L.-M.D. and Z.-Y.Z. also acknowledge support from the MURI program.

**Author contributions:**X.G. and Z.-Y.Z. carried out the work under L.-M.D.’s supervision. All the authors made substantial contributions to this work.

**Competing interests:**The authors declare that they have no competing interests.

**Data and materials availability:**All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors. Correspondence and requests for materials should be addressed to L.-M.D. (lmduan{at}tsinghua.edu.cn) or X.G. (gaoxungx{at}gmail.com).

- Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).