Quantum mechanical MRI simulations: Solving the matrix dimension problem

The direct product between classical and quantum degrees of freedom may be left unopened.


INTRODUCTION
There are two well-researched limits in the numerical simulation of magnetic resonance imaging: complicated spatial dynamics of simple spin systems and simple spatial dynamics of complicated spin systems. The former is diffusion and flow magnetic resonance imaging (MRI) of mostly water (1,2), and the latter is spatially encoded nuclear magnetic resonance (NMR) experiments (3,4) and localized NMR spectroscopy (5). Both cases are well covered by the existing simulation software. Both are also computationally straightforward because matrix dimensions are manageable: autðℝ 3 Þ⊗suð2Þ and ℝ 3 ⊗suð2 N Þ are both tractable, either directly or with reasonable approximations (6,7) for the matrix representations of suð2 N Þ, where N is the number of spins.
The simulation problem becomes numerically intractable when complicated spatial dynamics (diffusion, convection, and flow in three dimensions) is combined with complicated spin dynamics (spin-spin coupling, cross-relaxation, chemical kinetics, etc.). A well-digitized three-dimensional (3D) phantom would have at least a hundred pixels in each of the three directions, meaning a dimension of at least 100 3 = 10 6 for the spatial dynamics generator matrices. At the same time, a typical metabolite (e.g., glucose) contains upwards of 10 coupled spins, meaning a Liouville space dimension of at least 4 10 ≈ 10 6 . Direct products of spin and spatial dynamics generators would then have the dimension in excess of 10 12 even before chemical kinetics is considered-clearly an impossible proposition even if sparse matrix arithmetic is used.
Numerical simulation of such systems used to be impossible in particular because the available software (20)(21)(22)(23) always insisted on opening the Kronecker products in the general structure of the problem ½space dynamics⊗½reaction kinetics⊗½spin dynamics ð1Þ when running the simulation. If all three matrices are large, then the resulting object has intractable dimensions. However, our long and careful look at the entire structure of the magnetic resonance simulation problem (24) produced an interesting observation: The only algebraic operation essentially required in any Liouville space magnetic resonance simulation is a matrix-vector product, where the matrix is a sum of objects with the structure shown in Eq. 1. Even time propagation, which technically involves a matrix exponential, can be reformulated using only matrix-vector operations (25), for example where r is the state vector, L is the Liouvillian superoperator, and t is time. The second important observation then comes from linear algebra-the action by a direct product on a vector may be computed without opening the direct product, for example where A and B are matrices, v is a vector, V is obtained by reshaping v into appropriate dimensions for the product on the right-hand side, and vec stands for the column-wise stretch of a matrix back into a vector. The right-hand side of Eq. 3 is massively cheaper to compute. The entire class of such relations is well researched (26); they extend to multiple and nested direct products, as well as their sums. We postulate here that direct products are best left unopened in magnetic resonance simulations, particularly in MRI, where the most problematic product is between spatial and spin degrees of freedom. This hypothesis is explored in this work: We report the design and implementation of a quantum mechanical MRI and spatiotemporal 1 School of of Medicine, University of Al-Ameed, Karbala, PO No: 198, Iraq. 2 School of Chemistry, University of Southampton, Highfield, Southampton SO17 1BJ, UK. *Corresponding author. Email: i.kuprov@soton.ac.uk NMR simulation module in Spinach (24) that uses polyadics (the formal name for a sum of direct products) without opening them up. It is demonstrated below that previously unthinkable simulations now run in minutes.

GENERAL SIMULATION FRAMEWORK
The MRI simulation problem contains three principal factors: (i) spatial distributions and spatial dynamics, including field maps of various coils, diffusion, and flow; (ii) chemical kinetics; and (iii) spin dynamics and relaxation. These factors are in a direct product relationshipeach voxel may have different concentrations and transport velocities, each chemical species may have different spin dynamics, and spin dynamics may in turn be different in each voxel owing to, for example, a magnetic field gradient or a different viscosity that affects relaxation. Mathematically speaking, the structure of the problem is captured by Eq. 1 with the equation of motion consequently having the following algebraic form d dt where r(t) is the state vector, a nmk (t) are interaction amplitudes, M n are spatial operators, K m are chemical kinetics operators (themselves possibly dependent on r if the kinetics is nonlinear), and S k are (possibly dissipative) spin dynamics operators. For an infinitesimal time increment dt, the solution is The problem of simulating each of the three compartments of Eq. 1 individually is comprehensively solved and extensively studied, from theoretical foundations (27)(28)(29) all the way to numerical implementations (20-23, 30, 31). However, the composite problem runs into the matrix dimension issue described in the Introduction. The solutions offered here revolve around compressing the dimensions of the three subproblems to the maximum possible extent and never opening the Kronecker products in Eq. 4. The insight that makes efficient simulation possible is empirical: The polyadic object appearing in Eq. 4 is always low rank. In other words, the sum is short and contains many unit matrices.

Kronecker-times-vector operation
Consider the general case of a matrix-vector product where the matrix is a Kronecker product of smaller matrices, some of which may be unit matrices Each element of the object in square brackets is a product of the corresponding elements of A (k) but calculating and storing the left-hand side is undesirable because the number of elements there is astronomical. It is more efficient to take the products on the right-hand side as they become necessary. In such a scenario, the multiplication count is unchanged, but the memory requirements would not exceed the resources already deployed in storing A (k) , all of which have reasonable dimensions. The procedure for computing the product [⋯]x must then involve matching the linear index of the elements of x with the multi-index of a ð1Þ i1j1 a ð2Þ i2j2 ⋯a ðNÞ iN jN . This is unexpectedly straightforward-x is reshaped into a multidimensional array whose dimensions match the row dimension of A (k) , each dimension is multiplied by the corresponding A (k) , and the result stretched back into a vector whose dimension is now the product of column dimensions of A (k) . This multiplication procedure, described by Fernandes et al. (26) and first implemented for Matlab by D. Gleich, proceeds as follows: 3) Loop index n over the dimensions of X.
Permute the dimensions of X to make its nth dimension leftmost. Reshape X into a matrix of column dimension c n and row dimension ∏ m≠n c m . Perform reassignment X = A (N−n+1) ⋅ X and replace c n by the column dimension of A (N−n+1) . Reshape X back into an N-dimensional array X with dimensions c k . Put the dimensions of X back into the original order. 4) Reshape X back into a vector and return it as y. The extension of this procedure to a situation when x can have multiple columns is implemented in the kronm function supplied with Spinach versions 2.4 and later. Minor logistical optimizations are implemented: Multiplication by unit matrices is skipped, and memory-intensive dimension reordering operations are avoided for the first and the last instance of the loop.

Polyadic object in Spinach
A considerable amount of software engineering is required before the method described in the previous section becomes useful for solving Eq. 4. The first substantial hurdle is addition: Eq. 4 is a sum of direct products; it cannot be written as a single direct product. However, because matrix-vector multiplication is distributive over addition, the algorithm reported in the previous section is easily extended to linear combinations of krons In practice, this is implemented by buffering addition: When the user adds two Kronecker products, their sum is not evaluated-the terms are simply stored in a pointer array When the time comes to multiply this object into a vector, each term in the sum is multiplied into that vector individually, and the results are added up. We rely here on the fact that MRI evolution generators in Eq. 4 are short sums of Kronecker products: The number of terms is much smaller than the dimension of the matrix they represent. This also offers parallelization opportunities. The second substantial hurdle is the norm-numerical implementations of Eq. 2 in finite precision arithmetic require the knowledge of the matrix norm. However, the immediate definition of a matrix norm requires either the largest singular value (2-norm), or sums across rows (infinity-norm) and columns (1-norm), or element-wise products (Frobenius norm). These are expensive operations. Thankfully, the Taylor series in Eq. 2 has infinite convergence radius, and therefore, only an estimate of the norm is actually required. The cheapest estimate we could find was published for 1-norm by Hager (32). It requires a few matrixvector products, usually fewer than the subsequent evaluation of Eq. 2.
A less problematic but important observation is that the algorithm described in the previous section, particularly at the dimension reordering stages, has complicated memory access patterns. Nonsequential memory access can be expensive on modern central processing unit (CPU) architectures. The algorithm described in the previous section may therefore be viewed as a different trade-off between capacity and latency requirements on the random-access memory (RAM). However, the savings in the memory footprint are always large.
The last hurdle is that a polyadic object is sometimes pre-multiplied or post-multiplied by a small number of other objects. Because the only operation the entire object needs to deliver is the matrix-vector product, the best strategy is again to buffer the terms and apply them to the incoming vector before and after the polyadic is applied. The object structure is then extended as follows The sequence is simply replayed from right to left every time a product into a vector is needed. Once the elements of this object are themselves allowed to be polyadic, the object can buffer an arbitrary sequence of additions, multiplications, and Kronecker products. This representation is only computationally efficient when the number of terms is much smaller than the dimension of the matrix it represents, but this is always the case for MRI evolution generators.
More sophisticated representations for tensor-structured objects have existed for some time (33,34). We do have a tensor train object in Spinach (35), but it is certainly not a simple machine-precision parameter-free black box of the same kind as polyadics. Because the polyadic decomposition in Eq. 4 is known a priori, and is nearly always already canonical, the benefits of going to advanced tensorstructured formats such as tensor trains (33) are not worth having to deal with their harrowing logistics.
A related design decision was to compress evolution generators into the polyadic format but to leave the state vector uncompressed. This is intentional and a result of much trial and failure-initially, we attempted to design a representation where both the generator and the state vector are compressed into a tensor structure (33, 34) of some kind. However, none of the available tensor-structured formats were able to maintain sufficient accuracy in the state vector during time evolution simultaneously with sufficient extent of compression to justify their use. State vector compression is highly successful for iterative solvers returning stationary states of various kinds (36) but apparently not for long-range time evolution. In all magnetic resonance cases, we can reasonably foresee that the memory of modern computers is sufficient to leave the state vector uncompressed, and that is our recommendation.

Synthetic benchmarks
Direct product component matrices for synthetic benchmarks were filled with normally distributed complex random numbers. The dimension of each matrix was chosen randomly between 1 and 64, and a random complex vector was generated to match the Kronecker product dimension. For sparse matrices, an optimistic nonzero density estimate was used: five complex nonzeroes per column. The matrix-vector multiplication operation was timed for 100 instances of this setup; the resulting statistics is presented in Table 1. The source code for this benchmark is available in versions 2.4 and later of the Spinach library (24).
It is clear that 3D MRI experiments with quantum mechanical description of spin cannot be simulated on a system with 256 GB of RAM-this confirms the estimates given in the Introduction. However, multiplications using the polyadic object are all in milliseconds; they would remain realistic even if hundreds of terms are present in the sum in Eq. 4. There is no wall clock time advantage in these synthetic benchmarks, but the memory problem is solved. From the computational feasibility point of view, the polyadic object puts quantum mechanical 3D MRI simulations within reach.
Matrix representations: Spatial dynamics generators Given a concentration profile c(r, t), the diffusion flux is given by Fick's first law (37), and the hydrodynamic flux is the product of concentration and flow velocity. The total flux therefore is jðr; tÞ ¼ vðr; tÞcðr; tÞ À Dðr; tÞ∇cðr; tÞ ð 11Þ where ∇ = [∂/∂x ∂/∂y ∂/∂z] T is the gradient operator, v(r, t) is the flow velocity field, and D(r, t) is the translational diffusion tensor field. We take both fields as given-spin processes in MRI are too weak to influence either diffusion or hydrodynamics. Any established solver (38,39) may therefore be used to obtain them before one starts the simulations covered here. Conservation of matter requires the local decrease in concentration to be equal to the divergence of its flux, and therefore ∂ ∂t cðr; tÞ ¼ Àdiv½jðr; tÞ This is an instance of the Fokker-Planck equation with the probability density interpreted as concentration. A variety of other spatial dynamics models, covered in the literature dealing with the Fokker-Planck equation, may be used instead-rotational diffusion is a good example (40). At the matrix representation level, finite difference matrices (41) suffice for the gradient operator; the question of numerical accuracy is explored in detail in the "Numerical accuracy conditions" section. The salient point here is that all differential operators appearing in Eq. 12 are direct products. For the gradient operator acting on the vectorized array of concentrations in every voxel where 〚∂/∂x〛 denotes a matrix representation of ∂/∂x on a finite grid, and 1 {X, Y, Z} are unit matrices of appropriate dimensions. If the diffusion tensor is constant and isotropic, then the generator has three terms where D is the diffusion coefficient. The dimension of each matrix in Eq. 14 is equal to the number of grid points along the corresponding dimension, typically of the order of 100. If the diffusion is anisotropic, then the number of direct product terms in the sum rises to nine. Even if the entire D(r, t) array depends so strongly on r that it must be stored explicitly on the finite grid, the direct product structure of Eq. 13 still only acts on one dimension at a time. The same applies to the flow generator. Assuming hundreds of grid points in each of the three spatial directions, the biggest matrix dimension one would encounter in the polyadic form of Eq. 13, and consequently Eq. 12, is in the hundreds. The worst-case memory requirements would come from the velocity vector array and the diffusion tensor array: (hundreds) 3 × (9 elements) × (64 bits per element) = about a gigabyte. This is well known to numerical hydrodynamics practitioners-the diffusion and flow problem in 3D is easily tractable. Finite difference derivative matrices are sparse (41).
where r(r, t) is the spin density matrix at the spatial point r at time t, H(r, t) is the spin Hamiltonian commutation superoperator such that Hr = hr − rh, and R(r) is the relaxation superoperator. The spin Hamiltonian contains chemical shift terms, pulsed field gradients, radiofrequency (RF) pulse terms, J couplings, and a great variety of other complications that are discussed in magnetic resonance textbooks (28) where B 0 is the primary magnet field (assumed to be directed along the z axis), g is the primary magnet field gradient vector, d n are nuclear chemical shifts, g n are nuclear magnetogyric ratios, fS ðnÞ X ; S ðnÞ Y ; S ðnÞ Z g are nuclear spin operators, J nk are internuclear scalar couplings (traditionally published in hertz, hence the 2p in front), B 1{X, Y, Z} are the Cartesian components of the RF magnetic field, etc.-Eq. 16 can be rather long, but for our purposes, H(r, t) is simply a well-understood matrix that standard software packages can return on demand. The same applies to the relaxation superoperator, which this communication is not a place to discuss.
An important point from the computational efficiency point of view is that the dimension of the Hilbert space spanned by a realistic spin system trajectory is usually much smaller than the dimension of the full Hilbert space of the spin system (6,7,42,43). A nonconstructive proof may be obtained from the fact that the dimension of the space spanned by the simulation trajectory is equal to the number of linearly independent vectors in it, which is smaller than or equal to the number of discrete time points in the trajectory (42). At a more physically motivated level: 1) Some spin states s are not reachable from the initial condition under a given Hamiltonian and a given relaxation superoperator All states that cannot occur in the system trajectory may be dropped from the basis (42). 2) Some states are never populated to a significant extent because relaxation drains most of the amplitude en route. Any states that are reachable, but whose population stays below a set accuracy threshold, may also be dropped from the basis (43). 3) Some states are not reachable from the detection state. Any dynamics involving those states will never influence the observed parameter.
For the purposes of simulating the dynamics of that parameter, such states may be dropped from the basis (44). These observations are particularly relevant in liquid-state magnetic resonance, where the dimension of the space spanned by the system trajectory was recently shown to be orders of magnitude smaller than the dimension of the full Hilbert space (45).
Matrix representations of spin operators may be built directly in the reduced basis (45). The procedure makes use of the fact that the Lie algebra of operators acting on a multispin system is spanned by direct products of basis operators of the Lie algebras of individual spins where N is the number of spins and s k is the quantum number of each spin. The only consistent way to remove a specific generator O m from a Lie algebra while preserving the rest of its structure is to modify the corresponding structure coefficients by zeroing the coefficients c ijm that lead into the "ignored" physical states. The resulting object is still a Lie algebra that now generates the time dynamics of the approximate physical model under the exponential map. The structure coefficients of the direct product algebra are related to the structure coefficients of the single-spin algebras in a computationally friendly way because where S kn is the kth generator of the Lie algebra of the nth spin-usually a 2 × 2 matrix. The cost of computing individual structure coefficients of the full spin system algebra is thus linear with respect to the total number of spins. Once the structure coefficients are available, matrix representations may be built in any basis-see (45) for further information. When the restricted basis scales polynomially with the size of the spin system [true in liquid state (43)], this yields an overall polynomially scaling simulation algorithm. For our purposes here, all relevant methods are published (6,(42)(43)(44)(45)(46)(47) and implemented in Spinach (24). The program simply provides the matrices on demand.
Matrix representations: Spin-space coupling All interaction operators, initial states, and detection states that couple spin and spatial degrees of freedom are also short polyadic sums. The initial spin state need not be the same across the sample, but the number of initial spin states that realistically occur in magnetic resonance simulations is small, meaning that the corresponding polyadic sum is short. In this equation, the sum runs over all initial spin states r k that occur in the sample, and F ðrÞ k are the corresponding "phantoms"-3D arrays that give the amplitude of r k at each point of the sample, including variations that pertain to concentrations of individual substances. The same argument applies to the detection state, where the role of the phantoms is played by the coil receptivity profiles On the operator side, location dependence is important in the relaxation superoperator because it is a source of contrast in MRI. Another polyadic decomposition makes an appearance, this time with a sum over different relaxation mechanisms where R k are the relaxation superoperators responsible for the individual relaxation mechanisms and F ðRÞ k are the corresponding "phantoms"-cubes of data describing the amplitude of each relaxation mechanism at each voxel in the sample. It is an experimental observation that this sum is short.
The coherent spin Hamiltonian need not be the same in every voxel either. The matrix representation of the pulsed field gradient part of the Hamiltonian is again a short polyadic sum where {X, Y, Z} are matrices containing grid point coordinates on the diagonal, and {1 X , 1 Y , 1 Z } are unit matrices of appropriate dimensions. The expression in the square brackets is easily extended, by adding terms such as X 2 ⊗ Y ⊗ 1 Z , to account for the non-uniformity of the gradients produced by realistic gradient coils. The RF Hamiltonian depends on spatial coordinates because the magnetic field produced by the RF coils is not uniform either. It has exactly the same structure as Eq. 23, only with time-dependent amplitude coefficients in front All matrices mentioned in Eqs. 22 to 26 can have their own direct product structure-nested polyadics are supported by the object described in the "Polyadic object in Spinach" section, where individual arrays may be polyadics themselves. For reasons of brevity, we did not mention less common spin-space couplings here, but it stands to reason that all of them are short sums of Kronecker products. Generalizations may be made into solid-state NMR, EPR (electron paramagnetic resonance), and other types of magnetic resonance spectroscopy. The algebra here is actually obvious and well known to specialists in this field-the challenge is rather in making full use of this direct product structure. It is also in the software engineering, where the problem of implementing all of the above in a user-friendly, flexible, and general way is formidable.

Matrix representations: Chemical kinetics
where H c (t) is a diagonal concatenation of the Hamiltonian commutation superoperators, R c is a similar concatenation of thermalized relaxation superoperators, and K c (t) is a pseudolinear superoperator that depends on all concentrations and reaction rates but still has a tensor structure-it is a sum of Kronecker products between chemical reaction operators and spin projection operators, all Kroneckered with whatever other degrees of freedom there may be. The general case is notable for its Daedalean notation (48), but the principle is easy to illustrate using an exchange process between two conformations A and B d dt where the block structure is directly visible, and the savings obtained by respecting it are obvious. Because chemical exchange between nodes of a grid converges to diffusion as the grid becomes finer, the same mathematics also covers stochastic Liouville equation formulation (50) of relaxation theory. Cyclic networks of unidirectional processes converge to periodic motions such as magic angle spinning (51).

NUMERICAL ACCURACY CONDITIONS
Diffusion and flow operators on finite grids have two popular matrix representations: finite difference matrices (41) and spectral differentiation matrices (52). The accuracy of the former depends on the grid spacing and the choice of the finite difference coefficients; both methods require all spatial frequencies to be below the Nyquist frequency of the grid.
On the Nyquist condition side, for the expected cases of laminar flow and diffusion, the dominant source of high frequencies along spatial coordinates are pulsed field gradients. An accurate simulation is the one that guarantees correct treatment of diffusive and hydrodynamic transport for the tightest phase spiral that the pulse sequence can generate in any element of the spin density matrix. Because this includes unobservable and observable coherences, the upper bound on the highest spatial frequency W max must assume that each gradient, even if intended to refocus some spin states, would continue to defocus some other states. Therefore, in each spatial direction where L is the length of the sample in the chosen dimension, pulsed field gradients are assumed to be zero in the middle, B 0 is the magnet induction, ‖H Z ‖ 2 is the 2-norm (largest absolute eigenvalue) of the Zeeman Hamiltonian, g(t) is the gradient amplitude in the chosen dimension as a function of time, and the integral is taken over the duration of the experiment. The grid point spacing h is limited by the Nyquist condition (53) to have more than two grid points per period of W max For sequences that make extensive use of gradients, the bound in Eq. 29 may be overly cautious. A simple practical test is to perform spatial Fourier transforms after each gradient event in the pulse sequence and to inspect the edges of the frequency domain representation. Nonzero amplitude at the edges would indicate that the number of points in the spatial grid must be increased.
Equation 30 is the only accuracy condition when Fourier spectral differentiation matrices are used. However, they are expensive because they are not sparse. For diffusion and flow, it is therefore common to use finite difference matrices instead; they have further accuracy conditions on the grid point spacing. These conditions follow from Taylor series expressions for derivatives on finite grids. For the simplest example of the central first derivative where Lagrange's remainders (54) are used after second order. Solving this for f′(x) yields The functions being differentiated are dominated by gradient spirals with the worst-case frequency obtained in Eq. 29. The worst case is therefore f(x) = exp(− iW max x). This permits a more detailed examination of the remainder R(x, h) in Eq. 32. After taking the integrals and simplifying the result, we obtain the following expression for the relative error when |W max h| < 1 This condition is more stringent than Eq. 30 and necessitates a finer grid; this is the price to pay for the sparsity of the differentiation operators.
A variety of other finite difference schemes are available in the literature (41), usually accompanied by the accuracy estimate of the form O[h n ], where h is the grid spacing and n is a small integer. By following the same procedure as the one in Eqs. 31 to 33, it is easy to obtain a more accurate estimate of the worst-case relative error on the derivative. In our context, it is O[(W max h) n ], where W max is the maximum spatial frequency that occurs in the experiment. In practice, it is reasonable to start from |W max h| ≪ 1 and to make the grid finer until the simulation result no longer changes.
A good practical example of this is the ideal Stejskal-Tanner experiment (55), for which the analytical expression for the signal attenuation due to isotropic diffusion is known where S(g) is the signal intensity in an experiment with the gradient amplitude g, g is the magnetogyric ratio (including any applicable chemical shift corrections) of the working spins, d is the duration of the gradient pulse, D is the duration of the diffusion delay, and D is the diffusion coefficient. Running a numerical Stejskal-Tanner simulation on a finite grid and then fitting Eq. 34 to extract the diffusion coefficient back provides an illustration of how numerical accuracy depends on the grid spacing and the finite difference stencil size (Fig. 1). Figure 1 quantitatively illustrates the problem described in the Introduction: Accurate simulations of diffusion NMR experiments require thousands of points in each spatial dimension. Even with sparse matrix arithmetic, 3D samples with complicated spin systems and chemical processes are beyond modern computers unless Kronecker products are left unopened in Eq. 4. The minimum grid satisfying the Nyquist condition for spatial frequencies in Fig. 1 has 1280 points; Fourier spectral derivative operators become accurate from that size onward. However, the wall clock time is actually much longer with a 2000-point Fourier differentiation operator than with a 10,000-point finite difference operator because Fourier operators are not sparse.
The same accuracy analysis may be viewed from a different perspective: as a condition on the maximum accumulated gradient winding that a given spatial grid can support. This is illustrated in Fig. 2, which shows a comparison between signal attenuation observed in the simulation and the exact analytical solution as a function of the grid size and the differentiation stencil.
The simulation monotonically becomes more accurate as the grid point count increases. The practical conclusion again is that it is reasonable to use a seven-point stencil, to start from about twice the point count dictated by the Nyquist condition, and to increase the point count until the simulation result stops changing to the accuracy required. The exponential convergence illustrated in the bottom row of Fig. 2 stands in sharp contrast with the linear convergence of Monte Carlo methods (56), which are the only viable alternative in situations when quantum mechanical spin dynamics coexists with spatial motion. Even if Monte Carlo MRI simulation engines using the Liouville-von Neumann equation in the spin subspace existed (they currently do not), a grid solver would always converge faster (57).
An important consideration is the type of the boundary condition. Fourier differentiation matrices require periodic boundaries, and finite difference operators for diffusion and flow all have a particularly simple form in periodic boundary conditions. The logistics of implementing other types of boundaries in a general simulation package is formidable, and therefore only the periodic boundaries are supported in version 2.4 of Spinach (24). In practice, this means that sufficient white space must be left on either side of the sample concentration phantom to prevent diffusion and flow processes from folding over.

NUMERICAL EFFICIENCY
Computationally efficient time propagation The methods listed in the "Matrix representations: Spin dynamics generators" section can be applied a priori to decide which spin states would, in practice, contribute to the dynamics. However, when the equations of motion are integrated in the resulting matrix-vector representation, there are further efficiency opportunities-problem dimension may be reduced further by looking at the actual trajectory. These "trajectory level" methods include the following: 1) Krylov subspace analysis: All simulation trajectories have finite duration and finite step length. The space spanned by the trajectory vectors is known as the Krylov subspace of the propagator matrix (25). The dimension of that subspace (i.e., the number of linearly independent vectors in its basis) in favorable cases can be orders of magnitude smaller than the full state space dimension (42). 2) Zero track elimination: If a particular state has remained unpopulated within a certain initial period in the system evolution, rigorous bounds can be placed on its subsequent contribution to the system dynamics (42). States that remain unpopulated up to a user-specified tolerance can be dropped. 3) Liouvillian path tracing: Spin dynamics generators are always very sparse (58). The corresponding connectivity graphs between spin  states often contain disconnected subgraphs that correspond to non-interacting subspaces (46) induced by unobvious or accidental symmetries and conservation laws. They may be simulated separately. 4) Destination state screening: Detection states in magnetic resonance are very simple-typically some form of transverse magnetization. It is therefore advantageous to perform a part of the simulation in reverse-by starting from the detection state (44), going backwards in time, and meeting the evolving initial condition in the middle. This is possible even with dissipative dynamics (45) and particularly efficient for 3D experiments. We will not touch on the technical matters of implementing all of the above, noting only that this was done (24) and the details are set out in the papers cited above. Fully quantum mechanical simulations of protein-size spin systems became possible as a result (45). Sophisticated solvers that only include Bloch equation for the spin degrees of freedom are also available (59)(60)(61). has significant efficiency caveats in situations when ‖−iLT‖ ≫ 1. In the treatment below, the following common magnetic resonance settings shall be assumed: 1) The dimension of L is too large for diagonalization or inversion. This, and the presence of dissipative terms, rules out Padé, Chebyshev, and Newton exponentiation methods, even in cases where they could be superior to the scaled and squared Taylor series (62,63).
2) The cost of a matrix-matrix product is N a multiplications, where a ≤ 3.
3) The cost of a matrix-vector product is N b multiplications, where b ≤ 2. 4) The Taylor series procedure with scaling and squaring (62) is used to compute the matrix exponential; the scaling is done using Hager's estimate (32) of the 1-norm. With these settings in place, Eq. 35 transforms into rðTÞ ¼ ½expðÀiLDtÞ 2 n rð0Þ; Dt ¼ T=2 n ; where n is the number of propagator squaring operations that is dictated by the numerical accuracy condition for the Taylor series-that all eigenvalues of −iLDt be scaled into the unit circle to make sure that there is no hump (62). Estimates of the largest eigenvalue are expensive, but the more easily estimated 1-norm (32) is an upper bound. The optimization problem is then created by the following: 1) A matrix-vector product is cheaper than a matrix-matrix product. Some of the exponentials in Eq. 36 may be sequentially multiplied into r(0) rather than squared. 2) Propagator squaring increases the time step exponentially, whereas propagator-vector products only take the system forward at a rate that is linear with respect to the number of such products.
An optimal point must therefore exist between the number of propagator squaring operations and the number of propagator-vector products; the cost functional is the total number of multiplications. The cost of 1-norm estimation and scaling is 5N b multiplications (32). The cost of computing the Taylor series indexed by j and involving a matrix with eigenvalues scaled inside the unit circle is determined by the upper bound on the magnitude of the last term, which is 1/j!, meaning that where e is machine precision, G(x) is Euler's gamma function and j max ≈ 18 for double precision arithmetic. The number of matrix-matrix products in the Taylor series procedure is therefore j max , and the total cost of computing the Taylor series is j max N a multiplications. The cost of squaring the propagator to compensate for the scaling of −iLDt is nN a multiplications. If we decide not to do m propagator-squaring operations and multiply the propagator into the state vector 2 m times instead, then the total cost will be multiplications, where k is the number of state vectors in r(0)-there may be several. The minimum of this function with respect to m is It does not depend on the norm of the Liouvillian, the cost of its calculation, or machine precision. While a and b strongly depend on the sparsity of the matrix, their difference does not: a − b ≈ 1. Imposing the non-negative integer constraint produces the optimal number of squaring operations to skip where ceiling is preferred to the floor because sparse matrices fill up when multiplied, and each subsequent propagator squaring will be more expensive than the previous one. Equation 40 is used in Spinach (24) whenever the explicit matrix representation of the evolution generator is available.

PRACTICAL EXAMPLES
That MRI eventually starts to explore sophisticated multispin effects is inevitable; in some areas, it is happening already (10,16,18,19,(64)(65)(66). This section is a small collection of relevant simulations illustrating scenarios where the presence of spatial dynamics makes it impractical to simulate the corresponding experiment pixel by pixel-and even that would of course have been an instance of a Kronecker product. All illustrations come from the Spinach 2.4 example set.

DPFGSE excitation and suppression
A simple but important spatially encoded magnetic resonance experiment is DPFGSE (double pulsed field gradient spin echo) signal suppression,

S C I E N C E A D V A N C E S | R E S E A R C H A R T I C L E
a special case of the excitation sculpting technique (67). It is used in the NMR spectroscopy of proteins and in localized magnetic resonance spectroscopy (MRS) of metabolites, where the 110 M signal of water protons is suppressed to the level that permits acquisition of highquality spectra of millimolar concentrations of other molecules (68). Only one gradient axis (and therefore one spatial dimension) is required (Fig. 3)-this is the simplest example of the formalism presented above. Table 2 illustrates the problem previously faced in such simulations: The Liouville space dimension is in the thousands, and accurate digitization of the gradient spirals in this case requires at least 500 points in the spatial grid. There are 4096 spin states in the Liouville space of GABA and 16 for water. Even after unimportant and unpopulated states are dropped (45), this still leaves us with 1912-dimensional spin operator matrices. In principle, simulating 500 independent voxels under these conditions is not difficult. The problems begin when dif-fusion and flow must be considered because they couple the voxels and force the direct product treatment. Matrix dimension exceeds 2 million and even the sparse representation of the evolution generator is in the hundreds of megabytes. Explicit calculation of the exponential propagator becomes impractical because the memory requirements go into

S C I E N C E A D V A N C E S | R E S E A R C H A R T I C L E
multiple gigabytes. At the same time, the polyadic representation of the evolution generator takes up… 640 kB, the exact amount that Bill Gates had claimed "ought to be enough for anybody" in the year that IK was born.
A relevant point is that neither the water signal nor the signal being selectively excited is on resonance in Fig. 4. The shaped inversion pulse is simulated as an off-resonance soft pulse using an additional degree of Here, we effectively have a "flow" forward along the RF phase coordinate φ with a velocity w RF (t) and a periodic boundary condition. This is advantageous: Frequency-amplitude representations of NMR pulses are simpler and easier to digitize than Cartesian representationschirp pulses are a good example (51). Treating RF phase as an extra spatial coordinate adds another Kronecker product to the chain, and it is handled in the same way; the technical details are published elsewhere (51). The salient point is that the ∂/∂φ operator responsible for turning the RF phase in Eq. 41 is time independent. Only if the frequency moves during the pulse does this term need to be taken out of the background evolution generator and sliced up. This is the other advantage of treating spin and classical dynamics at the same level: The classical coordinates need not be Cartesian.

Point resolved spectroscopy
Point resolved spectroscopy (PRESS) relies on selective excitation of a specific volume within the MRI sample (5). Such excitation is straightforward-essentially three slice selection events in three orthogonal directions, followed by the rest of the desired NMR pulse sequence. However, the simulation of such an experiment involves spatially distributed multispin systems that potentially exhibit diffusion and flow-precisely the setting that this paper seeks to cover.
1D and 2D phantoms ( Fig. 5 and the left panel of Fig. 6) are straightforward-careful control of matrix sparsity and Krylov propagation keeps matrix dimensions at a manageable level of several hundred thousand in the 1D case and several million in the 2D case. The situation changes radically for the 3D simulation (Table 3 and the right panel of Fig. 6) particularly when diffusion is present: Not only is the overall dimension close to being unmanageable, the presence of diffusion also introduces many nonzeroes. The result is that the size of the evolution generator is in the gigabytes (Table 3), and its exponential propagator cannot be computed.
However, the polyadic representation has no difficulty with this simulation-the evolution generator takes up 233 MB of memory. The number of nonzeroes in the polyadic representation will likely reduce further as the implementation is optimized: Unit matrices may not only be skipped at the multiplication stages but also not be stored to begin with-only their dimension is in practice needed.
Ultrafast and pure-shift NMR spectroscopy Ultrafast (3) and spatially encoded pure-shift (69) NMR spectroscopy use spatial encoding to replace what was originally an extra temporal dimension in the earlier NMR pulse sequences. Both subjects are vast, but the conclusions from the simulations of the corresponding experiments (Fig. 7) are exactly the same as above-polyadic representation of the evolution generators either makes the simulation possible to start with or reduces the memory requirements by orders of magnitude relative to the sparse matrix representation of the corresponding operators. The reduction is particularly noticeable when spatial dynamics is present. Artifacts in ultrafast and pure-shift spectroscopy can be dominated (C) Same as (B), but including spatial diffusion with a diffusion coefficient of 2.6 × 10 −9 m 2 /s. (D) Same as (C), but using DPFGSE signal selection rather than suppression mode. One of the GABA multiplets is selectively excited. (E) Same as (C), but also including spatial flow at 10 mm/s. All simulations run in minutes on a Tesla K40 graphics processing unit (GPU) using Spinach 2.4 and later, and are included in the example set.
by spatial dynamics effects-the ability to simulate those will be a welcome development.
Because we focus here specifically on spatially distributed multispin processes, it is useful to find a spatially encoded magnetic resonance experiment that uses the relevant properties to the maximum possible extent. Such an experiment appeared very recently-the ultrafast "maximumquantum" sequence (70) uses spatial encoding to create a 2D correlation spectrum between the chemical shifts of the standard transverse magnetization and the chemical shift of the highest coherence order achie-vable in the system. Such experiments are impossible to simulate unless detailed quantum mechanical treatment is performed in the spin subspace. The spectra themselves are nothing special-a few peaks on the 2D plane that tell something useful to chemists (70)-but wall clock times and matrix dimension statistics are again pertinent (Table 4).
In common with the rest of the simulation methods that went into Spinach, polyadic generator storage does not offer any advantages for small spin systems on small grids (Table 4, two-spin system). The advantage is in the scaling-the situation changes radically as the system   The sample on the left has two dimensions (108 × 90 pixels), and the sample on the right has three (108 × 90 × 111 pixels). PRESS hotspots in two (108 × 90 pixels, left) and three (108 × 90 × 111 pixels, right) dimensions. All parameters and J-coupled spin systems are the same as in Fig. 5. Although its effect is not immediately visible, the 3D simulation includes isotropic diffusion with D = 2.6 × 10 −9 m 2 /s to emphasize the ability of the formalism presented here to handle spatial dynamics in 3D. Gradients are tilted relative to the spatial grid by arbitrary angles. Both simulations (including volumetric visualization in the right panel) run in minutes to hours on a Tesla K40 GPU using Spinach 2.4 and later, and are included in the example set.

A + B A A A A + C A + B A A A A + C A + B A A A A + C
gets larger: Wall clock time parity is achieved with a four-spin system, and a wall clock advantage appears thereafter. At the same time, the improvement in the memory footprint is dramatic-what is supposed to be a 1,228,800-dimensional matrix only has 87,400 nonzeroes in the polyadic representation, and at least half of that comes from unit matrix operands whose explicit storage is not strictly required. For the six-spin system, the number of nonzeroes in the polyadic representation of the evolution generator is much smaller than the number of nonzeroes in the state vector.

CONCLUSIONS AND OUTLOOK
Spatially encoded magnetic resonance spectroscopy of large spin systems distributed and moving in three spatial dimensions can now be simulated in reasonable time and essentially without approximations. This work describes the general structure of the mathematical methods and the software engineering involved. The key design decision was to avoid opening certain Kronecker products in the algebraic structure of the evolution generator. An open-source software The standard NMR simulation is nearly instantaneous. Table 3. Matrix dimension and memory footprint statistics for the simulation of 3D PRESS excitation profile in the right panel of Fig. 6. Memory utilization is quoted as reported by Matlab using compressed column sparse format (74). Sparse matrix representation fits into the L2 cache-polyadic processing is slower, memory footprint improvement is not significant nnz(L) and wct(Lv) a , sparse 116,000, 0.5 ms nnz(L) and wct(Lv) a , polyadic 12,000, 5 ms

S C I E N C E A D V A N C E S | R E S E A R C H A R T I C L E
implementation is already available (24); the immediate applications have been to ultrafast (71), spatially encoded pure-shift (72), and diffusion (73) NMR spectroscopy where spatial dynamics is inextricably linked to multispin processes. As the MRI community starts making greater use of multispin effects and metabolites in their pulse sequences (10,16,18,19,(64)(65)(66), we expect these simulation tools to find further applications in MRI experiment design.