## Abstract

We suggest and implement an approach for the bottom-up description of systems undergoing large-scale structural changes and chemical transformations from dynamic atomically resolved imaging data, where only partial or uncertain data on atomic positions are available. This approach is predicated on the synergy of two concepts, the parsimony of physical descriptors and general rotational invariance of noncrystalline solids, and is implemented using a rotationally invariant extension of the variational autoencoder applied to semantically segmented atom-resolved data seeking the most effective reduced representation for the system that still contains the maximum amount of original information. This approach allowed us to explore the dynamic evolution of electron beam–induced processes in a silicon-doped graphene system, but it can be also applied for a much broader range of atomic scale and mesoscopic phenomena to introduce the bottom-up order parameters and explore their dynamics with time and in response to external stimuli.

## INTRODUCTION

Over the past two decades, scanning transmission electron microscopy (STEM) has emerged as an indispensable tool for exploring materials structure and functionality on the atomic level (*1*–*4*). Recent advances in aberration correction enabled determining the position of individual atomic columns in three-dimensional (3D) materials (*5*) and single atoms in layered materials (*6*, *7*) with picometer-level precision (*8*), enabling visualization of ferroelectric polarization and octahedral tilting fields in perovskites (*9*–*11*), symmetry-breaking phenomena in complex materials such as ferroelectric relaxors and Kitaev materials (*12*), and strain fields in multicomponent systems or in the presence of structural defects (*13*). Equally impressive are the advances in the spectroscopic modes such as electron energy loss spectroscopy, where advances ranging from single-atom spectroscopy (*14*, *15*) to mapping plasmonic excitations and phonons have been demonstrated (*14*, *16*–*21*). Last, 4D-STEM methods now offer a pathway to further increasing the information limit in STEM (*22*–*24*), providing yet unseen details of atomic-level fields and functionalities (*22*, *25*–*27*).

The intriguing and yet largely unexplored opportunity presented by recent progress in STEM is the exploration of atomic-scale chemical processes, either induced by classical global stimuli like temperature or induced by electron beam (e-beam) irradiation. Notably, e-beam damage in electron microscopy has been known from the earliest days of the field (*28*–*30*) and is traditionally perceived as a strongly deleterious effect. Minimizing e-beam damage along with increasing the spatial resolution were major factors driving the development of electron microscopy instrumentation. The advent of aberration correction and the potential for atomic-resolution imaging at low voltages that it has unlocked have enabled opportunities toward systematic studies of e-beam–induced reactions. Early examples of these studies include phase transformations in 3D materials (*31*–*34*) and e-beam–induced radiolysis (*35*). A particularly broad set of opportunities has emerged with extensive studies of 2D materials, where nearly all atomic units are available for observation (*4*) and multiple studies of vacancy and defect formation (*36*–*41*) and of uncontrolled and controlled atomic motion (*42*–*51*) have been reported. Harnessing the e-beam effects on atomic dynamics opened a pathway toward atom by atom manipulation, and recently, homo- and heteroatomic cluster assemblies have been reported (*52*–*55*).

However, further progress in the field necessitates a quantitative description of e-beam–induced transformations as a necessary step for mapping e-beam–induced reaction networks, comparison with predictive and descriptive theory, and ultimately enabling control of e-beam–induced transformations. In cases where the e-beam–induced changes are localized on a single-atom level, these descriptions are possible using the classical point defect chemistry approach, as demonstrated, for example, by Kirkland and co-workers(*36*, *39*) for graphene and Maksov *et al.* (*56*) for layered transition metal dichalcogenides. However, the analysis becomes considerably more complicated in cases where large-scale changes in the connectivity of the chemical bond network occur, including changes in coordination, formation of defect agglomerates, and new phase formation. This is unsurprising because the description of the structure of disordered or partially disordered systems, ranging from amorphous solids to physical systems such as spin and cluster glasses or phase-separated oxides, has remained one of the most complex areas of condensed matter physics (*57*–*59*).

In chemically ordered systems, i.e., in cases where the chemical bond network is that of an ideal solid, symmetry-breaking phenomena can be effectively analyzed using machine learning tools such as multivariate statistical methods or more complex autoencoder (AE)– or variational AE (VAE)–based approaches. In this case, symmetry-breaking distortions are referenced to the known (from the global average) ideal lattice, allowing for straightforward analysis. Recently, this approach was applied to the structural description of perovskites based on column (*11*, *60*) and unit cell (*61*) shape analysis, as well as ferroelectric materials and layered chalcogenides based on the analysis of local atomic neighborhoods (*62*). However, this approach fails when the chemical bonding network is nonperiodic, precluding the unambiguous definition of the high-symmetry state.

Here, we explore the applicability of machine learning methods, specifically rotationally invariant VAEs, toward the analysis of chemical transformations in 2D materials under e-beam irradiation (Fig. 1). This approach is underpinned by two concepts, the postulated parsimony of structural descriptors in solids and the expectation of global rotational invariance of the system undergoing chemical transformations. That is to say, this approach is rotationally invariant in the sense that we do not know or prescribe the rotation angle of the local bonding pattern with respect to the image plane and allow for it to change from point to point. We demonstrate that the VAE with rotational invariance enables the effective exploration of the chemical evolution of the system based on local structural changes and may be extended to more complex systems.

## RESULTS AND DISCUSSION

As a model system, we explore Si in graphene, which has been extensively studied previously in the context of e-beam atomic manipulation (*45*, *46*, *48*, *49*, *51*, *52*). The description of the experimental protocol is available in Materials and Methods. The experimental datasets representing the multiple snapshots of the system during atomic manipulation or e-beam–induced decomposition have been analyzed using deep convolutional neural networks (DCNNs) to yield pixel probability maps of the carbon and silicon atoms. Training and tuning of the DCNNs are described in previous publications (*56*, *63*–*66*), and associated codes are freely available as a part of AtomAI package on GitHub (*67*).

An example of the DCNN analysis of dynamic STEM data is shown in Fig. 2. The original STEM image contains clearly visible regions of pristine graphene, regions of agglomerated point defects that may cause rotation and fragmentation with respect to original lattice, multiple holes in the graphene, individual Si atoms (bright dots), and regions of an amorphous Si-containing contamination phase. To analyze the data, we used a fully convolutional CNN with an encoder-decoder type of architecture that takes the raw experimental images and produces images where the values at pixels correspond to their likelihood of corresponding to atoms (*61*, *68*). The DCNN was trained using images simulated by a MultiSlice algorithm (*69*) and further augmented by applying random cropping, rotations, flipping, scale jittering, adding Poisson/Gaussian noise, and varying the levels of contrast. It is important to note that the network output can be interpreted as a probability that a specific image pixel belongs to a particular atom class, i.e., pixel-wise semantic analysis has been performed.

The DCNN output is shown in Fig. 2B. Here, clusters corresponding to carbon (blue) and silicon (red) are visible, which can naturally be interpreted as localization of the corresponding atoms. We define the centroids of the clusters as atomic positions as visualized in Fig. 2C. Note that this process is robust and only individual atoms are identified and classified, whereas holes and regions where the graphene is obscured by amorphous contamination are excluded from the analysis. This workflow can be applied to the full dynamic STEM image stack in a fully automated manner, decoding the sequence of frames into time-dependent coordinate arrays for C and Si.

The evolution of the Si-graphene system throughout the dynamic STEM measurements is shown in Fig. 3. Here, we use the DCNN output similar to that in Fig. 2B to illustrate the process. Before the measurement sequence, several large-scale holes were created using e-beam manipulation to seed the process. On e-beam irradiation, the formation of 5-7 member defect chains from the introduction, diffusion, and accumulation of carbon vacancies is observed (Fig. 3F) (*70*). This process is further associated with growth of the holes and formation of small-angle grain boundaries. The Si atoms tend to migrate during the process and segregate to the edges of the graphene sheet (edges of holes). Overall, the process illustrates changes of the connectivity of C─C and C─Si bond networks during e-beam irradiation.

The dynamic changes observed during the e-beam–induced reactions necessitate the development of suitable descriptors, both to quantify the reaction process and distinguish reaction processes in different settings such as temperature and beam fluency, and ultimately develop strategies for guiding the process through global and local stimuli. Here, the use of classical physical descriptors based on lattice symmetry and symmetry-breaking phenomena is impossible because the process is associated with the formation and breaking of chemical bonds. At the same time, descriptors based on point defects and clusters in the ideal host lattice also have limited applicability. While individual structural elements such as the formation of 5-7 or 7-5-5-7 defects can be identified, generally, the process associated with the formation of complex defect chains and bonding patterns cannot be uniquely represented as point defects.

Here, we explore the bottom-up description of the process using a machine learning approach where we do not prescribe a priori all of the descriptors; rather, descriptors are determined by prominent patterns within the dataset. Such an approach can be based on several types of descriptors based on the presence of continuous or discrete translational symmetry and initial local structure representation. For uniform sampling used for continuous translational symmetry as shown in Fig. 4A, the subimages are selected and centered on the rectangular grid in the image plane. Subsequent transforms such as fast Fourier transform or Radon transform (depending on the preferred features) combined with linear or nonlinear unmixing yields the spatial maps of the descriptive parameters (*71*–*73*). Alternatively, these images can be used as inputs in classical CNNs. This type of analysis is extremely powerful for the detection of patterns such as ferroelectric and magnetic domains, self-assembled monolayers, and atomically resolved images but generally does not take advantage of specific physically defined features within the data such as individual atom positions. Furthermore, it tends to yield a spectrum of descriptors that are amenable only to semiquantitative interpretation.

An alternative set of approaches can be developed for images where atomic positions (or other discrete reference features) are known, naturally extending to the analysis of subimages centered on individual atomic coordinates. In a point representation as introduced in a local crystallography approach, a statistical analysis of the radial vectors to the nearest neighbors is performed (Fig. 4B), enabling classification of local structures (*74*, *75*). This analysis can be further extended toward a graphical representation (Fig. 4C). A bonding network can be constructed with the atoms as vertices; the bond strength can be defined as a certain function of atomic spacings and dihedral angles, and graph decomposition methods can be used for deeper analysis. However, this approach relies on the atomic positions being known (or postulated on the basis of a certain threshold level of contrast) and descriptors being the positions of these features. While ideal for theoretical structural datasets, for experimental data, the methods that use partial or uncertain information on the atomic species positions that allow to work with full images and hyperspectral images and using the statistical representations of such distribution will be highly beneficial.

Here, we explore the approach for identification of the individual elements of the chemical bonding network in the material and its evolution during the reaction based on the combined description (Fig. 4, D and E) where the subimages are centered on known atomic positions and the atomic neighborhood is represented as a continuous function defined in the image plane. This approach naturally allows the exploration of images containing both well-defined atomic species and diffuse or uncertain backgrounds. Once the set of subimages is identified, it naturally brings forth the question of the construction of elementary building blocks or finding a minimal set of descriptors that can represent the full information in the image.

For the system where the chemical bonding pattern does not change, the analysis can be performed using linear dimensionality reduction methods such as principal components analysis or non-negative matrix factorization, Gaussian mixture model (GMM) separation, or nonlinear methods such as classical AEs and VAEs as explored earlier for local crystallography and FerroNet (*61*) approaches. However, the system that has substantial orientational disorder will not be amenable to these linear methods because the number of classes is no longer fixed and the decomposition components will be dominated by the rotational degrees of freedom and yield smooth distributions of rotational variants. The example of the GMM analysis is shown in the accompanying notebook. Similarly, analysis through the classic AE yields a complex set of descriptors that convolute both the geometry changes and rotations of individual fragments. This results in different descriptors for structures that only differ by virtue of a rotation (i.e., structures that would likely be considered the same), and separating unique geometric configurations from simple rotations becomes challenging.

To address this challenge, we introduce an approach for analyzing the individual building blocks using rotationally invariant VAEs. The classic AE operates on the principle of compressing data to a small number of latent variables and subsequently decoding back to the original dataset. Compression and decoding are typically performed via a set of convolutional filters similar to classical CNNs. The assumption is that if the information can be represented by a small number of latent variables, then variation of these latent variables across the initial dataset provides insight into the relevant physics. In VAE (*76*), one assumes that each point in the dataset is generated by a local random variable in a complex, nonlinear way. The encoder part of a VAE is used to find good values for the latent random variables such that they are true to our (fixed) prior beliefs and true to the data. The priors in a VAE are the type of distribution for the latent variables (usually a Gaussian distribution) and the parameters of this distribution. The decoder part of a VAE is used to make a prediction for the data based on a sample from this distribution. The model parameters (weights and biases) are learned by maximizing the evidence lower bound (ELBO), which consists of the Kullback-Leibler divergence term and the expected likelihood (reconstruction loss). Note that despite the close similarity in names, AEs and VAEs belong to the fundamentally different classes of machine learning algorithms, with VAEs having the Bayesian nature with the latent space defining the priors for the observables. In the absence of prior knowledge, these are chosen as Gaussian (or uniform); however, more complex models are possible in principle (*77*).

Here, the VAE architecture is expanded to account for rotational invariance, which allows rotations to be encoded separately. The implementation of the rotational VAE (rVAE) is based on recent work by Bepler *et al.* (*78*) who showed that parameterization of the VAE’s decoder as a function of the image coordinates allows separating the image content from rotations and translations. In this case, the latent variable “layer” contains the rotational angle and two translation vectors as three of the latent variables. Additional latent variables are chosen to represent the variability within the (sub)image content. We found that when using the DCNN output instead of raw data, a simple two-layer perceptron with 128 neurons in each layer of both encoder and decoder is sufficient to learn the rotationally invariant latent representation of atom-resolved STEM data. For the raw data, the rVAE tends to work better with multiple stacked convolutional layers in the encoder. The encoder and decoder networks are trained jointly, using the Adam optimizer for adjusting weights. As suggested in the original work, the rotational prior was modeled using the Gaussian distribution, with zero mean and an SD of 0.1 radians. The rVAE was trained using 75% of the data, while the remaining 25% were used for testing. The rVAE predictions were made on the entire (0.75 + 0.25) dataset.

Figure 5 shows the analysis of the dataset from Fig. 3 using the GMM, AE, and rVAE methods to contrast the effectiveness of each method. The data are a 3D stack of ~80,000 subimages generated from the 50 initial movie frames. The GMM analysis for a small window (Fig. 5A) generates a set of independent components that can be used to linearly reconstruct the best approximation of the image data. Within these components, we can see that most correspond to partial rotations of the lattice with some variation in intensity. In comparison, a classical AE analysis is illustrated in Fig. 5B where the full variety of the data contained in subimages is reduced to two continuous latent variables represented on the *x* and *y* axes. In this case, the elementary building blocks can be visualized in the lattice space. Examination of the data shows that the building blocks corresponding to two carbon sublattice sites are localized in the top left and bottom right corners. Intermediate states occupy the diagonal capturing rotation and other more subtle variations. Again, the primary differentiating factor between the building blocks is rotation and the subtle variations do not have a clear physical interpretation. These behaviors become even more pronounced for larger numbers of GMM components or larger window sizes (Fig. 5D), which enable a more precise reconstruction of the original data but decrease the physical interpretability. While the GMM effectively determines different components of the dataset from an imaging standpoint, these components are mostly the same from a structural standpoint. If the goal of the analysis is to categorize structural evolution of the sample, then it is not clear how to proceed using the GMM approach.

It could be argued that using only two latent variables in classical AE may be insufficient to describe the variability of the system. For higher–dimensional latent spaces, we adopt the approach where the point distribution corresponding to the real images is clustered via GMM in latent space and the images corresponding to GMM centroids are reconstructed using the decoder part of the AE (*79*). We found that, generally, the reconstructed components also demonstrate the presence of multiple rotation angles. This is unsurprising because changes in the carbon-carbon bond length or geometries are relatively small, whereas transformations associated with the rotation between similar structures are large. In other words, rotations do represent the most pronounced feature within the image data even if the structure is unchanged.

The rVAE analysis is illustrated in Fig. 5 (C and F). Here, the rotations of structural elements are captured in a separate latent variable allowing for their reliable separation from more subtle structural changes. For smaller window size (Fig. 5C), the variability across the latent space in the *x* direction is due to intensity, which we associate here with a degree of crystallinity (see the next paragraph), and variation along the *y* direction is minor. A similar analysis can be performed for much larger window sizes as shown in Fig. 5F, demonstrating excellent recovery of structural elements. In this case, one of the latent variables can be interpreted as a degree of crystallinity in the bulk and the other as proximity to the edge of the graphene. The quantitative comparison of GMM, VAE, and rVAE using the structural similarity measure between the unmixed/disentangled structures is shown in fig. S1.

Last, the latent variables can be back-projected to real space to visualize the evolution of the atomic structure, as shown in Fig. 6. Here, the color scale of each atomic unit is set according to the value of the corresponding latent variable for angle (top row) and first latent variable associated with images content (bottom row). The angle variable shows clear contrast for the two structurally unique sublattice positions in the graphene lattice, giving rise to the checkerboard-like pattern. Small-angle boundaries and the rotation between graphene fragments are clearly visible from the color changes in Fig. 6 (C and E). At the same time, the behavior of the latent variable corresponding to the content of subimages clearly shows that it adopts maximum value in the well-ordered regions and is reduced at the edges and in regions with high defect density, allowing for its interpretation as local crystallinity. We were able to reproduce these results even when rVAE was trained using only ~25% of the images from each movie frame and then applied to the entire set.

The time dynamics of the e-beam–induced evolution can be analyzed both in latent space using the time stamp as a parameter and in the time domain. Shown in Fig. 7A is the example of the encoded angle distribution across the whole image stack and (Fig. 7B) its time dependence. Note that the initial sharp bimodal distribution lowers in intensity (because of carbon atom loss) and becomes broader. The evolution of the image latent variables is shown in Fig. 7 (C and D) and demonstrates broadening and reduction in intensity, indicative of lattice decomposition.

To summarize, we suggest and implement an approach for the bottom-up description of systems undergoing chemical transformation from atomically resolved imaging data where only partial and uncertain information on atomic positions is available. This approach is predicated on the synergy of two concepts, the parsimony of physical descriptors and the general rotational invariance of noncrystalline solids. The first concept is implemented via the VAE approach, seeking the most effective reduced representation for the system that still contains the maximum original information. The second concept is implementation of rotational invariance. Using this approach, we explore the evolution of e-beam–induced processes in the silicon-graphene system, which can be applied to the studies of formation of small angle grain boundaries, lattice rotation, and amorphization.

We note that this approach can be applied for a much broader range of atomic-scale and mesoscopic phenomena, including domain dynamics, chemical reactivity, and emergence and dynamics of ordered domains where the natural mechanisms of the process necessitates compensation for the continuous rotation of a structural element. The low–dimensional latent representations yielded by the rVAE provide a bottom-up equivalent of the structural order parameters and hence can be used to describe a broad set of time- and stimulus-dependent phenomena in disordered systems in consistent manner. The code used to analyze data is provided as a Jupyter notebook for the interested readers (https://git.io/JfdSa).

## MATERIALS AND METHODS

### Sample preparation

Graphene was grown on Cu foil using chemical vapor deposition and spin-coated with poly(methyl methacrylate) (PMMA) to preserve the graphene and provide structural stability during handling. The Cu foil was etched away in an ammonium persulfate bath. The remaining PMMA/graphene stack was rinsed and transferred to a TEM grid, and the PMMA was dissolved in acetone. The sample was baked in a 10% O_{2} environment at 500°C for 1.5 hours to remove contaminant residue (*80*). Before examination in the STEM, the sample was baked again in vacuum at 160°C for 8 hours.

### STEM imaging

STEM imaging was performed using a Nion UltraSTEM 200 operated at 100-kV accelerating voltage with a nominal convergence angle of 30 mrad. The high-angle annular dark-field detector was used for imaging with a nominal inner angle of 80 mrad. Four holes were drilled through the pristine graphene using by manually positioning the e-beam. An image stack was then acquired capturing the continued evolution of the four holes under the 100-kV e-beam. The beam current was nominally 20 pA and used a frame acquisition time of ~33.5 s per frame for a dose of 4.2 × 10^{9} electrons per frame.

### Data analysis

*Deep convolutional neural network*. The DCNN for semantic segmentation of atomically resolved images had a U-Net architecture (*81*) and was trained on simulated data using an Adam optimizer (*82*) with the cross-entropy loss function and learning rate of 0.001.

*Convolutional AE*. In the convolutional AE, the encoder had two convolutional layers with 32 and 64 convolutional filters (“kernels”) of size (3, 3) and stride (2, 2) activated by a rectified linear unit (ReLU) function. The decoder consisted of two layers with transposed convolutions of the same size and stride activated by ReLU. The latent layer was represented by a fully connected (“dense”) layer with two neurons. The model was trained using Adam optimizer with a learning of 0.001 and mean squared error (MSE) loss function.

*Gaussian mixture model*. For GMM, it is assumed that there are *K* normal distributions parametrized by their mean (μ* _{i}*) and covariance (

*) and that each of extracted subimages*

_{i}*R*(

_{i}*x*,

*y*) is independently drawn from one of these distributions with probability density function given by

*Rotational variational autoencoder*. The generative and inference models of rotationally invariant VAE are parameterized by two deep neural networks, *p*_{θ}(*x*∣τ) and *q*_{ϕ}(τ∣*x*), where θ and ϕ are the parameters (weights and biases) of the neural networks and τ denotes a collective latent variable (image content + translation + rotation). A stochastic gradient descent is used to learn the θ and ϕ parameters via maximizing the ELBO consisting of the MSE term and the Kullback-Leibler divergence terms associated with image content and rotation*z* is a latent variable associated with image content, γ is a latent angle variable, and *s*_{γ} is a “rotational prior.” Here, we used encoder and decoder both consisting of two fully connected layers with 128 neurons in each layer activated by hyperbolic tangent. The latent layer had three neurons for absorbing arbitrary rotations and translations of images content, whereas the remaining two neurons in the latent layer were used for disentangling the image content. The rVAE was trained using the Adam optimizer with the learning rate of 0.0001.

All deep learning routines were implemented in PyTorch deep learning library (*83*). For more details, see the interactive Jupyter notebooks reproducing all the data analysis available without restrictions at https://git.io/JfdSa.

## SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/7/17/eabd5084/DC1

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is **not** for commercial advantage and provided the original work is properly cited.

## REFERENCES AND NOTES

**Acknowledgments:**

**Funding:**ML and STEM work was supported by the U.S. Department of Energy (DOE), Office of Science, Basic Energy Sciences (BES), Materials Sciences and Engineering Division (S.V.K., O.D., and S.J.) and was performed and partially supported (M.Z.) at the Oak Ridge National Laboratory’s Center for Nanophase Materials Sciences (CNMS), a U.S. DOE, Office of Science User Facility.

**Author contributions:**S.V.K. proposed the concept and led the paper writing. M.Z. implemented the code used for data analysis and contributed to the paper writing. O.D. and S.J. performed STEM experiments and participated in the interpretation of results.

**Competing interests:**The authors declare that they have no competing interests.

**Data and materials availability:**All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. The experimental data are available without restrictions at 10.5281/zenodo.4279127. The code related to this paper is available without restrictions on https://git.io/JfdSa.

- Copyright © 2021 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).