## Abstract

Predicting the stability of the perovskite structure remains a long-standing challenge for the discovery of new functional materials for many applications including photovoltaics and electrocatalysts. We developed an accurate, physically interpretable, and one-dimensional tolerance factor, τ, that correctly predicts 92% of compounds as perovskite or nonperovskite for an experimental dataset of 576 *ABX*_{3} materials (*X* = O^{2−}, F^{−}, Cl^{−}, Br^{−}, I^{−}) using a novel data analytics approach based on SISSO (sure independence screening and sparsifying operator). τ is shown to generalize outside the training set for 1034 experimentally realized single and double perovskites (91% accuracy) and is applied to identify 23,314 new double perovskites (*A*_{2}*BB′X*_{6}) ranked by their probability of being stable as perovskite. This work guides experimentalists and theorists toward which perovskites are most likely to be successfully synthesized and demonstrates an approach to descriptor identification that can be extended to arbitrary applications beyond perovskite stability predictions.

## INTRODUCTION

Crystal structure prediction from chemical composition continues as a persistent challenge to accelerated materials discovery (*1*, *2*). Most approaches capable of addressing this challenge require several computationally demanding electronic-structure calculations for each material composition, limiting their use to a small set of materials (*3*–*6*). Alternatively, descriptor-based approaches enable high-throughput screening applications because they provide rapid estimates of material properties (*7*, *8*). Notably, the Goldschmidt tolerance factor, *t* (*9*), has been used extensively to predict the stability of the perovskite structure based only on the chemical formula, *ABX*_{3}, and the ionic radii, *r*_{i}, of each ion (*A*, *B*, *X*)(1)

The perovskite crystal structure, as shown in Fig. 1A, is defined as any *ABX*_{3} compound with a network of corner-sharing *BX*_{6} octahedra surrounding a larger *A*-site cation (*r*_{A} > *r*_{B}), where the cations, *A* and *B*, can span the periodic table and the anion, *X*, is typically a chalcogen or halogen. Distortions from the cubic structure can arise from size mismatch of the cations and anion, which results in additional perovskite structures and nonperovskite structures. The *B* cation can also be replaced by two different ions, resulting in the double perovskite formula, *A*_{2}*BB′X*_{6} (Fig. 1B). Single and double perovskite materials have exceptional properties for a variety of applications such as electrocatalysis (*10*), proton conduction (*11*), ferroelectrics (*12*) (using oxides, *X* = O^{2−}), battery materials (*13*) (using fluorides, *X* = F^{−}), as well as photovoltaics (*14*) and optoelectronics (*15*) (using the heavier halides, *X* = Cl^{−}, Br^{−}, I^{−}).

The first step in designing new perovskites for these applications is typically the assessment of stability using *t*, which has informed the design of perovskites for over 90 years. However, as reported in recent studies, its accuracy is often insufficient (*16*). Considering 576 *ABX*_{3} solids experimentally characterized at ambient conditions and reported in (*17*–*19*) (see Fig. 1C for the *A*, *B*, and *X* elements in this set), *t* correctly distinguishes between perovskite and nonperovskite for only 74% of materials and performs considerably worse for compounds containing heavier halides [chlorides (51% accuracy), bromides (56%), and iodides (33%)] than for oxides (83%) and fluorides (83%) (Fig. 2A, fig. S1, and table S1). This deficiency in generalization to halide perovskites severely limits the applicability of *t* for materials discovery.

In this work, we present a new tolerance factor (τ), which has the form(2)where *n*_{A} is the oxidation state of *A*, *r*_{i} is the ionic radius of ion *i*, *r*_{A} > *r*_{B} by definition, and τ < 4.18 indicates perovskite. A high overall accuracy of 92% for the experimental set (94% for a randomly chosen test set of 116 compounds) and nearly uniform performance across the five anions evaluated [oxides (92% accuracy), fluorides (92%), chlorides (90%), bromides (93%), and iodides (91%)] is achieved with τ (Fig. 2B, fig. S1, and table S1). Like *t*, the prediction of perovskite stability using τ requires only the chemical composition, allowing the tolerance factor to be agnostic to the many structures that are considered perovskite. In addition to predicting if a material is stable as perovskite, τ also provides a monotonic estimate of the probability that a material is stable in the perovskite structure. The accurate and probabilistic nature of τ, as well as its generalizability over a broad range of single and double perovskites, allows new physical insights into the stability of the perovskite structure and the prediction of thousands of new double perovskite oxides and halides, 23,314 of which are provided here and ranked by their probability of being stable in the perovskite structure.

## RESULTS AND DISCUSSION

### Finding an improved tolerance factor to predict perovskite stability

One key aspect of the performance of *t* is how well the sum of ionic radii estimates the interatomic bond distances for a given structure. Shannon’s revised effective ionic radii (*20*) based on a systematic empirical assessment of interatomic distances in nearly 1000 compounds are the typical choice for radii because they provide ionic radius as a function of ion, oxidation state, and coordination number for the majority of elements. Most efforts to improve *t* have focused on refining the input radii (*17*, *19*, *21*, *22*) or increasing the dimensionality of the descriptor through two-dimensional (2D) structure maps (*18*, *23*, *24*) or high-dimensional machine-learned models (*25*–*27*). However, all hitherto applied approaches for improving the Goldschmidt tolerance factor are only effective over a limited range of *ABX*_{3} compositions. Despite its modest classification accuracy, *t* remains the primary descriptor used by experimentalists and theorists to predict the stability of perovskites.

The SISSO (sure independence screening and sparsifying operator) approach (*28*) was used to identify an improved tolerance factor for predicting whether a given compound is perovskite [determined by experimental realization of any structure with corner-sharing *BX*_{6} octahedra (*21*) at ambient conditions] or nonperovskite [determined by experimental realization of any structure(s) without corner-sharing *BX*_{6} octahedra, including, in some cases, failed synthesis of any *ABX*_{3} compound]. Of the 576 experimentally characterized *ABX*_{3} solids, 80% were used to train and 20% were used to test the SISSO-learned descriptor. Several alternative atomic properties were considered as candidate features, and among them, SISSO determined that the best performing descriptor, τ (Eq. 2 and Fig. 2B), depends only on oxidation states and Shannon ionic radii (see Materials and Methods for an explanation of the approach used for descriptor identification and a discussion of alternative approaches). For the set of 576 *ABX*_{3} compositions, τ correctly labels 94% of the perovskites and 89% of the nonperovskites compared with 94 and 49%, respectively, using *t*. The primary advantage of τ over *t* is the remarkable reduction in compounds that are predicted to be perovskite but are not experimentally identified as stable perovskites, with false-positive rates for τ and *t* of 11 and 51%, respectively. Full confusion matrices along with additional performance metrics for τ and *t* are provided in table S2. The large decrease in false-positive rate (from 51% to 11%) while substantially increasing the overall classification accuracy (from 74% to 92%) demonstrates that τ improves significantly upon *t* as a reliable tool to guide experimentalists toward which compounds can be synthesized in perovskite structures.

Beyond the improved accuracy, a crucial advantage of τ is the monotonic (continuous) dependence of perovskite stability on τ. As τ decreases, the τ-based probability of being perovskite, *P*(τ), increases, where perovskites are expected for an empirically determined range of τ < 4.18 (Fig. 2B; Materials and Methods for details). Probabilities are obtained using Platt’s scaling (*29*), where the binary classification of perovskite/nonperovskite is transformed into a continuous probability estimate of perovskite stability, *P*(τ), by training a logistic regression model on the τ-derived binary classification. Probabilities cannot similarly be obtained with *t* because the stability of the perovskite structure does not increase or decrease monotonically with *t*, where 0.825 < *t* < 1.059 results in a classification as perovskite (this range maximizes the classification accuracy of *t* on the set of 576 compounds). While *P*(τ) is sigmoidal with respect to τ because of the logistic fit (fig. S2), a bell-shaped behavior of *P*(τ) with respect to *t* is observed because of the multiple decision boundaries required for *t* (Fig. 2C). This relationship leads to an increase in *P*(τ) (i.e., probability of perovskite stability using τ), with an increase in *t* until a value of *t* ~ 0.9. Beyond this range, the probabilities level out or decrease as *t* increases further.

The disparity between the τ-derived perovskite probability, *P*(τ), and the assignment by *t* can be significant, especially in the range where *t* predicts a stable perovskite (0.825 < *t* < 1.059). A comparison of the perovskite (LaAlO_{3}) and the nonperovskite (NaBeCl_{3}) illustrates the discrepancy between these two approaches. *t* incorrectly predicts both compounds to be perovskite (*t* = 1.0), whereas *P*(τ) varies from <10% for NaBeCl_{3} to >97% for LaAlO_{3}, in agreement with the experimental results. For NaBeCl_{3}, instability in the perovskite structure arises from an insufficiently large Be^{2+} cation on the *B* site, which leads to unstable BeCl_{6} octahedra. This contribution to perovskite stability is accounted for in the first term of τ (Eq. 2, *r*_{X}/*r*_{B} = μ^{−1}, where μ is the octahedral factor).

μ is the typical choice for a second feature used in combination with *t* (*18*, *19*, *23*) and was recently used to assess the predictive accuracy of Goldschmidt’s “no-rattling” principle. In this analysis, six inequalities dependent on *t* and μ were derived and used to predict the formability of single and double perovskites with a reported accuracy of ~80% (*30*). Notably, training a decision tree algorithm on the bounds of *t* and μ that optimally separate perovskite from nonperovskite leads to a classification accuracy of 85% for this dataset (fig. S3). In contrast to these 2D descriptors based on (*t*, μ), τ incorporates μ as a 1D descriptor yet still achieves a higher accuracy of 92%, demonstrating the capability of the SISSO algorithm to identify a highly accurate tolerance factor composed of intuitively meaningful parameters.

The nature of geometrical descriptors, such as *t* or μ, is fundamentally different than that of data-driven descriptors, such as τ. *t* and μ are derived from geometric constraints that indicate when the perovskite structure is a possible structure that can form. However, these constraints do not necessarily indicate when the perovskite structure is the ground-state structure and does form. For instance, if *t* = 1 and the ionic limit on which *t* was derived is applicable (the interatomic distances are sums of the ionic radii), these criteria do not suggest that perovskite is the ground-state structure, only that the interatomic distances are such that the lattice constants in the *A-X* and *B-X* directions can be commensurate with the perovskite structure. The fact that *t* does not guarantee the formation of the perovskite structure is evident by the high false-positive rate (51%) in the region of *t* where perovskite is expected (0.825 < *t* < 1.059). Similarly, although μ may fall within the range where *BX*_{6} octahedra are expected based on geometric considerations (0.414 < μ < 0.732), the octahedra that form may be edge or face sharing, and therefore, the observed structure is nonperovskite. In this work, SISSO searches a massive space of potential descriptors to identify the one that most successfully detects when a given chemical formula will or will not crystallize in the perovskite structure, and because this is the target property, τ emerges as a much more predictive descriptor than *t* or μ.

Although the classification by τ disagrees with the experimental label for 8% of the 576 compounds, the agreement increases to 99% outside the range 3.31 < τ < 5.92 (200 compounds) and 100% outside the range 3.31 < τ < 12.08 (152 compounds). The experimental dataset may also be imperfect as compounds can manifest different crystal structures as a function of the synthesis conditions due to, e.g., defects in the experimental samples (impurities, vacancies, etc.). These considerations emphasize the usefulness of τ-derived probabilities, in addition to the binary classification of perovskite/nonperovskite, which address these uncertainties in the experimental data and corresponding classification by τ.

### Comparing τ to calculated perovskite stabilities

The precise and probabilistic nature of τ, as well as its simple functional form—depending only on widely available Shannon radii (and the oxidation states required to determine the radii)—enables the rapid search across composition space for stable perovskite materials. Before attempting synthesis, it is common for new materials to be examined using computational approaches; therefore, it is useful to compare the predictions from τ with those obtained using density functional theory (DFT). The stabilities (decomposition enthalpies, Δ*H*_{d}) of 73 single and double perovskite chalcogenides and halides were recently examined with DFT using the Perdew-Burke-Ernzerhof (*31*) exchange-correlation functional (DFT) (*32*, *33*). τ is found to agree with the calculated stability for 64 of 73 calculated materials. Importantly, the probabilities that result from classification with τ linearly correlate with Δ*H*_{d}, demonstrating the value of the monotonic behavior of τ and *P*(τ) (Fig. 2D and table S3).

Although τ appears to disagree with these DFT calculations for nine compounds, six disagreements lie near the decision boundaries [*P*(τ) = 0.5, Δ*H*_{d} = 0 meV/atom], suggesting that they cannot be confidently classified as stable or unstable perovskites using τ or DFT calculations of the cubic structure. Of the remaining disagreements, CaZrO_{3} and CaHfO_{3} reveal the power of τ compared with DFT calculations of the cubic structure, as these two oxides are known to be isostructural with the orthorhombic perovskite CaTiO_{3}, from which the name perovskite originates (*34*, *35*). Δ*H*_{d} < −90 meV/atom for these two compounds in the cubic structure, indicating that they are nonperovskites. In contrast, τ predicts both compounds to be stable perovskites with ~65% probability, which agrees with the experimental results. These results show that a key challenge in the prediction of perovskite stability from quantum chemical calculations is the requirement of a specific structure as an input, as there are more than a dozen unique structures classified as perovskite (i.e., those having corner-sharing *BX*_{6} octahedra) and many more that are nonperovskite.

Several recent machine-learned descriptors for perovskite stability have been trained or tested on DFT-calculated stabilities of only the cubic perovskite structure (*33*, *36*–*38*). However, less than 10% of perovskites are observed experimentally in this structure (*21*), leading to an inherent disagreement between the descriptor predictions and experimental observations. Recently, it was shown that of 254 synthesized perovskite oxides (*ABO*_{3}), DFT calculations in the Open Quantum Materials Database (*39*) predict only 186 (70%) to be stable or even moderately unstable (within 100 meV/atom of the convex hull) (*27*). The discrepancy is likely associated with the difference in energy between the true perovskite ground state and the calculated high-symmetry structure(s). Because τ was trained exclusively on the experimental characterization of *ABX*_{3} compounds, τ is informed by the true ground-state (or metastable but observed) structure of each *ABX*_{3} and the potential for these compounds to decompose into any compound(s) in the *A-B-X* composition space. A principal advantage of τ over many existing descriptors is that its identification and validation were based on experimentally observed stability or instability of a structurally diverse dataset.

### Extension to double perovskite oxides and halides

Double perovskites are particularly intriguing as an emerging class of semiconductors that offer a lead-free alternative to traditional perovskite photoabsorbers and an increased compositional tunability for enhancing desired properties such as catalytic activity (*10*, *16*, *40*). Still, the experimentally realized composition space of double perovskites is relatively unexplored compared with the number of possible *A*, *B*, *B′*, and *X* combinations that can form *A*_{2}*BB’X*_{6} compounds. The set of 576 compounds used for training and testing τ is composed of 49 *A* cations, 67 *B* cations, and 5 *X* anions, from which >500,000 double perovskite formulas, *A*_{2}*BB′X*_{6}, can be constructed. Comparison with the Inorganic Crystal Structure Database (ICSD) (*30*, *41*) reveals only 918 compounds (<0.2%) with known crystal structures, 868 of which are perovskite.

Although τ was only trained on *ABX*_{3} compounds, it is readily adaptable to double perovskites because it depends only on composition and not structure. To extend τ to *A*_{2}*BB′X*_{6} formulas, *r*_{B} is approximated as the arithmetic mean of the two *B*-site radii (*r*_{B}, *r*_{B′}). τ correctly classifies 91% of these 918 *A*_{2}*BB′X*_{6} compounds in the ICSD (compared with 92% on 576 *ABX*_{3} compounds), recovering 806 of 868 known double perovskites (table S4). The geometric mean has also been used to approximate the radius of a site with two ions (*42*). We find that this has little effect on classification with τ, as 91% of the 918 *A*_{2}*BB′X*_{6} compounds are also correctly classified using the geometric mean for *r*_{B}, and the classification label differs for only 14 of 918 compounds using the arithmetic or geometric mean. Although τ was identified using 460 *ABX*_{3} compounds, the agreement with experiment on these compounds (92%) is comparable to that on the 1034 compounds (91%) that span *ABX*_{3} (116 compounds) and *A*_{2}*BB′X*_{6} (918 compounds) formulas and was completely excluded from the development of τ (i.e., test set compounds). This result indicates pronounced generalizability to predicting experimental realization for single and double perovskites that are yet to be discovered. With τ thoroughly validated as being predictive of experimental stability, the space of yet-undiscovered double perovskites was explored to identify 23,314 charge-balanced double perovskites that τ predicts to be stable at ambient conditions (of >500,000 candidates). These compounds are provided in table S4 including assigned oxidation states and radii along with *t* and τ, predictions made using each tolerance factor, and classification in the ICSD where available. There are thousands of additional compounds with substitutions on the *A* and/or *X* sites, *AA′BB′*(*XX′*)_{3}, that are expected to be similarly rich in yet-undiscovered perovskite compounds.

Two particularly attractive classes of materials within this set of *A*_{2}*BB′X*_{6} compounds are double perovskites with *A* = Cs^{+}, *X* = Cl^{−} and *A* = La^{3+}, *X* = O^{2−}, which have garnered substantial interest in a number of applications including photovoltaics, electrocatalysis, and ferroelectricity. The ICSD contains 45 compounds (42 perovskites) with the formula Cs*BB′*Cl_{6}, 43 of which are correctly classified as perovskite or nonperovskite by τ. From the high-throughput analysis using τ, we predict an additional 420 perovskites to be stable with 164 having at least the probability of perovskite formation as the recently synthesized perovskite, Cs_{2}AgBiCl_{6} [*P*(τ) = 69.6%] (*43*). A map of perovskite probabilities for charge-balanced Cs_{2}*BB′*Cl_{6} compounds is shown in Fig. 3 (lower triangle). Within this set of 164 probable perovskites, there is an opportunity to synthesize double perovskite chlorides that contain 3*d* transition metals substituted on one or both *B* sites, as 83 new compounds of this type are predicted to be stable as perovskite with high probability.

While double perovskite oxides have been explored extensively for a number of applications, the small radius and favorable charge of O^{2−} yields a massive design space for the discovery of new compounds. For La_{2}*BB′*O_{6}, ~63% of candidate compositions are found to be charge-balanced compared with only ~24% of candidate Cs_{2}*BB′*Cl_{6} compounds. The ICSD contains 85 La_{2}*BB′*O_{6} compounds, all of which are predicted to be perovskite by τ in agreement with the experiment. We predict an additional 1128 perovskites to be discoverable in this space, with a remarkable 990 having *P*(τ) ≥ 85% (Fig. 3, upper triangle). All 128 *ABX*_{3} compounds in the experimental set that meet this threshold are experimentally realized as perovskite, suggesting that there is ample opportunity for perovskite discovery in lanthanum oxides.

### Compositional mapping of perovskite stability

In addition to enabling the rapid exploration of stoichiometric perovskite compositions, τ provides the probability of perovskite stability, *P*(τ), for an arbitrary combination of *n*_{A}, *r*_{A}, *r*_{B}, and *r*_{X}, which is shown in Fig. 4. For each grouping shown in Fig. 4, experimentally realized perovskites and nonperovskites are shown as single points to compare with the range of values in the predictions made from τ. Doping at various concentrations presents a nearly infinite number of *A*_{1−x}*A′*_{x}*B*_{1−y}*B′*_{y}(*X*_{1−z}*X′*_{z})_{3} compositions that allows the tuning of technologically useful properties. τ suggests the size and concentration of dopants on the *A*, *B*, or *X* sites that likely lead to improved stability in the perovskite structure. Conversely, compounds that lie in the high-probability region are likely amenable to ionic substitutions that decrease the probability of forming a perovskite but may improve a desired property for another application. For example, LaCoO_{3}, with *P*(τ) = 98.9%, should accommodate reasonable ionic substitutions (i.e., *A* sites of comparable size to La or *B* sites of comparable size to Co) and was recently shown to have enhanced oxygen exchange capacity and nitric oxide oxidation kinetics with stable substitutions of Sr on the *A* site (*44*).

The probability maps in Fig. 4 arise from the functional form of τ (Eq. 2) and provide insights into the stability of the perovskite structure as the size of each ion is varied. The perovskite structure requires that the *A* and *B* cations occupy distinct sites in the *ABX*_{3} lattice, with *A* 12-fold and *B* 6-fold coordinated by *X*. When *r*_{A} and *r*_{B} are too similar, nonperovskite lattices that have similarly coordinated *A* and *B* sites, such as cubic bixbyite, become preferred over the perovskite structure. On the basis of the construct of τ, as *r*_{A}/*r*_{B} → 1, *P*(τ) → 0, which arises from the +*x*/ln(*x*) (*x* = *r*_{A}/*r*_{B}) term, where and larger values of τ lead to lower probabilities of forming perovskites. When *r*_{A} = *r*_{B}, τ is undefined, yet compounds where *A* and *B* have identical radii are rare and not expected to adopt perovskite structures (*t* = 0.71).

The octahedral term in τ (*r*_{X}/*r*_{B}) also manifests itself in the probability maps, particularly in the lower bound on *r*_{B} where perovskites are expected as *r*_{X} is varied. As *r*_{X} increases, *r*_{B} must similarly increase to enable the formation of stable *BX*_{6} octahedra. This effect is noticeable when separately comparing compounds containing Cl^{−} (left), Br^{−} (center), and I^{−} (right) (bottom row of Fig. 4), where the range of allowed cation radii decreases as the anion radius increases. For *r*_{B} << *r*_{X}, *r*_{X}/*r*_{B} becomes large, which increases τ and therefore decreases the probability of stability in the perovskite structure. This accounts for the inability of small *B*-site ions to sufficiently separate *X* anions in *BX*_{6} octahedra, where geometric arguments suggest that *B* is sufficiently large to form *BX*_{6} octahedra only for *r*_{B}/*r*_{X} > 0.414. Because the cation radii ratios strongly affect the probability of perovskite, as discussed in the context of *x*/ln(*x*), *r*_{X} also has a noticeable indirect effect on the lower bound of *r*_{A}, which increases as *r*_{X} increases.

The role of *n*_{A} in τ is more difficult to parse, but its placement dictates two effects on stability—as *A* is more oxidized (increasing *n*_{A}), −*n*_{A}^{2} increases the probability of forming the perovskite structure, but *n*_{A} also magnifies the effect of the *x*/ln(*x*) term, increasing the importance of the cation radii ratio. Notably, *n*_{A} = 1 for most halides and some oxides (245 of the 576 compounds in our set), and in these cases, for all combinations of *A*, *B*, and *X* and *n*_{A} plays no role as the composition is varied.

This analysis illustrates how data-driven approaches not only can be used to maximize the predictive accuracy of new descriptors but also can be leveraged to understand the actuating mechanisms of a target property—in this case, perovskite stability. This attribute distinguishes τ from other descriptors for perovskite stability that have emerged in recent years. For instance, three recent works have shown that the experimental formability of perovskite oxides and halides can be separately predicted with high accuracy using kernel support vector machines (*26*), gradient boosted decision trees (*25*), or a random forest of decision trees (*27*). While these approaches can yield highly accurate models, the resulting descriptors are not documented analytically, and therefore, the mechanism by which they make the perovskite/nonperovskite classification is opaque.

## CONCLUSIONS

We report a new tolerance factor, τ, that enables the prediction of experimentally observed perovskite stability significantly better than the widely used Goldschmidt tolerance factor, *t*, and the 2D structure map using *t* and the octahedral factor, μ. For 576 *ABX*_{3} and 918 *A*_{2}*BB′X*_{6} compounds, the prediction by τ agrees with the experimentally observed stability for >90% of compounds, with >1000 of these compounds reserved for testing generalizability (prediction accuracy). The deficiency of *t* arises from its functional form and not the input features, as the calculation of τ requires the same inputs as *t* (composition, oxidation states, and Shannon ionic radii). Thus, τ enables a superior prediction of perovskite stability with negligible computational cost. The monotonic and 1D nature of τ allows the determination of perovskite probability as a continuous function of the radii and oxidation states of *A*, *B*, and *X*. These probabilities are shown to linearly correlate with DFT-computed decomposition enthalpies and help clarify how chemical substitutions at each of the sites modulate the tendency for perovskite formation. Using τ, we predict the probability of double perovskite formation for thousands of unexplored compounds, resulting in a library of stable perovskites ordered by their likelihood of forming perovskites. Because of the simplicity and accuracy of τ, we expect its use to accelerate the discovery and design of state-of-the-art perovskite materials for applications ranging from photovoltaics to electrocatalysis.

## MATERIALS AND METHODS

### Radii assignment

To develop a descriptor that takes as input the chemical composition and outputs a prediction of perovskite stability, the features that comprise the descriptor must also be based only on composition. However, it is not known a priori which cation will occupy the *A* or *B* site given only a chemical composition, *CC′X*_{3} (*C* and *C′* being cations). Therefore, we developed a systematic method for determining which cation is *A* or *B* to enable τ to be applied to an arbitrary new material. First, a list of allowed oxidation states is defined for each cation based on Shannon’s radii (*20*). All pairs of oxidation states for *C* and *C′* that charge-balance *X*_{3} are considered. If more than one charge-balanced pair exists, a single pair is chosen on the basis of the electronegativity ratio of the two cations (χ_{C}/χ_{C′}). If 0.9 < χ_{C}/χ_{C′} < 1.1, the pair that minimizes |*n*_{C} – *n*_{C′}| is chosen, where *n*_{C} is the oxidation state for *C*. Otherwise, the pair that maximizes |*n*_{C} – *n*_{C′}| is chosen. With the oxidation states of *C* and *C′* assigned, the values of the Shannon radii for the cations occupying the *A* and *B* sites are chosen to be closest to the coordination number of 12 and 6, which are consistent with the coordination environments of the *A* and *B* cations in the perovskite structure. Last, the radii of the *C* and *C′* cations were compared, and the larger cation is assigned as the *A*-site cation. This strategy reproduced the assignment of the *A* and *B* cations for 100% of 313 experimentally labeled perovskites.

### Selection of τ

For the identification of τ among the offered candidates, the oxidation states (*n*_{A}, *n*_{B}, *n*_{X}), ionic radii (*r*_{A}, *r*_{B}, *r*_{X}), and radii ratios (*r*_{A}/*r*_{B}, *r*_{A}/*r*_{X}, *r*_{B}/*r*_{X}) comprise the primary features, Φ_{0}, where Φ_{n} refers to the descriptor space with *n* iterations of complexity as defined in (*28*). For example, Φ_{1} refers to the primary features (Φ_{0}), together with one iteration of algebraic/functional operations applied to each feature in Φ_{0}. Φ_{2} then refers to the application of algebraic/functional operations to all potential descriptors in Φ_{1}, and so forth. Note that Φ_{m} contains all potential descriptors within Φ_{n<m}, with a filter to remove redundant potential descriptors. For the discovery of τ, complexity up to Φ_{3} is considered, yielding ~3 × 10^{9} potential descriptors. An alternative would be to exclude the radii ratios from Φ_{0} and construct potential descriptors with complexity up to Φ_{4}. However, given the minimal Φ_{0} = [*n*_{A}, *n*_{B}, *n*_{X}, *r*_{A}, *r*_{B}, *r*_{X}], there are ~10^{8} potential descriptors in Φ_{3}, so ~10^{16} potential descriptors would be expected in Φ_{4} (based on ~10^{2} being present in Φ_{1} and ~1 × 10^{4} in Φ_{2}), and this number is impractical to screen using available computing resources.

The dataset of 576 *ABX*_{3} compositions was partitioned randomly into an 80% training set for identifying candidate descriptors and a 20% test set for analyzing the predictive ability of each descriptor. The top 100,000 potential descriptors most applicable to the perovskite classification problem were identified using one iteration of SISSO with a subspace size of 100,000. Each descriptor in the set of ~3 × 10^{9} was ranked according to domain overlap, as described by Ouyang *et al*. (*28*). To identify a decision boundary for classification, a decision tree classifier with a maximum depth of two was fit to the top 100,000 candidate descriptors ranked based on domain overlap. Domain overlap (and not decision tree performance) was used as the SISSO ranking metric because of the much lower computational expense associated with applying this metric. Notably, τ was the 14,467th highest ranked descriptor by SISSO using the domain overlap metric, and hence, this defines the minimum subspace required to identify τ using this approach. Without evaluating a decision tree model for each descriptor in the set of ~3 × 10^{9} potential descriptors, we cannot be certain that a subspace size of 100,000 is sufficient to find the best descriptor. However, the identification of τ within a subspace as small as 15,000 suggests that a subspace size of 100,000 is sufficiently large to efficiently screen the much larger descriptor space. We have also conducted a test on this primary feature space (Φ_{0} = [*n*_{A}, *n*_{B}, *n*_{X}, *r*_{A}, *r*_{B}, *r*_{X}, *r*_{A}/*r*_{B}, *r*_{A}/*r*_{X}, *r*_{B}/*r*_{X}]) with a subspace size of 500,000. Even after increasing the subspace size by 5×, τ remains the highest performing descriptor (a classification accuracy of 92% on the 576-compound set). An important distinction between the SISSO approach described here and by Ouyang *et al*. (*28*) is the choice of sparsifying operator (SO). In this work, domain overlap was used to rank the features in SISSO, but a decision tree with a maximum depth of two was used as the SO (instead of domain overlap) to identify the best descriptor of those selected by SISSO. This alternative SO was used to decrease the leverage of individual data points, as the experimental labeling of perovskite/nonperovskite is prone to some ambiguity based on synthesis conditions, defects, and other experimental considerations.

The benefit of including the radii ratios in Φ_{0} was made clear by comparing the performance of τ to the best descriptor obtained using the minimal primary feature space with Φ_{0} = [*n*_{A}, *n*_{B}, *n*_{X}, *r*_{A}, *r*_{B}, *r*_{X}]. Repeating the procedure used to identify τ yields a Φ_{3} with ~1 × 10^{8} potential descriptors. The best 1D descriptor was found to be , with a classification accuracy of 89%.

### Alternative features

We also considered the effects of including properties outside of those required to compute *t* or τ. Beginning with Φ_{0} = [*n*_{A}, *n*_{B}, *n*_{X}, *r*_{A}, *r*_{B}, *r*_{X}, *r*_{cov,A}, *r*_{cov,B}, *r*_{cov,X}, *IE*_{A}, *IE*_{B}, *IE*_{X}, χ_{A}, χ_{B}, χ_{X}], where *r*_{cov,i} is the empirical covalent radius of neutral element *i*, *IE*_{i} is the empirical first ionization energy of neutral element *i*, and χ_{i} is the Pauling electronegativity of element *i*, all taken from WebElements (*45*), an aggregation of a number of references that are available within. Repeating the procedure used to identify τ results in ~6 × 10^{10} potential descriptors in Φ_{3}. The best performing 1D descriptor was found to be with a classification accuracy of 90%, lower than τ that makes use of only the oxidation states and ionic radii and is only slightly higher than the accuracy of the descriptor obtained using the minimal feature set.

### Increasing dimensionality

To assess the performance of descriptors with increased dimensionality, following the approach to higher dimensional descriptor identification using SISSO described in (*28*), the residuals from classification by τ (those misclassified by the decision tree, Fig. 2B) were used as the target property in the search for a second dimension to include with τ. From the same set of ~3 × 10^{9} potential descriptors constructed to identify τ, the 100,000 1D descriptors that best classify the 41 training set compounds misclassified by τ were identified on the basis of domain overlap. Each of these 100,000 descriptors was paired with τ, and the performance of each 2D descriptor was assessed using a decision tree with a maximum depth of two. The best performing 2D descriptor was found to be , with a classification accuracy of 95% on the 576-compound set. Improvements are expected to diminish as the dimensionality increases further due to the iterative nature of SISSO and the higher-order residuals used for subspace selection. Although the second dimension leads to slightly improved classification performance on the experimental set compared with τ, the simplicity and monotonicity of τ, which enables physical interpretation and the extraction of meaningful probabilities, support its selection instead of the more complex 2D descriptor. The benefits and capabilities of having a meaningfully probabilistic 1D tolerance factor, such as τ, are described in detail within the main text.

### Potential for overfitting

The SISSO algorithm as implemented here selects τ from a space of ~3 × 10^{9} candidate descriptors, and the only parameter that is fit is the optimum value of τ that defines the decision boundary for classification as perovskite or nonperovskite, τ = 4.18. This decision boundary was optimized using a decision tree to maximize the classification accuracy on the training set of 460 compounds. In this case, Gini impurity was minimized to optimize the decision boundary, but alternative cost functions based on Kullback-Leibler divergence or classification accuracy (e.g., l_{2}) would find the same decision boundary. The SISSO descriptor identification is done from billions of candidates, but these functions comprise a discrete set, i.e., they form a basis in a large dimensional space where the number of training points is the dimensionality of the space, which is not densely covered by the functions. Therefore, the selection of only one function, τ, cannot overfit the data. However, if some physical mechanism determining the stability of perovskites is not represented in the training set, it might be missed by the learned formula (here, τ), and therefore, the generalizability of the model would be hampered. However, the 94% accuracy achieved by τ on the excluded set of 116 compounds shows that τ can generalize outside of the training data.

### Alternative radii for more covalent compounds

Ionic radii are required inputs for τ (and *t*), and although the Shannon effective ionic radii are ubiquitous in solid-state materials research, a new set of *B*^{2+} radii was recently proposed for 18 cations to account for how their effective cationic radii vary as a function of increased covalency with the heavier halides (*19*). These revised radii apply to 129 of the 576 experimentally characterized compounds compiled in this dataset (62% of halides). Using these revised radii results in a 5% decrease in the accuracy of τ to 86% for these 129 compounds compared to a classification accuracy of 91% using the Shannon radii for these same compounds. The application of τ using Shannon radii for presumably covalent compounds was further validated by noting that τ correctly classifies 37 of 40 compounds that contain Sn or Pb and achieves an accuracy of 91% for 141 compounds with *X* = Cl^{−}, Br^{−}, or I^{−}. In addition to the higher accuracy achieved by τ when using Shannon radii, we note that the Shannon radii are more comprehensive than the revised radii in (*19*), applying to more ions, oxidation states, and coordination environments, and are thus recommended for the calculation of τ.

### Computer packages used

SISSO was performed using Fortran 90. Platt’s scaling (*29*) was used to extract classification probabilities for τ by fitting a logistic regression model on the decision tree classifications using threefold cross-validation. Decision tree fitting and Platt scaling were performed within the Python package scikit-learn. Data visualizations were generated within the Python packages Matplotlib and Seaborn.

## SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/2/eaav0693/DC1

Table S1. The 576 *ABX*_{3} used for training and testing τ.

Table S2. Confusion matrices for τ (above) and *t* (below).

Table S3. Additional information associated with Fig. 2D.

Table S4. Double perovskite oxides and halides.

Fig. S1. Comparing the performance of *t* and τ by composition.

Fig. S2. Sigmoidal relationship between *P*(τ) and τ.

Fig. S3. (*t*, μ) structure map for 576 *ABX*_{3} solids.

This is an open-access article distributed under the terms of the Creative Commons Attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## REFERENCES AND NOTES

**Acknowledgments:**We thank A. Holder for helpful discussions regarding the manuscript.

**Funding:**This project has received funding from the European Union’s Horizon 2020 research and innovation program (#676580: The NOMAD Laboratory—A European Center of Excellence and #740233: TEC1p), the Berlin Big-Data Center (BBDC, #01IS14013E), and BiGmax, the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science. C.J.B. acknowledges support from a U.S. Department of Education Graduate Assistantship in Areas of National Need. C.S. acknowledges funding by the Alexander von Humboldt Foundation. C.B.M. acknowledges support from NSF award CBET-1433521, which was cosponsored by the NSF and the U.S. Department of Energy (DOE), Office of Energy Efficiency and Renewable Energy (EERE), Fuel Cell Technologies Office and from DOE award EERE DE-EE0008088. Part of this research was performed using computational resources sponsored by the U.S. DOE, Office of EERE and located at the National Renewable Energy Laboratory.

**Author contributions:**M.S. and C.J.B. conceived the idea. C.J.B., C.S., and B.R.G. designed the studies. C.J.B. performed the studies. C.J.B., C.S., and B.R.G. analyzed the results and wrote the manuscript. R.O. provided the SISSO algorithm and facilitated its implementation. C.B.M., L.M.G., and M.S. supervised the project. All the authors discussed the results and implications and edited the manuscript.

**Competing interests:**The authors declare that they have no competing financial interests.

**Data and materials availability:**A repository containing all files necessary for classifying

*ABX*

_{3}and

*AA′BB′*(

*XX′*)

_{3}compositions as perovskite or nonperovskite using τ is available at https://github.com/CJBartel/perovskite-stability. A graphical interface allowing users to classify compounds with τ is also available at https://analytics-toolkit.nomad-coe.eu. The classification of all compounds shown in the manuscript is available in the Supplementary Materials. All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.

- Copyright © 2019 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution License 4.0 (CC BY).