Research ArticleAPPLIED MATHEMATICS

Semblance: An empirical similarity kernel on probability spaces

See allHide authors and affiliations

Science Advances  04 Dec 2019:
Vol. 5, no. 12, eaau9630
DOI: 10.1126/sciadv.aau9630

Abstract

In data science, determining proximity between observations is critical to many downstream analyses such as clustering, classification, and prediction. However, when the data’s underlying probability distribution is unclear, the function used to compute similarity between data points is often arbitrarily chosen. Here, we present a novel definition of proximity, Semblance, that uses the empirical distribution of a feature to inform the pair-wise similarity between observations. The advantage of Semblance lies in its distribution-free formulation and its ability to place greater emphasis on proximity between observation pairs that fall at the outskirts of the data distribution, as opposed to those toward the center. Semblance is a valid Mercer kernel, allowing its principled use in kernel-based learning algorithms, and for any data modality. We demonstrate its consistently improved performance against conventional methods through simulations and real case studies from diverse applications in single-cell transcriptomics, image reconstruction, and financial forecasting.

INTRODUCTION

In modern data analysis, the data are often first reduced to a proximity matrix representing the pairwise similarity between observations, which becomes the input to downstream analyses such as clustering, classification, and prediction. This proximity matrix is an information map as well as an information bottleneck—the former, because all of the information available to a downstream analysis algorithm is represented in the matrix, and the latter, because the matrix must transmit enough information about the data for any downstream method to be able to do its task. In exploratory data analysis settings, Euclidean distance– or correlation-based metrics are popular ad hoc choices for inferring proximity (13), although more sophisticated, context-specific choices have been designed for particular tasks (4, 5).

During the past two decades, efficient kernel-based learning algorithms and their reproducing kernel Hilbert space (RKHS) interpretations have generated intense renewed interest. Specifically, efforts have focused on the development of proximity matrices that satisfy the Mercer condition, which would allow the detection of complex nonlinear patterns in the data using well-understood linear algorithms (68). Such proximity matrices, called Mercer kernels, form the core of several state-of-the-art machine learning systems. Constructing a similarity function or a proximity matrix amounts to encoding our prior expectation about the possible patterns we may be expected to learn in a given feature space, and thus, it is a critical step in real-world data modeling (9, 10). Noise distributions in the real world are often non-elliptical, with continuous and discrete features generally intermixed. Yet, in the initial stages of data analysis, when the underlying structure of the data’s probability space is unclear, the choice of the similarity/distance metric is often arbitrary. In exploratory data analysis, there is often little prior knowledge to guide the selection of distance/similarity measures, much less the design of valid Mercer kernels. Thus, even as kernel-based machine learning algorithms become sophisticated, due to the lack of more informed options, we often default to relying on Euclidean distance or Pearson correlation during the exploratory stage of data analysis.

Here, we present a general, off-the-shelf kernel function, Semblance, that uses the empirical distribution of a feature across all observations to inform the proximity between each pair. Semblance puts a premium on agreement in rare feature values, for discrete features, and on proximity between points at the outskirts of the data distribution, for continuous features. This allows Semblance to reveal structures in the data that are obscured by current, commonly used kernel measures. We first describe the intuitions behind Semblance using a concrete example and subsequently prove that it is a valid Mercer kernel and thus can be used in any kernel-based learning algorithm (11). Then, under simplified but transparent simulation experiments, we systematically explore the types of patterns that we can expect to identify using Semblance versus other common approaches such as Euclidean distance. Semblance achieves higher sensitivity for niche features by adapting to the empirical data distribution. Through examples from several fields—single-cell biology, image analysis, and finance—we demonstrate how the Semblance kernel can be used.

Constructing the rank-based semblance function

Suppose we begin with Nn × G, the data matrix with n rows and G columns. Let each row correspond to an object and each column correspond to a feature measured for all objects. We would like to construct a similarity kernel relating the objects. For ease of notation, letX = (x1, …, xg, …, xG) and Y = (y1, …, yg, …, yG) be two objects, i.e., two rows in the matrix Nn × G.

Consider a feature for which most objects record the value “1,” and only very few record the value “0.” Now, consider two objects, both of which record the rare value “0” for this feature. Is this stronger evidence for similarity between these two objects, as opposed to the scenario where both record the much more typical value “1”? Intuitively, it is much more improbable for two independent objects to agree on a rare value than on a common value, and thus, two objects agreeing on the rare values for many features suggests that they may belong to a common niche group. Similarly, for a continuous-valued feature, proximity in terms of absolute distance between two independent draws is much more unlikely at the tails of the distribution as compared to at its center. Thus, for discrete features, we would like to reward agreement between objects on rare feature values, and for continuous features, we would like to reward proximity between objects in the tail of the empirical distribution. Furthermore, for robustness, it is desirable for the similarity function to be nonparametric and invariant to monotone transformations of the features. These are the considerations underlying the construction of Semblance.

More formally, for a given feature g, let ℙg be its underlying probability distribution. Let the observed values for this feature g in objects X and Y be xg and yg, respectively. In practice, we do not know ℙg, but if we did, we could ask how likely are we, if we were to redraw one of the two values (xg, yg), to obtain a distance that is equal to or smaller than the actual observed distance, while preserving the order between the two. Let Z be the redraw, then this could be expressed as the probabilitypg(xg,yg)=pg(yg,xg)g{min(xg,yg)Zmax(xg,yg)}(1)

The above probability is a natural measure of the dissimilarity between any two values of feature g (see Fig. 1). A subtle but important detail is that the probability in Eq. 1 includes both endpoints xg and yg, and therefore pg (xg, xg) = ℙ{xg} > 0. This definition of proximity is desirable because it naturally incorporates the information in the underlying probability measure that generated the data. For example, as illustrated in Fig. 1, in the binary setting, it is much more rare for two observations to both be equal to 0 if 0 has low probability, and thus, the “reward” for xg = yg = 0 depends on the probability mass at 0. Similarly, in the continuous setting, the reward for proximity between xg and yg depends on where the pair falls on the distribution. For the same linear distance between xg and yg, their dissimilarity is higher when they both fall at the center of the distribution than when they both fall at the tails.

Fig. 1 Illustration of what pg(xg, yg) corresponds to in the case of a discrete distribution or a continuous distribution.

In this toy example, X and Y are two objects with four features measured. Semblance computes an empirical distribution from the data for each feature and uses the information of where the observations fall on that distribution to determine how similar they are to each other. Specifically, it emphasizes relationships that are less likely to occur by chance and that lie at the tail ends of a probability distribution. For example, X and Y are equal to 0 for both the first and second feature, but these two features contribute different values to the kernel: “0” is more rare for the second feature, and thus p2 (0, 0) is smaller than p1 (0, 0), and the second feature contributes a higher value in the Semblance kernel. Similarly, even though the difference between X and Y is 1 for both features 3 and 4, feature 4, where the values fall in the tail, has lower pg(xg, yg) and thus contributes a higher value in the Semblance kernel than feature 3.

In practice, ℙg is not known, but with a large enough sample size, the empirical distribution ̂g serves as a good approximation, leading to the plug-in empirical estimate p̂g(xg,yg), obtained by substituting ̂g for ℙg in Eq. 1. This is reminiscent of empirical Bayes methods, where information is borrowed across all observed values to inform our dissimilarity evaluation between any given pair. We definekg(X,Y)=1p^g(xg,yg)=1ni=1n[1I(min(xg,yg)Nigmax(xg,yg))]the empirical probability of falling strictly outside the interval [xg, yg]. The indicator I returns 1 if min(xg, yg) ≤ Nig ≤ max (xg, yg), and 0 otherwise.

Suppose feature g is continuous, and hence each observed value is unique, and let rX, rY be the ranks of xg, yg among all observed values of this feature across the n objects. Then, kg can be expressed simply as a function of the rankskg(X,Y)=1n(rXrY+1)

However, for discrete features, the computation of kg(X, Y) is more complicated owing to ties. Nevertheless, computing kg(X, Y) in general is easy and fast. An example algorithm is provided in Materials and Methods.

We now define the Semblance function asK(X,Y)=1Gg=1Gkg(X,Y)wg(2)where wg corresponds to the relative weight or importance of each feature. When there is reliable domain knowledge to prioritize features, these should be used to construct the weights. When no a priori information is available, a weight that reflects the shape of the feature distribution can be used, for example, the Gini coefficient for positive-valued features or a robust approximation to the negentropy for real-valued features. In our experiments, we have found that considering all features to be equally important (wg = 1) gives decent results in most cases.

Since p^g(xg,xg)p^g(xg,yg) if xgyg, it follows that K(X, X) ≥ K(X, Y) ∀ X ≠ Y.

Thus, when applied to any data matrix N, this function outputs a symmetric n × n matrix whose rows and columns are maximized at the diagonal.

RESULTS

Semblance is a valid Mercer kernel

Since K(X, Y) is just the mean of kg(X, Y) across g, we start by consideringKg={Kg(i,j)=kg(Nig,Njg):1i,jn}the matrix derived only from observations of feature g. First, assume that the objects have been permuted such that {Nig : i = 1, …, n} are monotone nondecreasing. Defineai=^(Z<Nig) and bi=^(Z>Njg)(3)suppressing the notational dependence of ai and bi on g, for simplicity. On the basis of Eq. 1, for ijKg(i,j)=ai+bi(4)

By our monotone nondecreasing assumption, aiai + 1 and bibi + 1. Thus, Kg has the decompositionKg=[a1a1a1a1a2a2a1a2an]+[b1b2bnb2b2bnb3b3bnbnbnbn]=M+N

Remark: The matrices M and N have a symmetric and analogous structure. The left upper hook comprising the first row and column of M has all entries a1, the second hook has all entries a2 and so on, until the nth hook, which is simply the entry an. Similarly, the right lower hook of N comprising the last row and column has all entries bn, all the way up to the solo entry b1 in the first row and column. Proposition 1: M is a nonnegative-definite (NND) matrix.

The proof is by induction. For the base case, consider the 2 × 2 matrixM˙=[a1a1a1a2]

By construction ai + 1ai, therefore det(M˙)0, and hence M˙ is NND. The induction hypothesis is that all m × m matrices, M¯, with the structure[a2a2a2a2a3a3a2a3a4a2a3am] where a2a3am(5)are NND. Now, to prove that the same is true for m × m matrices, we can write such matrices in the form[a1UUTM¯](6)where U represents the vector (a1a1a1⋯) and M¯ is a matrix that satisfies the induction hypothesis. Using the Schur complement condition for the nonnegative definiteness of a symmetric matrix (12), we can show that M¯UTa11U is NNDM¯UTa11U=M¯(a1a1a1)1a1(a1a1a1)=M¯[a1a1a1a1a1a1a1a1a1]=[a2a1a2a1a2a1a2a1a3a1a3a1a2a1a3a1ana1]

This resultant matrix is of a form that satisfies Eq. 5 and thus, by the induction hypothesis, is NND. Therefore, the matrix (Eq. 6) is also NND.

Since N mirrors the properties of M by construction, we have by Proposition 1 that N is also an NND matrix. For NND matrices, it is also true that (i) the sum of NND matrices is NND, and (ii) applying the same permutation to the rows and columns of an NND matrix preserves the NND structure (see proofs in Materials and Methods). On the basis of these facts, together with Proposition 1, the kernel matrix K (sum of all Kg’s) is NND. The matrix K computed on any data matrix by the Semblance function defined in Eq. 2 is NND, and thus, Semblance is a valid Mercer kernel. As a result, the Representer theorem allows effective implementation of nonlinear mappings through inner products represented by our kernel function (6, 11). We review the theory governing the existence of an RKHS and a feature space for Semblance in the Supplementary Materials.

Semblance is conceptually different from rank-based similarity measures

Since, in the case where all features are continuous, Semblance can be simplified to a function on ranks, we first clarify how it differs from existing rank-based similarity measures: Spearman’s rho (ρ) and Kendall’s tau (τ). By construction, Semblance is fundamentally different from these existing measures in two ways. First, while ρ and τ are based on ranks computed by ordering the values within each object (the rows of matrix N), Semblance is computed using ranks determined by ordering the values within each feature (the columns of matrix N). Thus, the Semblance kernel can be expected to produce values that differ substantially from these two measures. Second, Semblance treats ties differently from simple rank-based methods, such that ties shared by many objects diminish the proximity between those objects. This treatment of ties, for discrete data, makes Semblance more sensitive for niche subgroups in the data. Therefore, Semblance is better understood through the lens of empirical Bayes, where, for each feature, the empirical distribution across all objects informs our evaluation of the similarity between each pair of objects.

Simulations

Simulations allow us to compare the effectiveness of similarity/distance measures under simplified but interpretable settings. We used simulations to compare Semblance against Euclidean distance, Pearson correlation, and Spearman correlation in their ability to separate two groups in an unsupervised setting. We simulated from a two-group model, where multivariate objects came from either group 1, with probability q < 0.5, or group 2, with probability 1 − q. Let each object contain m features, drawn independently, with a proportion p ∈ (0, 1) of the features being informative. The informative features have distribution PI,1 in group 1 and PI,2 in group 2. The rest of the features are non-informative and have the same distribution PNI across both groups. We consider both continuous and discrete distributions for the features. In the continuous case, the features are generated fromPNI=N(0,1), PI,1=N(μσ2,σ1), PI,2=N(0,σ2)(7)

In the discrete case, the features are generated fromPNI=PI,2=Bernoulli(r0), PI,1=Bernoulli(r1)(8)

Of course, whether a feature is informative or not, and whether an object is from group 1 or group 2, is not used when computing the similarity/distance matrix.

As shown in Fig. 2A, in each simulation run, we generated n objects with the first n1 = qn coming from group 1 and the next n2 = (1 − q)n coming from group 2. Our goal is to detect the existence of the minority group 1 and assign objects to the appropriate group. Similarities (Semblance, Pearson, and Spearman) and distances (Euclidean) are computed on these data, each producing an n × n matrix, which we will call S. LetS¯11=1n11i<jn1Sij, S¯22=1n2n1<i<jnSij, S¯12=1n1n21in1<jn2Sij

Fig. 2 Simulations exploring the effectiveness of similarity/distance measures.

(A) Setup for one simulation run. (B) T1 (top) and T2 (bottom) values for each similarity/distance metric, for varying values of σ1 ∈ [0.1, 1] (horizontal axis) and σ2 ∈ [0.1, 1] (vertical axis). (C) T1 (top) and T2 (bottom) values for each similarity/distance metric, for varying values of r1 ∈ [0.1, 1] (horizontal axis) and q ∈ [0.1, 1] (vertical axis).

Then, S¯11 is the mean similarity/distance between objects in group 1, S¯22 is the mean similarity/distance between objects in group 2, and S¯12 is the mean similarity/distance across groups. To quantify the signal in S, we let T1=(S¯11S¯12)/se1,T2=(S¯22S¯12)/se2, where se1 and se2 are standard errors of the differences in the numerators. Hence, large positive values of T1 and T2 imply that downstream algorithms based on S will be able to separate the two groups well.

Figure 2B shows the T1 and T2 values for an example set of simulations where n = m = 100, the proportion of informative features is 10%, the rare subpopulation proportion is 10%, and every feature is normal following Eq. 7 with μ = 2 and σ1, σ2 varying from 0.1 to 1. Heatmaps in the top row show the values of T1, and those in the bottom row show the values of T2 for each of the four similarity/distance measures. We see that Semblance improves upon Euclidean distance, Pearson, and Spearman, attaining large values for T1 and T2 across a broad range of parameters, especially when σ2 is small. Figure 2C shows another set of simulations, with the same n, m, and p values as Fig. 2B, but under the model (Eq. 8) with r2 = 0.5, r1 varying from 0.01 to 0.2, and q varying from 0.05 to 0.5. We see that, in this case, there is no signal in T2 for all of the measures except Semblance, and both Pearson correlation and Spearman correlation fail to separate the two groups for much of the parameter range. In contrast, Semblance gives large values for both T1 and T2 for a large portion of the explored parameter region.

We explored varying combinations of p, q, σ1, and σ2 in the normal setting, and p, q, r0, and r1 in the Bernoulli setting. Summarizing these systematic experiments in representative heatmaps (Fig. 3), we found that Semblance has robust performance across different distributions and distribution parameters (σ1, σ2, r1, and r2) as long as the proportion of informative features is not too small. Semblance is better than the other metrics especially in differentiating small tight subpopulations, i.e., niche groups. Unweighted Semblance (wg = 1 in Eq. 2) retains less information and should not be used when informative features are extremely rare (p → 0), but the separation between clusters is extremely large (p → 0, μ → ∞). This lack of sensitivity for rare features, however, can be remedied by the use of distribution-informative weights wg.

Fig. 3 Simulation results over parameter sweeps.

For each 2 × 4 group of heatmaps, the top row shows T1 and the bottom row shows T2 for each similarity/distance metric, computed as described in the text. Simulation parameters are varied along the rows and columns of the heatmaps. (A) Normal model, p = {0.1, 0.2, …, 0.9} for the horizontal axis, and q ∈ {0.05, 0.1, …, 0.5} for the vertical axis. (B) Normal model, q ∈ {0.05, 0.1, …, 0.5} for the horizontal axis and σ2 ∈ {0.1, 0.2, …, 1.5} for the vertical axis. (C) Normal model, p = {0.1, 0.2, …, 0.9} for the horizontal axis and σ2 ∈ {0.1, 0.2, …, 1.5} for the vertical axis. (D) Binomial model, p = {0.1, 0.2, …, 0.9} for the horizontal axis, and q ∈ {0.05, 0.1, …, 0.5} for the vertical axis.

In its explicit construction, Semblance is shift and scale invariant; however, this robustness comes at a trade-off of being insensitive to the shape of the distribution. We compared the performance of a Semblance metric that does not put explicit weights on the features (wg = 1) with a modified metric where features are weighed on the basis of the shape of the distribution (wg ≠ 1 in Eq. 2). We used as weights the negentropy, a robust approximation of kurtosis and non-Gaussianity that is invariant for invertible linear transformations of data (13), in our simulations described above and found that negentropy weighting further enhances the information captured by Semblance (fig. S1, A to C). For positive-valued data, we recommend the use of Gini coefficient for weights. The Gini index is a robust measure of dispersion (14). It ranges from 0 to 1, wherein a value of 0 means that values of the feature are distributed perfectly uniformly across the objects, and a value close to 1 means that there is high dispersion of values. When the data are simulated from gamma distributions (fig. S1D), we found that Gini weighting provides a robust and sensitive method of feature selection. Compared to an unweighted Semblance, the Gini-weighted statistic carries more signal, especially when the fraction of informative features is small (fig. S1, E to G). Collectively, our simulations demonstrate that Semblance can perform well in a totally unsupervised fashion; nonetheless, when prior knowledge about the data distribution is available, Semblance can also incorporate that to weigh features and augment its performance.

Semblance kernel-tSNE identifies a niche retinal horizontal cell population

In the setting of single-cell RNA sequencing (scRNA-seq), the data are in the form of a matrix with each row representing a cell and each column representing a gene. For cell c and gene g, Ncg is a count matrix measuring a gene’s RNA expression level in the given cell. A first step in the analysis of these data is often visualization via a t-distributed stochastic neighbor embedding (tSNE)–type dimension reduction. Most studies arbitrarily use the Euclidean distance or the radial basis function (RBF) in this step, although methods based on more sophisticated kernel choices that rely on strong prior assumptions have been proposed (15). Starting from the low-dimension embedding, a primary goal in many single-cell studies is to classify cells into distinct cell types and identify previously unknown cell subpopulations. This is a challenging analysis due to many factors: (i) Expression levels are not comparable across genes—lowly expressed cell-type markers may be swamped by highly expressed housekeeping genes; (ii) gene expression at the single-cell level is often bursty and thus cannot be approximated by the normal distribution; (iii) one is often interested in detecting rare niche subpopulations for which current methods have low power. These considerations motivated us to use the Semblance kernel to compute a cell-to-cell similarity metric, which can be used as input to tSNE, principal components analysis (PCA), and other kernel-based algorithms. Most methods used for cell-type identification based on scRNA-seq limit their consideration to highly variable genes, thereby using only a subset of the features. Instead, Semblance can be computed over all features, ensuring that information from all informative genes is retained.

Consider the retinal horizontal cell (RHC), a unique cell type that recently came to the limelight because of its notable morphological plasticity, and its role as a possible precursor for retinoblastoma (16). RHCs have a special level of complexity wherein they can undergo migration, mitosis, and differentiation at late developmental stages. They are traditionally divided into H1 axon-bearing and H2 to H4 axonless subtypes, although the latter are largely absent in the rod-dominated retina of most mammals (17). The axon-bearing and axonless RHC subtypes are generated during retinal development from progenitors that are susceptible to a transition in metabolic activity. For example, follistatin, an anabolic agent that alters protein synthesis and the inherent metabolic architecture in tissues, increases RHC proliferation (18). RHC subtypes also exhibit temporally distinguishable periods of migration, likely affected by their cellular metabolic state. These distinctive features are controlled by a niche set of genes, and thus RHCs provide a nonpareil setting to test Semblance. We used unweighted Semblance on an scRNA-seq dataset of 710 Lhx1+ RHCs from healthy P14 mice (19) and sought to answer the question, How similar are RHCs to each other? When we use Euclidean distance for tSNE analysis, only one RHC cluster could be identified (Fig. 4A), as opposed to two subsets of RHCs identified using kernel-tSNE (Fig. 4B).

Fig. 4 A unique RHC cluster is identified by Semblance kernel-tNSE.

Each dot in (A) to (C) represents a single cell. Euclidean distance tSNE identifies a single RHC cluster (A) as opposed to two subpopulations identified by Semblance. Comparing the kernel’s performance when features are naturally weighted on skewness (B) versus when they are weighted based on Gini coefficient (C) points to a geometric, trajectory-like structure in the data. The top five pathways found to be enriched in the rare cellular subtype are shown (D), and GO analysis suggested that the smaller RHC cluster has unique metabolic response properties (E). We also found evidence that these metabolic properties might lead to increased proliferation as suggested by increased expression of cell cycle genes by the cells in the red/rare cluster (F). For DE analysis, Benjamini-Hochberg–corrected P values are noted underneath each cyclin gene; the color codes blue and red correspond to the major and rare RHC clusters, respectively.

Moreover, since the Gini coefficient is a robust measure of gene expression dispersion in single-cell transcriptomics experiments (20), we wondered how the unweighted Semblance-tSNE (wg = 1 in Eq. 2) would compare against a Semblance measure where features are Gini-weighted. Although both kernel-tSNE projections successfully separated the second/rare RHC cluster, weighing genes by their Gini coefficient also suggested an underlying geometry, pointing to a plausible trajectory (Fig. 4C and fig. S2). We then sought further biological interpretation of these results and discovered that the cells in the second, smaller cluster—comprising 12% of the total RHC population—identified by Semblance exhibit differential expression (DE) of pathways that affect metabolism (Fig. 4D). We explicated our results by testing for enriched Gene Ontology (GO) functional categories using REVIGO (21) and uncovered a niche RHC population that has unique metabolic response properties (Fig. 4E). Gini weighting showed that this niche RHC cohort, which has an up-regulated metabolic state, resides at one end of a trajectory, suggesting that this trajectory may be related to proliferation. In other words, the niche RHC cohort could be a group of proliferating horizontal cells, compared with the more mature RHCs that constitute the larger cell cluster. To examine this, we tested for DE of cell cycle–associated genes, which are common markers of proliferation (22), between the two groups using MAST (model-based analysis of single-cell transcriptomics) (23). We found evidence for DE of genes associated with increased proliferation ability (Fig. 4F) in support of our hypothesis. Thus, without any domain knowledge, exploratory analysis using Semblance successfully uncovered meaningful biological signals from these data.

Semblance kernel PCA is efficient at image reconstruction and compression

Kernel PCA (kPCA), the nonlinear version of PCA, exploits the structure of high-dimensional features and can be used for data denoising, compression, and reconstruction (24, 25). This task, however, is nontrivial because the kPCA output resides in some high-dimensional feature space and does not necessarily have preimages in the input space (26). kPCA, particularly using the Gaussian kernel defined by k(x, y) = exp(−∣∣xy∣∣2/2σ2), has been used extensively to improve active shape models, reconstruct preimages, and recreate compressed shapes owing to its ability to recognize more nuanced features in real-world pictures (27). Nonlinear data recreation based on kPCA rests on the principle that using a small set of some f kPCA features provides an f-dimensional reparametrization of the data that better captures its inherent complexity (28). Since Semblance is nonparametric and empirically driven, emphasizing rare or niche feature values in data, we surmised that it would be useful as a nonlinear image reconstruction method. We discovered that Semblance kPCA can be used to reconstruct real-world images with remarkably good performance (Fig. 5, A and B). Upon adding uniform noise to an image, we found that Semblance kPCA can denoise images and compares favorably against linear PCA and Gaussian kPCA (Fig. 5, C and D).

Fig. 5 kPCA using the Semblance kernel provides a useful method for image reconstruction and denoising.

Two examples of open-source images: Philadelphia skyline (A) and daffodil flowers (B) are shown here. Semblance kPCA was able to effectively recover and compress when compared with linear PCA or Gaussian kPCA. These images were corrupted with added uniform noise: (C) and (D), respectively. The recovered image output using linear PCA, Gaussian kPCA, and Semblance kPCA is displayed. Comparing the same number of features (and even 2.5× as many features for Gaussian kPCA), Semblance performs favorably. More examples are given in the Supplementary Materials. Photo credits: Mo Huang (The Wharton School) and the EB Image Package.

We further evaluated the performance of kPCA on pictures obtained from The Yale Face Database (http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html) and the Bioconductor package EBImage (29) and found that Semblance can give a good re-encoding of the data when it lies along a nonlinear manifold, as is often the case with images. In each experiment, we computed the projections of the given image data onto the first f components and then sought to reconstruct the image as precisely as possible. We found that Semblance kPCA performed better than linear PCA and Gaussian kPCA when using a comparable number of components (fig. S3). This encouraging observation is supported by the intuitions underlying the construction of Semblance. Linear PCA encapsulates the coarse data structure as well as the noise. In contrast, Gaussian kPCA, similar to a k-nearest neighbor method, recreates the connection between data points that are close to each other in the original feature space (30). On the other hand, Semblance apprehends data points that are proximal in the feature space under the metric determined by the empirical data distribution and therefore performs better when informative and non-informative features in an image are not on the same scale.

Semblance performs comparably with domain-specific kernel support vector machines in stock market forecasting

In finance and business analytics, although predictions of market volatility are inherently challenging, support vector machines (SVMs) using Gaussian and Laplacian kernels have been found to efficiently model stock market prices, partly because training these kernel SVMs (kSMVs) allows convex optimization with a linear constraint, resulting in a stable and unique global minimum (31, 32). Semblance is a context-free kernel, and therefore one might be tempted to rely on existing kernels that explicitly incorporate domain-specific information. To examine Semblance in the setting of financial forecasting, we compared the performance of Semblance against eight other kernel functions. We used the Center for Research in Security Prices (CRSP) Database (http://www.crsp.com/products/research-products/crspziman-real-estate-database), which combines stock price and returns data with financial indices and company-specific information on all real estate investment trusts (REITs) that have traded on the three primary exchanges: Nasdaq, New York Stock Exchange (NYSE), and NYSE MKT. We focused our analysis on the 5419 REITs that were actively trading between 1 January 2016 and 31 December 2017 and obtained a list of financial indices and company-specific indicators for each REIT (table S1).

We determined which REITs had a net positive rate of return on their stock and then classified the companies based on whether they had a positive or negative rate of return. We then used kSVM to determine how accurately the model was able to predict the REIT category. We randomly split the data into training and test subsets in a 3:1 ratio and compared the generalization ability of each kSVM classifier using 10-fold cross-validation. Consistent with previous research in this area (33, 34), we observed that the Gaussian and Laplacian kernels performed better than linear, spline, and hyperbolic tangent kernels likely because the former two kernels are homogeneous and have good approximation capabilities to model financial fluctuations. Nonetheless, Semblance was more accurate at REIT classification than most other kernel choices (table S2) and performed comparably to other popularly used context-specific kernels.

DISCUSSION

We have presented Semblance, a new similarity kernel for the analysis of multivariate data. Semblance relies on the simple intuition that the empirically observed distribution for each feature should be used to reward a premium to proximity among objects in low-probability regions of the feature space. In this way, Semblance is sensitive for detecting niche features in the data. We have shown that Semblance is a valid Mercer kernel and thus can be used in a principled way in kernel-based learning algorithms. From a computational point of view, Semblance enables the extraction of features of the data’s empirical distribution at low computational cost. It naturally relies only on ranked feature values and thus is extremely robust. We evaluated Semblance and compared it to some commonly used similarity measures through simulations and diverse real-world examples, demonstrating scenarios where Semblance can improve downstream analysis.

Kernelized learning methods have been tremendously useful in a wide variety of applications and disciplines, particularly because of their ability to map data in complex, nonlinear spaces (6, 25, 28). Most commonly, kernels have been used to compare and classify objects, for instance, in clustering algorithms (7); however, Mercer kernels have another important interpretation in that they reflect similarities between sets of local features in the data. Semblance exploits this latter concept by defining a general-purpose similarity measure on probability spaces, even when no explicit correspondence between the data might appear obvious or intuitive. Satisfying the Mercer condition ensures that Semblance will guarantee unique global optimal solutions for downstream learning algorithms (35). Semblance operates in a high-dimensional, implicit feature space and can be applied to any data domain. We anticipate that it will also find utility in “multiple-kernel learning” approaches, wherein multiple kernels are often combined to learn from a heterogeneous data source.

MATERIALS AND METHODS

Algorithm to implement the semblance kernel

procedure Step 1

For a given feature, g, create a descending ranked list

such that the object with the highest value of g

is ranked 1, and the object with lowest value is

ranked last.

procedure Step 2

Compute the empirical cumulative distribution function (CDF) for the feature g.

loop:

for features g : 1 → G do

Store lists as look-up tables

For a given feature g, determine where two

observations, x and y, fall on the CDF for g.

procedure Step 3

Calculate the difference between the ranks of x and y,

Add 1 to the difference between ranks.

procedure Step 4

loop:

for features g : 1 → G do

Step 3, and store the cumulative sum in a matrix as

entry (x, y), corresponding to the xth row and yth column

R package implementation

Semblance is an open-source R package available on CRAN (Comprehensive R Archive Network; https://cran.r-project.org/web/packages/Semblance/) and is compatible with existing kernel method libraries such as kernlab (36). In our R package, we implemented the kernel method in the ranksem function, which takes an input Nng matrix (of g feature measurements for n objects), and returns an n × n similarity matrix.

Proofs concerning NND matrices

Lemma 1: The sum of NND matrices is NND.

Proof: Let Kg(1) and Kg(2) be two NND matrices, such that ∀z ∈ ℝn

zTKg(1)z and zTKg(2)z > 0 ⇒ zTKg(1)z + zTKg(2)z > 0

Using the distributive law of matrix multiplication0<zTKg(1)z+zTKg(2)z=zT(Kg(1)+Kg(2))zzT(Kg(1)+Kg(2))z>0(Kg(1)+Kg(2))0

Lemma 2: Permuting the observations of an NND matrix preserves the NND structure.

Proof: Let π be the permutation matrix such that it has exactly one entry in each row and in each column equal to 1, and all other entries are 0. For any permutation matrix, π−1 = πT and thusππT=πTπ=I

For any given NND matrix, K, πKπT is also NND. πKπT is also symmetric aswT(πKπT)w=(πTw)TK(πTw) w0 since K is NND

Furthermore, every NND matrix can be factored as K = ATA, where A is the Cholesky factor of K. The Cholesky factorization of NND matrices is numerically stable—a principal permutation of the rows and columns does not numerically destabilize the factorization (37). This leads to the result that symmetrically permuting the rows and columns of an NND matrix yields another NND matrix.

SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/12/eaau9630/DC1

Semblance and the connection to Mercer kernels

Existence of corresponding feature space for Semblance

Table S1. List of technical indicators recorded for each observation/REIT by the CRSP Real Estate Database.

Table S2. Test accuracy in forecasting whether the rate of return for an REIT would be positive or negative using SVMs for a range of kernel choices.

Fig. S1. Comparison of a naturally weighted Semblance metric with one wherein features are weighed by a context-dependent measure.

Fig. S2. We tested Semblance on an scRNA-seq dataset with 710 RHCs (19) and compared its performance against the conventionally used, Euclidean distance–based analysis.

Fig. S3. kPCA using the Semblance kernel is able to efficiently reconstruct images.

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.

REFERENCES AND NOTES

Acknowledgments: N.R.Z. thanks Z. Wu (Brown University, Rhode Island) for enlightening discussions. We thank H. Tang (Stanford University, California) for providing feedback on an earlier version of this research report. We also thank the referees for their helpful comments, which have led to a better presentation of the paper. Funding: This work was supported by the National Institutes of Health (NIH grant R01-GM125301 to N.R.Z.). Author contributions: D.A. and N.R.Z. devised the idea, conducted the supporting experiments, and wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials and are publicly available on open-access repositories. Any additional computer codes related to this paper may be requested from the authors. The scRNA-seq data were generated using the Drop-Seq platform and are publicly accessible via the NCBI Gene Expression Omnibus (accession: GSE63473). The data on real estate investment trusts (REITs) are curated by the Center for Research in Security Prices (CRSP) and are accessible at www.crsp.com/products/research-products/crspziman-real-estate-database.
View Abstract

Stay Connected to Science Advances


Editor's Blog

Navigate This Article