Research ArticleAPPLIED MATHEMATICS

Semblance: An empirical similarity kernel on probability spaces

See allHide authors and affiliations

Science Advances  04 Dec 2019:
Vol. 5, no. 12, eaau9630
DOI: 10.1126/sciadv.aau9630
  • Fig. 1 Illustration of what pg(xg, yg) corresponds to in the case of a discrete distribution or a continuous distribution.

    In this toy example, X and Y are two objects with four features measured. Semblance computes an empirical distribution from the data for each feature and uses the information of where the observations fall on that distribution to determine how similar they are to each other. Specifically, it emphasizes relationships that are less likely to occur by chance and that lie at the tail ends of a probability distribution. For example, X and Y are equal to 0 for both the first and second feature, but these two features contribute different values to the kernel: “0” is more rare for the second feature, and thus p2 (0, 0) is smaller than p1 (0, 0), and the second feature contributes a higher value in the Semblance kernel. Similarly, even though the difference between X and Y is 1 for both features 3 and 4, feature 4, where the values fall in the tail, has lower pg(xg, yg) and thus contributes a higher value in the Semblance kernel than feature 3.

  • Fig. 2 Simulations exploring the effectiveness of similarity/distance measures.

    (A) Setup for one simulation run. (B) T1 (top) and T2 (bottom) values for each similarity/distance metric, for varying values of σ1 ∈ [0.1, 1] (horizontal axis) and σ2 ∈ [0.1, 1] (vertical axis). (C) T1 (top) and T2 (bottom) values for each similarity/distance metric, for varying values of r1 ∈ [0.1, 1] (horizontal axis) and q ∈ [0.1, 1] (vertical axis).

  • Fig. 3 Simulation results over parameter sweeps.

    For each 2 × 4 group of heatmaps, the top row shows T1 and the bottom row shows T2 for each similarity/distance metric, computed as described in the text. Simulation parameters are varied along the rows and columns of the heatmaps. (A) Normal model, p = {0.1, 0.2, …, 0.9} for the horizontal axis, and q ∈ {0.05, 0.1, …, 0.5} for the vertical axis. (B) Normal model, q ∈ {0.05, 0.1, …, 0.5} for the horizontal axis and σ2 ∈ {0.1, 0.2, …, 1.5} for the vertical axis. (C) Normal model, p = {0.1, 0.2, …, 0.9} for the horizontal axis and σ2 ∈ {0.1, 0.2, …, 1.5} for the vertical axis. (D) Binomial model, p = {0.1, 0.2, …, 0.9} for the horizontal axis, and q ∈ {0.05, 0.1, …, 0.5} for the vertical axis.

  • Fig. 4 A unique RHC cluster is identified by Semblance kernel-tNSE.

    Each dot in (A) to (C) represents a single cell. Euclidean distance tSNE identifies a single RHC cluster (A) as opposed to two subpopulations identified by Semblance. Comparing the kernel’s performance when features are naturally weighted on skewness (B) versus when they are weighted based on Gini coefficient (C) points to a geometric, trajectory-like structure in the data. The top five pathways found to be enriched in the rare cellular subtype are shown (D), and GO analysis suggested that the smaller RHC cluster has unique metabolic response properties (E). We also found evidence that these metabolic properties might lead to increased proliferation as suggested by increased expression of cell cycle genes by the cells in the red/rare cluster (F). For DE analysis, Benjamini-Hochberg–corrected P values are noted underneath each cyclin gene; the color codes blue and red correspond to the major and rare RHC clusters, respectively.

  • Fig. 5 kPCA using the Semblance kernel provides a useful method for image reconstruction and denoising.

    Two examples of open-source images: Philadelphia skyline (A) and daffodil flowers (B) are shown here. Semblance kPCA was able to effectively recover and compress when compared with linear PCA or Gaussian kPCA. These images were corrupted with added uniform noise: (C) and (D), respectively. The recovered image output using linear PCA, Gaussian kPCA, and Semblance kPCA is displayed. Comparing the same number of features (and even 2.5× as many features for Gaussian kPCA), Semblance performs favorably. More examples are given in the Supplementary Materials. Photo credits: Mo Huang (The Wharton School) and the EB Image Package.

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/12/eaau9630/DC1

    Semblance and the connection to Mercer kernels

    Existence of corresponding feature space for Semblance

    Table S1. List of technical indicators recorded for each observation/REIT by the CRSP Real Estate Database.

    Table S2. Test accuracy in forecasting whether the rate of return for an REIT would be positive or negative using SVMs for a range of kernel choices.

    Fig. S1. Comparison of a naturally weighted Semblance metric with one wherein features are weighed by a context-dependent measure.

    Fig. S2. We tested Semblance on an scRNA-seq dataset with 710 RHCs (19) and compared its performance against the conventionally used, Euclidean distance–based analysis.

    Fig. S3. kPCA using the Semblance kernel is able to efficiently reconstruct images.

  • Supplementary Materials

    This PDF file includes:

    • Semblance and the connection to Mercer kernels
    • Existence of corresponding feature space for Semblance
    • Table S1. List of technical indicators recorded for each observation/REIT by the CRSP Real Estate Database.
    • Table S2. Test accuracy in forecasting whether the rate of return for an REIT would be positive or negative using SVMs for a range of kernel choices.
    • Fig. S1. Comparison of a naturally weighted Semblance metric with one wherein features are weighed by a context-dependent measure.
    • Fig. S2. We tested Semblance on an scRNA-seq dataset with 710 RHCs (19) and compared its performance against the conventionally used, Euclidean distance–based analysis.
    • Fig. S3. kPCA using the Semblance kernel is able to efficiently reconstruct images.

    Download PDF

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article