Research ArticleSYSTEMS BIOLOGY

Distinguishing cell phenotype using cell epigenotype

See allHide authors and affiliations

Science Advances  18 Mar 2020:
Vol. 6, no. 12, eaax7798
DOI: 10.1126/sciadv.aax7798
  • Fig. 1 Schematic illustration of our approach to distinguish cell types.

    (A) The epigenomic measurements of two different cells in blue and orange (top left) yield different epigenotypes (top right) from which an condition-specific effective network (bottom right) is determined from correlations in the data, where solid or dashed lines indicate relationships that are enforced or not enforced but possible, respectively, under the specified conditions. Projection to the state space of correlation eigenvectors approximates the attractors. (B) The probability distribution functions of distances between pairs of measurements of the same and different types are compared at selected percentiles (shaded regions) to determine whether pairs of the same type are more similar than pairs of different types. (C) The performance is evaluated by using KNN to predict unseen data (top) and by measuring the frequency with which chords cross cell type boundaries (gray dashed line, bottom panel).

  • Fig. 2 Assessment of our method applied to the GTEx dataset and comparison with alternatives.

    (A) AUROC for each cell type presented as a box plot for each number of features. Asterisks indicate significant improvement (P < 0.05, Kolmogorov-Smirnov test) relative to the MetaNeighbor performance. (B) Accuracy of LOGO validation as a function of the number of features and the size of the test set expressed as a fraction of all experiments. Optimization-selected features perform better than PCA-selected ones, especially for models with few features. (C) LOGO validation accuracy using nine features, where the cell types are listed in order of the number of experiments.

  • Fig. 3 Distinguishing cell types by the cell type homogeneity criterion for the GeneExp dataset.

    Equation 5 quantifies cell type homogeneity according to the UU, WU, and WC versions of measuring distance. The gray and white checkered background corresponds to the cell type groupings enumerated in table S2, and tick labels indicate the cell type associated with each row and column based on the key below the figure. The color coding defined in the legend above the figure marks the cases in which one or more of the versions failed for each query (row) and test (column) cell type. Gray indicates that the identification was successful for all three versions (91.4% of all cases). Self-comparisons (white diagonal) were not evaluated.

  • Fig. 4 Comparison of the UU (blue), WC (orange), and WU (green) versions of the KNN technique applied to the GeneExp and Hi-C datasets.

    (A) Boxplots summarizing the distribution of classification accuracy over n = 25 test sets plotted as a function of the set size indicated as a fraction of all experiments for the GeneExp dataset. Red lines, boxes, and whiskers denote the median, interquartile range, and 5th to 95th percentile range, respectively. (B) Mean accuracy plotted as a function of the number of features for the GeneExp dataset. (C and D) Same as (A and B), respectively, but for the Hi-C dataset. Brackets indicate statistically significant differences between version accuracies as reported in table S1.

  • Fig. 5 Comparison of LOGO validation for the three versions of the KNN technique applied to the GeneExp and Hi-C datasets.

    (A) Validation for the GeneExp dataset using 4 features. The colors indicate the version of the method used to classify the cell types (blue for UU, green for WU, and orange for WC), while the opacity indicates fraction of the total number of experiments belonging to the x axis cell type that are predicted to belong to the y axis cell type. (B) Same as (A), but for the Hi-C dataset using 3 features.

  • Table 1 Comparison between our machine-learning techniques and existing methods applied to both datasets measured by the percentage correct classifications under LOGO cross-validation.

    MethodGeneExp (%)Hi-C (%)
    KNN68.463.4
    SVC57.843.7
    RF39.740.0
    HNN5.611.5
    PDM18.859.3

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/6/12/eaax7798/DC1

    Supplementary Information

    Fig. S1. Confusion matrices for discerning actual and simulated data.

    Fig. S2. Method testing results as a function of the SNR under three scenarios (rows) for two criteria (columns).

    Fig. S3. Comparison of forward selection with PCA.

    Fig. S4. Distinguishing cell types for the Hi-C dataset.

    Fig. S5. KNN classification accuracy by cell type for the GeneExp dataset under LOGO cross-validation.

    Fig. S6. Fraction of nonconvex chords for each cell type.

    Fig. S7. Compilation of the number of squares of each color found in the preceding figures.

    Fig. S8. Accuracy as a function of genomic distance between loci and number of features for the Hi-C dataset.

    Table S1. Version comparison results and KS test P values.

    Table S2. Cell type counts, tick labels for Figs. 2C, 3, and 5 and figs. S5 and S6, and database accession numbers for the GeneExp and Hi-C datasets.

    Reference (43)

  • Supplementary Materials

    The PDF file includes:

    • Supplementary Information
    • Fig. S1. Confusion matrices for discerning actual and simulated data.
    • Fig. S2. Method testing results as a function of the SNR under three scenarios (rows) for two criteria (columns).
    • Fig. S3. Comparison of forward selection with PCA.
    • Fig. S4. Distinguishing cell types for the Hi-C dataset.
    • Fig. S5. KNN classification accuracy by cell type for the GeneExp dataset under LOGO cross-validation.
    • Fig. S6. Fraction of nonconvex chords for each cell type.
    • Fig. S7. Compilation of the number of squares of each color found in the preceding figures.
    • Fig. S8. Accuracy as a function of genomic distance between loci and number of features for the Hi-C dataset.
    • Legends for tables S1 and S2
    • Reference (43)

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Table S1 (Microsoft Excel format). Version comparison results and KS test P values.
    • Table S2 (Microsoft Excel format). Cell type counts, tick labels for Figs. 2C, 3, and 5 and figs. S5 and S6, and database accession numbers for the GeneExp and Hi-C datasets.

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article