Research ArticleCOMPUTER SCIENCE

A trainable clustering algorithm based on shortest paths from density peaks

See allHide authors and affiliations

Science Advances  30 Oct 2019:
Vol. 5, no. 10, eaax3770
DOI: 10.1126/sciadv.aax3770
  • Fig. 1 Association rules on a simplified example.

    (A) Raw data points in a 2D space. (B) Color-coded point density (ρ). (C) Color-coded distance to the closest point with a higher density (δ). (D) Density-distance plot for selecting density peaks (dashed box), characterized as outliers with sufficiently high density and distance. In this case, the density peaks correspond to the points p, q in (E) and (F). (E) Local association rule of CDP connecting a point x, to the closest point with higher density, until this is a density peak. This happens regardless of the other connections. (F) Global association rule of the proposed method, which computes the optimal path to connect a starting node s to a generic point x, passing by density peaks. (G) The path from S, via p, to x can be evaluated according to its properties, such as the maximum gap included. (H) Color-coded results of CDP. (I) Color-coded results of the proposed algorithm.

  • Fig. 2 Evaluation on a synthetic dataset with heterogeneous structures.

    (A and B) The performances of the proposed method are evaluated on the synthetic dataset 01_chang_pathbased provided in ClustEval (2). This includes groups with heterogeneous structures (i.e., a thin-elongated structure surrounding two globular-like structures). Our method, with a generic minimax path-cost function, partially improved the results produced by CDP and DBSCAN (A and B) with respect to the ground truth (GT). By contrast, by training a path classifier, the algorithm achieved full performances with an F1 score ≥0.99 and a Jaccard index ≥0.98. (C) Training paths composed of 25 examples of desired paths and 25 examples of undesired paths. (D) Path features corresponding to the density profile (vector including the density of the nodes in the path).

  • Fig. 3 Robustness to density peak detection.

    (A) Color-coded ground truth including two clusters in close proximity (red and blue) from the synthetic dataset 04_fu_flame provided in ClustEval. Points colored in black, at the intersection of the arrows, are the identified density peaks, which are used as seeds by the proposed algorithm. The position of seeds is perturbed by a displacement of max ±1 (black line). (B) Performance degradation with respect to the maximum perturbation. Results are obtained with n = 5 replicas with a random perturbation direction. CDP* refers to a modified version of CDP, forced to associate points to the perturbed seeds. (C) Color-coded results of the proposed method using the seeds colored in black. (D and E) Robustness to the setting of τρ and τδ parameters. The optimal settings of the parameters (D) identifies two outliers of the δ − ρ graph. (E) Variations from the optimal settings include additional subclusters, lowering the performances of the algorithm.

  • Fig. 4 Robustness to background noise.

    (A) Synthetic probability distribution from which point distributions are drawn. Lighter areas have the maximum probability. The area in blue is a background with uniform probability from 0 to 0.1, corresponding to a number of points from 10 to 52% of the total. (B) Quantitative performance degradation assay with respect to the value of background probability. A total of 1000 points are drawn from the probability distribution in (A), varying the background uniform probability from 0 to 0.1. For each level of background probability, n = 5 replicates are generated. Bold lines refer to the mean of the F1 score among the replicates, while the shaded lines correspond to the range. Three peaks were selected using γ ranking (5). (C to E) Qualitative results with a background probability of 0.01, corresponding to 900 points from the clusters and 100 points in the background. Cluster labels are color coded. Red x correspond to noise. (F to H) Qualitative results with a background probability of 0.1, corresponding to 480 points from the true clusters and 520 points in the background. Arrows indicate an aggregation of points in the background. The trained algorithm detects those points as noise.

  • Fig. 5 Application for segmentation of immune cells.

    (A to C) Qualitative results (color-coded labels) produced by different density-based clustering algorithms for clustering pixels in space. Images show a confocal microscopy acquisition of murine immune cells, labeled with CD11c + GFP. (i) Magnification showing a cell with dendritic morphology close to a cell with globular shape. (ii) Magnification showing two cells with globular shape in contact. CDP (A) using a Euclidean metric correctly separates touching cells but associates a piece of dendrite to the wrong cell. DBSCAN (B) correctly reconstructs the shape of dendritic cells but is not able to separate touching cells with the same density-reachability criterion. The proposed method (C) correctly associates the dendrite of the dendritic cell and separates the touching cells. Black lines indicate the optimal path followed by the algorithm, from the cell centroid (density peak) to a point in the dendrite and in the touching region, respectively. (D) Quantitative performances. The F1 score and the Jaccard index are computed with respect to the ground truth. Here, the trained version of the algorithm achieved similar performances to the generic version. These are bounded because of the separation of cells with increased size into multiple subclusters.

  • Fig. 6 Recognition of different heart rhythms.

    (A to D) Fragments of an ECG signal from a patient with several types of arrhythmia. Blue lines indicate a beat (systole). Cardiologists labeled six different types of beats “A, V, E, R, L, and !,” associated with different rhythms. Atrial/ventricular/ectopic beats “A/V/E” are extrasystoles close to the previous beat that can induce a pause in the rhythm (A). Right bundle branch block “R”: beat appearing in this patient between two V. Left bundle branch block “L” (B): normal beat type for this patient (similar RR1 and RR2 varying linearly in tachycardia and bradycardia). Runs of ventricular extrasystoles not followed by long pause are also reported (C). Ventricular flutter “!,” later indicated as “VF,” a critical condition characterized by extremely fast contractions (D). (E) Heartbeats represented as points, described according to their distance from the previous and to the next beat (RR1 and RR2). Our method recognized ventricular extrasystoles (red), which are otherwise mixed with L beats by CDP. The trained version of the algorithm further allowed identification of a group composed of mixed atrial and ventricular extrasystoles (orange).

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/10/eaax3770/DC1

    Proof S1. Unique cluster assignation.

    Fig. S1. Graph structure.

    Fig. S2. Results on the 12 synthetic datasets provided by ClustEval.

    Fig. S3. Performance degradation with respect to number of training paths.

    Fig. S4. Benchmark on high-dimensional synthetic datasets.

    Fig. S5. Results on the bone marrow leukemia dataset.

    Data file S1. Additional datasets used in this article.

    Data file S2. MATLAB source code of both the generic and trainable algorithm.

    Movie S1. Demo showing the training procedure.

  • Supplementary Materials

    The PDF file includes:

    • Proof S1. Unique cluster assignation.
    • Fig. S1. Graph structure.
    • Fig. S2. Results on the 12 synthetic datasets provided by ClustEval.
    • Fig. S3. Performance degradation with respect to number of training paths.
    • Fig. S4. Benchmark on high-dimensional synthetic datasets.
    • Fig. S5. Results on the bone marrow leukemia dataset.

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Data file S1 (.zip format). Additional datasets used in this article.
    • Data file S2 (.zip format). MATLAB source code of both the generic and trainable algorithm.
    • Movie S1 (.mp4 format). Demo showing the training procedure.

    Files in this Data Supplement:

Navigate This Article