Research ArticleAPPLIED MATHEMATICS

The ground truth about metadata and community detection in networks

See allHide authors and affiliations

Science Advances  03 May 2017:
Vol. 3, no. 5, e1602548
DOI: 10.1126/sciadv.1602548
  • Fig. 1 The stochastic blockmodel log-likelihood surface for bipartitions of the Karate Club network (14).

    The high-dimensional space of all possible bipartitions of the network has been projected onto the x, y plane (using a method described in Supplementary Text D.4) such that points representing similar partitions are closer together. The surface shows two distinct peaks that represent scientifically reasonable partitions. The lower peak corresponds to the social group partition given by the metadata—often treated as ground truth—whereas the higher peak corresponds to a leader-follower partition.

  • Fig. 2 Expected P value estimates of the blockmodel entropy significance test as the correlation ℓ between metadata and planted communities increases (each metadata label correctly reflects the planted community with probability (1 + ℓ)/2; see Supplementary Text B).

    Each curve represents networks with a fixed community strength ϵ = ωrs/ωrr. Solid lines indicate strong community structure in the so-called detectable regime (ϵ < λ), whereas dashed lines represent weak undetectable communities (ϵ > λ) (8). Three block density diagrams visually depict ϵ values.

  • Fig. 3 The neoSBM on synthetic data.

    (A) The stochastic blockmodel likelihood surface shows four distinct peaks corresponding to a sequence of locally optimal partitions. (B) Block density diagrams depict community structure for locally optimal partitions, where darker color indicates higher probability of interaction. (C) The neoSBM, with partition 1 as the metadata partition, interpolates between partition 1 and the globally optimal stochastic blockmodel partition 4. The number of free nodes q and stochastic blockmodel log likelihood as a function of θ show three discontinuous jumps as the neoSBM traverses each of the locally optimal partitions (1 to 4).

  • Fig. 4 The neoSBM on Lazega Lawyers friendship data (52).

    (A) Points of two neoSBM paths using office (red) and law school (blue) metadata partitions are shown on the stochastic blockmodel likelihood surface (grayscale to emphasize paths). (B) Block density diagrams depict community structure for metadata, (1 and 2) intermediate optimal, and (3) globally optimal partitions, where darker color indicates higher probability of interaction. (C) The neoSBM traverses two distinct paths to the global optimum (3), but only the path beginning at the office metadata partition traverses a local optimum (1), indicated by a plateau in free nodes q and log likelihood.

  • Table 1 BESTest P values for Lazega Lawyers.
    Metadata attribute
    NetworkStatusGenderOfficePracticeLaw school
    Friendship<10−60.034<10−60.0330.134
    Cowork<10−30.094<10−6<10−60.922
    Advice<10−60.010<10−6<10−60.205
  • Table 2 BESTest P values for malaria var genes.
    var gene network number
    123456789
    Genome0.5660.0640.5360.5880.3820.2750.0200.4640.115

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/3/5/e1602548/DC1

    Supplementary Text

    table S1. Notation used in Supplementary Text A.

    table S2. Notation used in Supplementary Text B.

    table S3. Lazega Lawyers: BESTest P values.

    table S4. Malaria: BESTest P values for parasite origin metadata.

    table S5. Malaria: BESTest P values for CP group metadata.

    table S6. Malaria: BESTest P values for UPS metadata.

    table S7. Notation used in the Supplementary Text.

    table S8. Normalized mutual information for partitions in fig. S6.

    table S9. Adjusted mutual information for partitions in fig. S6.

    fig. S1. The results of the neoSBM and the degree-corrected neoSBM on the Karate Club network.

    fig. S2. Results of the neoSBM on the malaria var gene network at locus one (“malaria 1”) using UPS metadata.

    fig. S3. Results of the neoSBM on the malaria var gene network at locus six (“malaria 6”) using UPS metadata.

    fig. S4. The block interaction matrix used to generate synthetic networks.

    fig. S5. Distributions of permuted partition entropies are negatively skewed.

    fig. S6. The five distinct ways to partition three nodes.

    References (5360)

  • Supplementary Materials

    This PDF file includes:

    • Supplementary Text
    • table S1. Notation used in Supplementary Text A.
    • table S2. Notation used in Supplementary Text B.
    • table S3. Lazega Lawyers: BESTest P values.
    • table S4. Malaria: BESTest P values for parasite origin metadata.
    • table S5. Malaria: BESTest P values for CP group metadata.
    • table S6. Malaria: BESTest P values for UPS metadata.
    • table S7. Notation used in the Supplementary Text.
    • table S8. Normalized mutual information for partitions in fig. S6.
    • table S9. Adjusted mutual information for partitions in fig. S6.
    • fig. S1. The results of the neoSBM and the degree-corrected neoSBM on the
      Karate Club network.
    • fig. S2. Results of the neoSBM on the malaria var gene network at locus one
      (“malaria 1”) using UPS metadata.
    • fig. S3. Results of the neoSBM on the malaria var gene network at locus six
      (“malaria 6”) using UPS metadata.
    • fig. S4. The block interaction matrix used to generate synthetic networks.
    • fig. S5. Distributions of permuted partition entropies are negatively skewed.
    • fig. S6. The five distinct ways to partition three nodes.
    • References (53–60)

    Download PDF

    Files in this Data Supplement: