Research ArticleNETWORK SCIENCE

A network approach to topic models

See allHide authors and affiliations

Science Advances  18 Jul 2018:
Vol. 4, no. 7, eaaq1360
DOI: 10.1126/sciadv.aaq1360
  • Fig. 1 Two approaches to extract information from collections of texts.

    Topic models represent the texts as a document-word matrix (how often each word appears in each document), which is then written as a product of two matrices of smaller dimensions with the help of the latent variable topic. The approach we propose here represents texts as a network and infers communities in this network. The nodes consists of documents and words, and the strength of the edge between them is given by the number of occurrences of the word in the document, yielding a bipartite multigraph that is equivalent to the word-document matrix used in topic models.

  • Fig. 2 Parallelism between topic models and community detection methods.

    The pLSI and SBMs are mathematically equivalent, and therefore, methods from community detection (for example, the hSBM we propose in this study) can be used as alternatives to traditional topic models (for example, LDA).

  • Fig. 3 LDA is unable to infer non-Dirichlet topic mixtures.

    Visualization of the distribution of topic mixtures logP(θd) for different synthetic and real data sets in the two-simplex using K = 3 topics. We show the true distribution in the case of the synthetic data (top) and the distributions inferred by LDA (middle) and SBM (bottom). (A) Synthetic data sets with Dirichlet mixtures from the generative process of LDA with document hyperparameters αd = 0.01 × (1/3, 1/3, 1/3) (left) and αd = 100 × (1/3, 1/3, 1/3) (right) leading to different true mixture distributions logP(θd). We fix the word hyperparameter βrw = 0.01, D = 1000 documents, V = 100 different words, and text length kd = 1000. (B) Synthetic data sets with non-Dirichlet mixtures from a combination of two Dirichlet mixtures, respectively: αd ϵ {100 × (1/3, 1/3, 1/3), 100 × (0.1, 0.8, 0.1)} (left) and αd ϵ {100 × (0.1, 0.2, 0.7), 100 × (0.1, 0.7, 0.2)} (right). (C) Real data sets with unknown topic mixtures: Reuters (left) and Web of Science (right) each containing D = 1000 documents. For LDA, we use hyperparameter optimization. For SBM, we use an overlapping, non-nested parametrization in which each document belongs to its own group such that B = D + K, allowing for an unambiguous interpretation of the group membership as topic mixtures in the framework of topic models.

  • Fig. 4 Comparison between LDA and SBM for artificial corpora drawn from LDA.

    Description length Σ of LDA and hSBM for an artificial corpus drawn from the generative process of LDA with K = 10 topics. (A) Difference in Σ, ΔΣ = Σi − ΣLDA−trueprior, compared to the LDA with true priors—the model that generated the data—as a function of the text length kd = m and D = 106 documents. (B) Normalized Σ (per word) as a function of the number of documents D for fixed text length kd = m = 128. The four curves correspond to different choices in the parametrization of the topic models: (i) LDA with noninformative (noninf) priors (light blue, ×), (ii) LDA with true priors, that is, the hyperparameters used to generate the artificial corpus (dark blue, •), (iii) hSBM with without clustering of documents (light orange, ▲), and (iv) hSBM with clustering of documents (dark orange, ▼).

  • Fig. 5 Inference of hSBM to articles from the Wikipedia.

    Articles from three categories (chemical physics, experimental physics, and computational biology). The first hierarchical level reflects bipartite nature of the network with document nodes (left) and word nodes (right). The grouping on the second hierarchical level is indicated by solid lines. We show examples for nodes that belong to each group on the third hierarchical level (indicated by dotted lines): For word nodes, we show the five most frequent words; for document nodes, we show three (or fewer) randomly selected articles. For each word, we calculate the dissemination coefficient UD, which quantifies how unevenly words are distributed among documents (60): UD = 1 indicates the expected dissemination from a random null model; the smaller UD (0 < UD < 1), the more unevenly a word is distributed. We show the 5th, 25th, 50th, 75th, and 95th percentile for each group of word nodes on the third level of the hierarchy. Intl. Soc. for Comp. Biol., International Society for Computational Biology; RRKM theory, Rice-Ramsperger-Kassel-Marcus theory.

  • Table 1 hSBM outperforms LDA in real corpora.

    hSBM outperforms LDA in real corpora.. Each row corresponds to a different data set (for details, see “Data sets for real corpora” section in Materials and Methods). We provide basic statistics of each data set in column “Corpus.” The models are compared on the basis of their description length Σ (see Eq. 22). We highlight the smallest Σ for each corpus in boldface to indicate the best model. Results for LDA with noninformative and fitted hyperparameters are shown in columns “ΣLDA” and “ΣLDA (hyperfit)” for different number of topics K ϵ {10, 50, 100, 500}. Results for the hSBM are shown in column “ΣhSBM” and the inferred number of groups (documents and words) in “hSBM groups.”

    CorpusΣLDAΣLDA (hyperfit)ΣhSBMhSBM
    groups
    Doc.WordsWord
    tokens
    10501005001050100500Doc.Words
    Twitter10,00012,258196,6251,231,1041,648,1951,960,9472,558,9401,040,9871,041,1061,037,6781,057,956963,260365359
    Reuters10008692117,661498,194593,893669,723922,984463,660477,645481,098496,645341,1995455
    Web of Science100011,198126,313530,519666,447760,1141,056,554531,893555,727560,455571,291426,5291618
    New York Times100032,415335,7491,658,8151,673,3332,178,4392,977,9311,658,8151,673,3331,686,4951,725,0571,448,631124125
    PLOS ONE100068,1885,172,90810,637,46410,964,31211,145,53113,180,80310,358,15710,140,24410,033,8869,348,1498,475,866897972

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/4/7/eaaq1360/DC1

    Section S1. Marginal likelihood of the SBM

    Section S2. Artificial corpora drawn from LDA

    Section S3. Varying the hyperparameters and number of topics

    Section S4. Word-document networks are not sparse

    Section S5. Empirical word-frequency distribution

    Fig. S1. Varying the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.

    Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial corpora drawn from LDA.

    Fig. S3. Varying the base measure of the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.

    Fig. S4. Word-document networks are not sparse.

    Fig. S5. Empirical rank-frequency distribution.

    Reference (61)

  • Supplementary Materials

    This PDF file includes:

    • Section S1. Marginal likelihood of the SBM
    • Section S2. Artificial corpora drawn from LDA
    • Section S3. Varying the hyperparameters and number of topics
    • Section S4. Word-document networks are not sparse
    • Section S5. Empirical word-frequency distribution
    • Fig. S1. Varying the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.
    • Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial corpora drawn from LDA.
    • Fig. S3. Varying the base measure of the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.
    • Fig. S4. Word-document networks are not sparse.
    • Fig. S5. Empirical rank-frequency distribution.
    • Reference (61)

    Download PDF

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article