Fig. 1 Two approaches to extract information from collections of texts. Topic models represent the texts as a document-word matrix (how often each word appears in each document), which is then written as a product of two matrices of smaller dimensions with the help of the latent variable topic. The approach we propose here represents texts as a network and infers communities in this network. The nodes consists of documents and words, and the strength of the edge between them is given by the number of occurrences of the word in the document, yielding a bipartite multigraph that is equivalent to the word-document matrix used in topic models.
Fig. 3 LDA is unable to infer non-Dirichlet topic mixtures. Visualization of the distribution of topic mixtures logP(θd) for different synthetic and real data sets in the two-simplex using K = 3 topics. We show the true distribution in the case of the synthetic data (top) and the distributions inferred by LDA (middle) and SBM (bottom). (A) Synthetic data sets with Dirichlet mixtures from the generative process of LDA with document hyperparameters αd = 0.01 × (1/3, 1/3, 1/3) (left) and αd = 100 × (1/3, 1/3, 1/3) (right) leading to different true mixture distributions logP(θd). We fix the word hyperparameter βrw = 0.01, D = 1000 documents, V = 100 different words, and text length kd = 1000. (B) Synthetic data sets with non-Dirichlet mixtures from a combination of two Dirichlet mixtures, respectively: αd ϵ {100 × (1/3, 1/3, 1/3), 100 × (0.1, 0.8, 0.1)} (left) and αd ϵ {100 × (0.1, 0.2, 0.7), 100 × (0.1, 0.7, 0.2)} (right). (C) Real data sets with unknown topic mixtures: Reuters (left) and Web of Science (right) each containing D = 1000 documents. For LDA, we use hyperparameter optimization. For SBM, we use an overlapping, non-nested parametrization in which each document belongs to its own group such that B = D + K, allowing for an unambiguous interpretation of the group membership as topic mixtures in the framework of topic models.
Fig. 4 Comparison between LDA and SBM for artificial corpora drawn from LDA. Description length Σ of LDA and hSBM for an artificial corpus drawn from the generative process of LDA with K = 10 topics. (A) Difference in Σ, ΔΣ = Σi − ΣLDA−trueprior, compared to the LDA with true priors—the model that generated the data—as a function of the text length kd = m and D = 106 documents. (B) Normalized Σ (per word) as a function of the number of documents D for fixed text length kd = m = 128. The four curves correspond to different choices in the parametrization of the topic models: (i) LDA with noninformative (noninf) priors (light blue, ×), (ii) LDA with true priors, that is, the hyperparameters used to generate the artificial corpus (dark blue, •), (iii) hSBM with without clustering of documents (light orange, ▲), and (iv) hSBM with clustering of documents (dark orange, ▼).
Fig. 5 Inference of hSBM to articles from the Wikipedia. Articles from three categories (chemical physics, experimental physics, and computational biology). The first hierarchical level reflects bipartite nature of the network with document nodes (left) and word nodes (right). The grouping on the second hierarchical level is indicated by solid lines. We show examples for nodes that belong to each group on the third hierarchical level (indicated by dotted lines): For word nodes, we show the five most frequent words; for document nodes, we show three (or fewer) randomly selected articles. For each word, we calculate the dissemination coefficient UD, which quantifies how unevenly words are distributed among documents (60): UD = 1 indicates the expected dissemination from a random null model; the smaller UD (0 < UD < 1), the more unevenly a word is distributed. We show the 5th, 25th, 50th, 75th, and 95th percentile for each group of word nodes on the third level of the hierarchy. Intl. Soc. for Comp. Biol., International Society for Computational Biology; RRKM theory, Rice-Ramsperger-Kassel-Marcus theory.
- Table 1 hSBM outperforms LDA in real corpora.
hSBM outperforms LDA in real corpora.. Each row corresponds to a different data set (for details, see “Data sets for real corpora” section in Materials and Methods). We provide basic statistics of each data set in column “Corpus.” The models are compared on the basis of their description length Σ (see Eq. 22). We highlight the smallest Σ for each corpus in boldface to indicate the best model. Results for LDA with noninformative and fitted hyperparameters are shown in columns “ΣLDA” and “ΣLDA (hyperfit)” for different number of topics K ϵ {10, 50, 100, 500}. Results for the hSBM are shown in column “ΣhSBM” and the inferred number of groups (documents and words) in “hSBM groups.”
Corpus ΣLDA ΣLDA (hyperfit) ΣhSBM hSBM
groupsDoc. Words Word
tokens10 50 100 500 10 50 100 500 Doc. Words Twitter 10,000 12,258 196,625 1,231,104 1,648,195 1,960,947 2,558,940 1,040,987 1,041,106 1,037,678 1,057,956 963,260 365 359 Reuters 1000 8692 117,661 498,194 593,893 669,723 922,984 463,660 477,645 481,098 496,645 341,199 54 55 Web of Science 1000 11,198 126,313 530,519 666,447 760,114 1,056,554 531,893 555,727 560,455 571,291 426,529 16 18 New York Times 1000 32,415 335,749 1,658,815 1,673,333 2,178,439 2,977,931 1,658,815 1,673,333 1,686,495 1,725,057 1,448,631 124 125 PLOS ONE 1000 68,188 5,172,908 10,637,464 10,964,312 11,145,531 13,180,803 10,358,157 10,140,244 10,033,886 9,348,149 8,475,866 897 972
Supplementary Materials
Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/4/7/eaaq1360/DC1
Section S1. Marginal likelihood of the SBM
Section S2. Artificial corpora drawn from LDA
Section S3. Varying the hyperparameters and number of topics
Section S4. Word-document networks are not sparse
Section S5. Empirical word-frequency distribution
Fig. S1. Varying the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.
Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial corpora drawn from LDA.
Fig. S3. Varying the base measure of the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.
Fig. S4. Word-document networks are not sparse.
Fig. S5. Empirical rank-frequency distribution.
Reference (61)
Additional Files
Supplementary Materials
This PDF file includes:
- Section S1. Marginal likelihood of the SBM
- Section S2. Artificial corpora drawn from LDA
- Section S3. Varying the hyperparameters and number of topics
- Section S4. Word-document networks are not sparse
- Section S5. Empirical word-frequency distribution
- Fig. S1. Varying the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.
- Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial corpora drawn from LDA.
- Fig. S3. Varying the base measure of the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.
- Fig. S4. Word-document networks are not sparse.
- Fig. S5. Empirical rank-frequency distribution.
- Reference (61)
Files in this Data Supplement: