Research ArticleNETWORK SCIENCE

A network approach to topic models

See allHide authors and affiliations

Science Advances  18 Jul 2018:
Vol. 4, no. 7, eaaq1360
DOI: 10.1126/sciadv.aaq1360

Abstract

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.

INTRODUCTION

The accelerating rate of digitization of information increases the importance and number of problems that require automatic organization and classification of written text. Topic models (1) are a flexible and widely used tool that identifies semantically related documents through the topics they address. These methods originated in machine learning and were largely based on heuristic approaches such as singular value decomposition in latent semantic indexing (LSI) (2) in which one optimizes an arbitrarily chosen quality function. Only a more statistically principled approach, based on the formulation of probabilistic generative models (3), allowed for a deeper theoretical foundation within the framework of Bayesian statistical inference. This, in turn, leads to a series of key developments, in particular, probabilistic LSI (pLSI) (4) and latent Dirichlet allocation (LDA) (5, 6). The latter established itself as the state-of-the-art method in topic modeling and has been widely used not only for recommendation and classification (7) but also for bibliometrical (8), psychological (9), and political (10) analysis. Beyond the scope of natural language, LDA has also been applied in biology (11) [developed independently in this context (12)] and image processing (13).

However, despite its success and overwhelming popularity, LDA is known to suffer from fundamental flaws in the way it represents text. In particular, it lacks an intrinsic methodology to choose the number of topics and contains a large number of free parameters that can cause overfitting. Furthermore, there is no justification for the use of the Dirichlet prior in the model formulation besides mathematical convenience. This choice restricts the types of topic mixtures and is not designed to be compatible with well-known properties of real text (14), such as Zipf’s law (15) for the frequency of words. More recently, consistency problems have also been identified with respect to how planted structures in artificial corpora can be recovered with LDA (16). A substantial part of the research in topic models focuses on creating more sophisticated and realistic versions of LDA that account for, for example, syntax (17), correlations between topics (18), meta-information (such as authors) (19), or burstiness (20). Other approaches consist of postinference fitting of the number of topics (21) or the hyperparameters (22) or the formulation of nonparametric hierarchical extensions (2325). In particular, models based on the Pitman-Yor (2628) or the negative binomial process have tried to address the issue of Zipf’s law (29), yielding useful generalizations of the simplistic Dirichlet prior (30). While all these approaches lead to demonstrable improvements, they do not provide satisfying solutions to the aforementioned issues because they share the limitations due to the choice of Dirichlet priors, introduce idiosyncratic structures to the model, or rely on heuristic approaches in the optimization of the free parameters.

A similar evolution from heuristic approaches to probabilistic models is occurring in the field of complex networks, particularly in the problem of community detection (31). Topic models and community-detection methods have been developed largely independently from each other, with only a few papers pointing to their conceptual similarities (16, 32, 33). The idea of community detection is to find large-scale structure, that is, the identification of groups of nodes with similar connectivity patterns (31). This is motivated by the fact that these groups describe the heterogeneous nonrandom structure of the network and may correspond to functional units, giving potential insights into the generative mechanisms behind the network formation. While there is a variety of different approaches to community detection, most methods are heuristic and optimize a quality function, the most popular being modularity (34). Modularity suffers from severe conceptual deficiencies, such as its inability to assess statistical significance, leading to detection of groups in completely random networks (35), or its incapacity in finding groups below a given size (36). Methods such as modularity maximization are analogous to the pre-pLSI heuristic approaches to topic models, sharing many conceptual and practical deficiencies with them. In an effort to quench these problems, many researchers moved to probabilistic inference approaches, most notably those based on stochastic block models (SBMs) (32, 37, 38), mirroring the same trend that occurred in topic modeling.

Here, we propose and apply a unified framework to the fields of topic modeling and community detection. As illustrated in Fig. 1, by representing the word-document matrix as a bipartite network, the problem of inferring topics becomes a problem of inferring communities. Topic models and community-detection methods have been previously discussed as being part of mixed-membership models (39). However, this has remained a conceptual connection (16), and in practice, the two approaches are used to address different problems (32): the occurrence of words within and the links/citations between documents, respectively. In contrast, here, we develop a formal correspondence that builds on the mathematical equivalence between pLSI of texts and SBMs of networks (33) and that we use to adapt community-detection methods to perform topic modeling. In particular, we derive a nonparametric Bayesian parametrization of pLSI—adapted from a hierarchical SBM (hSBM) (4042)—that makes fewer assumptions about the underlying structure of the data. As a consequence, it better matches the statistical properties of real texts and solves many of the intrinsic limitations of LDA. For example, we demonstrate the limitations induced by the Dirichlet priors by showing that LDA fails to infer topical structures that deviate from the Dirichlet assumption. We show that our model correctly infers these structures and thus leads to a better topic model than Dirichlet-based methods (such as LDA) in terms of model selection not only in various real corpora but also in artificial corpora generated from LDA itself. In addition, our nonparametric approach uncovers topical structures on many scales of resolution and automatically determines the number of topics, together with the word classification, and its symmetric formulation allows the documents themselves to be clustered into hierarchical categories.

Fig. 1 Two approaches to extract information from collections of texts.

Topic models represent the texts as a document-word matrix (how often each word appears in each document), which is then written as a product of two matrices of smaller dimensions with the help of the latent variable topic. The approach we propose here represents texts as a network and infers communities in this network. The nodes consists of documents and words, and the strength of the edge between them is given by the number of occurrences of the word in the document, yielding a bipartite multigraph that is equivalent to the word-document matrix used in topic models.

The goal of our study is to introduce a unified approach to topic modeling and community detection, showing how ideas and methods can be transported between these two classes of problems. The benefit of this unified approach is illustrated by the derivation of an alternative to Dirichlet-based topic models, which is more principled in its theoretical foundation (making fewer assumption about the data) and superior in practice according to model selection criteria.

RESULTS

Community detection for topic modeling

Here, we expose the connection between topic modeling and community detection, as illustrated in Fig. 2. We first revisit how a Bayesian formulation of pLSI assuming Dirichlet priors leads to LDA and how we can reinterpret the former as a mixed membership SBM. We then use the latter to derive a more principled approach to topic modeling using nonparametric and hierarchical priors.

Fig. 2 Parallelism between topic models and community detection methods.

The pLSI and SBMs are mathematically equivalent, and therefore, methods from community detection (for example, the hSBM we propose in this study) can be used as alternatives to traditional topic models (for example, LDA).

Topic models: pLSI and LDA. pLSI is a model that generates a corpus composed of D documents, where each document d has kd words (4). Words are placed in the documents based on the topic mixtures assigned to both document and words, from a total of K topics. More specifically, one iterates through all D documents; for each document d, one samples kd ~ Poi(ηd), and for each word token l ϵ {1, kd}, first, a topic r is chosen with probability θdr, and then, a word w is chosen from that topic with probability ϕrw. If Embedded Image is the number of occurrences of word w of topic r in document d (summarized as n), then the probability of a corpus isEmbedded Image(1)We denote matrices by boldface symbols, for example, θ = {θdr} with d = 1,…, D and r = 1,…, K, where θdr is an individual entry; thus, the notation θd refers to the vector {θdr} with fixed d and r = 1,…, K.

For an unknown text, we could simply maximize Eq. 1 to obtain the best parameters η, θ, and ϕ, which describe the topical structure of the corpus. However, we cannot directly use this approach to model textual data without a significant danger of overfitting. The model has a large number of parameters that grows as the number of documents, words, and topics is increased, and hence, a maximum likelihood estimate will invariably incorporate a considerable amount of noise. One solution to this problem is to use a Bayesian formulation by proposing prior distributions to the parameters and integrating over them. This is precisely what is performed in LDA (5, 6), where one chooses Dirichlet priors Dd(θd|αd) and Dr(ϕr|βr) with hyperparameters α and β for the probabilities θ and ϕ above and one uses instead the marginal likelihood.Embedded Image(2)

If one makes a noninformative choice, that is, αdr = 1 and βrw = 1, then inference using Eq. 2 is nonparametric and less susceptible to overfitting. In particular, one can obtain the labeling of word tokens into topics, Embedded Image, conditioned only on the observed total frequencies of words in documents, Embedded Image, in addition to the number of topics K itself, simply by maximizing or sampling from the posterior distribution. The weakness of this approach lies in the fact that the Dirichlet prior is a simplistic assumption about the data-generating process: In its noninformative form, every mixture in the model—both of topics in each document as well as words into topics—is assumed to be equally likely, precluding the existence of any form of higher-order structure. This limitation has prompted the widespread practice of inferring using LDA in a parametric manner by maximizing the likelihood with respect to the hyperparameters α and β, which can improve the quality of fit in many cases. But not only does this undermine to a large extent the initial purpose of a Bayesian approach—as the number of hyperparameters still increases with the number of documents, words, and topics, and hence maximizing over them reintroduces the danger of overfitting—but it also does not sufficiently address the original limitation of the Dirichlet prior. Namely, regardless of the hyperparameter choice, the Dirichlet distribution is unimodal, meaning that it generates mixtures that are either concentrated around the mean value or spread away uniformly from it toward pure components. This means that for any choice of α and β, the whole corpus is characterized by a single typical mixture of topics into documents and a single typical mixture of words into topics. This is an extreme level of assumed homogeneity, which stands in contradiction to a clustering approach initially designed to capture heterogeneity.

In addition to the above, the use of nonparametric Dirichlet priors is inconsistent with well-known universal statistical properties of real texts, most notably, the highly skewed distribution of word frequencies, which typically follows Zipf’s law (15). In contrast, the noninformative choice of the Dirichlet distribution with hyperparameters βrw = 1 amounts to an expected uniform frequency of words in topics and documents. Although choosing appropriate values of βrw can address this disagreement, such an approach, as already mentioned, runs contrary to nonparametric inference and is subject to overfitting. In the following, we will show how one can recast the same original pLSI model as a network model that completely removes the limitations described above and is capable of uncovering heterogeneity in the data at multiple scales.

Topic models and community detection: Equivalence between pLSI and SBM. We show that pLSI is equivalent to a specific form of a mixed-membership SBM, as proposed by Ball et al. (33). The SBM is a model that generates a network composed of i = 1,…, N nodes with adjacency matrix Aij, which we will assume without loss of generality to correspond to a multigraph, that is, Aij ϵ ℕ. The nodes are placed in a partition composed of B overlapping groups, and the edges between nodes i and j are sampled from a Poisson distribution with averageEmbedded Image(3)where ωrs is the expected number of edges between group r and group s, and κir is the probability that node i is sampled from group r. We can write the likelihood of observing Embedded Image, that is, a particular decomposition of Aij into labeled half-edges (that is, edge end points) such that Embedded Image, asEmbedded Image(4)by exploiting the fact that the sum of Poisson variables is also distributed according to a Poisson.

We can now make the connection to pLSI by rewriting the token probabilities in Eq. 1 in a symmetric fashion asEmbedded Image(5)where Embedded Image is the probability that the word w belongs to topic r, and ηw≡∑sϕsw is the overall propensity with which the word w is chosen across all topics. In this manner, we can rewrite the likelihood of Eq. 1 asEmbedded Image(6)with Embedded Image. If we choose to view the counts ndw as the entries of the adjacency matrix of a bipartite multigraph with documents and words as nodes, the likelihood of Eq. 6 is equivalent to the likelihood of Eq. 4 of the SBM if we assume that each document belongs to its own specific group, κir = δir, with i = 1,…, D for document nodes, and by rewriting Embedded Image. Therefore, the SBM of Eq. 4 is a generalization of pLSI that allows the words and the documents to be clustered into groups and includes it as a special case when the documents are not clustered.

In the symmetric setting of the SBM, we make no explicit distinction between words and documents, both of which become nodes in different partitions of a bipartite network. We base our Bayesian formulation that follows on this symmetric parametrization.

Community detection and the hSBM. Taking advantage of the above connection between pLSI and SBM, we show how we can extend the idea of hSBMs developed in (4042) such that we can effectively use them for the inference of topical structure in texts. Like pLSI, the SBM likelihood of Eq. 4 contains a large number of parameters that grow with the number of groups and therefore cannot be used effectively without knowing the most appropriate dimension of the model beforehand. Analogously to what is carried out in LDA, we can address this by assuming noninformative priors for the parameters κ and ω and computing the marginal likelihood (for an explicit expression, see section S1.1)Embedded Image(7)where Embedded Image is a global parameter determining the overall density of the network. We can use this to infer the labeled adjacency matrix Embedded Image, as performed in LDA, with the difference that not only the words but also the documents would be clustered into mixed categories.

However, at this stage, the model still shares some disadvantages with LDA. In particular, the noninformative priors make unrealistic assumptions about the data, where the mixture between groups and the distribution of nodes into groups is expected to be unstructured. Among other problems, this leads to a practical obstacle, as this approach has a “resolution limit” where, at most, Embedded Image groups can be inferred on a sparse network with N nodes (42, 43). In the following, we propose a qualitatively different approach to the choice of priors by replacing the noninformative approach with deeper Bayesian hierarchy of priors and hyperpriors, which are agnostic about the higher-order properties of the data while maintaining the nonparametric nature of the approach. We begin by reformulating the above model as an equivalent microcanonical model (for a proof, see section S1.2) (42) such that we can write the marginal likelihood as the joint likelihood of the data and its discrete parametersEmbedded Image(8)withEmbedded Image(9)Embedded Image(10)Embedded Image(11)where Embedded Image is the total number of edges between groups r and s (we used the shorthand er = ∑sers and Embedded Image), Embedded Image is the probability of a labeled graph A where the labeled degrees k and edge counts between groups e are constrained to specific values (and not their expectation values), P(k|e) is the uniform prior distribution of the labeled degrees constrained by the edge counts e, and Embedded Image is the prior distribution of edge counts, given by a mixture of independent geometric distributions with average Embedded Image.

The main advantage of this alternative model formulation is that it allows us to remove the homogeneous assumptions by replacing the uniform priors P(k|e) and Embedded Image by a hierarchy of priors and hyperpriors that incorporate the possibility of higher-order structures. We could achieve this in a tractable manner without the need of solving complicated integrals that would be required if introducing deeper Bayesian hierarchies in Eq. 7 directly.

In a first step, we follow the approach of (41) and condition the labeled degrees k on an overlapping partition b = {bir}, given byEmbedded Image(12)such that they are sampled by a distributionEmbedded Image(13)

The labeled degree sequence is sampled conditioned on the frequency of degrees Embedded Image inside each mixture b, which itself is sampled from its own noninformative priorEmbedded Image(14)Where eb is the number of incident edges in each mixture (for detailed expressions, see section S1.3).

Because of the fact that the frequencies of the mixtures and those of the labeled degrees are treated as latent variables, this model admits that group mixtures are far more heterogeneous than the Dirichlet prior used in LDA. In particular, as was shown in (42), the expected degrees generated in this manner follow a Bose-Einstein distribution, which is much broader than the exponential distribution obtained with the prior of Eq. 10. The asymptotic form of the degree likelihood will approach the true distribution as the prior washes out (42), making it more suitable for skewed empirical frequencies, such as Zipf’s law or mixtures thereof (44), without requiring specific parameters—such as exponents—to be determined a priori.

In a second step, we follow (40, 42) and model the prior for the edge counts e between groups by interpreting it as an adjacency matrix itself, that is, a multigraph where the B groups are the nodes. We then proceed by generating it from another SBM, which, in turn, has its own partition into groups and matrix of edge counts. Continuing in the same manner yields a hierarchy of nested SBMs, where each level l = 1,…, L clusters the groups of the levels below. This yields a probability [see (42)] given byEmbedded Image(15)withEmbedded Image(16)Embedded Image(17)where the index l refers to the variable of the SBM at a particular level; for example, Embedded Image is the number of nodes in group r at level l.

The use of this hierarchical prior is a strong departure from the noninformative assumption considered previously while containing it as a special case when the depth of the hierarchy is L = 1. It means that we expect some form of heterogeneity in the data at multiple scales, where groups of nodes are themselves grouped in larger groups, forming a hierarchy. Crucially, this removes the “unimodality” inherent in the LDA assumption, as the group mixtures are now modeled by another generative level, which admits as much heterogeneity as the original one. Furthermore, it can be shown to significantly alleviate the resolution limit of the noninformative approach, since it enables the detection of at most O(N/logN) groups in a sparse network with N nodes (40, 42).

Given the above model, we can find the best overlapping partitions of the nodes by maximizing the posterior distributionEmbedded Image(18)withEmbedded Image(19)which can be efficiently inferred using Markov Chain Monte Carlo, as described in (41, 42). The nonparametric nature of the model makes it possible to infer (i) the depth of the hierarchy (containing the “flat” model in case the data do not support a hierarchical structure) and (ii) the number of groups for both documents and words directly from the posterior distribution, without the need for extrinsic methods or supervised approaches to prevent overfitting. We can see the latter interpreting Eq. 19 as a description length (see discussion after Eq. 22).

The model above generates arbitrary multigraphs, whereas text is represented as a bipartite network of words and documents. Since the latter is a special case of the former, where words and documents belong to distinct groups, we can use the model as it is, as it will “learn” the bipartite structure during inference. However, a more consistent approach for text is to include this information in the prior, since we should not have to infer what we already know. We can perform this via a simple modification of the model, where one replaces the prior for the overlapping partition appearing in Eq. 13 byEmbedded Image(20)where Pw(bw) and Pd(bd) now correspond to a disjoint overlapping partition of the words and documents, respectively. Likewise, the same must be carried out at the upper levels of the hierarchy by replacing Eq. 17 withEmbedded Image(21)In this manner, by construction, words and documents will never be placed together in the same group.

Comparing LDA and hSBM in real and artificial data

Here, we show that the theoretical considerations discussed in the previous section are relevant in practice. We show that hSBM constitutes a better model than LDA in three classes of problems. First, we construct simple examples that show that LDA fails in cases of non-Dirichlet topic mixtures, while hSBM is able to infer both Dirichlet and non-Dirichlet mixtures. Second, we show that hSBM outperforms LDA even in artificial corpora drawn from the generative process of LDA. Third, we consider five different real corpora. We perform statistical model selection based on the principle of minimum description length (45) and computing the description length ∑ (the smaller the better) of each model (for details, see “Minimum description length” section in Materials and Methods).

Failure of LDA in the case of non-Dirichlet mixtures. The choice of the Dirichlet distribution as a prior for the topic mixtures θd implies that the ensemble of topic mixtures P(θd) is assumed to be either unimodal or concentrated at the edges of the simplex. This is an undesired feature of this prior because there is no reason why data should show these characteristics. To explore how this affects the inference of LDA, we construct a set of simple examples with K = 3 topics, which allow for easy visualization. Besides real data, we consider synthetic data constructed from the generative process of LDA [in which case P(θd) follows a Dirichlet distribution] and from cases in which the Dirichlet assumption is violated [for example, by superimposing two Dirichlet mixtures, resulting in a bimodal instead of a unimodal P(θd)].

The results summarized in Fig. 3 show that SBM leads to better results than LDA. In Dirichlet-generated data (Fig. 3A), LDA self-consistently identifies the distribution of mixtures correctly. The SBM is also able to correctly identify the Dirichlet mixture, although we did not explicitly specify Dirichlet priors. In the non-Dirichlet synthetic data (Fig. 3B), the SBM results again closely match the true topic mixtures, but LDA completely fails. Although the inferred result by LDA no longer resembles the Dirichlet distribution after being influenced by data, it is significantly distorted by the unsuitable prior assumptions. Turning to real data (Fig. 3C), the LDA and SBM yield very different results. While the “true” underlying topic mixture of each document is unknown in this case, we can identify the negative consequence of the Dirichlet priors from the fact that the results from LDA are again similar to the ones expected from a Dirichlet distribution (thus, likely an artifact), while the SBM results suggest a much richer pattern.

Fig. 3 LDA is unable to infer non-Dirichlet topic mixtures.

Visualization of the distribution of topic mixtures logP(θd) for different synthetic and real data sets in the two-simplex using K = 3 topics. We show the true distribution in the case of the synthetic data (top) and the distributions inferred by LDA (middle) and SBM (bottom). (A) Synthetic data sets with Dirichlet mixtures from the generative process of LDA with document hyperparameters αd = 0.01 × (1/3, 1/3, 1/3) (left) and αd = 100 × (1/3, 1/3, 1/3) (right) leading to different true mixture distributions logP(θd). We fix the word hyperparameter βrw = 0.01, D = 1000 documents, V = 100 different words, and text length kd = 1000. (B) Synthetic data sets with non-Dirichlet mixtures from a combination of two Dirichlet mixtures, respectively: αd ϵ {100 × (1/3, 1/3, 1/3), 100 × (0.1, 0.8, 0.1)} (left) and αd ϵ {100 × (0.1, 0.2, 0.7), 100 × (0.1, 0.7, 0.2)} (right). (C) Real data sets with unknown topic mixtures: Reuters (left) and Web of Science (right) each containing D = 1000 documents. For LDA, we use hyperparameter optimization. For SBM, we use an overlapping, non-nested parametrization in which each document belongs to its own group such that B = D + K, allowing for an unambiguous interpretation of the group membership as topic mixtures in the framework of topic models.

Together, the results of this simple example visually show that LDA not only struggles to infer non-Dirichlet mixtures but also shows strong biases in the inference toward Dirichlet-type mixtures. On the other hand, SBM is able to capture a much richer spectrum of topic mixtures due to its nonparametric formulation. This is a direct consequence of the choice of priors: While LDA assumes a priori that the ensemble of topic mixtures, P(θd), follows a Dirichlet distribution, SBM is more agnostic with respect to the type of mixtures while retaining its nonparametric formulation.

Artificial corpora sampled from LDA. We consider artificial corpora constructed from the generative process of LDA, incorporating some aspects of real texts (for details, see “Artificial corpora” section in Materials and Methods and section S2.1). Although LDA is not a good model for real corpora (as the Dirichlet assumption is not realistic), it serves to illustrate that even in a situation that favors LDA, the hSBM frequently provides a better description of the data.

From the generative process, we know the true latent variable of each word token. Therefore, we are able to obtain the inferred topical structure from each method by simply assigning the true labels without using approximate numerical optimization methods for the inference. This allows us to separate intrinsic properties of the model itself from external properties related to the numerical implementation.

To allow for a fair comparison between hSBM and LDA, we consider two different choices in the inference of each method, respectively. LDA requires the specification of a set of hyperparameters α and β used in the inference. While, in this particular case, we know the true hyperparameters that generated the corpus, in general, these are unknown. Therefore, in addition to the true values, we also consider a noninformative choice, that is, αdr = 1 and βrd = 1. For the inference with hSBM, we only use the special case where the hierarchy has a single level such that the prior is noninformative. We consider two different parametrizations of the SBM: (i) Each document is assigned to its own group, that is, they are not clustered, and (ii) different documents can belong to the same group, that is, they are clustered. While the former is motivated by the original correspondence between pLSI and SBM, the latter shows the additional advantage offered by the possibility of clustering documents due to its symmetric treatment of words and documents in a bipartite network (for details, see section S2.2).

In Fig. 4A, we show that hSBM is consistently better than LDA for synthetic corpora of almost any text length kd = m ranging over four orders of magnitude. These results hold for asymptotically large corpora (in terms of the number of documents), as shown in Fig. 4B, where we observe that the normalized description length of each model converges to a fixed value when increasing the size of the corpus. We confirm that these results hold across a wide range of parameter settings varying the number of topics, as well as the values and base measures of the hyperparameters (section S3 and figs. S1 to S3).

Fig. 4 Comparison between LDA and SBM for artificial corpora drawn from LDA.

Description length Σ of LDA and hSBM for an artificial corpus drawn from the generative process of LDA with K = 10 topics. (A) Difference in Σ, ΔΣ = Σi − ΣLDA−trueprior, compared to the LDA with true priors—the model that generated the data—as a function of the text length kd = m and D = 106 documents. (B) Normalized Σ (per word) as a function of the number of documents D for fixed text length kd = m = 128. The four curves correspond to different choices in the parametrization of the topic models: (i) LDA with noninformative (noninf) priors (light blue, ×), (ii) LDA with true priors, that is, the hyperparameters used to generate the artificial corpus (dark blue, •), (iii) hSBM with without clustering of documents (light orange, ▲), and (iv) hSBM with clustering of documents (dark orange, ▼).

The LDA description length ΣLDA does not depend strongly on the considered prior (true or noninformative) as the size of the corpora increases (Fig. 4B). This is consistent with the typical expectation that in the limit of large data, the prior washes out. However, note that for smaller corpora, the Σ of the noninformative prior is significantly worse than the Σ of the true prior.

In contrast, the hSBM provides much shorter description lengths than LDA for the same data when allowing documents to be clustered as well. The only exception is for very small texts (m < 10 tokens), where we have not converged to the asymptotic limit in the per-word description length. In the limit D → ∞, we expect hSBM to provide a similarly good or better model than LDA for all text lengths. The improvement of the hSBM over LDA in a LDA-generated corpus is counterintuitive because, for sufficient data, we expect the true model to provide a better description for it. However, for a model such as LDA, the limit of sufficient data involves the simultaneous scaling of the number of documents, words, and topics to very high values. In particular, the generative process of LDA requires a large number of documents to resolve the underlying Dirichlet distribution of the topic-document distribution and a large number of topics to resolve the underlying word-topic distribution. While the former is realized growing the corpus by adding documents, the latter aspect is nontrivial because the observed size of the vocabulary V is not a free parameter but is determined by the word-frequency distribution and the size of the corpus through the so-called Heaps’ law (14). This means that, as we grow the corpus by adding more and more documents, initially, the vocabulary increases linearly and only at very large corpora does it settle into an asymptotic sublinear growth (section S4 and fig. S4). This, in turn, requires an ever larger number of topics to resolve the underlying word-topic distribution. This large number of topics is not feasible in practice because it renders the whole goal and concept of topic models obsolete, compressing the information by obtaining an effective, coarse-grained description of the corpus at a manageable number of topics.

In summary, the limits in which LDA provides a better description, that is, either extremely small texts or very large number of topics, are irrelevant in practice. The observed limitations of LDA are due to the following reasons: (i) The finite number of topics used to generate the data always leads to an undersampling of the Dirichlet distributions, and (ii) LDA is redundant in the way it describes the data in this sparse regime. In contrast, the assumptions of the hSBM are better suited for this sparse regime and hence lead to a more compact description of the data, despite the fact that the corpora were generated by LDA.

Real corpora. We compare LDA and SBM for a variety of different data sets, as shown in Table 1 (for details, see “Data sets for real corpora” or “Numerical implementations” section in Materials and Methods). When using LDA, we consider both noninformative priors and fitted hyperparameters for a wide range of numbers of topics. We obtain systematically smaller values for the description length using the hSBM. For real corpora, the difference is exacerbated by the fact that the hSBM is capable of clustering documents, capitalizing on a source of structure in the data that are completely unavailable to LDA.

Table 1 hSBM outperforms LDA in real corpora.

Each row corresponds to a different data set (for details, see “Data sets for real corpora” section in Materials and Methods). We provide basic statistics of each data set in column “Corpus.” The models are compared on the basis of their description length Σ (see Eq. 22). We highlight the smallest Σ for each corpus in boldface to indicate the best model. Results for LDA with noninformative and fitted hyperparameters are shown in columns “ΣLDA” and “ΣLDA (hyperfit)” for different number of topics K ϵ {10, 50, 100, 500}. Results for the hSBM are shown in column “ΣhSBM” and the inferred number of groups (documents and words) in “hSBM groups.”

View this table:

As our examples also show, LDA cannot be used in a direct manner to choose the number of topics, as the noninformative choice systematically underfits (ΣLDA increases monotonically with the number of topics) and the parametric approach systematically overfits (ΣLDA decreases monotonically with the number of topics). In practice, users are required to resort to heuristics (46, 47) or more complicated inference approaches based on the computation of the model evidence, which not only are numerically expensive but can only be performed under onerous approximations (6, 22). In contrast, the hSBM is capable of extracting the appropriate number of topics directly from its posterior distribution while simultaneously avoiding both under- and overfitting (40, 42).

In addition to these formal aspects, we argue that the hierarchical nature of the hSBM and the fact that it clusters words and documents make it more useful in interpreting text. We illustrate this with a case study in the next section.

Case study: Application of hSBM to Wikipedia articles

We illustrate the results of the inference with the hSBM for articles taken from the English Wikipedia in Fig. 5, showing the hierarchical clustering of documents and words. To make the visualization clearer, we focus on a small network created from only three scientific disciplines: chemical physics (21 articles), experimental physics (24 articles), and computational biology (18 articles). For clarity, we only consider words that appear more than once so that we end up with a network of 63 document nodes, 3140 word nodes, and 39,704 edges.

Fig. 5 Inference of hSBM to articles from the Wikipedia.

Articles from three categories (chemical physics, experimental physics, and computational biology). The first hierarchical level reflects bipartite nature of the network with document nodes (left) and word nodes (right). The grouping on the second hierarchical level is indicated by solid lines. We show examples for nodes that belong to each group on the third hierarchical level (indicated by dotted lines): For word nodes, we show the five most frequent words; for document nodes, we show three (or fewer) randomly selected articles. For each word, we calculate the dissemination coefficient UD, which quantifies how unevenly words are distributed among documents (60): UD = 1 indicates the expected dissemination from a random null model; the smaller UD (0 < UD < 1), the more unevenly a word is distributed. We show the 5th, 25th, 50th, 75th, and 95th percentile for each group of word nodes on the third level of the hierarchy. Intl. Soc. for Comp. Biol., International Society for Computational Biology; RRKM theory, Rice-Ramsperger-Kassel-Marcus theory.

The hSBM splits the network into groups on different levels, organized as a hierarchical tree. Note that the number of groups and the number of levels were not specified beforehand but automatically detected in the inference. On the highest level, hSBM reflects the bipartite structure into word and document nodes, as is imposed in our model.

In contrast to traditional topic models such as LDA, hSBM automatically clusters documents into groups. While we considered articles from three different categories (one category from biology and two categories from physics), the second level in the hierarchy separates documents into only two groups corresponding to articles about biology (for example, bioinformatics or K-mer) and articles on physics (for example, rotating wave approximation or molecular beam). For lower levels, articles become separated into a larger number of groups; for example, one group contains two articles on Euler’s and Newton’s law of motion, respectively.

For words, the second level in the hierarchy splits nodes into three separate groups. We find that two groups represent words belonging to physics (for example, beam, formula, or energy) and biology (assembly, folding, or protein), while the third group represents function words (the, of, or a). We find that the latter group’s words show close-to-random distribution across documents by calculating the dissemination coefficient (right side of Fig. 5, see caption for definition). Furthermore, the median dissemination of the other groups is substantially less random with the exception of one subgroup (containing and, for, or which). This suggests a more data-driven approach to dealing with function words in topic models. The standard practice is to remove words from a manually curated list of stopwords; however, recent results question the efficacy of these methods (48). In contrast, the hSBM is able to automatically identify groups of stopwords, potentially rendering these heuristic interventions unnecessary.

DISCUSSION

The underlying equivalence between pLSI and the overlapping version of the SBM means that the “bag-of-words” formulation of topical corpora is mathematically equivalent to bipartite networks of words and documents with modular structures. From this, we were able to formulate a topic model based on hSBM in a fully Bayesian framework, alleviating some of the most serious conceptual deficiencies in current approaches to topic modeling such as LDA. In particular, the model formulation is nonparametric, and model complexity aspects, such as the number of topics, can be inferred directly from the model’s posterior distribution. Furthermore, the model is based on a hierarchical clustering of both words and documents, in contrast to LDA, which is based on a nonhierarchical clustering of the words alone. This enables the identification of structural patterns in text that is unavailable to LDA while, at the same time, allowing for the identification of patterns in multiple scales of resolution.

We have shown that hSBM constitutes a better topic model compared to LDA not only for a diverse set of real corpora but also for artificial corpora generated from LDA itself. It is capable of providing better compression—as a measure of the quality of fit—and a richer interpretation of the data. However, the hSBM offers an alternative to Dirichlet priors used in virtually any variation of current approaches to topic modeling. While motivated by their computational convenience, Dirichlet priors do not reflect prior knowledge compatible with the actual usage of language. Our analysis suggests that Dirichlet priors introduce severe biases into the inference result, which, in turn, markedly hinder its performance in the event of even slight deviations from the Dirichlet assumption. In contrast, our work shows how to formulate and incorporate different (and as we have shown, more suitable) priors in a fully Bayesian framework, which are completely agnostic to the type of inferred mixtures. Furthermore, it also serves as a working example that efficient numerical implementations of non-Dirichlet topic models are feasible and can be applied in practice to large collections of documents.

More generally, our results show how we can apply the same mathematical ideas to two extremely popular and mostly disconnected problems: the inference of topics in corpora and of communities in networks. We used this connection to obtain improved topic models, but there are many additional theoretical results in community detection that should be explored in the topic model context, for example, fundamental limits to inference such as the undetectable-detectable phase transition (49) or the analogy to Potts-like spin systems in statistical physics (50). Furthermore, this connection allows the many extensions of the SBM, such as multilayer (51) and annotated (52, 53) versions to be readily used for topic modeling of richer text including hyperlinks, citations between documents, etc. Conversely, the field of topic modeling has long adopted a Bayesian perspective to inference, which, until now, has not seen a widespread use in community detection. Thus, insights from topic modeling about either the formulation of suitable priors or the approximation of posterior distributions might catalyze the development of improved statistical methods to detect communities in networks. Furthermore, the traditional application of topic models in the analysis of texts leads to classes of networks usually not considered by community detection algorithms. The word-document network is bipartite (words-documents), the topics/communities can be overlapping, and the number of links (word tokens) and nodes (word types) are connected to each other through Heaps’ law. In particular, the latter aspect results in dense networks, which have been largely overlooked by the networks community (54). Thus, topic models might provide additional insights into how to approach these networks as it remains unclear how these properties affect the inference of communities in word-document networks. More generally, Heaps’ law constitutes only one of numerous statistical laws in language (14), such as the well-known Zipf’s law (15). While these regularities are studied well empirically, few attempts have been made to incorporate them explicitly as prior knowledge, for example, formulating generative processes that lead to Zipf’s law (27, 28). Our results show that the SBM provides a flexible approach to deal with Zipf’s law that constitutes a challenge to state-of-the-art topic models such as LDA. Zipf’s law also appears in genetic codes (55) and images (26), two prominent fields in which LDA-type models have been extensively applied (12, 29), suggesting that the block-model approach we introduce here is also promising beyond text analysis.

MATERIALS AND METHODS

Minimum description length

We compared both models based on the description length Σ, where smaller values indicate a better model (45). We obtained Σ for LDA from Eq. 2 and Σ for hSBM from Eq. 19 asEmbedded Image(22)Embedded Image(23)We noted that ΣLDA is conditioned on the hyperparameters β and α and, therefore, it is exact for noninformative priors (αdr = 1 and βrd = 1) only. Otherwise, Eq. 22 is only a lower bound for ΣLDA because it lacks the terms involving hyperpriors for β and α. For simplicity, we ignored this correction in our analysis, and therefore, we favored LDA. The motivation for this approach was twofold.

On the one hand, it offers a well-founded approach to unsupervised model selection within the framework of information theory, as it corresponds to the amount of information necessary to simultaneously describe (i) the data when the model parameters are known and (ii) the parameters themselves. As the complexity of the model increases, the former will typically decrease, as it fits more closely to the data, while at the same time, it is compensated by an increase of the latter term, which serves as a penalty that prevents overfitting. In addition, given data and two models M1 and M2 with description length Embedded Image and Embedded Image, we could relate the difference Embedded Image to the Bayes factor (56). The latter quantifies how much more likely one model is compared to the other given the dataEmbedded Image(24)where we assumed that each model is a priori equally likely, that is, P(M1) = P(M2).

On the other hand, the description length allows for a straightforward model comparison without the introduction of confounding factors. Commonly used supervised model selection approaches, such as perplexity, require additional approximation techniques (22), which are not readily applicable to the microcanonical formulation of the SBM. It is thus not clear whether any difference in predictive power would result from the model and its inference or the approximation used in the calculation of perplexity. Furthermore, we noted that it was shown recently that supervised approaches based on the held-out likelihood of missing edges tend to overfit in key cases, failing to select the most parsimonious model, unlike unsupervised approaches that are more robust (57).

Artificial corpora

For the construction of the artificial corpora, we fixed the parameters in the generative process of LDA, that is, the number of topics K, the hyperparameters α and β, and the length of individual articles m. The α(β) hyperparameters determine the distribution of topics (words) in each document (topic).

The generative process of LDA can be described in the following way. For each topic r ϵ {1,…, K}, we sampled a distribution over words ϕr from a V-dimensional Dirichlet distribution with parameters βrw for w ϵ {1,…, V}. For each document d ϵ {1,…, D}, we sampled a topic mixture θd from a K-dimensional Dirichlet distribution with parameters αdr for r ϵ {1,…, K}. For each word position ld ϵ {1,…, kd} (kd is the length of document d), we first sampled a topic Embedded Image from a multinomial with parameters θd and then sampled a word w from a multinomial with parameters Embedded Image.

We assumed a parametrization in which (i) each document has the same topic-document hyperparameter, that is, αdr = αr for d ϵ {1,…, D} and (ii) each topic has the same word-topic hyperparameter, that is, βrw = βw for r ϵ {1,…, K}. We fixed the average probability of occurrence of a topic, pr (word, pw), by introducing scalar hyperparameters α(β), that is, αdr = αK(pr) for r ϵ {1,…, K} [βrw = βV(pw) for w = 1,…, V). In our case, we chose (i) equiprobable topics, that is, pr = 1/K, and (ii) empirically measured word frequencies from the Wikipedia corpus, that is, Embedded Image with w = 1,…, 95,129, yielding a Zipfian distribution (section S5 and fig. S5), shown to be universally described by a double power law (44).

Data sets for real corpora

For the comparison of hSBM and LDA, we considered different data sets of written texts varying in genre, time of origin, average text length, number of documents, and language, as well as data sets used in previous works on topic models, for example, (5, 16, 58, 59):

(1) “Twitter,” a sample of Twitter messages obtained from www.nltk.org/nltk_data/;

(2) “Reuters,” a collection of documents from the Reuters financial newswire service denoted as “Reuters-21578, Distribution 1.0” obtained from www.nltk.org/nltk_data/;

(3) “Web of Science,” abstracts from physics papers published in the year 2000;.

(4) “New York Times,” a collection of newspaper articles obtained from http://archive.ics.uci.edu/ml;

(5) “PLOS ONE,” full text of all scientific articles published in 2011 in the journal PLOS ONE obtained via the PLOS API (http://api.plos.org/)

In all cases, we considered a random subset of the documents, as detailed in Table 1. For the New York Times data, we did not use any additional filtering since the data were already provided in the form of prefiltered word counts. For the other data sets, we used the following filtering: (i) We decapitalized all words, (ii) we replaced punctuation and special characters (for example, “.”, “,”, or “/”) by blank spaces so that we could define a word as any substring between two blank spaces, and (iii) we kept only those words that consisted of the letters a to z.

Numerical implementations

For inference with LDA, we used package mallet (http://mallet.cs.umass.edu/). The algorithm for inference with the hSBM shown in this work was implemented in C++ as part of the graph-tool Python library (https://graph-tool.skewed.de). We provided code on how to use hSBM for topic modeling in a GitHub repository (https://github.com/martingerlach/hSBM_Topicmodel).

SUPPLEMENTARY MATERIALS

Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/4/7/eaaq1360/DC1

Section S1. Marginal likelihood of the SBM

Section S2. Artificial corpora drawn from LDA

Section S3. Varying the hyperparameters and number of topics

Section S4. Word-document networks are not sparse

Section S5. Empirical word-frequency distribution

Fig. S1. Varying the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.

Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial corpora drawn from LDA.

Fig. S3. Varying the base measure of the hyperparameters α and β in the comparison between LDA and SBM for artificial corpora drawn from LDA.

Fig. S4. Word-document networks are not sparse.

Fig. S5. Empirical rank-frequency distribution.

Reference (61)

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.

REFERENCES AND NOTES

Acknowledgments: We thank M. Palzenberger for the help with the Web of Science data. E.G.A. thanks L. Azizi and W. L. Buntine for the helpful discussions. Author contributions: M.G., T.P.P., and E.G.A designed the research. M.G., T.P.P., and E.G.A performed the research. M.G. and T.P.P. analyzed the data. M.G., T.P.P., and E.G.A wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.
View Abstract

Navigate This Article