Table 1 hSBM outperforms LDA in real corpora.

hSBM outperforms LDA in real corpora.. Each row corresponds to a different data set (for details, see “Data sets for real corpora” section in Materials and Methods). We provide basic statistics of each data set in column “Corpus.” The models are compared on the basis of their description length Σ (see Eq. 22). We highlight the smallest Σ for each corpus in boldface to indicate the best model. Results for LDA with noninformative and fitted hyperparameters are shown in columns “ΣLDA” and “ΣLDA (hyperfit)” for different number of topics K ϵ {10, 50, 100, 500}. Results for the hSBM are shown in column “ΣhSBM” and the inferred number of groups (documents and words) in “hSBM groups.”

CorpusΣLDAΣLDA (hyperfit)ΣhSBMhSBM
groups
Doc.WordsWord
tokens
10501005001050100500Doc.Words
Twitter10,00012,258196,6251,231,1041,648,1951,960,9472,558,9401,040,9871,041,1061,037,6781,057,956963,260365359
Reuters10008692117,661498,194593,893669,723922,984463,660477,645481,098496,645341,1995455
Web of Science100011,198126,313530,519666,447760,1141,056,554531,893555,727560,455571,291426,5291618
New York Times100032,415335,7491,658,8151,673,3332,178,4392,977,9311,658,8151,673,3331,686,4951,725,0571,448,631124125
PLOS ONE100068,1885,172,90810,637,46410,964,31211,145,53113,180,80310,358,15710,140,24410,033,8869,348,1498,475,866897972