Table 2 Evaluation of the robustness of the clustering results.

For each feature set, the clustering is performed with level of selection from the 1% most frequent to all (MF); the number of features is given (N), as well as the cluster purity with respect to alleged authors (P-A) and with respect to the analysis obtained through our reference selection procedure (P-R). The last line, in italics, shows the results obtained with the reference selection procedure (RS), as shown in the dendrograms of Figs. 1, 2, and 5. Results with cluster purity above 0.9 are shown in bold. When P-A = P-R for all frequency cutoff thresholds, the reference selection procedure can be considered as optimal.

Control corpus
MFNP-AP-RMFNP-AP-RMFNP-AP-R
LemmasWord formsPOS 3-gr.
1%800.900.971%1550.900.901%950.900.90
10%79510.9310%15461110%9480.970.97
25%198610.9325%38651125%23690.970.97
50%39710.900.9750%77300.830.8350%47380.800.80
75%59980.900.9775%117040.570.5775%71170.730.73
100%79410.970.97100%154570.800.80100%94760.870.87
RS16140.93RS18871RS11581
Lemmas in rhyme positionAffixesFunction words
1%550.830.871%320.770.771%20.500.57
10%5430.900.9310%3170.930.9310%110.900.90
25%13580.900.7725%7921125%280.870.93
50%27150.770.8050%15831150%550.900.97
75%41140.700.7075%23741175%820.931
100%54290.730.73100%31650.970.97100%1100.931
RS5100.87RS15121RS1100.93
Main corpus (subgroup)
MFNP-AP-RMFNP-AP-RMFNP-AP-R
LemmasWord formsPOS 3-gr.
1%880.920.971%1760.9511%1030.890.95
10%8790.95110%17540.95110%10250.860.97
25%21960.920.9725%43850.95125%25610.891
50%43910.920.9750%87700.95150%51220.810.81
75%66350.810.8675%132280.700.7675%77100.810.81
100%87810.730.78100%175400.890.95100%102430.860.81
RS17890.95RS22190.95RS13440.89
Lemmas in rhyme positionAffixesFunction words
1%580.810.861%330.890.951%20.590.59
10%5730.920.9510%3260.95110%110.620.62
25%14320.950.9725%8140.95125%270.860.86
50%28640.950.9550%16270.95150%5411
75%42980.760.7675%24400.920.9775%8111
100%57280.680.73100%32530.920.97100%10811
RS6680.92RS16460.95RS1081