Research ArticleSOCIAL SCIENCES

Why Molière most likely did write his plays

See allHide authors and affiliations

Science Advances  27 Nov 2019:
Vol. 5, no. 11, eaax5489
DOI: 10.1126/sciadv.aax5489
  • Fig. 1 Dendrograms of agglomerative hierarchical clustering of the six feature sets, performed on the exploratory corpus (Ward’s linkage criterion, Manhattan distance, z transformation, and vector length normalization), accompanied by the number of features selected, the agglomerative coefficient, and cluster purity for the exhibited clusters with respect to the alleged authors (see Materials and Methods).

    Different sets of features are analyzed, from the most thematic to the most genre invariant: (A) lemma, (B) lemma in rhyme position, and (C) word forms, strongly related to the texts thematic contents; (D) affixes, (E) POS 3-grams, and (F) function words, which, in the current state of knowledge, are deemed to reflect most accurately the less conscious variations in individual style (B, Boursault; C, Chevalier; CP, Pierre Corneille; CT, Thomas Corneille; DDV, Donneau de Visé; G, Gillet de la Tessonerie; LF, La Fontaine; M, Molière; O, Ouville; Q, Quinault; R, Rotrou; S, Scarron).

  • Fig. 2 Dendrograms of agglomerative hierarchical clustering of the six feature sets (Ward’s linkage criterion, Manhattan distance, z transformation, and vector length normalization), accompanied by the number of features selected, the agglomerative coefficient, and cluster purity for the exhibited clusters with respect to the alleged authors (see Materials and Methods).

    Features analyzed: (A) lemma, (B) lemma in rhyme position, (C) word forms, (D) affixes, (E) POS 3-grams, and (F) function words. Despite variations in detail, each analysis shows clusters strongly or completely related to the putative authors (CP, Pierre Corneille; CT, Thomas Corneille; M, Molière; R, Rotrou; S, Scarron).

  • Fig. 3 Distributions of the size in tokens of the texts and samples, and dendrograms of agglomerative hierarchical clustering on the function words (Ward’s linkage criterion, Manhattan distance and MinMax metric, z transformation, accompanied by the number of features selected, the agglomerative coefficient, and cluster purity for the exhibited clusters with respect to the alleged authors (see Materials and Methods).

    (A) The distribution of the length of texts in tokens for the final corpus for the initial corpus shows two outliers (too short texts) that were removed. (B) The size, per alleged author of the corpus, displays noticeable difference between authors due to differences in their production of comedies; in particular, the size of Rotrou and Scarron samples is relatively smaller. (C) In the corpus used for the final analyses, the chosen plays are relatively homogeneous in length (minimum, 7887; maximum, 18279) but still display some variation between authors. For cross-validation, we completed (D) the analysis on function words done with our main procedure, with (E) an analysis using the MinMax metric. The results have shown to be very similar, and the main clusters are identical in the set of their members.

  • Fig. 4 Distribution of the most discriminant features between clusters for each author.

    (A) The lemma gloire (“glory”); (B) the lemma contentement (“contentment”), in rhyme position; (C) the form contentements; (D) the affix ̂glo; (E) the POS sequence “DETdem ADJqua NOMcom” (demonstrative determiner, qualificative adjective, common noun); (F) the function word et (“and”). The feature with the strongest correlation, measured by η2, with the five clusters of each analysis was selected (B, Boursault; C, Chevalier; CP, Pierre Corneille; CT, Thomas Corneille; DDV, Donneau de Visé; G, Gillet de la Tessonerie; LF, La Fontaine; M, Molière; O, Ouville; Q, Quinault; R, Rotrou; S, Scarron).

  • Fig. 5 Dendrograms of agglomerative hierarchical clustering of the six feature sets of the control corpus (Ward’s linkage criterion, Manhattan distance, z transformation, and vector length normalization), accompanied by the number of features selected, the agglomerative coefficient, and cluster purity for the exhibited clusters with respect to the alleged authors (see Materials and Methods).

    Six sets of features are analyzed: (A) lemma and (B) lemma in rhyme position, (C) word forms and (D) affixes, (E) POS 3-grams, and (F) function words. B, Boissy; DA, Dancourt; DU, Dufresny; N, Nivelle; R, Regnard; and V, Voltaire.

  • Table 1 Features most correlated to the clusters detected, according to their correlation ratio η2.

    Featureη2PFeatureη2PFeatureη2P
    LemmaRhyme lemmaForms
    Gloire0.812.4 × 10−17Contentement0.788.9 × 10−16Contentements0.851.2 × 10−20
    Négliger0.792.4 × 10−16Affection0.783.0 × 10−15Gloire0.823.4 × 10−18
    Ha0.797.4 × 10−16Gloire0.663.3 × 10−10Ta0.801.4 × 10−16
    Affection0.778.9 × 10−15Courage0.656.7 × 10−10Ha0.793.8 × 10−16
    Contentement0.762.6 × 10−14Sein0.594.2 × 10−8Servage0.733.1 × 10−13
    Bref0.758.8 × 10−14Cavalier0.581.0 × 10−7Bref0.734.5 × 10−13
    Ton0.743.0 × 10−13Maîtresse0.562.7 × 10−7Affections0.721.1 × 10−12
    Cela0.721.1 × 10−12Négliger0.563.1 × 10−7Tes0.715.0 × 10−12
    Manière0.686.6 × 10−11Éclat0.563.9 × 10−7Pas0.701.3 × 10−11
    De_le0.687.3 × 10−11État0.554.7 × 10−7Indigne0.701.4 × 10−11
    Pas0.671.3 × 10−10Envie0.555.5 × 10−7Illustre0.692.8 × 10−11
    Éclat0.662.6 × 10−10Élément0.556.1 × 10−7Dedans0.686.7 × 10−11
    Dedans0.662.6 × 10−10Souci0.532.1 × 10−6Cela0.678.6 × 10−11
    Illustre0.663.4 × 10−10Enrager0.532.1 × 10−6Éclat0.671.3 × 10−10
    Bien0.663.8 × 10−10Loi0.522.7 × 10−6Manière0.662.3 × 10−10
    Indigne0.655.4 × 10−10Dieu0.523.3 × 10−6Bien0.664.1 × 10−10
    Être0.658.6 × 10−10Cela0.523.5 × 10−6Objets0.654.4 × 10−10
    À0.659.1 × 10−10Dire0.513.8 × 10−6Affection0.657.5 × 10−10
    Soleil0.642.0 × 10−9Visage0.514.9 × 10−6Ose0.658.5 × 10−10
    0.632.4 × 10−9Dépit0.506.5 × 10−6Transport0.632.2 × 10−9
    AffixesPOS 3-grFunction words
    ̂Glo0.775.3 × 10−15DETdem.ADJqua.NOMcom0.679.0 × 10−12Et0.712.9 × 10−12
    Ha_0.775.5 × 10−15PROper.VERcjg.ADVgen0.661.8 × 10−11Des0.709.4 × 10−12
    Ta_0.775.9 × 10−15VERinf.PROper.VERcjg0.656.4 × 10−11C’0.655.5 × 10−10
    ̂Dés0.742.8 × 10−13PRE.DETpos.NOMcom0.648.7 × 10−11Oui0.657.2 × 10−10
    _Gl0.721.9 × 10−12PRE.NOMcom.PROper0.641.0 × 10−10De0.641.3 × 10−9
    Et_0.712.8 × 10−12NOMcom.NOMpro.NOMpro0.641.4 × 10−10Est0.641.7 × 10−9
    _Dé0.713.3 × 10−12NOMcom.PROrel.VERcjg0.632.7 × 10−10Plus0.642.0 × 10−9
    Ela$0.714.0 × 10−12NOMcom.CONsub.DETpos0.632.8 × 10−10Voilà0.633.3 × 10−9
    _L’0.709.7 × 10−12DETpos.NOMcom.PRE0.633.0 × 10−10En0.601.7 × 10−8
    L’_0.709.7 × 10−12PROind.PRE.NOMcom0.624.2 × 10−10S’0.602.2 × 10−8
    _Bi0.691.6 × 10−11CONsub.DETpos.NOMcom0.626.2 × 10−10Ou0.602.5 × 10−8
    ̂Bie0.691.7 × 10−11DETpos.NOMcom.ADVneg0.617.9 × 10−10À0.595.8 × 10−8
    _Ta0.692.6 × 10−11PROper.VERinf.PROper0.601.9 × 10−9Cela0.587.0 × 10−8
    Ang$0.692.8 × 10−11PROper.PROadv.VERcjg0.601.9 × 10−9Suis0.587.8 × 10−8
    Ng_0.684.1 × 10−11NOMcom.CONcoo.DETpos0.593.3 × 10−9Encor0.588.1 × 10−8
    Lat$0.671.3 × 10−10NOMcom.PRE.DETdef.NOMcom0.593.7 × 10−9Ces0.562.2 × 10−7
    _Et0.662.5 × 10−10DETdem.NOMcom.ADJqua0.594.5 × 10−9Encore0.562.5 × 10−7
    _Je0.663.4 × 10−10NOMcom.PRE.DETpos0.589.3 × 10−9Non0.563.4 × 10−7
    0.641.2 × 10−9NOMcom.CONcoo.PRE0.581.1 × 10−8Tous0.554.0 × 10−7
    À_0.641.2 × 10−9PROadv.VERcjg.DETdef0.571.4 × 10−8Enfin0.554.2 × 10−7
  • Table 2 Evaluation of the robustness of the clustering results.

    For each feature set, the clustering is performed with level of selection from the 1% most frequent to all (MF); the number of features is given (N), as well as the cluster purity with respect to alleged authors (P-A) and with respect to the analysis obtained through our reference selection procedure (P-R). The last line, in italics, shows the results obtained with the reference selection procedure (RS), as shown in the dendrograms of Figs. 1, 2, and 5. Results with cluster purity above 0.9 are shown in bold. When P-A = P-R for all frequency cutoff thresholds, the reference selection procedure can be considered as optimal.

    Control corpus
    MFNP-AP-RMFNP-AP-RMFNP-AP-R
    LemmasWord formsPOS 3-gr.
    1%800.900.971%1550.900.901%950.900.90
    10%79510.9310%15461110%9480.970.97
    25%198610.9325%38651125%23690.970.97
    50%39710.900.9750%77300.830.8350%47380.800.80
    75%59980.900.9775%117040.570.5775%71170.730.73
    100%79410.970.97100%154570.800.80100%94760.870.87
    RS16140.93RS18871RS11581
    Lemmas in rhyme positionAffixesFunction words
    1%550.830.871%320.770.771%20.500.57
    10%5430.900.9310%3170.930.9310%110.900.90
    25%13580.900.7725%7921125%280.870.93
    50%27150.770.8050%15831150%550.900.97
    75%41140.700.7075%23741175%820.931
    100%54290.730.73100%31650.970.97100%1100.931
    RS5100.87RS15121RS1100.93
    Main corpus (subgroup)
    MFNP-AP-RMFNP-AP-RMFNP-AP-R
    LemmasWord formsPOS 3-gr.
    1%880.920.971%1760.9511%1030.890.95
    10%8790.95110%17540.95110%10250.860.97
    25%21960.920.9725%43850.95125%25610.891
    50%43910.920.9750%87700.95150%51220.810.81
    75%66350.810.8675%132280.700.7675%77100.810.81
    100%87810.730.78100%175400.890.95100%102430.860.81
    RS17890.95RS22190.95RS13440.89
    Lemmas in rhyme positionAffixesFunction words
    1%580.810.861%330.890.951%20.590.59
    10%5730.920.9510%3260.95110%110.620.62
    25%14320.950.9725%8140.95125%270.860.86
    50%28640.950.9550%16270.95150%5411
    75%42980.760.7675%24400.920.9775%8111
    100%57280.680.73100%32530.920.97100%10811
    RS6680.92RS16460.95RS1081

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/11/eaax5489/DC1

    Section S1. Data preparation

    Section S2. Data analysis implementation

    Table S1. Plays used in the final (subgroup) analysis.

    Table S2. Plays used only in the exploratory study.

    Table S3. Plays of the control corpus.

    Table S4. Function words used for the analyses.

    Table S5. List of cluster members for each dendrogram shown in the main text.

    Data file S1. Training corpus for the lemmatizer and POS tagger, in tsv format, with the trained models.

    Data file S2. Automatically labeled corpora in xml format, with import scripts.

    Data file S3. Feature datasets and analysis scripts in csv, R, and RMarkdown formats.

    References (5558)

  • Supplementary Materials

    The PDFset includes:

    • Section S1. Data preparation
    • Section S2. Data analysis implementation
    • Table S1. Plays used in the final (subgroup) analysis.
    • Table S2. Plays used only in the exploratory study.
    • Table S3. Plays of the control corpus.
    • Table S4. Function words used for the analyses.
    • Table S5. List of cluster members for each dendrogram shown in the main text.
    • References (5558)

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Data file S1 (.zip format). Training corpus for the lemmatizer and POS tagger, in tsv format, with the trained models.
    • Data file S2 (.7z format). Automatically labeled corpora in xml format, with import scripts.
    • Data file S3 (.zip format). Feature datasets and analysis scripts in csv, R, and RMarkdown formats.

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article