Research ArticleSOCIAL SCIENCES

Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche

See allHide authors and affiliations

Science Advances  04 Sep 2019:
Vol. 5, no. 9, eaaw2594
DOI: 10.1126/sciadv.aaw2594
  • Fig. 1 SR and IR.

    The distribution of SR (in syllables per second) (left) and IR (in bits per second) (right) within the languages in our database (colored areas; colors represent the language families) and across them (black areas at the top) using a Gaussian kernel density estimate. The black vertical lines spanning the whole plot represent the means (solid lines) ± 1 SD (dashed lines). The short black vertical lines represent the actual data points.

  • Fig. 2 Relationship between SR and ID across languages

    Colors represent the language families, and individual languages are identified by the labels on top (to avoid overlapping labels, short black lines might show their actual position). While there is only one value of ID per language, there are as many values of SR per language as texts read by individual speakers. The straight yellow line represents the linear regression [with 95% confidence interval (CI)], and the black curve represents the locally estimated scatterplot smoothing regression (with 95% CI) of SR on ID.

  • Fig. 3 Pairwise divergence between languages.

    The distribution of the Jensen-Shannon divergence between pairs of languages for the NS, SR, and IR, also showing the significant differences using a randomization paired t test (1000 permutations). The IR-SR and IR-NS P values are <10−4, while the NS-SR P value is 0.30. All other divergence measures produce essentially identical results. n.s., not significant. ****P ≤ 0.0001.

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/9/eaaw2594/DC1

    Text S1. Relationship between semantic and encoding levels of information.

    Text S2. Relationship between canonical SR and automatically estimated SR.

    Text S3. Multilingual parallel corpus.

    Fig. S1. Two different encoding strategies that convey the same semantic information.

    Fig. S2. Automatic versus canonical SRs per language.

    Table S1. The 17 languages used in this study.

    Table S2. The written corpora.

    Data file S1. Raw data (as a TAB-separated CSV file) used for the analyses.

    Data file S2. Raw data (as a TAB-separated CSV file) used for the analyses.

    Analysis report file S1. Full analysis report and details about the data in HTML format.

    Analysis script file S1. The RMarkdown script continuing all the R code needed to reproduce the results and plots reported here.

    References (5059)

  • Supplementary Materials

    The PDF file includes:

    • Text S1. Relationship between semantic and encoding levels of information.
    • Text S2. Relationship between canonical SR and automatically estimated SR.
    • Text S3. Multilingual parallel corpus.
    • Fig. S1. Two different encoding strategies that convey the same semantic information.
    • Fig. S2. Automatic versus canonical SRs per language.
    • Table S1. The 17 languages used in this study.
    • Table S2. The written corpora.
    • Legends for data files S1 and S2
    • Legend for Analysis report file S1
    • Legend for Analysis script file S1
    • References (5059)

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Data file S1 (.csv format). Raw data (as a TAB-separated CSV file) used for the analyses.
    • Data file S2 (.csv format). Raw data (as a TAB-separated CSV file) used for the analyses.
    • Analysis report file S1 (.html format). Full analysis report and details about the data in HTML format.
    • Analysis script file S1 (.Rmd format). The RMarkdown script continuing all the R code needed to reproduce the results and plots reported here.

    Files in this Data Supplement:

Navigate This Article