Research ArticleEVOLUTIONARY BIOLOGY

Robust inference of positive selection on regulatory sequences in the human brain

See allHide authors and affiliations

Science Advances  27 Nov 2020:
Vol. 6, no. 48, eabc9863
DOI: 10.1126/sciadv.abc9863
  • Fig. 1 Illustration of the procedure for inferring positive selection.

    The method includes two parts. Part I (left) is the gapped k-mer support vector machine (gkm-SVM) model training. The gkm-SVM classifier was trained by using TFBSs as a positive training set and randomly sampled sequences from the genome as a negative training set. Then, SVM weights of all possible 10-mers, the contributions of prediction transcription factor binding affinity, were generated from the gkm-SVM. Part II (right) is the positive selection inference. The ancestor sequence was inferred from sequence alignment with a sister species (species B) and an outgroup (species C). Then, the binding affinity change (deltaSVM) of the two substitutions accumulated in the red branch leading to species A was calculated on the basis of the weight list. The significance of the observed deltaSVM was evaluated by comparing it with a null distribution of deltaSVM, constructed by scoring the same number of random substitutions 10,000 times.

  • Fig. 2 Mouse CEBPA binding sites study.

    (A) Topological illustration of the phylogenetic relationships between the three mouse species used to detect positive selection on CEBPA binding sites. We want to detect positive selection that occurred on the lineage of C57BL/6J after divergence from CAST/EiJ, as indicated by the red branch. Ma, million years. (B) Receiver operating characteristic (ROC) curve for gkm-SVM classification performance on CEBPA binding sites (fivefold cross-validation). The AUC value represents the area under the ROC curve and provides an overall measure of predictive power. (C) The left-hand graphs are the distributions of deltaSVM. The right-hand graphs are the distributions of deltaSVM P values (test for positive selection). (D) Proportion of CEBPA binding sites with evidence of positive selection. (E to G) The number of binding sites in each category is indicated below each box. The P values from a Wilcoxon test comparing categories are reported above boxes. Positive sites are binding sites with evidence of positive selection (deltaSVM P value <0.01). (E) Conserved binding sites. (F) Lineage-specific gain binding sites. (G) Lineage-specific loss binding sites. We compare the binding intensity from CAST/EiJ, as an approximation for ancestral binding intensity, between positive loss binding sites and nonpositive loss binding sites.

  • Fig. 3 Human CEBPA binding site study.

    (A) Topological illustration of the phylogenetic relationships between human, chimpanzee, and gorilla. We detected positive selection that occurred on the lineage of human after divergence from chimpanzee, as indicated by the red branch. (B) ROC curve for gkm-SVM classification performance on CEBPA binding sites (fivefold cross-validation). The AUC value represents the area under the ROC curve and provides an overall measure of predictive power. (C) The left graph is the distribution of deltaSVM. The right graph is the distribution of deltaSVM P values (test for positive selection). (D) Ratio between the number of substitutions and the number of polymorphisms [single-nucleotide polymorphisms (SNPs)] for CEBPA binding sites. Positive sites are binding sites with evidence of positive selection (deltaSVM P value <0.01). The P value from Fisher’s exact test is reported above the bars. (E) Comparison of expression variance (adjusted variance) of putative target genes (closest gene to a TFBS) between positive sites and nonpositive sites. The number of binding sites in each category is indicated below each box. The P values from a Wilcoxon test comparing categories are reported above boxes. Positive sites are binding sites with evidence of positive selection (deltaSVM P value <0.01).

  • Fig. 4 Proportion of positive CTCF binding sites in different tissues or cell types.

    PBSs are binding sites with evidence of positive selection (deltaSVM P value <0.01). Colors correspond to broad anatomical systems. (A) CTCF binding sites in 29 human tissues or cell types. (B) CTCF binding sites in 11 mouse tissues.

Supplementary Materials

  • Supplementary Materials

    Robust inference of positive selection on regulatory sequences in the human brain

    Jialin Liu, Marc Robinson-Rechavi

    Download Supplement

    This PDF file includes:

    • Figs. S1 to S20
    • Tables S1 and S2

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article