Research ArticleCELL BIOLOGY

Ancestry-dependent gene expression correlates with reprogramming to pluripotency and multiple dynamic biological processes

See allHide authors and affiliations

Science Advances  20 Nov 2020:
Vol. 6, no. 47, eabc3851
DOI: 10.1126/sciadv.abc3851
  • Fig. 1 Substantial gene expression changes during iPSC induction are sex and race independent.

    (A) Volcano plot displays the log2 fold change in expression during reprogramming for all detected RNAs. Transcripts with log2 fold change ≥4 or ≤−4 (FDR ≤ 0.05) are highlighted in red. (B) Box plots show the log2 fold change during reprogramming in each matched DF-iPSC pair for a subset of canonical stem cell and fibroblast markers. (C) Principal components analysis (PCA) plot showing clusters of samples based on similarity. The first two components (PC1 and PC2) of gene expression variance are displayed. Each dot represents a sample color coded by both cell type and demographic. (D) Ridgeline plots compare the pairwise differences of expression data for each gene up-regulated during reprogramming. AA, African American; WA, White American; M, male; F, female.

  • Fig. 2 Ancestry-dependent and ancestry-independent genes are associated with reprogramming efficiency.

    (A) Overlap of genes significantly associated with reprogramming efficiency (Spearman, P ≤ 0.01) in the full cohort and in the ancestries independently separated into positively and negatively associated genes. Spearman values for a P ≤ 0.01 cutoff are 0.388 for n = 36 (race cohorts) and 0.274 for n = 72 (total cohort). (B and C) Line plots with 95% confidence intervals of RNA expression (log2) in DFs (dark colors) or iPSCs (iPS; matching light colors) for each individual sample in the cohort plotted against reprogramming efficiency, separated out by African American (AA; red) and White American (WA; blue). (B) Examples of genes uniquely associated in African Americans only (GAS2 and PLCE1). (C) Examples of genes uniquely associated in White Americans only (FAM69B and WASHC2C). Spearman correlations calculated in the total population are in gray, African American in red, and White American in blue. Spearman correlations that reach a significance of P ≤ 0.01 are denoted by * in individual plots. (D) GO analysis for genes associated with reprogramming efficiency in DFs. Heatmaps show the −log10 P value for enriched GO categories, grouped into related broader groupings, with ancestry-dependent functional categories in orange and functional categories identified in the combined analysis and only one race in purple (AA) and green (WA).

  • Fig. 3 Majority of genes associated with reprogramming efficiency are primed.

    (A) Number of associated genes (Spearman ≤ 0.01) considered to have a primed (orange, red, or blue) or a nonprimed (gray) gene expression pattern. (B to E) Log2 fold change between each individual DF line and the mean iPSC expression for example primed genes that are (B) primed and positively associated with reprogramming efficiency in the combined ancestries (KIF26A), African Americans only (GAS2), and White Americans only (FAM69B); (C) primed and negatively associated with reprogramming efficiency in the combined ancestries (TOR1B), African Americans only (PLCE1), and White Americans only (WASHC2C); (D) nonprimed and positively associated with reprogramming efficiency in the combined ancestries (TCF4), African Americans only (GRP), and White Americans only (CCDC36); and (E) nonprimed and negatively associated with reprogramming efficiency in the combined ancestries (GRB14), African Americans only (TINAGL1), and White Americans only (CELF5). Note that the samples that reprogram with higher efficiency have log2 fold changes closer to 0 (dashed green line) in primed genes and farther from 0 in nonprimed genes.

  • Fig. 4 Scoring by ancestry-dependent reprogramming efficiency improves rank score.

    (A) Schematic of binary scoring system used to rank samples on sum of expression of gene sets of interest. For a given fibroblast line (hypothetical examples shown in black) with expression levels of a positively associated gene above the cohort mean (green line) or expression levels of a negatively associated gene below the cohort mean, a score for that individual gene was 1. If expression levels did not meet these criteria, then a value of 0 was assigned. Total scores for all genes in a gene set of interest were summed and used to rank order samples. (B) Histograms of the distribution of the correlations between reprogramming efficiency and the relative sample ranks resulting from our scoring system when applied to 10,000 random sets of 750 genes, in the total cohort (gray) and race-specific subcohorts (red and blue). Arrows indicate the correlation values when the scoring system is applied to the top 750 associated genes (based on Spearman correlations) in the total cohort (gray), African American cohort (red), and White American cohort (blue).

  • Fig. 5 Reprogramming efficiency–associated genes are involved in multiple dynamic processes.

    (A) Enrichment plots generated using preranked (Pearson) GSEA analysis for gene sets of interest. (B) Prediction ability for 5-year breast cancer survival or (C) race and reprogramming efficiency in univariate fashion, by AUC ROCs. The 95% confidence intervals (CIs) are shown.

  • Fig. 6 Prediction of previously unknown regulators of genes associated with reprogramming efficiency.

    Concept network (Cnet) plots showing exemplar upstream regulators of efficiency-associated genes (full results are provided in table S5). (A) Regulators and target genes are color-coded by whether they are a hit/associated in the total cohort (gray), African American cohort (red), or White American cohort (blue). (B) Same as in (A) but only including information for the ancestry-dependent analyses.

Supplementary Materials

Stay Connected to Science Advances

Navigate This Article