Research ArticleGENETICS

SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance

See allHide authors and affiliations

Science Advances  06 Nov 2019:
Vol. 5, no. 11, eaax9249
DOI: 10.1126/sciadv.aax9249
  • Fig. 1 Correlations between indel frequencies at endogenous and integrated sites and effect of chromatin accessibility on indel frequencies.

    (A) Correlation between indel frequencies at 120 endogenous and corresponding integrated target sequences. The Spearman correlation coefficients (R) and squared Pearson correlation coefficients (R2) are shown. (B) Effect of chromatin accessibility on the activities of SpCas9 (left) and AsCpf1 (right) at endogenous sites. Indel frequencies at endogenous sites were evaluated after transfection of plasmids encoding SpCas9 or AsCpf1 and guide RNAs. Indel frequencies at the target sites were compared after being divided into two groups, DHS sites and other sites (non-DHS). The numbers of analyzed target sites are as follows: SpCas9, n = 50 for DHS target sites and n = 74 for non-DHS target sites; AsCpf1, n = 20 for DHS target sites and n = 35 for non-DHS target sites. The HEK-plasmid dataset from (20) was used for drawing this graph. Error bars represent SEM. Statistical significances determined by Student’s t test are shown. (C and D) Correlation between indel frequencies at endogenous and corresponding integrated target sequences at 50 DHS sites (C) and 70 non-DHS sites (D). The Spearman correlation coefficients (R) and squared Pearson correlation coefficients (R2) are shown.

  • Fig. 2 Evaluation of machine learning–based computational models predicting Cas9 activities.

    (A) Cross-validation of DeepSpCas9 models trained on datasets of varying sizes. Each dot represents the Spearman correlation coefficient between the measured indel frequency and the predicted activity from 10-fold cross-validation (total n = 10 correlation coefficients). (B) Cross-validation of SpCas9 activity prediction models based on previously reported machine learning–based approaches. Each dot represents the Spearman correlation coefficient between the measured indel frequency and the predicted activity from 10-fold cross-validation (total n = 10 correlation coefficients). Statistical significances between the best, next-best, and third-best models are shown (Steiger’s test). In (A) and (B), the top, middle, and bottom lines in the boxes represent the 25th, 50th, and 75th percentiles, respectively. Whiskers indicate the minimum and maximum values. The confidence intervals are described in table S6. RT, regression trees. (C) Performance comparison of DeepSpCas9 with other prediction models using dataset Endo_Cas9 (n = 124 independent target sites) and two published datasets (n = 4207 and 2060 independent target sites for datasets Hart 2015 and Xu 2015, respectively) as the test datasets. Error bars represent 95% confidence intervals, which are described in detail in table S6. For clarity, results from statistical testing are shown only for DeepSpCas9 versus deep learning with an equal-sized filter, DeepSpCas9 versus the best conventional machine learning–based model, and deep learning with an equal-sized filter versus the best conventional machine learning–based model (left to right: *P = 1.4 × 10−2, DeepSpCas9 versus deep learning with an equal-sized filter; *P = 1.1 × 10−2, DeepSpCas9 versus SVM; *P = 4.6 × 10−2, deep learning with an equal-sized filter versus SVM; Steiger’s test). ns, not significant. (D) Performance comparison of DeepSpCas9 and DeepSpCas9-CA (chromatin accessibility). The DeepSpCas9-CA model was developed by fine-tuning the DeepSpCas9 model using the Endo-1A dataset. DeepSpCas9 (left) and DeepSpCas9-CA (right) models were evaluated with the Endo-1B dataset. The Spearman correlation coefficients (R) are shown. (E) Results from 10 iterations of fine-tuning and evaluation. Each dot represents the Spearman correlation coefficient between the measured indel frequency and the predicted activity. A total of 10 (= 2 × 5) rounds of fine-tuning and subsequent testing results are shown.

  • Fig. 3 Comparison of generalization performances of computational models predicting Cas9 activities.

    The heat map shows Spearman correlation coefficients determined from DeepSpCas9 and previously reported models, which are arranged horizontally. The names of the vertically placed test datasets include information about the cell line or species used. Other related parameters, such as the guide RNA expression method [U6 promoter–driven (U6) versus in vitro transcribed (IVT)], the Cas9 activity analysis method [phenotypic change (phenotype) versus indel], and the number of analyzed sites, are also shown. Each gray box indicates the correlation of a model tested against a test dataset that includes its own training dataset. In the evaluation against each test dataset, the statistical significance between the two best models is indicated for the best model (from the top: ****P = 5.3 × 10−9, ****P = 1.8 × 10−10, ****P = 3.4 × 10−8, ****P = 1.1 × 10−13, ****P = 2.9 × 10−11, ****P = 3.9 × 10−8, ***P = 2.5 × 10−4, *P = 3.7 × 10−2, and *P = 3.9 × 10−2; Steiger’s test).

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/11/eaax9249/DC1

    Fig. S1. Development of a high-throughput evaluation system for Cas9-induced indel frequencies.

    Fig. S2. Overview of DeepSpCas9 development.

    Table S1. Datasets generated from the results of the high-throughput experiments.

    Table S2. Datasets used for this study.

    Table S3. Datasets generated from the results of the experiments at endogenous target sites.

    Table S4. Oligonucleotides used in this study.

    Table S5. Model selection results.

    Table S6. Confidence intervals for the values shown in the graphs.

    Supplementary Code

  • Supplementary Materials

    The PDF file includes:

    • Fig. S1. Development of a high-throughput evaluation system for Cas9-induced indel frequencies.
    • Fig. S2. Overview of DeepSpCas9 development.
    • Legend for table S1
    • Table S2. Datasets used for this study.
    • Legends for tables S3 to S6
    • Legend for Supplementary Code

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Table S1 (Microsoft Excel format). Datasets generated from the results of the high-throughput experiments.
    • Table S3 (Microsoft Excel format). Datasets generated from the results of the experiments at endogenous target sites.
    • Table S4 (Microsoft Excel format). Oligonucleotides used in this study.
    • Table S5 (Microsoft Excel format). Model selection results.
    • Table S6 (Microsoft Excel format). Confidence intervals for the values shown in the graphs.
    • Supplementary Code (.zip format)

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article