Research ArticleCANCER

DORGE: Discovery of Oncogenes and tumoR suppressor genes using Genetic and Epigenetic features

See allHide authors and affiliations

Science Advances  11 Nov 2020:
Vol. 6, no. 46, eaba6784
DOI: 10.1126/sciadv.aba6784
  • Fig. 1 Features that discriminate TSGs from OGs.

    (A) Feature groups selected for TSGs. (B) Feature groups selected for OGs. Feature groups are sorted according to the AUPRC reduction in elastic net fivefold cross-validation. Feature groups are named according to the representative features. Box plots showing the distribution of (C) H3K4me3 mean peak length, (D) VEST score, (E) missense damaging/benign ratio, (F) missense entropy, (G) pLI score, and (H) super enhancer percentage for the CGC-OG, CGC-TSG, and NG sets. Genes as both TSGs and OGs are excluded. P values for the differences between the TSGs/OGs and NGs were calculated by the one-sided “greater than” Wilcoxon rank-sum test.

  • Fig. 2 Evaluation of the DORGE method and characterization of the DORGE-predicted novel TSGs and OGs.

    Venn diagrams showing the overlap (A) between DORGE-predicted novel TSGs/OGs and CGC-TSGs/OGs; (B) between DORGE-predicted novel TSGs, CGC-TSGs, CancerMine-TSGs, and TSGene database-TSGs; and (C) between DORGE-predicted novel OGs, CGC-OGs, CancerMine-OGs, and ONGene database-OGs. Precision-recall curves (PRCs) for (D) TSG and (E) OG prediction. Different lines represent different PRCs from DORGE or DORGE variants. (F) Stacked bar plots showing the number of rediscovered CGC-TSGs and CGC-OGs using all features compared to CRISPR-screening data only. Cumulative distribution function (CDF) plots of DORGE-predicted TSG scores (G) and OG scores (H) of 19,636 human genes. The x axis and the y axis are swapped for illustration purposes, and the y axis is stretched to emphasize large TSG and OG scores. CGC genes are plotted as jitter points to avoid overplotting. The dashed lines indicate DORGE-TSG and DORGE-OG thresholds at a target FPR of 1%, and the CGC genes whose TSG scores and OG scores exceed the thresholds (above the dashed lines) are predicted as TSGs and OGs. (I) Top 15 DORGE-predicted non-CGC novel TSGs (left) and OGs (right), respectively, along with representative feature heatmaps and PubMed IDs. To make features comparable, feature values are transformed into quantiles. (J) Top 15 DORGE-predicted non-CGC novel TSGs (left) and OGs (right) that have no documented role in cancer based on the TSGene, ONGene, and CancerMine databases, along with representative feature heatmaps.

  • Fig. 3 Characterization and evaluation of DORGE-predicted novel TSGs/OGs by independent functional genomic and genomic datasets.

    (A) KEGG pathway enrichment analysis performed by Enrichr (75) for DORGE-predicted novel TSGs and OGs. Because of space limitations, terms with adjusted P values <10−4 are shown. Besides, terms with adjusted P values 108-fold lower for TSGs than OGs or 104-fold lower for OGs than TSGs are also shown. (B) ATAC-seq peak score measuring open chromatin for CGC-TSGs/OGs, DORGE-predicted novel TSGs/OGs, and NGs. Enrichment heatmaps of various gene types in (C) ER gene list and (D) inactivating pattern gene list for SB insertional mutagenesis, a screening tool for cancer driver genes. (E) Boxplot showing the Cox hazard ratio (HR) score for various gene types. Data are from rectum adenocarcinoma (READ). (F) Boxplot showing the phyloP score for various gene types. The phyloP score measures phylogenetic conservation and represents –logP values under a null hypothesis of neutral evolution. PhyloP basewise conservation scores were derived from a Multiz alignment of 46 vertebrate species. (G) TSGs and OGs are enriched in genes having earlier evolutionary origin (Eukaryota). P values for the differences between indicated gene categories were calculated by the one-sided Wilcoxon rank-sum test. In boxplots and heatmap, the Fisher’s exact test is used to calculate P values, and gene numbers in different gene categories are normalized to 200 to make P values comparable. In this figure, dual-functional CGC genes were excluded from the CGC-TSGs/OGs.

  • Fig. 4 Dual-functional cancer driver genes act as backbones in BioGRID PPI and characterization of hub genes in PPI and PharmacoDB gene-drug networks.

    (A) Complete BioGRID PPI network. (B) The Molecular Complex Detection (MCODE) algorithm was applied to DORGE-predicted novel TSGs/OGs to identify densely connected network modules (or backbones). All genes in the identified network are CGC dual-functional genes or novel dual-functional genes. Gene categories are represented as pie charts, with the colors coded based on gene categories. (C) Enrichment of CGC-TSGs/OGs and DORGE-predicted novel TSGs/OGs in hub genes in the BioGRID network. (D) Enrichment of various gene sets or epigenetic and mutational patterns in hub genes in the BioGRID network. (E) Enrichment of CGC-TSGs/OGs and DORGE-predicted novel TSGs/OGs in hub genes in the PharmacoDB gene-drug network. (F) Enrichment of various gene sets or epigenetic and mutational features in hub genes in the PharmacoDB gene-drug network. Hub genes are defined as the genes with the top 5% highest degree in the BioGRID or PharmacoDB network. To generate comparable P values, the gene numbers in different gene categories were normalized to 200. Broad H3K4me3: Genes with H3K4me3 length > 4000. P values for the differences between indicated gene categories were calculated by the right-sided Wilcoxon rank-sum test.

  • Table 1 Evaluation of cancer driver gene (TSGs + OGs) prediction based on the v.87 CGC genes.

    Method#SnSpPrecisionAccuracyAlgorithms
    DORGE11720.6110.9970.9660.948Logistic regression
    with the elastic net
    model
    OncodriveFM (34)26000.3380.9150.3670.841Functional impact
    model
    MuSIC (35)19750.3310.8700.2720.801Mutational
    background model
    MutPanning (36)4600.3180.9940.8800.907Nucleotide context
    model
    TUSON (9)2430.2220.9990.9610.900P value
    combination
    OncodriveFML (58)6800.2120.9830.6460.885Functional impact
    model
    20/20+ (7)1930.2081.0000.9910.899Random Forest
    model
    GUST (78)2760.2060.9940.8380.894Random Forest
    model
    MutSigCV (57)1580.1370.9980.9050.888Mutational
    background model
    OncodriveCLUST (59)5860.1180.9630.3190.855Mutational hotspot
    model
    ActiveDriver (61)4170.0980.9960.7710.881Logistic regression
    model

Supplementary Materials

Stay Connected to Science Advances

Navigate This Article