Research ArticleCANCER

Matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns

See allHide authors and affiliations

Science Advances  01 Jul 2020:
Vol. 6, no. 27, eaba1862
DOI: 10.1126/sciadv.aba1862
  • Fig. 1 Methodology for data alignment and cancer type classification.

    (A) Principal component (PC) 1 and PC2 of a PCA, in the gene expression (GE) data, before adjustment for batch effects (raw data) and after adjustment [quantile normalization (QN) + ComBat] [see fig. S1 for PCA of DNA methylation data (MET)]. Colors represent the dataset sources (GDSC and CCLE are two sources for the cell line data, and TCGA is the source for the tumor data). (B) ROC curves for classifying tumors versus cell lines in the data before adjustment (orange) and after adjustment (blue) for GE and MET. (C) Schematic overview of the HyperTracker methodology. First, we systematically identified possible mislabeled cell lines using GE and MET data, independently. Second, we used various types of mutation-based data to corroborate the predictions. Third, we further validated the cell lines (CL) suspected to originate from skin using independent data, such as drug sensitivity. CT, cancer type.

  • Fig. 2 Detection of cell lines suspected to be mislabeled with a different cancer type.

    (A) TCGA-based precision scores for 614 cell lines were calculated in the MET and GE cancer type classifiers (one-versus-rest) and for 69 blood cancer cell lines. The higher the precision, the higher the confidence that the sample belongs to that particular cancer type (here, showing cases of SKCM, KIRC, and CRAD from left to right; see fig. S2 for the other cancer types). The cell lines that were originally annotated as the cancer type that is being tested are shown in red, and the rest in blue. (B) Heat map showing the 25 genes (GE) and CpG probes (MET) with the highest absolute values of ridge regression coefficients for each of the cancer types in the plot in one-versus-rest classifiers. The suspected skin cell lines are labeled. The cancer types shown are the suspected cancer type [melanoma (SKCM) in this case] and, additionally, the originally declared cancer types of the suspected cell lines [here, esophagus and stomach cancer (ESTAD), sarcoma (SARC), colorectal cancer (CRAD), and ovarian and uterus cancer (GYNE)]. See fig. S3 for the heat maps for the rest of the suspected cell lines. (C) Overview of the results from the systematic mislabeling testing of all cell lines. Cell lines with a TCGA_precision ≥ 0.7 to its original cancer type in (i) both in GE and in MET are assigned to the golden set group and (ii) either in GE or in MET are assigned to the silver set. If, however, the TCGA-based precision ≥ 0.7 to a different cancer type in GE and in MET, the cell line is assigned to the suspect set.

  • Fig. 3 Additional evidence supporting tissue identity of the suspected mislabeled cell lines.

    (A) Prediction consistency score (0 to 20) for each suspected cell line for 20 runs of one-versus-one classifiers that predicted suspected versus original cancer type in GE, MET, CNA, trinucleotide mutation spectrum (MS96), and oncogenic mutations (OGM). A value of 20 means that the cell line is predicted as suspected consistently in the 20 runs of the classification algorithm, and a value of 0 means that it is predicted as original cancer type. (B) Histograms of the consistency scores for CNA and MS96 classifiers for the models based on actual data and a baseline expectation on randomized data. (C) Prediction scores for MS96 and CNA for the suspected cell lines. Colors represent the suspected cancer type [see column “new” in (A)]. Gray dots represent the random values. (D) Drug sensitivity (IC50) for mutant BRAF-targeting drugs dabrafenib and trametinib for 614 cell lines. Cell lines originally labeled as skin cancer are shown in turquoise, and skin-suspected cell lines are marked with a square and their name. (E) Burden of UV-associated mutation signature 7 (estimated from two different sources) in 614 cell lines. Cell lines originally labeled as skin cancer are shown in turquoise, and skin-suspected cell lines are marked with a square and the name label.

  • Fig. 4 Drug sensitivity association testing using high-confidence sets of cell lines.

    (A) Drug sensitivity (IC50) to dabrafenib in all colorectal (CRAD) cell lines (left) and all CRAD cell lines except MDST8 (right), which is suspected of being skin cancer. Cell lines with a BRAF mutation and without (wild type) are compared. ANOVA FDR for this association (dabrafenib sensitivity with BRAF mutation status) is shown in blue for both datasets. Horizontal line is shown at 0, because score < 0 implies sensitivity to the drug. Dots and error bars represent the mean and SEM. (B) Number of significant associations between “CFEs” (includes mutations and CNAs in cancer genes) and drug associations detected (at FDR 25%) in the ANOVA test for all cell lines (“all”), cell lines in the golden set (“G”), cell lines in the golden plus silver sets (“G&S”), random subset of cell lines that match the number in the golden set (“r_G”), and random subset of cell lines that match the number in the golden plus silver sets (“r_G&S”). For the random subsets, the number of significant associations is calculated from 10 random selections and median shown. P values for a sign test (one-tailed) between the number of associations in the G/G&S and in r_G/r_G&S are shown. See fig. S7 for the remaining cancer types. (C) Differential sensitivity of drugs was analyzed by ANOVA for all brain cancer cell lines (left) and the brain cancer cell lines in the golden set only (right). Each point is an association between the sensitivity of a drug and a genetic feature (CFE). (D) Differential sensitivity of drugs was analyzed by ANOVA for all pancreatic (PAAD) cell lines (left) and PAAD cell lines in the golden and silver set only (right). Each point is an association between the sensitivity of a drug and a genetic feature (CFE). n.s., not significant.

Supplementary Materials

  • Supplementary Materials

    Matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns

    Marina Salvadores, Francisco Fuster-Tormo, Fran Supek

    Download Supplement

    The PDF file includes:

    • Figs. S1 to S9

    Other Supplementary Material for this manuscript includes the following:

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article