Research ArticleCANCER

Defined lifestyle and germline factors predispose Asian populations to gastric cancer

See allHide authors and affiliations

Science Advances  06 May 2020:
Vol. 6, no. 19, eaav9778
DOI: 10.1126/sciadv.aav9778


Germline and environmental effects on the development of gastric cancers (GC) and their ethnic differences have been poorly understood. Here, we performed genomic-scale trans-ethnic analysis of 531 GCs (319 Asian and 212 non-Asians). There was one distinct GC subclass with clear alcohol-associated mutation signature and strong Asian specificity, almost all of which were attributable to alcohol intake behavior, smoking habit, and Asian-specific defective ALDH2 allele. Alcohol-related GCs have low mutation burden and characteristic immunological profiles. In addition, we found frequent (7.4%) germline CDH1 variants among Japanese GCs, most of which were attributed to a few recurrent single-nucleotide variants shared by Japanese and Koreans, suggesting the existence of common ancestral events among East Asians. Specifically, approximately one-fifth of diffuse-type GCs were attributable to the combination of alcohol intake and defective ALDH2 allele or to CDH1 variants. These results revealed uncharacterized impacts of germline variants and lifestyles in the high incidence areas.


Gastric cancer (GC) is the third leading cause of cancer mortality worldwide (1). The GC incidence shows strong geographic variations, with the highest incidence in East Asia, especially Japan and Korea (2). Although there are several reports that suggest hereditary and lifestyle factors, including strains of Helicobacter pylori (H. pylori), as contributors to these geographic variations (3, 4), no reports so far have analyzed these factors with genomic resolution in a trans-ethnic manner. Although recent whole-genome/exome sequencing has contributed to the global characterization of somatic genetics and driver genes of GCs (58), the precise interplays among lifestyles, germline variations, and somatic mutations, especially in the contexts of ethnic variations, are not fully clarified to date. To elucidate this point, we conducted whole-exome sequencing for a large cohort of GCs with detailed etiological information among Japanese, a population with one of the highest incidences of GCs in the world. These results were integrated with The Cancer Genome Atlas (TCGA) dataset to conduct trans-ethnic analysis of somatic and germline genetics of GCs.


Trans-ethnic mutational signature of GC

The overall profile of the whole-exome sequencing for 531 GC cases (288-case TCGA dataset and our 243-case Japanese dataset) exhibited substantial diversity in the somatic mutation profiles among GCs (Fig. 1A). Comparing the distributions of GC subtypes from the perspective of the TCGA subtypes (9), Epstein-Barr virus (EBV)–associated GC and microsatellite instability (MSI)–GC were more frequent in the TCGA dataset, while genomically stable (GS)–GC was more frequent in the Japanese cohort (table S1). Cancer genomes are known to have acquired various, but specific, somatic mutational patterns during carcinogenesis depending on the carcinogenic etiologies they experienced, recently called as mutational “Signatures” (10). We used a pan-cancer catalog of 30 Signatures from the COSMIC (Catalogue Of Somatic Mutations In Cancer) database (11) for this analysis (see Materials and Methods). According to the hierarchical clustering based on the mutational signature contributions in each case (see Materials and Methods), GCs were subdivided into six groups (marked by color bars at the top of the heat map), each of which exhibited dominant contribution of Signatures 1, 3, 6, 15, 16, 17, and others (Fig. 1A). Single-nucleotide variants (SNVs) attributable to Signatures 1, 3, 6, and 15 are known to be associated with the aging of the patients, BRCA deficiency, and mismatch repair (MMR) deficiency, respectively (10). To the contrary, theoretical etiologies of Signature 17 have not been determined to date.

Fig. 1 Trans-ethnic mutational signature analysis of GC.

(A) The overall genetic profiles of 531 GCs, including Japanese GC cohort and TCGA GC datasets, are shown, on the basis of the hierarchical clustering of mutational signatures. GCs were genetically subdivided into six subgroups, each of which is characterized by Signatures 1 (red), 6 (yellow), 3 (blue), 16 (orange, surrounded by the dotted line), 15 (green), and 17 (purple). Clinicopathological information, representative somatic driver gene alterations, and germline variations of ALDH2 rs671/ADH1B rs1229984 are indicated as black/white columns at the top. White-to-red colored columns in the hierarchical clustering map represent the contribution rates (red bar on the left side) of the mutational signatures in each case. Mutation frequencies per megabase (Mb) are indicated at the bottom as bar graphs. PI3Ks: PIK3CA and PTEN mutations. (B) The numbers of the Signature 16 SNVs/Mb in GC cases are shown in relation to the races and ALDH2 genotypes of the patients. P values were calculated using unpaired Wilcoxon rank sum test. (C) The numbers of total SNVs/Mb in cases of Signature 16 cluster and those of other clusters are shown. Cases with hypermutator signature (yellow and green bars) were excluded. P values were calculated using the unpaired Wilcoxon rank sum test. NA, not applicable.

To identify any unknown interplays between SNV Signatures and other factors, we examined possible clinicopathological factors in each of the clusters, including patient races and well-known germline variants. We then found that a subgroup (Fig. 1A, orange bar) was strongly contributed by Signature 16, whose contributions consisted of approximately 28% of somatic SNVs, and almost all the patients in this cluster were of Asian ethnic background (90.5%), and a high proportion of these patients (16 of 23 = 69.6%) harbored a well-known inactive ALDH2 allele (rs671 AA or AG) (table S2). We further confirmed this observation of the correlation between Signature 16 and the ALDH2 allele by analyzing the Signature 16 contributions and ALDH2 genotypes in all the 531 cases, finding their positive correlation in an Asian-specific manner (Fig. 1B; P < 0.0001, unpaired Wilcoxon rank sum test). As is reported (12), inactive ALDH2 rs671-A allele is specific to Asian populations; thus, the correlation of the ALDH2 genotype with Signature 16 contribution is considered attributable specifically in the Asian populations.

Although the GC cluster with Signature 16 had no characteristic patterns of major driver gene mutations, this cluster was relatively enriched for diffuse-type histology (12 of 23) (Fig. 1A and table S2). The overall mutation burdens in this cluster were significantly smaller than those of the other GCs (P = 0.0075; Fig. 1, A and C), while the age at onset and Signature 1 mutations of the GC cases in the Signature 16 cluster were comparable to the other GCs (fig. S1). In view of chromosomal aberrations, we compared large-scale state transition (LST) scores (13) between all the Signature clusters (fig. S2), revealing that the LST scores of Signature 16 cluster were among the lowest, except for hypermutators (Signature 6/15 clusters) that are already known to harbor fewer chromosomal abnormalities (5, 9) (fig. S2). Together, the Signature 16 cluster is suggested not only as an Asian-specific subclass but also as a distinct biological entity.

Etiological factors of Signature 16 GC

With these findings and recent reports showing that Signature 16 is suggestive of an alcohol-associated signature in liver, esophageal, and several other cancers (1416), we then performed an in-depth analysis of our Japanese cohort with more information of their lifestyles and etiologies to investigate the relationship among Signature 16, patient ALDH2 genotypes, and alcohol intake. We also analyzed how strongly the combination of germline genetics and lifestyles affects the high-GC incidences in the Japanese population. The existence of the Signature 16 GC cluster was clearly reconfirmed (Fig. 2A), and the cluster contained 6.6% (16 of 243) of Japanese GCs. A high portion of patients (11 of 16, 68.8%) in the Signature 16 GC cluster were characterized as alcohol consumers with inactive ALDH2 allele (table S3). Therefore, the Signature 16 mutations in the GC context were most probably attributable to the combined effects of alcohol intake and loss-of-function allele of ALDH2 (table S3). To quantitatively confirm this phenomenon, we investigated the correlation between the number of Signature 16 SNVs and the combination of the alcohol intake habit and ALDH2 allele among all the Japanese GCs (Fig. 2B). While the effects of either alcohol intake or the ALDH2 allele alone were minimal on the number of Signature 16 mutations, the combination of both the alcohol intake behavior and inactive ALDH2 allele resulted in an 11.1-fold increase of the Signature 16 burdens (P < 0.0001, unpaired Wilcoxon rank sum test; Fig. 2B). ALDH2 is an enzyme that metabolizes acetaldehyde to acetic acid, the former of which exhibits significant genotoxic effects (17). As reported, the gastric mucosa expresses ALDH2, and the acetaldehyde levels in the gastric contents of ALDH2-inactive individuals were 5.6 times higher than those in the ALDH2-active individuals, after intragastric alcohol infusion (18, 19). Therefore, the accumulated acetaldehyde due to the low ALDH2 activity most likely induced the Signature 16 mutation patterns in gastric epithelial cells. The Signature 16 SNV occurred at the transcribed strands significantly more frequently than at the untranscribed strands (fig. S3; P < 0.0001, Fisher’s exact test), suggesting its association with transcription-coupled nucleotide excision repair (20). In summary, we conclude that the mutational Signature 16 significantly contributes to GCs in Asian populations and is attributable to the combined effects of the germline factor and patients’ lifestyle, which are alcohol intake behavior and loss-of-function allele of ALDH2. It was revealed that patients with GC with high numbers of Signature 16 SNVs did not always consume large amounts of alcohol; therefore, it can be concluded that a large amount of alcohol is not necessarily required to induce Signature 16 SNVs in ALDH2-inactive individuals. Mutational Signature 16 is also observed in esophageal squamous cell carcinoma and is related to alcohol intake with inactive ADH1B genotypes (rs1229984-GG) (16). In our Japanese GC cohort, however, such a correlation was not statistically confirmed (fig. S4, A and B), partly because of the limited statistical power due to the low minor allele frequency (MAF) of the ADH1B rs1229984 locus.

Fig. 2 Mutational signature analysis of Japanese GCs with lifestyle and germline information.

(A) The overall genetic profiles of 243 Japanese GCs are shown as a hierarchical clustering heat map as in Fig. 1A. Clinicopathological information, germline variations of ALDH2 rs671 and ADH1B rs1229984, and alcohol and smoking habits are indicated as black/white columns at the top. The contribution rate of the mutational signatures in each case is shown as in Fig. 1A. The germline and somatic variations of the CDH1 and BRCA pathway genes are indicated at the bottom of the figure. Red and blue columns indicate truncation and missense variants, respectively, with the proviso that somatic mutations in hypermutators (GCs with Signature 6 contribution, yellow column) and BRCA pathway variations without BRCA signatures (both somatic and germline) are shown as transparent columns. (B) The numbers of Signature 16 SNVs are plotted according to the patient subgroups defined by ALDH2 genotype and alcohol consumption habit. P values were calculated using unpaired Wilcoxon rank sum test. (C) The numbers of Signature 16 SNVs are plotted according to the patient subgroups defined by the combinations of ALDH2 genotype, alcohol consumption, and smoking habit. P values were calculated using the unpaired Wilcoxon rank sum test.

It is well known that alcohol consumption together with smoking epidemiologically increases the risk of esophageal squamous cell carcinoma (21). Conversely, it is not fully evident whether or not smoking synergizes with alcohol consumption regarding the GC risk. Here, detailed etiological analysis using Japanese GC cohort revealed that smoking habit has synergistic effects on Signature 16 mutations, specifically among GCs of alcohol consumers with ALDH2 defective allele (rs671 AA/AG) (P = 0.0339; Fig. 2C). This synergic effect was not observed among patients with ALDH2-proficient genotypes (rs671 GG) nor was there an increase in the numbers of Signature 16 mutations among smokers without drinking habit. Therefore, it can be considered that combination of alcohol consumption and smoking habit has synergic effects on the generation of Signature 16 SNVs, specifically among people with ALDH2 defective allele. To the contrary, the so-called smoking signature (Signature 4) was not apparently dominant among patients in the Signature 16 cluster (Figs. 1A and 2A), and it was also revealed that Signature 4 SNVs were not synergized by alcohol consumption (fig. S4C), indicating a possible one-side effect of smoking habit on the generation of alcohol-related signatures during GC carcinogenesis.

Immunological features of alcohol-related GC

Because of the distinct nature of the Signature 16 GCs, especially their smaller mutation burdens, we were then motivated to investigate whether this GC subgroup has characteristic biology including immunological features (22). Bioinformatics approach using gene expression profiles of 184 (of 243 cases of exome-sequenced) Japanese GCs pointed out that Signature 16 GCs exhibited characteristic immunological microenvironments (Fig. 3A), where we estimated compositions of 22 types of tumor-infiltrating immune cells in each GC using CIBERSORT (cell-type identification by estimating relative subsets of RNA transcripts) algorism (see Materials and Methods) (23). It was revealed that Signature 16 GCs showed skewed distributions in principal components analysis in view of the compositions of infiltrating immune cells (Fig. 3A), and higher B cell infiltration was the most prominent feature of the Signature 16 GCs (Fig. 3B). This trend of B cell infiltration was commonly observed among Signature 16 GCs regardless of the tumor subtypes, either diffuse or intestinal types. It was also noted that expression of CXCL13, a well-established B cell–recruiting chemokine, was significantly higher among Signature 16 GCs compared to the other groups of GCs (Fig. 3C). To confirm these observations, we histopathologically evaluated the B cell infiltration and CXCL13 expression in Signature 16 GCs. It was revealed that cancer cells themselves, in addition to well-characterized follicular dendritic cells, expressed CXCL13, and in total, 68.8% (11 of 16) of Japanese Signature 16 GCs showed immunohistochemically recognizable positivity for CXCL13 in cancer cells, either in a focal or wide-spread manner (Fig. 3D). In these cases, cancer cell nests with CXCL13 positivity were occasionally accompanied with B cell infiltrations around the nests. Representatively, one case showed massive infiltration of B cells within the cancer area as well as high-level expression of cancer cell–intrinsic CXCL13 (Fig. 3D). Pathway enrichment analysis using RNA sequencing (RNA-seq) data (see Materials and Methods) revealed significant enrichment of B cell receptor signaling, cytokine/chemokine signal, and Fc g receptor signaling pathways in the Signature 16 GCs compared to others (table S4), which further supports the characteristic B cell infiltrations in Signature 16 GCs. In summary, it is suggested that Signature 16 GCs frequently express CXCL13 in a cancer cell–intrinsic manner and thus recruit the infiltration of B cells within cancer area.

Fig. 3 Cancer-recruited B cell infiltration in alcohol-related GCs.

(A) Principal components analysis of 184 Japanese GC cases based on their profiles of the proportions of tumor-infiltrating immune cells defined by CIBERSORT algorism (see Materials and Methods). PC3 and PC4 components are shown. Green and red circles indicate Signature 16 GC and other GCs, respectively. Green and red circular areas indicate 95% confidence intervals of Signature 16 GC and other GCs, respectively. Arrows represent the correlations between the principal components (PC) and the variables (compositions of immune cells). (B) The proportion of B cells infiltrating in the GC tissues determined by the CIBERSORT algorism (see Materials and Methods) is shown. Green and red dots indicate Signature 16 GCs and other GCs, respectively. P value was calculated by unpaired Wilcoxon rank sum test. (C) Gene expression levels of B cell–attracting chemokine CXCL13 determined by RNA sequencing of bulk GC tissues (see Materials and Methods) are shown. Y axes are shown as log scale. Green and red dots indicate Signature 16 GCs and other GCs, respectively. P value was calculated by unpaired Wilcoxon rank sum test. (D) Representative cases of Signature 16 GCs with immunohistochemical staining of CD20 and CXCL13 are shown. In a GC case on the left side, as in other Signature 16 GCs, cancer cell–intrinsic CXCL13 expression and B cell infiltration around the tumor nest are observed. The GC case shown in the middle panels represents the prominent expression of tumor cell–intrinsic CXCL13 expression and massive infiltrations of B cells, followed by a picture at the right side, indicating negative CXCL13 staining in the normal gastric mucosae of the same specimen, as a control. A GC case at the rightest side indicates negative staining of CXCL13. Black bars under pictures, 50 μm.

Impact of germline variations in Asian GC

Germline variations of E-cadherin (CDH1) are well known to be responsible for hereditary diffuse-type GC (HDGC) (4), and recent exome sequencing identified germline mutations of the BRCA1/2 pathway genes as causative factors for GCs especially with familial histories (24). However, most of these data were obtained in non-Asian cases. Meanwhile, in Asian countries, the global burdens of GC-causing germline variants and their effects on familial aggregations have remained unclear to date. We summarized possible pathogenic germline variants of 624 cancer-related genes (25) in our Japanese GC cohort after filtering out the common and/or presumably nonpathogenic variants in the Japanese population (see Materials and Methods). While several truncating variants were observed in the BRCA pathway genes, we unexpectedly found that the CDH1 gene has the highest variant frequency and density (Table 1, bold), suggesting that the germline variants of the CDH1 gene significantly contribute to the genetic backgrounds of GCs in the Japanese population. Among Genome-Wide Association Study (GWAS)–reported GC-susceptive genes (2630), we found that PLCE1 variants were mildly enriched among Japanese GC patients compared to the general Japanese population (table S5; 2.0-fold, P = 0.039, Fisher’s exact test).

Table 1 Summary of germline variants that are presumably pathogenic.

Genes with presumably pathogenic germline variants are shown in the order of their densities of nonsilent variants (left, only genes with 5% or higher variant frequency among our cohort are shown) and numbers of truncation variants (right). AA, number of amino acid residues.

View this table:

In our detailed analysis of the BRCA pathway genes, we found 20 (in 22 of 243 cases, 9.1%) germline variants among cases with BRCA-related mutational signature (Signature 3) [Figs. 2A and 4A (blue bars) and table S6]. This frequency was not statistically different from that in the TCGA non-Asian GC population (19 of 212 cases, 9.0%; P = 1.0000, Fisher’s exact test). According to reported studies (31, 32), it has also been concluded that frequencies of BRCA1/2 pathogenic variants among general Japanese and West European populations are almost comparable. Several cancers with the BRCA signature are reported to be good responders to platinum chemotherapeutics and poly(adenosine diphosphate–ribose) polymerase inhibitors (33). The Signature 3 cluster (blue bars in Figs. 1A and 2A) consists of a roughly equal fraction of GCs in both Japanese and TCGA non-Asians (25 of 243, 10.3% and 19 of 212, 9.4%, respectively; P = 0.8735, Fisher’s exact test), which should be appropriate targets of such therapeutics.

Fig. 4 The landscape of germline variations in East Asian and non-Asian GCs.

(A) Distributions of germline variations of BRCA pathway genes found in combined datasets [Japanese GCs, TCGA GCs (9), and Korean DGCs (6)]. Only variants found in cases linked to the BRCA signature are shown. Red, Japanese; green, TCGA non-Asian. None of the patients were TCGA East Asians/Koreans (orange) who met this criterion. #, predicted to be damaging in silico; NLS, nuclear localization signal; SCD, serine cluster domain; HD, helical domain; OB, oligonucleotide binding; TR2, second RAD51-binding domain; aa, amino acid; ATP, adenosine triphosphate. (B) Distributions of germline variations of CDH1 in the same datasets as in (A). *, reported in clinically defined HDGC (strong familial aggregation or extremely early-onset case). Bold circle, diffuse-type histology. SIG, signal; TM, transmembrane domain.

As for the CDH1 germline variants, after filtering out common variations (see Materials and Methods), we summarized all nonsilent germline variants in our 243-case Japanese GC cohort (Fig. 4B, Table 2, and table S7), discovering 18 variants in total. This observed frequency (18 of 243, 7.4%) was higher than the reported consensus in the Japanese population (3436). In contrast to the variants found in TCGA non-Asians (Fig. 4B, green colored circle), most (14 of 18 cases) of the variants in the Japanese population were linked to diffuse-type histology (Fig. 4B, black outlined red circle), which provides a strong biological support to conclude their pathogenic roles (P = 0.0060, Fisher’s exact test) (37). The variant frequency of CDH1 in the Japanese DGC patients (14 of 105, 13.3%) was approximately 4.1-fold higher than that in the TCGA non-Asian GC populations (2 of 62, 3.2%; P = 0.0535, Fisher’s exact test; Table 2). It is intriguing that distributions of the CDH1 germline variants in the Japanese population were distinct from those of the TCGA non-Asians and specifically fell into five SNVs (p.G62V, p.T340A, p.L630V, p.V832M, and p.E880K). Among these five SNVs, four were predicted to be damaging in silico, four were reported in cases that met the HDGC criteria [strong familial aggregation (34, 35, 38) or extremely early onset (6)], and at least two were molecularly confirmed to be causative of malignancies due to the deficits in cell-cell adhesion and/or in the inhibition of invasion (Table 2) (39, 40). When germline variants found in a large Korean sporadic DGC dataset (6) were combined, it was revealed that these five recurrent SNVs were shared by both Japanese and Korean populations at measurable frequencies (Fig. 4B). This trend shows a clear contrast to the germline BRCA pathway variants (Fig. 4A), where no shared variants were found between Japanese and Korean GCs. The overall variant frequency of the 11 SNV loci found in this study was significantly higher among Japanese DGC patients than the general Japanese populations (4.0-fold; P < 0.0001, Fisher’s exact test) and also higher among the general East Asian population than the general European population (6.0-fold; P = 0.00124, Fisher’s exact test) (Table 2), supporting the pathogenic characteristics of these variants and confirming their enrichments in East Asian populations.

Table 2 Summary of germline CDH1 variants in East Asia (Japan and Korea) and TCGA datasets.

Germline CDH1 variants in East Asia (Japan and Korea) and TCGA datasets are shown. ToMMo, Tohoku Medical Megabank Organization.

View this table:

Familial history of germline CDH1 variant carriers

Out of 18 Japanese GC patients with germline CDH1 variants, 11 had mild familial cancer histories (table S5), including one case with lobular carcinoma of the breast, a well-known trace of germline CDH1 variants. However, although the information available was limited, only one case (DGC at 39 years old) met the clinical criteria of the HDGC (4). On the other hand, seven individuals had no familial histories of cancers and were clinically considered sporadic cases. Therefore, it should be noted that substantial portions of GCs that are clinically assumed as sporadic incidences could be attributable to germline variations of the patients. In view of molecular carcinogenesis pathways, it is hypothesized that the clinical phenotypes of CDH1 germline variations are significantly modified by other environmental factors such as H. pylori infection and food contents. We investigated H. pylori copy numbers at noncancerous mucosa of our GC patient cohort, finding that GC cases with germline variants of CDH1 exhibited statistically higher H. pylori copy numbers compared to others (fig. S5). This finding is consistent with a previous report showing that ongoing infections of H. pylori are more frequent in DGCs (41), since almost all the CDH1 germline variant patients harbor DGCs (table S7). H. pylori is known to disrupt proper function of CDH1 (42), which might suggest that H. pylori infection works as an additional second hit for the CDH1 germline variant population in the GC development. This observation and our speculations, however, are not scientifically conclusive because the copy number of H. pylori is influenced by the existence of mucosal metaplasia and the longevity of chronic infection (43). Even if the familial histories or clinical phenotypes did not straightforwardly resemble the definitions of familial GCs clinically, at least a portion of GC cases can be attributable to CDH1 germline variants. The p.V832M (34) and p.G62V (35) variants, for instance, were reported in families with strong familial aggregations of GCs (7 of 12 and 7 of 13 family members examined, respectively); however, only four of eight cases with these variants in our Japanese GCs showed, at most, slight familial histories, and none of them fulfilled the clinical criteria of HDGC, although the family information was available only for the first-degree relatives.


Large-scale whole-exome sequencing for East Asian GCs in combination with the public dataset has clarified important lifestyle and germline predispositions of GC development among high-incident East Asian populations. Alcohol intake behavior with genetically inactive ALDH2 background predisposes individuals to GCs with Signature 16. Besides, synergetic effects of alcohol intake and smoking habit were observed among the ALDH2-inactive individuals. Alcohol intake and the incidence of various types of cancers have extensively been studied so far (1416, 21). Although conflicting reports have indicated the correlations between alcohol intake and GC risk based on indirect epidemiological data (44, 45), no reports have shown direct biological evidence of the effects of alcohol on GC carcinogenesis. Global genetic classifications combined with clear quantification of alcohol contributions defined by mutational signatures, as taken by this study, could succeed in unraveling the ethnicity-specific intercorrelations of etiologies in GC development in the high-incidence areas. Alcohol-related GC is a biologically distinct subgroup with low mutation burdens and characteristic immunological profiles of increased B cell infiltration due to cancer cell–derived CXCL13 chemokine. While there are several reports of alcohol-induced dysregulation of adaptive immunity including B cells (46, 47), its biological mechanism and its relation to alcohol exposure are unclear and remain to be investigated further. Our results show that Signature 16 GCs have distinct immunological features, which attracts further attention in relation to the responsiveness of these GCs to current immune therapies as well as their applicability for alternative immunotherapies.

In addition to the BRCA pathways, this study also clarified that germline variations of CDH1 exist at a higher-than-expected frequency among East Asian GCs. The shared distributions of the recurrent CDH1 variants between Japanese and Koreans strongly suggest the existence of common ancestral events and widespread distributions of these CDH1 pathogenic variants in the East Asian areas. Thus, these variants affect the prevalence of GCs in these areas, although their precise geographic distributions are yet to be determined. It is important that many of these cases are clinically recognized as sporadic cases and have been overlooked so far without special attention to the preventive cares for family members, such as periodic endoscopies and/or prophylactic surgeries (4). The incomplete penetrance and environmental dependency of CDH1 variants are likely to have led to the controversy about their pathogenicity so far. Several groups have tried to experimentally show the evidence of pathogenicity of the variants, but their results were often conflicting (6, 40, 48, 49), partially because of the differences in the assays used and the difficulties in proving the biological function of the variants with mild effects. Our large-scale analysis of genomics and clinical data in this study clearly showed multiple evidences of the pathogenic nature of these CDH1 variants.

Approximately one-fifth (22 of 105, 21.0%) of the DGCs in our study were attributable either to the alcohol consumption coupled with defective ALDH2 allele or to the germline loss-of-function variants of CDH1. Our findings clarified previously unrecognized impacts of the defined germline and lifestyle factors on the high incidences of GCs in East Asian areas and provided us strong motivations for clinical intervention in lifestyles and familial cares coupled with precision germline genotyping in these areas.


Informed consent and sample preparation

Fresh-frozen GC and paired normal stomach tissues were obtained from essentially consecutive patients who underwent gastrectomy at the University of Tokyo Hospital (n = 169) and Yokohama City University Hospital (n = 126). Informed consent was obtained from each individual, and this study was approved by the institutional review boards at the University of Tokyo, Yokohama City University, and Tokyo Medical and Dental University. QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) and RNeasy Mini Kit (Qiagen) were used to extract 295 paired DNA and RNA samples from these tissues, respectively, as per the manufacturer’s instructions.

Clinicopathological information

Patient clinicopathological information [gender, age at diagnosis, alcoholic drinking habits, smoking habits, Lauren classification, tumor site, stage (UICC TNM classification), and histopathological features] were collected with informed consents. Patients were classified as “consuming alcohol” if they reported socially associated or more severe alcohol usage or otherwise as “not consuming alcohol.” Patients who smoked at least one cigarette/day on average for at least 1 year during their lifetime were defined as “smokers” or otherwise defined as “nonsmokers.” Tumor TNM staging including an assessment of the tumor size and lymph node and distant metastases were reviewed by at least two pathologists and defined according to the American Joint Committee on Cancer (eighth edition). Detailed data for the 243 patients are shown in table S9. Mix-type GCs (i.e., coexistence of diffuse and intestinal components) were regarded as diffuse-type GCs for this study, as is suggested by guidelines (50).

H. pylori infection status

H. pylori infection status of the patients at the time of surgeries was determined by quantitative polymerase chain reaction (qPCR), using the extracted DNA from noncancerous gastric tissues of patients with GC. The ratio of qPCR quantities of H. pylori per human genomic DNA was considered the H. pylori copy number in the subjected specimens. Human genomic DNA was quantified by the TaqMan method targeting ribonuclease P gene locus (Thermo Fisher Scientific, MA, USA), and the amount of H. pylori was measured by SYBR Green method targeting ureC gene, which was reported to be the most sensitive and specific marker for H. pylori (51). The ureC gene primers were 5′-GCATGCAATTGAATAAAGCC-3′ (forward) and 5′-GCCGCTATAACGGATCAAAT-3′ (reverse) as reported (51). TaqMan PCR was performed as per the manufacturer’s protocol, and SYBR Green qPCR was performed 50 cycles for denature, 95°C for 15 s, and extension, 60°C for 1 min. In total, we obtained informative data of H. pylori copy numbers for 170 of 243 Japanese GC patients. Because of the substantial paucities of genomic DNA, it was not possible to perform qPCRs for the remaining 73 cases.

Exome capture, library construction, and whole-exome sequencing

Whole-exome sequencing was performed for 295 GCs and paired normal gastric tissues, as previously described (8). Each DNA sample (1.1 mg) was sheared using a Covaris SS Ultrasonicator (Covaris, MA, USA) according to the manufacturer’s instructions. A Sciclone next-generation sequencing workstation (Caliper Life Sciences, MA, USA) was used to automatically construct the DNA sequence libraries, according to the manufacturer’s instruction. Exome capture was performed using the Agilent SureSelect Human All Exon Kit v4 and v5+ LincRNA (Agilent Technologies, CA, USA). Each sample was sequenced using an Illumina HiSeq 2000 platform (Illumina, CA, USA) and the provided protocol for producing 2 × 100–base pair (bp) pair-end reads. Image analyses and base calling were performed using an Illumina pipeline with default settings. The median depth of tumor/normal coding sequence is 170/84 in Japan and 165/91 in TCGA.

Mutation calling

We obtained FASTQ files of the TCGA dataset [published in 2014 (9)] from the TCGA website ( Then, we performed mutation calling of our dataset and TCGA dataset using exactly the same pipeline from the beginning of sequence mapping procedure as follows. Paired-end reads were aligned to the human reference genome (GRCh37) using NovoAlign ( for both tumor and nontumor samples. Probable PCR duplications, for which paired-end reads aligned to the same genomic position, were removed, and pileup files were generated using SAMtools (52) and some in-house programs. To find somatic point mutations (SNVs and short indels), the following cutoff values were used for base selection: (i) a mapping quality score of at least 20 and (ii) a base quality score of at least 10. Somatic mutations were selected using the following filtering conditions: (iii) The numbers of reads supporting a mutation for tumor were at least 4 and 8, with at least one base with a quality score greater than 30, when tumor variant allele frequency (TVAF) ≥ 0.15 and 0.15 > TVAF ≥ 0.05, respectively; (iv) variant allele frequency (VAF) of matched nontumor sample was less than 0.03 with a read depth of at least 8. Since sequence errors are considered to occur in a sequence-specific manner, the sequence read information of all nontumor samples was combined together to accurately discriminate true positives from false positives. Then, the following filter was applied: (v) VAF of grouped nontumor sample (NVAF) was less than 0.03 and 0.01, when TVAF ≥ 0.15 and 0.15 > TVAF ≥ 0.05, respectively; (vi) the ratio (TVAF/NVAF) of VAF of tumor (TVAF) and grouped nontumor samples (NVAF) must be more than 20. After mutation calling, (vii) mutations with a strand bias (between forward and reverse reads) greater than 95% were removed.

TCGA dataset

TCGA GC dataset [published in 2014 (9)] was obtained from the TCGA website ( It includes 288 (212 non-Asians and 76 Asians) GC cases with information regarding their somatic mutations, list of reads per kilobase per million (RPKM) mapped reads (for the 262 GCs and 29 normal stomach tissues), and clinical data. No ethnicity information was provided for 41 GC cases (obtained from 38 German and 3 Canadian patients); therefore, we considered these to be non-Asian cases. BAM files were obtained from the TCGA website archives and used to analyze germline variations for the TCGA dataset.

Classifications of Japanese GC cases by the TCGA subtypes

We classified our Japanese GC cases into four subgroups: EBV-associated GC, MSI-GC, GS-GC, and chromosomal instability (CIN)–GC (9). First, we defined EBV-GC by detecting viral sequences in the RNA-seq data using PathSeq (53). If the sequence reads unique to the EBV were more than 1 × 10−5 of the reads unique to human genome per individual, these GCs were classified as EBV-GC. This cutoff was determined in accordance with the data of Epstein-Barr virus-encoded small RNA (EBER) in situ hybridization. Second, we defined MSI-GC by mutational signature analysis. Signatures 6 and 15 are well-known resultants of the MMR deficiencies; therefore, we identified MSI-GC as those in Signature 6 or 15 clusters. Last, we divided the rest of GCs into GS- or CIN-GCs. These two subtypes were distinguished by the presence or absence of extensive somatic copy number aberrations based on the hierarchical clustering. To analyze copy number aberrations, the read depth was compared between normal and tumor for each capture target region. The tumor/normal depth ratios were calculated, and the values were smoothed using a moving average. Using these values, clustering was done in R based on Euclidean distance using Ward’s method as was performed in the TCGA (9).

Mutational signature analysis

We obtained somatic SNVs for 531 GC cases (288 TCGA cases and 243 Japanese cases). We used the pan-cancer catalog of 30 Signatures referenced in the COSMIC database (11) (, and the contribution of each mutational Signature was calculated by deconstructSigs software (table S10) (54). Unsupervised hierarchical clustering was performed on the Signature contribution scores for each GC case, by using pheatmap v1.0.2 software in R to define the GC subgroups (Fig. 1A). The ward.D was used as the clustering method, and Euclidean distances (Japan and TCGA datasets; Fig. 1A) and correlation distances (Japanese GCs; Fig. 2A), respectively, were used to assemble the GC clusters. The incidence of Signature 4– or Signature 16–associated SNVs/Mb was calculated on the basis of the contribution scores of Signature 16 in each case, as defined above. We excluded hypermutator GCs from the analyses of the frequencies of SNVs between subgroups, since large numbers of SNVs in hypermutators could extremely affect the statistical analyses. We defined the hypermutator GCs as those with mutation burdens over 10/Mb (55).

RNA sequencing

For 184 Japanese GC cases, total RNA was extracted from frozen sections (10 sections × 10 μm thickness) of GC specimens by RNeasy Mini Kit (Qiagen). Quantification and qualification of extracted RNAs were determined by Agilent Bioanalyzer (Agilent Technologies, CA, USA). RNA-seq libraries were constructed by the TruSeq Stranded mRNA Sample Prep Kit (Illumina, CA, USA) according to the manufacturer’s protocol. Each RNA-seq library was sequenced on an Illumina HiSeq platform (Illumina) by 2 × 100-bp paired-end sequencing as per the manufacturer’s protocol. FASTQ files were processed using Kallisto version 0.43.1 with Ensemble transcriptome (release 79) to create gene expression profiles.

Estimation of LSTs

We estimated the numbers of LSTs by the following three steps: (i) copy number ratio calculation, (ii) detection of large copy number changes, and (iii) LST detection. First, initial copy number ratios were obtained by comparing read depth information for tumor and normal samples, and the copy number estimates were adjusted by tumor purity as described previously (8). Next, copy number ratios within 100-kbp window were averaged and further smoothed by moving average filters with a window of five points. Then, circular binary segmentation implemented in python ( was applied with 10,000 random permutations. To detect only highly confident copy number variation (CNV) regions, the cutoff value was set to P = 0. The detected segments were merged when the difference of mean log ratios between the regions ≤ 0.3. Last, LSTs were detected on the basis of the definition of LST as previously described (13). After filtering and smoothing of all variations less than 3 Mb, a chromosomal break between adjacent regions of at least 10 Mb was defined as LST. The number of LSTs in the tumor genome was estimated for each chromosome arm independently.

Detection of differentially expressed genes among Signature 16 GCs

To detect any differentially expressed gene sets among Signature 16 GCs, we compared the fragments per kilobase of exon per million reads mapped (FPKM) between Signature 16 GCs (n = 12) and the other GCs (n = 172). Initially, we calculated q values, but there were no genes with statistical significances, probably because of the few numbers of Signature 16 GCs; therefore, we used P values to define differentially expressed genes. We selected genes with P values of <0.05 and, at the same time, the average FPKM of the genes among Signature 16 GCs > the other GCs. Last, we identified 1303 genes. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was performed using The Database for Annotation, Visualization and Integrated Discovery (DAVID) platform.


CIBERSORT (23) was applied to 184 Japanese GC tissues to enumerate the proportions of 22 types of immune cells [B cells naïve, B cells memory, plasma cells, T cells CD8, T cells CD4 naïve, T cells CD4 memory resting, T cells CD4 memory activated, T cells follicular helper, T cells gamma delta, T cells regulatory, natural killer (NK) cells resting, NK cells activated, monocytes, macrophages M0, macrophages M1, macrophages M2, dendritic cells resting, dendritic cells activated, mast cells resting, mast cells activated, eosinophils, and neutrophils] using the LM22 dataset. CIBERSORT was run in “absolute mode,” and “disable quantile normalization” option was selected as recommended for RNA-seq data.


Immunohistochemistry staining was performed onto formalin-fixed and paraffin-embedded (FFPE) specimens of GC cases. FFPE tissues were deparaffinized by xylene (Wako Pure Chemical Industries, Japan), and antigens were retrieved by autoclave treatments in citrate buffer (pH 6.0) (Abcam, Cambridge, UK). Endogenous peroxidases were removed by 3% H2O2 (Sigma-Aldrich, MO, USA). Then, 2% bovine serum albumin (Sigma-Aldrich) in phosphate-buffered saline was used for a blocking solution. Antibodies α-CD20 #ab78237 (Abcam) and α-CXCL13 #20927-1-AP (Proteintech Group, IL, USA) were used as primary antibodies. Histostar (MBL, Japan) and Histostar diaminobenzidine (DAB) substrate solution (MBL) were used for detecting signals.

Signature 16 transcriptional strand bias

Transcriptional strand biases of the Signature 16 were determined as follows. All SNVs called by Karkinos were divided into transcribed or untranscribed SNVs. The contribution of each mutational Signature was calculated by deconstructSigs software (54) on transcribed or untranscribed SNVs.

Germline variant analysis

We analyzed germline variants by using GATK software ( and identified 85,006 germline nonsilent (missense and nonframeshift indels) and 7000 germline truncation variants (nonsense, nonstop, and frameshift indels) using the following parameters: sequence depth ≧30 and <1% of population-level MAF, according to the Tohoku Medical Megabank Organization (ToMMo) (56), 1000 Genomes (57), and Exome Aggregation Consortium (ExAC) (58) databases. We retained 66,438 nonsilent variants and 3952 truncation variants after the removal of variants with VAF <0.2 and/or GATK quality value score < 500. We further screened the 262 gastric tumors and 29 normal stomach tissues sourced from TCGA according to whether they achieved a C score (59) ≧10 and a RPKM value of >1 and resultantly generated a list of 34,622 nonsilent and 2424 truncation variants. Last, we focused on 624 cancer-associated genes selected using a previously described pan-cancer germline variant analysis method (25) and obtained 1777 nonsilent variants (in 390 genes) and 60 truncation variants (in 28 genes) (fig. S6). Table 1 shows genes with presumed pathogenic germline variants identified in the 243 Japanese GC cases. For the focused analysis of CDH1, we included the TCGA GC dataset whose germline variants were called using GATK by a filtering depth of ≧30, <1% of population-level MAF, VAF ≧0.2, and a quality score of ≧500 (as described above) (9) and the germline variants called in Korean DGC dataset (6). We identified 11 SNVs in total (Table 2).

The truncation germline variants identified in the 243 Japanese GC cases are also summarized in Table 1. According to a previous paper of comprehensive breast cancer study (60), BRCA pathway genes include BRCA1, BRCA2, PALB2, RAD51, BARD1, and BRIP1; thus, we selected GC cases that exhibited nonsilent and/or truncation variants in these BRCA genes and were associated with Signature 3 contributions (the BRCA signature) using a filtering depth of ≧30, <1% population-level MAF, and VAF ≧0.2, and a quality score of ≧500 (table S3) from the Japanese cohort and TCGA dataset (9).


Supplementary material for this article is available at

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.


Acknowledgments: We thank K. Shiina, K. Nakano, S. Kawanabe, S. Yuba, and A. Yamamoto for technical assistances. Funding: This study was supported by AMED Project for Cancer Research and Therapeutic Evolution (P-CREATE) (JP19cm0106502) (to H.Abu.), JSPS KAKENHI Grant-in-Aid for Scientific Research (S) 24221011 (to H.Abu.), JSPS KAKENHI Grant-in-Aid for Scientific Research(A) 16H02481 (to S.I.), AMED P-CREATE (JP19cm0106551) (to S.I.), and AMED Practical Research for Innovative Cancer Control (JP16ck0106265) (to T.S.). The supercomputing resource was provided by the Human Genome Center (University of Tokyo). Author contributions: S.I. and H.Abu. designed the study. A.S. processed samples and organized exome sequencing and data analyses. M.K., H.Abe, T.Us., T.M., E.S., T.O., S.N., Y.S., Y.R., and M.F. coordinated acquisitions of surgical samples. H.K., D.K., A.T., H.Abe, T.Us., M.F., and S.I. carried out pathological reviews and analyses. S.F., T.Um., H.U., K.T., Y.T., and S.Y. performed computational analyses. A.S., H.K., D.K., and S.I. wrote the manuscript. G.N., T.S., A.N., M.F., S.I., and H.Abu. involved in supervisions of experiments and analyses, critical reviews, and discussion. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in this paper are present in this paper, the Supplementary Materials, or available at the following repository: [JGA (Japanese Genotype-phenotype Archive) database;] with the accession number JGAS00000000228 and JGAS00000000229. Additional data related to this paper may be requested from the authors.

Stay Connected to Science Advances

Navigate This Article