Research ArticleCANCER

Remote modulation of lncRNA GCLET by risk variant at 16p13 underlying genetic susceptibility to gastric cancer

See allHide authors and affiliations

Science Advances  20 May 2020:
Vol. 6, no. 21, eaay5525
DOI: 10.1126/sciadv.aay5525


The biological effects of susceptibility loci are rarely reported in gastric tumorigenesis. We conducted a large-scale cross-ancestry genetic study in 18,852 individuals and identified the potential causal variant rs3850997 T>G at 16p13 significantly associated with a decreased risk of gastric cancer [odds ratio (OR) = 0.87, 95% confidence interval (CI) = 0.83 to 0.91, P = 2.13 × 10−9]. This risk effect was mediated through the mapped long noncoding RNA GCLET (Gastric Cancer Low-Expressed Transcript; ORindirect = 0.987, 95% CI = 0.975 to 0.999, P = 0.018). Mechanistically, rs3850997 exerted an allele-specific long-range regulatory effect on GCLET by affecting the binding affinity of CTCF. Furthermore, GCLET increased FOXP2 expression by competing with miR-27a-3p, and this regulation remarkably affected in vitro, in vivo, and clinical gastric cancer phenotypes. The findings highlight the genetic functions and implications for the etiology and pathology of cancers.


Gastric cancer is the fifth most common malignancy and is the third leading cause of cancer death worldwide; half of new cases occur in Asia, including in China (1, 2). The high incidence and mortality rates in China possibly result from the differences in Helicobacter pylori infection and behavioral risk factors, as well as genetic diversity in particular (3, 4). In recent decades, genome-wide association studies (GWASs) have identified several gastric cancer risk–related single-nucleotide polymorphisms (SNPs) (511), and most of them were confirmed to be potentially useful as high-quality biomarkers for the programs of gastric cancer susceptibility screening and prognostic evaluation (1214).

In addition to GWAS, some risk loci for gastric cancer were also identified in hypothesis-driven studies based on biological relevance. However, because of the small sample size, most risk loci only explain a small proportion of the phenotypic variance and thus lead to publication bias (13). The currently reported risk loci only explain 1.07% of the phenotypic variance in gastric cancer (13), a value that is much lower than the 22% of phenotypic variance explained by heritability from a twin study (15). Moreover, the molecular mechanisms of risk variants and their assigned genes in gastric cancer have rarely been comprehensively investigated. Several researchers have proposed the idea of “from association to function” in the post-GWAS era to delineate the causal variants and biological mechanisms underlying disease risk, including cancer etiology and pathology (16, 17).

Thus, there is an urgent need to perform an in-depth analysis of gastric cancer GWAS to explore the remaining “missing heritability” and to reveal its underlying causal mechanism. In this study, we used an integrated analytical strategy to explore potential causal genetic variants and susceptibility genes for gastric cancer in a large-scale cross-ancestry population. A series of in vitro and in vivo experiments were applied to investigate the genetic and epigenetic biological impacts on gastric oncogenesis.


Study overview

The flowchart of this study is shown in Fig. 1A. First, a two-step gene-set analysis with both distribution-based sets and gene-based sets was performed for gastric cancer GWAS. Second, the fine-mapping and functional analyses were conducted to identify potential causal risk variants in candidate genes. Third, a large-scale cross-ancestry population was used to evaluate the effect of candidate genetic variants on gastric cancer risk. Last, molecular experiments were performed to interpret the biological significance of risk variants and genes.

Fig. 1 The procedures of gene-set analysis of gastric cancer risk.

(A) Flowchart of the study design. (B) Global map of the distribution of genetic variants by gene types, including protein-coding genes (red), noncoding RNAs (ncRNAs) (blue), the combination (orange), and intergenic regions (green). (C) Application of the gastric cancer GWAS. (D) Distribution-based gene-set analysis, including intervals of protein-coding genes, ncRNA genes, the combination, and intergenic regions. The top panel shows the SNP numbers, and the bottom panel shows the P value. (E) Gene-based gene-set analysis. The top panel shows the SNP numbers, and the bottom panel shows the P value. (F) The linkage disequilibrium plots based on Asian populations (Han Chinese in Beijing and Japanese in Tokyo from 1000 Genomes Project; left) and functional annotation with scores from RegulomeDB and HaploReg (right) for the third novel gene set. DNase, deoxyribonuclease.

The distribution of whole-genome and cancer risk genetic variants

The genes present in the genome were primarily divided into two major categories: protein-coding genes and noncoding RNA (ncRNA) genes (fig. S1A and table S1). The genes were subsequently used to annotate the distributions of the genome-wide genetic variants, pan-cancer risk variants, and gastric cancer–specific risk variants (Fig. 1B). The variants were mainly located at four subgroup intervals with different distribution proportions, including protein-coding genes, ncRNA genes, combined intervals of protein-coding and ncRNA genes (abbreviated as “the combination”), and intergenic regions (fig. S1 and table S2). Greater than 60% of gastric cancer risk variants were enriched in the combination compared with that in 1000 Genomes Project (24.6%; P = 2.07 × 10−5) and pan-cancer risk variants (31.2%; P = 5.76 × 10−4), suggesting the potential biological effect of the combined loci of protein-coding and ncRNA loci on gastric cancer susceptibility.

Identification of novel risk loci

After imputation and quality control for the gastric cancer GWAS dataset, 499 variants with P < 1.0 × 10−4 were included in the gene-set analysis (Fig. 1C and table S3). According to the aforementioned distribution patterns, we primarily performed a distribution-based analysis, which showed the strongest association between 110 SNPs in the subgroup of the combination and gastric cancer risk (P = 1.14 × 10−24; Fig. 1D and table S4). We subsequently conducted a second gene-set analysis with delicate gene annotation for 110 SNPs to clearly define the key genes with substantial effects. The gene-based sets harboring PLCE1 and MUC1 displayed the top two strongest associations with gastric cancer risk, consistent with the data from the original study (6) (P = 3.53 × 10−10 for the PLCE1 set and P = 9.03 × 10−7 for the MUC1 set; Fig. 1E and table S4). In the present study, the gene set with the third strongest effect served as a candidate for further investigation (i.e., 15 SNPs in the RP11-166B2.1|TNFRSF17 set with P = 2.11 × 10−6; Fig. 1E and table S4).

Functional assessment and determination of causal variants

We used a fine-mapping credible set analysis to identify the potential causal variants and to further dissect their effects on driving the changes in the expression of the novel regions (18). Among the 15 candidate SNPs located in the RP11-166B2.1|TNFRSF17 set, three tagSNPs were identified with high posterior probability (0.940 for rs3850997, 1.000 for rs11570151, and 0.971 for rs446289; table S5). In addition, we performed a functional annotation analysis to nominate putative causal variants and found that three tagSNPs, which are located in evolutionarily conserved and functional elements (fig. S2A), not only presented high functional scores (6 for rs3850997, 4 for rs11570151, and 6 for rs446289; Fig. 1F and table S5) but also exhibited high integrated haplotype scores (iHSs) (fig. S2B). Moreover, all three tagSNPs exerted strong pleiotropic expression quantitative trait loci (eQTL) effects on the expression of multiple genes in each tissue, of which only rs3850997 and rs446289 exerted allele-specific effects on RP11-166B2.1 expression in stomach tissues (fig. S3).

Because the causal variants might exert consistent genetic effects across populations with less heterogeneity, we subsequently evaluated the genetic impacts of these 15 candidate SNPs in a European population. Similarly, rs3850997, but not the other SNPs, presented a significant association with gastric cancer risk [odds ratio (OR) = 0.74, 95% confidence interval (CI) = 0.59 to 0.93, P = 9.26 × 10−3 for the G allele; Table 1 and table S6]. Moreover, the conditional analysis pinpointed out an independent effect of rs3850997 on gastric cancer risk after adjusting the GWAS-SNP rs2274223 genotypes (table S7) (6). Thus, rs3850997 likely drove the effect of the RP11-166B2.1|TNFRSF17 region on gastric cancer risk.

Table 1 Association between rs3850997 and gastric cancer risk in GWAS and validation stages.

MAF, minor allele frequency; HWE, Hardy-Weinberg equilibrium. The covariates of age, sex, and study design were adjusted in GWAS stage; the covariates of age and sex were adjusted in the Nanjing-1, Nanjing-2, Yixing, Nantong, and Jilin stages; and sex was adjusted in both the Nanjing-3 and European stages.

View this table:

Evaluation of the genetic effect of rs3850997 on gastric cancer risk

SNP rs3850997 was subsequently genotyped in six independent case-control populations. As shown in Table 1, rs3850997 displayed the same directional effect of the G allele on gastric cancer risk, with ORs ranging from 0.87 to 0.96, although the P value for the association intensity ranged from 0.017 to 0.618. When combined, the effect of rs3850997 reached genome-wide significance for an association with gastric cancer risk in a Chinese population (OR = 0.87, 95% CI = 0.84 to 0.92, P = 2.43 × 10−8), particularly in the cross-ancestry analysis of the Asian and European populations (OR = 0.87, 95% CI = 0.83 to 0.91, P = 2.13 × 10−9). The further stratified analysis confirmed the association of rs3850997 with gastric cancer risk in both demographic (age and sex) and clinical (tumor site) subgroups without heterogeneity (fig. S4).

In silico analysis of rs3850997 and its corresponding host gene

The transcript biotype of ENST00000532936.1 harboring rs3850997 was defined as a processed transcript of the RP11-166B2.1 gene, which was generally categorized as an ncRNA (also named as lnc-RP11-166B2.1.1-2:1 in LNCipedia). The coding probability evaluation revealed an extremely low coding probability for ENST00000532936.1 [probabilities of 0.0158 from Coding-Potential Assessment Tool (CPAT) and 0.0287 from iSeeRNA and a score of 9.7817 from PhyloCSF]. The quantitative analysis showed that expression of ENST00000532936.1 was significantly lower in gastric cancer tissues than in normal tissues, based on evidences from both public and in-house datasets with five independent populations (Fig. 2A). Similarly, this differential expression pattern was observed in gastric cell lines (Fig. 2B). Moreover, this transcript includes 574 poly(A)-negative base pairs (bps) without a Kozak sequence and is enriched in the cytoplasm, as detected by both quantitative and fluorescence analyses (Fig. 2C and fig. S5). On the basis of these results, the ENST00000532936.1 transcript is a novel long ncRNA (lncRNA) that is strongly correlated with gastric cancer and, hence, is named Gastric Cancer Low-Expressed Transcript (GCLET).

Fig. 2 The expression pattern of the lncRNA GCLET and genetic effect of rs3850997.

(A) Both public and in-house datasets showed significantly down-regulated GCLET expression in gastric cancer tissues [public databases: GSE13911, GSE37023, and GSE29272; in-house datasets: 51 pairs of tumor and normal tissues analyzed using reverse transcription polymerase chain reaction (PCR) and 20 pairs from one unpublished dataset detected using an SBC microarray]. Red dots, tumor (T) tissues; blue dots, normal (N) tissues. (B) The constitutive expression of GCLET in gastric cancer cells (BGC-823, SGC-7901, and MGC-803) is significantly lower than that in normal gastric cell (GES-1). (C) The fractionation of gastric cell lysates reveals the cytoplasmic expression of GCLET. The U6 RNA served as a positive control for nuclear gene expression. GAPDH, glyceraldehyde-3-phosphate dehydrogenase. (D) An eQTL analysis for the association of rs3850997 and GCLET expression was conducted using 51 tumor tissues from Nanjing-1 and 280 tumor tissues from Nantong. P values were calculated using an unpaired t test, paired t test, Wilcoxon signed-rank test, and linear regression analysis. (E) The mediation model for the genetic effect of rs3850997 (exposure variable) on the risk of gastric cancer (dependent variable) through GCLET expression (mediator variable). The arrows indicate the direction of the mediation model.

In addition, rs3850997 is located in the third intron of the GCLET gene. By analyzing the chromatin immunoprecipitation sequencing data deposited in Encyclopedia of DNA Elements (ENCODE) and Roadmap Epigenomics projects, the intronic region surrounding rs3850997 was subjected to deoxyribonuclease hypersensitivity and was enriched for multiple histone modifications and key transcription factor (TF) binding sites. Among the candidate TFs, CCCTC-binding factor (CTCF) presented the strongest binding affinity with the top 1000 scores (fig. S6A). Furthermore, specific to CTCF, we found that multiple CTCF binding sites were predicted in the region harboring rs3850997 using Find Individual Motif Occurrences (FIMO) (19) and CTCF-binding site database (CTCFBSDB) (20) and rs3850997 did not generate a new CTCF binding motif but affected the log-likelihood ratio for the match of CTCF to the T allele (PT allele = 0.042) and G allele (PG allele = 0.140; fig. S6B, bottom left). Besides, rs3850997 altered the DNA folding energy of the CTCF binding site (fig. S6C), indicating the easier access of CTCF to the region with T allele. These findings supported a prerequisite that rs3850997 might mediate the impact of CTCF on GCLET transcriptional activity.

Evaluation of the GCLET-mediated genetic effect of rs3850997 on gastric cancer risk

We then performed an eQTL analysis to evaluate the genetic effect of rs3850997 on GCLET expression. The protective rs3850997 G allele significantly increased GCLET expression in two independent sets (βNanjing-1 = 0.004, Ptrend = 0.043; βNantong = 0.011, Ptrend = 0.004; Fig. 2D). Because of the notable eQTL and risk effect of rs3850997, as well as the differential expression of GCLET, we hypothesized that GCLET served as a mediator of the effect of rs3850997 on gastric cancer risk. Using a causal mediation analysis, we observed a significant indirect effect of rs3850997 on gastric cancer susceptibility (ORindirect = 0.987, 95% CI = 0.975 to 0.999, P = 0.018, 9.50% effects mediated; Fig. 2E).

Allele-specific effect of rs3850997 on mediating GCLET expression

We further exploited the allele-specific expression (ASE) analysis to dissect this cis-regulatory variability in gastric cancer susceptibility. As shown in Fig. 3A, compared with the T risk allele, the G allele was selectively transcribed with a 2.3-fold increase in transcriptional activity (P = 0.003).

Fig. 3 Allele-specific effect of rs3850997.

(A) GCLET expression displayed allele specificity, with a preference for the G allele of rs3850997. The ratio of signal density at rs3850997 was calculated using Sanger sequencing in both the genomic DNAs (gDNAs) and complementary DNAs (cDNAs) derived from 32 gastric cancer tissues carrying the TG genotype. P values were calculated using the Wilcoxon signed-rank test. siRNA, small interfering RNA. (B) In dual-luciferase reporter assays, a higher transcriptional activity of the GCLET promoter was observed with the rs3850997 G allele than with the T allele. (C) In electrophoretic mobility shift assays (EMSAs), the G allele of rs3850997 showed a lower binding affinity for the negative regulatory TF CTCF than the T allele. (D) ChIP and allele-specific quantitative polymerase chain reaction (qPCR) indicated that CTCF preferentially bound to the T allele at rs3850997. Notably, the PCR products shown in lanes 1 to 4 were amplified using primers specific for the human H19 imprinting control region, which served as positive controls (lane 1, 100- to 600-bp ladder; lane 2, CTCF binding product; lane 3, immunoglobulin G (IgG) binding product as a negative control; lane 4, input product), and the products shown in lanes 5 to 9 were amplified with primers specific for rs3850997 alleles (lane 5, 100- to 600-bp ladder; lanes 6 and 8, CTCF binding product for T and G alleles, respectively; lanes 7 and 9, IgG binding product for T and G alleles, respectively). (E) The left diagram showed the relative positions of the long-range regulation of rs3850997 on GCLET transcription. Target and Anchor were two primers for hybrid fragments; the blue blocks refer to exons in the GCLET gene; the colorful pies represent multiple TFs, and the red line indicates the Pst1 restriction sites. TSS, transcription start site. The right plots showed the physical interactions of the rs3850997 intronic region with the GCLET promoter (top) as determined using PCR. The products from Pst1-digested cross-linked chromatin without ligation and noncross-linked gDNA with or without ligation were used as negative controls. We used the PCR products from gDNA that were not cleaved by any restriction enzyme as the loading controls (bottom).

Next, we constructed in vitro and in vivo allele-specific assays to investigate the mechanism by which rs3850997 could regulate GCLET transcription. As shown in Fig. 3B, compared with the GCLET promoter, the region containing the rs3850997 T allele notably repressed the transcriptional activity of the luciferase gene, while the G allele rescued this inhibitory effect on both constitutive (in red) and CTCF-overexpressing cells (in blue). The difference in luciferase activity between the T and G alleles of the region harboring rs3850997 was diminished by a CTCF small interfering RNA (in green), and both alleles could activate luciferase activity compared with the basal promoter.

We conducted an electrophoretic mobility shift assay (EMSA) with or without CTCF-containing nuclear extracts to independently verify the results described above. The supershift assays showed the preference of CTCF for the rs3850997 T allele probe (Fig. 3C), but this effect was not observed in the total nucleoprotein assays (fig. S7A). Furthermore, we used chromatin immunoprecipitation (ChIP) assays with a CTCF-specific antibody to explore whether the binding of CTCF to the rs3850997-containing region occurred in MGC-803 cells (heterozygous for rs3850997; fig. S7B). Using ChIP followed by quantitative polymerase chain reaction (qPCR) with or without allele-specific primers, we confirmed that CTCF chromatin binding occurred at the rs3850997 risk-associated locus, and we preferentially precipitated the risk-associated T allele with CTCF (Fig. 3D). Consistent with these findings, DNA sequencing of the ChIP products showed that unlike the input genomic DNA (gDNA), fragments precipitated with the anti-CTCF antibody were enriched for the T allele (fig. S7C). In addition, the region harboring rs3850997 was recognized as the modification region of active histones H3K4me1 and H3K27ac but had no allelic effect on histone modification (fig. S7D).

Furthermore, in silico analysis also predicted multiple CTCF binding sites in the promoter region of GCLET gene (fig. S6B, bottom right). Most of CTCF binding sites located in the region of rs3850997 and GCLET promoter were in the same motif direction aligning with the transcription, indicating a potential formation of a CTCF tandem loop. Subsequently, the chromosome conformation capture assay (3C) assay supported that the rs3850997 intronic region could physically juxtapose with the GCLET promoter and form an intron-promoter complex in the nucleus through a higher-order chromatin loop structure (Fig. 3E and fig. S7E), which was in accordance with the finding from the in situ Hi-C data (fig. S6B, top) (21).

Association of GCLET expression and clinical characteristics

Apart from the risk SNP function, we investigated whether dysregulated GCLET expression correlated with gastric cancer progression. The survival analysis revealed a better survival for patients with higher GCLET expression than for those with lower levels [Plog-rank = 0.005, hazard ratio (HR) = 0.70, 95% CI = 0.51 to 0.96; Fig. 4A and fig. S8A], but rs3850997 did not exert a significant effect on gastric cancer survival (Plog-rank = 0.670, HR = 0.85, 95% CI = 0.66 to 1.10; fig. S8A). Besides, the quantitative analysis revealed a substantial down-regulation of GCLET expression upon tumor progression to the advanced stage (especially the invasion depth and lymph node metastasis; Fig. 4, B to D) but not tumor size, distant metastasis, and histological type (fig. S8, B to D).

Fig. 4 Association of GCLET expression with clinical features of gastric cancer and effect of GCLET on cellular phenotypes and molecular regulation.

(A) Significant association of GCLET expression with the survival of gastric cancer, as determined using the log-rank test. (B to D) Significant associations of GCLET expression with clinical features, including tumor invasion (B), lymph node metastasis (C), and tumor-node-metastasis classification (D), as determined using a linear regression analysis. (E and F) Prominent effect of GCLET on gastric cancer cell proliferation, as assessed using the CCK8 assay (E) and colony-forming assay (F). OD, optical density. (G) Remarkable effects of GCLET on gastric cancer cell invasion and migration. (H and I) GCLET attenuated the effect of miR-27a-3p on the expression of its target gene FOXP2. Luciferase expression was reduced if the reporter plasmid contained the wild-type FOXP2 3′ untranslated region (3’UTR), and this effect was more notable when the reporter plasmids were cotransfected together with the miR-27a-3p mimics (H). GCLET independently increased the level of the FOXP2 protein and further rescued the inhibitory effect of miR-27a-3p on FOXP2 expression (I).

Effect of GCLET on cell and mouse phenotypes

To address the clinical relevance of GCLET, we detected the biological impact of GCLET on cell and mouse models. As illustrated in Fig. 4, E to G, GCLET overexpression significantly suppressed proliferation, clone formation, invasion, and migration of transiently transfected gastric cancer cells. Besides, the similar phenotypes were identified in stably transfected cells (fig. S9, A to C). Moreover, we injected MGC-803 cells carrying the GCLET or negative control (NC) lentiviral vector into nude mice. Consistent with the in vitro findings, GCLET overexpression significantly decreased the mean tumor weight and average tumor volume (fig. S9, D to G). The tumor tissues from mice inoculated with cells expressing GCLET showed increased expression of GCLET (fig. S9H) and significantly reduced expression of the proliferation marker Ki67 (fig. S9I).

Mechanistic significance of GCLET in gastric cancer etiology

According to the enrichment of GCLET in the cytoplasm, we investigated whether GCLET affected microRNA (miRNA)–mediated regulation of their shared targets in tumorigenesis. By combining two prediction algorithms from TargetScan and miRcode, we identified six miRNAs that matched the GCLET RNA sequences (fig. S10A) and identified only miR-27a-3p as a target of GCLET using luciferase reporter assays (fig. S10B).

We hypothesized that the lncRNA GCLET would compete for available miR-27a-3p and further regulate the expression of target genes shared by miR-27a-3p. Therefore, we conducted a systematic analysis to determine the targets of miR-27a-3p, which could also be regulated by GCLET (see Supplementary Materials and Methods). Among the 26 candidate targets (table S8), FOXP2 exhibited the strongest correlation coefficient and was assessed as a putative shared target of both GCLET and miR-27a-3p in further investigations. Subsequently, the luciferase reporter assays revealed a significantly lower activity in cells expressing plasmids containing the FOXP2 3′ untranslated region (3’UTR) than that in cells expressing plasmids containing mutations; this effect was markedly promoted by the biological actions of miR-27a-3p (Fig. 4H). Moreover, Western blot assays showed that miR-27a-3p markedly decreased the levels of the FOXP2 protein and GCLET rescued the suppressive effect of miR-27a-3p on the FOXP2 protein levels (Fig. 4I). These results indicated the potential of FOXP2 to serve as a miR-27a-3p target whose expression was regulated by GCLET.

Correlation with candidate genes expression

We further explored the expression patterns of CTCF, GCLET, miR-27a-3p, and FOXP2 and their correlations. The expressions of both CTCF and miR-27a-3p were greatly up-regulated in gastric cancer tissues, while both GCLET and FOXP2 expressions were markedly down-regulated (fig. S10, C and D). Moreover, a notable expression correlation for each gene was observed in the cascade of CTCF, GCLET, miR-27a-3p, and FOXP2 (fig. S10, C and D), except for the relationship of GCLET and miR-27a-3p (fig. S10E). In addition, after modifying the CTCF expression pattern in gastric cancer cells, we observed an inhibitory effect of CTCF on GCLET expression (fig. S11A).

Together, we identified a novel gastric cancer susceptibility locus in a previously unknown lncRNA GCLET at 16p13 (Fig. 5). SNP rs3850997 T>G in this region represents a potential causal variant that may drive the long-range regulation of GCLET expression through attenuating the binding affinity of CTCF, which, in turn, was associated with a decreased gastric cancer risk. Moreover, GCLET affected gastric cancer progression by competing with miR-27a-3p to regulate FOXP2 expression in oncogenesis.

Fig. 5 Graphical representation of our findings.

SNP rs3850997 was significantly associated with a decreased risk of gastric cancer, most likely by altering the CTCF binding affinity and mediating a long-range regulatory effect on the expression of the lncRNA GCLET. In addition, GCLET competed with miR-27a-3p to regulate FOXP2 expression, tumor growth, invasion, and migration.


GWAS has uncovered thousands of genetic variants related to human complex traits and diseases, and approximately 90% of these loci are located in noncoding regions. Although the functional studies underlying the observed statistical associations with disease risk have lagged, emerging evidence has demonstrated the allele-specific regulatory effects of risk variants (even in noncoding) on the development and progression of traits, which provides additional insights into the causal genetic variants and biological mechanisms (16, 22). In this study, our integrated analysis of gastric cancer GWAS data revealed that the potential causal risk SNP rs3850997 altered a long-range allele-specific effect on the transcriptional activity of the novel lncRNA GCLET and was involved in gastric oncogenesis.

Initially, we observed a significant enrichment of genetic variants in the subgroup of the combination, along with its significant association with gastric cancer risk using gene-set analysis. This finding supported the extended meaning that the region enriched with a large portion of risk variants could exert the potential biological role in disease, particularly the combination in gastric cancer etiology in this study. Subsequently, we adopted a delicate gene-set analysis to fill the gap in the missing heritability of conventional analyses (23). After two-step gene-set analysis and independent validations, we identified that rs3850997 T>G was significantly associated with a decreased risk of gastric cancer, without demographic and clinical specificity. This association was further confirmed through a cross-ancestry analysis. In the original gastric cancer GWAS by single-locus analysis, rs2274223 was one of the most significant signals associated with gastric cancer risk (P = 8.40 × 10−9) (6). Thus, we enrolled both rs2274223 and rs3850997 in the logistic regression model to confirm the independent genetic effect of rs3850997 on gastric cancer risk. All these findings support the hypothesis that rs3850997 represents a potential causal variant for gastric cancer risk.

Currently, the integration of omics data has advanced our knowledge of the role of genomics and epigenomics in cancer etiology and pathology as a bridge between an association and causality. Using both functional annotations and experimental studies, the intronic region of the lncRNA GCLET surrounding rs3850997 displayed significantly selective transcription, which helped rs3850997 exert an ASE effect on the expression of its assigned gene GCLET. In addition, this phenomenon partially mediated the genetic effect of rs3850997 on gastric cancer risk.

Progressive studies have revealed that noncoding risk variants function as cis-/trans-regulatory elements involved in key biological processes, including DNA methylation, TF binding affinity, and histone modification (17, 22). We primarily treated CTCF as a candidate regulatory element that disrupted the allele-specific effect of rs3850997 on GCLET expression. CTCF, a well-known TF involved in genomic and epigenomic regulation, facilitates changes in cell phenotypes that participate in the development and progression of cancers (24). The subsequent in vitro and in vivo studies revealed that the protection-associated rs3850997 G allele escaped its surrounding regulatory element by binding to the inhibitor CTCF and further exerted long-range regulatory effects on GCLET transcription by forming an intron-promoter loop structure. Collectively, rs3850997 plausibly drives the long-range regulation of GCLET expression, explaining its risk effect on gastric cancer etiology. Moreover, any state-of-the-art technologies, such as the CRISPR-Cas9–mediated genome editing approach, will be further used to demonstrate and define the real genes affected by the risk T allele.

LncRNAs are aberrantly expressed in a broad spectrum of cancers with disease- and stage-restricted expression and represent potentially important diagnostic biomarkers and therapeutic targets in cancer development and progression (25, 26). In the present study, higher expression of GCLET was markedly correlated with less progression and improved survival of patients with gastric cancer by diminishing malignant tumor phenotypes. Thus, GCLET represents a potential biomarker for the evaluation of gastric cancer outcome. Besides, regarding the mechanism, RNA transcripts, especially cytoplasm-enriched lncRNAs, are well established as competing endogenous RNAs that sponge miRNAs involved in human development and diseases (27, 28). Notably, miR-27a-3p is an oncogene that is involved in gastric cancer growth and drug resistance and potentially serves as a gastric cancer outcome biomarker (29, 30). We demonstrated that GCLET did not affect the abundance of miR-27a-3p but antagonized the negative modulatory effect of miR-27a-3p on its target FOXP2, contributing to the development and progression of gastric cancer. These findings need to be confirmed in further comprehensive functional studies. In addition, a future study will be conducted to comprehensively evaluate the predictive value of a set of biomarkers combining the expression of CTCF, GCLET, miR-27a-3p, and FOXP2 and the genotypes of rs3850997 for gastric cancer patient survival, progression, and therapeutic effects.

In summary, our findings suggest that 16p13 is a novel gastric cancer susceptibility locus where a potential causal variant rs3850997 T>G drives the long-range regulation of the expression of the first-reported lncRNA GCLET through altering the CTCF binding affinity. Besides, GCLET is involved in tumorigenesis by competing with miR-27a-3p to increase FOXP2 expression (Fig. 5). This study not only provides important insight into the role of “post-GWAS” in determining genetic susceptibility to cancer but also establishes the basis for the identification of genetic and epigenetic biomarkers for further screening and prognostic programs.


Global overview of genetic variants based on transcript distribution

Biotype annotation was supported by the GENCODE project (v19), which comprised a total of 229,917 transcripts from 85,305 genes, including 81,184 transcripts for protein-coding genes and 148,103 for ncRNA genes. The details for each biotype are listed in fig. S1A and table S1.

Study subjects

The whole-genome genotyping data of 1625 gastric cancer cases and 2100 controls from northwestern China were used for the GWAS stage (6). In the validation study, we enrolled 5353 cases and 6685 controls from six independent sets. All cases were diagnosed and histopathologically confirmed as gastric cancer at the hospitals, and the controls were genetically unrelated to the cases within the surrounding region. All enrollment criteria have been previously described in detail (6, 14). The demographic characteristics of all individuals are shown in table S9. The study was approved by the ethical committees of all institutions.

Imputation in the GWAS database

For the gastric cancer GWAS, the nongenotyped variants were imputed on the basis of the 1000 Genomes Project (phase 1 integrated release 3, March 2012) using IMPUTE (31). The imputed SNPs were then excluded on the basis of the criteria of a minor allele frequency of <0.01, a genotype call rate of <95%, and P < 0.001 for Hardy-Weinberg equilibrium (HWE). A total of 4,817,140 variants were included in the analysis of the association with gastric cancer using a probabilistic dosage model in SNPTEST v2.5 (32).

Functional prediction for SNPs and assigned genes

The functional scores for candidate SNPs were calculated using an aggregation of RegulomeDB and HaploReg v4; each provided the regulatory scores for candidate SNPs, including the items of protein binding, chromatin changing, and eQTL (table S5). The genomic sequences of multiple species were obtained from the UCSC Genome Browser to conduct the conservation assessment. Haplotter displayed the results of a scan for positive selection in the human genome and detected recent positive selection with iHSs (33). The Genotype-Tissue Expression project is a comprehensive public resource that was used to study tissue-specific eQTL. FIMO (19) and CTCFBSDB (20) were applied to predict the change in CTCF motif and binding sites affected by the risk variant. The mfold tool (34) calculated the energy change in DNA folding affected by the risk variant. The mature sequences of the GCLET transcript were derived from the Ensembl genome and LNCipedia v5.0 databases (35). Three stringent methods, CPAT (36), iSeeRNA (37), and PhyloCSF (38), were used to evaluate the coding probability of candidate lncRNAs.

Publicly available clinical datasets

We obtained three independent gastric cancer datasets from Gene Expression Omnibus to detect GCLET expression pattern, including GSE13911 (38 gastric tumor and 31 normal tissues) (39), GSE37023 (29 tumors and 36 normal tissues) (40), and GSE29272 (62 pairs of tumor and adjacent normal tissues) (41). Another dataset with 274 gastric cancer tissues and 33 normal tissues from The Cancer Genome Atlas allowed us to analyze the correlations between miR-27a-3p and its candidate targets by calculating log2(X + 1) transformed values.

Analysis of ASE in tissues

ASE was detected using Sanger sequencing with specific primers (table S10). We examined the sequences of the region surrounding rs3850997 in 32 gastric cancer tissues from heterozygous carriers (T/G) using TaqMan assays, with products of approximately 400 bps. ASE ratio (G versus T) was calculated by normalizing the ratio between the peak areas of the two alleles (T/G) in complementary DNA (cDNA) samples with the same parameters in gDNA (1)E(GT)=1ni=1n(KcDNA(GT)KgDNA(GT))i(1)

In Eq. 1, K represents the mean peak ratio of the peak area G allele/peak area T allele, and E represents the mean adjusted peak ratio of the KcDNA/KgDNA.

ChIP assay

ChIP experiments were strictly performed according to the manufacturer’s instructions (53009, Active Motif) using antibodies against CTCF (2418, Cell Signaling Technology), H3K4me1 (ab8895, Abcam), K3K27ac (ab4729, Abcam), and normal rabbit immunoglobulin G (2729, Cell Signaling Technology). Briefly, cells cultured in 15-cm plates were cross-linked with fixation solution (1% formaldehyde) at room temperature for 10 min. The reaction was quenched with Glycine Stop-Fix Solution. Cells were washed twice with cold phosphate-buffered saline (PBS), collected with cell scraping solution [1× PBS with 100 mM phenylmethylsulfonyl fluoride (PMSF)], and centrifuged into a pellet. The pellet was resuspended in cold lysis buffer supplemented with protease inhibitor cocktail (PIC) and PMSF to isolate nuclei. The centrifuged nuclear fraction was resuspended in digestion buffer; then, a working stock of enzymatic shearing cocktail was added and incubated for 15 min (the optimal enzymatic shearing condition) at 37°C. The shearing of the chromatin samples was terminated with 0.5 M EDTA, and the samples were collected by centrifugation. Chromatin lysates were then diluted with ChIP buffer 1 supplemented with Protein G magnetic beads and PIC and subjected to immunoprecipitation with candidate antibodies overnight. Subsequently, the complexes were washed with ChIP buffer 1 and ChIP buffer 2 on a magnetic stand, eluted with Elution Buffer AM2, and incubated for 15 min at room temperature. Reverse cross-linking buffer was added to the eluted chromatin, which was then incubated for 2.5 hours at 65°C, followed by a 1-hour incubation with Proteinase K at 37°C. After the addition of Proteinase K stop solution, DNA was purified from the collected liquid with a QIAquick PCR Purification kit (QIAGEN) and analyzed using qPCR with specific primers for CTCF binding sites and histones modification sites (table S10). The primers specific for the prostate-specific antigen enhancer (42) and HOXC-AS3 (43) served as positive controls for active histones modification.

ASE analysis in ChIP reaction

For allele-specific quantitative reverse transcription PCR, the DNA products of ChIP were quantitated using the allele-specific qPCR method, which is similar to normal qPCR with the allele-specific amplification of the rs3850997 region containing the T or G allele. The primers designed to specifically amplify the rs3850997 region with the T or G allele were listed in table S10. In addition, the difference in peak areas between the T and G alleles from the ChIP DNA products was detected using Sanger sequencing with the same primers used for ChIP experiments.

Chromosome conformation capture assay

Briefly, cross-linking was achieved by incubating 1 million cells with 2% formaldehyde for fixation and stopped by the addition of glycine. Cells were then lysed in lysis buffer supplemented with protease inhibitors for 60 min at 4°C. The nuclei were suspended in 1.2× restriction buffer and incubated for 1 hour at 37°C. Next, Triton X-100 was added and incubated for 1 hour at 37°C. The restriction enzyme Pst1 was chosen to cleave the DNA because the recognition sequence CTGCAG was identified in both the rs3850997 intronic region and the GCLET promoter. One hundred units of Pst1 was added, and the samples were subjected to a 22-hour digestion. The reaction was stopped by the addition of SDS and a 20-min incubation at 65°C. Approximately 2 μg of digested chromatin was diluted in 800 μl of ligation buffer. When the reaction was cooled to 16°C, 2000 U of T4 DNA ligase was added and incubated overnight. Then, the chromatin mixtures were incubated with Proteinase K (100 μg/ml) at 65°C overnight to reverse the cross-links. RNA was digested with ribonuclease A at 37°C for 30 min. The captured sample was purified with a QIAquick PCR purification kit (QIAGEN). The DNA extract was then amplified with PCR using the specific primers listed in table S10, and the primers were designed around the Pst1 recognition site. PCR products from Pst1-digested cross-linked chromatin without ligation and non–cross-linked gDNA with or without ligation were used as negative controls. All PCR products were sequenced.

Statistical analysis

The hypergeometric test deposited in dhyper function of R stats package was used to assess the enrichment significance for four classifications of gene types. The two-stage gene-set analysis was conducted using a sequence kernel association test, which was implemented in the sequence kernel association optimal test (23). A logistic regression analysis was performed to determine the association of rs3850997 and gastric cancer risk after adjustment for available covariates in each study. The inverse variance method for meta-analysis was performed to merge the effects of rs3850997 reported in all available studies.

Potential causal variants were identified using the credible set analysis with Bayesian fine-mapping algorithm FM-summary (18, 44). We split the analyses into four main scenarios in the following predefined order to evaluate the association between candidate variant and gastric cancer risk underlying the causal mediation pathway through gene expression: (i) merging all associations between genetic variant and gastric cancer risk; (ii) merging all associations between genetic variant and gene expression; (iii) merging all associations between gene expression and gastric cancer risk; and (iv) causal inference analysis. The indirect effect of genetic variant on gastric cancer risk through gene expression was estimated using Eq. 2, as well as the proportion of the effect mediated (EM%)ORindirect=eβ(b)*β(c)EM%=β(b)*β(c)β(a)(2)

The statistical P value for ORindirect was determined using the Sobel test.

A linear regression analysis was used to detect the eQTL effect and the relationship between rs3850997 and clinical phenotypes. We performed both Kaplan-Meier survival analysis and Cox multivariate regression model to calculate log-rank P value and HR for the association between GCLET expression or rs3850997 genotypes and 5-year survival of patients with gastric cancer. The adjustment was from the demographic characteristics (age and sex) and clinical features (tumor size, tumor-node-metastasis, and histological type). Paired or unpaired t tests, rank-sum test, and Pearson’s χ2 test were performed for continuous or categorical variables, as appropriate. PLINK 1.90 and R 3.4.2 were used for general statistical analyses.


Supplementary material for this article is available at

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.


Acknowledgments: Funding: This study was partly supported by National Key R&D Program of China (2018YFC1313100 and 2018YFC1313102), National Natural Science Foundation of China (81230068, 81473049, 81773538, 81773539, and 81903390), China Postdoctoral Science Foundation funded project (2015 M580449), Jiangsu Provincial Postdoctoral Science Foundation funded project (1501081C), Collaborative Innovation Center For Cancer Personalized Medicine, and the Priority Academic Program Development of Jiangsu Higher Education Institutions (Public Health and Preventive Medicine). Author contributions: Z.Z. and M.W. conceptualized, designed, and supervised the study. M.D. and J.X. performed the statistical analyses and data interpretation. M.D. and H.C. prepared the manuscript. R.Z., J.L., and G.M. contributed to the functional experiments. S.L., N.T., and G.Z. provided technical support. W.W., F.Q., W.G., Q.Z., G.T., J.C., Z.J., J.J., G.J., Z.H., and H.S. contributed to the data and biological sample collection in the original studies. Competing interests: The authors declare that they have no competing interests. Data and materials availability: Additional methods are described in Supplementary Materials and Methods. All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.

Stay Connected to Science Advances

Navigate This Article