Identification of significant genes associated with prognosis of gastric cancer by bioinformatics analysis

Gastric cancer (GC) ranks second in mortality among all malignant diseases worldwide. However, the cause and molecular mechanism underlying gastric cancer are not clear. Here, we used integrated bioinformatics to identify possible key genes and reveal the pathogenesis and prognosis of gastric cancer. The gene expression profiles of GSE118916, GSE79973, and GSE29272 were available from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) between GC and normal gastric tissues were screened by R software and Venn diagram software. GO and KEGG pathway enrichment of DEGs was performed using the DAVID database. A protein-protein interaction (PPI) network was established by STRING and visualized using Cytoscape software. Then the influence of hub genes on expression and survival was assessed using TCGA database. A total of 83 DEGs were found in the three datasets, including 41 up-regulated genes and 42 down-regulated genes. These DEGs were mainly enriched in extracellular matrix organization and cell adhesion. The enriched pathways obtained in the KEGG pathway analysis were extracellular matrix (ECM)-receptor interaction and focal adhesion. A PPI network of DEGs was analyzed using the Molecular Complex Detection (MCODE) app of Cytoscape. Four genes were considered hub genes, including COL5A1, FBN1, SPARC, and LUM. Among them, LUM was found to have a significantly worse prognosis based on TCGA database. We screened DEGs associated with GC by integrated bioinformatics analysis and found one potential biomarker that may be involved in the progress of GC. This hub gene may serve as a guide for further molecular biological experiments.


Background
Gastric cancer (GC) is the sixth most commonly diagnosed cancer. Its mortality rate places it second among the malignant tumors worldwide [1]. The 5-year overall survival rate of patients in the early stage can reach 95% [2], but for patients in the advanced stage, it has remained at about 50% even after comprehensive treatment based on surgery [3,4]. The cause of the low survival rate is tumor recurrence and metastasis. Therefore, it is important to study the potential molecular mechanism underlying the malignant biological behavior of GC cells and find effective early diagnostic techniques and reliable molecular markers for monitoring recurrence and evaluating prognosis. Despite major advances in the understanding of the molecular mechanisms of GC and in emerging targeted therapeutic options, not all patients see effective results from existing targeted therapies [5,6].
In recent years, the use of microarray and RNAsequencing technology has provided an efficient tool in the search for promising biomarkers for cancer diagnosis, treatment, and prognosis [7,8]. A large amount of data has been collected on public database platforms such as Gene Expression Omnibus (GEO) and the Cancer Genome Atlas (TCGA). These databases can be used to study the molecular mechanism further. A lot of research has been done on the gene expression profile of GC. The exact molecular mechanism of the GC is far from fully uncovered [9]. There is considerable need to find more potential for effective therapeutic strategies.
In order to better understand the influence of DEGs on molecular pathogenesis of GC, in this study, we downloaded three gene expression profiles from the GEO database and screened DEGs. We performed further gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment of DEGs. Finally, key genes affecting the prognosis of GC patients were identified using the PPI network and survival analyses.

Microarray data and identification of DEGs
Three sets of microarrays, GSE118916, GSE79973, and GSE29272, were downloaded from the Gene Expression Omnibus (http:// www. ncbi. nlm. nih. gov/ geo/) database. We only chose paired GC tissues and their matched adjacent tissues. When multiple probes were found to correspond to one specific gene, the average level of expression was considered to be its final expression. The original microarray data of each series were processed using R software package (version 3.6.1; http:// www.R-proje ct. org/). The data were log 2 transformed. |Log 2 fold change (FC)| > 1 and adjusted P < 0.01 were considered the cutoff criteria for DEG screening. A Venn diagram was created using Venny (version 2.1; https:// bioin fogp. cnb. csic. es/ tools/ venny/ index. html). All common DEGs in these three datasets were selected for further study.

GO and KEGG pathway enrichment analysis
GO is a common method for annotating a large number of genes [10]. KEGG is an integrated database resource for biological interpretation of genome sequences and other high-throughput data [11]. GO and KEGG pathway enrichment analysis was performed using the database for annotation, visualization, and integrated discovery (DAVID) online tool (version DAVID 6.8; http:// david. ncifc rf. gov/), which provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind the large list of genes [12]. A P < 0.05 was considered statistically significant.

PPI network construction and hub gene identification
The Search Tool for the Retrieval of Interacting Genes (STRING; version 11.0; http:// string-db. org/ cgi/ input. pl) was used to explore the protein-protein interaction (PPI) information of DEGs. Validated interaction score > 0.4 was selected as the cutoff criterion. Cytoscape software (version 3.6.0; http:// www. cytos cape. org/) was used to visualize and analyze integration of PPI networks. The Molecular Complex Detection (MCODE) app with default parameters in Cytoscape was used to filter modules of the entire network. The cytoHubba app of the Cytoscape software was used to select important hub genes among these DEGs. We use the density of maximum neighborhood component (DMNC) and maximal clique centrality (MCC) methods provided in the cytoHubba app. Mutual genes from two methods were selected as hub genes.

Validation and survival analysis based on TCGA database
To validate the results of hub genes, expression on box plots of GC from the Cancer Genome Atlas (TCGA) database was used to show the expression patterns between tumor and normal samples. Survival and stage analysis of the hub genes were also made with the Gene Expression Profiling Interactive Analysis (GEPIA) online database (http:// gepia. cancer-pku. cn/ detail. php).

Microarray data information and identification of DEGs
Three gene expression profiles (GSE118916, GSE79973, and GSE29272) were acquired from GEO database. The detailed information of these three gene expression profiles is shown in Table 1. There were a total of 318 samples, including 159 tumor and 159 matched adjacent  Fig. 1 a-c. A total of 83 genes were screened out in all three datasets for further analysis (Fig. 1d). There were 41 upregulated genes and 42 downregulated genes in GC tissues compared to adjacent tissues ( Table 2).

GO and KEGG pathway enrichment analysis of DEGs
GO and KEGG pathway enrichment of all 83 DEGs was analyzed using the DAVID online tool. The GO enrichment analysis results were divided into three functional categories, biological processes (BP), cell component (CC), and molecular function (MF). In the BP category, the genes were significantly enriched in extracellular matrix organization, collagen catabolic process, and cell adhesion categories. In the CC category, the genes were significantly enriched in extracellular exosome and extracellular regions. In the MF category, the genes were significantly enriched in calcium ion binding and identical protein binding. The details are shown in Table 3. The signaling pathways of DEGs were mainly enriched in extracellular matrix (ECM)-receptor interaction, protein digestion and absorption, focal adhesion, and PI3K-Akt signaling pathway (Table 4).

PPI network construction and selection of hub genes
To further explore the interaction between these 83 DEGs, the STRING database was used to construct PPI networks, and the resulting PPI networks were constructed using Cytoscape (Fig. 2a). Then, using MCODE, two key modules were identified from the whole network (Fig. 2 b and c). There were 21 nodes and 177 edges in module 1. In module 2, there were seven nodes and 19 edges. In order to identify hub genes, two algorithms (DMNC and MCC) of the cytoHubba app in the Cytoscape software were used. The top 10 hub genes based on the two methods were screened, and there were four mutual hub genes from the two methods: COL5A1, FBN1, SPARC, and LUM.

Validation and survival analysis based on TCGA database
To validate the results given above, the gene expression profiles of these four hub genes from TCGA database were used. GEPIA was used to visualize and analyze integration of TCGA database. These hub genes were significantly differentially expressed (P < 0.01), which was consistent with the results from the GEO data sets (Fig. 3). These hub genes were differentially expressed across various stages of GC (Fig. 4). Only LUM was  significantly closely correlated with the overall survival of GC patients (log-rank P = 0.041; Fig. 5).

Discussion
In this study, we integrated three microarray expression profiles from GEO and identified 83 DEGs between GC and normal gastric tissues, including 41 upregulated and 42 downregulated genes. Functional enrichment and KEGG pathway analysis showed that the DEGs primarily enriched in ECM organization, ECM-receptor interaction, and cell adhesion pathways. Our results suggested that these DEGs may play important role in the progression of GC. ECM organization and ECM-receptor interaction have been proven to be an important part of tumorigenesis and development [16]. Genes encoding proteins that mediate ECM remodeling were upregulated in patients with prostate, lung, and gastric cancers [17]. Collagens are the most abundant ECM components, and they can regulate the physical and biochemical properties of the tumor microenvironment, which modulate cancer cell polarity, migration, and signaling [18,19]. Cell adhesion is a key mediator of cancer progression and facilitates cancer metastatic dissemination. Many cell adhesion molecules within the tumor microenvironment are changed, and these changes alter the ability of tumor cells to interact with other cells and proteins of the ECM [20].
We also identified four major hub genes through the establishment of the PPI network by the STRING database and modules analysis, namely, COL5A1, FBN1, SPARC, and LUM. Subsequent survival analysis of these genes revealed that one of these four upregulated genes was closely related to the poor prognosis of GC patients.
The collagen type 5 α-1 chain (COL5A1) encodes an alpha chain for one of the low-abundance fibrillar collagens. In the research on ovarian cancer, COL5A1 is a poor outcome gene signature. Collagen remodeling might be a common biological process that contributes to poor overall survival [21]. Some studies have suggested COL5A1 is highly expressed at the mRNA and protein levels in breast cancer, and the patients with breast cancer with high COL5A1 expression have a reduced prognosis [22]. In GC, the COL family is a promising prognostic marker [23]. Fibrillin 1 (FBN1) is overexpressed in testicular germ cell tumors relative to nonneoplastic testicular tissue in patients with germ cell tumors, and it could be involved in germ cell neoplasia in situ development [24]. Silencing FBN1 could inhibit the cell proliferative, migratory, and invasive abilities of GC cells, while the influence of upregulated FBN1 expression showed the opposite effect [25]. Secreted protein acidic and rich in cysteine (SPARC) is a matricellular protein modulating cell-matrix interactions and has been found upregulated in colorectal tumor stroma. High SPARC was associated with better disease outcome in stage 2 colorectal cancer, but not in stage 3 colorectal cancer. It may play different roles in different development stages of colorectal cancer [26]. However, SPARC is upregulated in gastric cancer tissues relative to normal gastric tissues. High SPARC expression is associated with worse outcomes than negative and low SPARC expression, and SPARC is a potential marker for poor gastric cancer prognosis [27].
Lumican (LUM) is a protein-coding gene that encodes a member of the small leucine-rich proteoglycan (SLRP) family, which includes decorin, biglycan, fibromodulin, keratocan, epiphycan, and osteoglycin [28]. In recent years, an increasing number of experimental data has come to show that LUM is expressed in many kinds of tumors, including colorectal, prostate, lung, and pancreatic cancer [29][30][31][32]. The role of LUM in cancer varies according to the type of tumor. LUM is highly expressed in bladder cancer tissues and cell lines, and increased LUM expression is associated with the histological grade and the T/N stage of bladder tumors. The in vitro and in vivo data further indicate that low expression of LUM can inhibit the growth and migration of bladder cancer cells by inactivating MAPK signaling [33]. In node-negative invasive breast cancer, low lumican expression has a worse survival [34]. We provide reliable molecular biomarkers for therapy and prognosis of GC based on integrated bioinformatics analysis, including GO, KEGG pathway enrichment, PPI network, module analysis, and TCGA database, particularly when two algorithms are used to identify hub genes. However, our study has a number of limitations that should be considered. First, although we used the TCGA database to valid the results of GEO, molecular experiments are urgently needed to verify. Although we integrated three microarray data, large sample size is needed to validate the results. Second, we compared the paired GC tissues and their matched adjacent tissues. Many details were not taken into account, including histological type, grade of GC, and the distance from adjacent tissue to cancerous tissue. All of these may affect the expression of DEGs. Finally, in order to reduce the number of falsepositive DEGs, we obtained co-expressed DEGs in three datasets. In this way, many important genes may have been lost.

Conclusions
We screened DEGs associated with GC by integrated bioinformatics analysis and found one potential biomarker that may be involved in the progress of GC. This hub gene may serve as a guide for further molecular biological experiments.