The Self-training Gene Clustering Pipeline (SGCP
) is an innovative framework for constructing and analyzing gene co-expression networks. Its primary objective is to group genes with similar expression patterns into cohesive clusters, often referred to as modules. SGCP introduces several novel steps that enable the computation of highly enriched gene modules in an unsupervised manner. What sets SGCP apart from existing frameworks is its integration of a semi-supervised clustering approach, which leverages Gene Ontology (GO) information. This unique step significantly enhances the quality of the resulting modules, producing highly enriched and biologically relevant clusters.
SGCP
is available at BMC Bioinformatics.
For detailed instructions and steps, please refer to the SGCP
manual on Bioconductor page. To install the latest version of SGCP
, you can access the GitHub repository using the following command:
#install.packages("devtools")
#devtools::install_github("na396/SGCP")
GPL-3
UTF-8
SGCP
requires three main inputs; expData , geneID, and annotation_db.
m*n
where m
represents the number of genes and n
represents the number of samples. It can contain data from either DNA-microarray or RNA-seq experiments . Note that SGCP
assumes that pre-processing steps, such as normalization and batch effect corection, have already been performed, as these are not handled by the pipeline.annotation_db
package must be installed by user prior to using SGCP
.Below are some commonly used annotation_db
packages along with their corresponding gene identifiers for different organisms.
Gene expression datasets for your analysis can be obtained from the Gene Expression Omnibus, a public repository of high-throughput gene expression data.
In SGCP
, the following assumptions are made about the input genes:
Here, we give a brief example of the SGCP
input. For this documentation, we use the gene expression GSE181225
. For more information visit its Gene Expression Omnibus page).
Throughout this section, several Bioconductor packages will be required. Make sure to install and load them as needed to follow the example.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("org.Hs.eg.db", "GEOquery", "AnnotationDbi"))
First, set the directory
# Display the current working directory
print(getwd())
# If necessary, change the path below to the directory where the data files are stored.
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = "."
setwd(workingDir)
First, we need to download the gene expression file. The R package GEOquery
is used to obtain gene expression data from the Gene Expression Omnibus. For detailed information on how to use GEOquer
y, refer to the GEOquery guide.
To download the expression file for GSE181225
, visit its Gene Expression Omnibus page. On the page, locate the file GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz
in the Supplementary files
section, which contains the normalized gene expression data. Download this supplementary file
and save it to the directory specified by baseDir
.
library(GEOquery)
gse = getGEOSuppFiles("GSE181225", baseDir = getwd())
After downloading the file, you should find a new directory named GSE181225
, which contains the gene expression file. To proceed, read the gene expression file into R. The file has the following structure:
Symbol
column contains the gene symbols.df = read.delim("GSE181225/GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz")
head(df)
Next, create the expData, geneID, and annotation_db.
geneID = df[,1]
expData = df[, 2:ncol(df)]
rownames(expData) = geneID
library(org.Hs.eg.db)
To map gene symbols to Entrez identifiers using the annotation_db, you can use the select
function from the AnnotationDbi
package. Here’s how you can do it in R:
library(AnnotationDbi)
genes = AnnotationDbi::select(org.Hs.eg.db, keys = rownames(expData),
columns=c("ENTREZID"),
keytype="SYMBOL")
# initial dimension
print(dim(genes))
head(genes)
Remove genes with missing SYMBOL
or ENTREZID
.
genes = genes[!is.na(genes$SYMBOL), ]
genes = genes[!is.na(genes$ENTREZID), ]
#dimension after dropping missing values
print(dim(genes))
head(genes)
Remove genes with duplicated SYMBOL
or ENTREZID
.
genes = genes[!duplicated(genes$SYMBOL),]
genes = genes[!duplicated(genes$ENTREZID), ]
#dimension after dropping missing values
print(dim(genes))
print(head(genes))
Keep only rows in expData that have corresponding gene identifiers present in genes
.
expData = data.frame(expData, SYMBOL = rownames(expData))
expData = merge(expData, genes, by = "SYMBOL")
Produce expData.
rownames(expData) = expData$ENTREZID
expData = expData[, c(2:6)]
print(head(expData))
Remove genes with zero variance from expData.
# Dropping zero variance genes
vars = apply(expData, 1, var)
zeroInd = which(vars == 0)
if(length(zeroInd) != 0) {
print(paste0("number of zero variance genes ", length(zeroInd)))
expData = expData[-zeroInd, ]
genes = genes[-zeroInd, ]
}
print(paste0("number of genes after dropping ", dim(genes)[1]))
Remove genes with no gene ontology mapping.
## Remove genes with no GO mapping
xx = as.list(org.Hs.egGO[genes$ENTREZID])
haveGO = sapply(xx,
function(x) {if (length(x) == 1 && is.na(x)) FALSE else TRUE })
numNoGO = sum(!haveGO)
if(numNoGO != 0){
print(paste0("number of genes with no GO mapping ", length(zeroInd)))
expData = expData[haveGO, ]
genes = genes[haveGO, ]
}
print(paste0("number of genes after dropping ", dim(genes)[1]))
Produce the final expData, geneID, annotation_db. Now, the input is ready for SGCP
. Refer to SGCP Bioconductor page in order to see how to use this input in SGCP
.
expData = expData
print(head(expData))
geneID = genes$ENTREZID
print(head(geneID))
annotation_db = "org.Hs.eg.db"
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4