There are many ways to perform the analysis. The following sections will be splited by steps, and finishing with the complete analysis with visualization. We will use the airway (Himes et al. 2014) dataset in the following sections. This dataset provides a RNA-seq count data from four human ASM cell lines that were treated with dexamenthasone - a potent synthetic glucocorticoid. Briefly, this dataset has 4 samples untreated and other 4 samples with the treatment.
PCITThe first option is to perform the PCIT analysis. The output will be a list with 3 elements. The first one contains a dataframe with the pairwise correlation between genes (corr1) and the significant pairwise correlation (corr2 \(\neq\) 0). The second element of the list stores the adjacency matrix with all correlation. And the last element contains the adjacency matrix with only the significant values:
# Loading packages
library(CeTF)
library(airway)
library(kableExtra)
library(knitr)
# Loading airway data
data("airway")
# Creating a variable with annotation data
anno <- as.data.frame(colData(airway))
anno <- anno[order(anno$dex, decreasing = TRUE), ]
anno <- data.frame(cond = anno$dex,
row.names = rownames(anno))
# Creating a variable with count data
counts <- assay(airway)
# Sorting count data samples by conditions (untrt and trt)
counts <- counts[, rownames(anno)]
colnames(counts) <- paste0(colnames(counts), c(rep("_untrt", 4), rep("_trt", 4)))
# Differential Expression analysis to use only informative genes
DEGenes <- expDiff(exp = counts,
anno = anno,
conditions = c('untrt', 'trt'),
lfc = 4.5,
padj = 0.05,
diffMethod = "Reverter")
# Selecting only DE genes from counts data
counts <- counts[rownames(DEGenes$DE_unique), ]
# Converting count data to TPM
tpm <- apply(counts, 2, function(x) {
(1e+06 * x)/sum(x)
})
# Count normalization
PCIT_input <- normExp(tpm)
# PCIT input for untrt
PCIT_input_untrt <- PCIT_input[,grep("_untrt", colnames(PCIT_input))]
# PCIT input for trt
PCIT_input_trt <- PCIT_input[,grep("_trt", colnames(PCIT_input))]
# Performing PCIT analysis for untrt condition
PCIT_out_untrt <- PCIT(PCIT_input_untrt, tolType = "mean")
# Performing PCIT analysis for trt condition
PCIT_out_trt <- PCIT(PCIT_input_trt, tolType = "mean")
# Printing first 10 rows for untrt condition
kable(PCIT_out_untrt$tab[1:10, ]) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
gene1 gene2 corr1 corr2 ENSG00000011465 ENSG00000026025 0.01160 0.00000 ENSG00000011465 ENSG00000035403 0.61286 0.00000 ENSG00000011465 ENSG00000049323 -0.71685 0.00000 ENSG00000011465 ENSG00000067225 -0.99935 -0.99935 ENSG00000011465 ENSG00000071127 -0.56817 0.00000 ENSG00000011465 ENSG00000071242 -0.89063 0.00000 ENSG00000011465 ENSG00000075624 0.09820 0.00000 ENSG00000011465 ENSG00000077942 -0.80938 0.00000 ENSG00000011465 ENSG00000080824 -0.72639 0.00000 ENSG00000011465 ENSG00000087086 -0.91357 0.00000
# Printing first 10 rows for trt condition
kable(PCIT_out_trt$tab[1:10, ]) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
gene1 gene2 corr1 corr2 ENSG00000011465 ENSG00000026025 0.63800 0.00000 ENSG00000011465 ENSG00000035403 0.96857 0.96857 ENSG00000011465 ENSG00000049323 0.14794 0.00000 ENSG00000011465 ENSG00000067225 -0.99687 -0.99687 ENSG00000011465 ENSG00000071127 -0.92734 -0.92734 ENSG00000011465 ENSG00000071242 -0.75914 0.00000 ENSG00000011465 ENSG00000075624 -0.70489 0.00000 ENSG00000011465 ENSG00000077942 -0.29712 0.00000 ENSG00000011465 ENSG00000080824 -0.93148 0.00000 ENSG00000011465 ENSG00000087086 -0.94620 0.00000
# Adjacency matrix: PCIT_out_untrt$adj_raw or PCIT_out_trt$adj_raw
# Adjacency matrix only with the significant values: PCIT_out_untrt$adj_sig or PCIT_out_trt$adj_sig
Histogram of connectivity distribution
After performing the PCIT analysis, it is possible to verify the histogram distribution of the clustering coefficient of the adjacency matrix with the significant values:
# Example for trt condition
histPlot(PCIT_out_trt$adj_sig)
Density Plot of raw correlation and significant PCIT
Itâs possible to generate the density plot with the significance values of correlation. Weâll use the raw adjacency matrix and the adjacency matrix with significant values. It is necessary to define a cutoff of the correlation module (values between -1 and 1) that will be considered as significant:
# Example for trt condition
densityPlot(mat1 = PCIT_out_trt$adj_raw,
mat2 = PCIT_out_trt$adj_sig,
threshold = 0.5)
RIF
To perform the RIF analysis we will need the count data, an annotation table and a list with the Transcript Factors of specific organism (Homo sapiens in this case) and follow the following steps in order to get the output (dataframe with the average expression, RIF1 and RIF2 metrics for each TF):
# Loading packages
library(CeTF)
library(airway)
library(kableExtra)
library(knitr)
# Loading airway data
data("airway")
# Creating a variable with annotation data
anno <- as.data.frame(colData(airway))
anno <- anno[order(anno$dex, decreasing = TRUE), ]
anno <- data.frame(cond = anno$dex,
row.names = rownames(anno))
# Creating a variable with count data
counts <- assay(airway)
# Sorting count data samples by conditions (untrt and trt)
counts <- counts[, rownames(anno)]
colnames(counts) <- paste0(colnames(counts), c(rep("_untrt", 4), rep("_trt", 4)))
# Differential Expression analysis to use only informative genes
DEGenes <- expDiff(exp = counts,
anno = anno,
conditions = c('untrt', 'trt'),
lfc = 2,
padj = 0.05,
diffMethod = "Reverter")
# Selecting only DE genes from counts data
counts <- counts[rownames(DEGenes$DE_unique), ]
# Converting count data to TPM
tpm <- apply(counts, 2, function(x) {
(1e+06 * x)/sum(x)
})
# Count normalization
Clean_Dat <- normExp(tpm)
# Loading the Transcript Factors (TFs) character
data("TFs")
# Verifying which TFs are in the subsetted normalized data
TFs <- rownames(Clean_Dat)[rownames(Clean_Dat) %in% TFs]
# Selecting the Target genes
Target <- setdiff(rownames(Clean_Dat), TFs)
# Ordering rows of normalized count data
RIF_input <- Clean_Dat[c(Target, TFs), ]
# Performing RIF analysis
RIF_out <- RIF(input = RIF_input,
nta = length(Target),
ntf = length(TFs),
nSamples1 = 4,
nSamples2 = 4)
# Printing first 10 rows
kable(RIF_out[1:10, ]) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
TF avgexpr RIF1 RIF2 ENSG00000008441 10.381697 -0.1528773 -0.6769198 ENSG00000102804 9.265949 -0.0004763 0.1980176 ENSG00000103888 13.592705 0.0877683 -0.3434817 ENSG00000103994 10.386200 -1.2749114 0.8224675 ENSG00000112837 8.144279 -0.1825080 -1.7860290 ENSG00000116016 11.377135 -1.2236196 0.0188771 ENSG00000116132 9.389114 1.4025115 0.8876227 ENSG00000118689 9.190363 0.9718881 0.3826641 ENSG00000123562 9.208115 0.4050667 0.4663411 ENSG00000157514 7.216761 -1.5719588 -0.3117244 Whole analysis of Regulatory Impact Factors (RIF) and Partial Correlation and Information Theory analysis (PCIT)
Finally, it is possible to run the entire analysis all at once. The output will be a CeTF object with all results generated between the different steps. To access the CeTF object is recommended to use the accessors from CeTF class:
# Loading packages
library(airway)
library(CeTF)
# Loading airway data
data("airway")
# Creating a variable with annotation data
anno <- as.data.frame(colData(airway))
# Creating a variable with count data
counts <- assay(airway)
# Sorting count data samples by conditions (untrt and trt)
counts <- counts[, order(anno$dex, decreasing = TRUE)]
# Loading the Transcript Factors (TFs) character
data("TFs")
# Performing the complete analysis
out <- runAnalysis(mat = counts,
conditions=c("untrt", "trt"),
lfc = 5,
padj = 0.05,
TFs = TFs,
nSamples1 = 4,
nSamples2= 4,
tolType = "mean",
diffMethod = "Reverter",
data.type = "counts")
Getting some graphical outputs
Based on CeTF class object resulted from runAnalysis function is possible to get a plot for Differentially Expressed (DE) genes that shows the relationship between log(baseMean) and Difference of Expression or log2FoldChange, enabling the visualization of the distribution of DE genes and TF in both conditions. These two first plot provides a initial graphical visualization of results:
# Using the runAnalysis output (CeTF class object)
SmearPlot(object = out,
diffMethod = 'Reverter',
lfc = 1.5,
conditions = c('untrt', 'trt'),
type = "DE")
Then is possible to generate the same previous plot but now for a single TF. So, if you have a specific TF that you want to visualize, this is the recommended plot:
# Using the runAnalysis output (CeTF class object)
SmearPlot(object = out,
diffMethod = 'Reverter',
lfc = 1.5,
conditions = c('untrt', 'trt'),
TF = 'ENSG00000185917',
label = TRUE,
type = "TF")
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The next output is a network for both conditions resulted from :
# Using the runAnalysis output (CeTF class object)
netConditionsPlot(out)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## â¹ The deprecated feature was likely used in the CeTF package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Then we will associate the genes in both conditions with Gene Ontology (GO) and other databases. getGroupGO function will be performed where a set of genes are counted within all possible ontologies, without statistical power. This function is capable of perform groupGO for any organism as long as itâs annotation package exists (i.e. org.Hs.eg.db, org.Bt.eg.db, org.Rn.eg.db, etc). In this example we will perform only for first condition:
# Loading Homo sapiens annotation package
library(org.Hs.eg.db)
# Accessing the network for condition 1
genes <- unique(c(as.character(NetworkData(out, "network1")[, "gene1"]),
as.character(NetworkData(out, "network1")[, "gene2"])))
# Performing getGroupGO analysis
cond1 <- getGroupGO(genes = genes,
ont = "BP",
keyType = "ENSEMBL",
annoPkg = org.Hs.eg.db,
level = 3)
Alternatively, it is possible to perform the enrichment analysis with a statistical power. In this case, there are many databases options to perform this analysis (i.e. GO, KEGG, REACTOME, etc). The getEnrich function will perform the enrichment analysis and returns which pathways are enriched in a given set of genes. This analysis requires a reference list of genes with all genes that will be considered to enrich. In this package there is a function, refGenes that has these references genes for ENSEMBL and SYMBOL nomenclatures for 5 organisms: Human (Homo sapiens), Mouse (Mus musculus), Zebrafish (Danio rerio), Cow (Bos taurus) and Rat (Rattus norvegicus). If the user wants to use a different reference set, simply inputs a character vector with the genes.
# Accessing the network for condition 1
genes <- unique(c(as.character(NetworkData(out, "network1")[, "gene1"]),
as.character(NetworkData(out, "network1")[, "gene2"])))
# Performing getEnrich analysis
enrich <- getEnrich(genes = genes, organismDB = org.Hs.eg.db,
keyType = 'ENSEMBL', ont = 'BP', fdrMethod = "BH",
fdrThr = 0.05, minGSSize = 5, maxGSSize = 500)
In the next steps we will use the getGroupGO results, but it is worth mentioning that you can use the results of the getEnrich function. To run the following steps with the getEnrich results is recommended to change the lfc parameter in aboce runAnalysis function by 3. That way there will be more pathways to enrich. Remembering that this will require more processing time.
After the association of GOs and genes, we are able to plot the network of previously results. Itâs possible to choose which variables (âpathwaysâ, âTFsâ or âgenesâ) we want to group. All these variables are inputted by the user, i.e. the user can choose the pathways, TFs or genes returned from analysis from the runAnalysis function and CeTF class object, or can also choose by specific pathways, TFs or genes of interest. The table used to plot the networks are stored in a list resulted from netGOTFPlot function. To get access is needed to use the object$tab. The list will be splitted by groupBy argument choice:
library(dplyr)
# Subsetting only the first 12 Ontologies with more counts
t1 <- head(cond1$results, 12)
# Subsetting the network for the conditions to make available only the 12 nodes subsetted
t2 <- subset(cond1$netGO, cond1$netGO$gene1 %in% as.character(t1[, "ID"]))
# generating the GO plot grouping by pathways
pt <- netGOTFPlot(netCond = NetworkData(out, "network1") %>%
mutate(across(everything(), as.character)),
resultsGO = t1,
netGO = t2,
anno = NetworkData(out, "annotation"),
groupBy = 'pathways',
type = 'GO')
pt$plot
head(pt$tab$`GO:0006807`)
## NULL
# generating the GO plot grouping by TFs
TFs <- NetworkData(out, "keytfs")$TF[1:4]
pt <- netGOTFPlot(netCond = NetworkData(out, "network1"),
resultsGO = t1,
netGO = t2,
anno = NetworkData(out, "annotation"),
groupBy = 'TFs',
TFs = TFs,
type = 'GO')
pt$plot
head(pt$tab$ENSG00000011465)
## NULL
# generating the GO plot grouping by specifics genes
genes <- c('ENSG00000011465', 'ENSG00000026025', 'ENSG00000075624', 'ENSG00000077942')
pt <- netGOTFPlot(netCond = NetworkData(out, "network1"),
resultsGO = t1,
netGO = t2,
anno = NetworkData(out, "annotation"),
groupBy = 'genes',
genes = genes,
type = 'GO')
pt$plot
head(pt$tab$ENSG00000011465)
## gene1 genes gene2 class
## 1 ENSG00000004799 ENSG00000011465 <NA> gene
## 5 ENSG00000013293 ENSG00000011465 <NA> gene
## 9 ENSG00000030419 ENSG00000011465 <NA> TF
## 13 ENSG00000053254 ENSG00000011465 <NA> TF
## 17 ENSG00000064961 ENSG00000011465 <NA> TF
## 21 ENSG00000067082 ENSG00000011465 <NA> TF
Finally, we will merge the results from the network of condition 1 (Gene-Gene and TF-Gene interaction) with groupGO results (GO-Gene or GO-TF interaction). The final results has the objective to integrate all these data in order to results in a complete view of the data:
pt <- netGOTFPlot(netCond = NetworkData(out, "network1") %>%
mutate(across(everything(), as.character)),
netGO = t2,
keyTFs = NetworkData(out, "keytfs"),
type = 'Integrated')
pt$plot
## gene1 gene2
## 22 ENSG00000004799 ENSG00000113761
## 24 ENSG00000004799 ENSG00000114861
## 26 ENSG00000004799 ENSG00000119138
## 27 ENSG00000004799 ENSG00000120093
## 47 ENSG00000004799 ENSG00000143127
## 77 ENSG00000004799 ENSG00000184271
Using accessors to access results
Is possible to use the accessors from CeTF class object to access the outputs generated by the runAnalysis function. These outputs can be used as input in Cytoscape (Shannon et al. 2003). The accessors are:
All accessors can be further explored by looking in more detail at the documentation.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4