Showing content from https://github.com/dozmorovlab/HiCcompare/raw/supplemental/supplemental_files/S3_File.Rmd below:
--- title: Estimation of the power-law dependence between the $log_{10}-log_{10}$ SD of interaction frequencies and distance between interacting regions author: "John Stansfield, Mikhail Dozmorov" output: pdf_document: toc: yes html_document: toc: yes --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} # Set up the environment library(knitr) opts_chunk$set(cache.path='cache/', fig.path='img/', cache=F, tidy=T, fig.keep='high', echo=F, dpi=100, warnings=F, message=F, comment=NA, warning=F, results='as.is', fig.width = 5.5, fig.height = 3) #out.width=700, library(pander) panderOptions('table.split.table', Inf) set.seed(1) library(dplyr) options(stringsAsFactors = FALSE) ``` ```{r libraries} library(HiCcompare) library(igraph) library(ggplot2) library(gridExtra) ``` # Introduction To estimate the parameters for simulating individual Hi-C matrices, Hi-C data from Gm12878 cell line [@Rao:2014aa] were used (Supplementary Table 1, [GSE63525](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525)). The first dataset was obtained with the DpnII restriction enzyme, while the second dataset was obtained with the MboI enzyme. Data from chromosome 1 was used at resolutions of 1Mb, 500kb, 100kb, and 50kb. The data were converted in a sparse matrix format (see `HiCcompare-vignette.Rmd` for details). Additional Gm12878 Hi-C data from chrs 1, 18, and 19 at 1Mb resolution, cut using the DpnII and MboI enzymes were also included. ```{r} githubURL <- "https://github.com/dozmorovlab/HiCcompare/raw/supplemental/Supplemental_data/S2_File_data.RData" load(url(githubURL)) # Pairwise matrices as `hic.table` objects table.1mb = create.hic.table(S2.dpnii.chr1, S2.mbol.chr1, 'chr1', scale = F, include.zeros = T) table.500kb = create.hic.table(S2.dpnii.500kb, S2.p.500kb, 'chr1', scale = F, include.zeros = T) table.100kb = create.hic.table(S2.dpnii.100kb, S2.p.100kb, 'chr1', scale = F, include.zeros = T) table.50kb = create.hic.table(S2.dpnii.50kb, S2.p.50kb, 'chr1', scale = F, include.zeros = T) chr1.tab = create.hic.table(S2.dpnii.chr1, S2.mbol.chr1, chr = 'chr1') chr18.tab = create.hic.table(S2.dpnii.chr18, S2.mbol.chr18, chr = 'chr18') chr19.tab = create.hic.table(S2.dpnii.chr19, S2.mbol.chr19, chr = 'chr19') ``` ```{r} # fucntion for making ggplots sdplot = function(D, SD, title = 'SD vs D plot', linfit = FALSE) { dat = data.frame(D = D, SD = SD) p = ggplot(dat, aes(x = D, y = SD)) + geom_point() + ggtitle(title) if (linfit) { p = p + geom_smooth(method='lm', se = FALSE) # add linear fit } else { p = p + scale_x_continuous(limits = c(0, 50)) } return(p) } ``` # SD of IFs vs. distance dependence First, we estimate the power-law approximation of the dependence between the standard deviation (SD) of interaction frequencies (IFs) and distance by fitting the power-law estimate and assessing the fit using the Kolmogorov-Smirnov test. Because SDs at larger distances do not fit the power-law well the outlier values are iteratively removed, starting from the largest distances, until the Kolmogorov-Smirnov test results indicate the power-law fit is adequate. The $\alpha$ power-law parameter can then be used to approximate the decay of SD with distance. As with the decay of IFs with distance, the power-law approximation can be affected by multiple factors, e.g., the resolution of the data, the enzymes used to obtain the data, the chromosomal differences (e.g., chromosome length (chromosome 1 being the longest), gene density (gene-poor chromosome 18 and gene-dense chromosome 19)). ```{r} kb1000 = aggregate(table.1mb$IF1, by = list(table.1mb$D), FUN = sd) kb500 = aggregate(table.500kb$IF1, by = list(table.500kb$D), FUN = sd) kb100 = aggregate(table.100kb$IF1, by = list(table.100kb$D), FUN = sd) kb50 = aggregate(table.50kb$IF1, by = list(table.50kb$D), FUN = sd) mkb1000 = aggregate(table.1mb$IF2, by = list(table.1mb$D), FUN = sd) mkb500 = aggregate(table.500kb$IF2, by = list(table.500kb$D), FUN = sd) mkb100 = aggregate(table.100kb$IF2, by = list(table.100kb$D), FUN = sd) mkb50 = aggregate(table.50kb$IF2, by = list(table.50kb$D), FUN = sd) chr1 = aggregate(chr1.tab$IF1, by = list(chr1.tab$D), FUN = sd) chr18 = aggregate(chr18.tab$IF1, by = list(chr18.tab$D), FUN = sd) chr19 = aggregate(chr19.tab$IF1, by = list(chr19.tab$D), FUN = sd) mchr1 = aggregate(chr1.tab$IF2, by = list(chr1.tab$D), FUN = sd) mchr18 = aggregate(chr18.tab$IF2, by = list(chr18.tab$D), FUN = sd) mchr19 = aggregate(chr19.tab$IF2, by = list(chr19.tab$D), FUN = sd) ``` ```{r} # function to iteratively remove the observations at the largest distance until the power law is met fit_powerlaw = function(tab) { tab = na.omit(tab) fit = power.law.fit(tab$x) pval = fit$KS.p # loop until fit is good while(pval <= 0.05) { tab = tab[-nrow(tab),] fit = power.law.fit(tab$x) pval = fit$KS.p } return(tab) } ``` ```{r} kb1000 = fit_powerlaw(kb1000) kb500 = fit_powerlaw(kb500) kb100 = fit_powerlaw(kb100) kb50 = fit_powerlaw(kb50) mkb1000 = fit_powerlaw(mkb1000) mkb500 = fit_powerlaw(mkb500) mkb100 = fit_powerlaw(mkb100) mkb50 = fit_powerlaw(mkb50) chr1 = fit_powerlaw(chr1) chr18 = fit_powerlaw(chr18) chr19 = fit_powerlaw(chr19) mchr1 = fit_powerlaw(mchr1) mchr18 = fit_powerlaw(mchr18) mchr19 = fit_powerlaw(mchr19) ``` ## The effect of resolution Here the fit to the power-law of the SD of IFs at each distance is tested for Chr 1 from GM12878 using cutting enzyme DpnII at 1MB, 500KB, 100KB, 50KB resolution, respectively. Tables show the output of `fitdistplus::power.law.fit` function. Key variables to note are `alpha` - the power of the $C*x^{-alpha}$ power-law formula, and `KS.p` - p-value of the Kolmogorov-Smirnov test, larger p-value means that the power-law fit is adequate. The first row is for DpnII and the second row is for MboI. The plots represent the log10(SD) vs log10(Distance), one plot per cutting enzyme. ### 1 MB ```{r} t1 = power.law.fit(kb1000$x) t2 = power.law.fit(mkb1000$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1 = sdplot(log10(kb1000$Group.1), log10(kb1000$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mkb1000$Group.1), log10(mkb1000$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### 500KB ```{r} t1 = power.law.fit(kb500$x) t2 = power.law.fit(mkb500$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1 = sdplot(log10(kb500$Group.1), log10(kb500$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mkb500$Group.1), log10(mkb500$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### 100KB ```{r} t1 = power.law.fit(kb100$x) t2 = power.law.fit(mkb100$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1 = sdplot(log10(kb100$Group.1), log10(kb100$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mkb100$Group.1), log10(mkb100$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### 50KB ```{r} t1 = power.law.fit(kb50$x) t2 = power.law.fit(mkb50$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1 = sdplot(log10(kb50$Group.1), log10(kb50$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mkb50$Group.1), log10(mkb50$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### Summary The SD of the IFs seems to fit the power-law adequately over the range of resolutions after the outliers are removed, however some of the plots still show some deviations from the ideal fit. $\alpha$ ranges from 1.73 to 2.68. There is more variability in the $\alpha$ parameter for modeling SD compared to the median IF. ## The effect of chromosomes Here the fit to the power-law of SD of IF at each distance is tested for chromosome 1, 18, 19 from GM12878 cell line using cutting enzymes DpnII and MboI at 1MB resolution. As above, the table shows the output of `fitdistplus::power.law.fit` function. Key variables to note are `alpha` - the power of the $C*x^{-alpha}$ power-law formula, and `KS.p` - p-value of the Kolmogorov-Smirnov test, larger p-value means that the power-law fit is adequate. The plots represents the log10(SD) and log10(Distance), one plot for each cutting enzyme. ### Chr 1 ```{r} t1 = power.law.fit(chr1$x) t2 = power.law.fit(mchr1$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1= sdplot(log10(chr1$Group.1), log10(chr1$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mchr1$Group.1), log10(mchr1$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### Chr 18 ```{r} t1 = power.law.fit(chr18$x) t2 = power.law.fit(mchr18$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1 = sdplot(log10(chr18$Group.1), log10(chr18$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mchr18$Group.1), log10(mchr18$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### Chr 19 ```{r} t1 = power.law.fit(chr19$x) t2 = power.law.fit(mchr19$x) t3 = rbind(t1, t2) t3 = cbind(Enzyme = c('DpnII', 'MboI'), t3) rownames(t3) = NULL t3 %>% as.data.frame() %>% pander() ``` ```{r} p1 = sdplot(log10(chr19$Group.1), log10(chr19$x), title = 'DpnII log10(SD) vs log10(D)', linfit = T) p2 = sdplot(log10(mchr19$Group.1), log10(mchr19$x), title = 'MboI log10(SD) vs log10(D)', linfit = T) grid.arrange(p1, p2, ncol=2) ``` ### Summary The power-law fit is better over the varying chromosomes at 1MB resolution after the outliers were removed. $\alpha$ ranges from 1.79 to 2.25. The plots of the fits show less deviations compared to the plots in the effect of resolution section. For simulations an $\alpha$ between 1.7 and 2.7 should provide give a reasonable approximation to the data.
RetroSearch is an open source project built by @garambo
| Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4