High-throughput sequencing enables an unprecedented resolution in transcript quantification, at the cost of magnifying the impact of technical noise. The consistent reduction of random background noise to capture functionally meaningful biological signals is still challenging. Intrinsic sequencing variability introducing low-level expression variations can obscure patterns in downstream analyses.
The noisyR package comprises an end-to-end pipeline for quantifying and removing technical noise from HTS datasets. The three main pipeline steps are [i] similarity calculation across samples, [ii] noise quantification, and [iii] noise removal; each step can be finely tuned using hyperparameters; optimal, data-driven values for these parameters are also determined.
Publication (Nucleic Acids Research, June 2021): https://doi.org/10.1093/nar/gkab433
This project was presented at the UK Conference of Bioinformatics and Computational Biology on 29th September 2021. The slides can be found here.
Workflow diagram of the noisyr pipeline
Similarity calculation across samplesFor the sample-similarity calculation, two approaches are available:
The output formats for the two approaches correspond to their inputs: a denoised count matrix for the count matrix approach, and denoised BAM files for the transcript approach.
Main functions: calculate_expression_similarity_counts(), calculate_expression_similarity_transcript()
Supporting functions: get_methods_correlation_distance(), optimise_window_length(), calculate_expression_profile()
Input preparation functions: cast_matrix_to_numeric(), cast_gtf_to_genes()
The noise quantification step uses the expression-similarity relation calculated in step i to determine the noise threshold, representing the level below which the gene expression is considered noisy.
For example, if a similarity threshold is used as input then the corresponding expression from a (smoothed) expression-similarity line plot is selected as the noise threshold for each sample. The shape of the distribution can vary across experiments; we provide functionality for different thresholds and recommend the choice of the one that results in the lowest variance in the noise thresholds across samples.
Options for smoothing, or summarising the observations in a box plot and selecting the minimum abundance for which the interquartile range (or median, or 5-95% range) is consistently above the similarity threshold are also available. As a general rule (due to the number of observations), we recommend using the smoothing with the count matrix approach, and the boxplot representation with the transcript option.
Indicative plots of the Pearson correlation calculated on windows of increasing average abundance for the count matrix-based noise removal approach (left) and per exon for the transcript-based noise removal approach (right).
Main function: calculate_noise_threshold_base()
Supporting functions: get_methods_calculate_noise_threshold(), calculate_first_minimum_density()
Visualisation functions: plot_expression_similarity(), calculate_noise_threshold_method_statistics()
The third step uses the noise threshold calculated in step ii to remove noise from the count matrix or BAM files.
Main functions: remove_noise_from_matrix(), remove_noise_from_bams()
CRAN (use install.packages())RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4