RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/Core-Bioinformatics/noisyR below:

Core-Bioinformatics/noisyR: Package documentation and examples:

High-throughput sequencing enables an unprecedented resolution in transcript quantification, at the cost of magnifying the impact of technical noise. The consistent reduction of random background noise to capture functionally meaningful biological signals is still challenging. Intrinsic sequencing variability introducing low-level expression variations can obscure patterns in downstream analyses.

The noisyR package comprises an end-to-end pipeline for quantifying and removing technical noise from HTS datasets. The three main pipeline steps are [i] similarity calculation across samples, [ii] noise quantification, and [iii] noise removal; each step can be finely tuned using hyperparameters; optimal, data-driven values for these parameters are also determined.

Publication (Nucleic Acids Research, June 2021): https://doi.org/10.1093/nar/gkab433

This project was presented at the UK Conference of Bioinformatics and Computational Biology on 29th September 2021. The slides can be found here.

Workflow diagram of the noisyr pipeline

Similarity calculation across samples

For the sample-similarity calculation, two approaches are available:

The count matrix approach uses the original, un-normalised expression matrix, as provided after alignment and feature quantification; each sample is processed individually, only the relative expressions across samples are compared. Relying on the hypothesis that the majority of genes are not DE, most of the evaluations are expected to point towards a high similarity across samples. Choosing from a collection of >45 similarity metrics, users can select a measure to assess the localised consistency in expression across samples. A sliding window approach is used to compare the similarity of ranks or abundances for the selected features between samples. The window length is a hyperparameter, which can be user-defined or inferred from the data.
The transcript approach uses as input the alignment files derived from read-mappers (in BAM format). For each sample and each exon, the point-to-point similarity of expression across the transcript is calculated across samples in a pairwise all-versus-all comparison.

The output formats for the two approaches correspond to their inputs: a denoised count matrix for the count matrix approach, and denoised BAM files for the transcript approach.

Main functions: calculate_expression_similarity_counts(), calculate_expression_similarity_transcript()

Supporting functions: get_methods_correlation_distance(), optimise_window_length(), calculate_expression_profile()

Input preparation functions: cast_matrix_to_numeric(), cast_gtf_to_genes()

The noise quantification step uses the expression-similarity relation calculated in step i to determine the noise threshold, representing the level below which the gene expression is considered noisy.

For example, if a similarity threshold is used as input then the corresponding expression from a (smoothed) expression-similarity line plot is selected as the noise threshold for each sample. The shape of the distribution can vary across experiments; we provide functionality for different thresholds and recommend the choice of the one that results in the lowest variance in the noise thresholds across samples.

Options for smoothing, or summarising the observations in a box plot and selecting the minimum abundance for which the interquartile range (or median, or 5-95% range) is consistently above the similarity threshold are also available. As a general rule (due to the number of observations), we recommend using the smoothing with the count matrix approach, and the boxplot representation with the transcript option.

Indicative plots of the Pearson correlation calculated on windows of increasing average abundance for the count matrix-based noise removal approach (left) and per exon for the transcript-based noise removal approach (right).

Main function: calculate_noise_threshold_base()

Supporting functions: get_methods_calculate_noise_threshold(), calculate_first_minimum_density()

Visualisation functions: plot_expression_similarity(), calculate_noise_threshold_method_statistics()

The third step uses the noise threshold calculated in step ii to remove noise from the count matrix or BAM files.

For the count matrix approach, genes whose expression is below the noise thresholds for every sample are removed and the average noise threshold is calculated and added to every entry. This ensures that the fold-changes observed by downstream analyses are not biased by low expression, while still preserving the structure and relative expression levels in the data.
For the transcript approach, genes are removed from the BAM files if the expression of all their exons is below the noise thresholds for every sample. The removal is done at gene level to avoid scenarios that are not biologically possible.

Main functions: remove_noise_from_matrix(), remove_noise_from_bams()

CRAN (use install.packages())

utils
grDevices
tibble
dplyr
magrittr
ggplot2
philentropy
doParallel
foreach

Bioconductor (use BiocManager::install())

preprocessCore
IRanges
GenomicRanges
Rsamtools

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4