To get the most up-to-date options, run
canu -options
The default values below will vary based on the input data type and genome size.
Boolean options accept true/false or 1/0.
Memory sizes are assumed to be in gigabytes if no units are supplied. Values may be non-integer with or without a unit - âkâ for kilobytes, âmâ for megabytes, âgâ for gigabytes or âtâ for terabytes. For example, â0.25tâ is equivalent to â256gâ (or simply â256â).
Global Options¶The catch all category.
The allowed difference in an overlap between two corrected reads, expressed as fraction error. Sets obtOvlErrorRate, utgOvlErrorRate, obtErrorRate, utgErrorRate, and cnsErrorRate. The correctedErrorRate can be adjusted to account for the quality of read correction, for the amount of divergence in the sample being assembled, and for the amount of sequence being assembled. The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.
For low coverage datasets (less than 30X), we recommend increasing correctedErrorRate slightly, by 1% or so.
For high-coverage datasets (more than 60X), we recommend decreasing correctedErrorRate slightly, by 1% or so.
Raising the correctedErrorRate will increase run time. Likewise, decreasing correctedErrorRate will decrease run time, at the risk of missing overlaps and fracturing the assembly.
Reads shorter than this are not loaded into the assembler. Reads output by correction and trimming that are shorter than this are discarded.
Must be no smaller than minOverlapLength.
[In Canu v1.9 and earlier] If set high enough, the gatekeeper module will claim there are errors in the input reads, as too many of the input reads have been discarded. As long as there is sufficient coverage, this is not a problem. See stopOnReadQuality and stopOnLowCoverage
Overlaps shorter than this will not be discovered. Smaller values can be used to overcome lack of read coverage, but will also lead to false overlaps and potential misassemblies. Larger values will result in more correct assemblies, but more fragmented, assemblies.
Must be no bigger than minReadLength.
Adjust the sampling bias towards discarding longer (negative numbers) or shorter (positive numbers) reads. Reads are assigned a score equal to random_number * read_length ^ bias and the lowest scoring reads are discarded, as described in readSamplingCoverage.
In the pictures below, green reads are kept, while purple reads are discarded. The reads are along the X axis, sorted by decreasing score. The Y axis is the length of each read.
A bias of 0.0 will retain random reads:
A negative bias will retain shorter reads:
A positive bias will retain longer reads:
An estimate of the size of the genome. Common suffices are allowed, for example, 3.7m or 2.8g.
The genome size estimate is used to decide how many reads to correct (via the corOutCoverage parameter) and how sensitive the mhap overlapper should be (via the mhapSensitivity parameter). It also impacts some logging, in particular, reports of NG50 sizes.
This option uses MHAP overlapping for all steps, not just correction, making assembly significantly faster. It can be used on any genome size but may produce less continuous assemblies on genomes larger than 1 Gbp. It is recommended for nanopore genomes smaller than 1 Gbp or metagenomes.
The fast option will also optionally use wtdbg for unitigging if wtdbg is manually copied to the Canu binary folder. However, this is only tested with very small genomes and is NOT recommended.
Execute the command supplied when Canu terminates abnormally. The command will execute in the <assembly-directory> (the -d option to canu) and will be supplied with the name of the assembly (the -p option to canu) as its first and only parameter.
There are two exceptions when the command is not executed: if a âspecâ file cannot be read, or if canu tries to access an invalid parameter. The former will be reported as a command line error, and canu will never start. The latter should never occur except when developers are developing the software.
[In Canu v1.9 and earlier] If set, Canu will stop with the following error if there are significantly fewer reads or bases loaded into the read store than what is in the input data.
Gatekeeper detected potential problems in your input reads. Please review the logging in files: /assembly/godzilla/asm.gkpStore.BUILDING.err /assembly/godzilla/asm.gkpStore.BUILDING/errorLog If you wish to proceed, rename the store with the following command and restart canu. mv /assembly/godzilla/asm.gkpStore.BUILDING \ /assembly/godzilla/asm.gkpStore.ACCEPTED Option stopOnReadQuality=false skips these checks.
The missing reads could be too short (decrease minReadLength to include them), or have invalid bases or quality values. A summary of the files loaded and errors detected is in the asm.gkpStore.BUILDING.err
file, with full gory details in the asm.gkpStore.BUILDING/errorLog
.
To proceed, set stopOnReadQuality=false
or rename the directory as shown.
Note that U bases are silently translated to T bases, to allow assembly of RNA sequences.
If set, Canu will stop processing after a specific stage in the pipeline finishes. Valid values are:
stopAfter= Canu will stop after â¦. sequenceStore reads are loaded into the assembler read database. meryl-configure kmer counting jobs are configured. meryl-count kmers are counted, but not processed into one database. meryl-merge kmers are merged into one database. meryl-process frequent kmers are generated. meryl-subtract haplotype specific kmers are generated. meryl all kmer work is complete. haplotype-configure haplotype read separation jobs are configured. haplotype haplotype-specific reads are generated. overlapConfigure overlap jobs are configured. overlap overlaps are generated, before they are loaded into the database. overlapStoreConfigure the jobs for creating the overlap database are configured. overlapStore overlaps are loaded into the overlap database. correction corrected reads are generated. trimming trimmed reads are generated. unitig unitigs and contigs are created. consensusConfigure consensus jobs are configured. consensus consensus sequences are loaded into the databases.readCorrection and readTrimming are deprecated synonyms for correction and trimming, respectively.
The correction stage of Canu requires random access to all the reads. Performance is greatly improved if the gkpStore database of reads is copied locally to each node that computes corrected read consensus sequences. This âstagingâ is enabled by supplying a path name to fast local storage with the stageDirectory option, and, optionally, requesting access to that resource from the grid with the gridEngineStageOption option.
A path to a directory local to each compute node. The directory should use an environment variable specific to the grid engine to ensure that it is unique to each task.
For example, in Sun Grid Engine, /scratch/$JOB_ID-$SGE_TASK_ID will use both the numeric job ID and the numeric task ID. In SLURM, /scratch/$SLRUM_JOBID accomplishes the same.
If specified on the command line, be sure to escape the dollar sign, otherwise the shell will try to expand it before Canu sees the option: stageDirectory=/scratch/$JOB_ID-$SGE_TASK_ID.
If specified in a specFile, do not escape the dollar signs.
This string is passed to the job submission command, and is expected to request local disk space on each node. It is highly grid specific. The string DISK_SPACE will be replaced with the amount of disk space needed, in gigabytes.
On SLURM, an example is âgres=lscratch:DISK_SPACE
Controls when to remove intermediate overlap results.
âneverâ removes no intermediate overlap results. This is only useful if you have a desire to exhaust your disk space.
âfalseâ is the same as âneverâ.
ânormalâ removes intermediate overlap results after they are loaded into an overlap store.
âtrueâ is the same as ânormalâ.
âaggressiveâ removes intermediate overlap results as soon as possible. In the event of a corrupt or lost file, this can result in a fair amount of suffering to recompute the data. In particular, overlapper output is removed as soon as it is loaded into buckets, and buckets are removed once they are rewritten as sorted overlaps.
âdangerousâ removes intermediate results as soon as possible, in some cases, before they are even fully processed. In addition to corrupt files, jobs killed by out of memory, power outages, stray cosmic rays, et cetera, will result in a fair amount of suffering to recompute the lost data. This mode can help when creating ginormous overlap stores, by removing the bucketized data immediately after it is loaded into the sorting jobs, thus making space for the output of the sorting jobs.
Use ânormalâ for non-large assemblies, and when disk space is plentiful. Use âaggressiveâ on large assemblies when disk space is tight. Never use âdangerousâ, unless you know how to recover from an error and you fully trust your compute environment.
For Mhap and Minimap2, the raw ovelraps (in Mhap and PAF format) are deleted immediately after being converted to Canu ovb format, except when purgeOverlaps=never.
If set, do not remove intermediate outputs. Normally, intermediate files are removed once they are no longer needed.
NOT IMPLEMENTED.
The Canu âexecutiveâ is responsible for controlling what tasks run and when they run. It doesnât directly do any significant computations, rather it just examines the files that exist and decides which component to run next. For example, if overlaps exist but contigs do not, it would create contigs next.
When under grid control, some tasks can be run in the same job as the executive, if there is enough memory and threads reserved for the executive. The benefit of this is slight; on a heavily loaded grid, it would reduce the number of job scheduling iterations Canu needs to run.
executiveMemory <integer=4>
The amount of memory, in gigabytes, to reserve when running the Canu exectuve (and any jobs it runs directly). Increasing this past 4 GB can allow some tasks (such as creating an overlap store or creating contigs) to run directly, without needing a separate grid job.
executiveThreads <integer=1>
Overlapper Configuration¶The number of threads to reserve for the Canu executive.
Overlaps are generated for three purposes: read correction, read trimming and unitig construction. The algorithm and parameters used can be set independently for each set of overlaps.
Two overlap algorithms are in use. One, mhap, is typically applied to raw uncorrected reads and returns alignment-free overlaps with imprecise extents. The other, the original overlapper algorithm âovlâ, returns alignments but is much more expensive.
There are three sets of parameters, one for the âmhapâ algorithm, one for the âovlâ algorithm, and one for the âminimapâ algorithm. Parameters used for a specific type of overlap are set by a prefix on the option: âcorâ for read correction, âobtâ for read trimming (âoverlap based trimmingâ) or âutgâ for unitig construction. For example, âcorOverlapper=ovlâ would set the overlapper used for read correction to the âovlâ algorithm.
Do not seed overlaps with these kmers, or, for mhap, do not seed with these kmers unless necessary (down-weight them).
For corFrequentMers (mhap), the file must contain a single line header followed by number-of-kmers data lines:
0 number-of-kmers forward-kmer word-frequency kmer-count total-number-of-kmers reverse-kmer word-frequency kmer-count total-number-of-kmers
Where kmer-count is the number of times this kmer sequence occurs in the reads, âtotal-number-of-kmersâ is the number of kmers in the reads (including duplicates; rougly the number of bases in the reads), and âword-frequencyâ is âkmer-countâ / âtotal-number-of-kmersâ.
For example:
0 4 AAAATAATAGACTTATCGAGTC 0.0000382200 52 1360545 GACTCGATAAGTCTATTATTTT 0.0000382200 52 1360545 AAATAATAGACTTATCGAGTCA 0.0000382200 52 1360545 TGACTCGATAAGTCTATTATTT 0.0000382200 52 1360545
This file must be gzip compressed.
For obtFrequentMers and ovlFrequentMers, the file must contain a list of the canonical kmers and their count on a single line. The count value is ignored, but needs to be present. This file should not be compressed.
For example:
AAAATAATAGACTTATCGAGTC 52 AAATAATAGACTTATCGAGTCA 52
The overlap algorithms return overlaps in an arbitrary order, however, all other algorithms (or nearly all) require all overlaps for a single read to be readily available. Thus, the overlap store collects and sorts the overlapper outputs into a store of overlaps, sorted by the first read in the overlap. Each overlap is listed twice in the store, once in an âA vs Bâ format, and once in a âB vs Aâ format (that is, swapping which read is âfirstâ in the overlap description).
Two construction algorithms are supported. A âsequentialâ method uses a single data stream, and is faster for small and moderate size assemblies. A âparallelâ method uses multiple compute nodes and can be faster (depending on your network disk bandwidth) for moderate and large assemblies. Be advised that the parallel method is less efficient than the sequential method, and can easily thrash consumer-level NAS devices resulting in exceptionally poor performance.
The sequential method load all overlapper outputs (.ovb files in 1-overlapper) into memory, duplicating each overlap. It then sortes overlaps, and creates the final overlap store.
The parallel method uses two parallel tasks: bucketizing (âovbâ tasks) and sorting (âovsâ tasks). Bucketizing reads the outputs of the overlap tasks (ovb files in 1-overlapper), duplicates each overlap, and writes these to intermediate files. Sorting tasks load these intermediate file into memory, sorts the overlaps, then writes the sorted overlaps back to disk. There will be one âbucketizerâ (âovbâ tasks) task per overlap task, and tens to hundreds of âsorterâ (âovsâ tasks). A final âindexingâ step is done in the Canu executive, which ties all the various files togather into the final overlap store.
Increasing ovsMemory will allow more overlaps to fit into memory at once. This will allow larger assemblies to use the sequential method, or reduce the number of âovsâ tasks for the parallel method.
Increasing the allowed memory for the Canu executive can allow the overlap store to be constructed as part of the executive job â a separate grid job for constructing the store is not needed.
The âmerylâ algorithm counts the occurrences of kmers in the input reads. It outputs a FASTA format list of frequent kmers, and (optionally) a binary database of the counts for each kmer in the input.
Meryl can run in (almost) any memory size, by splitting the computation into smaller (or larger) chunks.
Canu directly supports most common grid scheduling systems. Under normal use, Canu will query the system for grid support, configure itself for the machines available in the grid, then submit itself to the grid for execution. The Canu pipeline is a series of about a dozen steps that alternate between embarrassingly parallel computations (e.g., overlap computation) and sequential bookkeeping steps (e.g., checking if all overlap jobs finished). This is entirely managed by Canu.
Canu has first class support for the various schedulers derived from Sun Grid Engine (Univa, Son of Grid Engine) and the Simple Linux Utility for Resource Management (SLURM), meaning that the developers have direct access to these systems. Platform Computingâs Load Sharing Facility (LSF) and the various schedulers derived from the Portable Batch System (PBS, Torque and PBSPro) are supported as well, but without developer access bugs do creep in. As of Canu v1.5, support seems stable and working.
Master control. If âfalseâ, no algorithms will run under grid control. Does not change the value of the other useGrid options.
If âremoteâ, jobs are configured for grid execution, but not submitted. A message, with commands to launch the job, is reported and canu halts execution.
Note that the host used to run canu for âremoteâ execution must know about the grid, that is, it must be able to submit jobs to the grid.
It is also possible to enable/disable grid support for individual algorithms with options such as useGridBAT, useGridCNS, et cetera. This has been useful in the (far) past to prevent certain algorithms, notably overlap error adjustment, from running too many jobs concurrently and thrashing disk. Recent storage systems seem to be able to handle the load better â computers have gotten faster quicker than genomes have gotten larger.
There are many options for configuring a new grid (âgridEngine*â) and for configuring how canu configures its computes to run under grid control (âgridOptions*â). The grid engine to use is specified with the âgridEngineâ option.
There are many options to configure support for a new grid engine, and we donât describe them fully. If you feel the need to add support for a new engine, please contact us. That said, file src/pipeline/canu/Defaults.pm
lists a whole slew of parameters that are used to build up grid commands, they all start with gridEngine
. For each grid, these parameters are defined in the various src/pipeline/Grid_*.pm
modules. The parameters are used in src/pipeline/canu/Execution.pm
.
In Canu 1.8 and earlier, gridEngineMemoryOption
and gridEngineThreadsOption
are used to tell Canu how to request resources from the grid. Starting with snapshot v1.8 +90 changes
(roughly January 11th), those options were merged into gridEngineResourceOption
. These options specify the grid options needed to request memory and threads for each job. For example, the default gridEngineResourceOption
for PBS/Torque is â-l nodes=1:ppn=THREADS:mem=MEMORYâ, and for Slurm it is ââcpus-per-task=THREADS âmem-per-cpu=MEMORYâ. Canu will replace âTHREADSâ and âMEMORYâ with the specific values needed for each job.
To run on the grid, each stage needs to be configured - to tell the grid how many cores it will use and how much memory it needs. Some support for this is automagic (for example, overlapInCore and mhap know how to do this), others need to be manually configured. Yes, itâs a problem, and yes, we want to fix it.
The gridOptions* parameters supply grid-specific options to the grid submission command.
Several algorithmic components of canu can be disabled, based on the type of the reads being assembled, the type of processing desired, or the amount of compute resources available. Overlap
Canu has a fairly sophisticated (or complicated, depending on if it is working or not) method for dividing large computes, such as read overlapping and consensus, into many smaller pieces and then running those pieces on a grid or in parallel on the local machine. The size of each piece is generally determined by the amount of memory the task is allowed to use, and this memory size â actually a range of memory sizes â is set based on the genomeSize parameter, but can be set explicitly by the user. The same holds for the number of processors each task can use. For example, a genomeSize=5m would result in overlaps using between 4gb and 8gb of memory, and between 1 and 8 processors.
Given these requirements, Canu will pick a specific memory size and number of processors so that the maximum number of jobs will run at the same time. In the overlapper example, if we are running on a machine with 32gb memory and 8 processors, it is not possible to run 8 concurrent jobs that each require 8gb memory, but it is possible to run 4 concurrent jobs each using 6gb memory and 2 processors.
To completely specify how Canu runs algorithms, one needs to specify a maximum memory size, a maximum number of processors, and how many pieces to run at one time. Users can set these manually through the {prefix}Memory, {prefix}Threads and {prefix}Concurrency options. If they are not set, defaults are chosen based on genomeSize.
Available prefixes are:
Prefix Algorithm cor mhapOverlap generation using the âmhapâ algorithm for
âcorâ=correction
âobtâ=trimming
âutgâ=assembly
obt utg cor mmapOverlap generation using the âminimap2â algorithm for
âcorâ = correction
âobtâ = trimming
âutgâ = assembly
obt utg cor ovlOverlap generation using the âoverlapInCoreâ algorithm for
âcorâ = correction
âobtâ = trimming
âutgâ = assembly
obt utg ovb Parallel overlap store bucketizing ovs Parallel overlap store bucket sorting cor Read correction red Error detection in reads oea Error adjustment in overlaps bat Unitig/contig construction cns Unitig/contig consensusFor example, âmhapMemory` would set the memory limit for computing overlaps with the mhap algorithm; âcormhapMemoryâ would set the memory limit only when mhap is used for generating overlaps used for correction.
The âminMemoryâ, âmaxMemoryâ, âminThreadsâ and âmaxThreadsâ options will apply to all jobs, and can be used to artificially limit canu to a portion of the current machine. In the overlapper example above, setting maxThreads=4 would result in two concurrent jobs instead of four.
Overlap Error Adjustment¶The Overlap Error Adjustment module adjusts the error rate claimed by each overlap to account for sequencing error and true polymorphism. A pair-wise multialignment is generated for all overlaps to a given read. Each multialignment column is examined to determine if the base in the given read is correct, is part of a true difference, or is a likely sequencing error. For the latter case, a base change is noted. Once all base changes in all columns of all reads have been determined, all overlaps are recomputed with said changes applied, and the new fraction error stored for each overlap.
Three parameters exist to change the behavior:
This module consists of two steps: RED (read error detection), and OEA (overlap error adjustment). They have slightly different run-time requirements. RED can use multiple threads and is slightly more computationally expensive; OEA can not use multiple threads and is slightly more I/O intensive. The batch length and batch size parameters can tune the size of each job, however, the default values have worked well (so well that we donât actually document what these values should be set to).
STILL DONE BY UNITIGGER, NEED TO MOVE OUTSIDE
The first step in Canu is to find high-error overlaps and generate corrected sequences for subsequent assembly. This is currently the fastest step in Canu. By default, only the longest 40X of data (based on the specified genome size) is used for correction. Typically, some reads are trimmed during correction due to being chimeric or having erroneous sequence, resulting in a loss of 20-25% (30X output). You can force correction to be non-lossy by setting corMinCoverage=0, in which case the corrected reads output will be the same length as the input data, keeping any high-error unsupported bases. Canu will trim these in downstream steps before assembly.
If you have a dataset with uneven coverage or small plasmids, correcting the longest 40X may not give you sufficient coverage of your genome/plasmid. In these cases, you can set corOutCoverage=999, or any value greater than your total input coverage which will correct and assemble all input data, at the expense of runtime.
Do not use overlaps with error rate higher than this when computing corrected reads.
In Canu v2.2, this parameter was changed from 0.50 (for -nanopore) and 0.30 (for -pacbio) to 0.30 and 0.25, respectively.
The tables below show a significant speedup for Nanopore reads without much loss in output quantity. There is indication of a slight improvement in corrected read quality at lower corErrorRate, however, read quality was not directly evaluated.
For PacBio reads, with a smaller change in corErrorRate, the speedup is about 10%.
CHM13 Chromosome X, nanopore, 105x input coveragecorErrorRate
Corrected
Coverage
Trimmed
Coverage
Bogart
Error Rate
CPU Time
(hours)
5 22.0x 21.8x 0.3958% 10 35.7x 35.2x 15 38.1x 37.5x 20 38.6x 38.0x 1160 25 38.7x 38.1x 1290 30 38.8x 38.1x 1449 40 38.8x 38.1x 1625 50 38.8x 38.2x 3683 HG002 Chromosome X, nanopore, 20x input coveragecorErrorRate
Corrected
Coverage
Trimmed
Coverage
Bogart
Error Rate
CPU Time
(hours)
5 â.-x â.-x -% 31 10 3.9x â.-x -% 66 15 9.6x â.-x -% 105 20 11.4x 11.2x 1.71% 134 25 11.9x 11.6x 1.79% 154 30 12.0x 11.8x 1.83% 169 40 12.2x 12.0x 1.94% 221 50 12.6x 12.3x 2.29% 709A contig that needs any of the following conditions is flagged as âunassembledâ and removed from further consideration:
- fewer than minReads reads (default 2)
- shorter than minLength bases (default 0)
- a single read covers more than singleReadSpan fraction of the contig (default 1.0)
- more than lowCovSpan fraction of the contig is at coverage below lowCovDepth (defaults 0.5, 5)
This filtering is done immediately after initial contigs are formed, before potentially incorrectly spanned repeats are detected. Initial contigs that incorrectly span a repeat can be split into multiple contigs; none of these new contigs will be flagged as âunassembledâ, even if they are a single read.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4