Canu v1.6
These are release notes for Canu version 1.6, which was released on August 14th, 2017. Canu is specialized for assembly of single-molecule high-noise sequences. Full documentation can be found at http://canu.readthedocs.org/.
This release provides a stable, tested, and documented version of the software. The binary distributions should work on any relatively recent version of the respective OS. The source code distribution contains everything you need to create a binary distribution for your own specific OS.
Citation
- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. (2017).
Minimum Requirements
- Perl 5.12.0, or File::Path 2.08
- Java SE 8
- GCC 4.5 (for compilation only)
- OS X 10.10 (for binaries only)
- gnuplot (optional, for generating diagnostic graphs)
Installation
Users can download Canu as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.
To install from source code (the file can be named either canu-v1.6.tar.gz
or just v1.6.tar.gz
, depending on how it is downloaded):
gunzip -dc canu-v1.6.tar.gz | tar -xf -
cd canu-1.6/src
make -j 8
cd ..
To install from a binary distribution:
xz -dc canu-1.6.*.tar.xz |tar -xf -
In both cases, canu is installed in directory canu-1.6/-, for example, canu-1.6/Linux-amd64. You can run the assembler with:
canu-1.6/*/bin/canu
Changes
- Improved detection of unitig and contig edges in GFA outputs.
- Repeats that are confirmed correct no longer form unitigs. This increases unitig length and greatly simplifies the unitig GFA.
- Small plasmids are no longer flagged as 'unassembled' sequences. Note that the
contigFilter
option values have changed and old values run the risk of filtering incorrectly. - Improved contig consensus accuracy (longer alignments to reference).
- Added a unitig to contig mapping via a BED output.
- Better memory management in bogart should reduce memory footprint slightly and run slightly faster.
- Remove the ovlStore for correction and trimming when those stages are finished. saveOverlaps=stores will retain them. The correction overlaps are usually the single largest consumer of disk space during the assembly.
- Remove the partitioned gkpStore copy when consensus is finished.
- Use file names with five digits, instead of four, for overlap error adjustment.
- Options minMemory and minThreads are now implemented.
- Use all overlaps, not just the best, to position reads in unitigs/contigs, resulting in more accurate repeat and edge detection.
- Implement the 'suggestCircular' flag in contigs and unitigs. It is set to 'true' if the single sequence can be circularized. Note: the flag is 'false' if two or more contigs are needed to form the circular chromosome.
- Stability improvements to overlap store building when ovsMethod=parallel (the default for large genomes).
- Easier restarts: if restarted from within the assembly directory, the -p, -d and read files can be omitted.
- Improved logging: citations are output at the start of the run for any included software within Canu.
Bug Fixes
- Fixed CIGAR multithreading bug in unitig and contig graphs which dropped some true edges.
- Fix invalid characters in corrected reads due to out of bounds array access.
- Fix useGrid=remote which failed to output commands when multiple jobs needed to be submitted.
Known Issues
See the issues page for up-to date open issues, or to report a problem.
- When running each step (correct/trim/assemble) by hand, the assemble step will use corrected not trimmed reads when all steps are run with the same -d option. Run with different -d options as a workaround.
- Large memory usage while unitig consensus calling on unitigs over 100MB in size; a 140Mb contig required approximately 75GB.
- Large memory usage and runtime for long reads (e.g., Nanopore) when using the
overlapper=ovl
algorithm, and during Overlap Error Adjustment. The optionsoverlapper=mhap utgReAlign=true
is significantly faster but may produce slightly less contiguous assemblies on genomes >200 Mbp in size. - Bubbles are not captured in the contig graph, but are included in the unitig graph. No attempt at marking bubbles is made.
See the FAQ for many suggestions, including suggestions for specific data types, e.g., Nanopore r9 reads.
Legal
Canu is derived from Celera Assembler and includes code from many other projects. Most, but not all, of the code is GPL licensed. See the README.licenses file and individual source code files for details.