Skip to content

BilkentCompGen/mrcanavar

Repository files navigation

mrCaNaVaR analyzes the mapping read depth to discover large segmental duplications and deletions. It also has the capability of predicting absolute copy numbers of genomic intervals.

Citation

  • Personalized copy number and segmental duplication maps using next-generation sequencing. Can Alkan, Jeffrey M. Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O. Kitzman, Carl Baker, Maika Malig, Onur Mutlu, S. Cenk Sahinalp, Richard A. Gibbs, Evan E. Eichler. Nature Genetics, Oct, 41(10):1061-1067, 2009.
  • Whole-genome shotgun sequence CNV detection using read depth. Fatma Kahveci and Can Alkan. Methods Mol Biol., 1833:61-72, 2018.

Basics & Definitions

Pronunciation:

mrCaNaVaR is pronounced as /mɪstər ʤanavar/, which sounds like "mister juh-nuh-vuhR".

Requirements:

  • mrFAST or mrsFAST for read mapping stage. No other mapper is recommended to use with mrCaNaVaR for characterization of duplications and absolute copy number prediction. Deletion discovery might work, but other mappers are untested.
  • zlib for the ability to read compressed SAM files.
  • C compiler (mrCaNaVaR is developed with gcc version 4.1.2)

Please download the latest version from our download page and then unzip the downloaded file. Run 'make' to build mrFAST.

Coordinate space:

mrCaNaVaR uses BED-style coordinates for genomic locations (chrom/chromStart/chromEnd) to facilitate easier integration with genome browsers and tools like BEDtools. The first base in a chromosome is numbered 0. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

Sliding windows:

mrCaNaVaR operates on three classses of sliding windows over the reference genome to estimate absolute copy numbers and discover duplicated and deleted regions. These three windows are:

  • Long Windows (LW) are used to find locations of deletions and duplications. The LWs are overlapping. By default the window size size is 5000 bp of non-masked characters (i.e. A, C, G, T only), and the slide size is 1000 bp of any characters. For example, in repeat masked NCBI Build 35, the first two LWs are:

        chr1    0       6792
        chr1    1000    6792
        chr1    2000    7203
    
        Here, the first window has 6792 any characters, but 5000 non-N characters (thus 1792 N characters). 
        We then slide the window by 1000 "any" base pairs, starting the second window at position 1000.  
        The end of the second window is again 6792 since all of the first 1000 characters of chr1 is masked. 
        The third window starts at position 2000 and contains 5203 any characters, where 5000 are non-N.
    
  • Short Windows (SW) are used to refine breakpoints of detected duplication and deletion intervals. Default window size is 1000 bp of non-masked characters, and the slide size is 1000 bp of any characters, similar to LW.

  • Copy Windows (CW) are used to predict absolute copy numbers. CWs are non-overlapping windows. Default window size is 1000 bp of non-masked characters.

Before using mrCaNaVaR

  • Mask the repeats in your reference genome assembly. For the human genome assemblies (NCBI Build 35, 36, GRCh37) we recommend both:
  • RepeatMasker with sensitive "-s" parameter enabled. The repeat masked versions of many genome assemblies are available at the UCSC Genome Browser.
  • Tandem Repeats Finder (parameters: trf input_file 2 7 7 80 10 50 500 -m)
  • The repeats should be masked with N characters in your genome. Lowercase letters will be treated as non-repeats.

Read Mapping:

  • Please use either mrFAST or mrsFAST in single-read mapping mode (not paired-end). Recommended edit/hamming distance threshold is 5% of the read length (i.e. 2 for 36 bp reads, 4 for 76 bp reads, etc.). Using other aligners might be possible if all map locations (including suboptimal alignments) are reported, however, we strongly recommend mrFAST or mrsFAST that were specifically developed for SV/CNV analysis.

  • Use the same repeat masked version of the reference genome described above for building the mrFAST/mrsFAST index and read mapping.

Using mrCaNaVaR:

  • Mask the common and tandem repeats in your reference genome assembly of choice. Replace repeat characters with N.
  • Build mrFAST or mrsFAST search index using the same genome.
  • Map your reads with mrFAST or mrsFAST.
  • Prepare genome configuration file with mrCaNaVaR (PREP mode; see below).
  • Calculate read depth values from the mapping information generated by mrFAST or mrsFAST using mrCaNaVaR (READ mode; see below).
  • Call segmental duplication and deletion intervals, and predict absolute copy numbers using mrCaNaVaR (CALL mode; see below).

Run Modes

mrCaNaVaR has three run modes

  • PREP mode (--prep) is used to prepare a reference genome for analysis. Users need to run mrCaNaVaR in this mode only once for a new reference genome assembly (or versions of it with different repeat masking or different window sizes). In this mode mrCaNaVaR will:

    • Load reference genome file (FASTA format), calculate the coordinates and GC content of three window classes (LW, SW, and CW).
    • Save this information in a "configuration" file.

    Required files and parameters:

    • fasta: Repeat masked reference genome in FASTA format (e.g. hg17_masked.fa)
    • gaps: Coordinates of assembly gaps in the reference genome in BED format (e.g. hg19.gaps.bed)

    Output:

    • conf: Reference configuration file. This file is in a native binary format. (e.g. hg17.cnvr)

    Sample command line:

      mrcanavar --prep -fasta hg17_masked.fa -gaps hg17.gaps.bed -conf hg17.cnvr
    

    Optional parameters:

      -lw_size   <lw_size>  : Long window span size. Default is 5000.
      -lw_slide  <lw_slide> : Long window slide size. Default is 1000.
      -sw_size   <sw_size>  : Short window span size. Default is 1000.
      -sw_slide  <sw_slide> : Short window slide size. Default is 1000.
      -cw_size   <cw_size>  : Copy number window size. Default is 1000.
      -pseudoa   <bed_file> : Coordinates for pseudoautosomal regions in the 
              		        reference genome in BED format.
    
  • READ mode (--read) is used after read mapping stage to load mapping information and calculate read depths over the LW, SW, and CW intervals. The mapping information should be in SAM format; either uncompressed (.sam), or compressed with gzip (.sam.gz). Both mrFAST and mrsFAST are capable of generating map files in plain SAM format and gzip-compressed SAM. In this mode mrCaNaVaR will:

    • Load reference configuration file.
    • Scan a given directory for *.sam or *.sam.gz files, and calculate the read depth in three window classes.
    • Save the read depth information in a depth output file in a native binary format. In addition it will generate three human-readable text filesfor read depths and GC contents of LW, SW, and CW intervals.
    • Perform GC correction of the read depth values over the LW, SW, and CW intervals. This is new due to the CONC mode and multiple library support.

    Required files:

    • conf: Reference configuration file created using the PREP mode.
    • samdir: Plain SAM files (.sam) or gzip-compressed SAM files (.sam.gz) in a directory (e.g. sam_files_are_here/).

    Output:

    • depth: Read depth file in a native binary format.
    • Three human readable text files that contain the same information as the file above. These three files are for manual inspection purposes only, if needed, and will not be loaded and needed by mrCaNaVaR.

    Sample command line:

      mrcanavar --read -conf hg17.cnvr -samdir sam_files_are_here/ -depth mysample.depth
    

    Optional parameters:

      --gz                    : Indicates the SAM files are compressed in gzip format.
      -nsam <first> <second>  : Choose a sequence of SAM files between two indexes.
    
  • CONC mode (--conc) is used to merge >=2 *.depth files created using the READ mode. The CONC mode is optional, and needed only if the SAM files are read into a number of *.depth files; or, if there are more than one library preparation (which should be read into different *.depth files). In this mode, mrCaNaVaR will:

    • Load reference configuration file.
    • Load read depth files through -concdepth parameter.
    • Merge depth files and output the merged LW, SW, and CW intervals.

    Required files:

    • conf: Reference configuration file created using the PREP mode.
    • concdepth: List of read depth files to merge.

    Output:

    • depth: Read depth file in a native binary format.
    • Three human readable text files that contain the same information as the file above. These three files are for manual inspection purposesonly, if needed, and will not be loaded and needed by mrCaNaVaR.

    Sample command line:

      mrcanavar --conc -conf hg17.cnvr -concdepth lib1.depth lib2.depth lib3.depth -depth merged.depth
    
  • CALL mode (--call) is the final step in CNV analysis using mrCaNaVaR. In this mode, mrCaNaVaR will:

    • Load reference configuration file.
    • Load read depth file.
    • Perform GC correction of the read depth values over the LW, SW, and CW intervals if the *.depth file is created with mrCaNaVaR version earlier than 0.5.
    • Predict absolute copy number over the CW intervals.
    • Predict duplicated and deleted regions in the analyzed sample using the LW and SW intervals.

    Required files:

    • conf: Reference configuration file created using the PREP mode.
    • depth: Read depth file created using the READ mode.

    Output:

    • o: Output file prefix (out_prefix).

    Output files:

    • out_prefix.log: Log file that describes what mrCaNaVaR did. Includes read depth statistics.
    • out_prefix.cw_norm.bed: GC-corrected read depth values for the inferred control regions for CW intervals.
    • out_prefix.copynumber.bed: Estimated copy numbers over CW intervals.

    Sample command line:

      mrcanavar --call -conf hg17.cnvr -depth mysample.depth -o mysample
    

    Optional parameters:

      --xx : Set gender of the sequenced sample as female. Mammalian genomes only. 
      			Default is autodetect.
      --xy : Set gender of the sequenced sample as male. Mammalian genomes only. 
      			Default is autodetect.
      --multgc : Perform multiplicative GC correction. Default is additive.
      -mindup <min_dup_len>   : Minimum duplication length. Default is 10000.
      -gene <genelist>        : Coordinates for genes for calculating gene-based copy numbers.
      -cont_win <cwin_number> : Contiguous window number to look for high/low depth windows. 
      			  Default is 7.
      -cut_win  <cwin_cutoff> : Window number cutoff. Default is 6.
      --verbose : Verbose output.
    

Automated run.

To compile:

make && make auto

mrcanavar-auto is a wrapper for mrsFAST and mrCaNaVaR. It assumes that both mrsFAST and mrCaNaVaR are available in your PATH. It also assumes that the mrsFAST index file is available in the same directory as the reference genome. Simply run:

mrcanavar-auto --ref repeat_masked_ref.fa --input file1.fastq.gz,file2.fastq.gz,file3.fastq.gz --conf repeat_masked_ref.cnvr --gene repeat_masked_ref.genes.bed

Full parameter list:

--input [list]: comma-separated FASTQ(.gz) files.
--aln-input [list]: comma-separated BAM/CRAM files (alternative to --input).
--ref [ref.fa]: repeat masked reference genome. mrsFAST index (ref.fa.index) should be in the same directory.
--unmasked-ref [unmaskedref.fasta]: Ogirinal unmasked reference  genome. This is required if --aln-input is used for BAM/CRAM files.
--conf [ref.cnvr]: mrCaNaVaR config file.
--gene [genes.bed]: List of genes.
--threads [int]: number of threads for mrsFAST.
--kmer [int]: Cropping length for mrsFAST mapping. Default is 36.
--ploidy [int]: Ploidy value for the organism. Default is 2.
--mincnv [int]: Minimum length for CNVs to call. Default is 10000.
--skip-mapping: Skip mrsFAST mapping. Use this only if you have the mapping output and rerunning mrCaNaVaR for some reason.
--cloud: Cloud mode, the directory info from the input file names will be stripped for output file generation.
--no-gz: Do not compress mrsFAST output. This option will generate larger files, but it will save some run time.
--no-sam: Do not generate SAM(.gz) output files. Pipe the mrsFAST output directly into mrCaNaVaR. This option does not work with multiple FASTQ files as input.
--dry-run: Do not execute commands, just print them.