-
Notifications
You must be signed in to change notification settings - Fork 183
HiCCUPS
HiCCUPS is an algorithm for finding chromatin loops.
This is the usage that most users will likely use (more detailed usage below):
hiccups [-m matrixSize] [-c chromosome(s)] [-r resolution(s)]
<HiC file> <outputDirectory>
Upon a successful run of HiCCUPS (depending on parameters used), the outputDirectory will contain something similar to:
.../outputDirectory/
.../outputDirectory/merged_loops (the final looplist - this is likely what you'll use)
.../outputDirectory/enriched_pixels_10000
.../outputDirectory/enriched_pixels_5000 (contains raw enriched pixels from GPU)
.../outputDirectory/fdr_thresholds_10000
.../outputDirectory/fdr_thresholds_5000 (threshold values used to calculate enrichment)
.../outputDirectory/postprocessed_pixels_10000
.../outputDirectory/postprocessed_pixels_5000 (clustered pixels for each resolution)
The merged_loops file uses this format.
NOTE: HiCCUPS will choose appropriate defaults for HiC files if no specifications are given
hiccups local/folder/HIC006.hic local/folder/hiccups_results
This command will run HiCCUPS on HIC006 and save all found loops in the hiccups_results folder. It will check the density of the Hi-C map to determine appropriate parameters for running HiCCUPS on the provided .hic file.
hiccups -m 1024 -r 5000,10000 -c 22 https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined_30.hic hiccups_results
This command will run HiCCUPS on chromosome 22 of GM12878 using default parameters for 5kB and 10kB resolutions. The results will be merged and saved in the hiccups_results folder. The GPU will use matrix slices of size 1024x1024 (likely speeding up the computation on a dedicated GPU).
hiccups [-m matrixSize] [-c chromosome(s)] [-r resolution(s)] [--threads num_threads]
[-k normalization (NONE/VC/VC_SQRT/KR)] [-f fdr]
[-p peak width] [-i window] [-t thresholds]
[-d centroid distances] <HiC file> <outputDirectory> [specified_loop_list]
The required arguments are:
- <HiC file>: Address of HiC file which should end with ".hic". This is the file you will load into Juicebox. URLs or local addresses may be used. HiCCUPS should be run using MAPQ>30 hic files.
- <outputDirectory>: Directory containing final merged list of all loops found by HiCCUPS. Can be visualized directly in Juicebox as a 2D annotation. By default, various values critical to the HICCUPS algorithm are saved as attributes for each loop found. These can be disabled using the suppress flag below. Intermediate files created by HiCCUPS (raw pixels from GPU, clustered pixels for each resolution, and FDR thresholds) will also be saved in this directory.
The optional arguments are:
-
specified_loop_list <Loop List>
is an optional positional argument which should point to a Juicebox formatted loop list. HiCCUPS will then return enrichments for these specified loops for each resolution. Starting with version 1.12.03, the given pixels are post-processed at each resolution and the results are merged across resolutions. This will create additional files in addition to the ones created in prior versions. CPU version only searches near the diagonal (in order to run in a reasonable amount of time), so it will not include regions far from the diagonal. -
-m <int>
Maximum size of the submatrix within the chromosome passed on to GPU (Must be an even number greater than 40 to prevent issues from running the CUDA kernel). The upper limit will depend on your GPU. Dedicated GPUs should be able to use values such as 500, 1000, or 2048 without trouble. Integrated GPUs are unlikely to run sizes larger than 90 or 100. Matrix size will not effect the result, merely the time it takes for hiccups. Larger values (with a dedicated GPU) will run fastest. -
-c <String(s)>
Chromosome(s) on which HiCCUPS will be run. The number/letter for the chromosome can be used with or without appending the "chr" string. Multiple chromosomes can be specified using commas (e.g. 1,chr2,X,chrY) -
-r <int(s)>
Resolution(s) for which HiCCUPS will be run. Multiple resolutions can be specified using commas (e.g. 25000,10000,5000). Due to the nature of DNA looping, it is unlikely that loops will be found at lower resolutions (i.e. 50kB or 100kB) IMPORTANT: if multiple resolutions are used, the flags below can be configured so that different parameters are used for the different resolutions. -
-k <NONE/VC/VC_SQRT/KR>
Normalizations (case sensitive) that can be selected. Generally, KR (Knight-Ruiz) balancing should be used when available. -
-f <int(s)>
FDR values actually corresponding to max_q_val (i.e. for 1% FDR use 0.01, for 10%FDR use 0.1). Different FDR values can be used for each resolution using commas. (e.g "-r 5000,10000 -f 0.1,0.15" would run HiCCUPS at 10% FDR for resolution 5000 and 15% FDR for resolution 10000) -
-p <int(s)>
Peak width used for finding enriched pixels in HiCCUPS. Different peak widths can be used for each resolution using commas. (e.g "-r 5000,10000 -p 4,2" would run at peak width 4 for resolution 5000 and peak width 2 for resolution 10000) -
-i <int(s)>
Window width used for finding enriched pixels in HiCCUPS. Different window widths can be used for each resolution using commas. (e.g "-r 5000,10000 -p 10,6" would run at window width 10 for resolution 5000 and window width 6 for resolution 10000) -
-t <floats>
Thresholds for merging loop lists of different resolutions. Four values must be given, separated by commas (e.g. 0.02,1.5,1.75,2). These thresholds (in order) represent: > threshold allowed for sum of FDR values of the horizontal, vertical, donut, and bottom left filters (an accepted loop must stay below this threshold) > threshold ratio that both the horizontal and vertical filters must exceed > threshold ratio that both the donut and bottom left filters must exceed > threshold ratio that at least one of the donut and bottom left filters must exceed -
-d <ints>
Distances used for merging nearby pixels to a centroid. Different distances can be used for each resolution using commas. (e.g "-r 5000,10000 -d 20000,21000” would merge pixels within 20kB of each other at 5kB resolution and within 21kB at 10kB resolution. -
--threads <int>
Number of threads to use (HiCCUPS is multi-threaded). As of Juicer Tools Version 1.13.02, the default number of threads used is 1. Passing in a value of 0 will result in the jar calculating the number of available threads. Passing in a value >0 will result in that value being used directly.
See this Colab notebook with an example run: notebook
hiccups -m 500 -r 5000,10000 -f 0.1,0.1 -p 4,2 -i 7,5 -d 20000,20000,0 -c 22 HIC006.hic all_hiccups_loops
This command will run HiCCUPS on chromosome 22 of HIC006 at 5kB and 10kB resolution using the following values:
-
5kB: fdr 10%, peak width 4, window width 7, and centroid distance 20kB
-
10kB: fdr 10%, peak width 2, window width 5, and centroid distance 20kB The resulting merged loop list as well as intermediate results will be saved in the all_hiccups_loops folder.
Note that these are values used for generating the GM12878 loop list, and that we could have also simply called
hiccups -m 500 -r 5000,10000 -c 22 HIC006.hic all_hiccups_loops
to achieve the same results.
These are the default parameters in for HiCCUPS described with all the available flags, as described in Rao, Huntley et al. Cell 2014.
Medium resolution maps:
-m 512
-c (all chromosomes)
-r 5000,10000,25000
-k KR
-f .1,.1,.1
-p 4,2,1
-i 7,5,3
-t 0.02,1.5,1.75,2
-d 20000,20000,50000
High resolution maps:
-m 512
-c (all chromosomes)
-r 5000,10000
-k KR
-f .1,.1
-p 4,2
-i 7,5
-t 0.02,1.5,1.75,2
-d 20000,20000,50000
The merged loop list created by HiCCUPS will start with a header line, followed by a line for every loop. By default, the file should contain 20 fields per line in the following format:
chromosome1 x1 x2 chromosome2 y1 y2 color observed
expected_bottom_left expected_donut expected_horizontal expected_vertical
fdr_bottom_left fdr_donut fdr_horizontal fdr_vertical
number_collapsed centroid1 centroid2 radius
Note: If you also run Motif Finder, additional fields will be created.
Explanations of the 20 fields are as follows:
- chromosome = the chromosome that the loop is located on
- x1,x2 = the coordinates of the upstream locus corresponding to the peak pixel
- y1,y2 = the coordinates of the downstream locus corresponding to the peak pixel
- color = the color that the feature will be rendered as if loaded in Juicebox
- observed = the raw observed counts at the peak pixel
- expected_[bottom_left, donut, horizontal, vertical] = the expected counts calculated using the [bottom_left, donut, horizontal, vertical] filter
- fdr_[bottom_left, donut, horizontal, vertical] = the q-value of the loop calculated using the [bottom_left, donut, horizontal, vertical] filter
- number_collapsed = the number of pixels that were clustered together as part of the loop call
- centroid1 = the upstream coordinate of the centroid of the cluster of pixels corresponding to the loop
- centroid2 = the downstream coordinate of the centroid of the cluster of pixels corresponding to the loop
- radius = the Euclidean distance from the centroid of the cluster of pixels to the farthest pixel in the cluster of pixels
See section VI.a.5.iv of the Extended Experimental Procedures of Rao, Huntley et al. Cell 2014 for more details.