- Link to original ABC paper: use
snakeABC_SE.smk
- Yeolab paired-end protocol: use
snakeOligoCLIP_PE.smk
- Main environment: Snakemake 7.3.8 and scipy:
- Snakemake Installation
- install snakemake 7.3.8 using
rules/envs/snakemake.yaml
. - Snakemake 8 has different command line options that will need modification in
--profile
- Singularity 3.11: Singularity.
- If you are on a server, ask the sys admin to install it. Sometimes there are weird permission issue if you install on your own.
- Not recommended: install via conda
- Download this repository by
git clone https://github.com/YeoLab/Mudskipper.git
.
- Download data from SRA.
utils/download_sra.smk
this script can help you download.
- Prepare config and manifest
PATH_TO_YOUR_CONFIG
. Example inputs:- config file:
config/preprocess_config/oligose_k562.yaml
- manifest:
config/fastq_csv/ABC_2rep.csv
- barcode csv:
config/barcode_csv/ABC_barcode.csv
- config file:
- Adjust profile for your cluster and computing resource:
- see profiles/tscc2 as an example
- for each option, see documentation
- Run snakemake
snakemake -s snakeABC_SE.smk \ --configfile config/preprocess_config/oligose_k562_noalt.yaml \ --profile profiles/tscc2 \ -n
-s
: usesnakeOligoCLIP_PE.smk
if you did YeoLab internal pair-end protocol. usesnakeABC_SE.smk
if you did ABC--configfile
: yaml file to specify your inputs, including where are the fastqs, what are the barcode, what reference genome...etc.-n
: dry run.- the rest of the options are in
--profile
. Adjust as needed. see documentation
Refer to the following sections to learn what to include in your configuration.
- Multiplex Example:
- Yeo lab internal pair-end protocol:
config/preprocess_config/oligope_iter5.yaml
- ABC single-end protocol:
config/preprocess_config/ABC_2rep.yaml
- Yeo lab internal pair-end protocol:
- Singleplex Example:
- ABC single-end protocol:
config/preprocess_config/oligose_single_rbfox2_hek.yaml
- Yeo lab internal paired-end protocol:
config/preprocess_config/oligope_v5_nanos2.yaml
- Process 1 type of singleplex per 1 manifest.
- ABC single-end protocol:
All FASTQ files in the same manifest should have the same combinations of barcodes. For example, if you performed two singleplex experiments—one with barcode 1 and the other with barcode 2—they should be specified in two separate config files.
- Example files:
- Multiplex Example:
- Yeo lab paired-end:
config/fastq_csv/katie_pe_iteration5.csv
- ABC:
config/fastq_csv/ABC_2rep.csv
- Yeo lab paired-end:
- Singleplex Example:
- Yeo labe paired-end:
config/fastq_csv/V5_NANOS2.csv
- ABC:
/config/fastq_csv/ABC_SLBP_singleplex.csv
- Yeo labe paired-end:
- Multiplex Example:
- columns:
fastq1
&fastq2
: *.fastq.gz file for read1 and read 2libname
: unique names for each library. Should not contain space, special characters such as #,%,*experiment
: unique names for experiment. Rows with the sameexperiment
will be treated as replicates. Should not contain space, special characters such as #,%,*
- Example:
config/barcode_csv/iter5.csv
- Notebook to generate this file (Yeolab internal user):
utils/generate barcode-iter5.ipynb
- delimiter:
:
- columns:
- 1st column: barcode sequence
- YeoLab internal protocol: read 2 starts with this sequence. Double check with
zcat READ2.fastq.gz | grep BARCODE_SEQ
. This sequence is reverse complement to the adapter sequence (see notebook for detail) - ABC: read starts with this sequence.
- YeoLab internal protocol: read 2 starts with this sequence. Double check with
- 2nd column: Antibody/RBP name, Should not contain space, special characters such as #,%,*.
- 1st column: barcode sequence
WORKDIR
: output directoryRBP_TO_RUN_MOTIF
: list of RBP names to run motif analysis. Must be one of the rows inbarcode_csv
.
Although the Mudskipper paper describes the Dirichlet Multinomial Mixture (DMM) algorithm, this pipeline supports all other peak callers and background models benchmarked in the paper. Below are options for running them. For everyday use, it is recommended to set all of these options to False unless you are specifically trying to regenerate all results.
run_clipper
: True if you want CLIPper outputs (works, but slow)run_skipper
: True if you want to run Skipper. (usually doesn't work in ABC)run_comparison
: True if you want to run other peak caller such as Piranha and OmniCLIPDB_FILE
andGENOME_dir
: are the omniclip files. check out their github on how to run them
debug
: True if you want to debug. This tries to blast the unmapped reads.
By default, if the options below are left blank, the pipeline runs the Dirichlet Multinomial Mixture (DMM) algorithm for multiplex datasets, where RBPs are explicitly compared against each other. DMM is the most effective model for multiplex datasets. There is an option to compare to 'internal control' such as a spike-in or IgG etc with a barcode as background.
Unfortunately, DMM is not compatible with singleplex datasets. Calling binding sites in singleplex datasets requires an 'external control' (see details below). Without an external control, the process will stop at the read counting stage. The best external control according to Mudskipper benchmark is an eCLIP SMInput.
The backgrounds can be applied to other peak callers such as Skipper, CLIPper, and the Beta-Binomial Mixture model.
If you want to include a background library, here's how to do it:
AS_INPUT
: If you have an IgG antibody/spike-in/bead-only control in the multiplex experiment that will serve as the normalization reference, enter its name here. It must be one of the rows in the barcode_csv file.
"External control": a library that is NOT in the same fastq as your oligoCLIP/ABC to serve as the background.
- specify them in
external_bam
with name of the library (first line, exeCLIP_SLBP_SMInput
), followed byfile:
andINFORMATIVE_READ
# For example: eCLIP_SLBP_SMInput: file: /tscc/nfs/home/hsher/ps-yeolab5/ENCODE_k562_noalt/output/bams/dedup/genome_R2/SLBP_IN_1.genome.Aligned.sort.dedup.R2.bam INFORMATIVE_READ: 2
- This can be an eCLIP SMInput, total RNA-seq, IgG pull down from another experiment, bead control, spike-ins
- How to generate them? the bams must be processed with the exact same STAR index as
STAR_DIR
, and is recommended to be processed with the same/similar mapping parameters as this repo or skipper.
adaptor_fwd
,adaptor_rev
: adapter sequence to trim. Do not include barcodetile_length
: we tile adapter sequences of this length so that indels don't mess up with trimmingQUALITY_CUTOFF
: default 15. cutadapt paramsumi_length
: Length of unique molecular identifier (UMI).STAR_DIR
: directory to STAR index
- skipper annotations: follow skipper instructions or generate with skipper_utils
- Yeolab internal users: Brian had all sorts of annotations here
/tscc/projects/ps-yeolab4/software/skipper/1.0.0/bin/skipper/annotations/
.
- Yeolab internal users: Brian had all sorts of annotations here
CHROM_SIZES
GENOMEFA
These are in the EXPERIMENT_NAME
folders. For example, in your manifest.csv, there are two experiments, "GN_1019" and "GN_1020", then, under the GN_1019/
folder you would see the following:
fastqs
: The trimmed and the demultiplexed fastqs.all.Tr.fq*
is the adapter and UMI removed onesultraplex*RBP.fastq.gz
are the demultiplexed.
bams
: bam files.*rmDup.Aligned.sortedByCoord.out.bam
is the PCR deduplicated bams. This is usually the bam you would want to use for other analysis!*Aligned.sortedByCoord.out.bam
is before deduplication.
bw
contains bigwig files!CITS
calculates the number of 5' read stops per nucleotide position. 5' read stop is presumed to be the crosslinking site.COV
calculates read coverage.
bw_bg
contains bigwigs of "complementary control(CC)", namely, summing all the other RBPs together.
These contain numbers that will help you debug your protocol. To look at them swiftly, I personally like to install VS code, and install some csv editing extension so the file can appear like a table. Everything is compiled into QC/summary.csv
.
cutadapt_stat.csv
contains how many times reads contain adapter, and how many reads are too short after adapter trimming.- One of my favorite column is "Reads/Pairs that were too short". When reads are too short after trimming, it means they are probably adapter dimer, or you fragment it too much.
fastQC*
are the summary from fastqs that has been trimmed.- CLIP reads are usually bad at GC content. So it is normal to see failed GC.
- I will always look at "adapter content" in
fastQC_passfail.csv
to make sure adapters are all gone. If this column fails, maybe you input the wrong adapter sequence, or gets contaminated by some other stuffs.
- Read count after demultiplexing.
demux_read_count.txt
. - Duplication:
dup_level.csv
. This file contains the number of reads after and before PCR deduplication. If the column "unique reads" is very low, you might have amplified too much! - Mapping statistics
mapping_stat.csv
.- "Unique mapped reads%" are the percentage of reads that map to only 1 position in the genome. These were called "Usable reads" in the old eCLIP terminology.
- "% of reads mapped to multiple loci" are the reads that map to 2~100 positions. These are mostly ribosomal RNAs. It is normal to see quite some (30-50%) multimapping reads in eCLIP
- Then it is the reason why the rest are unmapped:
- "% of reads mapped to multiple loci", "% of reads unmapped: too short", and "other reasons". These will help your bioinformatician debug what is going wrong. Common reasons to have lots of unmapped reads:
- Adapter trimming is bad. The read still contains adapter sequences when they enter mapping algorithms.
- Cells is contaminated. The baterial/fungal genomes gets sequences and does not map to human genome.
- "% of reads mapped to multiple loci", "% of reads unmapped: too short", and "other reasons". These will help your bioinformatician debug what is going wrong. Common reasons to have lots of unmapped reads:
- Read level metrices in
read_count/
folder:*cosine_similarity.csv
: Here we construct a vector, contains read counts per genomic bin for each RBP. Cosine similarity measure globally how two RBP are similar with each other. If you see big numbers, it means the two RBPs are very similar. In the Antibody barcoded CLIP paper we published, the cosine similarity is typically between 0.4~0.7. If all of the RBPs are similar, it can indicate loss of specificity.*gene_name.csv
: This contains how many read per gene for each RBP.*gene_type.csv
: contains how many read per gene/transcript type for each RBP.*region.csv
: This contains how many read per region for each RBP.- All of the above can help you see if there is the right enrichment and specificity for your multiplex CLIP.
This pipeline tries to integrate multiple peak callers/binding site finders and orchestrate secondary analysis (peaks per region, motifs etc).
-
beta-mixture*
considers enrichment of RBP reads against complementary control (CC) or an internal library(IgG)! It is not great at detecting shared binding sites. But it runs fast.- For biologist, all that matters is
*enriched_windows.tsv
. This contains the presumed binding sites.- column
logLR
measures confidence. The number represents the log likelihood ratio(LR) of a window being a binding site versus not, aka, how likely is it to observe the data if it is bound, versus it being not bound. A logLR of 2 means about 10x more likely to observe the data from binding than being something sh**ty. Higher means high confidence of it being a binding site -
p_bar
,p_raw
,fc_raw
,fc_bar
: Effect sizes-
*_bar
is the estimate from the model. -
*_raw
is directly calculated from counts: - p= (read in RBP)/(read in window)
- fc = ((read in RBP)/(read in window))/((all reads in RBP)/total reads)
-
- column
- Secondary analysis:
-
feature_logistic.pdf
andfeature_ridge.pdf
contains how likely each types of regions is bound. -
*summary.csv
: What gene types/transcript types/region types are bound. -
homer/
folder contains motifs -
finemapping
contains finemapped binding sites. It isn't great for splice site binding proteins. TODO: make algorithm better,
-
- Modelling outputs:
-
fit.rda
contains everything including models of various numbers of components(K). -
*alpha.tsv
contains the parameter alpha/beta for beta-binomial distribution for each component. -
*null_alpha.tsv
contains the parameter alpha/beta for a single component beta-binomial distribution. -
*goodness_of_fit.pdf
: AIC, BIC, log likelihood for model selection. -
*label_component.csv
: The labels (bound vs not bound) for each component. -
*mixture_weight.tsv
: This is namely badly. What is contains is$E[z_ik]$ which is "how likely each window belong to a cluster". -
*weights
: The "mixture weight".
-
- For biologist, all that matters is
-
DMM
(Dirichlet Multinomial Mixture) considers the distribution of reads among all RBPs without summing the rest into CC. This model detects shared binding site better than beta-mixture model. But it is so f* slow. (but still faster than CLIPper. ^.<)- The folder structure is the same.
- Biologists would only care for
*enriched_windows.tsv
which are the binding sites.
skipper_CC
models RBP versus complementary control.skipper_{INTERNAL_CONTROL_NAME}
models RBP versus an internal control library, e.g. IgG.skipper_external/{EXTERNAL_CONTROL_NAME}
models RBP versus an external control library, e.g. RNA-seq or SMInput.- For skipper outputs, see skipper's documentation!
CLIPper
only uses the IP to find local read enrichment.CLIPper_CC
contains local read enrichment, "normalized to" complementary control using chi-square or fisher exact test. This is what we publised in the original paper.CLIPper-{EXTERNAL_CONTROL_NAME}
: contains peaks "normalized to" an external control library. (SMInput or total RNA-seq)CLIPper.{INTERNAL_CONTROL_NAME}
: contains peaks "normalized to" an internal control library. (IgG)*normed.compressed.annotate.bed
is the final output. See the ENCODE pipeline for columns specification
- See their respective documentation for detail.
The meat of the pipeline is in rules/
:
se_preprocess
andpe_preprocess
takes fastq --> trim -> demultiplex -> deduplicate -> bamsQC
contains rules to assemble quality control statistics, and some additional debugging rules such as investigating unmapped reads and those without barcode.- rules for bigwig generation is in
make_track.smk
merge_bw.smk
sums up bigwigs to make complementary control.normalization_DMN
,repeat_DMN
contains Mudskipper code, which does mixture model/generative clustering in genomic windows and repeat windows.skipper.smk
,repeat.smk
is entirely stolen from skipperclipper.smk
runs CLIPper and the chi-square things. Stolen from ENCODE pipeline.analysis.smk
andfinemap.smk
: runs finemapping, motif detection from Skipper and MudSkipper
Some rules to help you debug
map_r1.smk
andmultimap.smk