- minimap2 (https://github.com/lh3/minimap2) (must be available via commandline)
- blast tools (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
- R and RSCRIPT (https://www.r-project.org/)
- Build all singularity images inside of the
simages
folder
- At the core of the proviral pipeline, data is read from
contigs.csv
andconseqs.csv
files produced by MiCall - First the pipeline reads through all of the contigs, then the contigs
- When it does this (see the
find_primers()
function) it applies the following logic in this order for filtering/tagging:- If a sample is not proviral, skip it. Do not attempt to find primers or anything, just log a message saying
sample X was skipped because it was non-proviral
- If a sample has 0 in the remap column of the
cascade.csv
file, tag that sequence with an error:No contig/conseq constructed
, do not analyze it or try to find primers, and write it to the*primer_analysis.csv
file (which records all failures) - If the
consensus-percent-cutoff
is NOTMAX
, tag it with an error:contig not MAX
and skip the sequence (do not try to find primers) - If the reference of the sample is
HIV1-CON-XX-Consensus-seed
tag that sequence with an error:is V3 sequence
, skip the sequence (do not try to find primers), and write it to the*primer_analysis.csv
file - If there is an
X
in the middle of the sequence, tag that sequence with an error:low internal read coverage
, skip the sequence (do not try to find primers), and write it to the*primer_analysis.csv
file - If there are ANY non-TCGA characters in the sequence, tag that sequence with an error:
contig sequence contained non-TCGA/gap
, skip the sequence (do not try to find primers), and write it to the*primer_analysis.csv
file - For each end (5' (fwd), 3' (rev)) of the sequence:
- If there are
X
characters found, try to remove them (if they are clustered) and if not possible to remove tag the fwd/rev end with a fwd/rev primer error:low read coverage in primer region
, skip the fwd/rev end (do not try to find primers) - If fwd/rev end has zero nucleotides found for primer, tag the fwd/rev end with a fwd/rev primer error:
primer was not found
, skip to the next end if any - If the fwd/rev primer is deemed not valid, tag the fwd/rev end with a fwd/rev primer error:
primer failed secondary validation
, skip to the next end if any
- If there are
- Write the sequence to the
*primer_analysis.csv
file regardless of tagged errors in any error column - Load the
*primer_analysis.csv
files for both contigs and conseqs and for both of them apply the following filters in order:- Remove all rows where either the
error
,fwd_error
, orrev_error
is tagged - Remove the primers from the sequences (for hivseqinr)
- Remove rows where sample name appears twice (duplicates)
- Remove rows where the reference contains
unknown
orreverse
- Remove all rows where either the
- Finally merge the filtered contigs and conseqs and write the final
*filtered.csv
file with conseqs taking precedence over contigs
- If a sample is not proviral, skip it. Do not attempt to find primers or anything, just log a message saying