GitHub - cfe-lab/proviral

Readme

Dependencies

minimap2 (https://github.com/lh3/minimap2) (must be available via commandline)
blast tools (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
R and RSCRIPT (https://www.r-project.org/)

Singularity builds

Build all singularity images inside of the simages folder

Filtering

At the core of the proviral pipeline, data is read from contigs.csv and conseqs.csv files produced by MiCall
First the pipeline reads through all of the contigs, then the contigs
When it does this (see the find_primers() function) it applies the following logic in this order for filtering/tagging:
1. If a sample is not proviral, skip it. Do not attempt to find primers or anything, just log a message saying sample X was skipped because it was non-proviral
2. If a sample has 0 in the remap column of the cascade.csv file, tag that sequence with an error: No contig/conseq constructed, do not analyze it or try to find primers, and write it to the *primer_analysis.csv file (which records all failures)
3. If the consensus-percent-cutoff is NOT MAX, tag it with an error: contig not MAX and skip the sequence (do not try to find primers)
4. If the reference of the sample is HIV1-CON-XX-Consensus-seed tag that sequence with an error: is V3 sequence, skip the sequence (do not try to find primers), and write it to the *primer_analysis.csv file
5. If there is an X in the middle of the sequence, tag that sequence with an error: low internal read coverage, skip the sequence (do not try to find primers), and write it to the *primer_analysis.csv file
6. If there are ANY non-TCGA characters in the sequence, tag that sequence with an error: contig sequence contained non-TCGA/gap, skip the sequence (do not try to find primers), and write it to the *primer_analysis.csv file
7. For each end (5' (fwd), 3' (rev)) of the sequence:
  1. If there are X characters found, try to remove them (if they are clustered) and if not possible to remove tag the fwd/rev end with a fwd/rev primer error: low read coverage in primer region, skip the fwd/rev end (do not try to find primers)
  2. If fwd/rev end has zero nucleotides found for primer, tag the fwd/rev end with a fwd/rev primer error: primer was not found, skip to the next end if any
  3. If the fwd/rev primer is deemed not valid, tag the fwd/rev end with a fwd/rev primer error: primer failed secondary validation, skip to the next end if any
8. Write the sequence to the *primer_analysis.csv file regardless of tagged errors in any error column
9. Load the *primer_analysis.csv files for both contigs and conseqs and for both of them apply the following filters in order:
  1. Remove all rows where either the error, fwd_error, or rev_error is tagged
  2. Remove the primers from the sequences (for hivseqinr)
  3. Remove rows where sample name appears twice (duplicates)
  4. Remove rows where the reference contains unknown or reverse
10. Finally merge the filtered contigs and conseqs and write the final *filtered.csv file with conseqs taking precedence over contigs

Name		Name	Last commit message	Last commit date
Latest commit History 320 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
gene_splicer		gene_splicer
tests		tests
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
Singularity		Singularity
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

Dependencies

Singularity builds

Filtering

About

Releases 2

Packages

Contributors 5

Languages

cfe-lab/proviral

Folders and files

Latest commit

History

Repository files navigation

Readme

Dependencies

Singularity builds

Filtering

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Languages

Packages