Discovery and validation of viruses using long read technology
Goal: To find viruses, both novel and known, in PacBio generated long read metagenomic data in order to find and diagnose viral infections in cultured cells using machine learning.
Discovery and validation of viruses using VirFinder and RPS BLAST technology trainded with PacBio long read metagenomic data for PacBio long read metagenomic data.
https://docs.google.com/presentation/d/125WTRdU_TjUtEwrDlG9p-orZDJxauGQl7ojDjvmwlMI/edit?usp=sharing
VirusAlert is intended to be used to analyze cell cultures used pharma bioractors to periodically check for viral infection. By running clean cell cultures through LRV, users can set a baseline p-value for their specific cell line. Extreme deviations along with BLAST analysis of 'contaminent' contigs can indicate viral infection and need for further investigation.
At the heart of VirusAlert is VirFinder. A k-mer based virus detection software that implements machine learning to allow users to isolate viral reads from host reads. VirFinder is, by default, trained on viral sequences from the RefSeq database. VirFinder then uses machine learning algorithms to find possible viral k-mers within the reads.
GitHub: https://github.com/jessieren/VirFinder NCBI: https://www.ncbi.nlm.nih.gov/pubmed/28683828
Searches through NCBI database for matches to viral contigs.
-
First, clone the repository:
git clone https://github.com/NCBI-Hackathons/LongReadViruses.git
Next, run the top-level install.sh script.
This will install dependencies in the tools directory, and test data files in the data directory.
Run longreadviruses.py.
Command Line Options for virusalert.py:
-h|--help Print this help text. -v Print debugging information. [default: true] -i INPUTS One or more SRR numbers or fastq/a file paths as input, e.g. SRR5150787 or testfile.fq [default: SRR5150787] -t INTYPE Type of input provided - can be either srr, fasta or fastq [default: srr] -c CONTDB Contamination database to use. Default is to download and install the RefSeq viral database. -o OUTDIR Working directory and where to save results [default: analysis]
Sequence SRR: All data passed into used in VirusAlert should be long read PacBio shotgun metadata and passed in the form of a SRA Run Accession (SRR).
Threshold [Optional] : minimum P-value for a non contaminated output
[Tree of viruses image]?
-
Install (if not present) any BLAST dependencies, R, then VirFinder, and any pip dependencies (docopt):
sh install.sh
-
Run program with sample options:
python3 virusalert.py -i <input-SRR-code> <optional p-value> -o <output-results-directory-path>
-
Output
This should:
1. Fetch the fasta data from the SRR code.
2. Run the fasta data with VirFinder.
3. Programmatically BLAST to see what the top hits are.
4. Return a graphic, a file of p-values, and a file of top hits in the output directory (date-stamped).
VirFinder: Blast: