kpop-workflow

A simplified workflow to run KPop (https://www.biorxiv.org/content/10.1101/2022.06.22.497172v2; https://github.com/PaoloRibeca/KPop) for clustering and classification

To run the workflow you'll first need to create the kpop environment using the enviroment.yml file and activate the kpop environment

conda env create -f environment.yml
conda activate kpop

Currently, there are three main workflows:

Clustering workflow - starting with an unknown dataset, running KPop to generate transformational database files (.KPopTwister and .KPopTwisted) and a pseudo-phylogenetic tree showing relatedness:

nextflow run main.nf --cluster --input_dir /dir/containing/fastqAndFastas

Classifying workflow - starting with a known training dataset with meta data, such as classes/species, and predicting the class/species of an unknown test set:

nextflow run main.nf --classify --input_dir /dir/containing/training/fastqAndFastas --meta_data /path/to/meta/file --test_dir /dir/containing/test/fastqAndFastas

Update transformational database workflow - starting with an unknown dataset and previous transformational database files, the new data is added to the existing database, creating updated .KPopTwister and KPopTwisted files and a new pseudo-phylogenetic tree.

nextflow run main.nf --update --test_dir /dir/containing/new/fastqAndFastas --twister_file /path/to/KPopTwister/file --twisted_file /path/to/KPopTwisted/file

All workflows create lots of files, but the most important outputs are the database files (.KPopTwister and .KPopTwisted in KPopTwist_files) and KPop pseudo-phylogenetic tree found in results/trees_and_metrics/output_2.<nwk/pdf> (--cluster), the class/species predictions found in results/predictions/output.<predictions/KPopSummary>.txt (--classify) and the updated database files (<output_prefix>.KPopTwisted and <output_prefix>.KPopTwisted) and updated_comparison.pdf plot in updated_KPopTwist_files (--update).

When no KPop workflow is selected, pre-processing is still performed. In the below example fastq files would be downloaded from NCBI based on IDs in sample_list.txt, then QC would be performed and finally assemblies would be generated.

nextflow run main.nf --accession_list sample_list.txt

Workflows

Option	Argument(s)	Effect	Note(s)
`--cluster`		Data is run through clustering workflow, starting with an unknown dataset the pipeline produces a distance matrix and pseudophylogenetic tree showing relatedness between samples.
`--classify`		Data is run through classification workflow, starting with separate training and test datasets, a model is created using the training dataset and known class metadata> This model is used to predict the classes of the unknown test dataset. Requires --test_dir argument.
`--update`		New data is run added to the existing database, creating updated .KPopTwister and KPopTwisted files.

Input

Option	Argument(s)	Effect
`--input_dir`	directory_name	Path to directory containing fasta/fastq files. Paired-end fastqs require "R1" and "R2" in filenames. Gzipped files are allowed. If `--classify` used this directory is the training dataset.
`--test_dir`	directory_name	Directory containing unseen test dataset. Only required if `--classify` workflow invoked.
`--accession_list`	txt_filename	Supply a list of SRA IDs to download as input samples in the form of a text file, with one SRA per line.
`--test_accession_list`	txt_filename	Supply a list of SRA IDs to download as test samples in the form of a text file, with one SRA per line. Only required if `--classify` workflow invoked.
`--meta_data`	TSV_filename	Tsv file with two required columns with defined headers; "fileName" and "class". "fileName" is file name if a fasta or fasta.gz file, or file prefix if paired-end fastqs. E.g. sample1.fasta.gz if fasta file or sample1 if sample1_R1.fastq.gz and sample1_R2.fastq.gz. Only required if `--classify` workflow invoked. If used with `--cluster` workflow, it will generate a pseudo-phylogenetic tree coloured with metadata. Additional columns allowed.
`--twisted_file`	.KPopTwisted_file	Full path to .KPopTwisted file. Only required for `--update` workflow.
`--twister_file`	.KPopTwister_file	Full path to .KPopTwister file. Only required for `--update` workflow.
`--kpopcount_input`	.KPopCounter_file	`.KPopCounter` file as an input. Incompatible with `--no_assembly`, `--match_reference`, `--input_dir`, `--accession_list`, `--max_dim` and `--meta_data` options.
`--kpopcount_test_input`	.KPopCounter_file	`.KPopCounter` file as a test input. Only required if `--classify` workflow invoked. Incompatible with `--no_assembly`, `--match_reference`, `--input_dir`, `--accession_list`, `--max_dim` and `--meta_data` options.

Output

Option	Argument(s)	Effect	Note(s)
`--output_dir`	directory_name	Path to output directory. If directory doesn't exist then a new directory will be created.	default=projectDir/results
`--output_prefix`	prefix_name	Prefix for output files	default=output
`--pred_class_num`	positive_integer \| all	Specify the top n number of best predictions to be included in .KPopSummary file. E.g. 2 would choose the top two closest classes	default=all

Reference matching

Option	Argument(s)	Effect	Note(s)
`--match_reference`	fasta_filename	Full path to reference fasta file. Used to select contigs that only match the supplied reference.
`--min_contig_match_len`	positive_integer	Minimum number of query contig base pairs that match the reference. Only used with `--match_reference` option	default=250
`--min_contig_match_proportion`	fractional_float	Minimum fraction of query contig base pairs that match reference. Only used with `--match_reference` option	default=0.6

General arguments

Option	Argument(s)	Effect	Note(s)
`--kmer_len`	positive_integer	Length of k-mer to use when generating spectra	default=12
`--cpu_num`	positive_integer	Number of CPUs used per process	default=8
`--no_assembly`		Do not perform assembly on the reads, the workflow will count the number of kmers from the raw reads directly instead of assemblies
`--no_qc`		Do not perform quality control using trim_galore
`--validate_inputs`		Perform validation check on fastq and fasta inputs to ensure they are formatted correctly, incorrectly formatted files will be skipped
`--max_dim`	positive_integer	Maximum number of dimensions used to separate data. Choosing 0 uses all available dimensions, which will be one less than the number of samples for `--cluster` or one less than the number of classes if `--classify`. A lower number will reduce memory usage. If the data cannot be separated into the number chosen, less dimensions will be chosen automatically. Must not be a number above the maximum number of samples.	default=0
`-profile`	conda	Install the required conda environment automatically from the environment.yml file found in the same directory as main.nf. Slower than installing it manually.
`--help`		Print help instructions

Additional arguments

Option	Argument(s)	Effect	Note(s)
`--tree_type`	string	Specify the type of tree generated by ggtree - 'rectangular' or 'circular'	default=rectangular
`--tree_label_size`	positive_integer	Specify the size of the labels on the tree generated by ggtree, choose 0 to remove labels	default=3
`--extra_flash`	string	Any additional arguments for flash (https://pubmed.ncbi.nlm.nih.gov/21903629/). E.g. `--extra_flash '-O -x 0.35'`
`--flash_minOverlap`	positive_integer	The minimum required overlap length between two reads to provide a confident overlap. Only used on fastq inputs.	default=20
`--flash_maxOverlap`	positive_integer	Maximum overlap length expected in approximately 90% of read pairs. Only used on fastq inputs.	default=1000
`--extra_megahit`	string	Any additional arguments for Megahit (https://pubmed.ncbi.nlm.nih.gov/25609793/). E.g. `--extra_megahit '--k-min 25'`
`--extra_trimGalore`	string	Any additional arguments for TrimGalore (https://github.com/FelixKrueger/TrimGalore).
`--extra_prefetch`	string	Any additional arguments for prefetch (https://github.com/ncbi/sra-tools).
`--extra_fasterq_dump`	string	Any additional arguments for fasterq-dump (https://github.com/ncbi/sra-tools).
`--extra_kpopCount`	string	Any additional arguments for KPopCount (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#41-kpopcount).
`--extra_kpopCountDB`	string	Any additional arguments for KPopCountDB (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#42-kpopcountdb).
`--extra_kpopTwist`	string	Any additional arguments for KPopTwist (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#43-kpoptwist).
`--extra_kpopTwistDB`	string	Any additional arguments for KPopTwistDB (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#44-kpoptwistdb).
`--kpopPhylo_power`	positive_integer	Set the external power when computing distances.	default=2
`--kpopPhylo_distance`	string	Distance measure to be used. This must be one of 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary' or 'minkowski'.	default=euclidean
`--kpopPhylo_magic`	string	Cluster-related variable (Not currently implemented).	default=1.
`--kpopScale_power`	string	Set the external power when computing distances.	default=2

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
bin		bin
modules		modules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kpop-workflow

About

Releases

Packages

Languages

License

ryanmorrison22/kpop-workflow

Folders and files

Latest commit

History

Repository files navigation

kpop-workflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages