A simplified workflow to run KPop (https://www.biorxiv.org/content/10.1101/2022.06.22.497172v2; https://github.com/PaoloRibeca/KPop) for clustering and classification
To run the workflow you'll first need to create the kpop environment using the enviroment.yml file and activate the kpop environment
conda env create -f environment.yml
conda activate kpop
Currently, there are three main workflows:
- Clustering workflow - starting with an unknown dataset, running KPop to generate transformational database files (.KPopTwister and .KPopTwisted) and a pseudo-phylogenetic tree showing relatedness:
nextflow run main.nf --cluster --input_dir /dir/containing/fastqAndFastas
- Classifying workflow - starting with a known training dataset with meta data, such as classes/species, and predicting the class/species of an unknown test set:
nextflow run main.nf --classify --input_dir /dir/containing/training/fastqAndFastas --meta_data /path/to/meta/file --test_dir /dir/containing/test/fastqAndFastas
- Update transformational database workflow - starting with an unknown dataset and previous transformational database files, the new data is added to the existing database, creating updated .KPopTwister and KPopTwisted files and a new pseudo-phylogenetic tree.
nextflow run main.nf --update --test_dir /dir/containing/new/fastqAndFastas --twister_file /path/to/KPopTwister/file --twisted_file /path/to/KPopTwisted/file
All workflows create lots of files, but the most important outputs are the database files (.KPopTwister
and .KPopTwisted
in KPopTwist_files) and KPop pseudo-phylogenetic tree found in results/trees_and_metrics/output_2.<nwk/pdf>
(--cluster
), the class/species predictions found in results/predictions/output.<predictions/KPopSummary>.txt
(--classify
) and the updated database files (<output_prefix>.KPopTwisted
and <output_prefix>.KPopTwisted
) and updated_comparison.pdf
plot in updated_KPopTwist_files
(--update
).
When no KPop workflow is selected, pre-processing is still performed. In the below example fastq files would be downloaded from NCBI based on IDs in sample_list.txt
, then QC would be performed and finally assemblies would be generated.
nextflow run main.nf --accession_list sample_list.txt
Workflows
Option | Argument(s) | Effect | Note(s) |
---|---|---|---|
--cluster |
Data is run through clustering workflow, starting with an unknown dataset the pipeline produces a distance matrix and pseudophylogenetic tree showing relatedness between samples. | ||
--classify |
Data is run through classification workflow, starting with separate training and test datasets, a model is created using the training dataset and known class metadata> This model is used to predict the classes of the unknown test dataset. Requires --test_dir argument. | ||
--update |
New data is run added to the existing database, creating updated .KPopTwister and KPopTwisted files. |
Input
Option | Argument(s) | Effect | Note(s) |
---|---|---|---|
--input_dir |
directory_name | Path to directory containing fasta/fastq files. Paired-end fastqs require "R1" and "R2" in filenames. Gzipped files are allowed. If --classify used this directory is the training dataset. |
|
--test_dir |
directory_name | Directory containing unseen test dataset. Only required if --classify workflow invoked. |
|
--accession_list |
txt_filename | Supply a list of SRA IDs to download as input samples in the form of a text file, with one SRA per line. | |
--test_accession_list |
txt_filename | Supply a list of SRA IDs to download as test samples in the form of a text file, with one SRA per line. Only required if --classify workflow invoked. |
|
--meta_data |
TSV_filename | Tsv file with two required columns with defined headers; "fileName" and "class". "fileName" is file name if a fasta or fasta.gz file, or file prefix if paired-end fastqs. E.g. sample1.fasta.gz if fasta file or sample1 if sample1_R1.fastq.gz and sample1_R2.fastq.gz. Only required if --classify workflow invoked. If used with --cluster workflow, it will generate a pseudo-phylogenetic tree coloured with metadata. Additional columns allowed. |
|
--twisted_file |
.KPopTwisted_file | Full path to .KPopTwisted file. Only required for --update workflow. |
|
--twister_file |
.KPopTwister_file | Full path to .KPopTwister file. Only required for --update workflow. |
|
--kpopcount_input |
.KPopCounter_file | .KPopCounter file as an input. Incompatible with --no_assembly , --match_reference , --input_dir , --accession_list , --max_dim and --meta_data options. |
|
--kpopcount_test_input |
.KPopCounter_file | .KPopCounter file as a test input. Only required if --classify workflow invoked. Incompatible with --no_assembly , --match_reference , --input_dir , --accession_list , --max_dim and --meta_data options. |
Output
Option | Argument(s) | Effect | Note(s) |
---|---|---|---|
--output_dir |
directory_name | Path to output directory. If directory doesn't exist then a new directory will be created. | default=projectDir/results |
--output_prefix |
prefix_name | Prefix for output files | default=output |
--pred_class_num |
positive_integer | all | Specify the top n number of best predictions to be included in .KPopSummary file. E.g. 2 would choose the top two closest classes | default=all |
Reference matching
Option | Argument(s) | Effect | Note(s) |
---|---|---|---|
--match_reference |
fasta_filename | Full path to reference fasta file. Used to select contigs that only match the supplied reference. | |
--min_contig_match_len |
positive_integer | Minimum number of query contig base pairs that match the reference. Only used with --match_reference option |
default=250 |
--min_contig_match_proportion |
fractional_float | Minimum fraction of query contig base pairs that match reference. Only used with --match_reference option |
default=0.6 |
General arguments
Option | Argument(s) | Effect | Note(s) |
---|---|---|---|
--kmer_len |
positive_integer | Length of k-mer to use when generating spectra | default=12 |
--cpu_num |
positive_integer | Number of CPUs used per process | default=8 |
--no_assembly |
Do not perform assembly on the reads, the workflow will count the number of kmers from the raw reads directly instead of assemblies | ||
--no_qc |
Do not perform quality control using trim_galore | ||
--validate_inputs |
Perform validation check on fastq and fasta inputs to ensure they are formatted correctly, incorrectly formatted files will be skipped | ||
--max_dim |
positive_integer | Maximum number of dimensions used to separate data. Choosing 0 uses all available dimensions, which will be one less than the number of samples for --cluster or one less than the number of classes if --classify . A lower number will reduce memory usage. If the data cannot be separated into the number chosen, less dimensions will be chosen automatically. Must not be a number above the maximum number of samples. |
default=0 |
-profile |
conda | Install the required conda environment automatically from the environment.yml file found in the same directory as main.nf. Slower than installing it manually. | |
--help |
Print help instructions |
Additional arguments
Option | Argument(s) | Effect | Note(s) |
---|---|---|---|
--tree_type |
string | Specify the type of tree generated by ggtree - 'rectangular' or 'circular' | default=rectangular |
--tree_label_size |
positive_integer | Specify the size of the labels on the tree generated by ggtree, choose 0 to remove labels | default=3 |
--extra_flash |
string | Any additional arguments for flash (https://pubmed.ncbi.nlm.nih.gov/21903629/). E.g. --extra_flash '-O -x 0.35' |
|
--flash_minOverlap |
positive_integer | The minimum required overlap length between two reads to provide a confident overlap. Only used on fastq inputs. | default=20 |
--flash_maxOverlap |
positive_integer | Maximum overlap length expected in approximately 90% of read pairs. Only used on fastq inputs. | default=1000 |
--extra_megahit |
string | Any additional arguments for Megahit (https://pubmed.ncbi.nlm.nih.gov/25609793/). E.g. --extra_megahit '--k-min 25' |
|
--extra_trimGalore |
string | Any additional arguments for TrimGalore (https://github.com/FelixKrueger/TrimGalore). | |
--extra_prefetch |
string | Any additional arguments for prefetch (https://github.com/ncbi/sra-tools). | |
--extra_fasterq_dump |
string | Any additional arguments for fasterq-dump (https://github.com/ncbi/sra-tools). | |
--extra_kpopCount |
string | Any additional arguments for KPopCount (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#41-kpopcount). | |
--extra_kpopCountDB |
string | Any additional arguments for KPopCountDB (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#42-kpopcountdb). | |
--extra_kpopTwist |
string | Any additional arguments for KPopTwist (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#43-kpoptwist). | |
--extra_kpopTwistDB |
string | Any additional arguments for KPopTwistDB (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#44-kpoptwistdb). | |
--kpopPhylo_power |
positive_integer | Set the external power when computing distances. | default=2 |
--kpopPhylo_distance |
string | Distance measure to be used. This must be one of 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary' or 'minkowski'. | default=euclidean |
--kpopPhylo_magic |
string | Cluster-related variable (Not currently implemented). | default=1. |
--kpopScale_power |
string | Set the external power when computing distances. | default=2 |