Skip to content

A simplified workflow to run KPop (https://github.com/PaoloRibeca/KPop) for disease outbreak clustering and classification

License

Notifications You must be signed in to change notification settings

ryanmorrison22/kpop-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kpop-workflow

A simplified workflow to run KPop (https://www.biorxiv.org/content/10.1101/2022.06.22.497172v2; https://github.com/PaoloRibeca/KPop) for clustering and classification

To run the workflow you'll first need to create the kpop environment using the enviroment.yml file and activate the kpop environment

conda env create -f environment.yml
conda activate kpop​

Currently, there are three main workflows:

  1. Clustering workflow - starting with an unknown dataset, running KPop to generate transformational database files (.KPopTwister and .KPopTwisted) and a pseudo-phylogenetic tree showing relatedness:
nextflow run main.nf --cluster --input_dir /dir/containing/fastqAndFastas
  1. Classifying workflow - starting with a known training dataset with meta data, such as classes/species, and predicting the class/species of an unknown test set:
nextflow run main.nf --classify --input_dir /dir/containing/training/fastqAndFastas --meta_data /path/to/meta/file --test_dir /dir/containing/test/fastqAndFastas
  1. Update transformational database workflow - starting with an unknown dataset and previous transformational database files, the new data is added to the existing database, creating updated .KPopTwister and KPopTwisted files and a new pseudo-phylogenetic tree.
nextflow run main.nf --update --test_dir /dir/containing/new/fastqAndFastas --twister_file /path/to/KPopTwister/file --twisted_file /path/to/KPopTwisted/file

All workflows create lots of files, but the most important outputs are the database files (.KPopTwister and .KPopTwisted in KPopTwist_files) and KPop pseudo-phylogenetic tree found in results/trees_and_metrics/output_2.<nwk/pdf> (--cluster), the class/species predictions found in results/predictions/output.<predictions/KPopSummary>.txt (--classify) and the updated database files (<output_prefix>.KPopTwisted and <output_prefix>.KPopTwisted) and updated_comparison.pdf plot in updated_KPopTwist_files (--update).

When no KPop workflow is selected, pre-processing is still performed. In the below example fastq files would be downloaded from NCBI based on IDs in sample_list.txt, then QC would be performed and finally assemblies would be generated.

nextflow run main.nf --accession_list sample_list.txt

Workflows

Option Argument(s) Effect Note(s)
--cluster Data is run through clustering workflow, starting with an unknown dataset the pipeline produces a distance matrix and pseudophylogenetic tree showing relatedness between samples.
--classify Data is run through classification workflow, starting with separate training and test datasets, a model is created using the training dataset and known class metadata> This model is used to predict the classes of the unknown test dataset. Requires --test_dir argument.
--update New data is run added to the existing database, creating updated .KPopTwister and KPopTwisted files.

Input

Option Argument(s) Effect Note(s)
--input_dir directory_name Path to directory containing fasta/fastq files. Paired-end fastqs require "R1" and "R2" in filenames. Gzipped files are allowed. If --classify used this directory is the training dataset.
--test_dir directory_name Directory containing unseen test dataset. Only required if --classify workflow invoked.
--accession_list txt_filename Supply a list of SRA IDs to download as input samples in the form of a text file, with one SRA per line.
--test_accession_list txt_filename Supply a list of SRA IDs to download as test samples in the form of a text file, with one SRA per line. Only required if --classify workflow invoked.
--meta_data TSV_filename Tsv file with two required columns with defined headers; "fileName" and "class". "fileName" is file name if a fasta or fasta.gz file, or file prefix if paired-end fastqs. E.g. sample1.fasta.gz if fasta file or sample1 if sample1_R1.fastq.gz and sample1_R2.fastq.gz. Only required if --classify workflow invoked. If used with --cluster workflow, it will generate a pseudo-phylogenetic tree coloured with metadata. Additional columns allowed.
--twisted_file .KPopTwisted_file Full path to .KPopTwisted file. Only required for --update workflow.
--twister_file .KPopTwister_file Full path to .KPopTwister file. Only required for --update workflow.
--kpopcount_input .KPopCounter_file .KPopCounter file as an input. Incompatible with --no_assembly, --match_reference, --input_dir, --accession_list, --max_dim and --meta_data options.
--kpopcount_test_input .KPopCounter_file .KPopCounter file as a test input. Only required if --classify workflow invoked. Incompatible with --no_assembly, --match_reference, --input_dir, --accession_list, --max_dim and --meta_data options.

Output

Option Argument(s) Effect Note(s)
--output_dir directory_name Path to output directory. If directory doesn't exist then a new directory will be created. default=projectDir/results
--output_prefix prefix_name Prefix for output files default=output
--pred_class_num positive_integer | all Specify the top n number of best predictions to be included in .KPopSummary file. E.g. 2 would choose the top two closest classes default=all

Reference matching

Option Argument(s) Effect Note(s)
--match_reference fasta_filename Full path to reference fasta file. Used to select contigs that only match the supplied reference.
--min_contig_match_len positive_integer Minimum number of query contig base pairs that match the reference. Only used with --match_reference option default=250
--min_contig_match_proportion fractional_float Minimum fraction of query contig base pairs that match reference. Only used with --match_reference option default=0.6

General arguments

Option Argument(s) Effect Note(s)
--kmer_len positive_integer Length of k-mer to use when generating spectra default=12
--cpu_num positive_integer Number of CPUs used per process default=8
--no_assembly Do not perform assembly on the reads, the workflow will count the number of kmers from the raw reads directly instead of assemblies
--no_qc Do not perform quality control using trim_galore
--validate_inputs Perform validation check on fastq and fasta inputs to ensure they are formatted correctly, incorrectly formatted files will be skipped
--max_dim positive_integer Maximum number of dimensions used to separate data. Choosing 0 uses all available dimensions, which will be one less than the number of samples for --cluster or one less than the number of classes if --classify. A lower number will reduce memory usage. If the data cannot be separated into the number chosen, less dimensions will be chosen automatically. Must not be a number above the maximum number of samples. default=0
-profile conda Install the required conda environment automatically from the environment.yml file found in the same directory as main.nf. Slower than installing it manually.
--help Print help instructions

Additional arguments

Option Argument(s) Effect Note(s)
--tree_type string Specify the type of tree generated by ggtree - 'rectangular' or 'circular' default=rectangular
--tree_label_size positive_integer Specify the size of the labels on the tree generated by ggtree, choose 0 to remove labels default=3
--extra_flash string Any additional arguments for flash (https://pubmed.ncbi.nlm.nih.gov/21903629/). E.g. --extra_flash '-O -x 0.35'
--flash_minOverlap positive_integer The minimum required overlap length between two reads to provide a confident overlap. Only used on fastq inputs. default=20
--flash_maxOverlap positive_integer Maximum overlap length expected in approximately 90% of read pairs. Only used on fastq inputs. default=1000
--extra_megahit string Any additional arguments for Megahit (https://pubmed.ncbi.nlm.nih.gov/25609793/). E.g. --extra_megahit '--k-min 25'
--extra_trimGalore string Any additional arguments for TrimGalore (https://github.com/FelixKrueger/TrimGalore).
--extra_prefetch string Any additional arguments for prefetch (https://github.com/ncbi/sra-tools).
--extra_fasterq_dump string Any additional arguments for fasterq-dump (https://github.com/ncbi/sra-tools).
--extra_kpopCount string Any additional arguments for KPopCount (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#41-kpopcount).
--extra_kpopCountDB string Any additional arguments for KPopCountDB (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#42-kpopcountdb).
--extra_kpopTwist string Any additional arguments for KPopTwist (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#43-kpoptwist).
--extra_kpopTwistDB string Any additional arguments for KPopTwistDB (https://github.com/PaoloRibeca/KPop?tab=readme-ov-file#44-kpoptwistdb).
--kpopPhylo_power positive_integer Set the external power when computing distances. default=2
--kpopPhylo_distance string Distance measure to be used. This must be one of 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary' or 'minkowski'. default=euclidean
--kpopPhylo_magic string Cluster-related variable (Not currently implemented). default=1.
--kpopScale_power string Set the external power when computing distances. default=2

About

A simplified workflow to run KPop (https://github.com/PaoloRibeca/KPop) for disease outbreak clustering and classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published