The Structure of a Project Configuration File

`[General]`

output_directory = /home/kakapo/kakapo-output
project_name = kakapo-prj-01
entrez_api_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
run_rcorrector = Yes
run_inter_pro_scan = No
prepend_assembly_name_to_sequence_name = Yes
kraken_2_confidence = 0.20
requery_after = 7
use_colors = Yes

output_directory: a path to a directory where kakapo should place all of its output.
project_name: a short (hopefully meaningful) name for the analysis. kakapo will create a subdirectory with this name where a bunch of project-specific output will be stored (log files, backups of the configuration files used, assembled sequences, etc): [output_directory]/02-project-specific/[project_name].
entrez_api_key: kakapo will make use of the public resources on GenBank. GenBank users are allowed 3 requests/second without an API key. With an API key, the limit is increased to 10 requests/second. You will need to generate the key here: https://www.ncbi.nlm.nih.gov/account/settings
run_rcorrector: if set to Yes, the reads will be processed by Rcorrector.
run_inter_pro_scan: if set to Yes, the translated CDS sequences found by kakapo will be submitted for functional annotation by InterProScan to https://www.ebi.ac.uk/interpro/search/sequence.
prepend_assembly_name_to_sequence_name: if set to Yes, will prepend sample name to the assembled isoform names. The sample name is derived based on the type of input:
- SRA: GenBank metadata.
- FASTQ: file name.
- FASTA (user-provided assembly): file name.
if set to No, unaltered names produced by SPAdes will be used in kakapo output. I highly discourage setting this option to No when more than one sample is being analyzed.
kraken_2_confidence: this should be set to a value between 0 and 1. I find that a value of 0.20 works quite well. Higher confidence values will reduce the number of reads classified (filtered out), and lower values will increase it. See discussion here, for more guidance.
requery_after: Do not search GenBank and/or Pfam if the search was performed already and the results are less than this many days old.
use_colors: if set to Yes, adds color to the log messages in the terminal, may not look great on light terminal backgrounds.

`[Target filters]`

allow_non_aug_start_codon = No
allow_missing_start_codon = No
allow_missing_stop_codon = No

allow_non_aug_start_codon: allow other start codons, in addition to AUG. A set of appropriate start codons will be chosen using GenBank taxonomical classification. If allow_non_aug_start_codon is set to No, allow_missing_start_codon will have no effect.
allow_missing_start_codon: annotate ORFs even if the start codon is missing. For allow_missing_start_codon to have any effect, allow_non_aug_start_codon must be set to Yes.
allow_missing_stop_codon: annotate ORFs even if the stop codon is missing.

`[Query taxonomic group]`

plants

The group your samples belong to. Although it is rare, Latin binomials of vastly different organisms may contain the same words. The purpose of this setting is to restrict the search space to a relatively broad taxonomic group in order to avoid ambiguity in name resolution. You may choose between animals, archaea, bacteria, fungi, plants, viruses. Alternatively, you may enter an NCBI TaxID for any taxon as long as all of your samples belong to it. You can look up NCBI TaxIDs here.

`[Target SRA accessions]`

SRR7829961
SRR23214014

A list of SRA accessions. One per line.

`[Target FASTQ files]`

/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample1_R*.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample2_R*.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample3_R*.fq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample4_R1.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample5_R1.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample6_R1.fq

A list of FASTQ files, one entry per line. Can be gzip-compressed or not. For paired-end reads replace 1/2 or F/R with a *. Anything with * in the file name is treated as a paired-end set. File names without a * character will be treated as single-read (forward-read only) files even if the reverse reads are in the same directory.

`[Target assemblies: FASTA files (DNA)]`

/home/kakapo/kakapo-input/assemblies/Matucana_madisoniorum_HBG13.fasta

A list of FASTA files, one entry per line. If you already have a set of transcripts or any other set of sequences without introns (CDS, mRNA). kakapo will perform the gene search part of the pipeline and will output the transcripts matching the search parameters it finds together with the transcripts derived from raw reads.

`[Bowtie2 filter order]`

cactus_virus_x = /home/kakapo/kakapo-input/reference_genomes/cactus_virus_x.fasta
plastid
mitochondrion

A list of FASTA files and/or keywords plastid and mitochondrion, one entry per line. For the keywords plastid and mitochondrion, kakapo will find the most closely related plastid or mitochondrial assembly on GenBank. Reads mapping to any of the entries listed here will be stored in subdirectories in the output_directory.

`[Kraken2 filter order]`

16S_Silva132
16S_Silva138
viral
mitochondrion
plastid
mitochondrion_and_plastid
minikraken_8GB_2020-03-12

A list of Kraken2 databases, one entry per line. kakapo will download a few smaller Kraken2 databases during the dependency installation process. You can place (or link) additional databases in the ~/.local/share/kakapo/kraken2_dbs directory for them to be visible to kakapo. Reads classified by Kraken2 will be stored in subdirectories in the output_directory.

`[BLAST SRA/FASTQ]`

evalue = 1e-5
max_hsps = 10000
qcov_hsp_perc = 1
best_hit_overhang = 0.05
best_hit_score_edge = 0.25
max_target_seqs = 1000000

`[BLAST assemblies]`

evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

If any of the settings in the [BLAST assemblies] section are found in the search strategies files, they will be overwritten for each search strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly