Skip to content

The Structure of a Project Configuration File

Karolis Ramanauskas edited this page Aug 14, 2023 · 14 revisions

[General]

output_directory = /home/kakapo/kakapo-output
project_name = kakapo-prj-01
entrez_api_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
run_rcorrector = Yes
run_inter_pro_scan = No
prepend_assembly_name_to_sequence_name = Yes
kraken_2_confidence = 0.20
requery_after = 7
use_colors = Yes
  • output_directory: a path to a directory where kakapo should place all of its output.

  • project_name: a short (hopefully meaningful) name for the analysis. kakapo will create a subdirectory with this name where a bunch of project-specific output will be stored (log files, backups of the configuration files used, assembled sequences, etc): [output_directory]/02-project-specific/[project_name].

  • entrez_api_key: kakapo will make use of the public resources on GenBank. GenBank users are allowed 3 requests/second without an API key. With an API key, the limit is increased to 10 requests/second. You will need to generate the key here: https://www.ncbi.nlm.nih.gov/account/settings

  • run_rcorrector: if set to Yes, the reads will be processed by Rcorrector.

  • run_inter_pro_scan: if set to Yes, the translated CDS sequences found by kakapo will be submitted for functional annotation by InterProScan to https://www.ebi.ac.uk/interpro/search/sequence.

  • prepend_assembly_name_to_sequence_name: if set to Yes, will prepend sample name to the assembled isoform names. The sample name is derived based on the type of input:

    • SRA: GenBank metadata.
    • FASTQ: file name.
    • FASTA (user-provided assembly): file name.

    if set to No, unaltered names produced by SPAdes will be used in kakapo output. I highly discourage setting this option to No when more than one sample is being analyzed.

  • kraken_2_confidence: this should be set to a value between 0 and 1. I find that a value of 0.20 works quite well. Higher confidence values will reduce the number of reads classified (filtered out), and lower values will increase it. See discussion here, for more guidance.

  • requery_after: Do not search GenBank and/or Pfam if the search was performed already and the results are less than this many days old.

  • use_colors: if set to Yes, adds color to the log messages in the terminal, may not look great on light terminal backgrounds.

[Target filters]

allow_non_aug_start_codon = No
allow_missing_start_codon = No
allow_missing_stop_codon = No
  • allow_non_aug_start_codon: allow other start codons, in addition to AUG. A set of appropriate start codons will be chosen using GenBank taxonomical classification. If allow_non_aug_start_codon is set to No, allow_missing_start_codon will have no effect.

  • allow_missing_start_codon: annotate ORFs even if the start codon is missing. For allow_missing_start_codon to have any effect, allow_non_aug_start_codon must be set to Yes.

  • allow_missing_stop_codon: annotate ORFs even if the stop codon is missing.

[Query taxonomic group]

plants

The group your samples belong to. Although it is rare, Latin binomials of vastly different organisms may contain the same words. The purpose of this setting is to restrict the search space to a relatively broad taxonomic group in order to avoid ambiguity in name resolution. You may choose between animals, archaea, bacteria, fungi, plants, viruses. Alternatively, you may enter an NCBI TaxID for any taxon as long as all of your samples belong to it. You can look up NCBI TaxIDs here.

[Target SRA accessions]

SRR7829961
SRR23214014

[Target FASTQ files]

pe_001 = /home/kakapo/kakapo-input/fastq/Solanum_chilense_sample1_R*.fastq.gz
pe_002 = /home/kakapo/kakapo-input/fastq/Solanum_chilense_sample2_R*.fastq
pe_003 = /home/kakapo/kakapo-input/fastq/Solanum_chilense_sample3_R*.fq.gz
se_001 = /home/kakapo/kakapo-input/fastq/Solanum_chilense_sample4_R1.fastq.gz
se_002 = /home/kakapo/kakapo-input/fastq/Solanum_chilense_sample5_R1.fastq
se_003 = /home/kakapo/kakapo-input/fastq/Solanum_chilense_sample6_R1.fq

[Target assemblies: FASTA files (DNA)]

/home/kakapo/kakapo-input/assemblies/Matucana_madisoniorum_HBG13.fasta

[Bowtie2 filter order]

# cactus_virus_x = /home/kakapo/kakapo-input/reference_genomes/cactus_virus_x.fasta
# plastid
# mitochondrion
# ...

[Kraken2 filter order]

# 16S_Silva132
# 16S_Silva138
# viral
# mitochondrion
# plastid
# mitochondrion_and_plastid
# minikraken_8GB_2020-03-12

[BLAST SRA/FASTQ]

evalue = 1e-5
max_hsps = 10000
qcov_hsp_perc = 1
best_hit_overhang = 0.05
best_hit_score_edge = 0.25
max_target_seqs = 1000000

[BLAST assemblies]

evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

If any of the settings in the [BLAST assemblies] section are found in the search strategies files, they will be overwritten for each search strategy.

Clone this wiki locally