Skip to content

The Structure of a Project Configuration File

Karolis Ramanauskas edited this page Aug 15, 2023 · 14 revisions

[General]

output_directory = /home/kakapo/kakapo-output
project_name = kakapo-prj-01
entrez_api_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
run_rcorrector = Yes
run_inter_pro_scan = No
prepend_assembly_name_to_sequence_name = Yes
kraken_2_confidence = 0.20
requery_after = 7
use_colors = Yes
  • output_directory: a path to a directory where kakapo places all of its output.

  • project_name: a short name for the analysis. kakapo creates a subdirectory with this name where a number of project-specific output files are stored (log files, backups of the configuration files used, assembled sequences, etc): [output_directory]/02-project-specific/[project_name]. A well-chosen name could significantly help with future data management.

  • entrez_api_key: This field is required. kakapo uses GenBank, whose users are allowed 3 requests per second without an API key. With an API key, the limit is increased to 10 requests/second. To obtain your key, go here: https://www.ncbi.nlm.nih.gov/account/settings

  • run_rcorrector: if set to Yes, the reads are processed by Rcorrector.

  • run_inter_pro_scan: if set to Yes, the translated CDS sequences found by kakapo are submitted for functional annotation by InterProScan to https://www.ebi.ac.uk/interpro/search/sequence.

  • prepend_assembly_name_to_sequence_name: if set to Yes, prepends sample name to the assembled isoform names. The sample name is derived based on the type of input:

    • SRA: GenBank metadata.
    • FASTQ: file name.
    • FASTA (user-provided assembly): file name.

    if set to No, unaltered names produced by SPAdes are used in kakapo output. (Users are highly discouraged from setting this option to No when more than one sample is being analyzed.)

  • kraken_2_confidence: value set between 0 and 1. I find that a value of 0.20 works quite well. Higher values reduce the number of reads classified (filtered out), and lower values increase filtered reads. See helpful discussion here, for additional guidance.

  • requery_after: Numeric value used to tell 'kakapo' not to re-query GenBank and/or Pfam, if search was previously performed already, and the results are less than this many days old.

  • use_colors: if set to Yes, adds color to the log messages in the terminal, may not look great on light terminal backgrounds.

[Target filters]

allow_non_aug_start_codon = No
allow_missing_start_codon = No
allow_missing_stop_codon = No
  • allow_non_aug_start_codon: allow other start codons, in addition to AUG. A set of appropriate start codons are chosen using GenBank taxonomical classification. If allow_non_aug_start_codon is set to No, allow_missing_start_codon has no effect.

  • allow_missing_start_codon: annotate ORFs even if the start codon is missing. For allow_missing_start_codon to have any effect, allow_non_aug_start_codon must be set to Yes.

  • allow_missing_stop_codon: annotate ORFs even if the stop codon is missing.

[Query taxonomic group]

plants

The group your samples belong to. Although it is rare, Latin binomials of vastly different organisms may contain the same words. The purpose of this setting is to restrict the search space to a relatively broad taxonomic group in order to avoid ambiguity in name resolution. You may choose between animals, archaea, bacteria, fungi, plants, viruses. Alternatively, you may enter an NCBI TaxID for any taxon as long as all of your samples belong to it. You can look up NCBI TaxIDs here.

[Target SRA accessions]

SRR7829961
SRR23214014

A list of SRA accessions. One per line.

[Target FASTQ files]

/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample1_R*.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample2_R*.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample3_R*.fq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample4_R1.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample5_R1.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample6_R1.fq

A list of FASTQ files, one entry per line. Can be gzip-compressed or not. For paired-end reads replace 1/2 or F/R with a *. Anything with * in the file name is treated as a paired-end set. File names without a * character are treated as single-read (forward-read only) files, even if the reverse reads are in the same directory.

[Target assemblies: FASTA files (DNA)]

/home/kakapo/kakapo-input/assemblies/Matucana_madisoniorum_HBG13.fasta

A list of FASTA files, one entry per line. If you already have a set of transcripts or any other set of sequences without introns (CDS, mRNA). kakapo will perform the gene search part of the pipeline and will output the transcripts matching the search parameters it finds together with the transcripts derived from raw reads.

[Bowtie2 filter order]

cactus_virus_x = /home/kakapo/kakapo-input/reference_genomes/cactus_virus_x.fasta
plastid
mitochondrion

A list of FASTA files and/or keywords plastid and mitochondrion, one entry per line. For the keywords plastid and mitochondrion, kakapo finds the most closely related plastid or mitochondrial assembly on GenBank. Reads mapping to any of the entries listed here are stored in subdirectories in the output_directory.

[Kraken2 filter order]

16S_Silva132
16S_Silva138
viral
mitochondrion
plastid
mitochondrion_and_plastid
minikraken_8GB_2020-03-12

A list of Kraken2 databases, one entry per line. kakapo will download a few smaller Kraken2 databases during the dependency installation process. You can place (or link) additional databases in the ~/.local/share/kakapo/kraken2_dbs directory for them to be visible to kakapo. Reads classified by Kraken2 will be stored in subdirectories in the output_directory.

[BLAST SRA/FASTQ]

evalue = 1e-5
max_hsps = 10000
qcov_hsp_perc = 1
best_hit_overhang = 0.05
best_hit_score_edge = 0.25
max_target_seqs = 1000000

BLAST parameters for searching RNA-Seq reads matching the query.

[BLAST assemblies]

evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

BLAST parameters for searching assembled transcripts matching the query. Settings in this section can be overridden in a search strategies file; each search strategy can have its own settings.

Clone this wiki locally