-
Notifications
You must be signed in to change notification settings - Fork 3
The Structure of a Project Configuration File
output_directory = /home/kakapo/kakapo-output
project_name = kakapo-prj-01
entrez_api_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
run_rcorrector = Yes
run_inter_pro_scan = No
prepend_assembly_name_to_sequence_name = Yes
kraken_2_confidence = 0.20
requery_after = 7
use_colors = Yes
-
output_directory
: a path to a directory wherekakapo
should place all of its output. -
project_name
: a short (hopefully meaningful) name for the analysis.kakapo
will create a subdirectory with this name where a bunch of project-specific output will be stored (log files, backups of the configuration files used, assembled sequences, etc):[output_directory]/02-project-specific/[project_name]
. -
entrez_api_key
:kakapo
will make use of the public resources on GenBank. GenBank users are allowed 3 requests/second without an API key. With an API key, the limit is increased to 10 requests/second. You will need to generate the key here: https://www.ncbi.nlm.nih.gov/account/settings -
run_rcorrector
: if set toYes
, the reads will be processed by Rcorrector. -
run_inter_pro_scan
: if set toYes
, the translated CDS sequences found bykakapo
will be submitted for functional annotation by InterProScan to https://www.ebi.ac.uk/interpro/search/sequence. -
prepend_assembly_name_to_sequence_name
: if set toYes
, will prepend sample name to the assembled isoform names. The sample name is derived based on the type of input:- SRA: GenBank metadata.
- FASTQ: file name.
- FASTA (user-provided assembly): file name.
if set to
No
, unaltered names produced by SPAdes will be used inkakapo
output. I highly discourage setting this option toNo
when more than one sample is being analyzed. -
kraken_2_confidence
: this should be set to a value between0
and1
. I find that a value of0.20
works quite well. Higher confidence values will reduce the number of reads classified (filtered out), and lower values will increase it. See discussion here, for more guidance. -
requery_after
: Do not search GenBank and/or Pfam if the search was performed already and the results are less than this many days old. -
use_colors
: if set toYes
, adds color to the log messages in the terminal, may not look great on light terminal backgrounds.
allow_non_aug_start_codon = No
allow_missing_start_codon = No
allow_missing_stop_codon = No
-
allow_non_aug_start_codon
: allow other start codons, in addition toAUG
. A set of appropriate start codons will be chosen using GenBank taxonomical classification. Ifallow_non_aug_start_codon
is set toNo
,allow_missing_start_codon
will have no effect. -
allow_missing_start_codon
: annotate ORFs even if the start codon is missing. Forallow_missing_start_codon
to have any effect,allow_non_aug_start_codon
must be set toYes
. -
allow_missing_stop_codon
: annotate ORFs even if the stop codon is missing.
plants
The group your samples belong to. Although it is rare, Latin binomials of vastly different organisms may contain the same words. The purpose of this setting is to restrict the search space to a relatively broad taxonomic group in order to avoid ambiguity in name resolution. You may choose between animals
, archaea
, bacteria
, fungi
, plants
, viruses
. Alternatively, you may enter an NCBI TaxID for any taxon as long as all of your samples belong to it. You can look up NCBI TaxIDs here.
SRR7829961
SRR23214014
A list of SRA accessions. One per line.
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample1_R*.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample2_R*.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample3_R*.fq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample4_R1.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample5_R1.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample6_R1.fq
A list of FASTQ files, one entry per line. Can be gzip-compressed or not. For paired-end reads replace 1/2 or F/R with a *
. Anything with *
in the file name is treated as a paired-end set. File names without a *
character will be treated as single-read (forward-read only) files even if the reverse reads are in the same directory.
/home/kakapo/kakapo-input/assemblies/Matucana_madisoniorum_HBG13.fasta
A list of FASTA files, one entry per line. If you already have a set of transcripts or any other set of sequences without introns (CDS, mRNA). kakapo
will perform the gene search part of the pipeline and will output the transcripts matching the search parameters it finds together with the transcripts derived from raw reads.
cactus_virus_x = /home/kakapo/kakapo-input/reference_genomes/cactus_virus_x.fasta
plastid
mitochondrion
A list of FASTA files and/or keywords plastid
and mitochondrion
, one entry per line. For the keywords plastid
and mitochondrion
, kakapo
will find the most closely related plastid or mitochondrial assembly on GenBank. Reads mapping to any of the entries listed here will be stored in subdirectories in the output_directory
.
16S_Silva132
16S_Silva138
viral
mitochondrion
plastid
mitochondrion_and_plastid
minikraken_8GB_2020-03-12
A list of Kraken2 databases, one entry per line. kakapo
will download a few smaller Kraken2 databases during the dependency installation process. You can place (or link) additional databases in the ~/.local/share/kakapo/kraken2_dbs
directory for them to be visible to kakapo
. Reads classified by Kraken2 will be stored in subdirectories in the output_directory
.
evalue = 1e-5
max_hsps = 10000
qcov_hsp_perc = 1
best_hit_overhang = 0.05
best_hit_score_edge = 0.25
max_target_seqs = 1000000
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500
If any of the settings in the [BLAST assemblies]
section are found in the search strategies files, they will be overwritten for each search strategy.