Skip to content

The Structure of a Search Strategies File

Karolis Ramanauskas edited this page Aug 15, 2023 · 8 revisions

A search strategies file can contain one or more entries. Each search strategy entry consists of a section heading (marked by text in square brackets) and a set of key = value pairs in subsequent lines with various options describing what sequences kakapo will use to search for the gene of interest. Note that comments in the .ini files used by kakapo may begin with the ; or # characters.

  • organelle: possible values are nucleus, plastid, mitochondrion. Note: use nucleus for non-eukaryotes.
  • min_query_length: minimum amino acid sequence length for kakapo to treat the query sequences as acceptable.
  • max_query_length: maximum amino acid sequence length for kakapo to treat the query sequences as acceptable.
  • max_query_identity: a value between 0 and 1. kakapo will consolidate the user-provided and downloaded query sequences and will dereplicate the set using vsearch so that no two sequences in the resulting set have a pairwise identity above this threshold. This is an easy way to speed up BLAST/vsearch as similar query sequences will return most of the same hits.
  • min_target_orf_length: minimum nucleotide sequence length for kakapo to treat the target ORF as acceptable.
  • max_target_orf_length: maximum nucleotide sequence length for kakapo to treat the target ORF as acceptable.
  • evalue, max_hsps, qcov_hsp_perc, best_hit_overhang, best_hit_score_edge, max_target_seqs: BLAST parameters for searching assembled transcripts matching this search strategy. These six settings (if present) override the ones in the project configuration file, section [BLAST assemblies].
  • pfam_families: zero or more Pfam accessions. One per line, need to be indented by four spaces.
  • ncbi_accessions_aa: zero or more NCBI Protein accessions. One per line, need to be indented by four spaces.
  • entrez_search_queries: any Entrez search query.
  • fasta_files_aa: a path to a FASTA file with one or more amino acid sequences.

The example below contains three search strategies: one for the T2/S-Ribonuclease protein family, S-locus associated F-box domain-containing proteins, and Elongation factor-1 alpha (EF-1α).

[T2-RNases]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus

# Length in Amino Acids
min_query_length = 175
max_query_length = 350

max_query_identity = 0.80

# Length in Nucleotides
min_target_orf_length = 570
max_target_orf_length = 1200

# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

# Pfam families
pfam_families =
    ; t2-ribonucleases
    PF00445

# NCBI protein accessions
ncbi_accessions_aa =

# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries = ("Ribonuclease T2"[Title] OR "RNase T2"[Title] OR "S-RNase"[Title]) AND (200[SLEN]:350[SLEN]) AND Eudicots[Organism] NOT (partial OR putative OR predicted OR "binding protein" OR thioredoxin)

# FASTA files (Amino Acid)
fasta_files_aa = input/T2-RNases.fasta


[F-boxes]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus

# Length in Amino Acids
min_query_length = 340
max_query_length = 450

max_query_identity = 0.99

# Length in Nucleotides
min_target_orf_length = 900
max_target_orf_length = 1500

# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 85
best_hit_overhang = 0.2
best_hit_score_edge = 0.1
max_target_seqs = 500

# Pfam families
pfam_families =

# NCBI protein accessions
ncbi_accessions_aa =

# NCBI Entrez search queries
entrez_search_queries = ("S-locus F-box" OR "S locus F-box" OR "S-haplotype-specific" OR "SLF" OR "SLFL" OR "SFB" OR "SFBB") AND (340[SLEN]:450[SLEN]) AND eudicotyledons[Organism] NOT (predicted[Title] OR partial OR putative OR hypothetical OR "non-S")

# FASTA files (Amino Acid)
fasta_files_aa = input/F-boxes.fasta


[Elong-factor-1-alpha]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus

# Length in Amino Acids
min_query_length = 430
max_query_length = 460

max_query_identity = 0.80

# Length in Nucleotides
min_target_orf_length = 1250
max_target_orf_length = 1400

# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

# Pfam families
pfam_families =

# NCBI protein accessions
ncbi_accessions_aa =

# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries =

# FASTA files (Amino Acid)
fasta_files_aa = input/elong_factor_1_alpha.fasta
Clone this wiki locally