Skip to content

The Structure of a Search Strategies File

Karolis Ramanauskas edited this page Aug 15, 2023 · 8 revisions

A search strategies file can contain one or more entries. Each search strategy entry consists of a section heading (marked by text in square brackets) and a set of key = value pairs in subsequent lines with various options describing what sequences kakapo will use to search for gene of interest.

  • organelle: possible values are nucleus, plastid, mitochondrion. Note: use nucleus for non-eukaryotes.
  • min_query_length: minimum amino acid sequence length for kakapo to treat the query sequences as acceptable.
  • max_query_length: maximum amino acid sequence length for kakapo to treat the query sequences as acceptable.
  • max_query_identity: a value between 0 and 1. kakapo will consolidate the user-provided and downloaded query sequences and will dereplicate the set using vsearch so that no two sequences in the resulting set have a pairwise identity above this threshold. This is an easy way to speed up BLAST/vsearch as similar query sequences will return most of the same hits.
  • min_target_orf_length: minimum nucleotide sequence length for kakapo to treat the target ORF as acceptable.
  • max_target_orf_length: maximum nucleotide sequence length for kakapo to treat the target ORF as acceptable.
  • evalue, max_hsps, qcov_hsp_perc, best_hit_overhang, best_hit_score_edge, max_target_seqs: BLAST parameters for searching assembled transcripts matching this search strategy. These six settings (if present) override the ones in the project configuration file, section [BLAST assemblies].
  • pfam_families: zero or more Pfam accessions. One per line, need to be indented by four spaces.
  • ncbi_accessions_aa: zero or more NCBI Protein accessions. One per line, need to be indented by four spaces.
  • entrez_search_queries: any Entrez search query.
  • fasta_files_aa: a path to a FASTA file with one or more amino acid sequences.

The example below contains three search strategies: one for the T2/S-Ribonuclease protein family, S-locus associated F-box domain-containing proteins, and Elongation factor-1 alpha (EF-1α).

[T2-RNases]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus

# Length in Amino Acids
min_query_length = 175
max_query_length = 350

max_query_identity = 0.80

# Length in Nucleotides
min_target_orf_length = 570
max_target_orf_length = 1200

# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

# Pfam families
pfam_families =
    ; t2-ribonucleases
    PF00445

# NCBI protein accessions
ncbi_accessions_aa =

# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries = ("Ribonuclease T2"[Title] OR "RNase T2"[Title] OR "S-RNase"[Title]) AND (200[SLEN]:350[SLEN]) AND Eudicots[Organism] NOT (partial OR putative OR predicted OR "binding protein" OR thioredoxin)

# FASTA files (Amino Acid)
fasta_files_aa = input/T2-RNases.fasta


[F-boxes]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus

# Length in Amino Acids
min_query_length = 340
max_query_length = 450

max_query_identity = 0.99

# Length in Nucleotides
min_target_orf_length = 900
max_target_orf_length = 1500

# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 85
best_hit_overhang = 0.2
best_hit_score_edge = 0.1
max_target_seqs = 500

# Pfam families
pfam_families =

# NCBI protein accessions
ncbi_accessions_aa =

# NCBI Entrez search queries
entrez_search_queries = ("S-locus F-box" OR "S locus F-box" OR "S-haplotype-specific" OR "SLF" OR "SLFL" OR "SFB" OR "SFBB") AND (340[SLEN]:450[SLEN]) AND eudicotyledons[Organism] NOT (predicted[Title] OR partial OR putative OR hypothetical OR "non-S")

# FASTA files (Amino Acid)
fasta_files_aa = input/F-boxes.fasta


[Elong-factor-1-alpha]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus

# Length in Amino Acids
min_query_length = 430
max_query_length = 460

max_query_identity = 0.80

# Length in Nucleotides
min_target_orf_length = 1250
max_target_orf_length = 1400

# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500

# Pfam families
pfam_families =

# NCBI protein accessions
ncbi_accessions_aa =

# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries =

# FASTA files (Amino Acid)
fasta_files_aa = input/elong_factor_1_alpha.fasta
Clone this wiki locally