-
Notifications
You must be signed in to change notification settings - Fork 3
The Structure of a Search Strategies File
Karolis Ramanauskas edited this page Aug 15, 2023
·
8 revisions
A search strategies file can contain one or more entries. Each search strategy entry consists of a section heading (marked by text in square brackets) and a set of key = value
pairs in subsequent lines with various options describing what sequences kakapo
will use to search for the gene of interest. Note that comments in the .ini
files used by kakapo
may begin with the ;
or #
characters.
-
organelle
: possible values arenucleus
,plastid
,mitochondrion
. Note: usenucleus
for non-eukaryotes. -
min_query_length
: minimum amino acid sequence length forkakapo
to treat the query sequences as acceptable. -
max_query_length
: maximum amino acid sequence length forkakapo
to treat the query sequences as acceptable. -
max_query_identity
: a value between 0 and 1.kakapo
will consolidate the user-provided and downloaded query sequences and will dereplicate the set using vsearch so that no two sequences in the resulting set have a pairwise identity above this threshold. This is an easy way to speed up BLAST/vsearch as similar query sequences will return most of the same hits. -
min_target_orf_length
: minimum nucleotide sequence length forkakapo
to treat the target ORF as acceptable. -
max_target_orf_length
: maximum nucleotide sequence length forkakapo
to treat the target ORF as acceptable. -
evalue
,max_hsps
,qcov_hsp_perc
,best_hit_overhang
,best_hit_score_edge
,max_target_seqs
: BLAST parameters for searching assembled transcripts matching this search strategy. These six settings (if present) override the ones in the project configuration file, section[BLAST assemblies]
. -
pfam_families
: zero or more Pfam accessions. One per line, need to be indented by four spaces. -
ncbi_accessions_aa
: zero or more NCBI Protein accessions. One per line, need to be indented by four spaces. -
entrez_search_queries
: any Entrez search query. -
fasta_files_aa
: a path to a FASTA file with one or more amino acid sequences.
The example below contains three search strategies: one for the T2/S-Ribonuclease protein family, S-locus associated F-box domain-containing proteins, and Elongation factor-1 alpha (EF-1α).
[T2-RNases]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus
# Length in Amino Acids
min_query_length = 175
max_query_length = 350
max_query_identity = 0.80
# Length in Nucleotides
min_target_orf_length = 570
max_target_orf_length = 1200
# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500
# Pfam families
pfam_families =
; t2-ribonucleases
PF00445
# NCBI protein accessions
ncbi_accessions_aa =
# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries = ("Ribonuclease T2"[Title] OR "RNase T2"[Title] OR "S-RNase"[Title]) AND (200[SLEN]:350[SLEN]) AND Eudicots[Organism] NOT (partial OR putative OR predicted OR "binding protein" OR thioredoxin)
# FASTA files (Amino Acid)
fasta_files_aa = input/T2-RNases.fasta
[F-boxes]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus
# Length in Amino Acids
min_query_length = 340
max_query_length = 450
max_query_identity = 0.99
# Length in Nucleotides
min_target_orf_length = 900
max_target_orf_length = 1500
# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 85
best_hit_overhang = 0.2
best_hit_score_edge = 0.1
max_target_seqs = 500
# Pfam families
pfam_families =
# NCBI protein accessions
ncbi_accessions_aa =
# NCBI Entrez search queries
entrez_search_queries = ("S-locus F-box" OR "S locus F-box" OR "S-haplotype-specific" OR "SLF" OR "SLFL" OR "SFB" OR "SFBB") AND (340[SLEN]:450[SLEN]) AND eudicotyledons[Organism] NOT (predicted[Title] OR partial OR putative OR hypothetical OR "non-S")
# FASTA files (Amino Acid)
fasta_files_aa = input/F-boxes.fasta
[Elong-factor-1-alpha]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus
# Length in Amino Acids
min_query_length = 430
max_query_length = 460
max_query_identity = 0.80
# Length in Nucleotides
min_target_orf_length = 1250
max_target_orf_length = 1400
# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500
# Pfam families
pfam_families =
# NCBI protein accessions
ncbi_accessions_aa =
# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries =
# FASTA files (Amino Acid)
fasta_files_aa = input/elong_factor_1_alpha.fasta