-
Notifications
You must be signed in to change notification settings - Fork 3
The Structure of a Search Strategies File
Karolis Ramanauskas edited this page Aug 15, 2023
·
8 revisions
A search strategies file can contain one or more entries. Each search strategy entry consists of a section heading (marked by text in square brackets) and a set of key = value
pairs in subsequent lines with various options describing what sequences kakapo
will use to search for gene of interest.
-
organelle
: possible values arenucleus
,plastid
,mitochondrion
. Note: usenucleus
for non-eukaryotes. -
min_query_length
: minimum amino acid sequence length forkakapo
to treat the query sequences as acceptable. -
max_query_length
: maximum amino acid sequence length forkakapo
to treat the query sequences as acceptable. -
max_query_identity
: a value between 0 and 1.kakapo
will consolidate the user-provided and downloaded query sequences and will dereplicate the set using vsearch so that no two sequences in the resulting set have a pairwise identity above this threshold. This is an easy way to speed up BLAST/vsearch as similar query sequences will return most of the same hits. -
min_target_orf_length
: minimum nucleotide sequence length forkakapo
to treat the target ORF as acceptable. -
max_target_orf_length
: maximum nucleotide sequence length forkakapo
to treat the target ORF as acceptable. -
evalue
,max_hsps
,qcov_hsp_perc
,best_hit_overhang
,best_hit_score_edge
,max_target_seqs
: BLAST parameters for searching assembled transcripts matching this search strategy. These six settings (if present) override the ones in the project configuration file, section[BLAST assemblies]
. -
pfam_families
: zero or more Pfam accessions. One per line, need to be indented by four spaces. -
ncbi_accessions_aa
: zero or more NCBI Protein accessions. One per line, need to be indented by four spaces. -
entrez_search_queries
: any Entrez search query. -
fasta_files_aa
: a path to a FASTA file with one or more amino acid sequences.
The example below contains three search strategies: one for the T2/S-Ribonuclease protein family, S-locus associated F-box domain-containing proteins, and Elongation factor-1 alpha (EF-1α).
[T2-RNases]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus
# Length in Amino Acids
min_query_length = 175
max_query_length = 350
max_query_identity = 0.80
# Length in Nucleotides
min_target_orf_length = 570
max_target_orf_length = 1200
# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500
# Pfam families
pfam_families =
; t2-ribonucleases
PF00445
# NCBI protein accessions
ncbi_accessions_aa =
# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries = ("Ribonuclease T2"[Title] OR "RNase T2"[Title] OR "S-RNase"[Title]) AND (200[SLEN]:350[SLEN]) AND Eudicots[Organism] NOT (partial OR putative OR predicted OR "binding protein" OR thioredoxin)
# FASTA files (Amino Acid)
fasta_files_aa = input/T2-RNases.fasta
[F-boxes]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus
# Length in Amino Acids
min_query_length = 340
max_query_length = 450
max_query_identity = 0.99
# Length in Nucleotides
min_target_orf_length = 900
max_target_orf_length = 1500
# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 85
best_hit_overhang = 0.2
best_hit_score_edge = 0.1
max_target_seqs = 500
# Pfam families
pfam_families =
# NCBI protein accessions
ncbi_accessions_aa =
# NCBI Entrez search queries
entrez_search_queries = ("S-locus F-box" OR "S locus F-box" OR "S-haplotype-specific" OR "SLF" OR "SLFL" OR "SFB" OR "SFBB") AND (340[SLEN]:450[SLEN]) AND eudicotyledons[Organism] NOT (predicted[Title] OR partial OR putative OR hypothetical OR "non-S")
# FASTA files (Amino Acid)
fasta_files_aa = input/F-boxes.fasta
[Elong-factor-1-alpha]
# nucleus, plastid, mitochondrion
# Note: use nucleus for non-eukaryotes.
organelle = nucleus
# Length in Amino Acids
min_query_length = 430
max_query_length = 460
max_query_identity = 0.80
# Length in Nucleotides
min_target_orf_length = 1250
max_target_orf_length = 1400
# BLAST assemblies
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500
# Pfam families
pfam_families =
# NCBI protein accessions
ncbi_accessions_aa =
# NCBI Entrez search queries
# T2-type ribonucleases
entrez_search_queries =
# FASTA files (Amino Acid)
fasta_files_aa = input/elong_factor_1_alpha.fasta