-
Notifications
You must be signed in to change notification settings - Fork 25
Roadmap
This page describes the current plans for development of the find_differential_primers
codebase.
The find_differential_primers.py
script was written to a specific request, and developed organically from that - it 'grew', rather than being 'engineered'. As a result, it's not the most flexible or well-organised codebase.
This roadmap describes how I would like to progress future development of this codebase, almost from the ground up.
The procedure for finding differential primers is essentially a 'pipeline' - input data goes in (a set of sequences and classifications of those sequences), is subjected to a series of processes, and output data is delivered (a set of primers that are predicted to distinguish between the sets of sequences defined in the inputs).
The series of processes is, in principle, flexible. It comprises a number of conceptual stages that could, in practice, be performed in any of several different ways, by any of several different software tools. The current find_differential_primers.py
script grew out of an initial piece of code in an ad hoc way and does not manage this flexibility well: it runs with a single command, where flags/options need to be set over multiple lines to control exectaion of each stage. For the same reason, it is inflexible regarding reuse of existing output from any stage (without recalculation).
A major focus of this redesign is therefore the user interface.
A common style of interaction for packages that implement multiple stages in a process - potentially in arbitrary order, and taking advantage of precalculated datasets - is to use a subcommand model, like git
. This would also enable us to have 'pipelining' commmand scripts that execute complete pathways with alternative steps, e.g.
# example potential pipeline command script
pdp.py process -i input.conf -o sequences/
pdp.py prodigal -i sequences/
pdp.py eprimer3 -i sequences/
pdp.py bwa_in_silico -i sequences/
pdp.py classify -i sequences
Many of the tools that are used in the primer design are employed several times across multiple datasets. This is an obvious target for parallelisation. The existing codebase allows for parallel execution with Python's multiprocessing
module, but SGE/OGE
capability was never implemented. This would be very useful to have, and a flexible approach to parallelisation: generating lists of the required commands and then passing these to the appropriate scheduler mechanism, is desirable.
The parallelisation strategy is to generate command-lines for external tools, and pass these to an appropriate scheduler
The major time bottleneck in the original script is the in silico hybridisation. This is currently performed with the EMBOSS
primersearch
package, whose algorithm is rather slow. Alternative, faster methods are possible, but not - to my current knowledge - implemented in a standalone tool. We could massively improve runtimes if we were to develop an alternative approach.
A major target of this redesign is to speed up the in silico hybridisation step.
- The process begins with named and classified sequence inputs. At a minimum this requires a set of input (mutiple) sequence files, a unique identifier for each sequence file, and at least one classification for each file.
- To restrict the acceptable locations for primer amplification, we can define regions within which the primers must amplify to be considered. These regions may be provided in a standard format as input (e.g. to design only to CDS regions), or generated during the pipeline run (e.g. with a gene-caller).
- Primer pairs are designed against each input sequence file (with or without oligos) corresponding to required thermodynamic parameters.
- in silico hybridisation is performed with all primer pairs against all input sequences to identify which sequences they apmplify, in principle.
- The results of in silico hybridisation are used to identify primer pairs that selectively amplify only members of each classification.
- An optional quick screen against larger databases can be applied to reduce candidate primer sets further.
We require at a minimum, for each sequence input
- The location of a file containing all sequences to be considered. Multiple sequence files (e.g. draft genomes) will have sequences concatenated with a spacer.
- A list of classes to which the input file belongs.
- An identifier for the file that is unique in the analysis context
For ease of generation, this may be defined in tab-separated tabular format:
<identifier>\t<file>\t<classes>
However, all configuration files used as input to processes in the pipeline must be presented in JSON format. This format was chosen as it is readily serialisable, standardised, and human readable. This is intended to allow easy reconstruction of the Python data object central to the pipeline.
To generate a JSON file for input to the pipeline, we implement a command such as
pdp.py config <config.tab> --to_json <config.json>
that produces a JSON equivalent file which can then be processed to produce the appropriate Python data object for the pipeline.
- An introduction to JSON: http://www.json.org/
- Python's
json
module (built-in): https://docs.python.org/3/library/json.html
We want to allow for config file checking with a command such as:
pdp.py config <config.tab|json> --validate
This should tell us whether any input files need to be stitched (with a standard spacer), or have ambiguous base symbols replaced with N
, for use with the third-party pipeline tools. If this is the case, then we want to be able to fix these problems with a command such as:
pdp.py config <config.tab|json> --fix <config.json>
The old find_differential_primers.py
script used the EMBOSS
ePrimer3
tool, which has limitations:
- Both
EMBOSS
andPrimer3
are required dependencies - The
EMBOSS
tool only works with an outdated version ofPrimer3
(v.1.1.4)
We would gain two advantages by allowing a choice of primer design tools: reduced dependency requirements, and flexibility to use updated/alternative design tools. Hence, a command structure such as:
pdp.py primers_eprimer3 <config.json> <configout.json> <tool options>
pdp.py primers_primer3 <config.json> <configout.json> <tool options>
pdp.py primers_toolX <config.json> <configout.json> <tool options>
where the subcommand defines the tool, or
pdp.py primers --tool eprimer3 <config.json> <configout.json> <tool options>
pdp.py primers --tool primer3 <config.json> <configout.json> <tool options>
pdp.py primers --tool toolX <config.json> <configout.json> <tool options>
might be considered.
As each tool may produce designed primers in one of a umber of formats, a single reliable interchange format will be used. This will be JSON, again chosen because it is human-readable and readily serialisable.
Each analytical/processing step in the pipeline will need to make reference to a central account of input data. For some steps, one or more input file locations will be required; for others metadata such as classifications will also be required.
This information can accrete during any single run of a pipeline, in that an input might start with only the sequence file and classification being defined, but step-by-step also acquire the locations of primer sequences and regions for amplification. As each new piece of information is acquired, it would be possible to write out a new configuration file, and pick up the analysis again from that point. The PDPCollection
and PDPData
classes will manage this data for the pipeline.
The general role of the two classes is:
-
PDPData
: for a single sequence file input, store and manage metadata such as the locations of relevant data files, and provide methods to operate on those files. -
PDPCollection
: manage severalPDPData
files and provide methods for configuration file input and output, and overviews on the collection data.
This class will store metadata and locations of datafiles for a single input.
-
name
: unique identifier for the input (string
) -
groups
: classifications for the input (list
) -
seqfile
: file withFASTA
format sequence(s) for this input (path) -
regions
: file defining regions to be included/excluded in primer design (path) -
primers
: file containing predicted primer information in JSON format (path)
The prodigal results generated in an early step are not being used for primer design by the package. They are currently kept in a generic features
field that points to a GFF file (that could store any feature set), and could be handled in a pdp.py filter
step, implemented as pdp.py filter <INCONFIG> <OUTCONFIG>
, with optional -–excludefeatures
(where
the default is to use the features to positively select regions from the input genome, the -–excludefeatures
flag would choose the inversion of that feature set).
Applying pdp.py filter
after primer design would only reduce the total number of primers carried forward, and not force design of primers only to selected regions. Therefore, we would need to apply the filter prior to primer design, generating a new sequence for each filtering region that would be stitched to generate a new ’input genome’ for primer design.
The filtered sequence could be carried forward in the ’seqfile’ field except that we want to run primersearch against the original genomes; therefore we should carry forward a link to the sequence in a new field, maybe filtered_seqfile
.
We’d want suitably long N
s to separate each filter region, so an argument to specify the number (default 250?) like -–spacerlen N
would be useful.
This could be handled in new subcommands, specific to the kind of filtering being performed: e.g. pdp.py cdsfilter
, where
both the prodigal genecall
step and filtering of the genome are combined to modify the config file. A similar command, such as igrfilter
could be created, where the prodigal
predictions are used to identify intergenic regions, and regions chosen that include these, plus some of the flanking gene for that region (with an argument such as -–flanklen N
to decide how much of the gene to incorporate)
Alternatively, the pdp.py filter
command could have flags like -–prodigal
or -–igr
to direct the action of the filter (or an -–inplace
option so that custom GFF files could be used), but now the filtered_seqfile
field will be created for the next config file to go on to the next step, and the filter would require to be explicitly used with a --filter
flag, such as pdp.py eprimer3 --filter <INCONFIG> <OUTCONFIG>
.