Skip to content
Leighton Pritchard edited this page Jul 10, 2018 · 8 revisions

This page describes the current plans for development of the find_differential_primers codebase.

Introduction

The find_differential_primers.py script was written to a specific request, and developed organically from that - it 'grew', rather than being 'engineered'. As a result, it's not the most flexible or well-organised codebase.

This roadmap describes how I would like to progress future development of this codebase, almost from the ground up.

Design

Overview

The procedure for finding differential primers is essentially a 'pipeline' - input data goes in (a set of sequences and classifications of those sequences), is subjected to a series of processes, and output data is delivered (a set of primers that are predicted to distinguish between the sets of sequences defined in the inputs).

The series of processes is, in principle, flexible. It comprises a number of conceptual stages that could, in practice, be performed in any of several different ways, by any of several different software tools. The current find_differential_primers.py script grew out of an initial piece of code in an ad hoc way and does not manage this flexibility well: it runs with a single command, where flags/options need to be set over multiple lines to control exectaion of each stage. For the same reason, it is inflexible regarding reuse of existing output from any stage (without recalculation).

A major focus of this redesign is therefore the user interface.

A common style of interaction for packages that implement multiple stages in a process - potentially in arbitrary order, and taking advantage of precalculated datasets - is to use a subcommand model, like git. This would also enable us to have 'pipelining' commmand scripts that execute complete pathways with alternative steps, e.g.

# example potential pipeline command script
pdp.py process -i input.conf -o sequences/
pdp.py prodigal -i sequences/
pdp.py eprimer3 -i sequences/
pdp.py bwa_in_silico -i sequences/
pdp.py classify -i sequences

Many of the tools that are used in the primer design are employed several times across multiple datasets. This is an obvious target for parallelisation. The existing codebase allows for parallel execution with Python's multiprocessing module, but SGE/OGE capability was never implemented. This would be very useful to have, and a flexible approach to parallelisation: generating lists of the required commands and then passing these to the appropriate scheduler mechanism, is desirable.

The parallelisation strategy is to generate command-lines for external tools, and pass these to an appropriate scheduler

The major time bottleneck in the original script is the in silico hybridisation. This is currently performed with the EMBOSS primersearch package, whose algorithm is rather slow. Alternative, faster methods are possible, but not - to my current knowledge - implemented in a standalone tool. We could massively improve runtimes if we were to develop an alternative approach.

A major target of this redesign is to speed up the in silico hybridisation step.

Analysis procedure

  • The process begins with named and classified sequence inputs. At a minimum this requires a set of input (mutiple) sequence files, a unique identifier for each sequence file, and at least one classification for each file.
  • To restrict the acceptable locations for primer amplification, we can define regions within which the primers must amplify to be considered. These regions may be provided in a standard format as input (e.g. to design only to CDS regions), or generated during the pipeline run (e.g. with a gene-caller).
  • Primer pairs are designed against each input sequence file (with or without oligos) corresponding to required thermodynamic parameters.
  • in silico hybridisation is performed with all primer pairs against all input sequences to identify which sequences they apmplify, in principle.
  • The results of in silico hybridisation are used to identify primer pairs that selectively amplify only members of each classification.
  • An optional quick screen against larger databases can be applied to reduce candidate primer sets further.

Classified sequence inputs

We require at a minimum, for each sequence input

  • The location of a file containing all sequences to be considered. Multiple sequence files (e.g. draft genomes) will have sequences concatenated with a spacer.
  • A list of classes to which the input file belongs.
  • An identifier for the file that is unique in the analysis context

For ease of generation, this may be defined in tab-separated tabular format:

<identifier>\t<file>\t<classes>

However, all configuration files used as input to processes in the pipeline must be presented in JSON format. This format was chosen as it is readily serialisable, standardised, and human readable. This is intended to allow easy reconstruction of the Python data object central to the pipeline.

To generate a JSON file for input to the pipeline, we implement a command such as

pdp.py config <config.tab> --to_json <config.json>

that produces a JSON equivalent file which can then be processed to produce the appropriate Python data object for the pipeline.

Config file validation and correction

We want to allow for config file checking with a command such as:

pdp.py config <config.tab|json> --validate 

This should tell us whether any input files need to be stitched (with a standard spacer), or have ambiguous base symbols replaced with N, for use with the third-party pipeline tools. If this is the case, then we want to be able to fix these problems with a command such as:

pdp.py config <config.tab|json> --fix <config.json>

Primer design

The old find_differential_primers.py script used the EMBOSS ePrimer3 tool, which has limitations:

  • Both EMBOSS and Primer3 are required dependencies
  • The EMBOSS tool only works with an outdated version of Primer3 (v.1.1.4)

We would gain two advantages by allowing a choice of primer design tools: reduced dependency requirements, and flexibility to use updated/alternative design tools. Hence, a command structure such as:

pdp.py primers_eprimer3 <config.json> <configout.json> <tool options>
pdp.py primers_primer3 <config.json> <configout.json> <tool options>
pdp.py primers_toolX <config.json> <configout.json> <tool options>

where the subcommand defines the tool, or

pdp.py primers --tool eprimer3 <config.json> <configout.json> <tool options>
pdp.py primers --tool primer3 <config.json> <configout.json> <tool options>
pdp.py primers --tool toolX <config.json> <configout.json> <tool options>

might be considered.

As each tool may produce designed primers in one of a umber of formats, a single reliable interchange format will be used. This will be JSON, again chosen because it is human-readable and readily serialisable.

PDPCollection and PDPData classes

Each analytical/processing step in the pipeline will need to make reference to a central account of input data. For some steps, one or more input file locations will be required; for others metadata such as classifications will also be required.

This information can accrete during any single run of a pipeline, in that an input might start with only the sequence file and classification being defined, but step-by-step also acquire the locations of primer sequences and regions for amplification. As each new piece of information is acquired, it would be possible to write out a new configuration file, and pick up the analysis again from that point. The PDPCollection and PDPData classes will manage this data for the pipeline.

The general role of the two classes is:

  • PDPData: for a single sequence file input, store and manage metadata such as the locations of relevant data files, and provide methods to operate on those files.
  • PDPCollection: manage several PDPData files and provide methods for configuration file input and output, and overviews on the collection data.

PDPData

This class will store metadata and locations of datafiles for a single input.

Attributes

  • name: unique identifier for the input (string)
  • groups: classifications for the input (list)
  • seqfile: file with FASTA format sequence(s) for this input (path)
  • regions: file defining regions to be included/excluded in primer design (path)
  • primers: file containing predicted primer information in JSON format (path)

How should we filter out features to restrict primer design locations?

The prodigal results generated in an early step are not being used for primer design by the package. They are currently kept in a generic features field that points to a GFF file (that could store any feature set), and could be handled in a pdp.py filter step, implemented as pdp.py filter <INCONFIG> <OUTCONFIG>, with optional -–excludefeatures (where the default is to use the features to positively select regions from the input genome, the -–excludefeatures flag would choose the inversion of that feature set).

Applying pdp.py filter after primer design would only reduce the total number of primers carried forward, and not force design of primers only to selected regions. Therefore, we would need to apply the filter prior to primer design, generating a new sequence for each filtering region that would be stitched to generate a new ’input genome’ for primer design.

The filtered sequence could be carried forward in the ’seqfile’ field except that we want to run primersearch against the original genomes; therefore we should carry forward a link to the sequence in a new field, maybe filtered_seqfile. We’d want suitably long Ns to separate each filter region, so an argument to specify the number (default 250?) like -–spacerlen N would be useful.

This could be handled in new subcommands, specific to the kind of filtering being performed: e.g. pdp.py cdsfilter, where both the prodigal genecall step and filtering of the genome are combined to modify the config file. A similar command, such as igrfilter could be created, where the prodigal predictions are used to identify intergenic regions, and regions chosen that include these, plus some of the flanking gene for that region (with an argument such as -–flanklen N to decide how much of the gene to incorporate)

Alternatively, the pdp.py filter command could have flags like -–prodigal or -–igr to direct the action of the filter (or an -–inplace option so that custom GFF files could be used), but now the filtered_seqfile field will be created for the next config file to go on to the next step, and the filter would require to be explicitly used with a --filter flag, such as pdp.py eprimer3 --filter <INCONFIG> <OUTCONFIG>.