Skip to content

Commit

Permalink
Merge pull request #19 from superphy/proposal-novel
Browse files Browse the repository at this point in the history
ADD: Proposal after Chad and Matt's edits
  • Loading branch information
kevinkle authored Dec 17, 2017
2 parents 8fca196 + 6d78ef2 commit 64d5157
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 2 deletions.
Binary file added paper-proposal.pdf
Binary file not shown.
48 changes: 48 additions & 0 deletions paper-proposal.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
\documentclass{article}
\usepackage[parfill]{parskip}
\usepackage[backend=bibtex,style=numeric-comp]{biblatex}
\bibliography{paper-webserver.bib}

\begin{document}
% "contain, at the top, the following affirmative statement. "This website is free and open to all users and there is no login requirement." Additionally, any third party software employed by the website that has more restrictive usage terms must be listed."
This website is free and open to all users and there is no login requirement. The code for this webserver, and all third party software used, are available under the open-source Apache 2.0, BSD 3-clause, or similar licenses. \para

% "include the website address; website name; and the names, affiliations, and email addresses of all authors."
The website is available at \url{https://lfz.corefacility.ca/superphy/spfy/}. Spfy's code is provided at \url{https://github.com/superphy/backend} and documentation at \url{https://superphy.readthedocs.io/en/latest/}. \para

% MAIN CONTENT
% "include a notification if this is an update from a previous publication in the Web Server issue, and in that case, include an estimate of the number of users and the number of citations."
% "For web servers, or essentially similar web servers, that have been the subject of a previous publication, including publication in journals other than NAR, there is a minimum two-year interval before re-publication in the Web Server Issue."
Our proposal covers an update to Superphy \citep{whiteside2016superphy}, an online predictive genomics platform targeting \textit{Escherichia coli}.
The update, called Spfy, uses graph datastructures to store and retrieve results for computational workflows.
We demonstrate the ability of graph data structures to scale to the [approximate number, eg. greater than 50,000]l of whole-genome sequences accumulated so far, and show the ability to scale to X genomes.
% I'm unsure if we should add more about the subtyping options. For example, see:
% https://github.com/superphy/paper_platform/commit/c017b1e022d310e16a1433af9d58a73e9550a401
Current comparative computational workflows chain different analysis software, but lack storage and retrieval methods for generated results.
% "IF THE WEBSITE IMPLEMENTS A META-SERVER OR COMPUTATIONAL WORKFLOW, the summary MUST describe 1) significant added value beyond the simple chaining together of existing third party software or the calculation of a consensus prediction from third party predictors and classifiers; and at least one of the following: 2) how user time for data gathering and multi-step analysis is significantly reduced, or 3) how the website offers significantly enhanced display of the data and results."
By making the storage and retrieval of results part of the platform, with data effectively linked to the organisms of interest through a standardized ontology, we can mitigate the recomputing of analyses.
Within Spfy, we store the output from every analysis, and link them together in the context of a genome graph. This graph also stores metadata for each genome, facilitating inquiries ranging from population genomics to epidemiological investigations.
Integrated data storage will be necessary as whole genome sequencing (WGS) data for bacterial pathogens have accumulated in public databases in the tens of thousands, with hundreds of thousands set to be available within the next few years. \para

% STATISTICS{}
% "provide descriptions of the input data, the output, and the processing method; complete citations for previous publications of the method or the web server; and two to four keywords. Additionally, authors must indicate how long the server has been running, the number of inputs analyzed during testing, and an estimate of the number of individuals outside of the authors' group who have been involved in the testing."
Spfy was tested with 59,5323 public \textit{E. coli} assembled genomes, 5,353 genomes from GenBank and 54,181 genomes from Enterobase (\~596 GB), storing both the entire sequences and results for all included analysis modules.
Spfy provides real-time subtyping, and the results are immediately displayed to the user following their completion.
Subtyping options include O-antigen, H-antigen, Shiga-toxin 1 (Stx1), Shiga-toxin 2 (Stx2), and Intimin typing. Reference-lab type tests include virulence factor and anti-microbial resistance annotation. All genomes are analyzed withing the pan-genome framework of \textit{E. coli}.
The resulting database had XYZ million nodes and XYZ million edges, with XYZ object properties, which worked out to X TB of data stored. \para

% COMPARED TO EXISTING PLATFORMS
% This aims to be more of an implementation paragraph.
Existing scientific workflow technologies such as Galaxy \cite{goecks2010galaxy}, and pipelines such as the Bacterium Analysis Pipeline (BAP) \cite{thomsen2016bacterial} and the Integrated Rapid Infectious Disease Analysis (IRIDA) platform \url{http://www.irida.ca/} help automate the use of WGS data for public-health surveillance.
% data integration
Like IRIDA and BAP, Spfy automates workflows for users, and like Galaxy, Spfy uses task queues to distribute selected analysis. File uploads begin through the ReactJS-based website, where user-defined analyses options are selected. To these concepts, we add the use of Docker containerization for task queue workers, thus allowing anaylsis software to safely run in parallel.
To avoid proliferating ontologies, and to allow Spfy to integrate with existing ones, annotations from the GenEpiO \citep{griffiths2017context}, FALDO \citep{bolleman2016faldo}, and TypOn \citep{vaz2014typon} ontologies are used to describe biological data.
The entire platform is packaged using Docker-Compose, and can be recreated with a simple command. \para

% up-time

% collaborators

% analysis run-time / throughput with different levels of parallelization
\para
\end{document}
4 changes: 2 additions & 2 deletions paper-webserver.tex
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ \section{FUNCTIONALITY}

% Need a sentance introducing the tools otherwise this sounds just like the intro section
Prebuilt tools are available for a variety of reference laboratory tasks.
Spfy performs: O-antigen typing, H-antigen typing, and VF gene determination using ECtyper \url{https://github.com/phac-nml/ecoli\_serotyping}, STX typing using Phylotyper \citep{whiteside2017phylotyper}, and AMR gene determination using the RGI program \citep{mcarthur2013comprehensive}. Spfy also performs bioinformatics analyses: pangenome generation using Panseq \citep{laing2010pan}, statistical significance testing of genome markers for user-defined groups using SciPy \citep{jones2014scipy}, and support vector machine (SVM)-backed AMR predictions using Scikit-learn \citep{pedregosa2011scikit}.
Spfy performs: O-antigen typing, H-antigen typing, and VF gene determination using ECtyper \url{https://github.com/phac-nml/ecoli\_serotyping}, STX typing using Phylotyper \citep{whiteside2017phylotyper}, and AMR gene determination using the RGI program \citep{mcarthur2013comprehensive}. Spfy also performs bioinformatics analyses: pangenome generation using Panseq \citep{laing2010pan}, statistical significance testing using SciPy \citep{jones2014scipy}, and support vector machine (SVM)-backed AMR predictions using Scikit-learn \citep{pedregosa2011scikit}.
For the larger population comparisons, we store and aggregate the results from individual genome analyses.

% para covering ectyper & RGI -- don't really need this
Expand Down Expand Up @@ -295,7 +295,7 @@ \subsection{Comparison with other bioinformatic pipeline technologies}
% trying to cut down on details that would be better suited for the "Implemntation" section.

% namely galaxy
Other scientific workflow technologies such as Galaxy \citep{goecks2010galaxy}, and bioinformatics pipelines IRIDA and BAP also run analysis modules on WGS data.
Scientific workflow technologies such as Galaxy \citep{goecks2010galaxy}, and bioinformatics pipelines IRIDA and BAP run analysis modules on WGS data.
Galaxy aims to provide a reproducible, computation-based research environment which is accessible to individuals without programming knowledge. Galaxy tackles the problem of linking different analysis software together, by defining interdependencies using a custom schema. A visual workflow editor is also provided for ease of use.
IRIDA is build on top of the Galaxy framework, and adds prebuilt pipelines specific to bioinformatics uses, as well as sequence and result storage. IRIDA takes a project-based approach, with sequences stored per project, and results stored linearly per sequence. IRIDA adds Controls for collaborating on projects, and uses common terms found in the GenEpiO ontology to describe results.

Expand Down

0 comments on commit 64d5157

Please sign in to comment.