-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpaper-webserver.tex
363 lines (277 loc) · 32.5 KB
/
paper-webserver.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
\documentclass{article}
\RequirePackage{hyperref}
\usepackage[parfill]{parskip}
\usepackage[letterpaper, portrait, margin=1in]{geometry}
% Author
\usepackage{authblk}
% Images
\usepackage{graphicx}
\usepackage{lscape}
\usepackage{longtable}
% \usepackage{svg}
%\articlesubtype{This is the article type (optional)}
% \bibliography{paper-webserver.bib}
\usepackage{xcolor}
\newcommand\mwcomment[1]{\textcolor{red}{#1}}
\begin{document}
\title{Spfy: an integrated graph database for real-time prediction of bacterial phenotypes and downstream comparative analyses}
\author[1]{Kevin K Le\thanks{[email protected]}}
\author[1]{Matthew D Whiteside}
\author[1]{James E Hopkins}
\author[1]{Victor PJ Gannon}
\author[1]{Chad R Laing\thanks{[email protected]}}
\affil[1]{National Microbiology Laboratory at Lethbridge, Public Health Agency of Canada, Twp Rd 9-1, Lethbridge, AB, T1J 3Z4, Canada}
\renewcommand\Authands{ and }
\maketitle
\begin{abstract}
Public health laboratories are currently moving to whole-genome sequence (WGS) based analyses, and require the rapid prediction of standard reference laboratory methods based solely on genomic data. Currently, these predictive genomics tasks rely on workflows that chain together multiple programs for the requisite analyses. While useful, these systems do not store the analyses in a genome-centric way, meaning the same analyses are often re-computed for the same genomes.
To solve this problem, we created Spfy, a platform that rapidly performs the common reference lab tests, uses a graph database to store and retrieve the results from the computational workflows, and links data to individual genomes using standardized ontologies.
The Spfy platform facilitates rapid phenotype identification, as well as the efficient storage and downstream comparative analysis of tens of thousands of genome sequences. Though generally applicable to bacterial genome sequences, Spfy currently contains 10,243 \textit{Escherichia coli} genomes, for which \textit{in-silico} serotype and Shiga-toxin subtype, as well as the presence of known virulence factors and antimicrobial resistance determinants have been computed. Additionally, the presence / absence of the entire \textit{E. coli} pan-genome was computed and linked to each genome.
Owing to its database of diverse pre-computed results, and the ability to easily incorporate user data, Spfy facilitates hypothesis testing in fields ranging from population genomics to epidemiology, while mitigating the re-computation of analyses.
The graph approach of Spfy is flexible, and can accommodate new analysis software modules as they are developed, easily linking new results to those already stored. Spfy provides a database and analyses approach for \textit{E. coli} that is able to match the rapid accumulation of WGS data in public databases.
\par
Database URL: \url{https://lfz.corefacility.ca/superphy/spfy/}.
\end{abstract}
\section{Introduction}
% new outline
% para
% 1. brief: WGS is standard
% 2. Big Problem: but tools are for individual analysis
% 3. 100,000 genomes, (lookup number for Enterobase, GenBank), how do we run analysis on it
% 4. previous methods (Galaxy/IRIDA: real-time, but no storage of results, other website examples - Denmark?), what problems they addressed
% 5. Problems that remain / lack of result storage means: recomputation, lost data, can't reference old analyses, not suited for "big data"
% 6. Solutions in general: store and increment, huga "big data" analyses, parallelization /queues
% para
% 1. Our specific problem, previous work
% 2. Why solving it is important (for Public health / research)
% 3. how we solved it
% para
% 4. benefits: rapid analyses in real-time -> huge comparisons, replace reference labs -> time & money saved, future work -> expand analyses, more genomes, more species
% 5. analyses modules -> conda -> IRIDA/Galaxy
% 6. short snippit on website link & github link
% 1. brief: WGS is standard
Whole genome sequencing (WGS) can provide the entire genetic content of an organism. This unparalleled resolution and sensitivity has recently transformed public-health surveillance and outbreak response \cite{ronholm2016navigating,lytsy2017time}. Additionally, the identification of novel disease mechanisms \cite{wang2014whole,yuen2015whole}, and rapid clinical diagnoses and reference lab tests are now possible. \cite{willig2015whole,dewey2014clinical}.
% 2. Big Problem: but tools are for individual analysis
The rapid characterization based on WGS relies on the outputs from multiple software programs that are targeted for specific applications. Examples include the identification of known antimicrobial resistance (AMR) genes, through software such as the Resistance Gene Identifier (RGI) \cite{mcarthur2013comprehensive}, ResFinder \cite{kleinheinz2014applying}, ARG-ANNOT \cite{gupta2014arg}, and ARIBA \cite{hunt2017ariba}; or the identification of known virulence factor genes (VF) through software such as VirulenceFinder \cite{kleinheinz2014applying}, SRST2 \cite{inouye2014srst2}, and GeneSippr \cite{lambert2015genesippr}.
For subtyping in clinical diagnoses and reference lab environments, software relies on pre-selected intraspecies genes or genomic regions, which are targeted through software such as Phylotyper \cite{whiteside2017phylotyper}, SerotypeFinder \cite{joensen2015rapid}, the EcOH dataset applied through SRST2 \cite{ingle2016silico}, and V-Typer \cite{carrillo2016comparative}. These methods represent \textit{in-silico} analogues of traditional wet-lab tests, which allow new whole-genome sequences to be viewed in the context of historical tests without the need for the time and labor of the traditional wet-lab tests.
Comprehensive platforms that combine individual analyses programs into a cohesive whole also exist. These include free platforms such as the Bacterium Analysis Pipeline (BAP) \cite{thomsen2016bacterial}, and the Pathosystems Resource Integration Center (PATRIC) \cite{wattam2016improvements}. Commercial applications, such as Bionumerics, which is used by PulseNet International for the analyses of WGS data in outbreak situations also exist, and offer support as well as accredited, standardized tests \cite{swaminathan2001pulsenet}. These platforms are designed to be applied to individual projects.
Many of the analyses used in the characterization and study of bacterial genomes are broadly useful, and therefore they are often computed multiple times for the same genome across different studies. An effective method to mitigate this re-computation, is to make the storage and retrieval of results part of the analyses platform, and effectively linked to the genomes of interest through a standardized ontology. Such measures help ensure the rapid response times required for public health applications, and allow results to be integrated and progressively updated as new data becomes available.
% 4. Our specific problem, previous work
We have previously developed Superphy \cite{whiteside2016superphy}, an online predictive genomics platform targeting \textit{E. coli}. Superphy integrates pre-computed results with domain-specific knowledge to provide real-time exploration of publicly available genomes. While this tool has been useful for the thousands of pre-computed genomes in its database, the current pace of genome sequencing requires real-time predictive genomic analyses of tens-, and soon hundreds-of-thousands of genomes, and the long term storage and referencing of these results, which the original SuperPhy platform was incapable of.
% ?. Why solving it is important (for Public health / research)
% Merging points from here into the above sections (2,3,4) to avoid repetition
% - everything is being sequenced (surveillance / outbreak / research)
% - previously mentioned common analyses / want to leverage pre-computed results
% - WGS does not discard old methods, linkage to thousands of historical results by developing in-silico methods of traditional tests
% - need fast / standardized outbreak response
% - need fast / standardized in-silico reference lab
% - need fast / standardized storage and retrieval of results based on ontology
% - all known data can be leveraged, allows the most informed decisions possible
% - etc.
% 5. how we solved it
Here, we present a new platform merging the pre-computed results of SuperPhy, with a novel data storage and processing architecture, which we call Spfy; Spfy integrates a graph database with real-time analyses to avoid recomputing identical results. Additionally, graph-based result storage allows retrospective comparisons across stored results as more genomes are sequenced or populations change. Spfy is flexible, accommodating new analysis modules as they are developed. The database is available at \url{https://lfz.corefacility.ca/superphy/spfy/}.
% {needs to be beefed up, in language a biologist / public health worker would care about}
% end of new intro
% **************************************************************
% Keep this command to avoid text of first page running into the
% first page footnotes
\enlargethispage{-65.1pt}
% **************************************************************
% Some general comments
% The NAR Database issue is more of a showcase then a rigorous exploration of software design choices.
% Given this focus, i would suggest the following:
% 1. Increase/highlight the discriptions of the functions and capabilities, maybe by adding a Functionality section
% 2. In the Implmentation (or Methods) section, only give a cursory description of the layout and components. Don't need to provide too much justification
% 3. Use the Results section to highlight the scope/size and speed. This can be short
% 4. In the Discussion, this is where i would expand on the justications and reasons for specific design choices. Pick 2-3 main ones and discuss those (i.e. don't need to justify our choice of documentation software). Also compare with other software in Discussion.
% 5. Add a conclusions section
\section{FUNCTIONALITY}
% IMPORTANT: All Tables and Figues must be explicitly mentioned in the text.
% ONLY FOCUS ON THE ANALYSIS MODULES
% Describe available functions in spfy
% para covering everything
Spfy provides rapid \textit{in-silico} versions of common reference laboratory tests for the analyses of \textit{E. coli}. It supports the following \textit{in-silico} subtyping options: serotyping, through both O- and H-antigen identification using ECTyper \url{https://github.com/phac-nml/ecoli_serotyping} as well as VF gene determination; Shiga-toxin 1 (Stx1), Shiga-toxin 2 (Stx2), and Intimin typing using Phylotyper \cite{whiteside2017phylotyper}; and AMR annotation using the resistance gene identifier from the comprehensive antibiotic resistance database \cite{mcarthur2013comprehensive}. An example of the VF results is given in Figure \ref{fig-tables}. Spfy also performs pan-genome analyses using Panseq \cite{laing2010pan}, with the entire pan-genome stored in the database and associated with source genomes.
No account creation is required to use the platform. A sharable token is automatically created for users upon entering the website, and is embedded into the website address. Users can share results by copying their URL, and files submitted from different computers using the same token will be visible to anyone with the same link.
\begin{figure}[!htb]
\begin{center}{}
\includegraphics[width=\textwidth]{images/tables.png}
\end{center}
\caption{Detailed results for the serotyping and virulence factor subtyping task. While data storage in Spfy is graph-based, a familiar tabular structure is presented to users. The genome file, GCA\_001911825.1\_ASM191182v1\_genome.fna, was analyzed with the determined serotype associated with the file, and virulence factors associated with the contigouous DNA sequences they were found on. The Start/Stop positions on the contig., are provided along with the Percent Identity (Cutoff) used in the analysis.}
\label{fig-tables}
\end{figure}
Spfy handles all of the analyses tasks by dividing them into subtasks, which are subsequently distributed across a built-in task queue. Results are converted into individual graphs and stored within a larger graph database according to the previously created ontologies GenEpiO \cite{griffiths2017context}, FALDO \cite{bolleman2016faldo}, and TypOn \cite{vaz2014typon}, which includes metadata for genotypes, location, biomarkers, host, and source, among others.
By integrating task distribution with graph storage, Spfy enables large-scale analyses, such as epidemiological association studies. Any data-type or relation in the graph is a valid option for analysis. This means that genomes can be compared on the basis of the presence or absence of pan-genome regions, serotype, subtyping data, or provided metadata such as location or host-source. All results are displayed to users in real-time, usually within 2-3 minutes.
For example, Spfy can determine if a statistically significant difference exists, using Fisher's exact test, among any identified AMR genes, between \textit{E. coli} genomes of serotype O157, and genomes of serotype O26, as shown in Figure \ref{fig-groupings}.
Different types can also be joined into a group through logical connectives AND, OR, or Negation. This approach can be used to compare any data regardless of source software.
\begin{figure}[!hb]
\begin{center}
\includegraphics[width=\textwidth]{images/o157vo26_amrgenes}
\end{center}
\caption{
% This needs to be specific to the image we are showing eg. the O157 vs. O26 comparison. We mention the general comparison in the text, and this can be offered as a specific example of that.
\textit{E. coli} genomes of serotype O157 (903 genomes) compared against genome of serotype O26 (291 genomes), for statistically significant differences in the carriage of 129 AMR genes. Fisher's Exact Test is used by Spfy for these comparisons.}
\label{fig-groupings}
\end{figure}
\section{IMPLEMENTATION}
% para: intro to the spfy stack
The server-side code for Spfy, graph generation, and analysis modules, were developed in Python, with the front-end website developed using the React JavaScript library \url{https://facebook.github.io/react/}. When new data is added to the database, the following steps are taken:
i) The upload begins through the website, where user-defined analyses options are selected. The results of these analyses are immediately reported to the user, while all other non-selected analyses are subsequently completed in the background and stored in the database without interaction from the user. The public web service accepts uploads of up to 200 MB (approximately 50 \textit {E. coli} genomes uncompressed, or 120 genomes compressed) at a time, though an unlimited amount of data can be submitted to a local instance.
ii) User-selected analyses are enqueued with the Redis Queue \url{http://python-rq.org/} task queue. Redis Queue consists of a Redis Database \url{https://redis.io/} and task queue workers which run as Python processes.
iii) The workers dequeue the analyses, run them in parallel, and temporarily store results in the Redis database.
iv) Python functions parse the results and permanently store them in Blazegraph \url{https://www.blazegraph.com/}, the graph database used for Spfy.
\subsection{Data Storage}
% para
% 0. Goals: big-data, everything linked, easy addition of new links
% 1. spfy is built around graph technologies
% 1. how we structure our graph
% 2. ontoogies used
% 3. inferencing
% 3 1/2. SPARQL queries
% para: the semantic web
Semantic web technologies describe the relationships between data, and have been proposed as an open standard for sharing public information \cite{berners2001semantic}, while graph databases are a flexible means of storing this information \cite{horrocks2005semantic}. Biological data can be a genome, contiguous DNA sequence, or gene, and these are linked together in a searchable graph structure using existing ontologies. This system is flexible and allows novel data to be incorporated into the existing graph.
\begin{figure}[!hb]
\begin{center}
\includegraphics[width=\textwidth]{images/ontology_r6}
\end{center}
\caption{Structure of the Spfy graph database. Brackets highlight the source of different data points and the software it was generated from. Data are added in as the analysis modules complete, at varying times, and the overall connections are inferred by the database. Non-bracketed sections are sourced from the uploaded genome files or user-supplied metadata. Figure was generated using \url{http://www.visualdataweb.de/webvowl/}.}
\label{fig-ontology}
\end{figure}
% Permanent storage + why a graph is preferable over tables
The permanent storage of results is as a one-time cost, which avoids re-computation when the same analysis is re-run. During analyses, Spfy searches the graph for all data points annotated with the queried ontology term. This graph data is then converted into the required structure, usually numerical arrays, for the given analysis module.
In a graph database, a search can begin at any node or attribute. This is in contrast to a SQL database, which requires a predefined schema, or a NoSQL database which treats data as documents with varying structure.
For example, the addition of a new analysis module would typically require a new table definition in a SQL database, or the addition of a new document type in a NoSQL database. With a graph database, new nodes or attributes are added and then connected to existing data, removing the need for explicit joins or data conversions. Additionally, data can be added to Spfy in parts, and the database will infer the correct connections between the data.
Spfy primarily uses Blazegraph \url{https://github.com/blazegraph/database} for storage along with MongoDB to cache a hash table for duplicate checking.
The cache allows Spfy to more efficiently check for duplicate files in Blazegraph than would be possible through a search of the graph structure.
MongoDB is also used to support the synchronized user sessions feature of the website.
\subsection{Web design}
% para
% 1. goals: intuitive/familar, ease of use
% 2. design specs
% 3. Google Material design
% 1. implementation: reactjs, react-md, ES6, JSX
% 4. separation from Flask layer
The front-end website is written as a single-page application.
To ensure a familiar user interface, we followed the Material Design specification \url{https://material.io/}, published by Google, built around a card-based design (Figure \ref{fig-results}).
Both the task selection and result displays follow the same design pattern, where data storage is graph-based, but the results of analysis modules are presented to users in a familiar tabular structure and available for download as .csv spreadsheet files.
\begin{figure}[!htb]
\begin{center}
\includegraphics[width=\textwidth]{images/results.png}
\end{center}
\caption{The results interface for submitted tasks. Cards represent individual tasks submitted by the user, such as checking the Database status, subtyping of single genomes, population comparisons, or subtyping of multiple genomes.}
\label{fig-results}
\end{figure}
\subsection{Service Virtualization}
% para
% 1. goals: real-time, support for pipelines (linked modules)
% 2. how pipelines have been handled in the past: Galaxy, other examples
% 3. Python, RQ
% para
% 1. how we implemented RQ
% related: packaging of modules in conda
% para
% 1. goals: scale analyses to "big-data", error handling
% 3. how we handle parallelization with RQ
% para
% 1. how many tasks have we tested this with
% 2. error handling: rq-dashboard, sentry
% 3. why options like sentry are better than traditional logging: scales well to tons (big-data levels) of tasks, groups the same errors together, reporting via email
% para
% 1. goals: why compartmentilizations
% 1. how we implemented docker
% 3. how this lets us replicate worker containers and link everything together
% cost of docker < a full VM
% it's not quite a full OS
Docker \url{https://www.docker.com/} is a virtualization technology to simulate self-contained operating systems on the same host computer, without the overhead of full hardware virtualization \cite{felter2015updated}.
The Spfy platform depends on a series of webservers, databases, and task workers, and uses Docker to compartmentalize these services, which are then networked together using Docker-Compose \url{https://docs.docker.com/compose/}.
(see Figure \ref{fig-docker})
Docker integration ensures that software dependencies, which are typically manually installed \cite{doi:10.1093/bioinformatics/btu153,laing2010pan,inouye2014srst2,naccache2014cloud}, are instead handled automatically.
\begin{figure}[!hb]
\begin{center}
\includegraphics[width=\textwidth]{images/docker}
\end{center}
\caption{The Docker containers used in Spfy. Arrows represent the connections between different containers, and the entire platform can be recreated with a single command using its Docker-Compose definition. Users access the platform using the ReactJS based website, which makes requests to the Flask API. Any requested analysis task is distributed to the Redis Task Queue and data files are stored in a Docker volume. MongoDB stores a hash table for efficient duplicate checking of results in Blazegraph.}
\label{fig-docker}
\end{figure}
% standard web tech means you can deploy to different cloud computing services
One of the key benefits of using common-place technologies is the compatibility with other infrastructure resources.
Docker containers are widely supported by cloud computing services: Amazon Web Services (AWS) \url{https://aws.amazon.com/docker/}, Google Cloud Platform (GCloud) \url{https://cloud.google.com/container-engine/}, and Microsoft Azure \url{https://azure.microsoft.com/en-us/services/container-service/}, and self-hosted cloud computing technologies such as OpenStack \url{https://wiki.openstack.org/wiki/Docker}.
% we may want to sign up for a free trial to give an example of this
Spfy packages compute nodes as reproducible Docker containers, and allows the platform to easily scale to demand.
\section{RESULTS}
Spfy was tested with 10,243 public \textit{E. coli} assembled genomes from Enterobase, storing every sequence and the results for all included analysis modules. These included: serotyping (O-antigen, H-antigen), toxin sub-typing (Shiga-toxin 1, Shiga-toxin 2, and Intimin), the identification of VF and AMR determinants, and determination of the pan-genome content of \textit{E. coli}.
The resulting database has 17,820 nodes and 3,811,473 leaves, with 1,125,909,074 object properties.
Spfy has been up since May 2017. The server accepts assembled \textit{E. coli} genomes with the \textit{.fasta} or \textit{.fna} extensions. Submissions are subjected to quality control, ensuring the submitted genomes are \textit{E. coli} sequences before subsequent analyses are run.
\par
% analysis run-time / throughput with different levels of parallelization
% particularly for statistical tests
% should do a 1 genome = X number VFs, Y number of genomes for Y*X retrieval/analysis
Spfy is running on a virtual machine (VM) with 8 vCPUs emulating single-core Intel® Xeon® E3-series processors, and with 32 GB of RAM.
The VM is running CentOS version 7.3, with Docker version 17.06.1-ce and Docker-Compose version 1.12.0.
On comparison tasks, Spfy can retrieve and compare 1 million nodes/attributes in the graph database in approximately 70 seconds.
As shown in Figure \ref{fig:fishers_performance}, performance scales linearly with the number of genomes involved in the comparison, or the number of target nodes retrieved per genome; 1.5 million nodes/attributes can be compared in approximately 90 seconds, and 2 million nodes/attributes in approximately 110 seconds.
% The X and Y axes labels need to be slightly larger. Designate the panes a), b), and c) and then reference them in the caption. The caption should contain the information that the titles on each pane currently do.
\begin{figure}[!htb]
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{images/fishers_genomes.png}
\endminipage\hfill
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{images/fishers_targets.png}
\endminipage\hfill
\minipage{0.32\textwidth}%
\includegraphics[width=\linewidth]{images/fishers_overall.png}
\endminipage
\caption{Runtimes of Fisher's Exact Test depending on the number of nodes/attributes involved in the comparison. A: Runtimes as the number of genomes increased for a fixed (107) number of targets per genome. B: Runtimes as the number of targets increased for a fixed (116) number of genomes per target. C: Overall runtimes as the total number of targets retrieved increased; the total number of targets was calculated as follows: (Number Genomes Group A + Number Genomes Group B) x Number Targets per Genome. In all cases, a linear increase in runtime was observered as the number of targets or genomes increased.}\label{fig:fishers_performance}
\end{figure}
On analysis tasks, Spfy runs all included analyses modules in an average of 130 seconds per genome.
Spfy can also queue batches of genomes for analysis, decreasing the average runtime to an average of 54 seconds per file due to parallelization of analysis runs \ref{fig:spfy_performance}.
In total, the platform analyzed 50 genomes in 45 minutes, and 100 genomes in 89 minutes.
\begin{figure}[!htb]
\begin{center}
\minipage{0.5\textwidth}%
\includegraphics[width=\linewidth]{images/spfy_batches_totals.png}
\endminipage
\caption{Total runtimes of Spfy's analysis modules for batches of files. The blue line indicates the actual time to completion after accounting for parallelization; 50 files are analyzed in 45 minutes and 100 files in 89 minutes.}\label{fig:spfy_performance}
\end{center}
\end{figure}
\section{DISCUSSION}
% Gist, we address the the "Limitations" of \cite{de2015trends}.
% Namely:
% 1. Custom formats.
% 2. Use of legacy tech (SQL) means youu can't accomadate "Big-Data" goals.{}
% drawbacks - big data
Many bioinformatics software programs have been developed \textit{ad hoc}, with individual researchers and laboratories developing software specific to their environment \cite{de2015trends}.
Such tools were often script-based, with custom data formats, and only suitable for small collections of data \cite{de2015trends}.
Recent efforts \cite{goecks2010galaxy,thomsen2016bacterial} have focused on providing a common web interface for these programs, while still returning the same result files.
However, many subsets of biology now require the analyses of big-data, where inputs are taken from a variety of analysis programs, and involve large-scale data warehousing \cite{schatz2015biological}.
The ability to integrate data from different source technologies, merge submissions from other labs, and distribute computations over fault-tolerant systems are now required \cite{schatz2015biological}.
One of the key goals in developing Spfy was to accommodate and store a variety of result formats, and then to make the data from these results retrievable and usable as inputs for downstream analyses, such as predictive biomarker discovery.
% This has not been shown in this paper; it has only been stated. We should probably have a large-scale analysis for an example, with accompanying times etc.
We have shown how a graph database can accommodate the results from a variety of bioinformatics programs, and how Spfy is performant for data retrieval of the results from multiple analyses among over 10,000 genomes. Spfy provides results from big-data comparisons with the same efficiency as old analyses on single files.
\subsection{Impact on Public Heath Efforts}
% para
% focus on application
The isolation and characterization of bacterial pathogens are critical for Public Health laboratories to rapidly respond to outbreaks, and to effectively monitor known and emerging pathogens through surveillance programs.
Until recently, public-health agencies relied on laboratory tests such as serotyping, pulsed-field gel electrophoresis (PFGE), PCR-based amplification of known VFs, and disc-diffusion assays, for the characterization of bacterial isolates in outbreak, surveillance, and reference laboratory settings \cite{ronholm2016navigating}. Current efforts are focused on predictive genomics, where the relevant phenotypic information can be determined through examination of the whole-genome sequence without need for the traditional laboratory tests.
Spfy provides rapid and easy predictive genomic analyses of \textit{E. coli} genomes while also addressing the problem of large scale comparisons. With the larger datasets involved in population genomics, it is no longer viable for individual researchers to download data to perform comparisons. Instead, efforts have focused on storing biological data online and enabling analyses of those data \cite{schatz2015biological}. By using a graph database, Spfy integrates results from different technologies, as well as laboratory results and user-submitted metadata. In addition, datasets can be built and submitted from multiple labs for joint analyses.
\subsection{Comparison with other bioinformatics pipeline technologies}
% trying to cut down on details that would be better suited for the "Implemntation" section.
% namely galaxy
The automated analyses of WGS is currently facilitated by existing scientific workflow technologies such as Galaxy \cite{goecks2010galaxy}. Galaxy aims to provide a reproducible, computational interface that is accessible to individuals without programming knowledge. Galaxy defines a formal schema for linking different analysis software together, so the entire pipeline can be replicated and also extended as new tools are developed. The Galaxy workflow focuses on running an individual analysis pipeline. It does not include functionality to store and collate analysis results for large-scale comparative studies.
The Bacterium Analysis Pipeline (BAP) \cite{thomsen2016bacterial} provides an integrated analysis pipeline for bacterial WGS data as a web service. It provides an individual per-genome report of the determined species, multilocus sequence type, VF and AMR genes \cite{thomsen2016bacterial}.
Spfy is similar to these technologies in that it automates workflows for users and uses task queues to distribute selected analyses.
% re: Matt "2. When comparing to IRIDA, BAP etc., can mention some differences with Superphy, e.g. the storage of interim result data that allows downstream integrated analysis"
% If we mention speed comparisons, we need to have actual data supporting them
On a per file basis, Spfy performs at a similar speed to BAP on predictive genomics tasks, though Spfy does not provide genome assembly services.
After accounting for assembly services, BAP reported \cite{thomsen2016bacterial} an average runtime of 8-9 minutes per genome over 476 runs which is in the same scale as Spfy's average of over 2 minutes.
Note, however, that the two platforms are not directly comparable, due to the differences in analysis tasks involved.
In similar tasks such as AMR determination, the ResFinder program included in BAP took an average of 3-4 minutes \cite{thomsen2016bacterial} and is similar to the RGI program included in Spfy, which took an average of 1 minute 30 seconds.
However, unlike these workflow managers, Spfy is designed to help solve the re-computation of analyses by storing results in a graph database for downstream comparative studies (Table \ref{db_comparison}). This allows Spfy to, for example, perform population-wide analyses on varied data from multiple diverse individual software.
\small \input{tables/comparison}
PATRIC \cite{wattam2016improvements} and Spfy share the same goal of integrated analyses. PATRIC has support for comparing up to nine user-submitted genomes against a reference genome, based on gene annotations; the platform indexes a NoSQL document store to compare similar document types. Unlike PARTIC, Spfy provides the ability to perform statistical comparisons of any permutation of a population group based on the chosen data types, and the graph database of Spfy has no limit on the number of genomes grouped for comparison.
\section{CONCLUSIONS}
The integrated approach taken in the creation of Spfy, where the analyses, storage, and retrieval of results is combined, provides enormous benefits for the large-scale analyses of \textit{E. coli}. The developed analyses modules are also self-contained and can be used in existing platforms such as Galaxy. Future work will focus on adding machine learning modules to improve genotype / phenotype predictions, and supporting bacterial species such as \textit{Salmonella}, and \textit{Campylobacter}. The source code for Spfy is hosted at \url{https://github.com/superphy/backend}, and is available for free under the open-source Apache 2.0 license. A developer guide is provided at \url{https://superphy.readthedocs.io/en/latest/}.
\textit{Conflict of interest}. None declared.
\newpage
\bibliographystyle{unsrt}
\bibliography{paper-webserver}
\end{document}