The study of cellular structure and core biological processes -- transcription, translation, signaling, metabolism, etc. -- in humans and model organisms will greatly impact our understanding of human disease over the long horizon [@tag:Nih_curiosity]. Predicting how cellular systems respond to environmental perturbations and are altered by genetic variation remain daunting tasks. Deep learning offers new approaches for modeling biological processes and integrating multiple types of omic data [@doi:10.1038/ncomms13090], which could eventually help predict how these processes are disrupted in disease. Recent work has already advanced our ability to identify and interpret genetic variants, study microbial communities, and predict protein structures, which also relates to the problems discussed in the drug development section. In addition, unsupervised deep learning has enormous potential for discovering novel cellular states from gene expression, fluorescence microscopy, and other types of data that may ultimately prove to be clinically relevant.
Progress has been rapid in genomics and imaging, fields where important tasks are readily adapted to well-established deep learning paradigms. One-dimensional convolutional and recurrent neural networks are well-suited for tasks related to DNA- and RNA-binding proteins, epigenomics, and RNA splicing. Two dimensional CNNs are ideal for segmentation, feature extraction, and classification in fluorescence microscopy images [@doi:10.3109/10409238.2015.1135868]. Other areas, such as cellular signaling, are biologically important but studied less-frequently to date, with some exceptions [@tag:Chen2015_trans_species]. This may be a consequence of data limitations or greater challenges in adapting neural network architectures to the available data. Here, we highlight several areas of investigation and assess how deep learning might move these fields forward.
Gene expression technologies characterize the abundance of many thousands of RNA transcripts within a given organism, tissue, or cell. This characterization can represent the underlying state of the given system and can be used to study heterogeneity across samples as well as how the system reacts to perturbation. While gene expression measurements were traditionally made by quantitative polymerase chain reaction (qPCR), low-throughput fluorescence-based methods, and microarray technologies, the field has shifted in recent years to primarily performing RNA sequencing (RNA-seq) to catalog whole transcriptomes. As RNA-seq continues to fall in price and rise in throughput, sample sizes will increase and training deep models to study gene expression will become even more useful.
Already several deep learning approaches have been applied to gene expression data with varying aims. For instance, many researchers have applied unsupervised deep learning models to extract meaningful representations of gene modules or sample clusters. Denoising autoencoders have been used to cluster yeast expression microarrays into known modules representing cell cycle processes [@tag:Gupta2015_exprs_yeast] and to stratify yeast strains based on chemical and mutational perturbations [@tag:Chen2016_exprs_yeast]. Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological insight from thousands of Pseudomonas aeruginosa experiments [@tag:Tan2015_adage; @tag:Tan2016_eadage] and in aggregating features relevant to specific breast cancer subtypes [@tag:Tan2014_psb]. These unsupervised approaches applied to gene expression data are powerful methods for identifying gene signatures that may otherwise be overlooked. An additional benefit of unsupervised approaches is that ground truth labels, which are often difficult to acquire or are incorrect, are nonessential. However, the genes that have been aggregated into features must be interpreted carefully. Attributing each node to a single specific biological function risks over-interpreting models. Batch effects could cause models to discover non-biological features, and downstream analyses should take this into consideration.
Deep learning approaches are also being applied to gene expression prediction tasks. For example, a deep neural network with three hidden layers outperformed linear regression in inferring the expression of over 20,000 target genes based on a representative, well-connected set of about 1,000 landmark genes [@tag:Chen2016_gene_expr]. However, while the deep learning model outperformed existing algorithms in nearly every scenario, the model still displayed poor performance. The paper was also limited by computational bottlenecks that required data to be split randomly into two distinct models and trained separately. It is unclear how much performance would have increased if not for computational restrictions.
Epigenetic data, combined with deep learning, may have sufficient explanatory power to infer gene expression. For instance, the DeepChrome CNN [@tag:Singh2016_deepchrome] improved prediction accuracy of high or low gene expression from histone modifications over existing methods. AttentiveChrome [@tag:Singh2017_attentivechrome] added a deep attention model to further enhance DeepChrome. Deep learning can also integrate different data types. For example, Liang et al. combined RBMs to integrate gene expression, DNA methylation, and miRNA data to define ovarian cancer subtypes [@tag:Liang2015_exprs_cancer]. While these approaches are promising, many convert gene expression measurements to categorical or binary variables, thus ablating many complex gene expression signatures present in intermediate and relative numbers.
Deep learning applied to gene expression data is still in its infancy, but the future is bright. Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies. For example, the effects of cellular heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.
Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatiotemporal flexibility to generate multiple distinct proteins from a single gene. This remarkable complexity can lend itself to defects that underlie many diseases. For instance, splicing mutations in the lamin A (LMNA) gene can lead to specific variants of dilated cardiomyopathy and limb girdle muscular dystrophy [@tag:Scotti2016_missplicing]. A recent study found that quantitative trait loci that affect splicing in lymphoblastoid cell lines are enriched within risk loci for schizophrenia, multiple sclerosis, and other immune diseases, implicating mis-splicing as a more widespread feature of human pathologies than previously thought [@tag:Li2016_variation]. Therapeutic strategies that aim to modulate splicing are also currently being considered for disorders such as Duchenne muscular dystrophy and spinal muscular atrophy [@tag:Scotti2016_missplicing].
Sequencing studies routinely return thousands of unannotated variants, but which cause functional changes in splicing and how are those changes manifested? Prediction of a "splicing code" has been a goal of the field for the past decade. Initial machine learning approaches used a naïve Bayes model and a 2-layer Bayesian neural network with thousands of hand-derived sequence-based features to predict the probability of exon skipping [@tag:Barash2010_splicing_code; @tag:Xiong2011_bayesian]. With the advent of deep learning, more complex models provided better predictive accuracy [@tag:Xiong2015_splicing_code; @tag:Jha2017_integrative_models]. Importantly, these new approaches can take in multiple kinds of epigenomic measurements as well as tissue identity and RNA binding partners of splicing factors. Deep learning is critical in furthering these kinds of integrative studies where different data types and inputs interact in unpredictable (often nonlinear) ways to create higher-order features. Moreover, as in gene expression network analysis, interrogating the hidden nodes within neural networks could potentially illuminate important aspects of splicing behavior. For instance, tissue-specific splicing mechanisms could be inferred by training networks on splicing data from different tissues, then searching for common versus distinctive hidden nodes, a technique employed by Qin et al. for tissue-specific transcription factor (TF) binding predictions [@tag:Qin2017_onehot].
A parallel effort has been to use more data with simpler models. An exhaustive study using readouts of splicing for millions of synthetic intronic sequences uncovered motifs that influence the strength of alternative splice sites [@tag:Rosenberg2015_synthetic_seqs]. The authors built a simple linear model using hexamer motif frequencies that successfully generalized to exon skipping. In a limited analysis using single nucleotide polymorphisms (SNPs) from three genes, it predicted exon skipping with three times the accuracy of an existing deep learning-based framework [@tag:Xiong2015_splicing_code]. This case is instructive in that clever sources of data, not just more descriptive models, are still critical.
We already understand how mis-splicing of a single gene can cause diseases such as limb girdle muscular dystrophy. The challenge now is to uncover how genome-wide alternative splicing underlies complex, non-Mendelian diseases such as autism, schizophrenia, Type 1 diabetes, and multiple sclerosis [@tag:JuanMateu2016_t1d]. As a proof of concept, Xiong et al. [@tag:Xiong2015_splicing_code] sequenced five autism spectrum disorder and 12 control samples, each with an average of 42,000 rare variants, and identified mis-splicing in 19 genes with neural functions. Such methods may one day enable scientists and clinicians to rapidly profile thousands of unannotated variants for functional effects on splicing and nominate candidates for further investigation. Moreover, these nonlinear algorithms can deconvolve the effects of multiple variants on a single splice event without the need to perform combinatorial in vitro experiments. The ultimate goal is to predict an individual’s tissue-specific, exon-specific splicing patterns from their genome sequence and other measurements to enable a new branch of precision diagnostics that also stratifies patients and suggests targeted therapies to correct splicing defects. However, to achieve this we expect that methods to interpret the "black box" of deep neural networks and integrate diverse data sources will be required.
Transcription factors and RNA-binding proteins are key components in gene regulation and higher-level biological processes. TFs are regulatory proteins that bind to certain genomic loci and control the rate of mRNA production. While high-throughput sequencing techniques such as chromatin immunoprecipitation and massively parallel DNA sequencing (ChIP-seq) have been able to accurately identify targets for TFs, these experiments are both time consuming and expensive. Thus, there is a need to computationally predict binding sites and understand binding specificities de novo from sequence data. In this section we focus on TFs, with the understanding that deep learning methods for TFs are similar to those for RNA-binding proteins, though RNA-specific models do exist [@doi:10.1186/s12859-017-1561-8].
ChIP-seq and related technologies are able to identify highly likely binding sites for a certain TF, and databases such as ENCODE [@tag:Consortium2012_encode] have made freely available ChIP-seq data for hundreds of different TFs across many laboratories. In order to computationally predict transcription factor binding sites (TFBSs) on a DNA sequence, researchers initially used consensus sequences and position weight matrices to match against a test sequence [@tag:Stormo2000_dna]. Simple neural network classifiers were then proposed to differentiate positive and negative binding sites but did not show meaningful improvements over the weight matrix matching methods [@tag:Horton1992_assessment]. Later, support vector machines (SVMs) outperformed the generative methods by using k-mer features [@tag:Ghandi2014_enhanced; @tag:Setty2015_seqgl], but string kernel-based SVM systems are limited by their expensive computational cost, which is proportional to the number of training and testing sequences.
With the advent of deep learning, Alipanahi et al. [@tag:Alipanahi2015_predicting] showed that convolutional neural network models could achieve state of the art results on the TFBS task and are scalable to a large number of genomic sequences. Lanchantin et al. [@tag:Lanchantin2016_motif] introduced several new convolutional and recurrent neural network models that further improved TFBS predictive accuracy. Due to the motif-driven nature of the TFBS task, most architectures have been convolution-based [@tag:Zeng2016_convolutional]. While many models for TFBS prediction resemble computer vision and NLP tasks, it is important to note that DNA sequence tasks are fundamentally different. Thus the models should be adapted from traditional deep learning models in order to account for such differences. For example, motifs may appear in either strand of a DNA sequence, resulting in two different forms of the motif (forward and reverse complement) due to complementary base pairing. To handle this issue, specialized reverse complement convolutional models share parameters to find motifs in both directions [@tag:Shrikumar2017_reversecomplement].
Despite these advances, several challenges remain. First, because the inputs (ChIP-seq measurements) are continuous and most current algorithms are designed to produce binary outputs (whether or not there is TF binding at a particular site), false positives or false negatives can result depending on the threshold chosen by the algorithm. Second, most methods predict binding of TFs at sites in isolation, whereas in reality multiple TFs may compete for binding at a single site or act synergistically to co-occupy it. Fortunately, multi-task models are rapidly improving at simultaneous prediction of many TFs' binding at any given site [@tag:Zhou2015_deep_sea]. Third, it is unclear exactly how to define a non-binding or "negative" site in the training data because the number of positive binding sites of a particular TF is relatively small with respect to the total number of base-pairs in a genome (see Discussion).
While deep learning-based models can automatically extract features for TFBS prediction at the sequence level, they often cannot predict binding patterns for cell types or species that have not been previously studied (i.e. that we don't have labelled data for).
For many cell types, we often have experimental TF binding data for a certain cell type (e.g. normal), but not the one of interest (e.g. cancer). To handle this issue, there are two options. The first is to use features such as epigenetic marks which are specific to the certain cell type. If we have these features available, this is probably the best approach for accurate prediction since they will be strong signals for that particular cell type and TF. The second option is to use domain adaptation, which is a type of transfer learning where we learn a model which can predict from features in one domain (e.g. normal) and transfer the model to predict on features in another domain (e.g. cancer). This has been done in many other areas such as sentiment analysis [@url:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.231.3442]. For example, we can train on the reviews for books, and predict on the reviews for movies. These are similar tasks, but the context (i.e. books or movies) is different. TFImpute [@tag:Qin2017_onehot] predicts binding in new cell type-TF pairs, but the cell types must be in the training set for other TFs. This is a step in the right direction, but a more general domain transfer model across cell types would be more useful. The main issue with predicting cross cell types using domain adaptation on sequence data is that the sequences in both cell types are likely identical. As a result, it would be necessary to also incorporate cell type specific data.
Similar to cross cell type prediction, we also have TF binding data for a certain species (e.g. mouse), but not one of interest one (e.g. human). When predicting cross species, domain adaptation may be used to reduce overfitting to differences in distribution of genomic sequence features between the species. However, since only a small fraction of the genome differs between human and mouse, it is not clear how well domain adaptation would work in this case.
Deep learning can also illustrate TF binding preferences. Lanchantin et al. [@tag:Lanchantin2016_motif] and Shrikumar et al. [@tag:Shrikumar2017_learning] developed tools to visualize TF motifs learned from TFBS classification tasks. Alipanahi et al. [@tag:Alipanahi2015_predicting] also introduced mutation maps, where they could easily mutate, add, or delete base pairs in a sequence and see how the model changed its prediction. Though time consuming to assay in a lab, this was easy to simulate with a computational model. As we learn to better visualize and analyze the hidden nodes within deep learning models, our understanding of TF binding motifs and dynamics will likely improve.
Multiple TFs act in concert to coordinate changes in gene regulation at the genomic regions known as promoters and enhancers. Each gene has an upstream promoter, essential for initiating that gene's transcription. The gene may also interact with multiple enhancers, which can amplify transcription in particular cellular contexts. These contexts include different cell types in development or environmental stresses.
Promoters and enhancers provide a nexus where clusters of TFs and binding sites mediate downstream gene regulation, starting with transcription. The gold standard to identify an active promoter or enhancer requires demonstrating its ability to affect transcription or other downstream gene products. Even extensive biochemical TF binding data has thus far proven insufficient on its own to accurately and comprehensively locate promoters and enhancers. We lack sufficient understanding of these elements to derive a mechanistic "promoter code" or "enhancer code". But extensive labeled data on promoters and enhancers lends itself to probabilistic classification. The complex interplay of TFs and chromatin leading to the emergent properties of promoter and enhancer activity seems particularly apt for representation by deep neural networks.
Despite decades of work, computational identification of promoters remains a stubborn problem [@doi:10.1093/bib/4.1.22]. Researchers have used neural networks for promoter recognition as early as 1996 [@tag:matis]. Recently, a CNN recognized promoter sequences with sensitivity and specificity exceeding 90% [@doi:10.1371/journal.pone.0171410]. Most activity in computational prediction of regulatory regions, however, has moved to enhancer identification. Because one can identify promoters with straightforward biochemical assays [@doi:10.1073/pnas.2136655100; @doi:10.1101/gr.110254.110], the direct rewards of promoter prediction alone have decreased. But the reliable ground truth provided by these assays makes promoter identification an appealing test bed for deep learning approaches that can also identify enhancers.
Recognizing enhancers presents additional challenges. Enhancers may be up to 1,000,000 bp away from the affected promoter, and even within introns of other genes [@doi:10.1038/nrg3458]. Enhancers do not necessarily operate on the nearest gene and may affect multiple genes. Their activity is frequently tissue- or context-specific. No biochemical assay can reliably identify all enhancers. Distinguishing them from other regulatory elements remains difficult, and some believe the distinction somewhat artificial [@doi:10.1016/j.tig.2015.05.007]. While these factors make the enhancer identification problem more difficult, they also make a solution more valuable.
Several neural network approaches yielded promising results in enhancer prediction. Both Basset [@doi:10.1101/gr.200535.115] and DeepEnhancer [@tag:Min2016_deepenhancer] used CNNs to predict enhancers. DECRES used a feed-forward neural network [@doi:10.1101/041616] to distinguish between different kinds of regulatory elements, such as active enhancers, and promoters. DECRES had difficulty distinguishing between inactive enhancers and promoters. They also investigated the power of sequence features to drive classification, finding that beyond CpG islands, few were useful.
Comparing the performance of enhancer prediction methods illustrates the problems in using metrics created with different benchmarking procedures. Both the Basset and DeepEnhancer studies include comparisons to a baseline SVM approach, gkm-SVM [@doi:10.1371/journal.pcbi.1003711]. The Basset study reports gkm-SVM attains a mean auPRC of 0.322 over 164 cell types [@doi:10.1101/gr.200535.115]. The DeepEnhancer study reports for gkm-SVM a dramatically different auPRC of 0.899 on nine cell types [@tag:Min2016_deepenhancer]. This large difference means it's impossible to directly compare the performance of Basset and DeepEnhancer based solely on their reported metrics. DECRES used a different set of metrics altogether. To drive further progress in enhancer identification, we must develop a common and comparable benchmarking procedure (see Discussion).
In addition to the location of enhancers, identifying enhancer-promoter interactions in three-dimensional space will provide critical knowledge for understanding transcriptional regulation. SPEID used a CNN to predicted these interactions with only sequence and the location of putative enhancers and promoters along a one-dimensional chromosome [@doi:10.1101/085241]. It compared well to other methods using a full complement of biochemical data from ChIP-seq and other epigenomic methods. Of course, the putative enhancers and promoters used were themselves derived from epigenomic methods. But one could easily replace them with the output of one of the enhancer or promoter prediction methods above.
Prediction of microRNAs (miRNAs) and miRNA targets is of great interest, as they are critical components of gene regulatory networks and are often conserved across great evolutionary distance [@tag:Bracken2016_mirna; @tag:Berezikov2011_mirna]. While many machine learning algorithms have been applied to these tasks, they currently require extensive feature selection and optimization. For instance, one of the most widely adopted tools for miRNA target prediction, TargetScan, trained multiple linear regression models on 14 hand-curated features including structural accessibility of the target site on the mRNA, the degree of site conservation, and predicted thermodynamic stability of the miRNA-mRNA complex [@tag:Agarwal2015_targetscan]. Some of these features, including structural accessibility, are imperfect or empirically derived. In addition, current algorithms suffer from low specificity [@tag:Lee2016_deeptarget].
As in other applications, deep learning promises to achieve equal or better performance in predictive tasks by automatically engineering complex features to minimize an objective function. Two recently published tools use different recurrent neural network-based architectures to perform miRNA and target prediction with solely sequence data as input [@tag:Park2016_deepmirgene; @tag:Lee2016_deeptarget]. Though the results are preliminary and still based on a validation set rather than a completely independent test set, they were able to predict microRNA target sites with higher specificity and sensitivity than TargetScan. Excitingly, these tools seem to show that RNNs can accurately align sequences and predict bulges, mismatches, and wobble base pairing without requiring the user to input secondary structure predictions or thermodynamic calculations. Further incremental advances in deep learning for miRNA and target prediction will likely be sufficient to meet the current needs of systems biologists and other researchers who use prediction tools mainly to nominate candidates that are then tested experimentally.
Proteins play fundamental roles in almost all biological processes, and understanding their structure is critical for basic biology and drug development. UniProt currently has about 94 million protein sequences, yet fewer than 100,000 proteins across all species have experimentally-solved structures in Protein Data Bank (PDB). As a result, computational structure prediction is essential for a majority of proteins. However, this is very challenging, especially when similar solved structures, called templates, are not available in PDB. Over the past several decades, many computational methods have been developed to predict aspects of protein structure such as secondary structure, torsion angles, solvent accessibility, inter-residue contact maps, disorder regions, and side-chain packing. In recent years, multiple deep learning architectures have been applied, including deep belief networks, LSTMs, CNNs, and deep convolutional neural fields (DeepCNFs) [@doi:10.1007/978-3-319-46227-1_1; @doi:10.1038/srep18962].
Here we focus on deep learning methods for two representative sub-problems: secondary structure prediction and contact map prediction. Secondary structure refers to local conformation of a sequence segment, while a contact map contains information on all residue-residue contacts. Secondary structure prediction is a basic problem and an almost essential module of any protein structure prediction package. Contact prediction is much more challenging than secondary structure prediction, but it has a much larger impact on tertiary structure prediction. In recent years, the accuracy of contact prediction has greatly improved [@doi:10.1371/journal.pcbi.1005324; @doi:10.1093/bioinformatics/btu791; @doi:10.1073/pnas.0805923106; @doi:10.1371/journal.pone.0028766].
One can represent protein secondary structure with three different states (alpha helix, beta strand, and loop regions) or eight finer-grained states. Accuracy of a three-state prediction is called Q3, and accuracy of an 8-state prediction is called Q8. Several groups [@doi:10.1371/journal.pone.0032235; @doi:10.1109/TCBB.2014.2343960; @doi:10.1038/srep11476] applied deep learning to protein secondary structure prediction but were unable to achieve significant improvement over the de facto standard method PSIPRED [@doi:10.1006/jmbi.1999.3091], which uses two shallow feedforward neural networks. In 2014, Zhou and Troyanskaya demonstrated that they could improve Q8 accuracy by using a deep supervised and convolutional generative stochastic network [@arxiv:1403.1347]. In 2016 Wang et al. developed a DeepCNF model that improved Q3 and Q8 accuracy as well as prediction of solvent accessibility and disorder regions [@doi:10.1038/srep18962; @doi:10.1007/978-3-319-46227-1_1]. DeepCNF achieved a higher Q3 accuracy than the standard maintained by PSIPRED for more than 10 years. This improvement may be mainly due to the ability of convolutional neural fields to capture long-range sequential information, which is important for beta strand prediction. Nevertheless, the improvements in secondary structure prediction from DeepCNF are unlikely to result in a commensurate improvement in tertiary structure prediction since secondary structure mainly reflects coarse-grained local conformation of a protein structure.
Protein contact prediction and contact-assisted folding (i.e. folding proteins using predicted contacts as restraints) represents a promising new direction for ab initio folding of proteins without good templates in PDB. Co-evolution analysis is effective for proteins with a very large number (>1000) of sequence homologs [@doi:10.1371/journal.pone.0028766], but fares poorly for proteins without many sequence homologs. By combining co-evolution information with a few other protein features, shallow neural network methods such as MetaPSICOV [@doi:10.1093/bioinformatics/btu791] and CoinDCA-NN [@doi:10.1093/bioinformatics/btv472] have shown some advantage over pure co-evolution analysis for proteins with few sequence homologs, but their accuracy is still far from satisfactory. In recent years, deeper architectures have been explored for contact prediction, such as CMAPpro [@doi:10.1093/bioinformatics/bts475], DNCON [@doi:10.1093/bioinformatics/bts598] and PConsC [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in the well-known CASP competitions, these methods did not show any advantage over MetaPSICOV [@doi:10.1093/bioinformatics/btu791].
Recently, Wang et al. proposed the deep learning method RaptorX-Contact [@doi:10.1371/journal.pcbi.1005324], which significantly improves contact prediction over MetaPSICOV and pure co-evolution methods, especially for proteins without many sequence homologs. It employs a network architecture formed by one 1D residual neural network and one 2D residual neural network. Blindly tested in the latest CASP competition (i.e. CASP12 [@url:http://www.predictioncenter.org/casp12/rrc_avrg_results.cgi]), RaptorX-Contact ranked first in F1 score on free-modeling targets as well as the whole set of targets. In CAMEO (which can be interpreted as a fully-automated CASP) [@url:http://www.cameo3d.org/], its predicted contacts were also able to fold proteins with a novel fold and only 65-330 sequence homologs. This technique also worked well on membrane proteins even when trained on non-membrane proteins [@arxiv:1704.07207]. RaptorX-Contact performed better mainly due to introduction of residual neural networks and exploitation of contact occurrence patterns by simultaneously predicting all the contacts in a single protein.
Taken together, ab initio folding is becoming much easier with the advent of direct evolutionary coupling analysis and deep learning techniques. We expect further improvements in contact prediction for proteins with fewer than 1000 homologs by studying new deep network architectures. However, it is unclear if there is an effective way to use deep learning to improve prediction for proteins with few or no sequence homologs. Finally, the deep learning methods summarized above also apply to interfacial contact prediction for protein complexes but may be less effective since on average protein complexes have fewer sequence homologs.
Complementing computational prediction approaches, cryo-electron microscopy (cryo-EM) allows near-atomic resolution determination of protein models by comparing individual electron micrographs [@doi:10.1016/j.cell.2015.03.049].
Detailed structures require tens of thousands of protein images [@doi:10.1016/j.cell.2015.03.050].
Technological development has increased the throughput of image capture.
New hardware, such as direct electron detectors, has made large-scale image production practical, while new software has focused on rapid, automated image processing.
Some components of cryo-EM image processing remain difficult to automate. For instance, in particle picking, micrographs are scanned to identify individual molecular images that will be used in structure refinement. In typical applications, hundreds of thousands of particles are necessary to determine a structure to near atomic resolution, making manual selection impractical [@doi:10.1016/j.cell.2015.03.050]. Typical selection approaches are semi-supervised; a user will select several particles manually, and these selections will be used to train a classifier [@doi:10.1016/j.jsb.2006.04.006; @doi:10.1016/j.jsb.2014.11.010]. Now CNNs are being used to select particles in tools like DeepPicker [@doi:10.1016/j.jsb.2016.07.006] and DeepEM [@doi:10.1186/s12859-017-1757-y]. In addition to addressing shortcomings from manual selection, such as selection bias and poor discrimination of low-contrast images, these approaches also provide a means of full automation. DeepPicker can be trained by reference particles from other experiments with structurally unrelated macromolecules, allowing for fully automated application to new samples.
Downstream of particle picking, deep learning is being applied to other aspects of cryo-EM image processing. Statistical manifold learning has been implemented in the software package ROME to classify selected particles and elucidate the different conformations of the subject molecule necessary for accurate 3D structures [@doi:10.1371/journal.pone.0182130]. These recent tools highlight the general applicability of deep learning approaches for image processing to increase the throughput of high-resolution cryo-EM.
Protein-protein interactions (PPIs) are highly specific and non-accidental physical contacts between proteins, which occur for purposes other than generic protein production or degradation [@doi:10.1371/journal.pcbi.1000807]. Abundant interaction data have been generated in-part thanks to advances in high-throughput screening methods, such as yeast two-hybrid and affinity-purification with mass spectrometry. However, because many PPIs are transient or dependent on biological context, high-throughput methods can fail to capture a number of interactions. The imperfections and costs associated with many experimental PPI screening methods have motivated an interest in high-throughput computational prediction.
Many machine learning approaches to PPI have focused on text mining the literature [@doi:10.1016/j.jbi.2007.11.008; @arxiv:1706.01556v2], but these approaches can fail to capture context-specific interactions, motivating de novo PPI prediction. Early de novo prediction approaches used a variety of statistical and machine learning tools on structural and sequential data, sometimes with reference to the existing body of protein structure knowledge. In the context of PPIs — as in other domains — deep learning shows promise both for exceeding current predictive performance and for circumventing limitations from which other approaches suffer.
One of the key difficulties in applying deep learning techniques to binding prediction is the task of representing peptide and protein sequences in a meaningful way. DeepPPI [@doi:10.1021/acs.jcim.7b00028] made PPI predictions from a set of sequence and composition protein descriptors using a two-stage deep neural network that trained two subnetworks for each protein and combined them into a single network. Sun et al. [@doi:10.1186/s12859-017-1700-2] applied autocovariances, a coding scheme that returns uniform-size vectors describing the covariance between physicochemical properties of the protein sequence at various positions. Wang et al. [@doi:10.1039/C7MB00188F] used deep learning as an intermediate step in PPI prediction. They examined 70 amino acid protein sequences from each of which they extracted 1260 features. A stacked sparse autoencoder with two hidden layers was then used to reduce feature dimensions and noisiness before a novel type of classification vector machine made PPI predictions.
Beyond predicting whether or not two proteins interact, Du et al. [@doi:10.1016/j.ymeth.2016.06.001] employed a deep learning approach to predict the residue contacts between two interacting proteins. Using features that describe how similar a protein's residue is relative to similar proteins at the same position, the authors extracted uniform-length features for each residue in the protein sequence. A stacked autoencoder took two such vectors as input for the prediction of contact between two residues. The authors evaluated the performance of this method with several classifiers and showed that a deep neural network classifier paired with the stacked autoencoder significantly exceeded classical machine learning accuracy.
Because many studies used predefined higher-level features, one of the benefits of deep learning — automatic feature extraction — is not fully leveraged. More work is needed to determine the best ways to represent raw protein sequence information so that the full benefits of deep learning as an automatic feature extractor can be realized.
An important type of PPI involves the immune system's ability to recognize the body's own cells. The major histocompatibility complex (MHC) plays a key role in regulating this process by binding antigens and displaying them on the cell surface to be recognized by T cells. Due to its importance in immunity and immune response, peptide-MHC binding prediction is a useful problem in computational biology, and one that must account for the allelic diversity in MHC-encoding gene region.
Shallow, feed-forward neural networks are competitive methods and have made progress toward pan-allele and pan-length peptide representations. Sequence alignment techniques are useful for representing variable-length peptides as uniform-length features [@doi:10.1110/ps.0239403; @doi:10.1093/bioinformatics/btv639]. For pan-allelic prediction, NetMHCpan [@doi:10.1007/s00251-008-0341-z; @doi:10.1186/s13073-016-0288-x] used a pseudo-sequence representation of the MHC class I molecule, which included only polymorphic peptide contact residues. The sequences of the peptide and MHC were then represented using both sparse vector encoding and Blosum encoding, in which amino acids are encoded by matrix score vectors. A comparable method to the NetMHC tools is MHCflurry [@doi:10.1101/174243], a method which shows superior performance on peptides of lengths other than nine. MHCflurry adds placeholder amino acids to transform variable-length peptides to length 15 peptides. In training the MHCflurry feed-forward neural network [@doi:10.1101/054775], the authors imputed missing MHC-peptide binding affinities using a Gibbs sampling method, showing that imputation improves performance for data-sets with roughly 100 or fewer training examples. MHCflurry's imputation method increases its performance on poorly characterized alleles, making it competitive with NetMHCpan for this task. Kuksa et al. [@doi:10.1093/bioinformatics/btv371] developed a shallow, higher-order neural network (HONN) comprised of both mean and covariance hidden units to capture some of the higher-order dependencies between amino acid locations. Pretraining this HONN with a semi-restricted Boltzmann machine, the authors found that the performance of the HONN exceeded that of a simple DNN, as well as that of NetMHC.
Deep learning's unique flexibility was recently leveraged by Bhattacharya et al. [@doi:10.1101/154757], who used a gated RNN method called MHCnuggets to overcome the difficulty of multiple length peptides. Under this framework, they used smoothed sparse encoding to represent amino acids individually. Because MHCnuggets had to be trained for every MHC allele, performance was far better for alleles with abundant, balanced training data. Vang et al. [@doi:10.1093/bioinformatics/btx264] developed HLA-CNN, a method which maps amino acids onto a 15-dimensional vector space based on their context relation to other amino acids before making predictions with a CNN. In a comparison of several current methods, Bhattacharya et al. found that the top methods — NetMHC, NetMHCpan, MHCflurry, and MHCnuggets — showed comparable performance, but large differences in speed. Convolutional neural networks (in this case, HLA-CNN) showed comparatively poor performance, while shallow and recurrent neural networks performed the best. They found that MHCnuggets — the recurrent neural network — was by far the fastest training among the top performing methods.
Because interacting proteins are more likely to share a similar function, the connectivity of a PPI network itself can be a valuable information source for the prediction of protein function [@doi:10.1038/msb4100129]. To incorporate higher-order network information, it is necessary to find a lower-level embedding of network structure that preserves this higher-order structure. Rather than use hand-crafted network features, deep learning shows promise for the automatic discovery of predictive features within networks. For example, Navlakha [@doi:10.1162/NECO_a_00924] showed that a deep autoencoder was able to compress a graph to 40% of its original size, while being able to reconstruct 93% of the original graph's edges, improving upon standard dimension reduction methods. To achieve this, each graph was represented as an adjacency matrix with rows sorted in descending node degree order, then flattened into a vector and given as input to the autoencoder. While the activity of some hidden layers correlated with several popular hand-crafted network features such as k-core size and graph density, this work showed that deep learning can effectively reduce graph dimensionality while retaining much of its structural information.
An important challenge in PPI network prediction is the task of combining different networks and types of networks. Gligorijevic et al. [@doi:10.1101/223339] developed a multimodal deep autoencoder, deepNF, to find a feature representation common among several different PPI networks. This common lower-level representation allows for the combination of various PPI data sources towards a single predictive task. An SVM classifier trained on the compressed features from the middle layer of the autoencoder outperformed previous methods in predicting protein function.
Hamilton et al. addressed the issue of large, heterogeneous, and changing networks with an inductive approach called GraphSAGE [@arxiv:1706.02216v2]. By finding node embeddings through learned aggregator functions that describe the node and its neighbors in the network, the GraphSAGE approach allows for the generalization of the model to new graphs. In a classification task for the prediction of protein function, Chen and Zhu [@arxiv:1710.10568v1] optimized this approach and enhanced the graph convolutional network with a preprocessing step that uses an approximation to the dropout operation. This preprocessing effectively reduces the number of graph convolutional layers and it significantly improves both training time and prediction accuracy.
A field poised for dramatic revolution by deep learning is bioimage analysis. Thus far, the primary use of deep learning for biological images has been for segmentation -- that is, for the identification of biologically relevant structures in images such as nuclei, infected cells, or vasculature -- in fluorescence or even brightfield channels [@doi:10.1371/journal.pcbi.1005177]. Once so-called regions of interest have been identified, it is often straightforward to measure biological properties of interest, such as fluorescence intensities, textures, and sizes. Given the dramatic successes of deep learning in biological imaging, we simply refer to articles that review recent advancements [@doi:10.3109/10409238.2015.1135868; @doi:10.1371/journal.pcbi.1005177; @doi:10.1007/978-3-319-24574-4_28]. For deep learning to become commonplace for biological image segmentation, we need user-friendly tools.
We anticipate an additional paradigm shift in bioimaging that will be brought about by deep learning: what if images of biological samples, from simple cell cultures to three-dimensional organoids and tissue samples, could be mined for much more extensive biologically meaningful information than is currently standard? For example, a recent study demonstrated the ability to predict lineage fate in hematopoietic cells up to three generations in advance of differentiation [@doi:10.1038/nmeth.4182]. In biomedical research, most often biologists decide in advance what feature to measure in images from their assay system. Although classical methods of segmentation and feature extraction can produce hundreds of metrics per cell in an image, deep learning is unconstrained by human intuition and can in theory extract more subtle features through its hidden nodes. Already, there is evidence deep learning can surpass the efficacy of classical methods [@doi:10.1101/081364], even using generic deep convolutional networks trained on natural images [@doi:10.1101/085118], known as transfer learning. Recent work by Johnson et al. [@tag:Johnson2017_integ_cell] demonstrated how the use of a conditional adversarial autoencoder allows for a probabilistic interpretation of cell and nuclear morphology and structure localization from fluorescence images. The proposed model is able to generalize well to a wide range of subcellular localizations. The generative nature of the model allows it to produce high-quality synthetic images predicting localization of subcellular structures by directly modeling the localization of fluorescent labels. Notably, this approach reduces the modeling time by omitting the subcellular structure segmentation step.
The impact of further improvements on biomedicine could be enormous. Comparing cell population morphologies using conventional methods of segmentation and feature extraction has already proven useful for functionally annotating genes and alleles, identifying the cellular target of small molecules, and identifying disease-specific phenotypes suitable for drug screening [@doi:10.1016/j.copbio.2016.04.003; @doi:10.1002/cyto.a.22909; @doi:10.1083/jcb.201610026]. Deep learning would bring to these new kinds of experiments -- known as image-based profiling or morphological profiling -- a higher degree of accuracy, stemming from the freedom from human-tuned feature extraction strategies.
Single-cell methods are generating excitement as biologists characterize the vast heterogeneity within unicellular species and between cells of the same tissue type in the same organism [@tag:Gawad2016_singlecell]. For instance, tumor cells and neurons can both harbor extensive somatic variation [@tag:Lodato2015_neurons]. Understanding single-cell diversity in all its dimensions -- genetic, epigenetic, transcriptomic, proteomic, morphologic, and metabolic -- is key if treatments are to be targeted not only to a specific individual, but also to specific pathological subsets of cells. Single-cell methods also promise to uncover a wealth of new biological knowledge. A sufficiently large population of single cells will have enough representative "snapshots" to recreate timelines of dynamic biological processes. If tracking processes over time is not the limiting factor, single-cell techniques can provide maximal resolution compared to averaging across all cells in bulk tissue, enabling the study of transcriptional bursting with single-cell fluorescence in situ hybridization or the heterogeneity of epigenetic patterns with single-cell Hi-C or ATAC-seq [@tag:Liu2016_sc_transcriptome; @tag:Vera2016_sc_analysis]. Joint profiling of single-cell epigenetic and transcriptional states provides unprecedented views of regulatory processes [@doi:10.1101/138685].
However, large challenges exist in studying single cells. Relatively few cells can be assayed at once using current droplet, imaging, or microwell technologies, and low-abundance molecules or modifications may not be detected by chance due to a phenomenon known as dropout, not to be confused with the dropout layer of deep learning. To solve this problem, Angermueller et al. [@tag:Angermueller2016_single_methyl] trained a neural network to predict the presence or absence of methylation of a specific CpG site in single cells based on surrounding methylation signal and underlying DNA sequence, achieving several percentage points of improvement compared to random forests or deep networks trained only on CpG or sequence information. Similar deep learning methods have been applied to impute low-resolution ChIP-seq signal from bulk tissue with great success, and they could easily be adapted to single-cell data [@tag:Qin2017_onehot; @tag:Koh2016_denoising]. Deep learning has also been useful for dealing with batch effects [@tag:Shaham2016_batch_effects].
Examining populations of single cells can reveal biologically meaningful subsets of cells as well as their underlying gene regulatory networks [@tag:Gaublomme2015_th17]. Unfortunately, machine learning methods generally struggle with imbalanced data -- when there are many more examples of class 1 than class 2 -- because prediction accuracy is usually evaluated over the entire dataset. To tackle this challenge, Arvaniti et al. [@tag:Arvaniti2016_rare_subsets] classified healthy and cancer cells expressing 25 markers by using the most discriminative filters from a CNN trained on the data as a linear classifier. They achieved impressive performance, even for cell types where the subset percentage ranged from 0.1 to 1%, significantly outperforming logistic regression and distance-based outlier detection methods. However, they did not benchmark against random forests, which tend to work better for imbalanced data, and their data was relatively low dimensional.
Neural networks can also learn low-dimensional representations of single-cell gene expression data for visualization, clustering, and other tasks. Both scvis [@doi:10.1101/178624] and scVI [@arxiv:1709.02082] are unsupervised approaches based on VAEs. Whereas scvis primarily focuses on single-cell visualization as a replacement for t-Distributed Stochastic Neighbor Embedding [@tag:Maaten2008_tsne], the scVI model accounts for zero-inflated expression distributions and can impute zero values that are due to technical effects. Beyond VAEs, Lin et al. developed a supervised model to predict cell type [@doi:10.1093/nar/gkx681]. Similar to transfer learning approaches for microscopy images [@doi:10.1101/085118], they demonstrated that the hidden layer representations were informative in general and could be used to identify cellular subpopulations or match new cells to known cell types. The supervised neural network's representation was better overall at retrieving cell types than alternatives, but all methods struggled to recover certain cell types such as hematopoietic stem cells and inner cell mass cells. As the Human Cell Atlas [@doi:10.7554/eLife.27041] and related efforts generate more single-cell expression data, there will be opportunities to assess how well these low-dimensional representations generalize to new cell types as well as abundant training data to learn broadly-applicable representations.
The sheer quantity of omic information that can be obtained from each cell, as well as the number of cells in each dataset, uniquely position single-cell data to benefit from deep learning. In the future, lineage tracing could be revolutionized by using autoencoders to reduce the feature space of transcriptomic or variant data followed by algorithms to learn optimal cell differentiation trajectories [@tag:Qiu2017_graph_embedding] or by feeding cell morphology and movement into neural networks [@tag:Buggenthin2017_imaged_lineage]. Reinforcement learning algorithms [@tag:Silver2016_alphago] could be trained on the evolutionary dynamics of cancer cells or bacterial cells undergoing selection pressure and reveal whether patterns of adaptation are random or deterministic, allowing us to develop therapeutic strategies that forestall resistance. We are excited to see the creative applications of deep learning to single-cell biology that emerge over the next few years.
Metagenomics, which refers to the study of genetic material -- 16S rRNA or whole-genome shotgun DNA -- from microbial communities, has revolutionized the study of micro-scale ecosystems within and around us. In recent years, machine learning has proved to be a powerful tool for metagenomic analysis. 16S rRNA has long been used to deconvolve mixtures of microbial genomes, yet this ignores more than 99% of the genomic content. Subsequent tools aimed to classify 300 bp-3000 bp reads from complex mixtures of microbial genomes based on tetranucleotide frequencies, which differ across organisms [@tag:Karlin], using supervised [@tag:McHardy; @tag:nbc] or unsupervised methods [@tag:Abe]. Then, researchers began to use techniques that could estimate relative abundances from an entire sample faster than classifying individual reads [@tag:Metaphlan; @tag:wgsquikr; @tag:lmat; @tag:Vervier]. There is also great interest in identifying and annotating sequence reads [@tag:yok; @tag:Soueidan]. However, the focus on taxonomic and functional annotation is just the first step. Several groups have proposed methods to determine host or environment phenotypes from the organisms that are identified [@tag:Guetterman; @tag:Knights; @tag:Stratnikov; @tag:Segata] or overall sequence composition [@tag:Ding]. Also, researchers have looked into how feature selection can improve classification [@tag:Liu; @tag:Segata], and techniques have been proposed that are classifier-independent [@tag:Ditzler; @tag:Ditzler2].
Most neural networks are used for phylogenetic classification or functional annotation from sequence data where there is ample data for training. Neural networks have been applied successfully to gene annotation (e.g. Orphelia [@tag:Hoff] and FragGeneScan [@doi:10.1093/nar/gkq747]). Representations (similar to Word2Vec [@tag:Word2Vec] in natural language processing) for protein family classification have been introduced and classified with a skip-gram neural network [@tag:Asgari]. Recurrent neural networks show good performance for homology and protein family identification [@tag:Hochreiter; @tag:Sonderby].
One of the first techniques of de novo genome binning used self-organizing maps, a type of neural network [@tag:Abe]. Essinger et al. [@tag:Essinger2010_taxonomic] used Adaptive Resonance Theory to cluster similar genomic fragments and showed that it had better performance than k-means. However, other methods based on interpolated Markov models [@tag:Salzberg] have performed better than these early genome binners. Neural networks can be slow and therefore have had limited use for reference-based taxonomic classification, with TAC-ELM [@tag:TAC-ELM] being the only neural network-based algorithm to taxonomically classify massive amounts of metagenomic data. An initial study successfully applied neural networks to taxonomic classification of 16S rRNA genes, with convolutional networks providing about 10% accuracy genus-level improvement over RNNs and random forests [@tag:Mrzelj]. However, this study evaluated only 3000 sequences.
Neural network uses for classifying phenotype from microbial composition are just beginning. A simple multi-layer perceptron (MLP) was able to classify wound severity from microbial species present in the wound [@doi:10.1016/j.bjid.2015.08.013]. Recently, Ditzler et al. associated soil samples with pH level using MLPs, DBNs, and RNNs [@tag:Ditzler3]. Besides classifying samples appropriately, internal phylogenetic tree nodes inferred by the networks represented features for low and high pH. Thus, hidden nodes might provide biological insight as well as new features for future metagenomic sample comparison. Also, an initial study has shown promise of these networks for diagnosing disease [@tag:Faruqi].
Challenges remain in applying deep neural networks to metagenomics problems. They are not yet ideal for phenotype classification because most studies contain tens of samples and hundreds or thousands of features (species). Such underdetermined, or ill-conditioned, problems are still a challenge for deep neural networks that require many training examples. Also, due to convergence issues [@arxiv:1212.0901v2], taxonomic classification of reads from whole genome sequencing seems out of reach at the moment for deep neural networks. There are only thousands of full-sequenced genomes as compared to hundreds of thousands of 16S rRNA sequences available for training.
However, because RNNs have been applied to base calls for the Oxford Nanopore long-read sequencer with some success [@tag:Boza] (discussed below), one day the entire pipeline, from denoising to functional classification, may be combined into one step using powerful LSTMs [@tag:Sutskever]. For example, metagenomic assembly usually requires binning then assembly, but could deep neural nets accomplish both tasks in one network? We believe the greatest potential in deep learning is to learn the complete characteristics of a metagenomic sample in one complex network.
While we have so far primarily discussed the role of deep learning in analyzing genomic data, deep learning can also substantially improve our ability to obtain the genomic data itself. We discuss two specific challenges: calling SNPs and indels (insertions and deletions) with high specificity and sensitivity and improving the accuracy of new types of data such as nanopore sequencing. These two tasks are critical for studying rare variation, allele-specific transcription and translation, and splice site mutations. In the clinical realm, sequencing of rare tumor clones and other genetic diseases will require accurate calling of SNPs and indels.
Current methods achieve relatively high (>99%) precision at 90% recall for SNPs and indel calls from Illumina short-read data [@tag:Poplin2016_deepvariant], yet this leaves a large number of potentially clinically-important remaining false positives and false negatives. These methods have so far relied on experts to build probabilistic models that reliably separate signal from noise. However, this process is time consuming and fundamentally limited by how well we understand and can model the factors that contribute to noise. Recently, two groups have applied deep learning to construct data-driven unbiased noise models. One of these models, DeepVariant, leverages Inception, a neural network trained for image classification by Google Brain, by encoding reads around a candidate SNP as a 221x100 bitmap image, where each column is a nucleotide and each row is a read from the sample library [@tag:Poplin2016_deepvariant]. The top 5 rows represent the reference, and the bottom 95 rows represent randomly sampled reads that overlap the candidate variant. Each RGBA (red/green/blue/alpha) image pixel encodes the base (A, C, G, T) as a different red value, quality score as a green value, strand as a blue value, and variation from the reference as the alpha value. The neural network outputs genotype probabilities for each candidate variant. They were able to achieve better performance than GATK, a leading genotype caller, even when GATK was given information about population variation for each candidate variant. Another method, still in its infancy, hand-developed 62 features for each candidate variant and fed these vectors into a fully connected deep neural network [@tag:Torracinta2016_deep_snp]. Unfortunately, this feature set required at least 15 iterations of software development to fine-tune, which suggests that these models may not generalize.
Variant calling will benefit more from optimizing neural network architectures than from developing features by hand. An interesting and informative next step would be to rigorously test if encoding raw sequence and quality data as an image, tensor, or some other mixed format produces the best variant calls. Because many of the latest neural network architectures (ResNet, Inception, Xception, and others) are already optimized for and pre-trained on generic, large-scale image datasets [@tag:Chollet2016_xception], encoding genomic data as images could prove to be a generally effective and efficient strategy.
In limited experiments, DeepVariant was robust to sequencing depth, read length, and even species [@tag:Poplin2016_deepvariant]. However, a model built on Illumina data, for instance, may not be optimal for Pacific Biosciences long-read data or MinION nanopore data, which have vastly different specificity and sensitivity profiles and signal-to-noise characteristics. Recently, Boza et al. used bidirectional recurrent neural networks to infer the E. coli sequence from MinION nanopore electric current data with higher per-base accuracy than the proprietary hidden Markov model-based algorithm Metrichor [@tag:Boza]. Unfortunately, training any neural network requires a large amount of data, which is often not available for new sequencing technologies. To circumvent this, one very preliminary study simulated mutations and spiked them into somatic and germline RNA-seq data, then trained and tested a neural network on simulated paired RNA-seq and exome sequencing data [@tag:Torracinta2016_sim]. However, because this model was not subsequently tested on ground-truth datasets, it is unclear whether simulation can produce sufficiently realistic data to produce reliable models.
Method development for interpreting new types of sequencing data has historically taken two steps: first, easily implemented hard cutoffs that prioritize specificity over sensitivity, then expert development of probabilistic models with hand-developed inputs [@tag:Torracinta2016_sim]. We anticipate that these steps will be replaced by deep learning, which will infer features simply by its ability to optimize a complex model against data.