main-text.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Start the main part of the manuscript here.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

Protein-protein interactions underlie essentially all biological processes, including signal transduction and antibody-antigen recognition. Many protein-protein interfaces are sensitive to mutations that can alter interaction affinity and specificity.
In fact, mutations at protein-protein interfaces have been reported to be overrepresented within disease-causing mutations\cite{jubb_mutations_2017}, highlighting the central importance of these interactions to biology and human health.
A sufficiently accurate computational method capable of predicting mutations that strengthen or weaken known protein-protein interactions would hence serve as a useful tool to dissect the role of specific protein-protein interactions in important biological processes. Coupled with state-of-the-art methods for protein engineering and design, such a method would also enhance our ability to create new and selective interactions, enabling the development of improved protein therapeutics, protein-based sensors, and protein materials.

Several prior methods have been developed to predict changes in protein-protein binding affinity upon mutation using different approaches to estimating energetic effects (scoring) and modeling structural changes (sampling). Common approaches include
weighted energy functions that seek to describe physical interactions underlying protein-protein interactions\cite{guerois_predicting_2002,kamisetty_accounting_2011},
statistical and contact potentials \cite{dehouck_beatmusic:_2013,moal_intermolecular_2013,vangone_contacts-based_2015,brender_predicting_2015},
a combination of these approaches\cite{li_predicting_2014,tuncbag_identification_2009},
graph-based representations\cite{pires_mcsm:_2014},
methods that sample backbone structure space locally around mutations\cite{dourado_multiscale_2014},
and machine learning approaches\cite{zhu_kfc2:_2011}.

We set out to develop and assess methods for estimating experimentally determined changes in binding free energy after mutation (interface \ddg) within the Rosetta macromolecular modeling suite. Rosetta is freely available for academic use, and allows combination of interface \ddg\ predictions with Rosetta's powerful protein design capabilities, which have proven successful in a variety of applications \cite{mandell_computer-aided_2009,kaufmann_practically_2010}.
Prior projects have applied Rosetta predictions to
dissect determinants of binding specificity and promiscuity\cite{boulanger_convergent_2003,mcfarland_symmetry_2003},
enhance protein-protein binding affinities\cite{sammond_structure-based_2007,song_rational_2006},
and to design modified\cite{kapp_control_2012}
and new interactions\cite{chevalier_design_2002,fleishman_computational_2011,chevalier_massively_2017}, but no prior benchmarking effort has quantitatively assessed the performance of predicting changes in binding free energy in Rosetta on a large, diverse benchmark dataset, in part because such datasets have only become available more recently.
The current state-of-the-art Rosetta \ddg\ method,  ddg\_monomer\cite{kellogg_role_2011}, has proven effective at predicting changes in stability of monomeric proteins after mutation, but had not yet been tested at predicting change of binding free energies in protein-protein complexes.
Prior ``computational alanine scanning'' \ddg\ methods were benchmarked on mutations in protein-protein interfaces, focusing on mutations to alanine \cite{kortemme_simple_2002,kortemme_computational_2004,conchuir_web_2015}.
The original Rosetta alanine scanning method\cite{kortemme_simple_2002} did not sample backbone degrees of freedom, which is a first-order approximation for mutations to alanine (that are not expected to cause large backbone perturbations\cite{cunningham_high-resolution_1989}), but less likely to be predictive for mutations to larger side chains which might require some degree of backbone rearrangement to accommodate the change.
Inclusion of recent Rosetta energy function and sampling method developments, including methods that attempt to more aggressively sample conformational space, has not resulted in significant improvement to the alanine scanning method\cite{conchuir_web_2015}.

We sought to create a method that would take into account aspects of the conformational plasticity of proteins by representing structures as an ensemble of individual full-atom models to explore biologically relevant and accessible portions of conformational space near the crystallographically determined input structures.
Ensemble representations have previously been shown to be effective at predicting changes in protein stabilities after mutation\cite{davey_prediction_2015} and at predicting the effects of mutation on protein-protein binding affinities\cite{benedix_predicting_2009}, as well as at improving $\Delta G_{binding}$ calculations between kinases and their inhibitors \cite{araki_effect_2016}.

We chose to sample conformational plasticity using the ``backrub'' protocol implemented in Rosetta \cite{smith_backrub-like_2008}.
The backrub method samples local side chain and backbone conformational changes, similar to those suggested to underlie observed conformational heterogeneity in high-resolution crystal structures \cite{davis_backrub_2006}, and to accommodate evolved and designed mutations\cite{keedy_role_2012}.
Backrub ensembles have been demonstrated to recapitulate properties of proteins that have been experimentally determined, such as side chain NMR order parameters\cite{friedland_simple_2008}, tolerated sequence profiles at protein-protein \cite{humphris_prediction_2008} and protein-peptide interfaces \cite{smith_structure-based_2010,smith_predicting_2011}, and conformational variability between protein homologs\cite{schenkelberg_protein_2016}.
Backrub has also proved effective in design applications, such as the redesign of protein-protein interfaces\cite{kapp_control_2012} and recapitulation of mutations that alter ligand-binding specificity\cite{ollikainen_coupling_2015}.
When compared to ensembles generated via molecular dynamics simulations or the ``PertMin'' method\cite{davey_improving_2014}, backrub ensembles were shown to be the only ensembles capable of generating higher diversity (as measured by RMSD) between output models than from output models to the original input crystal structure.
This observation suggests that backrub could be uniquely suited to produce diverse ensembles that effectively explore the local conformational space around an input structure.\cite{davey_improving_2014}
Taken together, we hypothesized that these previously demonstrated properties of backrub ensembles would also make them an effective representation of near-native conformational states for use in predicting interface \ddg\ values.

\section{Methods}

\subsection{Benchmark datasets}

Developing and assessing the accuracy of a new method to predict changes in binding free energy after mutation requires a large and diverse benchmark set covering single mutations to all amino acid types, multiple mutations, and mutations across a variety of protein-protein interfaces.
To facilitate comparisons to other methods and to avoid biases specific to our approach, we chose to use an existing benchmark dataset created by Dourado and Flores\cite{dourado_multiscale_2014} during the development of their ZEMu (Zone Equilibration of Mutants) method.
The ZEMu dataset was curated from the larger SKEMPI database\cite{moal_skempi:_2012} by avoiding a bias towards complexes in which a single position is repeatedly mutated, experimental data that are not peer-reviewed, redundancy (duplicate experimental values), mutations outside of interfaces, mutations involved in crystal contacts, and experimental \ddg\ values for which wild-type and mutant conditions (such as pH) varied.
Confidence in the ``known'' experimental \ddg\ values is important, as it has been pointed out that the experimental methodology used can have a strong effect on the performance of predictors of changes in binding free energy\cite{geng_exploring_2016}.
The ZEMu dataset was also curated to include a range of both stabilizing and destabilizing mutants, small side chain to large side chain mutations, single and multiple mutations, and a diversity of complexes.
Small-to-large mutations are defined as those dataset cases where all mutation(s) are at positions where the residue side chain increases in van der Waals volume post-mutation\cite{simpson_proteins_2002}.

\subimport*{figs-and-tables/}{table-composition}

After a review of the literature from which the known experimental \ddg\ values originated, we removed one data point from the 1254 point ZEMu set that we could not match to the originally reported affinity value. We also removed 5 mutations we determined to be duplicates, along with 8 mutations that were reverse mutations of other data points, leaving us with a test set of 1240 mutations (\cref{tab:table-composition}).
We used SAbDab \cite{dunbar_sabdab:_2014} to define complexes that contained at least one antibody binding partner.
Our version of the ZEMu dataset is available in the Supporting Information as Dataset \ref{dst:zemu-filtered}.
All \ddg\ predictions described in the paper are available in the Supporting Information.

\subsection{Rosetta implementation and prediction protocol}

Our protocol, called ``flex ddG'', is implemented within the RosettaScripts interface to the Rosetta macromolecular modeling software suite \cite{fleishman_rosettascripts:_2011}, which makes the protocol easily adaptable to future improvements and energy function development.
The method can be run using a Rosetta Scripts XML that is available in the Supporting Information as \cref{lst:ddg-script}.
Version numbers of tested software are available in \cref{tab:table-versions}.

\begin{figure}
  \centering
  \includegraphics[width=\textwidth,keepaspectratio]{figures/fig-overview.pdf}
    \caption[]{
      Schematic of the flex ddG protocol method.
  } \label{fig:figure-overview}
\end{figure}

Flex ddG method steps are outlined in \cref{fig:figure-overview}.
\textbf{Step 1:} The protocol begins with an initial minimization (on backbone $\phi$/$\psi$\ and side chain $\chi$\ torsional degrees of freedom, using the limited-memory Broyden-Fletcher-Goldfarb-Shanno minimizer implementation within Rosetta, with Armijo inexact line search conditions (option ``lbfgs\_armijo\_nonmonotone'') of the input crystal structure of the wild-type protein complex.
This (and later) minimizations are performed with harmonic restraints on pairwise atom distances to their values in the input crystal structure.
Restraints were added for all pairs for C-$\alpha$ atoms within 9 \AA\ of each other using a harmonic score potential defined to have the width (standard deviation) parameter set to 0.5 \AA, and added to the Rosetta score function with a term weight of 1.0.
Minimization is run until convergence (absolute score change upon minimization of less than one REU (Rosetta Energy Unit)).
\textbf{Step 2:} Starting from the minimized input structure including both binding partners in the protein-protein complex, the backrub method in Rosetta\cite{smith_backrub-like_2008} is used to create an ensemble of models.
In brief, each backrub move is undertaken on a randomly chosen protein segment consisting of three to twelve adjacent residues in the neighborhood of any mutated position.
The mutation neighborhood is defined by finding all residues in the protein-protein complex with a C-$\beta$ atom (C-$\alpha$ for glycines) within 8 \AA\ of any mutant position, then adding this residue and its adjacent N and C-terminal residues to the list of neighborhood residues.
All atoms in the backrub segment are rotated locally about an axis defined as the vector between the endpoint C-$\alpha$ atoms.
The allowed rotation angles for the backrub steps use Rosetta default values as described in Smith \& Kortemme, 2008\cite{smith_backrub-like_2008}.
Backrub is run at a temperature of 1.2 kT, for up to 50,000 backrub Monte Carlo trials/steps (\cref{tab:table-temperature} shows that using a kT of 1.6 gives similar results to a kT of 1.2).
Up to 50 output models are generated.
\textbf{Step 3A:} For each of the 50 models in the ensemble output by backrub, the Rosetta ``packer'' is used to optimize side chain conformations for the wild-type sequence using discrete rotameric conformations \cite{shapovalov_smoothed_2011} and simulated annealing. The packer is run with the multi-cool annealer option\cite{leaver-fay_generic_2011}, which is set to keep a history of the 6 best rotameric states visited during annealing.
\textbf{Step 3B:} Independently and in parallel to step 3A, side chain conformations for the mutant sequence are optimized on all 50 models, introducing the mutation(s).
\textbf{Step 4A:} Each of the 50 wild-type models is minimized, again adding pairwise interatomic distance restraints to the input structure. Minimization is run with the same parameters as in step 1; the coordinate restraints used in this step are taken from the coordinates of the Step 3A model.
\textbf{Step 4B:} As Step 4A, but for each of the 50 mutant models.
\textbf{Step 5A:} Each of the 50 minimized wild-type models are scored in complex, and the complex partners are scored individually. The scores of the split, unbound complex partners are obtained simply by moving the complex halves away from each other. No further minimization or side chain optimization is performed on the unbound partners before scoring.
\textbf{Step 5B:} In the same fashion as Step 5A, each of the 50 minimized mutant models are scored in complex, and the complex partners are scored individually.
\textbf{Step 6:} The interface \ddg\ score is calculated via Eq. \ref{eqn:split-ddg} as the arithmetic mean over the different models produced:

\begin{equation}\label{eqn:split-ddg}
  \begin{split}
    {\Delta\Delta}G_{bind} & ={\Delta}G^{MUT}_{bind} - {\Delta}G^{WT}_{bind}\\
    & =({\Delta}G^{MUT}_{complex} - {\Delta}G^{MUT}_{partner A} - {\Delta}G^{MUT}_{partner B})\\
    & \quad - ({\Delta}G^{WT}_{complex} - {\Delta}G^{WT}_{partner A} - {\Delta}G^{WT}_{partner B})\\
  \end{split}
\end{equation}

We evaluate performance of the protocol by comparing predicted \ddg\ scores to known experimental values, using Pearson's correlation (R), Fraction Correct (FC), and Mean Absolute Error (MAE). Fraction Correct is defined as the number of cases in the dataset categorized correctly as stabilizing, neutral, or destabilizing, divided by the total number of cases in the dataset. Stabilizing mutations are defined as those with a \ddg\ $<=$\ -1.0 kcal/mol, neutral as those with -1.0 kcal/mol $<$\ \ddg\ $<$\ 1.0 kcal/mol, and destabilizing as those with \ddg\ $>=$\ 1.0 kcal/mol.

MAE (Mean Absolute Error) is defined in Eq.~\ref{eqn:mae} as:

\begin{equation}\label{eqn:mae}
  MAE = \dfrac{1}{n}\sum\limits_{i=1}^n|y_i-x_i| = \dfrac{1}{n}\sum\limits_{i=1}^n|e_i|
\end{equation}

where $y_i$ are the predicted \ddg\ values, $x_i$ are the known, experimentally determined values, and $e_i$ is the prediction error.

As a control, we ran the flex ddG protocol omitting the backrub ensemble generation step.
This control protocol can in principle generate multiple models because of the minimization and packing steps, but in practice these models are structurally highly similar or identical.

\subsection{Rosetta energy function}

We utilized Rosetta's Talaris\cite{song_structure-guided_2011,shapovalov_smoothed_2011,omeara_combined_2015} all-atom energy function for the modeling steps.
As we do not modify our models of the unbound state, several terms of the Rosetta energy function will cancel out in the final \ddg\ scoring because the $\Delta G$\ of folding score of the unbound partners is subtracted from the total score of the complex (\cref{eqn:split-ddg}).
After subtraction, seven score terms remain, and combined, become the final interface \ddg\ score, dominated by solvation (fa\_sol using an implicit solvation model\cite{lazaridis_effective_1999}), hydrogen bonding and electrostatics\cite{kortemme_orientation-dependent_2003,song_structure-guided_2011,omeara_combined_2015} (hbond\_sc: side chain-side chain hydrogen bonds; hbond\_bb\_sc: hydrogen bonds between backbone atoms and side chain atoms; hbond\_lr\_bb: long-range hydrogen bond interactions between backbone atoms; fa\_elec: Coulomb electrostatics), and Lennard-Jones atomic packing interactions (fa\_rep and fa\_atr: repulsive and attractive components of the Lennard-Jones potential).

\subsection{Score analysis}

To investigate potential sources of prediction error on an individual score term basis, we used a generalized additive model\cite{hastie_generalized_1990} approach to fit Rosetta's predicted \ddg\ values to experimentally known values.
First, we apply an unbiased logistic scaling to individual score terms,
$$h_{a,b}(x) = \frac{2e^{a}}{1 + e^{-xe^{b}}}-e^a,$$
where $a$ is the scaling range of the score, and $b$ is the steepness of the sigmoid scaling. Both parameters are transformed through an exponential to ensure non-negativity. The scaling function $h$ does not introduce bias, that is, $h_\theta(0) = 0$ for any $\theta$. The scoring model results in a generalized additive model (GAM) over the $M$ score terms,
$$f(\bf{x}) = \sum_{j=1}^M h_{a_i,b_i}(x).$$
The parameters $\theta = (a_j,b_j)_{j=1}^M$ for the score terms were simultaneously sampled using a random walk Metropolis-Hastings MCMC algorithm (the \texttt{mhsample} function in Matlab) assuming a Gaussian likelihood as the target distribution
$$p(\theta ; \bf{y}) = N( f(\bf{x}_i) | y_i, \sigma_n^2)$$
with a noise variance set to $\sigma_n^2 = 1.0$, and where $(\bf{x}_i,y_i)_{i=1}^N$ are the empirical observations $y_i$ that correspond to the protein score terms $\bf{x}_i$, respectively. We sample for $1000$ samples with a burn-in set to $1000$ samples and a thinning parameter of $20$. The proposal distribution was selected to be a symmetric uniform distribution such that $[a^{(s+1)},b^{(s+1)}] \sim U( a^{(s)} \pm 2, b^{(s)} \pm 2)$. The resulting MCMC sample represents all logistics score scalings that reproduce the empirical measurements assuming an error model with noise variance $\sigma_n^2$.

\section{Results and discussion}

\subimport*{figs-and-tables/}{table-main}

The overall performance of the protocol is summarized in \cref{tab:table-main}.
We compare 4 prediction methods: (a) our flex ddG backrub ensemble method, (b) the prior state-of-the-art Rosetta methodology, ddg\_monomer \cite{kellogg_role_2011}, (c) a control version of our flex ddG protocol which omits the backrub ensemble generation step, leaving only the minimization and packing steps, and (d) published data from the ZEMu (zone equilibration of mutants) method\cite{dourado_multiscale_2014}. Data split by input protein-protein complex are shown in \cref{tab:table-by-structure}.

The new flex ddG method outperforms the comparison methods on the complete dataset in each of the correlation, MAE, and fraction correct metrics (Table 2). In particular, we see a large increase in performance relative to the other methods on the small-to-large subset of mutations. This is in accordance with our expectations that backrub ensembles should be able to sample small backbone conformational adjustments required to accommodate changes in amino acid residue size. Notably, application of backrub ensembles performs better than other methods that include backbone minimization steps only, including the current state-of-the-art Rosetta ddg\_monomer method. On the small-to-large mutations subset, the ddg\_monomer method achieves a Pearson correlation of only 0.31 compared to 0.65 with flex ddG.

Performance of the flex ddG method on the subset of single mutations to alanine is also competitive or outperforms the alternative methods.
As we do not expect single mutations to alanine to require intensive backbone sampling, our method's effectiveness on this subset shows that the method is fairly robust to the mutation type.
As we chose to perform backrub sampling prior to introducing mutations, these results could suggest that flex ddG is effective by sampling underlying, relevant plasticity of the input crystal structure instead of distorting the local structure around a mutation to resolve a clash or poor interaction with a mutant side chain.

While the flex ddG method shows improved performance on the subset of multiple mutations as compared to the control and ddg\_monomer methods, flex ddG did not match the performance of the ZEMu method on this subset.
This result could indicate that further refinement of the backrub parameters is required when simultaneously sampling conformational space around the sites of multiple mutations.
For example, while we modeled all mutations simultaneously, it is possible that a protocol that considers mutations sequentially could improve predictions.
However, and remarkably, flex ddG outperforms ZEMu on the subset of cases with multiple mutations where none of the mutations are to alanine (\cref{tab:table-main}).
While any comments on the origins of this difference will be speculative especially with only limited structural information on the mutated proteins (as well as information on possible changed dynamics), we note that flex ddG predictions are more accurate for several cases in this dataset with experimental \ddg\ values around zero that ZEMu over- or underpredicts
Finally, the flex ddG method also shows considerable improvements over other methods on the subset of antibody-antigen complexes (Table 2).

\cref{fig:figure-scatter} illustrates the performance for the flex ddG and control methods on the complete dataset and small-to-large subsets using scatterplots comparing experimentally determined and computationally estimated changes in binding free energies for each of the cases in the datasets. In particular, a notable improvement with flex ddG over the control can be seen for the 13 small-to-large mutations that were experimentally determined to stabilize the protein-protein interface significantly (\ddg\ $<=$\ -1.0 kcal/mol). For this set, the control method misclassifies most stabilizing mutations to have minimal effect or to be destabilizing (9 mutations with predicted Rosetta \ddg\ scores $>$\ 0) (\cref{fig:figure-scatter}d), whereas flex ddG identifies a sizable number (12 of 13 mutations) to have predicted Rosetta $\Delta\Delta G < 0$ (\cref{fig:figure-scatter}c), even though only one of these mutations is predicted to be strongly stabilizing (predicted \ddg\ score < -1). The capability to predict stabilizing mutations is especially important for challenging design applications to modulate binding affinity and selectivity, as well as creating entirely new high-affinity protein-protein interactions.

It has been previously observed that increasing the number of stabilizing mutations that are correctly identified (decreasing ``false negatives'') might be accompanied by an increase in ``false positives'', i.e. predictions that a mutation is stabilizing when it is not. However, using backbone ensembles was found to mitigate this effect by decreasing the number of false negatives more than it increases the number of false positives\cite{davey_improving_2014}. We therefore also evaluated the number of false positive predictions. For the complete dataset, there are 12 cases where the no backrub control method predicts a mutation incorrectly as stabilizing (Rosetta \ddg\ score $<= -1$) that were experimentally determined to destabilize the interface significantly (\ddg\ $> 1$ kcal/mol). In contrast, flex ddG misclassified only 1 destabilizing mutation as stabilizing. We conclude that flex ddG makes both fewer false negative and fewer false positive predictions

In the following sections, we assess how different flex ddG implementations would affect prediction performance, focusing separately on sampling and scoring.

\subimport*{figs-and-tables/}{figure-scatter}

\subimport*{figs-and-tables/}{structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}

\subsection{Effect of ensemble size}

While the results presented above used an ensemble size of 50 members, we next investigated what the ideal ensemble size would be to maximize the predictive ability of our method.
For example, prior methods used ensemble sizes ranging from ten\cite{kamisetty_accounting_2011}\ to thousands\cite{benedix_predicting_2009}. As the computational time required to run flex ddG increases linearly with ensemble size, determining an optimal size is practically relevant. We therefore evaluated the performance of flex ddG as we average across an increasing number of models (from 1 to 50, \cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}).
The models are first sorted by the score of the corresponding repacked and minimized wild type model, such that producing a \ddg\ with 1 model will only use the lowest (best) scoring model, 2 models will use the 2 lowest scoring models, and so forth.
\cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}(a) shows the performance on the complete dataset.
As more models with increasing wild type complex score are averaged, correlation with known experimental values increases.
Conversely, performance for the no backrub control method stays approximately constant as more models are averaged.
This result indicates that sampling with backrub adds information that improves \ddg\ calculation even though the additional averaged models have higher scores (average ensemble total score is shown in \cref{fig:wildtypecomplex-scores-complete}).
These higher scoring models would be excluded in methods such as the Rosetta ddg\_monomer protocol, which typically use only the lowest scoring wild-type and mutant models.
Similar observations on the utility of higher scoring models for stability prediction have been made previously\cite{howell_understanding_2014,davey_optimization_2015}.
Increasing the ensemble size may hence be useful to increase the odds of finding alternative conformations that are informative for estimating the effects of mutations, rather than simple minimization of structural models.


Instead of using just the three lowest energy models\cite{kellogg_role_2011}, we find that the performance of the ddg\_monomer method also improves as more output models are averaged (\cref{fig:structs-v-corr-WildTypeComplex-ddg-monomer-16-003-zemu-2}, \cref{tab:structs-v-corr-WildTypeComplex-ddg-monomer-16-003-zemu-2-underlying-data}).
This was somewhat unexpected, as the no-backrub control method, which did not show an improvement with increasing ensemble size, is conceptually similar to the ddg\_monomer method. However, the difference may arise from the fact that the ddg\_monomer method ramps the weight of the repulsive Lennard-Jones term in the energy function during minimization. This strategy explores conformational space more broadly in different backbone ensemble members than minimization with a fully weighted repulsive term in the no-backrub control method. In this fashion, including more ensemble members generated by the ddg\_monomer method increases the conformational plasticity sampled which in turn increases performance, as seen for the flex ddG method.

Using flex ddG, the subset of small-to-large mutations shows the largest increase in correlation with experimental \ddg\ values as more models are averaged (\cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}(b)). This result is consistent with our reasoning above that improved modeling of conformational plasticity is important for prediction performance, and that this effect is most important for significant changes in amino acid residue size. For the subset of multiple mutations where none are mutations to alanine (\cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}(c)), performance overall increases substantially initially when more models are added.

Averaging across increased numbers of models also improves correlation for the subset of single mutations to alanine (\cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}(d)).
Here, improvements are seen up to averaging about 10 models, after which performance stays approximately constant.
This observation indicates that increased sampling, in the very least, is not harmful for cases where one would expect structural changes to be relatively small on average.

To test whether the optimal number of models depends on the structural context of the mutation, we binned the complete dataset by secondary structure class (alpha-helix, strand, loop, turn) at the site of the mutation using DSSP\cite{kabsch_dictionary_1983,joosten_series_2011}. For all secondary structure classes, we observed a performance increase when averaging over increasing number of models, reaching a plateau at around 20 to 30 models (\cref{fig:structs-v-corr-id-zemu-12-60000-rscript-simplified-t14-SS}). We observed a similar behavior when binning the dataset by residue burial at the site of mutation using solvent accessible surface area computed using DSSP\cite{kabsch_dictionary_1983,joosten_series_2011}\ (\cref{fig:structs-v-corr-id-zemu-12-60000-rscript-simplified-t14-burial}). In all cases, we observed an increase in performance when averaging across a larger number of models.

In summary, from a practical standpoint, generating 20-30 models should constitute sufficient sampling for most cases. Sorting the generated models by score and selecting the best scoring 20-30 out of 50 models does not appear to be necessary, as not sorting the models by score (\cref{fig:structs-v-corr-id-zemu-12-60000-rscript-simplified-t14}, \cref{tab:structs-v-corr-id-zemu-12-60000-rscript-simplified-t14-underlying-data}) gives similar results to sorting the models (\cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}).

\subsection{Effect of extent of backrub sampling in each trajectory}

\subimport*{figs-and-tables/}{steps-v-corr}

The extent of sampling can also be controlled by changing the number of Monte Carlo steps in the backrub simulations.
\cref{fig:steps-v-corr} shows the effect of increasing the number of backrub Monte Carlo steps (while averaging all 50 models at each output step) on flex ddG performance, compared to a control method with zero backrub steps that uses only minimization and side chain packing.
\ddg\ scores are calculated every 2,500 backrub steps.

After an initial increase for the first set of 2500 backrub steps, performance stays relatively constant for the complete dataset (Figure 4a) and for single mutations to alanine (\cref{fig:steps-v-corr}d). However, for the subsets of small-to-large mutations (Figure 4b) and multiple mutations, none to alanine (\cref{fig:steps-v-corr}c), performance increases considerably with increasing numbers of Monte Carlo steps. This increase in performance is similar to what was observed with averaging over more models for these subsets (\cref{fig:structs-v-corr-WildTypeComplex-zemu-12-60000-rscript-simplified-t14}b,c).
Performance levels off at around 30,000 backrub Monte Carlo steps.

The increased performance does not appear to be simply a result of decreasing scores as the simulation progresses, as the average score of the minimized wild type complexes does not decrease uniformly across the sampled ensemble as the simulation progresses (\cref{fig:wildtypecomplex-scores-complete}).
The pairwise backrub ensemble RMSDs continue to increase throughout the backrub simulation for all subsets (\cref{fig:t14-mean-ensemble}), indicating that diminishing returns at > 30,000 Monte Carlo steps is not a result of failure to sample new conformations, but rather might indicate that continued sampling does not capture additional relevant local changes in structure in this benchmark set.

\subsection{Score analysis}

As the sampling and scoring problems of protein modeling are generally linked, it is often the case that improving one enables further improvements in the other.

First, we compared the performance of our flex ddG method, which was run using Rosetta's Talaris\cite{song_structure-guided_2011,shapovalov_smoothed_2011,omeara_combined_2015} energy function, to an identical protocol run with the more recently developed Rosetta Energy Function (REF)\cite{alford_rosetta_2017}.
The REF energy function differs from the Talaris energy function by utilizing a new anisotropic implicit solvation model, and an improved electrostatics and Lennard-Jones model. REF was optimized simultaneously against small-molecule thermodynamic data and high-resolution macromolecular structural data.
Using the REF energy function, we did not observe an increase in performance on the complete ZEMu dataset, and performance decreases were seen for the subsets of small-to-large mutations and multiple mutations (\cref{tab:table-ref}).
Interestingly, flex ddG performance with the REF energy function increased over using the Talaris energy function if the resolution of the input crystal structure was $<= 1.5$\ \AA, but this subset of the data was rather small with only 52 mutations.

Next, we sought to analyze underlying errors of the Rosetta energy function (when applied to interface \ddg) by assessing the individual terms of the energy function. To do so, we chose to reweight the terms of the energy function using a non-linear reweighting scheme similar to Generalized Additive Models (GAMs)\cite{hastie_generalized_1990}.
In this reweighting method, we used Monte Carlo sampling to fit a sigmoid function to the individual distributions of energy function terms, with the objective function of reducing the absolute error between our predictions and known experimental values over the entire dataset.

\begin{figure}
  \includegraphics[width=0.5\textwidth,keepaspectratio]{figures/zemu-sigmoid2-corrs-main.png}
  \caption[]{
    Experimentally determined $\Delta\Delta G$ values (x-axis) versus predictions using a Generalized additive model (GAM).
    The complete dataset is shown.
    GAM scores are refit from values in Rosetta Energy Units (REU) using the Rosetta Talaris\cite{song_structure-guided_2011,shapovalov_smoothed_2011,omeara_combined_2015} energy function.
    The error bars in gray represent the range from minimum to maximum fit predicted \ddg\ value for the 1000 sampled GAM models.
    \textbf{(a)}: Control (no backrub) Rosetta predictions.
    \textbf{(b)}: Flex ddG Rosetta predictions using 35,000 backrub steps and 50 output models.
    A line of best fit is shown in each of the panels.
  } \label{fig:t14-fit-scatter-main}
\end{figure}

The effect on the predictions is shown in \cref{fig:t14-fit-scatter-main}, \cref{fig:t14-fit-scatter-supp}, and \cref{tab:table-gam-fit}. In general, the GAM-adjusted predictions contain fewer outliers. In particular, experimental \ddg\ values that are relatively neutral (near zero) can sometimes be predicted by flex ddG to be highly destabilizing; the GAM model reduces the magnitude of error of many of these outliers, improving overall performance (\cref{fig:t14-fit-scatter-main}). The overall correlation increases from 0.64 to 0.68 (\cref{tab:table-main} and \cref{tab:table-gam-fit})\ when refitting the values from the Rosetta Talaris energy function\cite{song_structure-guided_2011,shapovalov_smoothed_2011,omeara_combined_2015}; refitting values from the Rosetta REF energy function\cite{alford_rosetta_2017} leads to a similar increase from 0.63 to 0.68 (\cref{fig:t14-fit-scatter-supp}, \cref{tab:table-ref}, \cref{tab:table-gam-fit}). The correlation coefficient also increases when refitting the values obtained for the no backrub control, but only to 0.62 (\cref{fig:t14-fit-scatter-main}a, \cref{tab:table-gam-fit}).

The fit functions (fit for Talaris-derived \ddg\ predictions) are shown in \cref{fig:t14-fits-feats}. Extreme values for most score terms are downweighted, especially for the fa\_sol and fa\_atr terms, which make the largest contributions to predicted \ddg\ (\cref{fig:tal-GAM-terms-mpl}).

\section{Conclusions}

We have shown on a large, curated benchmark dataset that the ``flex ddG'' method presented here is more accurate than previous methods for estimating changes in binding affinity after mutation in protein-protein interfaces.
Particular improvement in performance is seen on the subset of small-to-large mutations, indicating that representing backbone flexibility using backrub motions is effective in cases where backbone rearrangements are expected to be more common. Other notable improvements over previous methods are seen for stabilizing mutations, mutations in antibody-antigen interfaces, and for cases with multiple changes where none of the mutations is to an alanine residue.

We have also shown that more accurate predictions can be obtained by averaging the predictions across a generated structural ensemble of backrub models, and that the number of required models is relatively low (20-30).
Prior methods that produced \ddg\ predictions by averaging an ensemble of models required on the order of thousands of models  \cite{benedix_predicting_2009}, indicating that backrub sampling can efficiently sample the local conformational space around an input wild-type structure that is relevant for interface \ddg\ prediction.

By creating a method that uses backrub to sample conformational space more broadly than minimization alone, while still staying close to the known wild-type input structure, we have also generated data that should prove useful for future energy function improvements.
In particular, using Rosetta's newest REF energy function\cite{alford_rosetta_2017} does not improve performance of our method when compared to use of the prior Talaris\cite{leaver-fay_chapter_2013,song_structure-guided_2011,shapovalov_smoothed_2011} energy function (\cref{tab:table-ref}), indicating that the backrub sampling parameters might require further benchmarking and adaption to the REF energy function.
Our error analysis via GAM-like reweighting also indicates potential avenues for energy function improvement by identifying imbalances in predicted energetic contributions leading to overestimation of stabilizing and destabilizing effects.
Further improvements might also be obtained by more explicitly including the effects of altering water-mediated interactions\cite{lai_enhancing_2017} and of conformational entropy\cite{hu_protein_2006,guerois_predicting_2002}, as well as by considering the commonly observed shortcomings of energy functions balancing the magnitudes of electrostatic interactions and desolvation costs.
We expect energy function improvements to require more accurate representation of subtle conformational changes, as these changes can have a considerable impact on design predictions\cite{dou_sampling_2017}.