-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncertainty #7
Uncertainty #7
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #7 +/- ##
==========================================
+ Coverage 92.65% 97.47% +4.81%
==========================================
Files 34 35 +1
Lines 2547 3518 +971
==========================================
+ Hits 2360 3429 +1069
+ Misses 187 89 -98
☔ View full report in Codecov by Sentry. |
- introduce parameter gp_type - remove parameter method - validate parameter consistency - compute Lp separatly - more testing
@ManuSetty The refactoring is done. I made the chages I mentioned and I removed the argument gp_type : str or GaussianProcessType
The type of sparcification used for the Gaussian Process
- 'full' None-sparse Gaussian Process
- 'full_nystroem' Sparse GP with Nyström rank reduction without landmarks,
which lowers the computational complexity.
- 'sparse_cholesky' Sparse GP using landmarks/inducing points,
typically employed to enable scalable GP models.
- 'sparse_nystroem' Sparse GP using landmarks or inducing points,
along with an improved Nyström rank reduction method that balances
accuracy with efficiency.
The value can be either a string matching one of the above options or an instance of
the `mellon.parameters.GaussianProcessType` Enum. If a partial match is found with the
Enum, a warning will be logged, and the closest match will be used.
Defaults to 'sparse_cholesky'. This comes with an additional parameter validation making sure no contradictory parameters are specified. |
The key changes are: - **Old Behavior**: `sigma` explained noise on the input data, impacting the conditional mean functions and thus the prediction. This prediction reflected the mean over all functions and all possible inputs given the variability of the input data indicated by `sigma`. - **Intermediate Behavior**: `sigma` now only inflated the uncertainty of the prediction, assuming the input is fixed. The new behavior is more sensible for the `DensityEstimator`, where the uncertainty of the input is quantified only by ADVI inference but the values `y` are already mean values of the GP. - **Introduction of `y_is_mean` Parameter**: A boolean parameter `y_is_mean` is added to the `conditional_mean` that computes the predictive function of the GP. If `y_is_mean=True`, the `y` values are considered a fixed mean, and `sigma` only reflects the uncertainty estimate. If `y_is_mean=False`, the values `y` are treated as a noisy measurement, leading to a potentially smoothed value at corresponding locations `x`. This update brings clarity to the treatment of values we condition on in the Gaussian Process and allows controlling if they are seen as the mean of a Gaussian Process or noisy measurements. It promotes better alignment with the underlying statistical principles and the requirements of the `DensityEstimator` and `FunctionEstimator` ensuring backwards compatibility.
Commit 27b7d63 resolves a major ambiguity harmonizing the new uncertainty computation for the |
The core objective of this PR is to introduce uncertainty estimation into Mellon's primary results.
New Features
with_uncertainty
ParameterIntegrates a boolean parameter
with_uncertainty
across all estimators: DensityEstimator, TimeSensitiveDensityEstimator, FunctionEstimator, and DimensionalityEstimator. It modifies the fitted predictor, accessible via the.predict
property, to include the following methods:.covariance(X)
: Calculates the (co-)variance of the posterior Gaussian Process (GP).diag=True
, computing only the covariance matrix diagonal..mean_covariance(X)
: Computes the (co-)variance through the uncertainty of the mean function's GP posterior.optimizer='advi'
except for theFunctionEstimator
where input uncertainty is specified through thesigma
parameter.diag=True
, computing only the covariance matrix diagonal..uncertainty(X)
: Combines.covariance(X)
and.mean_covariance(X)
.diag=True
, computing only the covariance matrix diagonal.gp_type
ParameterIntroduces the
gp_type
parameter to all relevant estimators to explicitly specify the Gaussian Process (GP) sparsification strategy, replacing the previously usedmethod
argument (with options auto, fixed, and percent) that implicitly controlled sparsification. The available options forgp_type
include:This new parameter adds additional validation steps, ensuring that no contradictory parameters are specified. If inconsistencies are detected, a helpful reply guides the user on how to fix the issue. The value can be either a string matching one of the options above or an instance of the
mellon.parameters.GaussianProcessType
Enum. Partial matches log a warning, using the closest match. Defaults to 'sparse_cholesky'.Note: Nyström strategies are not applicable to the FunctionEstimator.
y_is_mean
ParameterAdds a boolean parameter
y_is_mean
to FunctionEstimator, affecting howy
values are interpreted:sigma
impacted conditional mean functions and predictions.sigma
only influenced prediction uncertainty.y_is_mean=True
,y
values are treated as a fixed mean;sigma
reflects only uncertainty. Ify_is_mean=False
,y
is considered a noisy measurement, potentially smoothing values at locationsx
.This change benefits DensityEstimator, TimeSensitiveDensityEstimator, and DimensionalityEstimator where function values are predicted for out-of-sample locations after mean GP computation.
check_rank
ParameterIntroduces the
check_rank
parameter to all relevant estimators. This boolean parameter explicitly controls whether the rank check is performed, specifically in thegp_type="sparse_cholesky"
case. The rank check assesses the chosen landmarks for adequate complexity by examining the approximate rank of the covariance matrix, issuing a warning if insufficient. Allowed values are:True
: Always perform the check.False
: Never perform the check.None
(Default): Perform the check only ifn_landmarks
is greater than or equal ton_samples
divided by 10.The default setting aims to bypass unnecessary computation when the number of landmarks is so abundant that insufficient complexity becomes improbable.
normalize
ParameterThe
normalize
parameter is applicable to both the.mean
method and.__call__
method within the mellon.Predictor class. When set toTrue
, these methods will subtractlog(number of observations)
from the value returned. This feature is particularly useful with the DensityEstimator, where normalization adjusts for the number of cells in the training sample, allowing for accurate density comparisons between datasets. This correction takes into account the effect of dataset size, ensuring that differences in total cell numbers are not unduly influential. By default, the parameter is set toFalse
, meaning that density differences due to variations in total cell number will remain uncorrected.normalize_per_time_point
ParameterThis parameter fine-tunes the
TimeSensitiveDensityEstimator
to handle variations in sampling bias across different time points, ensuring both continuity and differentiability in the resulting density estimation. Notably, it also allows to reflect the growth of a population even if the same number of cells were sampled from each time point.The normalization is realized by manipulating the nearest neighbor distances
nn_distances
to reflect the deviation from an expected cell count.bool
,list
,array-like
, ordict
.Options:
True
: Normalizes to emulate an even distribution of total cell count across all time points.False
: Retains raw cell counts at each time point for density estimation.Notes:
nn_distance
Precedence: Ifnn_distance
is supplied, this parameter will be bypassed, and the provided distances will be used directly.False
Enhancements
Lp
in the estimators for reuse, enhancing the speed of the predictive function computation in non-Nyström strategies.DimensionalityEstimator.predict
now returns a subclass of themellon.Predictor
class instead of a closure. Giving access to serialization and uncertainty computations.compute_L
functionChanges
.mean
that is an alias to.__call__
....ConditionalMean...
were renamed to...Conditional...
since they now also compute.covariance
and.mean_covariance
....conditional_mean...
toconditional
.d_method != "fractal"
. Additionally, usingnormalize=True
in the density predictor triggers a warning that one has to use the non defaultd_method = "fractal"
in theDensityEstimator
.