Uncertainty #7

katosh · 2023-08-09T17:46:51Z

The core objective of this PR is to introduce uncertainty estimation into Mellon's primary results.

New Features

`with_uncertainty` Parameter

Integrates a boolean parameter with_uncertainty across all estimators: DensityEstimator, TimeSensitiveDensityEstimator, FunctionEstimator, and DimensionalityEstimator. It modifies the fitted predictor, accessible via the .predict property, to include the following methods:

.covariance(X): Calculates the (co-)variance of the posterior Gaussian Process (GP).
- Almost 0 near landmarks; grows for out-of-sample locations.
- Increases with sparsity.
- Defaults to diag=True, computing only the covariance matrix diagonal.
.mean_covariance(X): Computes the (co-)variance through the uncertainty of the mean function's GP posterior.
- Derived from Bayesian inference for latent density function representation.
- Increases in low data or low-density areas.
- Only available with posterior uncertainty quantification, e.g., optimizer='advi' except for the FunctionEstimator where input uncertainty is specified through the sigma parameter.
- Defaults to diag=True, computing only the covariance matrix diagonal.
.uncertainty(X): Combines .covariance(X) and .mean_covariance(X).
- Defaults to diag=True, computing only the covariance matrix diagonal.
- Square root provides standard deviation.

`gp_type` Parameter

Introduces the gp_type parameter to all relevant estimators to explicitly specify the Gaussian Process (GP) sparsification strategy, replacing the previously used method argument (with options auto, fixed, and percent) that implicitly controlled sparsification. The available options for gp_type include:

'full': Non-sparse GP.
'full_nystroem': Sparse GP with Nyström rank reduction, lowering computational complexity.
'sparse_cholesky': SParse GP using landmarks/inducing points.
'sparse_nystroem': Improved Nyström rank reduction on sparse GP with landmarks, balancing accuracy and efficiency.

This new parameter adds additional validation steps, ensuring that no contradictory parameters are specified. If inconsistencies are detected, a helpful reply guides the user on how to fix the issue. The value can be either a string matching one of the options above or an instance of the mellon.parameters.GaussianProcessType Enum. Partial matches log a warning, using the closest match. Defaults to 'sparse_cholesky'.

Note: Nyström strategies are not applicable to the FunctionEstimator.

`y_is_mean` Parameter

Adds a boolean parameter y_is_mean to FunctionEstimator, affecting how y values are interpreted:

Old Behavior: sigma impacted conditional mean functions and predictions.
Intermediate Behavior: sigma only influenced prediction uncertainty.
New Parameter: If y_is_mean=True, y values are treated as a fixed mean; sigma reflects only uncertainty. If y_is_mean=False, y is considered a noisy measurement, potentially smoothing values at locations x.

This change benefits DensityEstimator, TimeSensitiveDensityEstimator, and DimensionalityEstimator where function values are predicted for out-of-sample locations after mean GP computation.

`check_rank` Parameter

Introduces the check_rank parameter to all relevant estimators. This boolean parameter explicitly controls whether the rank check is performed, specifically in the gp_type="sparse_cholesky" case. The rank check assesses the chosen landmarks for adequate complexity by examining the approximate rank of the covariance matrix, issuing a warning if insufficient. Allowed values are:

True: Always perform the check.
False: Never perform the check.
None (Default): Perform the check only if n_landmarks is greater than or equal to n_samples divided by 10.

The default setting aims to bypass unnecessary computation when the number of landmarks is so abundant that insufficient complexity becomes improbable.

`normalize` Parameter

The normalize parameter is applicable to both the .mean method and .__call__ method within the mellon.Predictor class. When set to True, these methods will subtract log(number of observations) from the value returned. This feature is particularly useful with the DensityEstimator, where normalization adjusts for the number of cells in the training sample, allowing for accurate density comparisons between datasets. This correction takes into account the effect of dataset size, ensuring that differences in total cell numbers are not unduly influential. By default, the parameter is set to False, meaning that density differences due to variations in total cell number will remain uncorrected.

`normalize_per_time_point` Parameter

This parameter fine-tunes the TimeSensitiveDensityEstimator to handle variations in sampling bias across different time points, ensuring both continuity and differentiability in the resulting density estimation. Notably, it also allows to reflect the growth of a population even if the same number of cells were sampled from each time point.

The normalization is realized by manipulating the nearest neighbor distances
nn_distances to reflect the deviation from an expected cell count.

Type: Optional, accepts bool, list, array-like, or dict.

Options:

True: Normalizes to emulate an even distribution of total cell count across all time points.
False: Retains raw cell counts at each time point for density estimation.
List/Array-like: Specifies an ordered sequence of total cell count targets for each time point, starting with the earliest.
Dict: Associates each unique time point with a specific total cell count target.

Notes:

Relative Metrics: While this parameter adjusts for sample bias, it only requires relative cell counts for comparisons within the dataset; exact counts are not mandatory.
nn_distance Precedence: If nn_distance is supplied, this parameter will be bypassed, and the provided distances will be used directly.
The default value is False

Enhancements

Optimization by saving the intermediate result Lp in the estimators for reuse, enhancing the speed of the predictive function computation in non-Nyström strategies.
The DimensionalityEstimator.predict now returns a subclass of the mellon.Predictor class instead of a closure. Giving access to serialization and uncertainty computations.
Expanded testing.
propagate logging messages and explicit logger name "mellon" everywhere
extended parameter validation for the estimators now also applies to the compute_L function
better string representation of estimators and predictors
bugfix some edge cases
Revise some documentation (s. b70bb04) and include Predictor page on sphinx doc

Changes

The mellon.Predictor class now has a method .mean that is an alias to .__call__.
All mellon.Predictor sub classes ...ConditionalMean... were renamed to ...Conditional... since they now also compute .covariance and .mean_covariance.
All generating methods for mellon.Predictor were renamed from ...conditional_mean... to conditional.
A new log message now informs that the normalization is not effective d_method != "fractal". Additionally, using normalize=True in the density predictor triggers a warning that one has to use the non default d_method = "fractal" in the DensityEstimator.

codecov · 2023-08-09T17:52:22Z

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (c274b7f) 92.65% compared to head (924aa5a) 97.47%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #7      +/-   ##
==========================================
+ Coverage   92.65%   97.47%   +4.81%     
==========================================
  Files          34       35       +1     
  Lines        2547     3518     +971     
==========================================
+ Hits         2360     3429    +1069     
+ Misses        187       89      -98

Files	Coverage Δ
mellon/__init__.py	`100.00% <100.00%> (ø)`
mellon/_conditional.py	`100.00% <ø> (ø)`
mellon/_inference.py	`100.00% <ø> (ø)`
mellon/_parameters.py	`100.00% <ø> (ø)`
mellon/_util.py	`100.00% <ø> (ø)`
mellon/base_cov.py	`94.40% <100.00%> (+0.38%)`	⬆️
mellon/base_model.py	`93.06% <100.00%> (+8.55%)`	⬆️
mellon/compute_ls_time.py	`100.00% <100.00%> (ø)`
mellon/cov.py	`100.00% <100.00%> (ø)`
mellon/decomposition.py	`91.93% <100.00%> (+8.15%)`	⬆️
... and 18 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

- introduce parameter gp_type - remove parameter method - validate parameter consistency - compute Lp separatly - more testing

katosh · 2023-08-12T04:16:15Z

@ManuSetty The refactoring is done. I made the chages I mentioned and I removed the argument method (with options auto, fixed, and percent) in favor of gp_type that lets the user choose the sparsification method explicitly:

    gp_type : str or GaussianProcessType
        The type of sparcification used for the Gaussian Process
         - 'full' None-sparse Gaussian Process
         - 'full_nystroem' Sparse GP with Nyström rank reduction without landmarks,
            which lowers the computational complexity.
         - 'sparse_cholesky' Sparse GP using landmarks/inducing points,
            typically employed to enable scalable GP models.
         - 'sparse_nystroem' Sparse GP using landmarks or inducing points,
            along with an improved Nyström rank reduction method that balances
            accuracy with efficiency.

        The value can be either a string matching one of the above options or an instance of
        the `mellon.parameters.GaussianProcessType` Enum. If a partial match is found with the
        Enum, a warning will be logged, and the closest match will be used.
        Defaults to 'sparse_cholesky'.

This comes with an additional parameter validation making sure no contradictory parameters are specified.

The key changes are: - **Old Behavior**: `sigma` explained noise on the input data, impacting the conditional mean functions and thus the prediction. This prediction reflected the mean over all functions and all possible inputs given the variability of the input data indicated by `sigma`. - **Intermediate Behavior**: `sigma` now only inflated the uncertainty of the prediction, assuming the input is fixed. The new behavior is more sensible for the `DensityEstimator`, where the uncertainty of the input is quantified only by ADVI inference but the values `y` are already mean values of the GP. - **Introduction of `y_is_mean` Parameter**: A boolean parameter `y_is_mean` is added to the `conditional_mean` that computes the predictive function of the GP. If `y_is_mean=True`, the `y` values are considered a fixed mean, and `sigma` only reflects the uncertainty estimate. If `y_is_mean=False`, the values `y` are treated as a noisy measurement, leading to a potentially smoothed value at corresponding locations `x`. This update brings clarity to the treatment of values we condition on in the Gaussian Process and allows controlling if they are seen as the mean of a Gaussian Process or noisy measurements. It promotes better alignment with the underlying statistical principles and the requirements of the `DensityEstimator` and `FunctionEstimator` ensuring backwards compatibility.

katosh · 2023-08-18T00:57:33Z

Commit 27b7d63 resolves a major ambiguity harmonizing the new uncertainty computation for the DensityEstimator with the noise input handling of the FunctionEstimator.

…r naming

katosh added 7 commits August 8, 2023 16:10

implement cov matrix diagonal computation

26902e7

decomposition wth sigma

61f2646

implement predictor uncertanty estimation

f605615

replace deprecated function in testing

e5be987

avoid recomputation of Lp and generalize uncertainty computation

170dfda

make dim estimator return predictor instance

27a912a

bugfix: advi stds were in logarithmized

78070b6

katosh added 3 commits August 9, 2023 11:54

save distance computation in case of identical cells

95641f7

fix uncertainty typo

ea204c4

Refactoring, version bump 1.3.1 -> 1.4.0

ef5dbed

- introduce parameter gp_type - remove parameter method - validate parameter consistency - compute Lp separatly - more testing

katosh added 11 commits August 12, 2023 00:44

more testing

d6610bb

lint doc config

1cc75d9

validation doc and array positivity

459e5ef

make sure automaticly ls is used as ls in cov_func_curry

92201fb

default alpha=1.0 in cov.RatQuad

6c3609b

avoid passing GaussianProcessType to _validate_params

2cd6bbf

log choice of gp_type

2b66e05

trajectory-trends_tutorial compatibility with more palantir versions

8bf8601

statement on ls in gene-trend notebook

eb622f3

Merge branch 'main' into uncertainty

deeadd2

time point normalization warning

2664fdd

katosh force-pushed the main branch from eb622f3 to ad86892 Compare August 17, 2023 21:14

katosh added 3 commits August 17, 2023 14:15

Merge branch 'main' into uncertainty

b9e2a29

error if uncertainty requested without advi

754b3be

katosh added 2 commits August 17, 2023 18:10

test covariances and mean_covariance methods

9d12ec7

test more validation cases

93af527

katosh added 13 commits August 28, 2023 16:18

backwards compatibility with v1.3 serialized predictors

741a07e

more doc and feedback on n_obs

784c36e

update GaussianProcessType reference in error message

59ff524

better string representation of estimators & consistent ls_time_facto…

318864a

…r naming

document n_obs and n_input_features in Predictor

12664df

"sphinx_github_style" only for readthedocs

8303a3b

implement per-time-point normalization

e442be9

typing compatibility with python 3.8

d5a288d

test backards compatible predictor loading

a0bfb98

test failure when n_obs is missing

334cd8e

normalize in time-sensitive predictor by average cells per time point

e195414

corret code target of error tests

23de93a

better representation of Predictor

1874c15

katosh force-pushed the uncertainty branch from b8899e6 to 1874c15 Compare August 30, 2023 03:46

katosh added 3 commits August 29, 2023 21:20

allow ExpPredictor to return log scale mean

310bc7a

enable testing of DimensionalityEstimator uncertainty

fef82a9

Predictor tab in sphinx doc

b70bb04

katosh force-pushed the uncertainty branch from 7107cd0 to b70bb04 Compare August 30, 2023 06:57

katosh added 6 commits August 30, 2023 00:12

refernece attributes instea of methods

e3ea0d6

mention multi_time in all docstrings

222ef83

improve multi_time doc

7d66c20

remove dublication in doc

56434e2

more refences to documentation

99bb40f

log info about normalization

cae38a9

katosh force-pushed the uncertainty branch from 81f7b8b to cae38a9 Compare October 3, 2023 18:23

katosh marked this pull request as ready for review October 3, 2023 18:29

katosh added 3 commits October 3, 2023 11:34

warning when trying to normalize

251d0e7

avoid deprecated logger.warn(...)

4691f2f

add CHANGELOG.md

924aa5a

katosh merged commit 64af1ce into main Oct 3, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncertainty #7

Uncertainty #7

katosh commented Aug 9, 2023 •

edited

Loading

codecov bot commented Aug 9, 2023 •

edited

Loading

katosh commented Aug 12, 2023

katosh commented Aug 18, 2023

Uncertainty #7

Uncertainty #7

Conversation

katosh commented Aug 9, 2023 • edited Loading

New Features

with_uncertainty Parameter

gp_type Parameter

y_is_mean Parameter

check_rank Parameter

normalize Parameter

normalize_per_time_point Parameter

Options:

Notes:

Enhancements

Changes

codecov bot commented Aug 9, 2023 • edited Loading

Codecov Report

katosh commented Aug 12, 2023

katosh commented Aug 18, 2023

katosh commented Aug 9, 2023 •

edited

Loading

`with_uncertainty` Parameter

`gp_type` Parameter

`y_is_mean` Parameter

`check_rank` Parameter

`normalize` Parameter

`normalize_per_time_point` Parameter

codecov bot commented Aug 9, 2023 •

edited

Loading