[RFC] Sklearn regression #407

thomaspinder · 2023-11-05T19:03:54Z

Type of changes

Checklist

I've formatted the new code by running poetry run pre-commit run --all-files --show-diff-on-failure before committing.
I've added tests for new code.
I've added docstrings for the new code.

Description

Opening a draft PR to receive feedback on the overall API design. Once this has been agreed upon, tests and more detailed typing will be added to the PR along with functionality for classification and optimisation. To prevent a single monstrous PR, I'll have one PR each for regression, classification, and optimisation that each go into the sklearn_api branch which can eventually be merged into main.

This PR introduces an SKLearn API for GPJax that allows users to invoke GP modelling through .fit and .predict commands, as per the SKLearn API. Further to this, the ability to score the GP model is introduced. In this PR, I would like comments around the high-level design choices and suggestions for how it can be improved.

Issue Number: N/A

daniel-dodd · 2023-11-05T19:19:17Z

docs/examples/sklearn_api.py

+#
+# We store our data $\mathcal{D}$ as a GPJax `Dataset` and create test inputs and labels
+# for later.


We aren't using the Dataset abstraction in this example.

Oop, copy-and-paste hangover. Thanks!

daniel-dodd · 2023-11-05T19:23:52Z

docs/examples/sklearn_api.py

+# %%
+model.fit(x, y, key=key)


Nice. Just to be a devil's advocate here. Do we want to be JAX'y or Sklearn'y with regards to random_state. I.e., in sklearn regressor you would provide an integer (seed like) random_state input. Do we want to import a key from jax.random and follow the api of key passing, or do we just want to do model.fit(x,y) with an argument random_state: int = 0 or something?

Hmm, straight out the gate, I lean towards using a key, just as it guarantees reproducibility inside larger codebases. What are your thought?

Fitting a simple GP shouldn't need any randomness though?

I would argue that if you want to implement the Sklearn API, you should actually implement the API. Which means it should pass only X and y. Otherwise, it wouldn't actually be possible to write some code that takes an arbitrary model and fits it (and does some more stuff with it). If there's any parameters - like a random state - that need to be passed in, these should go in the constructor of the sklearn-style model class.

daniel-dodd · 2023-11-05T19:25:59Z

docs/examples/sklearn_api.py

+model.score(xtest, ytest, gpx.sklearn.SKLearnScore("mse", mean_squared_error))
+model.score(x, y, gpx.sklearn.LogPredictiveDensity())


Looks clean!

docs/examples/sklearn_api.py

daniel-dodd · 2023-11-05T19:47:38Z

gpjax/sklearn/strategy.py

+from dataclasses import dataclass
+
+
+@dataclass
+class AbstractStrategy:
+    pass
+
+
+@dataclass
+class ExactInference(AbstractStrategy):
+    pass
+
+
+@dataclass
+class VariationalInference(AbstractStrategy):
+    pass
+
+
+@dataclass
+class MCMCInference(AbstractStrategy):
+    pass


Is this file used?

Not yet. Tbh, accidentally committed this.

daniel-dodd · 2023-11-05T19:54:13Z

docs/examples/sklearn_api.py

+# %% [markdown]
+# ## Model building
+#
+# We'll now proceed to build our model. Within the SKLearn API we have three main classes: `GPJaxRegressor`, `GPJaxClassifier`, and `GPJaxOptimizer`/`GPJaxOptimiser`. We'll consider a problem where the response is continuous and so we'll use the `GPJaxRegressor` class. The problem is identical to the one considered in the [Regression notebook](regression.py); however, we'll now use the SKLearn API to build our model. This offers an alternative to the lower-level API and is designed to be similar to the API of [scikit-learn](https://scikit-learn.org/stable/).


Does GPJaxClassifier exist?

Not yet. Will implement once we’re aligned on an API

daniel-dodd · 2023-11-05T19:56:43Z

gpjax/sklearn/optim.py

+class GPJaxOptimizer(BaseEstimator):
+    kernel: AbstractKernel
+    mean_function: AbstractMeanFunction = None
+    n_inducing: int = -1


What is the intended usage for the GPJaxOptimizer? (Why do we have it?)

Once we’re happy with an API, I see it being analogous to Scipy’s minimise fn

daniel-dodd · 2023-11-05T20:01:25Z

gpjax/sklearn/config.py

+    sparse_threshold: Optional[int] = 2000
+    stochastic_threshold: Optional[int] = 20000
+    min_num_inducing: Optional[int] = 100


Out of interest. How were these numbers chosen? e.g., In the some of the documentation we recommend 5,000 datapoints is still fine for the regression before switching to sparse strategies.

Adhoc for now. Happy to bump the default to 5000

Co-authored-by: Daniel Dodd <[email protected]> Signed-off-by: Thomas Pinder <[email protected]>

thomaspinder added 6 commits October 18, 2023 21:45

Initial commit

f54b66e

Update SKLearn API

d8859f5

Update SKLearn API

bf5574b

Update SKLearn API

8a077c3

Format

0621f42

Add hangover files

0ee1121

thomaspinder added the enhancement New feature or request label Nov 5, 2023

thomaspinder added this to the v1.0.0 milestone Nov 5, 2023

thomaspinder requested review from st--, henrymoss and daniel-dodd November 5, 2023 19:03

thomaspinder self-assigned this Nov 5, 2023

Drop incorrect commits

879d2cb

daniel-dodd reviewed Nov 5, 2023

View reviewed changes

docs/examples/sklearn_api.py Outdated Show resolved Hide resolved

daniel-dodd reviewed Nov 5, 2023

View reviewed changes

thomaspinder and others added 2 commits November 6, 2023 06:18

Bump default values

777e8a4

Update docs/examples/sklearn_api.py

02aa634

Co-authored-by: Daniel Dodd <[email protected]> Signed-off-by: Thomas Pinder <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Sklearn regression #407

[RFC] Sklearn regression #407

thomaspinder commented Nov 5, 2023 •

edited

Loading

daniel-dodd Nov 5, 2023

thomaspinder Nov 5, 2023

daniel-dodd Nov 5, 2023

thomaspinder Nov 5, 2023

st-- Jan 29, 2024

daniel-dodd Nov 5, 2023

daniel-dodd Nov 5, 2023

thomaspinder Nov 5, 2023

daniel-dodd Nov 5, 2023

thomaspinder Nov 5, 2023

daniel-dodd Nov 5, 2023

thomaspinder Nov 5, 2023

daniel-dodd Nov 5, 2023

thomaspinder Nov 5, 2023

		model.score(xtest, ytest, gpx.sklearn.SKLearnScore("mse", mean_squared_error))
		model.score(x, y, gpx.sklearn.LogPredictiveDensity())

		# %%
		model.fit(x, y, key=key)

[RFC] Sklearn regression #407

Are you sure you want to change the base?

[RFC] Sklearn regression #407

Conversation

thomaspinder commented Nov 5, 2023 • edited Loading

Type of changes

Checklist

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomaspinder commented Nov 5, 2023 •

edited

Loading