diff --git a/Gaussian_Processes_colab.ipynb b/Gaussian_Processes_colab.ipynb new file mode 100644 index 0000000..8acf2f6 --- /dev/null +++ b/Gaussian_Processes_colab.ipynb @@ -0,0 +1,1528 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "id": "dense-anthony", + "metadata": { + "id": "dense-anthony" + }, + "source": [ + "
\n", + "

Tutorial 4

\n", + "

Gaussian Processes

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "careful-fashion", + "metadata": { + "id": "careful-fashion" + }, + "source": [ + "## Overview\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "liked-canyon", + "metadata": { + "id": "liked-canyon" + }, + "source": [ + "
\n", + "\n", + "

Gaussian Processes

\n", + "\n", + "This tutorial is based on work done by Ollie Pollard on using Gaussian Processes to predict sea-level rise. Following the steps outlined in this [visual article](https://distill.pub/2019/visual-exploration-gaussian-processes/). Gaussian processes are often used to make predictions about our data by incorporating prior knowledge often to fit a function to a data set. For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The mean of this probability distribution then represents the most probable characterization of the data.\n", + "\n", + "\n", + " \n", + "## Recommended reading\n", + "\n", + "* [Overview of Linear Regression](https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86)\n", + "* [An Intuative Guide to Gaussian Processes](https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d)\n", + " \n", + "
\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "hispanic-recall", + "metadata": { + "id": "hispanic-recall" + }, + "source": [ + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "

Machine Learning Theory

\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "This tutorial is mainly focusing on using python to explore Gaussian Processes. Please read the full [visual aritcle](https://distill.pub/2019/visual-exploration-gaussian-processes/) for a more in-depth explanation.\n", + " \n", + "The [gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) forms the building blocks of Gaussian Processes.\n", + " $\\displaystyle f(x) ={\\frac {1}{\\sigma {\\sqrt {2\\pi }}}}e^{-{\\frac {1}{2}}\\left({\\frac {x-\\mu }{\\sigma }}\\right)^{2}}$\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "For Machine Learning Gaussian Processes we are interested in the [multivariate case](https://en.wikipedia.org/wiki/Multivariate_normal_distribution). The multivariate Gaussian distribution is defined by a mean vector $\\mu$ (the expected value of the distribution) and a covariance matrix $\\Sigma$. $\\Sigma$ models the variance along each dimension and determines how the different random variables are correlated. If $X$ follows a normal distribution:\n", + " \n", + "\\begin{equation} X = \\begin{bmatrix} X_1 \\\\ X_2 \\\\ \\vdots \\\\ X_n \\end{bmatrix} \\sim \\mathcal{N}(\\mu, \\Sigma) \\end{equation}\n", + "\n", + "The covariance matrix $\\Sigma$ describes the shape of the distribution. It is defined in terms of the expected value $E$\n", + " \n", + "\\begin{equation} \n", + "\\Sigma = \\text{Cov}(X_i, X_j) = E \\left[ (X_i - \\mu_i)(X_j - \\mu_j)^T \\right]\n", + "\\end{equation}\n", + "\n", + "Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable.\n", + "Marginalization and conditioning both work on subsets of the original distribution and we will use the following notation:\n", + "\n", + "\\begin{equation}\n", + "P_{X,Y} = \\begin{bmatrix} X \\\\ Y \\end{bmatrix} \\sim \\mathcal{N}(\\mu, \\Sigma) = \\mathcal{N} \\left( \\begin{bmatrix} \\mu_X \\\\ \\mu_Y \\end{bmatrix}, \\begin{bmatrix} \\Sigma_{XX} \\, \\Sigma_{XY} \\\\ \\Sigma_{YX} \\, \\Sigma_{YY} \\end{bmatrix} \\right) \n", + "\\end{equation}\n", + "\n", + "With $X$ and $Y$ representing subsets of original random variables. \n", + "Through marginalization, we can extract partial information from multivariate probability distributions. In particular, given a normal probability distribution $P(X, Y)$ over vectors of random variables $X$ and $Y$, we can determine their marginalized probability distributions in the following way:\n", + " \n", + "\\begin{equation}\n", + "\\begin{aligned}\n", + "X \\sim \\mathcal{N}(\\mu_X, \\Sigma_{XX}) \\\\\n", + "Y \\sim \\mathcal{N}(\\mu_Y, \\Sigma_{YY})\n", + "\\end{aligned}\n", + "\\end{equation} \n", + " \n", + "The interpretation of this equation is that each partition $X$ and $Y$ only depends on its corresponding entries in $\\mu$ and $\\Sigma$. To marginalize a random variable from a Gaussian distribution we can simply drop the variables from $\\mu$ and $\\Sigma$.\n", + " \n", + "\\begin{equation} \n", + "p_X(x) = \\int_y p_{X,Y}(x,y)dy = \\int_y p_{X|Y}(x|y) p_Y(y) dy\n", + "\\end{equation}\n", + " \n", + "The way to interpret this equation is that if we are interested in the probability density of $X=x$, we need to consider all possible outcomes of $Y$ that can jointly lead to the result.\n", + "\n", + "Another important operation for Gaussian processes is conditioning. It is used to determine the probability of one variable depending on another variable. Similar to marginalization, this operation is also closed and yields a modified Gaussian distribution. This operation is the cornerstone of Gaussian processes since it allows Bayesian inference. Conditioning is defined by:\n", + " \n", + "\\begin{equation} \n", + "\\begin{aligned}\n", + "X|Y \\sim \\mathcal{N}(\\:\\mu_X + \\Sigma_{XY}\\Sigma_{YY}^{-1}(Y - \\mu_Y),\\: \\Sigma_{XX}-\\Sigma_{XY}\\Sigma_{YY}^{-1}\\Sigma_{YX}\\:) \\\\\n", + "Y|X \\sim \\mathcal{N}(\\:\\mu_Y + \\Sigma_{YX}\\Sigma_{XX}^{-1}(X - \\mu_X),\\: \\Sigma_{YY}-\\Sigma_{YX}\\Sigma_{XX}^{-1}\\Sigma_{XY}\\:) \\\\\n", + "\\end{aligned}\n", + "\\end{equation}\n", + "\n", + "The new mean only depends on the conditioned variable, while the covariance matrix is independent of this variable.\n", + " \n", + "
\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "expired-attribute", + "metadata": { + "id": "expired-attribute" + }, + "source": [ + " \n", + "
\n", + "\n", + "

Python

\n", + "\n", + "## Tensorflow and GPflow\n", + " \n", + "There are many machine learning python libraries available, [TensorFlow](https://www.tensorflow.org/) a is one such library. Throughout this tutorial, you will see some complex machine learning tasks executed in just a few lines of code by calling [GPflow](https://gpflow.readthedocs.io/en/master/) functions which use Tensor flow. If you have GPUs on the machine you are using, these python libraries will automatically use them and run the code even faster!\n", + "\n", + "## Further Reading\n", + " \n", + "* [GPflow example Notebooks](https://gpflow.readthedocs.io/en/develop/notebooks_file.html)\n", + "\n", + "
\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "seasonal-concept", + "metadata": { + "id": "seasonal-concept" + }, + "source": [ + "
\n", + " \n", + "

Requirements

\n", + "\n", + "These notebooks should run on a standard laptop with the correct python environment.\n", + "\n", + "

Python Packages:

\n", + "\n", + "* Python 3.8\n", + "* tensorflow > 2.1\n", + "* gpflow 2.1 *(must be installed via pip to get latest version)*\n", + "* numpy\n", + "* matplotlib\n", + "* plotly\n", + "* scipy\n", + "* pandas\n", + "\n", + "\n", + "

Data Requirements

\n", + " \n", + "This notebook refers to some data included in the git hub repository in the [data](data) folder\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "coral-attempt", + "metadata": { + "id": "coral-attempt" + }, + "source": [ + "\n", + "**Contents:**\n", + "\n", + "1. [Overview of Gausian Processes](#Overview-of-Gausian-Processes)\n", + "2. [Sea Level Example](#Sea-Level-Example)\n", + "1. [Load Data](#Load-Data)\n", + "2. [Normalise Data](#Normalise-Data)\n", + "3. [Plot Data](#Plot-Data)\n", + "4. [Define GP flow model](#Define-GP-flow-model)\n", + "5. [Optamization](#Optamization)\n", + "6. [Prediction](#Prediction)\n", + "7. [Cross Validation](#Cross-Validation)\n", + "8. [Plot High Stands](#Plot-High-Stands)\n" + ] + }, + { + "cell_type": "markdown", + "id": "enabling-steam", + "metadata": { + "id": "enabling-steam" + }, + "source": [ + "
\n", + " \n", + "Load in all required modules and turn off warnings. If you have no [GPU's](https://www.analyticsvidhya.com/blog/2020/09/why-gpus-are-more-suited-for-deep-learning/) available you may see some tensor flow warnings\n", + "\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "inappropriate-magic", + "metadata": { + "id": "inappropriate-magic" + }, + "outputs": [], + "source": [ + "# For readability: disable warnings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "code", + "source": [ + "pip install gpflow" + ], + "metadata": { + "id": "w4mqZ8EWbKUV" + }, + "id": "w4mqZ8EWbKUV", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "affecting-scanner", + "metadata": { + "id": "affecting-scanner" + }, + "outputs": [], + "source": [ + "# import modules\n", + "import gpflow\n", + "import tensorflow as tf\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import plotly.express as px\n", + "\n", + "from gpflow.utilities import print_summary\n", + "from scipy.spatial.distance import cdist\n", + "from scipy.stats import multivariate_normal\n", + "\n", + "gpflow.config.set_default_summary_fmt(\"notebook\")\n", + "plt.style.use(\"seaborn-whitegrid\")" + ] + }, + { + "cell_type": "markdown", + "id": "922ed27c", + "metadata": { + "id": "922ed27c" + }, + "source": [ + "
\n", + "\n", + "The next cell check's for GPUS. If the result is not what you expect check your python installation.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "affccf11", + "metadata": { + "id": "affccf11" + }, + "outputs": [], + "source": [ + "if tf.config.list_physical_devices('GPU'):\n", + " print('GPU found')\n", + " tf.config.list_physical_devices('GPU')\n", + "else:\n", + " print('No GPU found')" + ] + }, + { + "cell_type": "markdown", + "id": "07ce84e8", + "metadata": { + "id": "07ce84e8" + }, + "source": [ + "# Overview of Gaussian Processes\n", + "\n", + "
\n", + "\n", + "Say we want to find an unknown function `y` where $y = x^3 - 9x + e^{(x^3/30)}$ from a random sample of points [`x_samp`, `ysamp`]\n", + " \n", + "**NB function `y` is chosen at random for demonstrative purposes you can change `y` to whatever you like and see similar results!**\n", + " \n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "27c14374", + "metadata": { + "id": "27c14374" + }, + "source": [ + "
\n", + "\n", + "After you've run through the code you might want to re-run with a different choice of `y`\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8440359", + "metadata": { + "id": "d8440359" + }, + "outputs": [], + "source": [ + "# some function y that we're going to try and find\n", + "# set x to between -6 and 6\n", + "x = np.linspace(-6,6,100)\n", + "# A random function chosem\n", + "y = x**3 - 9*x + np.exp(x**3/30)" + ] + }, + { + "cell_type": "markdown", + "id": "05515d9e", + "metadata": { + "id": "05515d9e" + }, + "source": [ + "# generate sample" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86dadabc", + "metadata": { + "id": "86dadabc" + }, + "outputs": [], + "source": [ + "# a random sample of points\n", + "# whole x range is between -6 and 6 and we'll sample x between -2 and 2\n", + "sample_size = 5\n", + "\n", + "samp_index = np.random.randint((len(x)-1)/4, high=(len(x)-1)*3/4, size=sample_size)" + ] + }, + { + "cell_type": "markdown", + "id": "75c75cb9", + "metadata": { + "id": "75c75cb9" + }, + "source": [ + "
\n", + "\n", + "Later if you wish you can uncomment the below cell to see the effects of changing the sample size or sample area (below is set to sample the whole range with a sample size of 8)\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf732d59", + "metadata": { + "id": "bf732d59" + }, + "outputs": [], + "source": [ + "#sample_size = 8\n", + "#samp_index = np.random.randint(0, high=len(x), size=sample_size)" + ] + }, + { + "cell_type": "markdown", + "id": "5f2f9385", + "metadata": { + "id": "5f2f9385" + }, + "source": [ + "
\n", + "\n", + "As we've taken a random sample there's a small chance we might have sampled the same point twice so the below code is going to check we have 5 unique sample points \n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3526ef0a", + "metadata": { + "id": "3526ef0a" + }, + "outputs": [], + "source": [ + "# Check no duplicates in sample index\n", + "\n", + "if len(np.unique(samp_index)) != sample_size :\n", + " print(\"duplicate sample index found please rerun above cell\")\n", + " print(\"the sample will automatically be reset to 5 over the whole range\")\n", + " sample_size = 5\n", + " samp_index = np.random.randint(0, high=len(x), size=sample_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "016c0340", + "metadata": { + "id": "016c0340" + }, + "outputs": [], + "source": [ + "samp_index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b48cff78", + "metadata": { + "id": "b48cff78" + }, + "outputs": [], + "source": [ + "x_samp = x[samp_index]\n", + "y_samp = y[samp_index]\n", + "# Plot Sample points\n", + "fig, ax = plt.subplots(figsize=(10,5))\n", + "point_c = \"#f77f00\"\n", + "ax.scatter(x_samp, y_samp, s=50, c=point_c, zorder=10)\n", + "ax.set_xlim([-6,6])\n", + "ax.set_xlabel(\"x\")\n", + "ax.set_ylabel(\"f(x)\")" + ] + }, + { + "cell_type": "markdown", + "id": "68be05c9", + "metadata": { + "id": "68be05c9" + }, + "source": [ + "
\n", + "\n", + "From these few points, it's not obvious what function `y` could possibly be. So we can use Gaussian Processes to help work out the unknown function from these few points\n", + "\n", + "The covariance matrix $\\Sigma$ is determined by its covariance function $k$, which is often also called the kernel of the Gaussian process. Here we will use the [Radial Basis Function Kernal](https://towardsdatascience.com/radial-basis-function-rbf-kernel-the-go-to-kernel-acf0d22c798a):\n", + " \n", + "\\begin{equation}\n", + " K(X_1,X_2) = exp(-{\\frac{||X_1 - X_2|| ^2}{2\\sigma^2}})\n", + "\\end{equation}\n", + "\n", + "where $\\sigma$ is the variance and hyperparameter and $||X_1 - X_2|| ^2$ is the Euclidean distance between two points\n", + " \n", + "this is defined in the function `rbf_kernel`\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c77c430", + "metadata": { + "id": "9c77c430" + }, + "outputs": [], + "source": [ + "# radial basis function kernel\n", + "def rbf_kernel(x1, x2, var, lscale):\n", + " \"\"\"\n", + " Compute the Euclidean distance between each row of X and X2, or between\n", + " each pair of rows of X if X2 is None and feed it to the kernel.\n", + " \"\"\"\n", + " if x2 is None:\n", + " d = cdist(x1, x1)\n", + " else:\n", + " d = cdist(x1, x2)\n", + " K = var*np.exp(-np.power(d,2)/(lscale**2))\n", + " return K" + ] + }, + { + "cell_type": "markdown", + "id": "82ca3e8d", + "metadata": { + "id": "82ca3e8d" + }, + "source": [ + "
\n", + "\n", + "so `K` can be obtained via `rbf_kernel` and $\\mu$ `mu` is often assumed to be zero as a starting point\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "e90b5f6d", + "metadata": { + "id": "e90b5f6d" + }, + "source": [ + "
\n", + "\n", + "try adjusting `lscale` to see what happens with the results\n", + "e.g. after running through the first time set `lscale=5` and run through all cells again\n", + " \n", + "Increasing the length parameter increases the banding, as points further away from each other become more correlated.\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e2bb5b3", + "metadata": { + "id": "9e2bb5b3" + }, + "outputs": [], + "source": [ + "lscale = 1.0\n", + "K_rbf = rbf_kernel(x.reshape(-1,1), None, 1.0, lscale)\n", + "mu = np.zeros(x.shape[0])\n", + "f_rbf = np.random.multivariate_normal(mu, K_rbf, 100)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff646d12", + "metadata": { + "id": "ff646d12" + }, + "outputs": [], + "source": [ + "rv = multivariate_normal(np.zeros(2), rbf_kernel(np.arange(0,2).reshape(-1,1), None, 1.0, 2.0))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64f1ad96", + "metadata": { + "id": "64f1ad96" + }, + "outputs": [], + "source": [ + "#Create grid and multivariate normal\n", + "a = np.linspace(-3,3,500)\n", + "b = np.linspace(-3,3,500)\n", + "X, Y = np.meshgrid(a,b)\n", + "pos = np.empty(X.shape + (2,))\n", + "pos[:, :, 0] = X\n", + "pos[:, :, 1] = Y" + ] + }, + { + "cell_type": "markdown", + "id": "98be54da", + "metadata": { + "id": "98be54da" + }, + "source": [ + "
\n", + "\n", + "Stochastic processes, such as Gaussian processes, are essentially a set of random variables. In addition, each of these random variables has a corresponding index $i$. We will use this index to refer to the $i$-th dimension of our nnn-dimensional multivariate distributions.\n", + " \n", + "Below, we have a two-dimensional normal distribution. Each dimension $y_i$ is assigned an index $\\displaystyle{i \\in {1,2}}$. This representation allows us to understand the connection between the covariance and the resulting values: the underlying Gaussian distribution has a positive covariance between $y_1$ and $y_1$ -  this means that $y_2$ will increases as $y_1$ gets larger and vice versa.\n", + " \n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23b0eb7b", + "metadata": { + "id": "23b0eb7b" + }, + "outputs": [], + "source": [ + "grid = plt.GridSpec(1, 2, wspace=0.2, hspace=0.3)\n", + "fig = plt.figure(figsize=(10, 5))\n", + "ax1 = fig.add_subplot(grid[0, 0])\n", + "ax2 = fig.add_subplot(grid[0, 1])\n", + "ax1.contourf(X, Y, rv.pdf(pos), cmap=\"Reds\", levels=300)\n", + "for index in range(10):\n", + " rand_y1, rand_y2 = np.random.multivariate_normal(np.zeros(2), rbf_kernel(np.arange(0,2).reshape(-1,1), None, 1.0, 2.0), 1)[0]\n", + " ax1.scatter(rand_y1, rand_y2)\n", + " ax1.set_xlabel(\"$y_1$\")\n", + " ax1.set_ylabel(\"$y_2$\")\n", + " ax2.plot(np.arange(1,3), [rand_y1, rand_y2], '-o')\n", + " ax2.set_xlim([0,3])\n", + " ax2.set_ylim([-3,3])\n", + " ax2.set_xlabel(\"index\")\n", + " ax2.set_ylabel(\"$y$\")\n", + " ax2.set_xticks([1,2])" + ] + }, + { + "cell_type": "markdown", + "id": "bf7147ba", + "metadata": { + "id": "bf7147ba" + }, + "source": [ + "
\n", + " \n", + "The more horizontal the lines are the more strongly correlated\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75487067", + "metadata": { + "id": "75487067" + }, + "outputs": [], + "source": [ + "# rbf_kernel(X1,X2,var,lscale)\n", + "rbf_kernel(np.arange(0,2).reshape(-1,1), None, 1.0, lscale)" + ] + }, + { + "cell_type": "markdown", + "id": "af39fff9", + "metadata": { + "id": "af39fff9" + }, + "source": [ + "
\n", + "\n", + "## Prior distribution\n", + "\n", + "\n", + "The following figure shows samples of potential functions from prior distributions (the case where we have not yet observed any training data) that were created using RBF kernel\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff4bf6c5", + "metadata": { + "id": "ff4bf6c5" + }, + "outputs": [], + "source": [ + "var_index = np.linspace(0, 20,100)\n", + "fig, ax = plt.subplots(figsize=(10,5))\n", + "for index in range(10):\n", + " ax.plot(var_index, np.random.multivariate_normal(np.zeros(var_index.shape[0]), rbf_kernel(var_index.reshape(-1,1), None, 1.0, 5.0), 1)[0])\n", + " ax.set_xlim(0,20)\n", + " ax.set_ylim(-3,3)\n", + " ax.set_xlabel(\"$x$\")\n", + " ax.set_ylabel(\"$y$\")\n", + " ax.set_xticks([])\n", + "ax.set_title('GP samples from RBF where lscale ='+str(lscale))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "198512d1", + "metadata": { + "id": "198512d1" + }, + "outputs": [], + "source": [ + "# Covariance matrix from the RBF kernel\n", + "# rbf_kernel(X1,X2,var,lscale)\n", + "K_rbf = rbf_kernel(x.reshape(-1,1), None, 1.0, lscale)\n", + "# mu = 0\n", + "mu = np.zeros(x.shape[0])\n", + "# fuctions from rbf kernel\n", + "f_rbf = np.random.multivariate_normal(mu, K_rbf, 100)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "becd6ca0", + "metadata": { + "id": "becd6ca0" + }, + "outputs": [], + "source": [ + "# plot 100 GP samples\n", + "index=100\n", + "fig, ax = plt.subplots(figsize=(10,5))\n", + "ax.plot(x, f_rbf.T[:,0:index], color='C0', alpha=0.2)\n", + "ax.set_xlim([-6,6])\n", + "# depending you you function choice you might need to comment out the ylim line below\n", + "ax.set_ylim([-10,15])\n", + "ax.set_xlabel(\"x\")\n", + "ax.set_ylabel(\"f(x)\")\n", + "ax.set_title('100 GP samples from RBF kernel where lscale ='+str(lscale))" + ] + }, + { + "cell_type": "markdown", + "id": "1fcbf842", + "metadata": { + "id": "1fcbf842" + }, + "source": [ + "
\n", + " \n", + "as $\\mu$ is set to zero all functions are distributed normally around the mean $\\mu$ (0)\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "38c26c7d", + "metadata": { + "id": "38c26c7d" + }, + "source": [ + "
\n", + "\n", + "## Posterior distribution\n", + "\n", + "Now we're going to activate training data which we can add back into our distribution. To give the posterior distribution (where we have incorporated the training data into our model).\n", + "\n", + "1. First, we form the joint distribution between all x points `x` and the training points `x_samp` which gives `k_starX`\n", + "2. we can the use `k_xx` (covariance matrix of test x points) `k_starstar` covariance matrix of all x to calculate $\\mu_{pred}$ and $K_{pred}$\n", + " \n", + "\\begin{equation}\n", + "\\boldsymbol \\mu_{\\text{pred}} = \\mathbf{K}_*^\\top \\left[\\mathbf{K} + \\sigma^2 \\mathbf{I}\\right]^{-1} \\mathbf{y}\n", + "\\end{equation}\n", + " \n", + "\\begin{equation}\n", + "\\mathbf{K}_{\\text{pred}} = \\mathbf{K}_{*,*} - \\mathbf{K}_*^\\top \\left[\\mathbf{K} + \\sigma^2 \\mathbf{I}\\right]^{-1} \\mathbf{K}_*.\n", + "\\end{equation}\n", + "\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a11ec1cb", + "metadata": { + "id": "a11ec1cb" + }, + "outputs": [], + "source": [ + "k_starX = rbf_kernel(x.reshape(-1,1), x_samp.reshape(-1,1), 3, lscale)\n", + "\n", + "# K from sample point\n", + "k_xx = rbf_kernel(x_samp.reshape(-1,1), None, 3, lscale)\n", + "k_starstar = rbf_kernel(x.reshape(-1,1), None, 3, lscale)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9eb330e", + "metadata": { + "id": "c9eb330e" + }, + "outputs": [], + "source": [ + "print('no training data multivariate Gaussian distribution shape = ' +str(k_starstar.shape))\n", + "print('sample-whole multivariate Gaussian distribution shape = ' +str(k_starX.shape))\n", + "print('whole multivariate Gaussian distribution shape = ' +str(k_starX.shape))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98e08ec9", + "metadata": { + "id": "98e08ec9" + }, + "outputs": [], + "source": [ + "try:\n", + " mu = k_starX.dot(np.linalg.inv(k_xx)).dot(y_samp).flatten()\n", + " var = k_starstar - k_starX.dot(np.linalg.inv(k_xx)).dot(k_starX.T)\n", + "except:\n", + " print('ERROR PLEASE RETURN TO GENERATE SAMPLE')" + ] + }, + { + "cell_type": "markdown", + "id": "b65b08da", + "metadata": { + "id": "b65b08da" + }, + "source": [ + "
\n", + "\n", + "If you get an error here please return to [generate sample](#generate-sample) section to check for duplicated sample points\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "b3b2bc70", + "metadata": { + "id": "b3b2bc70" + }, + "source": [ + "
\n", + "In the constrained covariance matrix `var`, we can see that the correlation of neighbouring points is affected by the training data. If a predicted point lies on the training data, there is no correlation with other points. Therefore, the function must pass directly through it. Predicted values further away are also affected by the training data: proportional to their distance.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc008204", + "metadata": { + "id": "cc008204" + }, + "outputs": [], + "source": [ + "fig, ax = plt.subplots()\n", + "ax.imshow(var, cmap=\"BuPu\", interpolation='None')\n", + "ax.axis(False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75426e81", + "metadata": { + "id": "75426e81" + }, + "outputs": [], + "source": [ + "# functions from new mu and covariance matrix\n", + "f_star = np.random.multivariate_normal(mu, var, 100)\n", + "f_star_mean = f_star.mean(axis=0)\n", + "f_star_std = f_star.std(axis=0)" + ] + }, + { + "cell_type": "markdown", + "id": "3e49d531", + "metadata": { + "id": "3e49d531" + }, + "source": [ + "
\n", + "\n", + "Now we have our function predictions `f_star`that intercepts all training points\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86d813ac", + "metadata": { + "id": "86d813ac" + }, + "outputs": [], + "source": [ + "# plot predictions\n", + "fig, ax = plt.subplots(figsize=(10,5))\n", + "#plt.scatter(x, y, 100, 'k', 'o', zorder=100)\n", + "ax.scatter(x_samp, y_samp, s=50, c=\"C1\", zorder=10)\n", + "ax.plot(x, f_star.T, color='C0', alpha=0.1)\n", + "# if you have altered the function you may need to alter the y lim\n", + "ax.set_xlim([-6,6])\n", + "ax.set_ylim([-10,15])\n", + "ax.set_xlabel(\"x\")\n", + "ax.set_ylabel(\"f(x)\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "9398b3a5", + "metadata": { + "id": "9398b3a5" + }, + "source": [ + "
\n", + "\n", + "We can plot below the mean prediction function and the standard deviation. Away from training points, the standard deviation is much higher. Reflecting the lack of knowledge in these areas.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee7df77f", + "metadata": { + "id": "ee7df77f" + }, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(10,5))\n", + "ax.scatter(x_samp, y_samp, s=50, c=\"C1\", zorder=10)\n", + "ax.plot(x,y,color='r',alpha=0.2)\n", + "ax.plot(x, f_star_mean, color='C0')\n", + "ax.fill_between(x.flatten(), f_star_mean+f_star_std, f_star_mean-f_star_std, alpha=0.1, color='C0')\n", + "ax.set_xlim([-6,6])\n", + "# if you have altered the function you may need to alter the y lim\n", + "ax.set_ylim([-10,15])\n", + "ax.set_xlabel(\"x\")\n", + "ax.set_ylabel(\"f(x)\")\n", + "ax.legend(['actual','predicted'])" + ] + }, + { + "cell_type": "markdown", + "id": "a1624a46", + "metadata": { + "id": "a1624a46" + }, + "source": [ + "
\n", + "Since the RBF kernel is stationary it will always return to $\\mu=0$ in regions further away from observed training data. This decreases the accuracy of predictions that reach further into the past or the future.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "9851a817", + "metadata": { + "id": "9851a817" + }, + "source": [ + "# Sea Level Example\n", + "\n", + "
\n", + "\n", + "Now let's look at an example using Gaussian Process to Predict sea-level change using a subset of sea level data from the RISeR dataset provided in python NumPy arrays by Oliver Pollard. \n", + "\n", + "Example of a global RSL change output for the Last Glacial Maximum to present (using the ICE-5G (Peltier et al., 2015) input to an implemented model (Han and Gomez, 2018) of sea-level theory (Kendall et al., 2005)).\n", + " \n", + "\n", + "\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "hispanic-jones", + "metadata": { + "tags": [ + "hide-cell" + ], + "id": "hispanic-jones" + }, + "source": [ + "# Load Data" + ] + }, + { + "cell_type": "markdown", + "id": "modular-arabic", + "metadata": { + "tags": [ + "hide-input" + ], + "id": "modular-arabic" + }, + "source": [ + "
\n", + "\n", + "For this example we're using data provided in python [.npy files](https://towardsdatascience.com/what-is-npy-files-and-why-you-should-use-them-603373c78883) containing relative sea-level change at a single point in the southern north sea at 122 (ka). `highstand` data we will use to test our predictions\n", + " \n", + "\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "source": [ + "%%bash\n", + "\n", + "wget https://github.com/cemac/LIFD_GaussianProcesses/raw/469cc8b75c903a5976ddf37fff300d58b853bd6a/data/highstand_data.npy\n", + "wget https://github.com/cemac/LIFD_GaussianProcesses/raw/469cc8b75c903a5976ddf37fff300d58b853bd6a/data/parameter_data.npy\n", + "wget https://raw.githubusercontent.com/cemac/LIFD_GaussianProcesses/469cc8b75c903a5976ddf37fff300d58b853bd6a/data/predict_points.npy\n", + "\n", + "mkdir data/\n", + "mv highstand_data.npy parameter_data.npy predict_points.npy data/" + ], + "metadata": { + "id": "rnmTegC9XJry" + }, + "execution_count": null, + "outputs": [], + "id": "rnmTegC9XJry" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ef40bb9", + "metadata": { + "id": "7ef40bb9" + }, + "outputs": [], + "source": [ + "highstand = np.load(\"data/highstand_data.npy\")" + ] + }, + { + "cell_type": "markdown", + "id": "133dd0ed", + "metadata": { + "id": "133dd0ed" + }, + "source": [ + "
\n", + " \n", + "We can use 4 parameters provided in `parameter_data.npy` which correspond to the parameters listed below\n", + "\n", + "The Rates of Interglacial Sea-level Change and Responses (RISeR) dataset covers the early interglacial period and can be used to show the relative contribution to the Penultimate Glacial Period (PPGM) Max Ice\n", + "Volume and Last Interglacial (LIG) Highstnad from 4 parameters from sediment cores\n", + "\n", + "\n", + "\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "427b8674", + "metadata": { + "id": "427b8674" + }, + "outputs": [], + "source": [ + "parameters = np.load('data/parameter_data.npy')" + ] + }, + { + "cell_type": "markdown", + "id": "9c6e1a7b", + "metadata": { + "id": "9c6e1a7b" + }, + "source": [ + "# Normalise Data\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fcc73ef", + "metadata": { + "id": "6fcc73ef" + }, + "outputs": [], + "source": [ + "# useful functions\n", + "def stack_parameters(*parameters):\n", + " return np.column_stack([[*parameters]]).T\n", + "\n", + "def normalise(parameters):\n", + " normalised_parameters = np.zeros_like(parameters)\n", + " normalisation_values = []\n", + " for index, parameter in enumerate(parameters.T):\n", + " shift = - np.min(parameter)\n", + " scale = np.max(parameter + shift)\n", + " normalised_parameters[:,index] = (parameter + shift)/scale\n", + " normalisation_values.append((shift, scale))\n", + "\n", + " return normalised_parameters, normalisation_values\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "breeding-membership", + "metadata": { + "scrolled": false, + "id": "breeding-membership" + }, + "outputs": [], + "source": [ + "\n", + "# move parameter ranges down to improve results\n", + "p1 = parameters[:,0] # Bedrock\n", + "p2 = parameters[:,1] # Onshore Sediment\n", + "p3 = parameters[:,2] # Marine Sediement\n", + "p4 = parameters[:,3] # Ice streams\n", + "parmlist = ['Bedrock','Onshore Sediment','Marine Sediement','Ice streams']\n", + "parameters = stack_parameters(p1, p2, p3, p4)\n", + "highstand = highstand.reshape(-1,1)\n", + "\n", + "parameters_norm, parameters_norm_params = normalise(parameters)\n", + "highstand_norm, highstand_norm_params = normalise(highstand)" + ] + }, + { + "cell_type": "markdown", + "id": "b49828d5", + "metadata": { + "id": "b49828d5" + }, + "source": [ + "# Plot Data\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "proud-edwards", + "metadata": { + "id": "proud-edwards" + }, + "outputs": [], + "source": [ + "# plot data for parameter\n", + "scatter_fig = plt.figure(figsize=[16,10])\n", + "for index in range(4):\n", + " ax = scatter_fig.add_subplot(221+index)\n", + " ax.scatter(parameters_norm[:,index], highstand, s=15, c=\"#71a8ad\")\n", + " ax.set_title(f\"{str(parmlist[index])}\", fontsize=16, fontweight='bold')\n", + " ax.set_ylabel(\"Highstand (m)\")" + ] + }, + { + "cell_type": "markdown", + "id": "fd286e00", + "metadata": { + "id": "fd286e00" + }, + "source": [ + "## Plot parameters against other parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "accessory-philip", + "metadata": { + "id": "accessory-philip" + }, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(nrows=4, ncols=4, figsize=(15,15))\n", + "for index_1 in range(4):\n", + " for index_2 in range(4):\n", + " if index_1 == index_2:\n", + " ax[index_1, index_2].axis('off')\n", + " ax[index_1, index_2].text(0.5,0.5,f\"{str(parmlist[index_1])}\",horizontalalignment=\"center\",\n", + " verticalalignment=\"center\", fontsize=15,\n", + " fontweight='bold')\n", + " elif index_2 < index_1:\n", + " ax[index_1, index_2].axis('off')\n", + "\n", + " else:\n", + " ax[index_1, index_2].scatter(parameters_norm[:,index_2], parameters_norm[:,index_1],s=2,c=\"red\")\n", + " ax[index_1, index_2].set_xlim([0,1])\n", + " ax[index_1, index_2].set_ylim([0,1])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b896cfa8", + "metadata": { + "id": "b896cfa8" + }, + "outputs": [], + "source": [ + "# Over lay sparse points in red over the points to be predicted in blue" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "polish-direction", + "metadata": { + "id": "polish-direction" + }, + "outputs": [], + "source": [ + "predict_coords = np.load(\"data/predict_points.npy\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f08a106", + "metadata": { + "id": "4f08a106" + }, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(nrows=4, ncols=4, figsize=(15,15))\n", + "for index_1 in range(4):\n", + " for index_2 in range(4):\n", + " if index_1 == index_2:\n", + " ax[index_1, index_2].axis('off')\n", + " ax[index_1, index_2].text(0.5,0.5,f\"{str(parmlist[index_1])}\",\n", + " horizontalalignment=\"center\",\n", + " verticalalignment=\"center\", fontsize=15)\n", + " elif index_2 < index_1:\n", + " ax[index_1, index_2].axis('off')\n", + "\n", + " else:\n", + " ax[index_1, index_2].scatter(predict_coords[:,index_2], predict_coords[:,index_1],s=2,c=\"C0\", alpha=0.2)\n", + " ax[index_1, index_2].scatter(parameters_norm[:,index_2], parameters_norm[:,index_1],s=2,c=\"red\")\n", + " ax[index_1, index_2].set_xlim([0,1])\n", + " ax[index_1, index_2].set_ylim([0,1])" + ] + }, + { + "cell_type": "markdown", + "id": "96445033", + "metadata": { + "id": "96445033" + }, + "source": [ + "## Define GP flow model\n", + "\n", + "
\n", + "\n", + "We're going to use the python library [GP flow model](https://gpflow.readthedocs.io/en/master/notebooks/basics/regression.html) to create our Gaussian process model to create a more complex model that in our previous example in fewer lines of code\n", + " \n", + "`k = gpflow.kernels.Matern52(lengthscales=lscale)` selects a [Matérn covariance function](https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function) for the [GP flow kernel](https://gpflow.readthedocs.io/en/master/notebooks/advanced/kernels.html) \n", + "\n", + "`m = gpflow.models.GPR(data=(X, Y), kernel=k, mean_function=None)` constructs a regression model from data points and the selected kernal.\n", + "\n", + "to inspect the chosen kernel you can run the `print_summary` command which should show you a list of hyperparameters: `variance` and `lengthscale` and will display information about those hyperparameters which will start at default values of 1 and the [transformation](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Softplus) applied.\n", + "\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5be48c64", + "metadata": { + "id": "5be48c64" + }, + "outputs": [], + "source": [ + "gpflow.config.set_default_summary_fmt(\"notebook\")\n", + "k = gpflow.kernels.Matern52(lengthscales=np.ones((len(parameters_norm.T))))\n", + "print_summary(k)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a60f04b", + "metadata": { + "id": "7a60f04b" + }, + "outputs": [], + "source": [ + "m = gpflow.models.GPR(data=(parameters_norm, highstand_norm), kernel=k, mean_function=None)\n", + "print_summary(m)" + ] + }, + { + "cell_type": "markdown", + "id": "2847d55e", + "metadata": { + "id": "2847d55e" + }, + "source": [ + "## Optamization\n", + "\n", + "
\n", + " \n", + "`opt = gpflow.optimizers.Scipy()` uses the Scipy optimizer, which by default implements the [Limited Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS-B)](https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm\n", + " \n", + "`opt.minimize(m.training_loss, m.trainable_variables, options=dict(maxiter=100))` calls the minimize method of an optimizer which uses the training_loss defined by the GPflow model and the variables to train with and the number of iterations\n", + "\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d55b5932", + "metadata": { + "id": "d55b5932" + }, + "outputs": [], + "source": [ + "opt = gpflow.optimizers.Scipy()\n", + "opt.minimize(m.training_loss, m.trainable_variables, options=dict(maxiter=100))\n", + "print_summary(m)" + ] + }, + { + "cell_type": "markdown", + "id": "5eb5be2f", + "metadata": { + "id": "5eb5be2f" + }, + "source": [ + "## Prediction\n", + "\n", + "\n", + "
\n", + " \n", + "`m.predict_f` predicts at the new points producing a mean and variance we'll create an array of predicted means `predict` and variance `predict_var` by looping over the normalised training data from our parameters \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c51e8b3", + "metadata": { + "id": "7c51e8b3" + }, + "outputs": [], + "source": [ + "mean, var = m.predict_f(np.asarray([[0.2,0.5,0.6,0.7]]))\n", + "print(mean.numpy()[0][0], var.numpy()[0][0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9abf715", + "metadata": { + "id": "b9abf715" + }, + "outputs": [], + "source": [ + "mean, var = m.predict_f(predict_coords)\n", + "print(mean.numpy()[:,0], var)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "roman-chase", + "metadata": { + "id": "roman-chase" + }, + "outputs": [], + "source": [ + "def coords_flattern(*coords):\n", + " meshs = np.meshgrid(*coords)\n", + " return np.column_stack(([mesh.ravel() for mesh in meshs]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58b53e4f", + "metadata": { + "id": "58b53e4f" + }, + "outputs": [], + "source": [ + "predict = []\n", + "predict_var = []\n", + "for index, point in enumerate(parameters_norm):\n", + " parameters_norm_training = np.delete(parameters_norm, index, axis=0)\n", + " highstand_norm_training = np.delete(highstand_norm, index, axis=0)\n", + "\n", + " k = gpflow.kernels.Matern52(lengthscales=np.ones((len(parameters.T))))\n", + " m = gpflow.models.GPR(data=(parameters_norm_training, highstand_norm_training), kernel=k, mean_function=None)\n", + " opt.minimize(m.training_loss, m.trainable_variables, options=dict(maxiter=100))\n", + "\n", + " lop = parameters_norm[index]\n", + " meanp, varp = m.predict_f(coords_flattern([lop[0]],[lop[1]],[lop[2]],[lop[3]]))\n", + " predict_var.append(varp.numpy()[0][0])\n", + " predict.append(meanp.numpy()[0][0])" + ] + }, + { + "cell_type": "markdown", + "id": "f61ab311", + "metadata": { + "id": "f61ab311" + }, + "source": [ + "## Cross Validation\n", + "\n", + "\n", + "
\n", + " \n", + "we can now plot our predicted values (`predict`) against our actual values (`highstand`) *normalised*\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf1da56c", + "metadata": { + "id": "bf1da56c" + }, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=[6,6])\n", + "ax.plot(np.linspace(0,1,num=100), np.linspace(0,1,num=100), c=\"#031f1d\")\n", + "ax.plot(highstand_norm, predict, 'x', c=\"#71a8ad\")\n", + "ax.errorbar(highstand_norm, predict, yerr=predict_var, linestyle=\"None\")\n", + "ax.set_ylabel(\"Predicted Highstand Normalised\")\n", + "ax.set_xlabel(\"Actual Highstand Normalised\")" + ] + }, + { + "cell_type": "markdown", + "id": "794b84f8", + "metadata": { + "id": "794b84f8" + }, + "source": [ + "## Plot High Stands\n", + "\n", + "
\n", + " \n", + "We can now plot the actual predictions (`mean`) in purple with the grey dots agreement with the actual data (`highstand_norm`) \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7c9ee16", + "metadata": { + "id": "e7c9ee16" + }, + "outputs": [], + "source": [ + "# plot data for parameter\n", + "scatter_fig = plt.figure(figsize=[16,10])\n", + "for index in range(4):\n", + " ax = scatter_fig.add_subplot(221+index)\n", + " ax.scatter(predict_coords[:,index], mean.numpy()[:,0], s=15, c=\"purple\")\n", + " ax.scatter(parameters_norm[:,index], highstand_norm, s=15, c=\"#71a8ad\")\n", + " ax.set_title(f\"{str(parmlist[index])}\", fontsize=16, fontweight='bold')\n", + " ax.set_ylabel(\"Highstand (m)\")\n", + " plt.legend(['mean','highstand_norm'])\n", + "\n", + "plt.tight_layout()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa9f0dd6", + "metadata": { + "id": "aa9f0dd6" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + }, + "colab": { + "provenance": [], + "include_colab_link": true + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file