Skip to content

Commit

Permalink
CircleCI build openturns 17383
Browse files Browse the repository at this point in the history
  • Loading branch information
CircleCI committed Apr 5, 2024
1 parent d50e3e9 commit fd4d4c9
Show file tree
Hide file tree
Showing 367 changed files with 5,999 additions and 3,582 deletions.
Original file line number Diff line number Diff line change
@@ -1,86 +1,141 @@
"""
Create a linear model
=====================
In this example we create a surrogate model using linear model approximation.
In this example we create a surrogate model using linear model approximation
using the :class:`~openturns.LinearModelAlgorithm` class.
We show how the :class:`~openturns.LinearModelAnalysis` class
can be used to produce the statistical analysis of the least squares
regression model.
"""
# %%
# The following 2-dimensional function is used in this example
# :math:`h(x,y) = 2x - y + 3 + 0.05 \sin(0.8x)`.
# Introduction
# ~~~~~~~~~~~~
#
# The following 2-dimensional function is used in this example:
#
# .. math::
#
# \model(x,y) = 3 + 2x - y
#
# for any :math:`x, y \in \Rset`.
#
# Notice that this model is linear:
#
# .. math::
#
# \model(x,y) = \beta_1 + \beta_2 x + \beta_3 y
#
# where :math:`\beta_1 = 3`, :math:`\beta_2 = 2` and :math:`\beta_3 = -1`.
#
# We consider noisy observations of the output:
#
# .. math::
#
# Y = \model(x,y) + \epsilon
#
# where :math:`\epsilon \sim \cN(0, \sigma^2)` with :math:`\sigma > 0`
# is the standard deviation.
# In our example, we use :math:`\sigma = 0.1`.
#
# Finally, we use :math:`n = 1000` independent observations of the model.
#

# %%
import openturns as ot
import openturns.viewer as viewer

# %%
# Generation of the data set
# --------------------------
# Simulate the data set
# ~~~~~~~~~~~~~~~~~~~~~
#
# We first generate the data and we add noise to the output observations:

# %%
ot.RandomGenerator.SetSeed(0)
distribution = ot.Normal(2)
distribution.setDescription(["x", "y"])
func = ot.SymbolicFunction(["x", "y"], ["2 * x - y + 3 + 0.05 * sin(0.8*x)"])
input_sample = distribution.getSample(30)
epsilon = ot.Normal(0, 0.1).getSample(30)
sampleSize = 1000
func = ot.SymbolicFunction(["x", "y"], ["3 + 2 * x - y"])
input_sample = distribution.getSample(sampleSize)
epsilon = ot.Normal(0, 0.1).getSample(sampleSize)
output_sample = func(input_sample) + epsilon

# %%
# Linear regression
# -----------------
# ~~~~~~~~~~~~~~~~~
#
# Let us run the linear model algorithm using the `LinearModelAlgorithm` class and get the associated results :
# Let us run the linear model algorithm using the
# :class:`~openturns.LinearModelAlgorithm` class and get the associated
# results:

# %%
algo = ot.LinearModelAlgorithm(input_sample, output_sample)
result = algo.getResult()

# %%
# Residuals analysis
# ------------------
# ~~~~~~~~~~~~~~~~~~
#
# We can now analyse the residuals of the regression on the training data.
# For clarity purposes, only the first 5 residual values are printed.

# %%
residuals = result.getSampleResiduals()
print(residuals[:5])
residuals[:5]

# %%
# Alternatively, the `standardized` or `studentized` residuals can be used:

# %%
stdresiduals = result.getStandardizedResiduals()
print(stdresiduals[:5])
stdresiduals[:5]

# %%
# Similarly, we can also obtain the underyling distribution characterizing the residuals:
# Similarly, we can also obtain the underyling distribution characterizing
# the residuals:

# %%
print(result.getNoiseDistribution())
result.getNoiseDistribution()


# %%
# ANOVA table
# -----------
# Analysis of the results
# ~~~~~~~~~~~~~~~~~~~~~~~
#
# In order to post-process the linear regression results, the `LinearModelAnalysis` class can be used:
# In order to post-process the linear regression results, the
# :class:`~openturns.LinearModelAnalysis` class can be used:

# %%
analysis = ot.LinearModelAnalysis(result)
print(analysis)
analysis

# %%
# The results seem to indicate that the linear hypothesis can be accepted. Indeed, the `R-Squared` value is nearly `1`.
# Furthermore, the adjusted value, which takes into account the data set size and the number of hyperparameters, is similar to `R-Squared`.
# The results seem to indicate that the linear model is satisfactory.
#
# - The basis uses the three functions :math:`1` (i.e. the intercept),
# :math:`x` and :math:`y`.
# - Each row of the table of coefficients tests if one single coefficient is zero.
# The probability of observing a large value of the T statistics is close to
# zero for all coefficients.
# Hence, we can reject the hypothesis that any coefficient is zero.
# In other words, all the coefficients are significantly nonzero.
# - The R2 score is close to 1.
# Furthermore, the adjusted R2 value, which takes into account the size of
# the data set and the number of hyperparameters, is similar to the
# unadjusted R2 score.
# Most of the variance is explained by the linear model.
# - The F-test tests if all coefficients are simultaneously zero.
# The `Fisher-Snedecor` p-value is lower than 1%.
# Hence, there is at least one nonzero coefficient.
# - The normality test checks that the residuals have a normal distribution.
# The normality assumption can be rejected if the p-value is close to zero.
# The p-values are larger than 0.05: the normality assumption of the
# residuals is not rejected.
#
# We can also notice that the `Fisher-Snedecor` and `Student` p-values detailed above are lower than 1%. This ensures an acceptable quality of the linear model.

# %%
# Graphical analyses
# ------------------
# ~~~~~~~~~~~~~~~~~~
#
# Let us compare model and fitted values:

Expand All @@ -89,7 +144,8 @@
view = viewer.View(graph)

# %%
# The previous figure seems to indicate that the linearity hypothesis is accurate.
# The previous figure seems to indicate that the linearity hypothesis
# is accurate.

# %%
# Residuals can be plotted against the fitted values.
Expand All @@ -107,9 +163,11 @@
view = viewer.View(graph)

# %%
# In this case, the two distributions are very close: there is no obvious outlier.
# In this case, the two distributions are very close: there is no obvious
# outlier.
#
# Cook's distance measures the impact of every individual data point on the linear regression, and can be plotted as follows:
# Cook's distance measures the impact of every individual data point on the
# linear regression, and can be plotted as follows:

# %%
graph = analysis.drawCookDistance()
Expand All @@ -118,27 +176,32 @@
# %%
# This graph shows us the index of the points with disproportionate influence.
#
# One of the components of the computation of Cook's distance at a given point is that point's *leverage*.
# High-leverage points are far from their closest neighbors, so the fitted linear regression model must pass close to them.
# One of the components of the computation of Cook's distance at a given
# point is that point's *leverage*.
# High-leverage points are far from their closest neighbors, so the fitted
# linear regression model must pass close to them.

# %%
# sphinx_gallery_thumbnail_number = 6
graph = analysis.drawResidualsVsLeverages()
view = viewer.View(graph)

# %%
# In this case, there seem to be no obvious influential outlier characterized by large leverage and residual values, as is also shown in the figure below:
# In this case, there seem to be no obvious influential outlier characterized
# by large leverage and residual values.
#
# Similarly, we can also plot Cook's distances as a function of the sample leverages:
# Similarly, we can also plot Cook's distances as a function of the sample
# leverages:

# %%
graph = analysis.drawCookVsLeverages()
view = viewer.View(graph)

# %%
# Finally, we give the intervals for each estimated coefficient (95% confidence interval):
# Finally, we give the intervals for each estimated coefficient (95% confidence
# interval):

# %%
alpha = 0.95
interval = analysis.getCoefficientsConfidenceInterval(alpha)
print("confidence intervals with level=%1.2f : " % (alpha))
print("confidence intervals with level=%1.2f: " % (alpha))
print("%s" % (interval))
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Distribution of estimators in linear regression\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n\nIn this example, we are interested in the distribution of the estimator of the variance\nof the observation error in linear regression.\nWe are also interested in the estimator of the standard deviation of the\nobservation error.\nWe show how to use the :class:`~openturns.PythonRandomVector` class in order to\nperform a study of the sample distribution of these estimators.\n\nIn the general linear regression model, the observation error $\\epsilon$ has the\nnormal distribution $\\cN(0, \\sigma^2)$ where $\\sigma > 0$\nis the standard deviation.\nWe are interested in the estimators of the variance $\\sigma^2$\nand the standard deviation $\\sigma$:\n\n- the variance of the residuals, $\\sigma^2$, is estimated from :meth:`~openturns.LinearModelResult.getResidualsVariance`;\n- the standard deviation $\\sigma$ is estimated from :meth:`~openturns.LinearModelAnalysis.getResidualsStandardError`.\n\nThe asymptotic distribution of these estimators is known (see [vaart2000]_)\nbut we want to perform an empirical study, based on simulation.\nIn order to see the distribution of the estimator, we simulate an observation of the estimator,\nand repeat that experiment $r \\in \\Nset$ times, where $r$\nis a large integer.\nThen we approximate the distribution using :class:`~openturns.KernelSmoothing`.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import openturns as ot\nimport openturns.viewer as otv\nimport pylab as pl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The simulation engine\nThe simulation is based on the :class:`~openturns.PythonRandomVector` class,\nwhich simulates independent observations of a random vector.\nThe `getRealization()` method implements the simulation of the observation\nof the estimator.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class SampleEstimatorLinearRegression(ot.PythonRandomVector):\n def __init__(\n self, sample_size, true_standard_deviation, coefficients, estimator=\"variance\"\n ):\n \"\"\"\n Create a RandomVector for an estimator from a linear regression model.\n\n Parameters\n ----------\n sample_size : int\n The sample size n.\n true_standard_deviation : float\n The standard deviation of the Gaussian observation error.\n coefficients: sequence of p floats\n The coefficients of the linear model.\n estimator: str\n The estimator.\n Available estimators are \"variance\" or \"standard-deviation\".\n \"\"\"\n super(SampleEstimatorLinearRegression, self).__init__(1)\n self.sample_size = sample_size\n self.numberOfParameters = coefficients.getDimension()\n input_dimension = self.numberOfParameters - 1 # Because of the intercept\n centerCoefficient = [0.0] * input_dimension\n constantCoefficient = [coefficients[0]]\n linearCoefficient = ot.Matrix([coefficients[1:]])\n self.linearModel = ot.LinearFunction(\n centerCoefficient, constantCoefficient, linearCoefficient\n )\n self.distribution = ot.Normal(input_dimension)\n self.distribution.setDescription([\"x%d\" % (i) for i in range(input_dimension)])\n self.errorDistribution = ot.Normal(0.0, true_standard_deviation)\n self.estimator = estimator\n # In classical linear regression, the input is deterministic.\n # Here, we set it once and for all.\n self.input_sample = self.distribution.getSample(self.sample_size)\n self.output_sample = self.linearModel(self.input_sample)\n\n def getRealization(self):\n errorSample = self.errorDistribution.getSample(self.sample_size)\n noisy_output_sample = self.output_sample + errorSample\n algo = ot.LinearModelAlgorithm(self.input_sample, noisy_output_sample)\n lmResult = algo.getResult()\n if self.estimator == \"variance\":\n output = lmResult.getResidualsVariance()\n elif self.estimator == \"standard-deviation\":\n lmAnalysis = ot.LinearModelAnalysis(lmResult)\n output = lmAnalysis.getResidualsStandardError()\n else:\n raise ValueError(\"Unknown estimator %s\" % (self.estimator))\n return [output]\n\n\ndef plot_sample_by_kernel_smoothing(\n sample_size, true_standard_deviation, coefficients, estimator, repetitions_size\n):\n \"\"\"\n Plot the estimated distribution of the biased sample variance from a sample\n\n The method uses Kernel Smoothing with default kernel.\n\n Parameters\n ----------\n repetitions_size : int\n The number of repetitions of the experiments.\n This is the (children) size of the sample of the biased\n sample variance\n\n Returns\n -------\n graph : ot.Graph\n The plot of the PDF of the estimated distribution.\n\n \"\"\"\n myRV = ot.RandomVector(\n SampleEstimatorLinearRegression(\n sample_size, true_standard_deviation, coefficients, estimator\n )\n )\n sample_estimator = myRV.getSample(repetitions_size)\n if estimator == \"variance\":\n sample_estimator.setDescription([r\"$\\hat{\\sigma}^2$\"])\n elif estimator == \"standard-deviation\":\n sample_estimator.setDescription([r\"$\\hat{\\sigma}$\"])\n else:\n raise ValueError(\"Unknown estimator %s\" % (estimator))\n mean_sample_estimator = sample_estimator.computeMean()\n\n graph = ot.KernelSmoothing().build(sample_estimator).drawPDF()\n graph.setLegends([\"Sample\"])\n bb = graph.getBoundingBox()\n ylb = bb.getLowerBound()[1]\n yub = bb.getUpperBound()[1]\n curve = ot.Curve([true_standard_deviation ** 2] * 2, [ylb, yub])\n curve.setLegend(\"Exact\")\n curve.setLineWidth(2.0)\n graph.add(curve)\n graph.setTitle(\n \"Size = %d, True S.D. = %.4f, Mean = %.4f, Rep. = %d\"\n % (\n sample_size,\n true_standard_deviation,\n mean_sample_estimator[0],\n repetitions_size,\n )\n )\n return graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distribution of the variance estimator\nWe first consider the estimation of the variance $\\sigma^2$.\nIn the next cell, we consider a sample size equal to $n = 6$ with\n$p = 3$ parameters.\nWe use $r = 1000$ repetitions.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"repetitions_size = 1000\ntrue_standard_deviation = 0.1\nsample_size = 6\ncoefficients = ot.Point([3.0, 2.0, -1.0])\nestimator = \"variance\"\nview = otv.View(\n plot_sample_by_kernel_smoothing(\n sample_size, true_standard_deviation, coefficients, estimator, repetitions_size\n ),\n figure_kw={\"figsize\": (6.0, 3.5)},\n)\npl.subplots_adjust(bottom=0.25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we use a sample size equal to $n = 6$ with\n$p = 3$ parameters, the distribution is not symmetric.\nThe mean of the observations of the sample variances remains close to\nthe true value $0.1^2 = 0.01$.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we increase the sample size $n$.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"repetitions_size = 1000\ntrue_standard_deviation = 0.1\nsample_size = 100\ncoefficients = ot.Point([3.0, 2.0, -1.0])\nview = otv.View(\n plot_sample_by_kernel_smoothing(\n sample_size, true_standard_deviation, coefficients, estimator, repetitions_size\n ),\n figure_kw={\"figsize\": (6.0, 3.5)},\n)\npl.subplots_adjust(bottom=0.25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we use a sample size equal to $n = 6$ with\n$p = 3$ parameters, the distribution is almost symmetric and\nalmost normal.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distribution of the standard deviation estimator\nWe now consider the estimation of the standard deviation $\\sigma$.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"repetitions_size = 1000\ntrue_standard_deviation = 0.1\nsample_size = 6\ncoefficients = ot.Point([3.0, 2.0, -1.0])\nestimator = \"standard-deviation\"\nview = otv.View(\n plot_sample_by_kernel_smoothing(\n sample_size, true_standard_deviation, coefficients, estimator, repetitions_size\n ),\n figure_kw={\"figsize\": (6.0, 3.5)},\n)\npl.subplots_adjust(bottom=0.25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we use a sample size equal to $n = 6$ with\n$p = 3$ parameters, we see that the distribution is almost symmetric.\nWe notice a slight bias, since the mean of the observations of the\nstandard deviation is not as close to the true value 0.1\nas we could expect.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"repetitions_size = 1000\ntrue_standard_deviation = 0.1\nsample_size = 100\ncoefficients = ot.Point([3.0, 2.0, -1.0])\nestimator = \"standard-deviation\"\nview = otv.View(\n plot_sample_by_kernel_smoothing(\n sample_size, true_standard_deviation, coefficients, estimator, repetitions_size\n ),\n figure_kw={\"figsize\": (6.0, 3.5)},\n)\npl.subplots_adjust(bottom=0.25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we use a sample size equal to $n = 100$ with\n$p = 3$ parameters, we see that the distribution is almost normal.\nWe notice that the bias disappeared.\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit fd4d4c9

Please sign in to comment.