From b57ea7b601266ee8fd0d36f2f2ccf9c6beeef02b Mon Sep 17 00:00:00 2001 From: fcorowe Date: Thu, 22 Feb 2024 23:10:06 +0000 Subject: [PATCH] update chp5 --- 04-points.qmd | 4 +- 05-flows.qmd | 8 +- 06-spatial-econometrics.qmd | 41 +- _quarto.yml | 10 +- docs/01-overview.html | 5 +- docs/02-spatial_data.html | 25 +- docs/03-data-wrangling.html | 29 +- docs/04-points.html | 688 +++++---- docs/05-flows.html | 33 +- .../figure-html/unnamed-chunk-17-1.png | Bin 83880 -> 83921 bytes .../figure-html/unnamed-chunk-18-1.png | Bin 95741 -> 95704 bytes .../figure-html/unnamed-chunk-22-1.png | Bin 96793 -> 96600 bytes docs/06-spatial-econometrics.html | 595 ++++---- docs/07-multilevel-01.html | 55 +- .../figure-html/unnamed-chunk-21-1.png | Bin 110262 -> 109875 bytes .../figure-html/unnamed-chunk-24-1.png | Bin 170087 -> 170088 bytes docs/08-multilevel-02.html | 29 +- .../figure-html/unnamed-chunk-11-1.png | Bin 172652 -> 172204 bytes .../figure-html/unnamed-chunk-9-1.png | Bin 133583 -> 133642 bytes docs/09-gwr.html | 25 +- docs/10-st_analysis.html | 25 +- docs/11-datasets.html | 1 - docs/index.html | 5 +- docs/references.html | 327 +++- docs/robots.txt | 1 + docs/search.json | 732 +++++++++ docs/sitemap.xml | 55 + index.aux | 9 +- index.log | 1359 ++++++++--------- index.pdf | Bin 21879656 -> 21876583 bytes index.tex | 121 +- references.bib | 2 +- 32 files changed, 2606 insertions(+), 1578 deletions(-) create mode 100644 docs/robots.txt create mode 100644 docs/search.json create mode 100644 docs/sitemap.xml diff --git a/04-points.qmd b/04-points.qmd index e8166d1..343adb6 100644 --- a/04-points.qmd +++ b/04-points.qmd @@ -51,7 +51,8 @@ colnames(db) The rest of this session will focus on two main elements of the table: the spatial dimension (as stored in the point coordinates), and the nightly price values, expressed in USD and contained in the `price` column. To get a sense of what they look like first, let us plot both. We can get a quick look at the non-spatial distribution of house values with the following commands: -```{r fig.margin=TRUE, fig.cap="Raw AirBnb prices in San Diego"} +```{r} +#| warning: false # Create the histogram qplot( data = db, x = price) ``` @@ -59,6 +60,7 @@ qplot( data = db, x = price) This basically shows there is a lot of values concentrated around the lower end of the distribution but a few very large ones. A usual transformation to *shrink* these differences is to take logarithms. The original table already contains an additional column with the logarithm of each price (`log_price`). ```{r} +#| warning: false # Create the histogram qplot( data = db, x = log_price ) ``` diff --git a/05-flows.qmd b/05-flows.qmd index 1158dff..274f2c1 100644 --- a/05-flows.qmd +++ b/05-flows.qmd @@ -35,7 +35,7 @@ In this chapter we will show a slightly different way of managing spatial data i ## Data -In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the SF Open Data portal ([link](http://www.bayareabikeshare.com/open-data)) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the `sf_bikes` folder. The data file is named `flows.geojson` and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the `flows_prep.ipynb` notebook [\[url\]](https://github.com/darribas/spa_notes/blob/master/sf_bikes/flows_prep.ipynb), also in the same folder. +In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the [SF Open Data portal](https://datasf.org/opendata/) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the `sf_bikes` folder. The data file is named `flows.geojson` and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the `flows_prep.ipynb` [notebook](https://github.com/darribas/spa_notes/blob/master/sf_bikes/flows_prep.ipynb), also in the same folder. Let us then directly load the file with all the information necessary: @@ -277,7 +277,7 @@ generate_draw <- function(m){ } ``` -This function takes a model `m` and the set of covariates `x` used and returns a random realization of predictions from the model. To get a sense of how this works, we can get and plot a realization of the model, compared to the expected one and the actual values: +This function takes a model `m` and the set of covariates `x` used and returns a random realisation of predictions from the model. To get a sense of how this works, we can get and plot a realisation of the model, compared to the expected one and the actual values: ```{r} new_y <- generate_draw(m1) @@ -355,11 +355,11 @@ legend( title(main="Predictive check - Baseline model") ``` -The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realizations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-2] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution. +The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realisations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-2] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution. [^05-flows-2]: The $R^2$ of our model is around 2% -It is important to keep in mind that the issues discussed in the paragraph above relate only to the uncertainty behind our model, not to the point predictions derived from them, which are a mechanistic result of the minimization of the squared residuals and hence are not subject to probability or inference. That allows them in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from this model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-3], our capabilities to infer about the more general underlying process are fairly limited. +Keep in mind that the issues discussed in the paragraph above relate to the uncertainty behind our model. This is not to the point predictions derived from them, which are a mechanistic result of the minimisation of the squared residuals and hence are not subject to probability or inference. That allows the model in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from the model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-3], our capabilities to infer about the more general underlying process are fairly limited. [^05-flows-3]: which they are not really, in light of the comparison between the black and red lines. diff --git a/06-spatial-econometrics.qmd b/06-spatial-econometrics.qmd index 8945b92..9008b32 100644 --- a/06-spatial-econometrics.qmd +++ b/06-spatial-econometrics.qmd @@ -26,7 +26,7 @@ library(spdep) ## Data -To explore ideas in spatial regression, we will the set of Airbnb properties for San Diego (US), borrowed from the "Geographic Data Science with Python" book (see [here](https://geographicdata.science/book/data/airbnb/regression_cleaning.html) for more info on the dataset source). This covers the point location of properties advertised on the Airbnb website in the San Diego region. +To explore ideas in spatial regression, we will use a set of Airbnb properties for San Diego (US) from @reyABwolf. The dataset provides point location data of properties advertised on the Airbnb website in the San Diego region. Let us load the data: @@ -34,13 +34,13 @@ Let us load the data: db <- st_read('data/abb_sd/regression_db.geojson') ``` -The table contains the followig variables: +The table contains the following variables: ```{r} names(db) ``` -For most of this chapter, we will be exploring determinants and strategies for modelling the price of a property advertised in AirBnb. To get a first taste of what this means, we can create a plot of prices within the area of San Diego: +We will be exploring determinants and strategies for modelling the price of a property advertised in AirBnb. To get a first taste of what this means, we can create a plot of prices within the area of San Diego: ```{r} db %>% @@ -50,11 +50,9 @@ db %>% theme_void() ``` -## Non-spatial regression, a refresh +## Non-spatial regression -Before we discuss how to explicitly include space into the linear regression framework, let us show how basic regression can be carried out in R, and how you can interpret the results. By no means is this a formal and complete introduction to regression so, if that is what you are looking for, the first part of @gelman2006data, in particular chapters 3 and 4, are excellent places to check out. - -The core idea of linear regression is to explain the variation in a given (*dependent*) variable as a linear function of a series of other (*explanatory*) variables. For example, in our case, we may want to express/explain the price of a property advertised on AirBnb as a function of some of its characteristics, such as the number of people it accommodates, and how many bathrooms, bedrooms and beds it features. At the individual level, we can express this as: +Before we discuss how to explicitly include space into the linear regression framework, let us fit a linear regression model and interpret the results. We may want to explain the price of a property advertised on AirBnb as a function of some of its characteristics, such as the number of people it accommodates, and how many bathrooms, bedrooms and beds it features. At the individual level, we can express this as: $$ \log(P_i) = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i @@ -66,17 +64,11 @@ $$ \log(P) = \alpha + \beta_1 Acc + \beta_2 Bath + \beta_3 Bedr + \beta_4 Beds + \epsilon $$ where each term can be interpreted in terms of vectors instead of scalars (wit the exception of the parameters $(\alpha, \beta_{1, 2, 3, 4})$, which *are* scalars). Note we are using the logarithm of the price, since this allows us to interpret the coefficients as roughly the percentage change induced by a unit increase in the explanatory variable of the estimate. -Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the $\beta_k$ coefficients in the equation above is as the degree of correlation between the explanatory variable $k$ and the dependent variable, *keeping all the other explanatory variables constant*. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables --also called confounding factors\[\^06-spatial-econometrics-1\]. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we *control* for those other variables. - -::: column-margin -::: {.callout-tip appearance="simple" icon="false"} -**Task** +Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the $\beta_k$ coefficients in the equation above is as the degree of correlation between the explanatory variable $k$ and the dependent variable, *keeping all the other explanatory variables constant*. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables --also called confounding factors. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we *control* for those other variables. Assume that new houses tend to be built more often in areas with low deprivation. If that is the case, then $NEW$ and $IMD$ will be correlated with each other (as well as with the price of a house, as we are hypothesizing in this case). If we calculate a simple correlation between $P$ and $IMD$, the coefficient will represent the degree of association between both variables, but it will also include some of the association between $IMD$ and $NEW$. That is, part of the obtained correlation coefficient will be due not to the fact that higher prices tend to be found in areas with low IMD, but to the fact that new houses tend to be more expensive. This is because (in this example) new houses tend to be built in areas with low deprivation and simple bivariate correlation cannot account for that. -::: -::: -Practically speaking, running linear regressions in `R` is straightforward. For example, to fit the model specified in the equation above, we only need one line of code: +We first fit the model specified in the equation above by running: ```{r} m1 <- lm('log_price ~ accommodates + bathrooms + bedrooms + beds', db) @@ -94,11 +86,11 @@ A full detailed explanation of the output is beyond the scope of the chapter, bu [^06-spatial-econometrics-1]: Keep in mind that regression is no magic. We are only discounting the effect of other confounding factors that we include in the model, not of *all* potentially confounding factors. -## Spatial regression: a (very) first dip +## Spatial regression Spatial regression is about *explicitly* introducing space or geographical context into the statistical framework of a regression. Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors we cannot but should include in our model. As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views. To illustrate the latter, we can think of how the character of a neighborhood is important in determining the price of a house; however, it is very hard to identify and quantify "character" per se, although it might be easier to get at its spatial variation, hence a case of space as a proxy. -Spatial regression is a large field of development in the econometrics and statistics literature. In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. For more rigorous treatments of the topics introduced here, the reader is referred to @anselin2003spatial, @anselin2014modern, and @gibbons2014spatial. +Spatial regression is a large field of development in the econometrics and statistics literature. In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: *spatial heterogeneity* and *spatial dependence*. For more rigorous treatments of the topics introduced here, the reader is referred to @anselin2003spatial and @anselin2014modern. ## Spatial heterogeneity @@ -160,9 +152,9 @@ nei.fes %>% theme_void() ``` -We can see how neighborhoods in the left (west) tend to have higher prices. What we can't see, but it is represented there if you are familiar with the geography of San Diego, is that the city is bounded by the Pacific ocean on the left, suggesting neighbourhoods by the beach tend to be more expensive. +We can see how neighborhoods in the left (west) tend to have higher prices. What we cannot see, but it is represented there if you are familiar with the geography of San Diego, is that the city is bounded by the Pacific ocean on the left, suggesting neighbourhoods by the beach tend to be more expensive. -Remember that the interpretation of a $\beta_k$ coefficient is the effect of variable $k$, *given all the other explanatory variables included remain constant*. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; in other words, only houses located within the same area. Introducing FE affords you a higher degree of isolation of the effects of the variables you introduce in your model because you can control for unobserved effects that align spatially with the distribution of the FE you introduce (by neighbourhood, in our case). +Remember that the interpretation of a $\beta_k$ coefficient is the effect of variable $k$, *given all the other explanatory variables included remained constant*. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; that is, only houses located within the same area. Introducing FE affords you a higher degree of isolation of the effects of the variables you introduce in your model because you can control for unobserved effects that align spatially with the distribution of the FE you introduce (by neighbourhood, in our case). **Spatial regimes** @@ -187,7 +179,7 @@ m3 <- lm( summary(m3) ``` -This allows us to get a separate constant term and estimate of the impact of each variable *for every neighborhood* +This allows us to get a separate constant term and estimate of the impact of each variable *for every neighborhood*. Note that to obtain a neighbourhood-specific constant, you will need to add the regression constant and the estimate for the interaction between one and a specific neighbourhood estimate. ## Spatial dependence @@ -195,9 +187,9 @@ As we have just discussed, SH is about effects of phenomena that are *explicitly **Spatial Weights** -There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication [see @anselin2003spatial for a good overview]. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)[^06-spatial-econometrics-2] These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}>0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one. +There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication [see @anselin2003spatial for a good overview]. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)[^06-spatial-econometrics-2] These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}>0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one. -[^06-spatial-econometrics-2]: If you need to refresh your knowledge on spatial weight matrices, check [Block E](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course; [Chapter 4](https://geographicdata.science/book/notebooks/04_spatial_weights.html) of @reyABwolf; or the [Spatial Weights](https://fcorowe.github.io/intro-gds/03-spatial_weights.html) Section of @rowe2022a. +[^06-spatial-econometrics-2]: If you need to refresh your knowledge on spatial weight matrices. [Block E](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course [Chapter 4](https://geographicdata.science/book/notebooks/04_spatial_weights.html) of @reyABwolf provide a good explanation of theory around spatial weights and the [Spatial Weights](https://fcorowe.github.io/intro-gds/03-spatial_weights.html) Section of @rowe2022a illustrates the use of R to compute different types of spatial weight matrices. A related concept to spatial weight matrices is that of *spatial lag*. This is an operator that multiplies a given variable $y$ by a spatial weight matrix: @@ -235,7 +227,7 @@ hknn **Exogenous spatial effects** -Let us come back to the house price example we have been working with. So far, we have hypothesized that the price of an AirBnb property in San Diego can be explained using information about its own characteristics, and the neighbourhood it belongs to. However, we can hypothesise that the price of a house is also affected by the characteristics of the houses surrounding it. Considering it as a proxy for larger and more luxurious houses, we will use the number of bathrooms of neighboring houses as an additional explanatory variable. This represents the most straightforward way to introduce spatial dependence in a regression, by considering not only a given explanatory variable, but also its spatial lag. +Let us come back to the house price example we have been working with. So far, we have hypothesised that the price of an AirBnb property in San Diego can be explained using information about its own characteristics, and the neighbourhood it belongs to. However, we can hypothesise that the price of a house is also affected by the characteristics of the houses surrounding it. Considering it as a proxy for larger and more luxurious houses, we will use the number of bathrooms of neighboring houses as an additional explanatory variable. This represents the most straightforward way to introduce spatial dependence in a regression, by considering not only a given explanatory variable, but also its spatial lag. In our example case, in addition to including the number of bathrooms of the property, we will include its spatial lag. In other words, we will be saying that it is not only the number of bathrooms in a house but also that of the surrounding properties that helps explain the final price at which a house is advertised for. Mathematically, this implies estimating the following model: @@ -298,8 +290,7 @@ Conceptually, predicting in linear regression models involves using the estimate $$ \log(\bar{P_i}) = \bar{\alpha} + \bar{\beta_1} Acc_i^* + \bar{\beta_2} Bath_i^* + \bar{\beta_3} Bedr_i^* + \bar{\beta_4} Beds_i^* -$$ -where $\log(\bar{P_i})$ is our predicted value, and we include the bar sign to note that it is our estimate obtained from fitting the model. We use the $^*$ sign to note that those can be new values for the explanatory variables, not necessarily those used to fit the model. +$$ where $\log(\bar{P_i})$ is our predicted value, and we include the bar sign to note that it is our estimate obtained from fitting the model. We use the $^*$ sign to note that those can be new values for the explanatory variables, not necessarily those used to fit the model. Technically speaking, prediction in linear models is relatively streamlined in R. Suppose we are given data for a new house which is to be put on the AirBnb platform. We know it accommodates four people, and has two bedrooms, three beds, and one bathroom. We also know that the surrounding properties have, on average, 1.5 bathrooms. Let us record the data first: diff --git a/_quarto.yml b/_quarto.yml index fb8dde4..dc01705 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -13,7 +13,7 @@ book: style: docked background: light date: today - downloads: [pdf] + #downloads: [pdf] chapters: - index.qmd - 01-overview.qmd @@ -28,8 +28,8 @@ book: - 10-st_analysis.qmd - 11-datasets.qmd - references.qmd - bibliography: references.bib +editor: visual format: html: @@ -55,8 +55,6 @@ format: reference-location: margin mermaid: theme: neutral - pdf: - documentclass: scrreprt - -editor: visual +# pdf: +# documentclass: scrreprt diff --git a/docs/01-overview.html b/docs/01-overview.html index a0241e5..07e443c 100644 --- a/docs/01-overview.html +++ b/docs/01-overview.html @@ -119,7 +119,6 @@ Spatial Modelling for Data Scientists @@ -400,8 +399,8 @@

1.5 Assessment

The final module mark is composed of the two computational essays. Together they are designed to cover the materials introduced in the entirety of content covered during the semester. A computational essay is an essay whose narrative is supported by code and computational results that are included in the essay itself. Each teaching week, you will be required to address a set of questions relating to the module content covered in that week, and to use the material that you will produce for this purpose to build your computational essay.

-

Assignment 1 (50%) refer to the set of questions at the end of sec-chp4, sec-chp5 and sec-chp6. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

-

Assignment 2 (50%) refer to the set of questions at the end of sec-chp7, sec-chp8, sec-chp9 and sec-chp10. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

+

Assignment 1 (50%) refer to the set of questions at the end of Chapter 4, Chapter 5 and Chapter 6. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

+

Assignment 2 (50%) refer to the set of questions at the end of Chapter 7, Chapter 8, Chapter 9 and Chapter 10. You are required to use your responses to build your computational essay. Each chapter provides more specific guidance of the tasks and discussion that you are required to consider in your assignment.

1.5.1 Format Requirements

Both assignments will have the same requirements:

diff --git a/docs/02-spatial_data.html b/docs/02-spatial_data.html index abdd9be..b852987 100644 --- a/docs/02-spatial_data.html +++ b/docs/02-spatial_data.html @@ -139,7 +139,6 @@ Spatial Modelling for Data Scientists @@ -272,13 +271,13 @@

Different classifications of spatial data types exist. Knowing the structure of the data at hand is important as specific analytical methods would be more appropriate for particular data types. We will use a particular classification involving four data types: lattice/areal data, point data, flow data and trajectory data (Fig. 1). This is not a exhaustive list but it is helpful to motivate the analytical and modelling methods that we cover in this book.

-
Fig. 1. Data Types. Area / Lattice data source: Önnerfors et al. (2019). Point data source: Tao et al. (2018). Flow data source: Rowe and Patias (2020). Trajectory data source: Kwan and Lee (2004).
+
Fig. 1. Data Types. Area / Lattice data source: Önnerfors et al. (2019). Point data source: Tao et al. (2018). Flow data source: Rowe and Patias (2020). Trajectory data source: Kwan and Lee (2004).

Lattice/Areal Data. These data correspond to records of attribute values (such as population counts) for a fixed geographical area. They may comprise regular shapes (such as grids or pixels) or irregular shapes (such as states, counties or travel-to-work areas). Raster data are a common source of regular lattice/areal area, while censuses are probably the most common form of irregular lattice/areal area. Point data within an area can be aggregated to produce lattice/areal data.

Point Data. These data refer to records of the geographic location of an discrete event, or the number of occurrences of geographical process at a given location. As displayed in Fig. 1, examples include the geographic location of bus stops in a city, or the number of boarding passengers at each bus stop.

Flow Data. These data refer to records of measurements for a pair of geographic point locations. or pair of areas. These data capture the linkage or spatial interaction between two locations. Migration flows between a place of origin and a place of destination is an example of this type of data.

-

Trajectory Data. These data record geographic locations of moving objects at various points in time. A trajectory is composed of a single string of data recording the geographic location of an object at various points in time and each record in the string contains a time stamp. These data are complex and can be classified into explicit trajectory data and implicit trajectory data. The former refer to well-structured data and record positions of objects continuously and intensively at uniform time intervals, such as GPS data. The latter is less structured and record data in relatively time point intervals, including sensor-based, network-based and signal-based data (Kong et al. 2018).

-

In this course, we cover analytical and modelling approaches for point, lattice/areal and flow data. While we do not explicitly analyse trajectory data, various of the analytical approaches described in this book can be extended to incorporate time, and can be applied to model these types of data. In sec-chp10, we describe approaches to analyse and model spatio-temporal data. These same methods can be applied to trajectory data.

+

Trajectory Data. These data record geographic locations of moving objects at various points in time. A trajectory is composed of a single string of data recording the geographic location of an object at various points in time and each record in the string contains a time stamp. These data are complex and can be classified into explicit trajectory data and implicit trajectory data. The former refer to well-structured data and record positions of objects continuously and intensively at uniform time intervals, such as GPS data. The latter is less structured and record data in relatively time point intervals, including sensor-based, network-based and signal-based data (Kong et al. 2018).

+

In this course, we cover analytical and modelling approaches for point, lattice/areal and flow data. While we do not explicitly analyse trajectory data, various of the analytical approaches described in this book can be extended to incorporate time, and can be applied to model these types of data. In Chapter 10, we describe approaches to analyse and model spatio-temporal data. These same methods can be applied to trajectory data.

2.2 Hierarchical Structure of Data

The hierarchical organisation is a key feature of spatial data. Smaller geographical units are organised within larger geographical units. You can find the hierarchical representation of UK Statistical Geographies on the Office for National Statistics website. In the bottom part of the output below, we can observe a spatial data frame for Liverpool displaying the hierarchical structure of census data (from the smallest to the largest): Output Areas (OAs), Lower Super Output Areas (LSOAs), Middle Super Output Areas (MSOAs) and Local Authority Districts (LADs). This hierarchical structure entails that units in smaller geographies are nested within units in larger geographies, and that smaller units can be aggregated to produce large units.

@@ -304,34 +303,34 @@

Major challenges exist when working with spatial data. Below we explore some of the key longstanding problems data scientists often face when working with geographical data.

2.3.1 Modifible Area Unit Problem (MAUP)

-

The Modifible Area Unit Problem (MAUP) represents a challenge that has troubled geographers for decades (Openshaw 1981). Two aspects of the MAUP are normally recognised in empirical analysis relating to scale and zonation. Fig. 2 illustrates these issues

+

The Modifible Area Unit Problem (MAUP) represents a challenge that has troubled geographers for decades (Openshaw 1981). Two aspects of the MAUP are normally recognised in empirical analysis relating to scale and zonation. Fig. 2 illustrates these issues

  • Scale refers to the idea that a geographical area can be divided into geographies with differing numbers of spatial units.

  • Zonation refers to the idea that a geographical area can be divided into the same number of units in a variety of ways.

-
Fig. 2. MAUP effect. (a) scale effect; and, (b) zonation effect. Source: Loidl et al. (2016).
+
Fig. 2. MAUP effect. (a) scale effect; and, (b) zonation effect. Source: Loidl et al. (2016).
-

The MAUP is a critical issue as it can impact our analysis and thus any conclusions we can infer from our results (e.g. Fotheringham and Wong 1991). There is no agreed systematic approach on how to handle the effects of the MAUP. Some have suggested to perform analyses based on different existing geographical scales, and assess the consistency of the results and identify potential sources of change. The issue with such approach is that results from analysis at different scales are likely to differ because distinct dimensions of a geographic process may be captured at different scales. For example, in migration studies, smaller geographies may be more suitable to capture residential mobility over short distances, while large geographies may be more suitable to capture long-distance migration. And it is well documented that these types of moves are driven by different factors. While residential mobility tends to be driven by housing related reasons, long-distance migration is more closely related to employment-related motives (Niedomysl 2011).

-

An alternative approach is to use the smallest geographical system available and create random aggregations at various geographical scales, to directly quantify the extent of scale and zonation. This approach has shown promising results in applications to study internal migration flows (Stillwell, Daras, and Bell 2018). Another approach involves the production of “meaningful” or functional geographies that can more appropriately capture the process of interest. There is an active area of work defining functional labour markets (Casado-Díaz, Martínez-Bernabéu, and Rowe 2017), urban areas (Arribas-Bel, Garcia-López, and Viladecans-Marsal 2021) and various forms of geodemographic classifications (Singleton and Spielman 2013; Patias, Rowe, and Cavazzi 2019) . However there is the recognition that none of the existing approaches resolve the effects of the MAUP and recently it has been suggested that the most plausible ‘solution’ would be to ignore the MAUP (Wolf et al. 2020).

+

The MAUP is a critical issue as it can impact our analysis and thus any conclusions we can infer from our results (e.g. Fotheringham and Wong 1991). There is no agreed systematic approach on how to handle the effects of the MAUP. Some have suggested to perform analyses based on different existing geographical scales, and assess the consistency of the results and identify potential sources of change. The issue with such approach is that results from analysis at different scales are likely to differ because distinct dimensions of a geographic process may be captured at different scales. For example, in migration studies, smaller geographies may be more suitable to capture residential mobility over short distances, while large geographies may be more suitable to capture long-distance migration. And it is well documented that these types of moves are driven by different factors. While residential mobility tends to be driven by housing related reasons, long-distance migration is more closely related to employment-related motives (Niedomysl 2011).

+

An alternative approach is to use the smallest geographical system available and create random aggregations at various geographical scales, to directly quantify the extent of scale and zonation. This approach has shown promising results in applications to study internal migration flows (Stillwell, Daras, and Bell 2018). Another approach involves the production of “meaningful” or functional geographies that can more appropriately capture the process of interest. There is an active area of work defining functional labour markets (Casado-Díaz, Martínez-Bernabéu, and Rowe 2017), urban areas (Arribas-Bel, Garcia-López, and Viladecans-Marsal 2021) and various forms of geodemographic classifications (Singleton and Spielman 2013; Patias, Rowe, and Cavazzi 2019) . However there is the recognition that none of the existing approaches resolve the effects of the MAUP and recently it has been suggested that the most plausible ‘solution’ would be to ignore the MAUP (Wolf et al. 2020).

2.3.2 Ecological Fallacy

-

Ecological fallacy is an error in the interpretation of statistical data based on aggregate information. Specifically it refers to inferences made about the nature of specific individuals based solely on statistics aggregated for a given group. It is about thinking that relationships observed for groups necessarily hold for individuals. A key example is Robinson (1950) who illustrates this problem exploring the difference between ecological correlations and individual correlations. He looked at the relationship between country of birth and literacy. Robinson (1950) used the percent of foreign-born population and percent of literate population for the 48 states in the United States in 1930. The ecological correlation based on these data was 0.53. This suggests a positive association between foreign birth and literacy, and could be interpreted as foreign born individuals being more likely to be literate than native-born individuals. Yet, the correlation based on individual data was negative -0.11 which indicates the opposite. The main point emerging from this example is to carefully interpret analysis based on spatial data and avoid making inferences about individuals from these data.

+

Ecological fallacy is an error in the interpretation of statistical data based on aggregate information. Specifically it refers to inferences made about the nature of specific individuals based solely on statistics aggregated for a given group. It is about thinking that relationships observed for groups necessarily hold for individuals. A key example is Robinson (1950) who illustrates this problem exploring the difference between ecological correlations and individual correlations. He looked at the relationship between country of birth and literacy. Robinson (1950) used the percent of foreign-born population and percent of literate population for the 48 states in the United States in 1930. The ecological correlation based on these data was 0.53. This suggests a positive association between foreign birth and literacy, and could be interpreted as foreign born individuals being more likely to be literate than native-born individuals. Yet, the correlation based on individual data was negative -0.11 which indicates the opposite. The main point emerging from this example is to carefully interpret analysis based on spatial data and avoid making inferences about individuals from these data.

2.3.3 Spatial Dependence

-

Spatial dependence refers to the spatial relationship of a variable’s values for a pair of locations at a certain distance apart, so that these values are more similar (or less similar) than expected for randomly associated pairs of observations (Anselin 1988). For example, we could think of observed patterns of ethnic segregation in an area are a result of spillover effects of pre-existing patterns of ethnic segregation in neighbouring areas. sec-chp5 will illustrate approach to explicitly incorporate spatial dependence in regression analysis.

+

Spatial dependence refers to the spatial relationship of a variable’s values for a pair of locations at a certain distance apart, so that these values are more similar (or less similar) than expected for randomly associated pairs of observations (Anselin 1988). For example, we could think of observed patterns of ethnic segregation in an area are a result of spillover effects of pre-existing patterns of ethnic segregation in neighbouring areas. Chapter 5 will illustrate approach to explicitly incorporate spatial dependence in regression analysis.

2.3.4 Spatial Heterogeneity

-

Spatial heterogeneity refers to the uneven distribution of a variable’s values across space. Concentration of deprivation or unemployment across an area are good examples of spatial heterogeneity. We illustrate various ways to visualise, explore and measure the spatial distribution of data in multiple chapters. We also discuss on potential modelling approaches to capture spatial heterogeneity in sec-chp5, sec-chp7 and sec-chp10.

+

Spatial heterogeneity refers to the uneven distribution of a variable’s values across space. Concentration of deprivation or unemployment across an area are good examples of spatial heterogeneity. We illustrate various ways to visualise, explore and measure the spatial distribution of data in multiple chapters. We also discuss on potential modelling approaches to capture spatial heterogeneity in Chapter 5, Chapter 7 and Chapter 10.

2.3.5 Spatial nonstationarity

-

Spatial nonstationarity refers to variations in the relationship between an outcome variable and a set of predictor variables across space. In a modelling context, it relates to a situation in which a simple “global” model is inappropriate to explain the relationships between a set of variables. The geographical nature of the model must be modified to reflect local structural relationships within the data. For example, ethinic segregation has been positively associated with employment outcomes in some countries pointing to networks in pre-existing communities facilitating access to the local labour market. Inversely ethinic segregation has been negatively associated with employment outcomes pointing to lack of integration into the broader local community. We illustrate various modelling approaches to capture spatial nonstationarity in sec-chp8 and sec-chp9.

+

Spatial nonstationarity refers to variations in the relationship between an outcome variable and a set of predictor variables across space. In a modelling context, it relates to a situation in which a simple “global” model is inappropriate to explain the relationships between a set of variables. The geographical nature of the model must be modified to reflect local structural relationships within the data. For example, ethinic segregation has been positively associated with employment outcomes in some countries pointing to networks in pre-existing communities facilitating access to the local labour market. Inversely ethinic segregation has been negatively associated with employment outcomes pointing to lack of integration into the broader local community. We illustrate various modelling approaches to capture spatial nonstationarity in Chapter 8 and Chapter 9.

-
+
@@ -305,10 +304,10 @@

If you are already familiar with R, R computational notebooks and data types, you may want to jump to Section Read Data and start from there. This section describes how to read and manipulate data using sf and tidyverse functions, including mutate(), %>% (known as pipe operator), select(), filter() and specific packages and functions how to manipulate spatial data.

The chapter is based on:

    -
  • Grolemund and Wickham (2019), this book illustrates key libraries, including tidyverse, and functions for data manipulation in R

  • -
  • Xie, Allaire, and Grolemund (2019), excellent introduction to R markdown!

  • -
  • Williamson (2018), some examples from the first lecture of ENVS450 are used to explain the various types of random variables.

  • -
  • Lovelace, Nowosad, and Muenchow (2019), a really good book on handling spatial data and historical background of the evolution of R packages for spatial data analysis.

  • +
  • Grolemund and Wickham (2019), this book illustrates key libraries, including tidyverse, and functions for data manipulation in R

  • +
  • Xie, Allaire, and Grolemund (2019), excellent introduction to R markdown!

  • +
  • Williamson (2018), some examples from the first lecture of ENVS450 are used to explain the various types of random variables.

  • +
  • Lovelace, Nowosad, and Muenchow (2019), a really good book on handling spatial data and historical background of the evolution of R packages for spatial data analysis.

3.1 Dependencies

@@ -381,7 +380,7 @@
-

To get familiar with good practices in writing your code in R, we recommend the Chapter Workflow: basics and Workflow: scripts and projects from the R in Data Science book by Wickham, Çetinkaya-Rundel, and Grolemund (2023).

+

To get familiar with good practices in writing your code in R, we recommend the Chapter Workflow: basics and Workflow: scripts and projects from the R in Data Science book by Wickham, Çetinkaya-Rundel, and Grolemund (2023).

https://r4ds.hadley.nz/workflow-basics.html

@@ -433,7 +432,7 @@
  • Save the script: File > Save As, select your required destination folder, and enter any filename that you like, provided that it ends with the file extension .R
  • -

    An R Notebook or a Quarto Document are a Markdown options with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see Xie, Allaire, and Grolemund (2019). A Quarto Document is an improved version of the original R Notebook. Quarto Document requires a package called Quarto. Quarto does not have a dependency or requirement for R. Quarto is multilingual, beginning with R, Python, Javascript, and Julia. The concept is that Quarto will work even for languages that do not yet exist. This book was original written in R Notebook but later transitioned into Quarto Documents.

    +

    An R Notebook or a Quarto Document are a Markdown options with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see Xie, Allaire, and Grolemund (2019). A Quarto Document is an improved version of the original R Notebook. Quarto Document requires a package called Quarto. Quarto does not have a dependency or requirement for R. Quarto is multilingual, beginning with R, Python, Javascript, and Julia. The concept is that Quarto will work even for languages that do not yet exist. This book was original written in R Notebook but later transitioned into Quarto Documents.

    To create an R Notebook, you need to:

    • Open a new script file: File > New File > R Notebook @@ -857,7 +856,7 @@
      census <- census %>% 
         mutate( per_ghealth = ghealth / pop )
      -

      Note we used a pipe operator %>%, which helps make the code more efficient and readable - more details, see Grolemund and Wickham (2019). When using the pipe operator, recall to first indicate the data frame before %>%.

      +

      Note we used a pipe operator %>%, which helps make the code more efficient and readable - more details, see Grolemund and Wickham (2019). When using the pipe operator, recall to first indicate the data frame before %>%.

      Note also the use a variable name before the = sign in brackets to indicate the name of the new variable after mutate.

    3.9.2 Selecting Variables

    @@ -932,7 +931,7 @@

    3.10 Using Spatial Data Frames

    -

    A core area of the module is learning to work with spatial data in R. R has various purposedly designed packages for manipulation of spatial data and spatial analysis techniques. Various packages exist in CRAN, including sf (Pebesma 2018, 2022a), stars (Pebesma 2022b), terra, s2 (Dunnington, Pebesma, and Rubak 2023), lwgeom (Pebesma 2023), gstat (Pebesma 2004; Pebesma and Graeler 2022), spdep (Bivand 2022), spatialreg (Bivand and Piras 2022), spatstat (Baddeley, Rubak, and Turner 2015; Baddeley, Turner, and Rubak 2022), tmap (Tennekes 2018, 2022), mapview (Appelhans et al. 2022) and more. A key package is this ecosystem is sf (Pebesma and Bivand 2023). R package sf provides a table format for simple features, where feature geometries are stored in a list-column. It appeared in 2016 and was developed to move spatial data analysis in R closer to standards-based approaches seen in the industry and open source projects, to build upon more modern versions of open source geospatial software stack and allow for integration of R spatial software with the tidyverse (Wickham et al. 2019), particularly ggplot2, dplyr, and tidyr. Hence, this book relies heavely on sf for the manipulation and analysis of the data.

    +

    A core area of the module is learning to work with spatial data in R. R has various purposedly designed packages for manipulation of spatial data and spatial analysis techniques. Various packages exist in CRAN, including sf (Pebesma 2018, 2022a), stars (Pebesma 2022b), terra, s2 (Dunnington, Pebesma, and Rubak 2023), lwgeom (Pebesma 2023), gstat (Pebesma 2004; Pebesma and Graeler 2022), spdep (Bivand 2022), spatialreg (Bivand and Piras 2022), spatstat (Baddeley, Rubak, and Turner 2015; Baddeley, Turner, and Rubak 2022), tmap (Tennekes 2018, 2022), mapview (Appelhans et al. 2022) and more. A key package is this ecosystem is sf (Pebesma and Bivand 2023). R package sf provides a table format for simple features, where feature geometries are stored in a list-column. It appeared in 2016 and was developed to move spatial data analysis in R closer to standards-based approaches seen in the industry and open source projects, to build upon more modern versions of open source geospatial software stack and allow for integration of R spatial software with the tidyverse (Wickham et al. 2019), particularly ggplot2, dplyr, and tidyr. Hence, this book relies heavely on sf for the manipulation and analysis of the data.

    @@ -945,7 +944,7 @@
    -

    Lovelace, Nowosad, and Muenchow (2024) provide a helpful overview and evolution of R spatial package ecosystem.

    +

    Lovelace, Nowosad, and Muenchow (2024) provide a helpful overview and evolution of R spatial package ecosystem.

    To read our spatial data, we use the st_read function. We read a shapefile containing data at Output Area (OA) level for Liverpool. These data illustrates the hierarchical structure of spatial data.

    @@ -1168,7 +1167,7 @@ -

    To master ggplot, see Wickham (2009).

    +

    To master ggplot, see Wickham (2009).

    Using tmap

    Similar to ggplot2, tmap is based on the idea of a ‘grammar of graphics’ which involves a separation between the input data and aesthetics (i.e. the way data are visualised). Each data set can be mapped in various different ways, including location as defined by its geometry, colour and other features. The basic building block is tm_shape() (which defines input data), followed by one or more layer elements such as tm_fill() and tm_dots().

    @@ -1204,8 +1203,8 @@
    Warning: In view mode, scale bar breaks are ignored.
    -
    - +
    +
    @@ -1223,7 +1222,7 @@

    3.10.3 Comparing geographies

    -

    If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see (spatial_data?). To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.

    +

    If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see (spatial_data?). To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.

    @@ -1293,7 +1292,7 @@
    -
    +
    @@ -302,14 +301,14 @@

    This chapter is based on the following references, which are great follow-up’s on the topic:

    • -Lovelace, Nowosad, and Muenchow (2019) offer a great introduction.
    • -
    • Chapter 6 of Brunsdon and Comber (2015), in particular subsections 6.3 and 6.7.
    • +Lovelace, Nowosad, and Muenchow (2019) offer a great introduction. +
    • Chapter 6 of Brunsdon and Comber (2015), in particular subsections 6.3 and 6.7.
    • -Bivand, Pebesma, and Gómez-Rubio (2013) provides an in-depth treatment of spatial data in R.
    • +Bivand, Pebesma, and Gómez-Rubio (2013) provides an in-depth treatment of spatial data in R.

    4.1 Dependencies

    -

    We will rely on the following libraries in this section, all of them included in sec-dependencies:

    +

    We will rely on the following libraries in this section, all of them included in Section 1.4.1:

    # data manipulation, transformation and visualisation
     library(tidyverse)
    @@ -356,29 +355,20 @@
     

    The rest of this session will focus on two main elements of the table: the spatial dimension (as stored in the point coordinates), and the nightly price values, expressed in USD and contained in the price column. To get a sense of what they look like first, let us plot both. We can get a quick look at the non-spatial distribution of house values with the following commands:

    -
    +
    # Create the histogram
     qplot( data = db, x = price)
    -
    -
    Warning: `qplot()` was deprecated in ggplot2 3.4.0.
    -
    -
    -
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    -
    -
    +

    -
    Raw AirBnb prices in San Diego
    +

    This basically shows there is a lot of values concentrated around the lower end of the distribution but a few very large ones. A usual transformation to shrink these differences is to take logarithms. The original table already contains an additional column with the logarithm of each price (log_price).

    -
    # Create the histogram
    +
    # Create the histogram
     qplot( data = db, x = log_price )
    -
    -
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    -

    @@ -388,7 +378,7 @@

    To obtain the spatial distribution of these houses, we need to focus on the geometry column. The easiest, quickest (and also “dirtiest”) way to get a sense of what the data look like over space is using plot:

    - +

    @@ -401,7 +391,7 @@ 4.3 Binning

    The two-dimensional sister of histograms are binning maps: we divide each of the two dimensions into “buckets”, and count how many points fall within each bucket. Unlike histograms, we encode that count with a color gradient rather than a bar chart over an additional dimension (for that, we would need a 3D plot). These “buckets” can be squares (left) or hexagons (right):

    -
          # Squared binning
    +
          # Squared binning
     # Set up plot
     sqbin <- ggplot( ) + 
     # Add 2D binning with the XY coordinates as
    @@ -440,7 +430,7 @@
     4.4.1 One-dimensional KDE
     

    KDE over a single dimension is essentially a contiguous version of a histogram. We can see that by overlaying a KDE on top of the histogram of logs that we have created before:

    -
    # Create the base
    +
    # Create the base
     base <- ggplot(db, aes(x=log_price))
     # Histogram
     hist <- base + 
    @@ -467,7 +457,7 @@
     

    Geography, at the end of the day, is usually represented as a two-dimensional space where we locate objects using a system of dual coordinates, X and Y (or latitude and longitude). Thanks to that, we can use the same technique as above to obtain a smooth representation of the distribution of a two-dimensional variable. The crucial difference is that, instead of obtaining a curve as the output, we will create a surface, where intensity will be represented with a color gradient, rather than with the second dimension, as it is the case in the figure above.

    To create a spatial KDE in R, we can use general tooling for non-spatial points, such as the stat_density2d_filled method:

    -
    # Create the KDE surface
    +
    # Create the KDE surface
     kde <- ggplot(data = db) +
       stat_density2d_filled(alpha = 1,
         data = as.data.frame(st_coordinates(db)), 
    @@ -488,7 +478,7 @@
     

    This approach generates a surface that represents the density of dots, that is an estimation of the probability of finding a house transaction at a given coordinate. However, without any further information, they are hard to interpret and link with previous knowledge of the area. To bring such context to the figure, we can plot an underlying basemap, using a cloud provider such as Google Maps or, as in this case, OpenStreetMap. To do it, we will leverage the library basemapR, which is designed to play nicely with the ggplot2 family (hence the seemingly counterintuitive example above). Before we can plot them with the online map, we need to reproject them though.

    -
    bbox_db <- st_bbox(db)
    +
    bbox_db <- st_bbox(db)
     ggplot() +
       base_map(bbox_db, increase_zoom = 2, basemap = "positron") +
       #geom_sf(data = db, fill = NA) +
    @@ -508,10 +498,10 @@
     4.5 Spatial Interpolation
     

    The previous section demonstrates how to visualize the distribution of a set of spatial objects represented as points. In particular, given a bunch of house locations, it shows how one can effectively visualize their distribution over space and get a sense of the density of occurrences. Such visualization, because it is based on KDE, is based on a smooth continuum, rather than on a discrete approach (as a choropleth would do, for example).

    Many times however, we are not particularly interested in learning about the density of occurrences, but about the distribution of a given value attached to each location. Think for example of weather stations and temperature: the location of the stations is no secret and rarely changes, so it is not of particular interest to visualize the density of stations; what we are usually interested instead is to know how temperature is distributed over space, given we only measure it in a few places. One could argue the example we have been working with so far, house prices in AirBnb, fits into this category as well: although where a house is advertised may be of relevance, more often we are interested in finding out what the “surface of price” looks like. Rather than where are most houses being advertised? we usually want to know where the most expensive or most affordable houses are located.

    -

    In cases where we are interested in creating a surface of a given value, rather than a simple density surface of occurrences, KDE cannot help us. In these cases, what we are interested in is spatial interpolation, a family of techniques that aim at exactly that: creating continuous surfaces for a particular phenomenon (e.g. temperature, house prices) given only a finite sample of observations. Spatial interpolation is a large field of research that is still being actively developed and that can involve a substantial amount of mathematical complexity in order to obtain the most accurate estimates possible1. In this chapter, we will introduce the simplest possible way of interpolating values, hoping this will give you a general understanding of the methodology and, if you are interested, you can check out further literature. For example, Banerjee, Carlin, and Gelfand (2014) or Cressie (2015) are hard but good overviews.

    +

    In cases where we are interested in creating a surface of a given value, rather than a simple density surface of occurrences, KDE cannot help us. In these cases, what we are interested in is spatial interpolation, a family of techniques that aim at exactly that: creating continuous surfaces for a particular phenomenon (e.g. temperature, house prices) given only a finite sample of observations. Spatial interpolation is a large field of research that is still being actively developed and that can involve a substantial amount of mathematical complexity in order to obtain the most accurate estimates possible1. In this chapter, we will introduce the simplest possible way of interpolating values, hoping this will give you a general understanding of the methodology and, if you are interested, you can check out further literature. For example, Banerjee, Carlin, and Gelfand (2014) or Cressie (2015) are hard but good overviews.

    1 There is also an important economic incentive to do this: some of the most popular applications are in the oil and gas or mining industries. In fact, the very creator of this technique, Danie G. Krige, was a mining engineer. His name is usually used to nickname spatial interpolation as kriging.

    4.5.1 Inverse Distance Weight (IDW) interpolation

    -

    The technique we will cover here is called Inverse Distance Weighting, or IDW for convenience. Brunsdon and Comber (2015) offer a good description:

    +

    The technique we will cover here is called Inverse Distance Weighting, or IDW for convenience. Brunsdon and Comber (2015) offer a good description:

    In the inverse distance weighting (IDW) approach to interpolation, to estimate the value of \(z\) at location \(x\) a weighted mean of nearby observations is taken […]. To accommodate the idea that observations of \(z\) at points closer to \(x\) should be given more importance in the interpolation, greater weight is given to these points […]

    — Page 204

    @@ -526,7 +516,7 @@

    Let us go in detail into each of them4. First, let us set up a grid for the extent of the bounding box of our data (not the use of pipe, %>%, operator to chain functions):

    4 For the relevant calculations, we will be using the gstat library.

    -
    sd.grid <- db %>%
    +
    sd.grid <- db %>%
       st_bbox() %>%
       st_as_sfc() %>%
       st_make_grid(
    @@ -538,7 +528,7 @@
     

    The object sd.grid is a regular grid with 10,000 (\(100 \times 100\)) equally spaced cells:

    -
    sd.grid
    +
    sd.grid
    Simple feature collection with 10000 features and 2 fields
     Geometry type: POINT
    @@ -561,7 +551,7 @@
     

    Now, sd.grid only contain the location of points to which we wish to interpolate. That is, we now have our “target” geography for which we’d like to have AirBnb prices, but we don’t have price estimates. For that, on to the IDW, which will generate estimates for locations in sd.grid based on the observed prices in db. Again, this is hugely simplified by gstat:

    -
    idw.hp <- idw(
    +
    idw.hp <- idw(
       price ~ 1,         # Formula for IDW
       locations = db,    # Initial locations with values
       newdata=sd.grid,   # Locations we want predictions for
    @@ -574,7 +564,7 @@
     

    Boom! We’ve got it. Let us pause for a second to see how we just did it. First, we pass price ~ 1. This specifies the formula we are using to model house prices. The name on the left of ~ represents the variable we want to explain, while everything to its right captures the explanatory variables. Since we are considering the simplest possible case, we do not have further variables to add, so we simply write 1. Then we specify the original locations for which we do have house prices (our original db object), and the points where we want to interpolate the house prices (the sd.grid object we just created above). One more note: by default, idw uses all the available observations, weighted by distance, to provide an estimate for a given point. If you want to modify that and restrict the maximum number of neighbors to consider, you need to tweak the argument nmax, as we do above by using the 150 nearest observations to each point5.

    5 Have a play with this because the results do change significantly. Can you reason why?

    The object we get from idw is another spatial table, just as db, containing the interpolated values. As such, we can inspect it just as with any other of its kind. For example, to check out the top of the estimated table:

    -
    head(idw.hp)
    +
    head(idw.hp)
    Simple feature collection with 6 features and 2 fields
     Geometry type: POINT
    @@ -595,12 +585,12 @@
     4.5.2 A surface of housing prices
     

    Once we have the IDW object computed, we can plot it to explore the distribution, not of AirBnb locations in this case, but of house prices over the geography of San Diego. To do this using ggplot2, we first append the coordinates of each grid cell as columns of the table:

    -
    idw.hp = idw.hp %>%
    +
    idw.hp = idw.hp %>%
       cbind(st_coordinates(.))

    Now, we can visualise the surface using standard ggplot2 tools:

    -
    ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
    +
    ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
       geom_raster()
    @@ -611,7 +601,7 @@

    And we can “dress it up” a bit further:

    -
    ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
    +
    ggplot(idw.hp, aes(x = X, y = Y, fill = var1.pred)) +
       geom_raster() +
       scale_fill_viridis_b() +
       theme_void() +
    @@ -630,7 +620,7 @@
     

    The last bit we will explore in this session relates to prediction for new values. Imagine you are a real state data scientist working for Airbnb and your boss asks you to give an estimate of how much a new house going into the market should cost. The only information you have to make such a guess is the location of the house. In this case, the IDW model we have just fitted can help you. The trick is realizing that, instead of creating an entire grid, all we need is to obtain an estimate of a single location.

    Let us say, a new house is going to be advertised on the coordinates X = -117.02259063720702, Y = 32.76511965117273 as expressed in longitude and latitude. In that case, we can do as follows:

    -
    pt <- c(X = -117.02259063720702, Y = 32.76511965117273) %>%
    +
    pt <- c(X = -117.02259063720702, Y = 32.76511965117273) %>%
       st_point() %>%
       st_sfc() %>%
       st_sf(crs = "EPSG:4326") %>%
    @@ -639,7 +629,7 @@
     
    [inverse distance weighted interpolation]
    -
    idw.one
    +
    idw.one
    Simple feature collection with 1 feature and 2 fields
     Geometry type: POINT
    @@ -655,7 +645,7 @@
     4.6 Questions
     

    We will be using the Madrid AirBnb dataset:

    -
    mad_abb <- st_read("data/assignment_1_madrid/madrid_abb.gpkg")
    +
    mad_abb <- st_read("data/assignment_1_madrid/madrid_abb.gpkg")
    Reading layer `madrid_abb' from data source 
       `/Users/franciscorowe/Dropbox/Francisco/uol/teaching/envs453/202324/san/data/assignment_1_madrid/madrid_abb.gpkg' 
    @@ -669,7 +659,7 @@
     

    This is fairly similar in spirit to the one from San Diego we have relied on for the chapter, although the column set is not exactly the same:

    -
    colnames(mad_abb)
    +
    colnames(mad_abb)
     [1] "price"           "price_usd"       "log1p_price_usd" "accommodates"   
      [5] "bathrooms_text"  "bathrooms"       "bedrooms"        "beds"           
    @@ -688,7 +678,7 @@
     
     
     
    -
    + diff --git a/docs/05-flows.html b/docs/05-flows.html index 24ae134..1656aca 100644 --- a/docs/05-flows.html +++ b/docs/05-flows.html @@ -165,7 +165,6 @@ Spatial Modelling for Data Scientists
    @@ -290,16 +289,16 @@

    Content is based on the following references, which are great follow-up’s on the topic:

    • -Fotheringham and O’Kelly (1989) offer a historical overview of spatial interaction models and illustration of use cases.
    • +Fotheringham and O’Kelly (1989) offer a historical overview of spatial interaction models and illustration of use cases.
    • -Rowe, Lovelace, and Dennett (2022) provide a good overview of the existing limitations and opportunities of spatial interaction modelling.
    • +Rowe, Lovelace, and Dennett (2022) provide a good overview of the existing limitations and opportunities of spatial interaction modelling.
    • -Singleton (2017), an online short course on R for Geographic Data Science and Urban Analytics. In particular, the section on mapping flows is specially relevant here.
    • -
    • The predictive checks section draws heavily from Gelman and Hill (2006), in particular Chapters 6 and 7.
    • +Singleton (2017), an online short course on R for Geographic Data Science and Urban Analytics. In particular, the section on mapping flows is specially relevant here. +
    • The predictive checks section draws heavily from Gelman and Hill (2006), in particular Chapters 6 and 7.

    5.1 Dependencies

    -

    We will rely on the following libraries in this section, all of them included in sec-dependencies:

    +

    We will rely on the following libraries in this section, all of them included in Section 1.4.1:

    # Data management
     library(tidyverse)
    @@ -318,7 +317,7 @@
     

    In this chapter we will show a slightly different way of managing spatial data in R. Although most of the functionality will be similar to that seen in previous chapters, we will not rely on the “sf stack” and we will instead show how to read and manipulate data using the more traditional sp stack. Although this approach is being slowly phased out, it is still important to be aware of its existence and its differences with more modern approaches.

    5.2 Data

    -

    In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the SF Open Data portal (link) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the sf_bikes folder. The data file is named flows.geojson and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the flows_prep.ipynb notebook [url], also in the same folder.

    +

    In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the SF Open Data portal and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the sf_bikes folder. The data file is named flows.geojson and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the flows_prep.ipynb notebook, also in the same folder.

    Let us then directly load the file with all the information necessary:

    db <- st_read('./data/sf_bikes/flows.geojson')
    @@ -518,7 +517,7 @@

    Note how we transform the size so it’s a proportion of the largest trip and then it is compressed with a logarithm.

    5.4 Modelling flows

    -

    Now we have an idea of the spatial distribution of flows, we can begin to think about modeling them. The core idea in this section is to fit a model that can capture the particular characteristics of our variable of interest (the volume of trips) using a set of predictors that describe the nature of a given flow. We will start from the simplest model and then progressively build complexity until we get to a satisfying point. Along the way, we will be exploring each model using concepts from Gelman and Hill (2006) such as predictive performance checks1 (PPC)

    +

    Now we have an idea of the spatial distribution of flows, we can begin to think about modeling them. The core idea in this section is to fit a model that can capture the particular characteristics of our variable of interest (the volume of trips) using a set of predictors that describe the nature of a given flow. We will start from the simplest model and then progressively build complexity until we get to a satisfying point. Along the way, we will be exploring each model using concepts from Gelman and Hill (2006) such as predictive performance checks1 (PPC)

    1 For a more elaborate introduction to PPC, have a look at Chapters 7 and 8.

    Before we start running regressions, let us first standardize the predictors so we can interpret the intercept as the average flow when all the predictors take the average value, and so we can interpret the model coefficients as changes in standard deviation units:

    # Scale all the table
    @@ -608,7 +607,7 @@
       return(y_hat)
     }
    -

    This function takes a model m and the set of covariates x used and returns a random realization of predictions from the model. To get a sense of how this works, we can get and plot a realization of the model, compared to the expected one and the actual values:

    +

    This function takes a model m and the set of covariates x used and returns a random realisation of predictions from the model. To get a sense of how this works, we can get and plot a realisation of the model, compared to the expected one and the actual values:

    new_y <- generate_draw(m1)
     
    @@ -694,8 +693,8 @@
     
    -

    The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realizations of our “inferential engine”, which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model2 is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.

    -

    2 The \(R^2\) of our model is around 2%

    3 which they are not really, in light of the comparison between the black and red lines.

    It is important to keep in mind that the issues discussed in the paragraph above relate only to the uncertainty behind our model, not to the point predictions derived from them, which are a mechanistic result of the minimization of the squared residuals and hence are not subject to probability or inference. That allows them in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from this model is that, even if the point predictions (fitted values) are artificially accurate3, our capabilities to infer about the more general underlying process are fairly limited.

    +

    The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realisations of our “inferential engine”, which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model2 is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.

    +

    2 The \(R^2\) of our model is around 2%

    3 which they are not really, in light of the comparison between the black and red lines.

    Keep in mind that the issues discussed in the paragraph above relate to the uncertainty behind our model. This is not to the point predictions derived from them, which are a mechanistic result of the minimisation of the squared residuals and hence are not subject to probability or inference. That allows the model in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from the model is that, even if the point predictions (fitted values) are artificially accurate3, our capabilities to infer about the more general underlying process are fairly limited.

    Improving the model

    The bad news from the previous section is that our initial model is not great at explaining bike trips. The good news is there are several ways in which we can improve this. In this section we will cover three main extensions that exemplify three different routes you can take when enriching and augmenting models in general, and spatial interaction ones in particular4. These three routes are aligned around the following principles:

    4 These principles are general and can be applied to pretty much any modeling exercise you run into. The specific approaches we take in this note relate to spatial interaction models

      @@ -930,7 +929,7 @@

    As we can see, the differences exist but they are not massive. Let’s use this example to learn how to interpret coefficients in a Poisson model8. Effectively, these estimates can be understood as multiplicative effects. Since our model fits

    -

    8 See section 6.2 of Gelman and Hill (2006) for a similar treatment of these.

    \[ +

    8 See section 6.2 of Gelman and Hill (2006) for a similar treatment of these.

    \[ T_{ij} \sim Poisson (\exp^{X_{ij}\beta}) \]

    we need to transform \(\beta\) through an exponential in order to get a sense of the effect of distance on the number of trips. This means that for the street distance, our original estimate is \(\beta_{street} = -0.0996\), but this needs to be translated through the exponential into \(e^{-0.0996} = 0.906\). In other words, since distance is expressed in standard deviations9, we can expect a 10% decrease in the number of trips for an increase of one standard deviation (about 1Km) in the distance between the stations. This can be compared with \(e^{-0.0782} = 0.925\) for the straight distances, or a reduction of about 8% the number of trips for every increase of a standard deviation (about 720m).

    @@ -1054,7 +1053,7 @@ -
    +
    @@ -264,8 +263,8 @@
    -

    The table contains the followig variables:

    +

    The table contains the following variables:

    names(db)
    @@ -341,7 +340,7 @@ [19] "rt_Shared_room" "geometry"
    -

    For most of this chapter, we will be exploring determinants and strategies for modelling the price of a property advertised in AirBnb. To get a first taste of what this means, we can create a plot of prices within the area of San Diego:

    +

    We will be exploring determinants and strategies for modelling the price of a property advertised in AirBnb. To get a first taste of what this means, we can create a plot of prices within the area of San Diego:

    db %>%
       ggplot(aes(color = price)) +
    @@ -355,10 +354,9 @@
     
    -

    -6.3 Non-spatial regression, a refresh

    -

    Before we discuss how to explicitly include space into the linear regression framework, let us show how basic regression can be carried out in R, and how you can interpret the results. By no means is this a formal and complete introduction to regression so, if that is what you are looking for, the first part of Gelman and Hill (2006), in particular chapters 3 and 4, are excellent places to check out.

    -

    The core idea of linear regression is to explain the variation in a given (dependent) variable as a linear function of a series of other (explanatory) variables. For example, in our case, we may want to express/explain the price of a property advertised on AirBnb as a function of some of its characteristics, such as the number of people it accommodates, and how many bathrooms, bedrooms and beds it features. At the individual level, we can express this as:

    +

    +6.3 Non-spatial regression

    +

    Before we discuss how to explicitly include space into the linear regression framework, let us fit a linear regression model and interpret the results. We may want to explain the price of a property advertised on AirBnb as a function of some of its characteristics, such as the number of people it accommodates, and how many bathrooms, bedrooms and beds it features. At the individual level, we can express this as:

    \[ \log(P_i) = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i \]

    @@ -366,21 +364,9 @@

    \[ \log(P) = \alpha + \beta_1 Acc + \beta_2 Bath + \beta_3 Bedr + \beta_4 Beds + \epsilon \] where each term can be interpreted in terms of vectors instead of scalars (wit the exception of the parameters \((\alpha, \beta_{1, 2, 3, 4})\), which are scalars). Note we are using the logarithm of the price, since this allows us to interpret the coefficients as roughly the percentage change induced by a unit increase in the explanatory variable of the estimate.

    -

    Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the \(\beta_k\) coefficients in the equation above is as the degree of correlation between the explanatory variable \(k\) and the dependent variable, keeping all the other explanatory variables constant. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables –also called confounding factors[^06-spatial-econometrics-1]. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we control for those other variables.

    - -
    -
    -
    -
    - -
    -
    -

    Task

    +

    Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the \(\beta_k\) coefficients in the equation above is as the degree of correlation between the explanatory variable \(k\) and the dependent variable, keeping all the other explanatory variables constant. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables –also called confounding factors. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we control for those other variables.

    Assume that new houses tend to be built more often in areas with low deprivation. If that is the case, then \(NEW\) and \(IMD\) will be correlated with each other (as well as with the price of a house, as we are hypothesizing in this case). If we calculate a simple correlation between \(P\) and \(IMD\), the coefficient will represent the degree of association between both variables, but it will also include some of the association between \(IMD\) and \(NEW\). That is, part of the obtained correlation coefficient will be due not to the fact that higher prices tend to be found in areas with low IMD, but to the fact that new houses tend to be more expensive. This is because (in this example) new houses tend to be built in areas with low deprivation and simple bivariate correlation cannot account for that.

    -
    -
    -
    -

    Practically speaking, running linear regressions in R is straightforward. For example, to fit the model specified in the equation above, we only need one line of code:

    +

    We first fit the model specified in the equation above by running:

    m1 <- lm('log_price ~ accommodates + bathrooms + bedrooms + beds', db)
    @@ -414,10 +400,10 @@

    A full detailed explanation of the output is beyond the scope of the chapter, but we will highlight the relevant bits for our main purpose. This is concentrated on the Coefficients section, which gives us the estimates for the \(\beta_k\) coefficients in our model. These estimates are the raw equivalent of the correlation coefficient between each explanatory variable and the dependent one, once the “polluting” effect of the other variables included in the model has been accounted for1. Results are as expected for the most part: houses tend to be significantly more expensive if they accommodate more people (an extra person increases the price by 17.7%, approximately), have more bathrooms (15.1%), or bedrooms (11.2%). Perhaps counter intuitively, an extra bed available seems to decrease the price by about -7.7%. However, keep in mind that this is the case, everything else equal. Hence, more beds per room and bathroom (ie. a more crowded house) is a bit cheaper.

    -

    1 Keep in mind that regression is no magic. We are only discounting the effect of other confounding factors that we include in the model, not of all potentially confounding factors.

    -6.4 Spatial regression: a (very) first dip

    +

    1 Keep in mind that regression is no magic. We are only discounting the effect of other confounding factors that we include in the model, not of all potentially confounding factors.

    +6.4 Spatial regression

    Spatial regression is about explicitly introducing space or geographical context into the statistical framework of a regression. Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors we cannot but should include in our model. As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views. To illustrate the latter, we can think of how the character of a neighborhood is important in determining the price of a house; however, it is very hard to identify and quantify “character” per se, although it might be easier to get at its spatial variation, hence a case of space as a proxy.

    -

    Spatial regression is a large field of development in the econometrics and statistics literature. In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. For more rigorous treatments of the topics introduced here, the reader is referred to Anselin (2003), Anselin and Rey (2014), and Gibbons, Overman, and Patacchini (2014).

    +

    Spatial regression is a large field of development in the econometrics and statistics literature. In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. For more rigorous treatments of the topics introduced here, the reader is referred to Anselin (2003) and Anselin and Rey (2014).

    6.5 Spatial heterogeneity

    Spatial heterogeneity (SH) arises when we cannot safely assume the process we are studying operates under the same “rules” throughout the geography of interest. In other words, we can observe SH when there are effects on the outcome variable that are intrinsically linked to specific locations. A good example of this is the case of seafront houses above: we are trying to model the price of a house and, the fact some houses are located under certain conditions (i.e. by the sea), makes their price behave differently. This somewhat abstract concept of SH can be made operational in a model in several ways. We will explore the following two: spatial fixed-effects (FE); and spatial regimes, which is a generalization of FE.

    @@ -547,8 +533,8 @@ -

    We can see how neighborhoods in the left (west) tend to have higher prices. What we can’t see, but it is represented there if you are familiar with the geography of San Diego, is that the city is bounded by the Pacific ocean on the left, suggesting neighbourhoods by the beach tend to be more expensive.

    -

    Remember that the interpretation of a \(\beta_k\) coefficient is the effect of variable \(k\), given all the other explanatory variables included remain constant. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; in other words, only houses located within the same area. Introducing FE affords you a higher degree of isolation of the effects of the variables you introduce in your model because you can control for unobserved effects that align spatially with the distribution of the FE you introduce (by neighbourhood, in our case).

    +

    We can see how neighborhoods in the left (west) tend to have higher prices. What we cannot see, but it is represented there if you are familiar with the geography of San Diego, is that the city is bounded by the Pacific ocean on the left, suggesting neighbourhoods by the beach tend to be more expensive.

    +

    Remember that the interpretation of a \(\beta_k\) coefficient is the effect of variable \(k\), given all the other explanatory variables included remained constant. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; that is, only houses located within the same area. Introducing FE affords you a higher degree of isolation of the effects of the variables you introduce in your model because you can control for unobserved effects that align spatially with the distribution of the FE you introduce (by neighbourhood, in our case).

    Spatial regimes

    At the core of estimating spatial FEs is the idea that, instead of assuming the dependent variable behaves uniformly over space, there are systematic effects following a geographical pattern that affect its behaviour. In other words, spatial FEs introduce econometrically the notion of spatial heterogeneity. They do this in the simplest possible form: by allowing the constant term to vary geographically. The other elements of the regression are left untouched and hence apply uniformly across space. The idea of spatial regimes (SRs) is to generalize the spatial FE approach to allow not only the constant term to vary but also any other explanatory variable. This implies that the equation we will be estimating is: \[\log(P_i) = \alpha_r + \beta_{1r} Acc_i + \beta_{2r} Bath_i + \beta_{3r} Bedr_i + \beta_{4r} Beds_i + \epsilon_i\]

    where we are not only allowing the constant term to vary by region (\(\alpha_r\)), but also every other parameter (\(\beta_{kr}\)).

    @@ -1037,13 +1023,13 @@ F-statistic: 51.51 on 224 and 5885 DF, p-value: < 2.2e-16
    -

    This allows us to get a separate constant term and estimate of the impact of each variable for every neighborhood

    +

    This allows us to get a separate constant term and estimate of the impact of each variable for every neighborhood. Note that to obtain a neighbourhood-specific constant, you will need to add the regression constant and the estimate for the interaction between one and a specific neighbourhood estimate.

    6.6 Spatial dependence

    -

    As we have just discussed, SH is about effects of phenomena that are explicitly linked to geography and that hence cause spatial variation and clustering of values. This encompasses many of the kinds of spatial effects we may be interested in when we fit linear regressions. However, in other cases, our interest is on the effect of the spatial configuration of the observations, and the extent to which that has an effect on the outcome we are considering. For example, we might think that the price of a house not only depends on the number of bathrooms it has but, if we take number of bathrooms as a proxy for size and status, also whether it is surrounded by other houses with many bathrooms. This kind of spatial effect is fundamentally different from SH in that is it not related to inherent characteristics of the geography but relates to the characteristics of the observations in our dataset and, specially, to their spatial arrangement. We call this phenomenon by which the values of observations are related to each other through distance spatial dependence (Anselin 1988).

    +

    As we have just discussed, SH is about effects of phenomena that are explicitly linked to geography and that hence cause spatial variation and clustering of values. This encompasses many of the kinds of spatial effects we may be interested in when we fit linear regressions. However, in other cases, our interest is on the effect of the spatial configuration of the observations, and the extent to which that has an effect on the outcome we are considering. For example, we might think that the price of a house not only depends on the number of bathrooms it has but, if we take number of bathrooms as a proxy for size and status, also whether it is surrounded by other houses with many bathrooms. This kind of spatial effect is fundamentally different from SH in that is it not related to inherent characteristics of the geography but relates to the characteristics of the observations in our dataset and, specially, to their spatial arrangement. We call this phenomenon by which the values of observations are related to each other through distance spatial dependence (Anselin 1988).

    Spatial Weights

    -

    There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication (see Anselin 2003 for a good overview). Common to all of them however is the way space is formally encapsulated: through spatial weights matrices (\(W\))2 These are \(NxN\) matrices with zero diagonals and every \(w_{ij}\) cell with a value that represents the degree of spatial connectivity/interaction between observations \(i\) and \(j\). If they are not connected at all, \(w_{ij}=0\), otherwise \(w_{ij}>0\) and we call \(i\) and \(j\) neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.

    -

    2 If you need to refresh your knowledge on spatial weight matrices, check Block E of Arribas-Bel (2019); Chapter 4 of Rey, Arribas-Bel, and Wolf (2023); or the Spatial Weights Section of Rowe (2022).

    A related concept to spatial weight matrices is that of spatial lag. This is an operator that multiplies a given variable \(y\) by a spatial weight matrix:

    +

    There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication (see Anselin 2003 for a good overview). Common to all of them however is the way space is formally encapsulated: through spatial weights matrices (\(W\))2 These are \(NxN\) matrices with zero diagonals and every \(w_{ij}\) cell with a value that represents the degree of spatial connectivity/interaction between observations \(i\) and \(j\). If they are not connected at all, \(w_{ij}=0\), otherwise \(w_{ij}>0\) and we call \(i\) and \(j\) neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.

    +

    2 If you need to refresh your knowledge on spatial weight matrices. Block E of Arribas-Bel (2019) Chapter 4 of Rey, Arribas-Bel, and Wolf (2023) provide a good explanation of theory around spatial weights and the Spatial Weights Section of Rowe (2022) illustrates the use of R to compute different types of spatial weight matrices.

    A related concept to spatial weight matrices is that of spatial lag. This is an operator that multiplies a given variable \(y\) by a spatial weight matrix:

    \[ y_{lag} = W y \]

    @@ -1083,7 +1069,7 @@

    Exogenous spatial effects

    -

    Let us come back to the house price example we have been working with. So far, we have hypothesized that the price of an AirBnb property in San Diego can be explained using information about its own characteristics, and the neighbourhood it belongs to. However, we can hypothesise that the price of a house is also affected by the characteristics of the houses surrounding it. Considering it as a proxy for larger and more luxurious houses, we will use the number of bathrooms of neighboring houses as an additional explanatory variable. This represents the most straightforward way to introduce spatial dependence in a regression, by considering not only a given explanatory variable, but also its spatial lag.

    +

    Let us come back to the house price example we have been working with. So far, we have hypothesised that the price of an AirBnb property in San Diego can be explained using information about its own characteristics, and the neighbourhood it belongs to. However, we can hypothesise that the price of a house is also affected by the characteristics of the houses surrounding it. Considering it as a proxy for larger and more luxurious houses, we will use the number of bathrooms of neighboring houses as an additional explanatory variable. This represents the most straightforward way to introduce spatial dependence in a regression, by considering not only a given explanatory variable, but also its spatial lag.

    In our example case, in addition to including the number of bathrooms of the property, we will include its spatial lag. In other words, we will be saying that it is not only the number of bathrooms in a house but also that of the surrounding properties that helps explain the final price at which a house is advertised for. Mathematically, this implies estimating the following model:

    \[ \log(P_i) = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i+ \beta_5 Bath_{lag-i} + \epsilon_i @@ -1142,7 +1128,7 @@ u_i = u_{lag-i} + \epsilon_i \]

    Again, although similar, one can show this specification violates the assumptions about the error term in a classical OLS model.

    -

    Both the spatial lag and error model violate some of the assumptions on which OLS relies and thus render the technique unusable. Much of the efforts have thus focused on coming up with alternative methodologies that allow unbiased, robust, and efficient estimation of such models. A survey of those is beyond the scope of this note, but the interested reader is referred to Anselin (1988), Anselin (2003), and Anselin and Rey (2014) for further reference.

    +

    Both the spatial lag and error model violate some of the assumptions on which OLS relies and thus render the technique unusable. Much of the efforts have thus focused on coming up with alternative methodologies that allow unbiased, robust, and efficient estimation of such models. A survey of those is beyond the scope of this note, but the interested reader is referred to Anselin (1988), Anselin (2003), and Anselin and Rey (2014) for further reference.

    6.7 Predicting house prices

    So far, we have seen how to exploit the output of a regression model to evaluate the role different variables play in explaining another one of interest. However, once fit, a model can also be used to obtain predictions of the dependent variable given a new set of values for the explanatory variables. We will finish this session by dipping our toes in predicting with linear models.

    @@ -1239,7 +1225,7 @@ -
    +
    diff --git a/docs/07-multilevel-01.html b/docs/07-multilevel-01.html index 9c225da..d102b21 100644 --- a/docs/07-multilevel-01.html +++ b/docs/07-multilevel-01.html @@ -165,7 +165,6 @@ Spatial Modelling for Data Scientists @@ -303,9 +302,9 @@

    This chapter provides an introduction to multi-level data structures and multi-level modelling and draws on the following references:

    • -Gelman and Hill (2006) provides an excellent and intuitive explanation of multilevel modelling and data analysis in general. Read Part 2A for a really good explanation of multilevel models.
    • +Gelman and Hill (2006) provides an excellent and intuitive explanation of multilevel modelling and data analysis in general. Read Part 2A for a really good explanation of multilevel models.
    • -Multilevel Modelling (n.d.) is an useful online resource on multilevel modelling and is free!
    • +Multilevel Modelling (n.d.) is an useful online resource on multilevel modelling and is free!

    7.1 Dependencies

    @@ -438,7 +437,7 @@

    7.3 Modelling

    -

    We should now be persuaded that ignoring the hierarchical structure of data may be a major issue. Let us now use a simple example to understand the intuition of multilevel model using the census data. We will seek to understand the spatial distribution of the proportion of population in unemployment in Liverpool, particularly why and where concentrations in this proportion occur. To illustrate the advantages of taking a multilevel modelling approach, we will start by estimating a linear regression model and progressively building complexity. We will first estimate a model and then explain the intuition underpinning the process. We will seek to gain a general understanding of multilevel modelling. If you are interested in the statistical and mathemathical formalisation of the underpinning concepts, please refer to Gelman and Hill (2006).

    +

    We should now be persuaded that ignoring the hierarchical structure of data may be a major issue. Let us now use a simple example to understand the intuition of multilevel model using the census data. We will seek to understand the spatial distribution of the proportion of population in unemployment in Liverpool, particularly why and where concentrations in this proportion occur. To illustrate the advantages of taking a multilevel modelling approach, we will start by estimating a linear regression model and progressively building complexity. We will first estimate a model and then explain the intuition underpinning the process. We will seek to gain a general understanding of multilevel modelling. If you are interested in the statistical and mathemathical formalisation of the underpinning concepts, please refer to Gelman and Hill (2006).

    We first need to want to understand our dependent variable: its density ditribution;

    ggplot(data = oa_shp) +
    @@ -916,27 +915,27 @@
       head(10)
             groupFctr             groupID        term         mean       median
    -1  lsoa_cd:msoa_cd E01006512:E02001377 (Intercept) -0.015517531 -0.014486880
    -2  lsoa_cd:msoa_cd E01006513:E02006932 (Intercept) -0.016811135 -0.017546167
    -3  lsoa_cd:msoa_cd E01006514:E02001383 (Intercept) -0.021818145 -0.024092438
    -4  lsoa_cd:msoa_cd E01006515:E02001383 (Intercept) -0.018097772 -0.017458932
    -5  lsoa_cd:msoa_cd E01006518:E02001390 (Intercept) -0.017939320 -0.017949390
    -6  lsoa_cd:msoa_cd E01006519:E02001402 (Intercept) -0.015206492 -0.015264256
    -7  lsoa_cd:msoa_cd E01006520:E02001389 (Intercept) -0.024048878 -0.023262342
    -8  lsoa_cd:msoa_cd E01006521:E02001398 (Intercept)  0.006892622  0.007299342
    -9  lsoa_cd:msoa_cd E01006522:E02001394 (Intercept)  0.019082274  0.020449785
    -10 lsoa_cd:msoa_cd E01006523:E02001398 (Intercept)  0.004270358  0.005194620
    +1  lsoa_cd:msoa_cd E01006512:E02001377 (Intercept) -0.016750074 -0.016974644
    +2  lsoa_cd:msoa_cd E01006513:E02006932 (Intercept) -0.016985050 -0.015851627
    +3  lsoa_cd:msoa_cd E01006514:E02001383 (Intercept) -0.019717446 -0.018885616
    +4  lsoa_cd:msoa_cd E01006515:E02001383 (Intercept) -0.014932030 -0.016818498
    +5  lsoa_cd:msoa_cd E01006518:E02001390 (Intercept) -0.019077813 -0.018799761
    +6  lsoa_cd:msoa_cd E01006519:E02001402 (Intercept) -0.015795825 -0.015550835
    +7  lsoa_cd:msoa_cd E01006520:E02001389 (Intercept) -0.025289464 -0.024460855
    +8  lsoa_cd:msoa_cd E01006521:E02001398 (Intercept)  0.004727472  0.005030506
    +9  lsoa_cd:msoa_cd E01006522:E02001394 (Intercept)  0.019116330  0.019146406
    +10 lsoa_cd:msoa_cd E01006523:E02001398 (Intercept)  0.003182825  0.002890660
                 sd
    -1  0.020824395
    -2  0.019268301
    -3  0.019622814
    -4  0.018468456
    -5  0.019955005
    -6  0.008461949
    -7  0.021041952
    -8  0.018644138
    -9  0.020438046
    -10 0.018968348
    +1 0.018218673 +2 0.019760651 +3 0.018909129 +4 0.018781160 +5 0.019617365 +6 0.009553315 +7 0.018801694 +8 0.020192525 +9 0.018851941 +10 0.020034567

    The results contain the estimated mean, median and standard deviation for the intercept within each group (e.g. LSOA). The mean estimates are similar to those obtained from ranef with some small differences due to rounding.

    @@ -1010,9 +1009,9 @@ $ groupFctr: chr "msoa_cd" "msoa_cd" "msoa_cd" "msoa_cd" ... $ groupID : chr "E02001347" "E02001348" "E02001349" "E02001350" ... $ term : chr "(Intercept)" "(Intercept)" "(Intercept)" "(Intercept)" ... - $ mean : num -0.01287 -0.02288 -0.03004 0.00695 0.02319 ... - $ median : num -0.01269 -0.02376 -0.03138 0.00675 0.02437 ... - $ sd : num 0.0345 0.0343 0.0329 0.0335 0.0157 ...
    + $ mean : num -0.01256 -0.02679 -0.03173 0.00477 0.02249 ... + $ median : num -0.01239 -0.02967 -0.03086 0.00561 0.02157 ... + $ sd : num 0.0311 0.0338 0.0312 0.0335 0.0168 ...
    # merge data
     msoa_shp <- merge(x = msoa_shp, y = re_msoa, by.x = "MSOA_CD", by.y = "groupID")
    @@ -1413,7 +1412,7 @@ -
    +
    @@ -300,12 +299,12 @@

    This chapter explains varying slopes and draws on the following references:

    The content of this chapter is based on:

      -
    • Gelman and Hill (2006) provides an excellent and intuitive explanation of multilevel modelling and data analysis in general. Read Part 2A for a really good explanation of multilevel models.

    • -
    • Multilevel Modelling (n.d.) is an useful online resource on multilevel modelling and is free!

    • +
    • Gelman and Hill (2006) provides an excellent and intuitive explanation of multilevel modelling and data analysis in general. Read Part 2A for a really good explanation of multilevel models.

    • +
    • Multilevel Modelling (n.d.) is an useful online resource on multilevel modelling and is free!

    8.1 Dependencies

    -

    This chapter uses the following libraries which are listed in the sec-dependencies in Chapter 1:

    +

    This chapter uses the following libraries which are listed in the Section 1.4.1 in Chapter 1:

    # Data manipulation, transformation and visualisation
     library(tidyverse)
    @@ -341,7 +340,7 @@
     

    So far, we have estimated varying-intercept models; that is, when the intercept (\(\beta_{0}\)) is allowed to vary by group (eg. geographical area) - as shown in Fig. 1(a). The strength of the relationship between \(y\) (i.e. unemployment rate) and \(x\) (long-term illness) has been assumed to be the same across groups (i.e. MSOAs), as captured by the regression slope (\(\beta_{1}\)). Yet it can also vary by group as shown in Fig. 1(b), or we can observe group variability for both intercepts and slopes as represented in Fig. 1(c).

    -
    Fig. 1. Linear regression model with (a) varying intercepts, (b) varying slopes, and (c) both. Source: Gelman and Hill (2006) p.238.
    +
    Fig. 1. Linear regression model with (a) varying intercepts, (b) varying slopes, and (c) both. Source: Gelman and Hill (2006) p.238.

    8.3.1 Exploratory Analysis: Varying Slopes

    @@ -514,9 +513,9 @@ $ groupFctr: chr "msoa_cd" "msoa_cd" "msoa_cd" "msoa_cd" ... $ groupID : chr "E02001347" "E02001348" "E02001349" "E02001350" ... $ term : chr "lt_ill" "lt_ill" "lt_ill" "lt_ill" ... - $ mean : num 0.0272 -0.1049 0.0525 -0.1432 -0.2837 ... - $ median : num 0.0276 -0.1081 0.0484 -0.1426 -0.283 ... - $ sd : num 0.046 0.0737 0.0793 0.041 0.0397 ...
    + $ mean : num 0.023 -0.1111 0.0588 -0.1413 -0.2804 ... + $ median : num 0.0241 -0.1043 0.0579 -0.1431 -0.2823 ... + $ sd : num 0.0464 0.0744 0.0848 0.038 0.0385 ...
    # merge data
     msoa_shp <- merge(x = msoa_shp, y = re_msoa_m6, by.x = "MSOA_CD", by.y = "groupID")
    @@ -575,10 +574,10 @@ 10 MULTIPOLYGON (((337363.8 39...
    -

    Now try adding a group-level predictor and an individual-level predictor to the model. Unsure, look at sec-indlevel and sec-grouplevel in sec-chp7.

    +

    Now try adding a group-level predictor and an individual-level predictor to the model. Unsure, look at Section 7.4.5 and Section 7.4.6 in Chapter 7.

    8.5 Interpreting Correlations Between Group-level Intercepts and Slopes

    -

    Correlations of random effects are confusing to interpret. Key for their appropriate interpretation is to recall they refer to group-level residuals i.e. deviation of intercepts and slopes from the average model intercept and slope. A strong negative correlation indicates that groups with high intercepts have relatively low slopes, and vice versa. A strong positive correlation indicates that groups with high intercepts have relatively high slopes, and vice versa. A correlation close to zero indicate little or no systematic between intercepts and slopes. Note that a high correlation between intercepts and slopes is not a problem, but it makes the interpretation of the estimated intercepts more challenging. For this reason, a suggestion is to center predictors (\(x's\)); that is, substract their average value (\(z = x - \bar{x}\)). For a more detailed discussion, see Multilevel Modelling (n.d.).

    +

    Correlations of random effects are confusing to interpret. Key for their appropriate interpretation is to recall they refer to group-level residuals i.e. deviation of intercepts and slopes from the average model intercept and slope. A strong negative correlation indicates that groups with high intercepts have relatively low slopes, and vice versa. A strong positive correlation indicates that groups with high intercepts have relatively high slopes, and vice versa. A correlation close to zero indicate little or no systematic between intercepts and slopes. Note that a high correlation between intercepts and slopes is not a problem, but it makes the interpretation of the estimated intercepts more challenging. For this reason, a suggestion is to center predictors (\(x's\)); that is, substract their average value (\(z = x - \bar{x}\)). For a more detailed discussion, see Multilevel Modelling (n.d.).

    To illustrate this, let’s reestimate our model adding an individual-level predictor: the share of population with no educational qualification.

    # centering to the mean
    @@ -624,9 +623,9 @@
     

    How do you interpret the random effect correlation?

    8.6 Model building

    -

    Now we know how to estimate multilevel regression models in R. The question that remains is: When does multilevel modeling make a difference? The short answer is: when there is little group-level variation. When there is very little group-level variation, the multilevel modelling reduces to classical linear regression estimates with no group indicators. Inversely, when group-level coefficients vary greatly (compared to their standard errors of estimation), multilevel modelling reduces to classical regression with group indicators Gelman and Hill (2006).

    +

    Now we know how to estimate multilevel regression models in R. The question that remains is: When does multilevel modeling make a difference? The short answer is: when there is little group-level variation. When there is very little group-level variation, the multilevel modelling reduces to classical linear regression estimates with no group indicators. Inversely, when group-level coefficients vary greatly (compared to their standard errors of estimation), multilevel modelling reduces to classical regression with group indicators Gelman and Hill (2006).

    How do you go about building a model?

    -

    We generally start simple by fitting simple linear regressions and then work our way up to a full multilevel model - see Gelman and Hill (2006) p. 270.

    +

    We generally start simple by fitting simple linear regressions and then work our way up to a full multilevel model - see Gelman and Hill (2006) p. 270.

    How many groups are needed?

    As an absolute minimum, more than two groups are required. With only one or two groups, a multilevel model reduces to a linear regression model.

    How many observations per group?

    @@ -653,7 +652,7 @@

    8.7 Questions

    -

    We will continue to use the COVID-19 dataset. Please see sec-chp11 for details on the data.

    +

    We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.

    sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
    @@ -673,7 +672,7 @@
  • Fit a varying-intercept and varying-slope model.

  • Compare the results for models fitted in 1 and 2. Which is better? Why?

  • -

    Use the same explanatory variables used for the sec-chp7 challenge, so you can compare the model results from this chapter.

    +

    Use the same explanatory variables used for the Chapter 7 challenge, so you can compare the model results from this chapter.

    Analyse and discuss:

    1. the varying slope estimate(s) from your model(s) (to what extent does the relationship between your dependent and independent variables vary across groups / areas? are they statistically significantly different?).
    2. @@ -683,7 +682,7 @@ -
      +
      @@ -307,8 +306,8 @@

      This chapter provides an introduction to geographically weighted regression models.

      The content of this chapter is based on:

        -
      • Fotheringham, Brunsdon, and Charlton (2002), a must-go book if you are working or planning to start working on geographically weighted regression modelling.

      • -
      • Comber et al. (2022) provide a roadmap to approach various practical issues in the application of GWR.

      • +
      • Fotheringham, Brunsdon, and Charlton (2002), a must-go book if you are working or planning to start working on geographically weighted regression modelling.

      • +
      • Comber et al. (2022) provide a roadmap to approach various practical issues in the application of GWR.

      9.1 Dependencies

      @@ -502,7 +501,7 @@

    The correlogram shows the strength and significance of the linear relationship between our set of variables. The size of the circle reflects the strength of the relationships as captured by the Pearson correlation coefficient, and crosses indicate statistically insignificant relationships at the 95% level of confidence. The colour indicate the direction of the relationship with dark (light) colours indicating a negative (positive) association.

    The results indicate that the incidence of COVID-19 is significantly and positively related to the share of overcrowded housing, nonwhite ethnic minorities and administrative & support workers. Against expectations, the incidence of COVID-19 appears to be negatively correlated with the share of elderly population, of population suffering from long-term illness and of administrative & support workers, and displays no significant association with the share of the population living in deprived areas as well as the share of public administration & defence workers, and health & social workers. The latter probably reflects the effectiveness of the protective measures undertaken to prevent infection among these population groups, but it may also reflect the partial coverage of COVID-19 testing and underreporting. It may also reveal the descriptive limitations of correlation coefficients as they show the relationship between a pairs of variables, not controlling for others. Correlation coefficients can thus produce spurious relationships resulting from confounded variables. We will return to this point below.

    -

    The results also reveal high collinearity between particular pairs of variables, notably between the share of crowded housing and of nonwhite ethnic population, the share of crowded housing and of elderly population, the share of overcrowded housing and of administrative & support workers, the share of elderly population and of population suffering from long-term illness. A more refined analysis of multicollinearity is needed. Various diagnostics for multicollinearity in a regression framework exist, including matrix condition numbers (CNs), predictor variance inflation factors (VIFs) and variance decomposition factors (VDPs). Rules of thumb (CNs > 30, VIFs > 10 and VDPs > 0.5) to indicate worrying levels of collinearity can be found in Belsley, Kuh, and Welsch (2005). To avoid problems of multicollinearity, often a simple strategy is to remove highly correlated predictors. The difficultly is in deciding which predictor(s) to remove, especially when all are considered important. Keep this in mind when specifying your model.

    +

    The results also reveal high collinearity between particular pairs of variables, notably between the share of crowded housing and of nonwhite ethnic population, the share of crowded housing and of elderly population, the share of overcrowded housing and of administrative & support workers, the share of elderly population and of population suffering from long-term illness. A more refined analysis of multicollinearity is needed. Various diagnostics for multicollinearity in a regression framework exist, including matrix condition numbers (CNs), predictor variance inflation factors (VIFs) and variance decomposition factors (VDPs). Rules of thumb (CNs > 30, VIFs > 10 and VDPs > 0.5) to indicate worrying levels of collinearity can be found in Belsley, Kuh, and Welsch (2005). To avoid problems of multicollinearity, often a simple strategy is to remove highly correlated predictors. The difficultly is in deciding which predictor(s) to remove, especially when all are considered important. Keep this in mind when specifying your model.

    Challenge 2: Analyse the relationship of all the variables executing pairs(df_sel). How accurate would a linear regression be in capturing the relationships for our set of variables?

    @@ -551,7 +550,7 @@

    The regression results indicate a positive relationship exists between the share of nonwhite population and an increased risk of COVID-19 infection. A one percentage point increase in the share of nonwhite population returns a 271 rise in the cumulative count of COVID-19 infection per 100,000 people, everything else constant. The results also reveal a positive (albeit statistically insignificant) relationship between the share of population suffering from long-term illness and an increased risk of COVID-19 infection, after controlling for the share of nonwhite population, thereby confirming our suspicion about the limitations of correlation coefficients; that is, once differences in the share of nonwhite population are taken into account, the association between the share of population suffering from long-term illness and an increased risk of COVID-19 infection becomes positive. We also test for multicollinearity. The VIFs are below 10 indicating that multicollinearity is not highly problematic.

    The \(R^{2}\) value for the OLS regression is 0.393 indicating that our model explains only 39% of the variance in the rate of COVID-19 infection. This leaves 71% of the variance unexplained. Some of this unexplained variance can be because we have only included two explanatory variables in our model, but also because the OLS regression model assumes that the relationships in the model are constant over space; that is, it assumes a stationary process. Hence, an OLS regression model is considered to capture global relationships. However, relationships may vary over space. Suppose, for instance, that there are intrinsic behavioural variations across England and that people have adhered more strictly to self-isolation and social distancing measures in some areas than in others, or that ethnic minorities are less exposed to contracting COVID-19 in certain parts of England. If such variations in associations exist over space, our estimated OLS model will be a misspecification of reality because it assumes these relationships to be constant.

    -

    To better understand this potential misspecification, we investigate the model residuals which show high variability (see below). The distribution is non-random displaying large positive residuals in the metropolitan areas of London, Liverpool, Newcastle (in light colours) and the Lake District and large negative residuals across much of England (in black). This conforms to the spatial pattern of confirmed COVID-19 cases with high concentration in a limited number of metropolitan areas (see above). While our residual map reveals that there is a problem with the OLS model, it does not indicate which, if any, of the parameters in the model might exhibit spatial nonstationarity. A simple way of examining if the relationships being modelled in our global OLS model are likely to be stationary over space would be to estimate separate OLS model for each UTLA in England. But this would require higher resolution i.e. data within UTLA, and we only have one data point per UTLA. -Fotheringham, Brunsdon, and Charlton (2002) (2002, p.40-44) discuss alternative approaches and their limitations.

    +

    To better understand this potential misspecification, we investigate the model residuals which show high variability (see below). The distribution is non-random displaying large positive residuals in the metropolitan areas of London, Liverpool, Newcastle (in light colours) and the Lake District and large negative residuals across much of England (in black). This conforms to the spatial pattern of confirmed COVID-19 cases with high concentration in a limited number of metropolitan areas (see above). While our residual map reveals that there is a problem with the OLS model, it does not indicate which, if any, of the parameters in the model might exhibit spatial nonstationarity. A simple way of examining if the relationships being modelled in our global OLS model are likely to be stationary over space would be to estimate separate OLS model for each UTLA in England. But this would require higher resolution i.e. data within UTLA, and we only have one data point per UTLA. -Fotheringham, Brunsdon, and Charlton (2002) (2002, p.40-44) discuss alternative approaches and their limitations.

    utla_shp$res_m1 <- residuals(model1)
     
    @@ -578,14 +577,14 @@
     

    Graphically, GWR involves fitting a spatial kernel to the data as described in the Fig. 1. For a given regression point \(X\), the weight (\(W\)) of a data point is at a maximum at the location of the regression point. The weight decreases gradually as the distance between two points increases. A regression model is thus calibrated locally by moving the regression point across the area under study. For each location, the data are weighted differently so that the resulting estimates are unique to a particular location.

    -
    Fig. 1. GWR with fixed spatial kernel. Source: Fotheringham et al. (2002, 45).
    +
    Fig. 1. GWR with fixed spatial kernel. Source: Fotheringham et al. (2002, 45).

    9.6.1 Fixed or Adaptive Kernel

    A key issue is to decide between two options of spatial kernels: a fixed kernel or an adaptive kernel. Intuitively, a fixed kernel involves using a fixed bandwidth to define a region around all regression points as displayed in Fig. 1. The extent of the kernel is determined by the distance to a given regression point, with the kernel being identical at any point in space. An adaptive kernel involves using varying bandwidth to define a region around regression points as displayed in Fig. 2. The extent of the kernel is determined by the number of nearest neighbours from a given regression point. The kernels have larger bandwidths where the data are sparse.

    -
    Fig. 2. GWR with adaptive spatial kernel. Source: Fotheringham et al. (2002, 47).
    +
    Fig. 2. GWR with adaptive spatial kernel. Source: Fotheringham et al. (2002, 47).

    9.6.2 Optimal Bandwidth

    @@ -595,7 +594,7 @@

    Choosing an optimal bandwidth involves a compromise between bias and precision. For example, a larger bandwidth will involve using a larger number of observations to fit a local regression, and hence result in reduced variance (or increased precision) but high bias of estimates. On the other hand, too small bandwidth involves using a very small number of observations resulting in increased variance but small bias. An optimal bandwidth offers a compromise between bias and variance.

    9.6.3 Shape of Spatial Kernel

    -

    Two general set of kernel functions can be distinguished: continuous kernels and kernels with compact support. Continuous kernels are used to weight all observations in the study area and includes uniform, Gaussian and Exponential kernel functions. Kernel with compact support are used to assign a nonzero weight to observations within a certain distance and a zero weight beyond it. The shape of the kernel has been reported to cause small changes to resulting estimates (Brunsdon, Fotheringham, and Charlton 1998).

    +

    Two general set of kernel functions can be distinguished: continuous kernels and kernels with compact support. Continuous kernels are used to weight all observations in the study area and includes uniform, Gaussian and Exponential kernel functions. Kernel with compact support are used to assign a nonzero weight to observations within a certain distance and a zero weight beyond it. The shape of the kernel has been reported to cause small changes to resulting estimates (Brunsdon, Fotheringham, and Charlton 1998).

    9.6.4 Selecting a Bandwidth

    Let’s now implement a GWR model. The first key step is to define the optimal bandwidth. We first illustrate the use of a fixed spatial kernel.

    @@ -804,10 +803,10 @@
    -

    Analysing the map for long-term illness, a clear North-South divide can be identified. In the North we observed the expected positive relationship between COVID-19 and long-term illness i.e. as the share of the local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to increase. In the South, we observe the inverse pattern i.e. as the share of local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to drop. This pattern is counterintuitive but may be explained by the wider socio-economic disadvantages between the North and the South of England. The North is usually characterised by a persistent concentration of more disadvantaged neighbourhoods than the South where affluent households have tended to cluster for the last 40 years (Patias, Rowe, and Arribas-Bel 2021).

    +

    Analysing the map for long-term illness, a clear North-South divide can be identified. In the North we observed the expected positive relationship between COVID-19 and long-term illness i.e. as the share of the local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to increase. In the South, we observe the inverse pattern i.e. as the share of local population suffering from long-term illness rises, the cumulative number of positive COVID-19 cases is expected to drop. This pattern is counterintuitive but may be explained by the wider socio-economic disadvantages between the North and the South of England. The North is usually characterised by a persistent concentration of more disadvantaged neighbourhoods than the South where affluent households have tended to cluster for the last 40 years (Patias, Rowe, and Arribas-Bel 2021).

    9.6.7 Assessing statistical significance

    -

    While the maps above offer valuable insights to understand the spatial pattering of relationships, they do not identify whether these associations are statistically significant. They may not be. Roughly, if a coefficient estimate has an absolute value of t greater than 1.96 and the sample is sufficiently large, then it is statistically significant. Our sample has only 150 observations, so we are more conservative and considered a coefficient to be statistically significant if it has an absolute value of t larger than 2. Note also that p-values could be computed - see Lu et al. (2014).

    +

    While the maps above offer valuable insights to understand the spatial pattering of relationships, they do not identify whether these associations are statistically significant. They may not be. Roughly, if a coefficient estimate has an absolute value of t greater than 1.96 and the sample is sufficiently large, then it is statistically significant. Our sample has only 150 observations, so we are more conservative and considered a coefficient to be statistically significant if it has an absolute value of t larger than 2. Note also that p-values could be computed - see Lu et al. (2014).

    # compute t statistic
     utla_shp$t_ethnic = ab_gwr_out$ethnic / ab_gwr_out$ethnic_se
    @@ -848,10 +847,10 @@
     
     

    9.6.8 Collinearity in GWR

    -

    An important final note is: collinearity tends to be problematic in GWR models. It can be present in the data subsets to estimate local coefficients even when not observed globally Wheeler and Tiefelsdorf (2005). Collinearity can be highly problematic in the case of compositional, categorical and ordinal predictors, and may result in exact local collinearity making the search for an optimal bandwidth impossible. A recent paper suggests potential ways forward (Comber et al. 2022).

    +

    An important final note is: collinearity tends to be problematic in GWR models. It can be present in the data subsets to estimate local coefficients even when not observed globally Wheeler and Tiefelsdorf (2005). Collinearity can be highly problematic in the case of compositional, categorical and ordinal predictors, and may result in exact local collinearity making the search for an optimal bandwidth impossible. A recent paper suggests potential ways forward (Comber et al. 2022).

    9.7 Questions

    -

    We will continue to use the COVID-19 dataset. Please see sec-chp11 for details on the data.

    +

    We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.

    sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
    @@ -879,7 +878,7 @@ -
    +
    @@ -304,11 +303,11 @@ -

    This chapter provides an introduction to the complexities of spatio-temporal data and modelling. For modelling, we consider the Fixed Rank Kriging (FRK) framework developed by Cressie and Johannesson (2008). It enables constructing a spatial random effects model on a discretised spatial domain. Key advantages of this approach comprise the capacity to: (1) work with large data sets, (2) be scaled up; (3) generate predictions based on sparse linear algebraic techniques, and (4) produce fine-scale resolution uncertainty estimates.

    +

    This chapter provides an introduction to the complexities of spatio-temporal data and modelling. For modelling, we consider the Fixed Rank Kriging (FRK) framework developed by Cressie and Johannesson (2008). It enables constructing a spatial random effects model on a discretised spatial domain. Key advantages of this approach comprise the capacity to: (1) work with large data sets, (2) be scaled up; (3) generate predictions based on sparse linear algebraic techniques, and (4) produce fine-scale resolution uncertainty estimates.

    The content of this chapter is based on:

      -
    • Wikle, Zammit-Mangion, and Cressie (2019), a recently published book which provides a good overview of existing statistical approaches to spatio-temporal modelling and R packages.

    • -
    • Zammit-Mangion and Cressie (2017), who introduce the statistical framework and R package for modelling spatio-temporal used in this Chapter.

    • +
    • Wikle, Zammit-Mangion, and Cressie (2019), a recently published book which provides a good overview of existing statistical approaches to spatio-temporal modelling and R packages.

    • +
    • Zammit-Mangion and Cressie (2017), who introduce the statistical framework and R package for modelling spatio-temporal used in this Chapter.

    10.1 Dependencies

    @@ -358,7 +357,7 @@

    Investigating the spatial patterns of human processes as we have done so far in this book only offers a partial incomplete representation of these processes. It does not allow understanding of the temporal evolution of these processes. Human processes evolve in space and time. Human mobility is a inherent geographical process which changes over the course of the day, with peaks at rush hours and high concentration towards employment, education and retail centres. Exposure to air pollution changes with local climatic conditions, and emission and concentration of atmospheric pollutants which fluctuate over time. The rate of disease spread varies over space and may significantly change over time as we have seen during the current outbreak, with flattened or rapid declining trends in Australia, New Zealand and South Korea but fast proliferation in the United Kingdom and the United States. Only by considering time and space together we can address how geographic entities change over time and why they change. A large part of how and why of such change occurs is due to interactions across space and time, and multiple processes. It is essential to understand the past to inform our understanding of the present and make predictions about the future.

    10.3.1 Spatio-temporal Data Structures

    -

    A first key element is to understand the structure of spatio-temporal data. Spatio-temporal data incorporate two dimensions. At one end, we have the temporal dimension. In quantitative analysis, time-series data are used to capture geographical processes at regular or irregular intervals; that is, in a continuous (daily) or discrete (only when a event occurs) temporal scale. At another end, we have the spatial dimension. We often use spatial data as temporal aggregations or temporally frozen states (or ‘snapshots’) of a geographical process - this is what we have done so far. Recall that spatial data can be capture in different geographical units, such as areal or lattice, points, flows or trajectories - refer to the introductory lecture in Week 1. Relatively few ways exist to formally integrate temporal and spatial data in consistent analytical framework. Two notable exceptions in R are the packages TraMiner (Gabadinho et al. 2009) and spacetime (Pebesma et al. 2012). We use the class definitions defined in the R package spacetime. These classes extend those used for spatial data in sp and time-series data in xts. Next a brief introduction to concepts that facilitate thinking about spatio-temporal data structures.

    +

    A first key element is to understand the structure of spatio-temporal data. Spatio-temporal data incorporate two dimensions. At one end, we have the temporal dimension. In quantitative analysis, time-series data are used to capture geographical processes at regular or irregular intervals; that is, in a continuous (daily) or discrete (only when a event occurs) temporal scale. At another end, we have the spatial dimension. We often use spatial data as temporal aggregations or temporally frozen states (or ‘snapshots’) of a geographical process - this is what we have done so far. Recall that spatial data can be capture in different geographical units, such as areal or lattice, points, flows or trajectories - refer to the introductory lecture in Week 1. Relatively few ways exist to formally integrate temporal and spatial data in consistent analytical framework. Two notable exceptions in R are the packages TraMiner (Gabadinho et al. 2009) and spacetime (Pebesma et al. 2012). We use the class definitions defined in the R package spacetime. These classes extend those used for spatial data in sp and time-series data in xts. Next a brief introduction to concepts that facilitate thinking about spatio-temporal data structures.

    10.3.1.1 Type of Table

    Spatio-temporal data can be conceptualised as three main different types of tables:

    @@ -379,7 +378,7 @@
  • Irregular (STI): an object with an irregular space-time data structure, where each point is allocated a spatial coordinate and a time stamp;

  • Simple Trajectories (STT): an object containig a sequence of space-time points that form trajectories.

  • -

    More details on these spatio-temporal structures, construction and manipulation, see Pebesma et al. (2012). Enough theory, let’s code!

    +

    More details on these spatio-temporal structures, construction and manipulation, see Pebesma et al. (2012). Enough theory, let’s code!

    10.4 Data Wrangling

    This section illustrates the complexities of handling spatio-temporal data. It discusses good practices in data manipulation and construction of a Space Time Irregular Data Frame (STIDF) object. Three key requirements to define a STFDF object are:

    @@ -628,7 +627,7 @@
    #tmap_mode("view")
     #imap
    -

    Alternative data visualisation tools are animations, telliscope and shiny. Animations can be constructed by plotting spatial data frame-by-frame, and then stringing them together in sequence. A useful R packages gganimate and tmap! See Lovelace, Nowosad, and Muenchow (2019). Note that the creation of animations may require external dependencies; hence, they have been included here. Both telliscope and shiny are useful ways for visualising large spatio-temporal data sets in an interactive ways. Some effort is required to deploy these tools.

    +

    Alternative data visualisation tools are animations, telliscope and shiny. Animations can be constructed by plotting spatial data frame-by-frame, and then stringing them together in sequence. A useful R packages gganimate and tmap! See Lovelace, Nowosad, and Muenchow (2019). Note that the creation of animations may require external dependencies; hence, they have been included here. Both telliscope and shiny are useful ways for visualising large spatio-temporal data sets in an interactive ways. Some effort is required to deploy these tools.

    10.5.2 Exploratory Analysis

    In addition to visualising data, we often want to obtain numerical summaries of the data. Again, innovative ways to reduce the inherent dimensionality of the data and examine dependence structures and potential relationships in time and space are needed. We consider visualisations of empirical spatial and temporal means, dependence structures and some basic time-series analysis.

    @@ -693,7 +692,7 @@

    10.5.2.2 Dependence

    Spatial Dependence

    -

    As we know spatial dependence refers to the spatial relationship of a variable’s values for a pairs of locations at a certain distance apart, so that are more similar (or less similar) than expected for randomly associated pairs of observations. Patterns of spatial dependence may change over time. In the case of a disease outbreak patterns of spatial dependence can change very quickly as new cases emerge and social distancing measures are implemented. sec-chp6 illustrates how to measure spatial dependence in the context of spatial data.

    +

    As we know spatial dependence refers to the spatial relationship of a variable’s values for a pairs of locations at a certain distance apart, so that are more similar (or less similar) than expected for randomly associated pairs of observations. Patterns of spatial dependence may change over time. In the case of a disease outbreak patterns of spatial dependence can change very quickly as new cases emerge and social distancing measures are implemented. Chapter 6 illustrates how to measure spatial dependence in the context of spatial data.

    Challenge 1: Measure how spatial dependence change over time. Hint: compute the Moran’s I on the rate of new COVID-19 cases (i.e. n_covid19_r in the covid19 data frame) at multiple time points.

    @@ -743,7 +742,7 @@ -

    For a good introduction to time-series analysis in R, refer to Hyndman and Athanasopoulos (2018) and DataCamp.

    +

    For a good introduction to time-series analysis in R, refer to Hyndman and Athanasopoulos (2018) and DataCamp.

    10.6 Spatio-Temporal Data Modelling

    Having some understanding of the spatio-temporal patterns of COVID-19 spread through data exploration, we are ready to start further examining structural relationships between the rate of new infections and local contextual factors via regression modelling across UTLAs. Specifically we consider the number of new cases per 100,000 people to capture the rate of new infections and only one contextual factor; that is, the share of population suffering from long-term sickness or disabled. We will consider some basic statistical models, of the form of linear regression and generalized linear models, to account for spatio-temporal dependencies in the data. Note that we do not consider more complex structures based on hierarchical models or spatio-temporal weighted regression models which would be the natural step moving forward.

    @@ -780,7 +779,7 @@

    These basis functions are incorporated as independent variables in the regression model. Additionally, we also include the share of population suffering from long-term illness as we know it is highly correlated to the cumulative number of COVID-19 cases. The share of population suffering long-term illness is incorporated as a spatial-variant, temporal-invariant covariates given that rely in 2011 census data.

    10.6.2 Fitting Spatio-Temporal Models

    -

    As indicated at the start of this Chapter, we use the FRK framework developed by Cressie and Johannesson (2008). It provides a scalable, relies on the use a spatial random effects model (with which we have some familiarity) and can be easily implemented in R by the use of the FRK package (Zammit-Mangion and Cressie 2017). In this framework, a spatially correlated errors can be decomposed using a linear combination of spatial basis functions, effectively addressing issues of spatial-temporal dependence and nonstationarity. The specification of spatio-temporal basis functions is a key component of the model and they can be generated automatically or by the user via the FRK package. We will use the automatically generated functions. While as we will notice they are difficult to interpret, user generated functions require greater understanding of the spatio-temporal structure of COVID-19 which is beyond the scope of this Chapter.

    +

    As indicated at the start of this Chapter, we use the FRK framework developed by Cressie and Johannesson (2008). It provides a scalable, relies on the use a spatial random effects model (with which we have some familiarity) and can be easily implemented in R by the use of the FRK package (Zammit-Mangion and Cressie 2017). In this framework, a spatially correlated errors can be decomposed using a linear combination of spatial basis functions, effectively addressing issues of spatial-temporal dependence and nonstationarity. The specification of spatio-temporal basis functions is a key component of the model and they can be generated automatically or by the user via the FRK package. We will use the automatically generated functions. While as we will notice they are difficult to interpret, user generated functions require greater understanding of the spatio-temporal structure of COVID-19 which is beyond the scope of this Chapter.

    Prepare Data

    The first step to create a data frame with the variables that we will consider for the analysis. We first remove the geometries to convert covid19_spt from a simple feature object to a data frame and then compute the share of long-term illness population.

    @@ -1060,7 +1059,7 @@

    Note that you may need to install the huxtable package.

    10.6.2.1 Model Comparision

    -

    To compare regression models based on different specifications and assumptions, like those reported above, you may want to consider the cross-validation approach used in sec-chp5. An alternative approach if you would like to get a quick sense of model fit is to explore the correlation between observed and predicted values of the dependent variable. For our models, we can achieve this by executing:

    +

    To compare regression models based on different specifications and assumptions, like those reported above, you may want to consider the cross-validation approach used in Chapter 5. An alternative approach if you would like to get a quick sense of model fit is to explore the correlation between observed and predicted values of the dependent variable. For our models, we can achieve this by executing:

    # computing predictions for all models
     lm_cnt <- predict(lm_m)
    @@ -1102,7 +1101,7 @@
     Option 4 Check for collinearity. Collinearity is likely to be an issue given the way basis functions are created. Checking for collinearity of course will not improve the fit of the existing model but it is important to remove collinear terms if statistical inference is a key goal - which in this case is. Over to you now!
     

    10.7 Questions

    -

    We will continue to use the COVID-19 dataset. Please see sec-chp11 for details on the data.

    +

    We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.

    sdf <- st_read("data/assignment_2_covid/covid19_eng.gpkg")
    @@ -1130,7 +1129,7 @@ -
    +
    diff --git a/docs/index.html b/docs/index.html index 90949d3..98fec94 100644 --- a/docs/index.html +++ b/docs/index.html @@ -7,7 +7,7 @@ - + Spatial Modelling for Data Scientists