Skip to content

Commit

Permalink
update chp5
Browse files Browse the repository at this point in the history
  • Loading branch information
fcorowe committed Feb 22, 2024
1 parent c74f330 commit b57ea7b
Show file tree
Hide file tree
Showing 32 changed files with 2,606 additions and 1,578 deletions.
4 changes: 3 additions & 1 deletion 04-points.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,16 @@ colnames(db)

The rest of this session will focus on two main elements of the table: the spatial dimension (as stored in the point coordinates), and the nightly price values, expressed in USD and contained in the `price` column. To get a sense of what they look like first, let us plot both. We can get a quick look at the non-spatial distribution of house values with the following commands:

```{r fig.margin=TRUE, fig.cap="Raw AirBnb prices in San Diego"}
```{r}
#| warning: false
# Create the histogram
qplot( data = db, x = price)
```

This basically shows there is a lot of values concentrated around the lower end of the distribution but a few very large ones. A usual transformation to *shrink* these differences is to take logarithms. The original table already contains an additional column with the logarithm of each price (`log_price`).

```{r}
#| warning: false
# Create the histogram
qplot( data = db, x = log_price )
```
Expand Down
8 changes: 4 additions & 4 deletions 05-flows.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ In this chapter we will show a slightly different way of managing spatial data i

## Data

In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the SF Open Data portal ([link](http://www.bayareabikeshare.com/open-data)) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the `sf_bikes` folder. The data file is named `flows.geojson` and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the `flows_prep.ipynb` notebook [\[url\]](https://github.com/darribas/spa_notes/blob/master/sf_bikes/flows_prep.ipynb), also in the same folder.
In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the [SF Open Data portal](https://datasf.org/opendata/) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the `sf_bikes` folder. The data file is named `flows.geojson` and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the `flows_prep.ipynb` [notebook](https://github.com/darribas/spa_notes/blob/master/sf_bikes/flows_prep.ipynb), also in the same folder.

Let us then directly load the file with all the information necessary:

Expand Down Expand Up @@ -277,7 +277,7 @@ generate_draw <- function(m){
}
```
This function takes a model `m` and the set of covariates `x` used and returns a random realization of predictions from the model. To get a sense of how this works, we can get and plot a realization of the model, compared to the expected one and the actual values:
This function takes a model `m` and the set of covariates `x` used and returns a random realisation of predictions from the model. To get a sense of how this works, we can get and plot a realisation of the model, compared to the expected one and the actual values:
```{r}
new_y <- generate_draw(m1)
Expand Down Expand Up @@ -355,11 +355,11 @@ legend(
title(main="Predictive check - Baseline model")
```
The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realizations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-2] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.
The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realisations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-2] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.
[^05-flows-2]: The $R^2$ of our model is around 2%
It is important to keep in mind that the issues discussed in the paragraph above relate only to the uncertainty behind our model, not to the point predictions derived from them, which are a mechanistic result of the minimization of the squared residuals and hence are not subject to probability or inference. That allows them in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from this model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-3], our capabilities to infer about the more general underlying process are fairly limited.
Keep in mind that the issues discussed in the paragraph above relate to the uncertainty behind our model. This is not to the point predictions derived from them, which are a mechanistic result of the minimisation of the squared residuals and hence are not subject to probability or inference. That allows the model in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from the model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-3], our capabilities to infer about the more general underlying process are fairly limited.
[^05-flows-3]: which they are not really, in light of the comparison between the black and red lines.
Expand Down
Loading

0 comments on commit b57ea7b

Please sign in to comment.