IC-based-selection.qmd

# Model Selection Using Information Criteria

So far, we have learned to fit models with multiple predictors, both
quantitative and categorical, and to assess whether required conditions
are met for linear regression to be an appropriate model for a dataset.

One missing piece is: If I have an appropriate model with a set of
multiple predictors, how can I choose which predictors are worth
retaining in a "best" model for the data (and which ones have no
relationship, or a weak relationship, with the response, so should be
discarded)?

## Data and Model

Today we will recreate part of the analysis from [*Vertebrate community
composition and diversity declines along a defaunation gradient
radiating from rural villages in
Gabon*](https://doi.org/10.1111/1365-2664.12798), by Sally Koerner and
colleagues. They investigated the relationship between rural villages,
hunting, and wildlife in Gabon. They asked how monkey abundance depends
on distance from villages, village size, and vegetation characteristics.
They shared their data at
[Dryad.org](https://datadryad.org/stash/dataset/doi:10.5061/dryad.vs97g)
and we can read it in and fit a regression model like this:

```{r, gabon-data}
defaun <- read.csv('http://sldr.netlify.com/data/koerner_gabon_defaunation.csv')
```

```{r, bee-models, echo=TRUE}
ape_mod <- lm(RA_Apes ~ Veg_DBH + Veg_Canopy + Veg_Understory +
                   Veg_Rich + Veg_Stems + Veg_liana +
                   LandUse + Distance + NumHouseholds, data = defaun)
summary(ape_mod)
as.numeric(logLik(ape_mod))
```

```{r, include=FALSE}
monkeyBIC <- BIC(ape_mod)
```

## Calculations

-   Information criteria allow us to **balance the conflicting goals**
    of having a model that *fits the data as well as possible* (which
    pushes us toward models with more predictors) and *parsimony*
    (choosing the simplest model, with the fewest predictors, that works
    for the data and research question). The basic idea is that we
    **minimize** the quantity
    $-(2LogLikelihood - penalty) = -2LogLikelihood + penalty$

-   AIC is computed according to $-2LogLikelihood +2k$, where $k$ is the
    number of coefficients being estimated (don't forget $\sigma$!)
    **Smaller AIC is better.**

-   BIC is computed according to $-2LogLikelihood + ln(n)k$, where $n$
    is the number of observations (rows) in the dataset and $k$ is the
    number of coefficients being estimated. **Smaller BIC is better.**

-   Verify that the BIC for this model is
    `r round(monkeyBIC, digits=2)`.

## Decisions with ICs

The following rules of thumb (**not** laws, just common rules of thumb)
may help you make decisions with ICs:

-   A model with lower IC *by at least 3 units* is notably better.
-   If two or more models have ICs *within* 3 IC units of each other,
    there is not a lot of difference between them. Here, we usually
    choose the model with fewest predictors.
-   In some cases, if the research question is to measure the influence
    of some particular predictor on the response, but *the IC does not
    strongly support including that predictor* in the best model (IC
    difference less than 3), you might want to keep it in anyway and
    then discuss the situation honestly, for example, "AIC does not
    provide strong support for including predictor x in the best model,
    but the model including predictor x indicates that as x increases
    the response decreases slightly. More research would be needed..."

## All-possible-subsets Selection

The model we just fitted is our *full model*, with all predictors of
potential interest included. How can we use information criteria to
choose the best model from possible models with subsets of the
predictors?

We can use the `dredge()` function from the `MuMIn` package to get and
display ICs for all these models.

Before using dredge, we need to make sure our dataset has no missing
values, and also set the "na.action" input for our model (can be done in
call to `lm(..., na.action = 'na.fail')` also).

```{r, getaicbic}
library(MuMIn)
ape_mod <- ape_mod |> update(na.action = 'na.fail')
ape_dredge <- dredge(ape_mod, rank='BIC')
pander::pander(head(ape_dredge, 7))
```

-   What is the best model according to BIC, for this dataset?

## Which IC should I use?

AIC and BIC may give different best models, especially if the dataset is
large. You may want to just choose one to use *a priori* (before making
calculations). You might prefer BIC if you want to err on the
"conservative" side, as it is more likely to select a "smaller" model
with fewer predictors. This is because of its larger penalty.

## Quantities derived from AIC

-   $\Delta AIC$ is the AIC for a given model, minus the AIC of the best
    one in the dataset. (Same for $\Delta BIC$)
-   *Akaike weights* are values (ranging from 0-1) that measure the
    weight of evidence suggesting that a model is the best one (given
    that there is one best one in the set)

## Important Caution

**Very important**: IC can **ONLY** be compared for models with the same
response variable, and the exact same rows of data.