modeling_college_ed.qmd

---
title: "Modeling College Education"
format:
  html:
    embed-resources: true
editor: visual
execute: 
  echo: false
  warning: false
fig-cap-location: top
toc: true
toc-expand: true
---

```{r}
library(Hmisc)
library(tidyverse)
library(margins)
library(scales)
library(forecast)
library(fpp)
library(ggridges)
library(modelsummary)
library(urbnthemes)
set_urbn_defaults(style = "print")

# In regressions, treat ordered factors as unordered
defcontrasts=options('contrasts')
options(contrasts=c("contr.treatment", "contr.treatment"))

```

# Introduction

The main purpose of this document is to define a new model of college education that is sensitive to parents' wealth and income.

## Current Model

The model of college education in the current version of DYNASIM consists of four models that simulate four separate college-related events each year:

1.  Enrollment, for those who graduated from high school,
2.  Returning to college, for those who are enrolled in college but not yet in their senior year,
3.  Failing the senior year and remaining enrolled, for those who reach senior year, and
4.  Dropping out of college, for those in senior year.

Those who reach the senior year, don't fail it, and don't drop out graduate from college.

Likelihoods of these events, which are implemented as a set of transition-probability look-up tables, are modeled as a function of:

-   gender,
-   parents' education (less than high school, high-school diploma, some college, or college diploma or more), and
-   race/ethnicity (Black Non-Hispanic, Hispanic, and White Non-Hispanic).

Parents' education is the education of the head of a family. In two-parent households, in which parents are heterosexual couples, the father is assigned as the head.

This is a simplified model, as it assumes that each year spent in college, before the senior year, is successfully completed. All students who get enrolled in college are assumed to complete first year. In subsequent years, the retention model simulates their decision to return to school and continue their education.

## Logistic Hazard Model

The new model of college education will use the logistic hazard model rather than tables of transitional probabilities. This section introduces some concepts and notation.

### Survival and Hazard Functions

Assuming discrete time $t=1,2,...$ and a random variable $T$ that indicates the timing of an event (e.g., college enrollment), we define a *survival function* $S(.)$ at the end of time $t$ as: $$Pr(T>t) = 1-F(t) = S(t) = S_t$$ Similarly, at the beginning of $t$, the survival function is $Pr(T>t-1) = 1-F(t-1) = S(t-1)$. The (unconditional) probability of exit during the interval $t$ is $Pr(t-1<T<t) = S_{t-1}-S_t$. The conditional probability of exit during time $t$ conditional on surviving until $t-1$ is *interval hazard rate*: \begin{align}
h_t &= Pr(t-1<T<t|T>t-1) \\
     &= \frac{Pr(t-1<T<t)}{Pr(T>t-1)} \\
     &= \frac{S_{t-1}-S_t}{S_{t-1}}
\end{align}

The probability of survival until the end of interval $t$ is the product of probabilities of not experiencing an event in each of the intervals ${1,...,t}$:

$$S_t = \Pi_{k=1}^{t} (1-h_k)$$

### Proportional Odds and Logistic Hazard Model

The proportional odds model assumes that the odds ratio for a person with a vector of individual characteristics $X$ at time $t$ is proportional to the odds ratio at $t$ that is common to all individuals and an individual-specific scaling factor:

$$\frac{h(t,X)}{1-h(t,X)} = \frac{h_0(t)}{1-h_0(t)} \exp (\beta' X)$$

Taking into account the definition of the logit function $\text{logit}(x) = \log(x/(1-x))$ and denoting $\text{logit}(h_0(t))=\alpha_t$, yields the *logistic hazard model*:

$$h(t,X) = \frac{1}{1+\exp(-\alpha_t -\beta' X)}$$ This model can be estimated by the standard logistic regression. The data set has to have one observation per person per time period. The dependent variable is 0 for time periods without an event (i.e., survived periods) and 1 for the time period in which an event occurs. If an event occurs for a person, the period in which it occurs is the last period for that person.

Note that the only time-variable term is $\alpha_t$. This term can be modeled as a function of time (e.g., linear, logarithmic) or with time dummies.

# Proposed Model

The proposed model of college education consists of four models:

1.  College enrollment, for those who graduated from high school,
2.  College re-enrollment, for those who dropped out of college,
3.  Dropping out of college, for those who are enrolled in college,
4.  Graduating from college, for those who reach their senior year.

## Data and Settings


```{r}
#| echo: true

# These covariates are used in all regressions
model_covars = c("yearf", "asian", "pnetworth", "asian*pnetworth", "pared", "asian*pared")

# Linear transformation of wealth (divided by average annual earnings)
opt_wealth_xform = 'linear'

# Parents' education is encoded as number of parents with a college diploma
opt_pared_code = 'college'

# Coefficients' significance level
opt_sig_level = .1

```


```{r}
#| label: make-basedf

source(here::here("NLSY/nlsy_lib.R"))
source(here::here("dynasim_lib.R"))

# Create a base frame
basedf0 = nlsy_get_base_df() |> 
    select(id, sex, race, hisp, pnetworth, wt) |>
    filter(!is.na(race) & !is.na(hisp) & race!='No information') |>
    mutate(
        pnetworth_na = as.integer(is.na(pnetworth)),
        pnetworth = if_else(is.na(pnetworth), 0, pnetworth),
        race = nlsy_recode_race_and_eth5(race, hisp)
    ) |>
    # There are not enough Asians in the dataset for a separate model
    # We group them with whites and use a dummy variable
    mutate(
        asian = as.integer(race=="Asian"),
        race  = if_else(race=="Asian", "White", race)
    ) |>
    filter(race != 'Other')


basedf = nlsy_get_base_df() |> 
    select(id, sex, race, hisp, wt) |>
    filter(!is.na(race) & !is.na(hisp) & race!='No information') |>
    left_join(nlsy_get_imputed()) |>
    mutate(
        race = nlsy_recode_race_and_eth5(race, hisp),
        pared= nlsy_encode_parents_educ(mom_educ, dad_educ, encoding=opt_pared_code),
        # Remove retirement savings because we don't have them in FEH
        pnetworth = pnetworth-retsav
    ) |>
    # There are not enough Asians in the dataset for a separate model
    # We group them with whites and use a dummy variable
    mutate(
        asian = as.integer(race=="Asian"),
        race  = if_else(race=="Asian", "White", race)
    ) |>
    filter(race != 'Other') |>
    mutate(
        pnworth_cat = cut(
            pnetworth, 
            breaks=Hmisc::wtd.quantile(pnetworth, wt, probs=seq(0,1,.2)),
            labels=c('1st','2nd','3rd','4th','5th'),
            include.lowest = TRUE)
    )
  

```
```{r}
#| label: fig-pnetworth-density
#| fig-cap: "Distribution of parent's networth (divided by the average wage index) by race"

basedf |>
  select(id, pnetworth, pincome, race, wt) |>
  mutate(pnetworth = dyn_wealth_xform(pnetworth, type='linear')) |>
  filter(pnetworth>-20) |>
  ggplot() +
  geom_density(aes(x=pnetworth, y=after_stat(density), color=race, weight=wt), fill=NA) +
  xlab("Net worth / AWI") + ylab("Density")

```


```{r}
#| label: fig-ever-college-by-race-pnetworth
#| fig-cap: "The share of NLSY participants who attended college by race and quintile of parents' net worth"

nlsy_get_col_stat_annual_df() |>
    group_by(id) |>
    summarise(evercol = max(colenr)) |>
    right_join(
        basedf,
        by='id'
    ) |>
    filter(!is.na(pnworth_cat)) |>
    group_by(race, pnworth_cat) |>
    mutate(
        `College Attendance`=weighted.mean(evercol, wt),
        se = sqrt(`College Attendance`*(1-`College Attendance`)/n()),
        ci_l = `College Attendance`-1.96*se,
        ci_h = `College Attendance`+1.96*se) |>
    ggplot() +
      geom_col(aes(x=race,y=`College Attendance`,fill=pnworth_cat), position="dodge") +
#      geom_errorbar(aes(x=race, ymin=ci_l, ymax=ci_h, group=pnworth_cat), color='red', width = .2, position=position_dodge(.7)) +
      xlab("Race")

```


## Enrollment

### Enrollment After Graduating from High School

The college enrollment model simulates enrollment into college for high-school graduates. To allow high-school graduates to take a break before enrolling into college, we need to model enrollment as a *survival model*, in which all high-school graduate who have not enrolled in college are at risk of enrolling in each year after graduating from high school for some number of years that we call *enrollment window*. The enrollment window in the current model is seven years.

The data for estimating this model should have one observation per person per year. The time should be measured in years since the high-school graduation. The dependent variable should have value zero in years in which a person was not enrolled in college, and one in the year in which the person enrolled into college. For a person who enrolled in college, the year of enrollment would be the last year for that person in the dataset. A person who did not enroll during the enrollment window would have a zero for each year during that window.

Note that a simpler way to implement this model would be as a one-shot logistic regression model that selects high-school graduates who enroll into college. Those who are not selected don't get another opportunity. However, this model does not represent reality well because many people don't enroll into college immediately after their high-school graduation (@fig-enrol-hazard). If this kind of model was calibrated to match the total enrollment, it would result in the average college graduation age that is younger than that in the population.

Another interesting thing to note on @fig-enrol-hazard is that the college-enrollment hazard for Black women in years 2 to 7 remains significantly higher than the hazard for other groups.

```{r}
#| label: make-colenrdf
##| cache: true
##| cache-vars: colenrdf


colenrdf = nlsy_make_spell_df('enroll') |>
    filter(!is.na(spEnroll)) |>
    select(id, timeEnroll, tEnroll, spEnroll) |>
    inner_join(
        basedf,
        by='id'
    )


```

```{r}
#| label: fig-enrol-hazard
#| fig-cap: "Hazard function for college enrollment by race, sex and year since the graduation from high school."


plotdf = colenrdf |>
    group_by(timeEnroll, spEnroll, race, sex) |>
    summarise(tEnroll=weighted.mean(tEnroll, wt)) |>
    filter(spEnroll==1)

ggplot(plotdf) +
    geom_line(aes(x=timeEnroll, y=tEnroll, color=race)) +
    facet_grid(~sex) +
    xlab('Years') +
    ylab('Likelihood of college enrollment')

```

Even though some people make long breaks between high school and college, majority enrolls soon after high school: 90 percent enrolls within four years, 95 percent enrolls within 7 years, and 99 percent enrolls within 13 years (@fig-cum-share-gap-years).

```{r}
#| label: fig-cum-share-gap-years
#| fig-cap: "Cummulative share of years between high-school graduation and college enrollment."

plotdf = nlsy_make_spell_df() |> 
    filter(yrEnroll>0) |>
    mutate(gap_years=yrEnroll-hs_grad_year) |> 
    group_by(id) |> 
    summarise(gap_years=min(gap_years)) |>
    inner_join(basedf, by='id') |>
    ungroup() |>
    mutate(totwt=sum(wt)) |>
    group_by(gap_years) |>
    summarise(
        totwt=max(totwt),
        share=sum(wt)/totwt
        ) |>
    mutate(F=cumsum(share))

plotdf |>
    ggplot() +
    geom_line(aes(x=gap_years, y=F)) +
    xlab("Years between high-school graduation and college enrollment") +
    ylab("Cumulative share")

```


```{r}
#| label: fig-enrol-hazard-by-pnetworth
#| fig-cap: "Hazard function for college enrollment by quintile of parents net-worth and year since the graduation from high school."


plotdf = colenrdf |>
    group_by(timeEnroll, spEnroll, pnworth_cat) |>
    summarise(tEnroll=weighted.mean(tEnroll, wt)) |>
    filter(spEnroll==1)

ggplot(plotdf) +
    geom_line(aes(x=timeEnroll, y=tEnroll, color=pnworth_cat)) +
    xlab('Years') +
    ylab('Likelihood of college enrollment')

```


```{r}
#| label: fig-college-completion-tiles
#| fig-width: 7
#| fig-height: 10
#| fig-cap: "College years completed in every year since the graduation from high school."
#| eval: false

make_first_enr_year_df = function()
{
    # Make a college enrollment frame with:
    #   col1 (boolean):         enrolled first time
    #   cumenr (int):           cumulative years of enrollment,
    #   years_since_hs (int):   number of years since the high-school graduation
    #
    
    # Make a frame with college enrollment and highest grade completed
    colstdf = left_join(
      nlsy_get_col_stat_fall_df(),
      nlsy_get_highest_grade_completed_df(),
      by=c('id', 'year')
    )

    data = colstdf |>
      filter(year>=hs_grad_year) |>
      mutate(
        cumenr = cumsum(enrolled),
        col1 = as.integer(enrolled==TRUE & cumenr==1),
        year_since_hs = year-hs_grad_year
        )

    # Test that nobody has col1==1 in more than one year    
    stopifnot(max((data |> group_by(id) |>
        summarise(col1 = sum(col1)))$col1) == 1)

    # Add the year of college enrollment
    #   colenryr (int):         the year when first enrolled in college
    #
    data = data |>
        group_by(id) |>
        mutate(
            colenryr = case_when(
                col1 == 1 ~ year,
                TRUE ~ 0
                ),
            colenryr = max(colenryr)
        )
    
    return(data)    
}


plotdf = make_first_enr_year_df() |> 
  ungroup() |>
  select(id, year_since_hs, hcyc) |>
  pivot_wider(id_cols=id, names_from=year_since_hs, values_from = hcyc, names_prefix="y") |>
  arrange(y0,y1,y2,y3,y4,y5,y6,y7,y8,y9,y10,y11,y12,y13,y14,y15,y16,y17,y18,y19,y20,y21,y22) |>
  mutate(id2=row_number()) |>
  pivot_longer(
    starts_with('y'),
    names_prefix='y',
    values_to = 'hcyc',
    names_to='year_since_hs') |>
  mutate(
    year_since_hs=as.integer(year_since_hs)
    ) 

plotdf |>
  ggplot() + 
    geom_tile(aes(x=year_since_hs, y=id2, fill=hcyc)) +
    theme() + 
    scale_fill_gradientn() +
    theme(legend.position = "right",
          legend.direction = "vertical",
          axis.line.x = element_blank(),
          axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank(),
          panel.grid.major.y = element_blank()) +
    remove_ticks() +
    xlab("Years Since High School")

```

We estimate a hazard model of college enrollment that regresses the college-enrollment indicator on a set of indicators for years after the high school graduation and a set of variables representing individual characteristics for race and sex and their interactions, $x_i$, one for each combination of sex and race:

$$ \text{logit}(y_{it}^{(j)}) = \sum_{k=0}^T \alpha_k^{(j)} I(k=t) + \beta' x_i, j=1,...,6$$

@tbl-enrol-simple-mod-est shows the estimated coefficients.

```{r}
#| label: est-simple-models

marginal_effects = function(model, data, variables)
{
    return((summary(margins(model, data=data, variables=variables)))$AME)
}

# Create a factor year variable to use it as dummmies in regressions
estdf = colenrdf |>
    filter(spEnroll==1) |>
    mutate(
        yearf=factor(pmin(timeEnroll,8)),
        pnetworth=dyn_wealth_xform(pnetworth, type=opt_wealth_xform)
    )

model_formula = as.formula(paste("tEnroll ~ ", paste(model_covars, collapse = " + ")))

# Estimate one model for each sex-race combination
estdf = estdf |>
    group_by(race, sex) |>
    nest() |>
    mutate(
        mods = map(data, \(x) glm(
            model_formula, 
            data=x, 
            family=binomial)
        )
    ) |>
    mutate(pred = map(mods, \(x) predict(x, type='response'))) |>
    mutate(marg = map2(mods, data, \(x,y) marginal_effects(x, y, variables=c("pnetworth"))))


```

```{r}
#| label: tbl-enrol-simple-mod-est
#| tbl-cap: "Estimated coefficients for the model of college enrollment"

# Create a named list of models
estdf = estdf |> arrange(desc(sex), race)
models = estdf$mods
names(models) = dyn_get_sexrace_names(estdf)

# Create a regression table
modelsummary(models, 
             estimate="{estimate}{stars}", 
             statistic=NULL,
             stars=c('*'=.1, '**'=.05, '***'=.01),
             coef_rename=coef_rename
#             coef_rename=\(x) gsub('yearf(.*)', 'Year\\1', x)
             )

# Dataframe with coefficients
res_colenr = dyn_coef_to_df(models, sig_level=opt_sig_level)

coefs = list()

coefs[['colenr']] = list(
    df = dyn_coef_to_df(models, sig_level=opt_sig_level),
    filename = "coefs/coef_col_enr.csv",
    description = 'Coefficients for college enrollment model.'
)

```

@tbl-enroll-marginal shows average marginal effects of parental networth on the likelihood of enrollment. The numbers in the table can be interpreted as the effect that an increase in parental wealth by the amount equal to the average earnings in 1997 ($27,426) has on the likelihood of enrollment in college.

```{r}
#| label: tbl-enroll-marginal
#| tbl-cap: "Marginal effects of parental networth"

df = data.frame(estdf$marg)
colnames(df) = dyn_get_sexrace_names(estdf)
knitr::kable(df, digits=5)
```

@fig-enr-simple-mod-pred shows hazard rates by sex and race predicted by the model, as well as the actual hazard rates from @fig-enrol-hazard.

```{r}
#| label: fig-enr-simple-mod-pred
#| fig-cap: 'Hazard rates for college enrollment after graduating from high school by sex and race, actual and predicted.'

estdf |>
    select(sex, race, data, pred) |>
    unnest(cols=c(data, pred)) |>
    group_by(timeEnroll, race, sex) |>
    summarise(
        tEnroll = weighted.mean(tEnroll, wt),
        pred = weighted.mean(pred, wt)
    ) |>
    rename(
        `Actual`    = tEnroll,
        `Predicted` = pred
        ) |>
    pivot_longer(cols=c('Actual', 'Predicted')) |>
    ggplot() +
    geom_line(aes(x=timeEnroll, y=value, color=name)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab('Years since high-school graduation') +
    ylab('Likelihood of college enrollment')


```


### Enrollment After Graduating from High School, an Alternative

In this variation, we use the logarithmic transformation for parents' networth and we separate it in a positive and a negative term. For positive values of networth, the positive term equals the  natural logarithm of networth and the negative term is zero. For negative values of networth, the positive term is zero and the negative term equals the natural logarithm of negative networth. @tbl-enrol2-simple-mod-est shows the estimated coefficients.

```{r}
#| label: est-enroll2-models

# Create a factor year variable to use it as dummmies in regressions
estdf = colenrdf |>
    filter(spEnroll==1) |>
    mutate(
        yearf       = factor(pmin(timeEnroll,8)),
        pwealthpos  = dyn_wealth_xform(pnetworth, type='log+'),
        pwealthneg  = dyn_wealth_xform(pnetworth, type='log-')
    )

model_covars2 = gsub("pnetworth","pwealthpos", model_covars)
model_covars2 = c(model_covars2, "pwealthneg", "asian*pwealthneg")
model_formula = as.formula(paste("tEnroll ~ ", paste(model_covars2, collapse = " + ")))

# Estimate one model for each sex-race combination
estdf = estdf |>
    group_by(race, sex) |>
    nest() |>
    mutate(
        mods = map(data, \(x) glm(
            model_formula, 
            data=x, 
            family=binomial)
        )
    ) |>
    mutate(pred = map(mods, \(x) predict(x, type='response'))) |>
    mutate(
        margpos = map2(mods, data, \(x,y) marginal_effects(x, y, variables=c("pwealthpos"))),
        margneg = map2(mods, data, \(x,y) marginal_effects(x, y, variables=c("pwealthneg"))),
        )

```

```{r}
#| label: tbl-enrol2-simple-mod-est
#| tbl-cap: "Estimated coefficients for the model of college enrollment with parental networth split into negative an dpositive terms"

# Create a named list of models
estdf = estdf |> arrange(desc(sex), race)
models = estdf$mods
names(models) = dyn_get_sexrace_names(estdf)

# Create a regression table
modelsummary(models, 
             estimate="{estimate}{stars}", 
             statistic=NULL,
             stars=c('*'=.1, '**'=.05, '***'=.01),
             coef_rename=\(x) gsub('yearf(.*)', 'Year\\1', x)
             )

```

@tbl-enroll2-marginal shows average marginal effects of parental networth on the likelihood of enrollment. The numbers in the table can be interpreted as the effect that one percent increase in parental wealth has on the likelihood of enrollment in college.

```{r}
#| label: tbl-enroll2-marginal
#| tbl-cap: "Marginal effects of parental networth separate for negative and positive"

df1 = data.frame(estdf$margpos)
df2 = data.frame(estdf$margneg)
colnames(df1) = dyn_get_sexrace_names(estdf)
colnames(df2) = dyn_get_sexrace_names(estdf)

df = bind_rows(df1, df2)
colnames(df) = dyn_get_sexrace_names(estdf)
knitr::kable(df, digits=5)
```


### Enrollment After Dropping out of College

Many students who drop out of college return and re-enroll. Some do this multiple times. @fig-enrol-hazard-spells shows enrollment hazard rates for the first two spells of non-enrollment after dropping out of college for all students who drop out of college. @fig-enrol-hazard-two-spells-sex-race shows enrollment hazard rates for the first two spells of non-enrollment after dropping out of college by sex and race. The higher-order spells have few observations, which makes their hazard rates very noisy.

```{r}
#| label: fig-enrol-hazard-spells
#| fig-cap: "Hazard function for college enrollment by enrollment spell. X-axes represents years since the graduation from high school, for the first spell of non-enrollment, and years, since dropping out of college for subsequent spells."


plotdf = colenrdf |>
    group_by(timeEnroll, spEnroll) |>
    summarise(tEnroll=weighted.mean(tEnroll, wt)) |>
    filter(spEnroll %in% 2:4) |>
    mutate(spEnroll = factor(spEnroll))


ggplot(plotdf) +
    geom_line(aes(x=timeEnroll, y=tEnroll, color=spEnroll)) +
    xlab('Years') +
    ylab('Likelihood of college enrollment')

```

```{r}
#| label: fig-enrol-hazard-two-spells-sex-race
#| fig-cap: "Hazard function for college enrollment by enrollment spell, sex, and race. X-axes represents years since dropping out of college."

plotdf = colenrdf |>
    group_by(timeEnroll, spEnroll, race, sex) |>
    summarise(tEnroll=weighted.mean(tEnroll, wt)) |>
    filter(spEnroll %in% 2:3) |>
    filter(tEnroll < .5) |>
    mutate(spEnroll = factor(spEnroll))


ggplot(plotdf) +
    geom_line(aes(x=timeEnroll, y=tEnroll, color=spEnroll)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab('Years') +
    ylab('Likelihood of college enrollment')


```

Because hazard rates for the first two spells are so close to each other and higher-order spells have few observations, we can merge all the spells together and estimate a single hazard rate for returning to college after dropping out.

```{r}
#| label: fig-enrol-hazard-all-spells-sex-race
#| fig-cap: "Hazard function for college enrollment, sex, and race. X-axes represents years since dropping out of college."

plotdf = colenrdf |>
    filter(spEnroll %in% 2:7) |>
    group_by(timeEnroll, race, sex) |>
    summarise(tEnroll=weighted.mean(tEnroll, wt)) |>
    filter(tEnroll < .5)


ggplot(plotdf) +
    geom_line(aes(x=timeEnroll, y=tEnroll)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab('Years') +
    ylab('Likelihood of college enrollment')

```

```{r}
#| label: est-reenrol-models

# Create a factor year variable to use it as dummmies in regressions
estdf = colenrdf |>
    filter(spEnroll==2) |>
    mutate(
        yearf=factor(pmin(timeEnroll,5)),
        pnetworth=dyn_wealth_xform(pnetworth, type=opt_wealth_xform)
    )

model_formula = as.formula(paste("tEnroll ~ ", paste(model_covars, collapse = " + ")))

# Estimate one model for each sex-race combination
estdf = estdf |>
    group_by(race, sex) |>
    nest() |>
    mutate(
        mods = map(data, \(x) glm(
            model_formula, 
            data=x,
            family=binomial)
        )
    ) |>
    mutate(pred = map(mods, \(x) predict(x, type='response'))) |>
    mutate(marg = map2(mods, data, \(x,y) marginal_effects(x, y, variables=c("pnetworth"))))
```

```{r}
#| label: tbl-reenrol-mod-est
#| tbl-cap: "Estimated coefficients of the re-enrollment model"

# Create a named list of models
estdf = estdf |> arrange(desc(sex), race)
models = estdf$mods
names(models) = dyn_get_sexrace_names(estdf)

# Create a regression table
modelsummary(models, 
             estimate="{estimate}{stars}", 
             statistic=NULL,
             stars=c('*'=.1, '**'=.05, '***'=.01),
             coef_rename=\(x) gsub('yearf(.*)', 'Year\\1', x)
             )

# Dataframe with coefficients
res_colreenr = dyn_coef_to_df(models)

coefs[['colreenr']] = list(
    df = dyn_coef_to_df(models, sig_level=opt_sig_level),
    filename = "coefs/coef_col_reenr.csv",
    description = 'Coefficients for the model of re-enrolling into college after dropping out.'
)

```


@tbl-reenroll-marginal shows average marginal effects of parental networth on the likelihood of enrollment in college after having dropped out. The numbers in the table can be interpreted as the effect that an increase in parental wealth by the amount equal to the average earnings in 1997 ($27,426) has on the likelihood of re-enrollment in college.

```{r}
#| label: tbl-reenroll-marginal
#| tbl-cap: "Marginal effects of parental networth on re-enrollment in college"

df = data.frame(estdf$marg)
colnames(df) = dyn_get_sexrace_names(estdf)
knitr::kable(df, digits=5)
```


```{r}
#| label: fig-enr-return-simple-mod-pred
#| fig-cap: 'Hazard rates for college enrollment after dropping out of college by sex and race, actual and predicted.'

estdf |>
    select(sex, race, data, pred) |>
    unnest(cols=c(data, pred)) |>
    group_by(timeEnroll, race, sex) |>
    summarise(
        tEnroll = weighted.mean(tEnroll, wt),
        pred = weighted.mean(pred, wt)
    ) |>
    rename(
        `Actual` = tEnroll,
        `Predicted` = pred
        ) |>
    pivot_longer(cols=c('Actual', 'Predicted')) |>
    ggplot() +
    geom_line(aes(x=timeEnroll, y=value, color=name)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab('Years') +
    ylab('Likelihood of college enrollment')


```

## Dropping out of College

```{r}
rm(m1, m2models, models, plotdf)
```

The drop-out model simulates the likelihood of dropping out of college for students who are enrolled in college. This is a survival model where the dependent variable is zero if a student is enrolled in college and one in the year the student drops out. For students who graduate from college, the variable is zero in all years.

**Multiple Spells.** A problem with modeling dropping out of college as a terminal event is that some students who drop out of college return later. Some drop out and return multiple times, i.e., have multiple spells of college enrollment. @fig-college-spells-number shows the distribution of number of spells in the NLSY sample for all individuals who were enrolled in college and @fig-college-spells-shares shows this distribution by race and sex. It appears that about one-third of students have more than one spell, and some have as many as seven. Which is why we do not treat dropping out of college as a terminal event. If a student drops out before graduation, they would be at risk of re-enrolling in subsequent years.

```{r}
#| label: make-drop-df
#| cache: true
#| cache-vars: dropdf

dropdf = nlsy_make_spell_df() |>
    filter(!is.na(spDrop))

```

```{r}
#| label: fig-college-spells-number
#| fig-cap: "Number of college enrollment spells"

nspellsdf = dropdf |>
    group_by(id) |>
    summarise(nspells = max(spDrop)) |>
    inner_join(
        basedf,
        by='id'
    )

nspellsdf |>
    ggplot() +
    geom_bar(aes(x=nspells)) +
    xlab("Number of spells") +
    ylab("Number of students")    
```

```{r}
#| label: fig-college-spells-shares
#| fig-cap: "Shares of students by the number of college enrollment spells by race and sex"

nspellsdf |>
    group_by(race, sex) |>
    mutate(total=n()) |>
    ungroup() |>
    group_by(race, sex, nspells) |>
    summarise(
        spellrat = n()/max(total),
        nspells = max(nspells)
        ) |>
    ggplot() +
    geom_col(aes(x=nspells, y=spellrat)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab("Number of spells") +
    ylab("Share of students")

# nspellsdf |>
#     filter(colgradyr>0 & colenryr>0) |>
#     mutate(coltime = colgradyr-colenryr) |>
#     group_by(nspells, coltime) |>
#     tally() |>
#     spread(nspells, n)
#     


```

```{r}
plotdf = dropdf |>
    filter(timeDrop>0) |>
  inner_join(basedf, by='id') |>
  group_by(spDrop, timeDrop) |>
  summarise(tDrop = weighted.mean(tDrop, wt))

plotdf |>
  mutate(spell=factor(spDrop)) |>
  ggplot(aes(x=timeDrop, y=tDrop, color=spell)) +
  geom_line() +
  xlab("Year since the beginning of a spell") +
  ylab("Share dropped")

```


@fig-dropout-hazard shows estimated hazard functions of dropping out of college in every year after the enrollment. Because some students drop out and re-enroll, these hazards are estimated for each enrollment spell separately (@fig-dropout-hazard-1). Hazard functions for spells two and three track each other closely over first seven years. Because higher numbered spells have few observations, we combine them and estimate a hazard function for spells 3-6 (@fig-dropout-hazard-2), spells 2-6 (@fig-dropout-hazard-3), and spells 1-6 (@fig-dropout-hazard-4).

```{r}
#| label: fig-dropout-hazard
#| fig-cap: "Drop-out hazard functions"
#| fig-subcap: 
#|   - "All six spells"
#|   - "Combine spells 3-6"
#|   - "Combine spells 2-6"
#|   - "Combine all spells"
#| layout-nrow: 2
#| layout-ncol: 2

plotdf = dropdf |>
  filter(timeDrop>0) |>
  inner_join(basedf, by='id') |>
  group_by(spDrop, timeDrop) |>
  summarise(tDrop = weighted.mean(tDrop, wt))

# Plot all 6 spells
plotdf |>
  mutate(spell=factor(spDrop)) |>
  ggplot(aes(x=timeDrop, y=tDrop, color=spell)) +
  geom_line() +
  xlab("Year since the beginning of a spell") +
  ylab("Share dropped")

# Merge spells 3 and above
plotdf = dropdf |>
  filter(timeDrop>0) |>
  inner_join(basedf, by='id') |>
  mutate(spDrop = pmin(spDrop, 3)) |>
  group_by(spDrop, timeDrop) |>
  summarise(tDrop = weighted.mean(tDrop, wt))

plotdf |>
  mutate(spell=factor(spDrop)) |>
  ggplot(aes(x=timeDrop, y=tDrop, color=spell)) +
  geom_line() +
  xlab("Year since the beginning of a spell") +
  ylab("Share dropped")

# Merge spells 2 and above
plotdf = dropdf |>
  filter(timeDrop>0) |>
  inner_join(basedf, by='id') |>
  mutate(spDrop = pmin(spDrop, 2)) |>
  group_by(spDrop, timeDrop) |>
  summarise(tDrop = weighted.mean(tDrop, wt))

plotdf |>
  mutate(spell=factor(spDrop)) |>
  ggplot(aes(x=timeDrop, y=tDrop, color=spell)) +
  geom_line() +
  xlab("Year since the beginning of a spell") +
  ylab("Share dropped")

# Merge all spells
plotdf = dropdf |>
  filter(timeDrop>0) |>
  inner_join(basedf, by='id') |>
  group_by(timeDrop) |>
  summarise(tDrop = weighted.mean(tDrop, wt))

plotdf |>
  ggplot(aes(x=timeDrop, y=tDrop)) +
  geom_line() +
  xlab("Year since the beginning of a spell") +
  ylab("Share dropped")


```


```{r}
#| label: est-simple-drop-models

# Create a factor year variable to use it as dummmies in regressions
estdf = dropdf |>
    filter(timeDrop>0) |>
    inner_join(
        basedf,
        by='id'
    ) |>
#    filter(spDrop==1) |>
    mutate(
        yearf=factor(pmin(timeDrop,5)),
        pnetworth=dyn_wealth_xform(pnetworth, type=opt_wealth_xform)
        )

model_formula = as.formula(paste("tDrop ~ ", paste(model_covars, collapse = " + ")))

# Estimate one model for each sex-race combination
estdf = estdf |>
    group_by(race, sex) |>
    nest() |>
    mutate(
        mods = map(data, \(x) glm(
            model_formula, 
            data=x, 
            family=binomial)
        )
    ) |>
    mutate(pred = map(mods, \(x) predict(x, type='response'))) |>
    mutate(marg = map2(mods, data, \(x,y) marginal_effects(x, y, variables=c("pnetworth"))))
```

```{r}
#| label: tbl-drop-simple-mod-est
#| tbl-cap: "Estimated coefficients of the model of dropping out of college"

# Create a named list of models
estdf = estdf |> arrange(desc(sex), race)
models = estdf$mods
names(models) = dyn_get_sexrace_names(estdf)

# Create a regression table
modelsummary(models, 
             estimate="{estimate}{stars}", 
             statistic=NULL,
             stars=c('*'=.1, '**'=.05, '***'=.01),
             coef_rename=\(x) gsub('yearf(.*)', 'Year\\1', x))

# Dataframe with coefficients
res_coldrop = dyn_coef_to_df(models)

coefs[['coldrop']] = list(
    df = dyn_coef_to_df(models, sig_level=opt_sig_level),
    filename = "coefs/coef_col_drop.csv",
    description = 'Coefficients for the model of dropping out of college.'
)

```

@tbl-drop-marginal shows average marginal effects of parental networth on the likelihood of dropping out of college. The numbers in the table can be interpreted as the effect that an increase in parental wealth by the amount equal to the average earnings in 1997 ($27,426) has on the likelihood of dropping out of college.

```{r}
#| label: tbl-drop-marginal
#| tbl-cap: "Marginal effects of parental networth on dropping out of college"

df = data.frame(estdf$marg)
colnames(df) = dyn_get_sexrace_names(estdf)
knitr::kable(df, digits=5)
```

```{r}
#| label: fig-drop-simple-mod-pred
#| fig-cap: 'Hazard rates for dropping out of college by sex and race, actual and predicted.'

estdf |>
    select(sex, race, data, pred) |>
    unnest(cols=c(data, pred)) |>
    group_by(timeDrop, race, sex) |>
    summarise(
        tDrop = weighted.mean(tDrop, wt),
        pred = weighted.mean(pred, wt)
    ) |>
    rename(
        `Actual`    = tDrop,
        `Predicted` = pred
        ) |>
    pivot_longer(cols=c('Actual', 'Predicted')) |>
    filter(timeDrop<=10) |>
    ggplot() +
    geom_line(aes(x=timeDrop, y=value, color=name)) +
    scale_x_continuous(breaks=seq(2,10,2)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab('Years since college enrollment') +
    ylab('Likelihood of dropping out of college')


```

## Graduating from College

```{r}
#| label: make-grad-df
#| cache: true
#| cache-vars: graddf

graddf = nlsy_make_spell_df() |>
    filter(!is.na(spGrad))

```


```{r}
#| label: fig-college-grad-time
#| fig-cap: "Distribution of the time between enrollment into college and the graduation (unweighted)"

graddf |>
    group_by(id) |>
    filter(tGrad==1) |>
    mutate(coltime = colgradyr-colenryr) |>
    ggplot() +
    geom_bar(aes(x=coltime)) +
    xlab("Years to graduate") +
    ylab("Number of students")
    
```


```{r}
#| label: fig-college-grad-hazard
#| fig-cap: "Hazard function for college graduation by sex and race"

graddf = graddf |>
    inner_join(
        basedf,
        by='id'
    ) 

graddf |>
    group_by(timeGrad,sex,race) |>
    summarise(tGrad = weighted.mean(tGrad,wt)) |>
    ggplot(aes(x=timeGrad, y=tGrad, color=race)) +
    geom_line() +
    facet_wrap(~sex) +
    xlab("Years since the senior year") +
    ylab("Likelihood of graduating")
    

```
```{r}
#| label: est-grad-simple-models

# Create a factor year variable to use it as dummmies in regressions
estdf = graddf |>
    mutate(
        yearf=factor(pmin(timeGrad,4)),
        pnetworth=dyn_wealth_xform(pnetworth, type=opt_wealth_xform)
    )

model_formula = as.formula(paste("tGrad ~ ", paste(model_covars, collapse = " + ")))

# Estimate one model for each sex-race combination
estdf = estdf |>
    group_by(race, sex) |>
    nest() |>
    mutate(mods = map(
        data, \(x) glm(
            model_formula, 
            data=x, 
            family=binomial)
        )
    ) |>
    mutate(pred = map(mods, \(x) predict(x, type='response'))) |>
    mutate(marg = map2(mods, data, \(x,y) marginal_effects(x, y, variables=c("pnetworth"))))
```

```{r}
#| label: tbl-grad-simple-mod-est
#| tbl-cap: "Estimated coefficients of the college graduation model"

# Create a named list of models
estdf = estdf |> arrange(desc(sex), race)
models = estdf$mods
names(models) = dyn_get_sexrace_names(estdf)

# Create a regression table
modelsummary(models, 
             estimate="{estimate}{stars}", 
             statistic=NULL,
             stars=c('*'=.1, '**'=.05, '***'=.01),
             coef_rename=\(x) gsub('yearf(.*)', 'Year\\1', x))

# Dataframe with coefficients
res_colgrad = dyn_coef_to_df(models)

coefs[['colgrad']] = list(
    df = dyn_coef_to_df(models, sig_level=opt_sig_level),
    filename = "coefs/coef_col_grad.csv",
    description = 'Coefficients for college graduation model.'
)

# Uniformize and write coefficients
coefs2 = dyn_uniformize_coef(coefs)
map(coefs2, \(x) dyn_write_coef_file(x$df, x$filename, x$description))

```

@tbl-grad-marginal shows average marginal effects of parental networth on the likelihood of graduating from college. The numbers in the table can be interpreted as the effect that an increase in parental wealth by the amount equal to the average earnings in 1997 ($27,426) has on the likelihood of graduation from college.

```{r}
#| label: tbl-grad-marginal
#| tbl-cap: "Marginal effects of parental networth on graduation from college"

df = data.frame(estdf$marg)
colnames(df) = dyn_get_sexrace_names(estdf)
knitr::kable(df, digits=5)
```

```{r}
#| label: fig-grad-simple-mod-pred
#| fig-cap: 'Hazard rates for college graduation after the senior year by sex and race, actual and predicted.'

estdf |>
    select(sex, race, data, pred) |>
    unnest(cols=c(data, pred)) |>
    group_by(timeGrad, race, sex) |>
    summarise(
        tGrad = weighted.mean(tGrad, wt),
        pred = weighted.mean(pred, wt),
    ) |>
    rename(
        `Actual`    = tGrad,
        `Predicted` = pred
        ) |>
    pivot_longer(cols=c('Actual', 'Predicted')) |>
    ggplot() +
    geom_line(aes(x=timeGrad, y=value, color=name)) +
    facet_grid(rows=vars(sex), cols=vars(race)) +
    xlab('Years since senior year') +
    ylab('Likelihood of college graduation')


```