04_synthesis-demo.qmd

---
title: "Data Synthesis Demo"
date: today
format:
  html:
    fig-cap-location: top
    number_sections: false
    embed-resources: true
    toc: true
    css: ../www/web_report.css
editor_options:
  chunk_output_type: console
execute:
  warning: false
  message: false
bibliography: references.bib
---

```{=html}
<style>
@import url('https://fonts.googleapis.com/css?family=Lato&display=swap');
</style>
```

```{r}
#| echo: false

exercise_number <- 1
```

```{r, echo = FALSE, setup}
# load packages we need
library(tidyverse)
library(urbnthemes)
library(palmerpenguins)
library(kableExtra)
library(gt)

source(here::here("R", "create_table.R"))

# set Urban Institute data visualization styles
set_urbn_defaults(base_size = 12)

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
```

## Demo: Creating Fully Synthetic `palmerpenguins` Data

The `palmerpenguins` package contains data about three species of penguins collected from three islands in the Palmer Archipelago, Antarctica. We will use an adapted version of the dataset to demonstrate some of the concepts discussed above.

```{r}
# create dataset we will be using
penguins <- penguins |> 
  filter(species == "Adelie") |> 
  select(
    sex, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  drop_na() 

penguins |> 
  head() |> 
  create_table()
```

<br>

The above code simplifies the dataset to only three variables and removes missing values in those variables. We will synthesize the `sex`, `bill_length_mm`, and `flipper_length_mm` variables in this dataset using some of the methods discussed above. Since we are synthesizing all three variables, our final version of the dataset is considered fully synthetic.


### Synthesize `sex` variable

Let's start by synthesizing `sex`, which is a binary variable that can take a value of either "male" or "female". To synthesize this variable, we will identify the underlying percentages of the data that fall into each category and use it to generate records that mimic the properties of the confidential data.

```{r}
# identify percentage of total that each category (sex) makes up
penguins |>
  count(sex) |>
  mutate(relative_frequency = n / sum(n)) |> 
  create_table()
```

<br>

Using these proportions, we will now randomly sample with replacement to mimic the underlying distribution of gender.

```{r}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# vector of gender categories
sex_categories <- c("female", "male")

# size of sample to generate
synthetic_data_size <- nrow(penguins)

# probability weights
sex_probs <- penguins |>
  count(sex) |>
  mutate(relative_frequency = n / sum(n)) |>
  pull(relative_frequency)

# use sample function to generate synthetic vector of genders
sex_synthetic <- sample(
  x = sex_categories, 
  size = synthetic_data_size, 
  replace = TRUE, 
  prob = sex_probs
)
```

Our new `sex_synthetic` variable will form the foundation of our synthesized data.

```{r}
# use vector to generate synthetic gender column
penguins_synthetic <- tibble(
  sex = sex_synthetic
)

penguins_synthetic |> 
  head() |> 
  create_table()
```

<br>

### Synthesize `bill_length_mm` variable

Unlike `sex`, `bill_length_mm` is numeric.

```{r}
summary(penguins$bill_length_mm)
```

<br>

To synthesize this variable, we are going to predict the `bill_length_mm` for each penguin using a linear regression with `sex` as a predictor. 

Note that `sex` is a factor variable with possible values of "male" and "female" which can't directly be used in a regression. So under the hood, R converts that factor variable into a binary numeric variables (i.e. `0` or `1`), and then runs the regression.

```{r}
# linear regression
bill_length_lm <- lm(
  formula = bill_length_mm ~ sex, 
  data = penguins
)

summary(bill_length_lm)
```

<br> <br>

Now that we have coefficients for the linear regression, we can generate our synthetic values. First, we try a straightforward prediction of bill lengths using the synthetic `sex` variable.

```{r}
# predict bill length with model coefficients
bill_length_synthetic_method1 <- predict(
  object = bill_length_lm, 
  newdata = penguins_synthetic
)

# add predictions to synthetic data as bill_length
penguins_synthetic_method1 <- penguins_synthetic |>
  mutate(bill_length_mm = bill_length_synthetic_method1)
```


And now we compare the univariate distributions of the confidential data to our newly synthesized variable with a graph.

```{r}
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
  "confidential" = penguins,
  "synthetic" = penguins_synthetic_method1,
  .id = "data_source"
)

# plot comparison of bill_length_mm distributions
compare_penguins |>
  select(data_source, bill_length_mm) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3) +
  labs(title = "Comparison of Univariate Distributions",
       subtitle = "Method 1") +
  scatter_grid()
```

### What is the problem here?

Simply using the predicted values from our linear regression does not give us enough variation in the synthetic variable.

To understand more about the predictions made from the linear regression model, let's dig into the predicted values for the first five rows of data and the corresponding synthetic `sex`.

```{r}
# Look at first few rows of synthetic data
penguins_synthetic_method1 |> 
  head() |> 
  create_table()
```

<br>


We know from our regression analysis output from above that the intercept ($\beta_0$) is 42.1 and the coefficient for a male penguin ($\beta_1$) is 3.8. Therefore, if the penguin is male, we have a predicted value ($\hat{y}$) of `42.1 + 3.8 = 45.9`. If the penguin is female, our predicted value ($\hat{y}$) is only the intercept, `42.1`. 

Because `sex` will only take two values, our synthetic `bill_length_mm` also only takes two values. The model fit limits the variation that is possible, making our synthetic variable significantly less useful.

Instead, we can try a second method. We will create a version of the variable, where the synthetic value is a draw from a normal distribution with a mean of the regression line for the given predictions, and standard deviation equal to the residual standard error.

```{r}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# predict bill length with model coefficients
bill_length_predicted <- predict(
  object = bill_length_lm, 
  newdata = penguins_synthetic
)

# synthetic column using normal distribution centered on predictions with sd of residual standard error
bill_length_synthetic_method2 <- rnorm(
  n = nrow(penguins_synthetic), 
  mean = bill_length_predicted, 
  sd = sigma(bill_length_lm)
)

# add predictions to synthetic data as bill_length
penguins_synthetic_method2 <- penguins_synthetic |>
  mutate(bill_length_mm = bill_length_synthetic_method2)
```

Now, we again compare the univariate distributions of the confidential data and the synthetic data we generated with this second method.

```{r}
#| code-fold: true

# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
  "confidential" = penguins,
  "synthetic" = penguins_synthetic_method2,
  .id = "data_source"
)

# plot comparison of bill_length_mm distributions
compare_penguins |>
  select(data_source, bill_length_mm) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3) +
  labs(title = "Comparison of Univariate Distributions",
       subtitle = "Method 2") +
  scatter_grid()
```

We have much more variation with this new method, though the distributions still do not match perfectly. We choose this method's output to add as the synthetic `bill_length_mm` in our final synthesized dataset. And now our synthesized dataset has two columns.

```{r}
# using method 2 as synthesized variable
penguins_synthetic <- penguins_synthetic |>
  mutate(bill_length_mm = bill_length_synthetic_method2)


penguins_synthetic |> 
  head() |> 
  create_table()
```

<br> 

### Synthesize `flipper_length_mm`

The `flipper_length_mm` variable is also numeric, so we can follow the same steps we used to synthesize `bill_length_mm`.

```{r}
summary(penguins$flipper_length_mm)
```

<br>

This time, we regress `flipper_length_mm` on both `sex` and `bill_length_mm`.

```{r}
# linear regression
flipper_length_lm <- lm(
  formula = flipper_length_mm ~ sex + bill_length_mm, 
  data = penguins
)

summary(flipper_length_lm)
```

<br> <br>

Since we already know we prefer the method using draws from the normal distribution centered on the mean of the regression line, we will default to that.

```{r}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# predict flipper length with model coefficients
flipper_length_predicted <- predict(
  object = flipper_length_lm, 
  newdata = penguins_synthetic
)

# synthetic column using normal distribution centered on predictions with sd of residual standard error
flipper_length_synthetic <- rnorm(
  n = nrow(penguins_synthetic), 
  mean = flipper_length_predicted, 
  sd = sigma(flipper_length_lm)
)

# add predictions to synthetic data as flipper_length
penguins_synthetic <- penguins_synthetic |>
  mutate(flipper_length_mm = flipper_length_synthetic)
```

### Final (fully synthesized) product and discussion

With all three synthesized variables, we can compare the the univariate distributions of the confidential data and the synthetic data (we have already done this for `bill_length_mm`).

```{r}
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
  "confidential" = penguins,
  "synthetic" = penguins_synthetic,
  .id = "data_source"
)

# Write out final compare_penguins df for use in future exercises
dir.create(here::here("data/"), showWarnings = FALSE)

compare_penguins |> 
  write_csv(here::here("data", "penguins_synthetic_and_confidential.csv"))
```

First, we can compare the distributions of the `sex` variable in the confidential and synthetic data.

```{r, fig.height = 3}
sex_comparison <- compare_penguins |>
  select(data_source, sex) |>
  count(data_source, sex) |>
  group_by(data_source) |>
  mutate(relative_frequency = n / sum(n))

sex_comparison |> 
  ggplot(aes(x = n, y = sex, fill = data_source)) +
  geom_text(aes(label = n), 
            position = position_dodge(width = 0.5),
            hjust = -0.4) + 
  geom_col(position = "dodge", alpha = 0.7) +
  scale_x_continuous(expand = expansion(add = c(0, 15))) +
  labs(y = "", x = "N", title = "Comparison of sex distribution")

```

```{r}
# plot comparison of distributions
compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Comparison of Univariate Distributions",
       subtitle = "Final synthetic product") +
  scatter_grid()
```

<br> <br>

## `r paste("Exercise", exercise_number)`

### Discuss `sex` synthesis

::: {.panel-tabset}

#### <font color="#55b748">**Question 1**</font>

-   Was the method we used parametric or nonparametric? Why?
-   What do we notice about the synthetic variable compared to the original?
-   We generated a new `sex` variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?

#### <font color="#55b748">**Question 1 Notes**</font>


- Was the method we used parametric or nonparametric? Why?
  
  - **Nonparametric; the data generation process was based on underlying frequencies rather than a distribution or generative model.**

- What do we notice about the synthetic variable compared to the original?

- We generated a new `sex` variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?
  
  - **No, the number of rows in the dataset does not matter as long as the underlying frequencies are preserved**

:::

### Discuss `bill_length_mm` synthesis {.tabset}

::: {.panel-tabset}

#### <font color="#55b748">**Question 2**</font>


- Was the method we just used parametric or nonparametric? Why?

- What do we notice about the synthetic variable compared to the original?

- Are we synthesizing these data sequentially? How do you know?

#### <font color="#55b748">**Question 2 Notes**</font>
- Was the method we just used parametric or nonparametric? Why?
  
  - **Parametric: the data generation process was based on a (linear) model.**

- What do we notice about the synthetic variable compared to the original?

- Are we synthesizing these data sequentially? How do you know?
  
  - **Yes, the previously synthesized `sex` variable was used as a predictor.**

:::

### Discuss `flipper_length_mm` synthesis

::: {.panel-tabset}

#### <font color="#55b748">**Question 3**</font>

-   What do we notice about the synthetic variable compared to the original?
-   What are benefits and drawbacks to generating this variable sequentially?

#### <font color="#55b748">**Question 3 Notes**</font>

-   What do we notice about the synthetic variable compared to the original?

-   What are benefits and drawbacks to generating this variable sequentially?

    -   **Benefits: generating this variable sequentially allows us to preserve more of the multivariate relationships present in the data**.
    -   **Drawbacks: if we synthesized previous variables poorly, our synthesis of this variable will also be affected**.

:::

### Overall Discussion

<font color="#55b748">**Question 4**</font>

-   Would you feel comfortable using this version of the dataset for analysis? Why or why not? What additional tests might you run to determine this?

<br> <br>