-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04_synthesis-demo.qmd
460 lines (329 loc) · 13.6 KB
/
04_synthesis-demo.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
---
title: "Data Synthesis Demo"
date: today
format:
html:
fig-cap-location: top
number_sections: false
embed-resources: true
toc: true
css: ../www/web_report.css
editor_options:
chunk_output_type: console
execute:
warning: false
message: false
bibliography: references.bib
---
```{=html}
<style>
@import url('https://fonts.googleapis.com/css?family=Lato&display=swap');
</style>
```
```{r}
#| echo: false
exercise_number <- 1
```
```{r, echo = FALSE, setup}
# load packages we need
library(tidyverse)
library(urbnthemes)
library(palmerpenguins)
library(kableExtra)
library(gt)
source(here::here("R", "create_table.R"))
# set Urban Institute data visualization styles
set_urbn_defaults(base_size = 12)
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
```
## Demo: Creating Fully Synthetic `palmerpenguins` Data
The `palmerpenguins` package contains data about three species of penguins collected from three islands in the Palmer Archipelago, Antarctica. We will use an adapted version of the dataset to demonstrate some of the concepts discussed above.
```{r}
# create dataset we will be using
penguins <- penguins |>
filter(species == "Adelie") |>
select(
sex,
bill_length_mm,
flipper_length_mm
) |>
drop_na()
penguins |>
head() |>
create_table()
```
<br>
The above code simplifies the dataset to only three variables and removes missing values in those variables. We will synthesize the `sex`, `bill_length_mm`, and `flipper_length_mm` variables in this dataset using some of the methods discussed above. Since we are synthesizing all three variables, our final version of the dataset is considered fully synthetic.
### Synthesize `sex` variable
Let's start by synthesizing `sex`, which is a binary variable that can take a value of either "male" or "female". To synthesize this variable, we will identify the underlying percentages of the data that fall into each category and use it to generate records that mimic the properties of the confidential data.
```{r}
# identify percentage of total that each category (sex) makes up
penguins |>
count(sex) |>
mutate(relative_frequency = n / sum(n)) |>
create_table()
```
<br>
Using these proportions, we will now randomly sample with replacement to mimic the underlying distribution of gender.
```{r}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# vector of gender categories
sex_categories <- c("female", "male")
# size of sample to generate
synthetic_data_size <- nrow(penguins)
# probability weights
sex_probs <- penguins |>
count(sex) |>
mutate(relative_frequency = n / sum(n)) |>
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
sex_synthetic <- sample(
x = sex_categories,
size = synthetic_data_size,
replace = TRUE,
prob = sex_probs
)
```
Our new `sex_synthetic` variable will form the foundation of our synthesized data.
```{r}
# use vector to generate synthetic gender column
penguins_synthetic <- tibble(
sex = sex_synthetic
)
penguins_synthetic |>
head() |>
create_table()
```
<br>
### Synthesize `bill_length_mm` variable
Unlike `sex`, `bill_length_mm` is numeric.
```{r}
summary(penguins$bill_length_mm)
```
<br>
To synthesize this variable, we are going to predict the `bill_length_mm` for each penguin using a linear regression with `sex` as a predictor.
Note that `sex` is a factor variable with possible values of "male" and "female" which can't directly be used in a regression. So under the hood, R converts that factor variable into a binary numeric variables (i.e. `0` or `1`), and then runs the regression.
```{r}
# linear regression
bill_length_lm <- lm(
formula = bill_length_mm ~ sex,
data = penguins
)
summary(bill_length_lm)
```
<br> <br>
Now that we have coefficients for the linear regression, we can generate our synthetic values. First, we try a straightforward prediction of bill lengths using the synthetic `sex` variable.
```{r}
# predict bill length with model coefficients
bill_length_synthetic_method1 <- predict(
object = bill_length_lm,
newdata = penguins_synthetic
)
# add predictions to synthetic data as bill_length
penguins_synthetic_method1 <- penguins_synthetic |>
mutate(bill_length_mm = bill_length_synthetic_method1)
```
And now we compare the univariate distributions of the confidential data to our newly synthesized variable with a graph.
```{r}
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic_method1,
.id = "data_source"
)
# plot comparison of bill_length_mm distributions
compare_penguins |>
select(data_source, bill_length_mm) |>
pivot_longer(-data_source, names_to = "variable") |>
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Method 1") +
scatter_grid()
```
### What is the problem here?
Simply using the predicted values from our linear regression does not give us enough variation in the synthetic variable.
To understand more about the predictions made from the linear regression model, let's dig into the predicted values for the first five rows of data and the corresponding synthetic `sex`.
```{r}
# Look at first few rows of synthetic data
penguins_synthetic_method1 |>
head() |>
create_table()
```
<br>
We know from our regression analysis output from above that the intercept ($\beta_0$) is 42.1 and the coefficient for a male penguin ($\beta_1$) is 3.8. Therefore, if the penguin is male, we have a predicted value ($\hat{y}$) of `42.1 + 3.8 = 45.9`. If the penguin is female, our predicted value ($\hat{y}$) is only the intercept, `42.1`.
Because `sex` will only take two values, our synthetic `bill_length_mm` also only takes two values. The model fit limits the variation that is possible, making our synthetic variable significantly less useful.
Instead, we can try a second method. We will create a version of the variable, where the synthetic value is a draw from a normal distribution with a mean of the regression line for the given predictions, and standard deviation equal to the residual standard error.
```{r}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# predict bill length with model coefficients
bill_length_predicted <- predict(
object = bill_length_lm,
newdata = penguins_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
bill_length_synthetic_method2 <- rnorm(
n = nrow(penguins_synthetic),
mean = bill_length_predicted,
sd = sigma(bill_length_lm)
)
# add predictions to synthetic data as bill_length
penguins_synthetic_method2 <- penguins_synthetic |>
mutate(bill_length_mm = bill_length_synthetic_method2)
```
Now, we again compare the univariate distributions of the confidential data and the synthetic data we generated with this second method.
```{r}
#| code-fold: true
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic_method2,
.id = "data_source"
)
# plot comparison of bill_length_mm distributions
compare_penguins |>
select(data_source, bill_length_mm) |>
pivot_longer(-data_source, names_to = "variable") |>
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Method 2") +
scatter_grid()
```
We have much more variation with this new method, though the distributions still do not match perfectly. We choose this method's output to add as the synthetic `bill_length_mm` in our final synthesized dataset. And now our synthesized dataset has two columns.
```{r}
# using method 2 as synthesized variable
penguins_synthetic <- penguins_synthetic |>
mutate(bill_length_mm = bill_length_synthetic_method2)
penguins_synthetic |>
head() |>
create_table()
```
<br>
### Synthesize `flipper_length_mm`
The `flipper_length_mm` variable is also numeric, so we can follow the same steps we used to synthesize `bill_length_mm`.
```{r}
summary(penguins$flipper_length_mm)
```
<br>
This time, we regress `flipper_length_mm` on both `sex` and `bill_length_mm`.
```{r}
# linear regression
flipper_length_lm <- lm(
formula = flipper_length_mm ~ sex + bill_length_mm,
data = penguins
)
summary(flipper_length_lm)
```
<br> <br>
Since we already know we prefer the method using draws from the normal distribution centered on the mean of the regression line, we will default to that.
```{r}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# predict flipper length with model coefficients
flipper_length_predicted <- predict(
object = flipper_length_lm,
newdata = penguins_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
flipper_length_synthetic <- rnorm(
n = nrow(penguins_synthetic),
mean = flipper_length_predicted,
sd = sigma(flipper_length_lm)
)
# add predictions to synthetic data as flipper_length
penguins_synthetic <- penguins_synthetic |>
mutate(flipper_length_mm = flipper_length_synthetic)
```
### Final (fully synthesized) product and discussion
With all three synthesized variables, we can compare the the univariate distributions of the confidential data and the synthetic data (we have already done this for `bill_length_mm`).
```{r}
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic,
.id = "data_source"
)
# Write out final compare_penguins df for use in future exercises
dir.create(here::here("data/"), showWarnings = FALSE)
compare_penguins |>
write_csv(here::here("data", "penguins_synthetic_and_confidential.csv"))
```
First, we can compare the distributions of the `sex` variable in the confidential and synthetic data.
```{r, fig.height = 3}
sex_comparison <- compare_penguins |>
select(data_source, sex) |>
count(data_source, sex) |>
group_by(data_source) |>
mutate(relative_frequency = n / sum(n))
sex_comparison |>
ggplot(aes(x = n, y = sex, fill = data_source)) +
geom_text(aes(label = n),
position = position_dodge(width = 0.5),
hjust = -0.4) +
geom_col(position = "dodge", alpha = 0.7) +
scale_x_continuous(expand = expansion(add = c(0, 15))) +
labs(y = "", x = "N", title = "Comparison of sex distribution")
```
```{r}
# plot comparison of distributions
compare_penguins |>
select(
data_source,
bill_length_mm,
flipper_length_mm
) |>
pivot_longer(-data_source, names_to = "variable") |>
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
facet_wrap(~variable, scales = "free") +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Final synthetic product") +
scatter_grid()
```
<br> <br>
## `r paste("Exercise", exercise_number)`
### Discuss `sex` synthesis
::: {.panel-tabset}
#### <font color="#55b748">**Question 1**</font>
- Was the method we used parametric or nonparametric? Why?
- What do we notice about the synthetic variable compared to the original?
- We generated a new `sex` variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?
#### <font color="#55b748">**Question 1 Notes**</font>
- Was the method we used parametric or nonparametric? Why?
- **Nonparametric; the data generation process was based on underlying frequencies rather than a distribution or generative model.**
- What do we notice about the synthetic variable compared to the original?
- We generated a new `sex` variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?
- **No, the number of rows in the dataset does not matter as long as the underlying frequencies are preserved**
:::
### Discuss `bill_length_mm` synthesis {.tabset}
::: {.panel-tabset}
#### <font color="#55b748">**Question 2**</font>
- Was the method we just used parametric or nonparametric? Why?
- What do we notice about the synthetic variable compared to the original?
- Are we synthesizing these data sequentially? How do you know?
#### <font color="#55b748">**Question 2 Notes**</font>
- Was the method we just used parametric or nonparametric? Why?
- **Parametric: the data generation process was based on a (linear) model.**
- What do we notice about the synthetic variable compared to the original?
- Are we synthesizing these data sequentially? How do you know?
- **Yes, the previously synthesized `sex` variable was used as a predictor.**
:::
### Discuss `flipper_length_mm` synthesis
::: {.panel-tabset}
#### <font color="#55b748">**Question 3**</font>
- What do we notice about the synthetic variable compared to the original?
- What are benefits and drawbacks to generating this variable sequentially?
#### <font color="#55b748">**Question 3 Notes**</font>
- What do we notice about the synthetic variable compared to the original?
- What are benefits and drawbacks to generating this variable sequentially?
- **Benefits: generating this variable sequentially allows us to preserve more of the multivariate relationships present in the data**.
- **Drawbacks: if we synthesized previous variables poorly, our synthesis of this variable will also be affected**.
:::
### Overall Discussion
<font color="#55b748">**Question 4**</font>
- Would you feel comfortable using this version of the dataset for analysis? Why or why not? What additional tests might you run to determine this?
<br> <br>