-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathreport-template.Rmd
274 lines (195 loc) · 9.41 KB
/
report-template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: "Course Project Template"
output: html_notebook
---
The purpose of this template is:
1. to recap some of the concepts covered in the course, and
2. to provide a resource of code snippets that will be useful for your course project (and hopefully other future R projects!)
As you prepare the report, you will likely encounter strange errors and get stuck multiple times. **That's normal and there's no way around it as you learn**. The basic approach for getting un-stuck is to spend a **limited** amount of time **actively** looking for the solution and then **asking for help**.
Two common scenarios, and a systematic way to troubleshoot them are:
1. **You encounter an unexpected result or error**
* Check **spelling** - missing or extra quotes, commas, parentheses, spaces, and misspelled function names
* Read the error message and try to make sense of it
* Google the error message (cut and paste!)
2. **You (kind of) know what you want to do but don't know how to do it in R**
* Consult a relevant [RStudio Cheat Sheet](https://www.rstudio.com/resources/cheatsheets/)
* If you know it was covered in the workshop, re-watch the workshop. The shared drive for Penn Pathology is: smb://uphsfp15.uphs.upenn.edu/SHAREDATA2/HUP/Pathology/Reproducible_Clinical_Data_Analysis
* Do a Datacamp tutorial. Recommended tutorials include:
* [Introduction to the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse) is highly recommended for everyone.
* [Data Visualization with ggplot2 (Part 1)](https://www.datacamp.com/courses/data-visualization-with-ggplot2-1) for more practice with ggplot.
* [Data Manipulation in R with dplyr](https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial) for more practice with dplyr
* [Cleaning Data in R](https://www.datacamp.com/courses/cleaning-data-in-r) is useful if you have messy data that needs to be cleaned.
* [Joining Data in R with dplyr](https://www.datacamp.com/courses/joining-data-in-r-with-dplyr) is useful if you need to combine data from two separate data frames.
Asking good questions when you get stuck is a key skill for coding and collaborative data analysis. I recommend the following formula:
> I tried to accomplish (*goal*) so I tried this code:
> (*copy and paste the code here*)
>
> I expected this: ...
>
> Instead I get this result / error:
> (*copy and paste the result or error message*)
The best resources for asking questions are:
1. Email one of the course directors: [Stephan](mailto:[email protected]), [Amrom](mailto:[email protected]), or [Dan](mailto:[email protected])
2. Post your question on the [CHOP R User Group Slack channel](https://choprusergroup.slack.com)
What follows is a suggested structure for a reproducible report as well as some code snippets that you might want to use or modify.
## Background and Objectives
This section should address the following questions:
* What is the question to be addressed by this project?
* How could addressing this question change patient management and/or further generalizable knowledge?
* What's the point of doing this analysis?
* Would anyone else care about the results?
* What are the first 1-2 simple but important graphs or statistics that you'd like to generate?
## Data Acquisition
The **narrative** of this section should cover the following:
* General properties of the data set
* For data files:
* Origin, with details on how is was created and/or source from where it was acquired
* File name and file format
* For databases:
* Database server, database, and table name (**not** login credentials!)
* Code Book
* List each column of the data set (after isolating data of interest) with a concise description.
The **code** in this section should accomplish the following:
* For data files:
* Import your data into R
* Isolate the data of interest
* For databases:
* Connect to the database
* Query the data and place it into a data frame
* Optional: Dump the data in a local file
* Disconnect
### Import data file
Don't forget to load necessary packages first:
```{r}
library(tidyverse)
library(lubridate)
```
#### CSV file
Here is a sample code snippet to load a CSV file named `my_data.csv` into a data frame named `my_data`. Try to give your data more descriptive names than that!
```{r}
my_data <- read_csv("my_data.csv")
```
#### Excel file
This code snippet loads and Excel file `my_data.xlsx` into a data frame named `my_data`. We need to load the `readxl` package for the `read_excel()` function to work.
```{r}
library(readxl)
my_data <- read_excel("my_data.xlsx")
```
### Isolate data of interest
This sample code snippet would select the `Patient_MRN`, `Patient_Age`, `Patient_Sex`, `Date_of_Collection`, and `Result` columns and then filter the rows such that only those rows remain in which `Date_of_Collection` is within the year of 2017. The `arrange()` function would then sort rows by `Date_of_Collection`.
```{r}
my_data <- my_data %>%
select(Patient_MRN,
Patient_Age,
Patient_Sex,
Date_of_Collection,
Result) %>%
filter(year(Date_of_Collection) == 2017) %>%
arrange(Date_of_Collection)
```
### Code book
The code book is where you document concisely document the data contained in each column. Be sure to include enough detail to avoid misunderstandings. For example:
- **Patient_MRN**. Patient medical record number (HUP)
- **Patient_Age**. Patient age (in years) at the time the specimen was collected.
- **Patient_Sex**. Patient sex. M = Male, F = Female.
- **Date_of_Collection**. The date on which the specimen was collected.
- **Result**. Laboratory result value for this specimen.
## Data Exploration and Cleaning
The **narrative and code** of this section should, together, describe and accomplish the following:
1. Convert **messy** data to **tidy**
2. Check for **missing values**
3. **Explore** and **fix** data in each column
4. **Display** the cleaned data table
### Convert messy data to tidy
If any of following are not true for your data set, then you will have to tidy your data:
* Each **variable** is in its own **column**
* Each **case** is in its own **row**
* Each **value** is in its own **cell**
A tutorial for cleaning data can be found here: [Cleaning Data in R](https://www.datacamp.com/courses/cleaning-data-in-r)
### Missing values
The following code chunk shows columns that have missing data in the data frame `my_data`.
```{r}
my_data %>%
map_df(function(x) sum(is.na(x))) %>%
gather(column, NAs)
```
### Explore and fix data column-by-column
In this section, go column by column and do the following:
* Verify (and if necessary, convert) data type
* **Numeric** variables should be **integer** (whole numbers) or **numeric** (fractions)
* **Date-time** variables should be **Date** (whole dates) or **POSIXct** (date/time stamps)
* **Categorial** variables should be **factor**
* All others, in general, should be **character**
* Examine the distribution
* **Numeric** variables: **histogram**
* **Categorical** variables: **numerical summary**
* **Character** variables: inspect a small **sample**
* Comment on whether the data is as expected
* Range and shape of distribution
* Number and breakdown of categories
* Properties of character variables
* Explain discrepancies
* Resolve data problems
Below are a few examples of how various types of columns could be handled.
#### A numeric column
`Patient_Age` is an **integer** column, which is appropriate.
```{r}
ggplot(data = my_data) +
geom_histogram(mapping = aes(x = Patient_Age),
binwidth = 1)
```
The histogram shows ..., which is expected.
#### A character column
`Patient_MRN` is a **character** column, which is appropriate.
```{r}
set.seed(1)
my_data %>%
sample_n(10) %>%
pull(Patient_MRN)
```
A sample of data shows ..., which is expected.
#### A character column that should be a factor
`Patient_Sex` is a **character** column, but since it represents a categorial variable, convert it to a **factor**.
```{r}
my_data <- my_data %>%
mutate(Patient_Sex = as_factor(Patient_Sex))
```
```{r}
my_data %>%
pull(Patient_Sex) %>%
summary()
```
A numerical summary shows ..., which is expected.
#### A character column that should be integer
`Result` is a **character** column, but since it represents a numeric variable, convert it to **integer**.
Before doing so, examine any rows that do **not** contain an integer value:
```{r}
my_data %>%
filter(is.na(as.integer(Result))) %>%
group_by(Result) %>%
summarize(n = n()) %>%
arrange(desc(n))
```
Now convert to integer and drop all rows in which `Result` had a non-numeric entry.
```{r}
my_data <- my_data %>%
mutate(Result = as.integer(Result)) %>%
drop_na(Result)
```
```{r}
ggplot(data = my_data) +
geom_histogram(mapping = aes(x = Result),
binwidth = 1)
```
The histogram shows ..., which is expected.
### Overview of the cleaned data
The following code chunk creates an interactive table from the `my_data` data frame, which is useful for peer review and presentation purposes.
```{r}
my_data
```
## Visualization and Modeling
This section should iterate through multiple cycles of:
1. Pose a **question** or state a **hypothesis**
2. Create a **graph** or **statistical summary** to address #1
3. **Comment** on new insights gained and new problems raised
## Summary and Conclusions