forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
158 lines (109 loc) · 5.33 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
## Loading and preprocessing the data
##### 1. Load the data
```{r}
## Import the necessary libraries
library(data.table)
library(ggplot2)
library(plyr)
## Read the CSV file
data <- read.csv(file = "activity.csv")
```
##### 2. Transform the data into a format suitable for analysis
```{r}
stepsData <- as.data.table(x = data)
```
## What is mean total number of steps taken per day?
##### 1. Calculate the total number of steps taken per day
```{r}
totalSteps <- aggregate(formula = steps ~ date, FUN = sum, data = stepsData)
```
A portion of the `totalSteps` is as follows
```{r, echo=FALSE}
head(totalSteps)
```
##### 2. Make a histogram of the total number of steps taken each day
```{r}
qplot(x = steps, data = totalSteps, geom = "histogram", main = "Total Steps Per Day")
```
##### 3. Calculate the mean and median of the total number of steps taken per day
```{r}
stepsMean <- mean(x = totalSteps$steps, na.rm = TRUE)
stepsMedian <- median(x = totalSteps$steps, na.rm = TRUE)
```
The **mean** of the total number of steps taken per day is **`r sprintf(fmt = "%4.2f", stepsMean)`** & the **median** of the total number of steps taken per day is **`r stepsMedian`**.
## What is the average daily activity pattern?
##### 1. Make a time series plot of the 5-minute interval and the average number of steps taken, averaged across all days
```{r}
## Calculate the average number of steps taken in the 5-minute interval
averageSteps <- ddply(.data = stepsData, .variables = "interval", .fun = summarise, steps = mean(steps, na.rm = TRUE))
## Create the plot
qplot(x = interval, y = steps, data = averageSteps, geom = "line", xlab = "5-Minute Interval(HHMM)", ylab = "Average Number of Steps", main = "Time Series of Avg. Steps Against 5-Minute Interval")
```
##### 2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r}
## Find the index of the maximum of steps
index <- which.max(x = averageSteps$steps)
## Get the interval located at the index
maxInterval <- stepsData$interval[index]
```
The *interval with maximum number of steps* is **`r maxInterval`**.
## Imputing missing values
##### 1. Calculate and report the total number of missing values in the dataset
```{r}
countNA <- sum(is.na(stepsData[,steps]))
```
*Total number of missing values* in the dataset is **`r countNA`**.
##### 2. Fill in all of the missing values in the dataset with the average 5-minute interval values and create a new dataset with all the missing data filled in
We are replacing the NA values with the average 5-minute interval values. We use `ceiling` here to round off the average 5-minute interval values. The new dataset with the missing values filled in is named `completeStepsData`
```{r}
## Get the steps column from the dataset
stepCount <- data.frame(stepsData[,steps])
## Replacing NA values with the average 5-minute interval values
stepCount[is.na(stepCount),] <- ceiling(tapply(X = stepsData[,steps], INDEX = stepsData[,interval], FUN = mean, na.rm = TRUE))
## Create a new dataset combining the columns
completeStepsData <- cbind(stepCount, data[,2:3])
colnames(completeStepsData) <- colnames(data)
```
A portion of the new dataset `completeStepsData` with all the missing data filled in is as follows
```{r, echo=FALSE}
head(completeStepsData)
```
##### 4. Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day
###### Total number of steps taken each day(with missing values filled in)
```{r}
newTotalSteps <- aggregate(formula = steps ~ date, FUN = sum, data = completeStepsData)
```
A portion of the `newTotalSteps` is as follows
```{r, echo=FALSE}
head(newTotalSteps)
```
###### Histogram of the total number of steps taken each day(with missing values filled in)
```{r}
qplot(x = steps, data = newTotalSteps, geom = "histogram", main = "Total Steps Per Day With Missing Data Filled In")
```
###### Calculate the mean and median of the total number of steps taken per day(with missing values filled in)
```{r}
newStepsMean <- mean(x = newTotalSteps$steps, na.rm = TRUE)
newStepsMedian <- median(x = newTotalSteps$steps, na.rm = TRUE)
```
The **mean** of the total number of steps taken per day is **`r sprintf(fmt = "%4.2f", newStepsMean)`** & the **median** of the total number of steps taken per day is **`r sprintf(fmt = "%4.2f", newStepsMedian)`**.
## Are there differences in activity patterns between weekdays and weekends?
```{r}
## Create a new data frame with a column for the type of the day
day.name <- weekdays(as.POSIXct(completeStepsData$date))
day.type <- ifelse(day.name == "Saturday" | day.name == "Sunday", "weekend", "weekday" )
weeklySteps <- cbind(completeStepsData, day.type)
## Calculate the average steps for 5-minute intervals for weekend and weekday
newAverageSteps <- ddply(.data = weeklySteps, .variables = c("interval", "day.type"), .fun = summarise, steps = mean(steps))
## Plotting
qplot(x = interval, y = steps, data = newAverageSteps, geom = "line", xlab = "5-Minute Interval(HHMM)", ylab = "Average Number of Steps", main = "Time Series of Avg. Steps Against 5-Minute Interval", facets = day.type ~ .)
```