forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
94 lines (74 loc) · 3.28 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Data Science Specialization
## Reproducible Research: Peer Assessment 1
### Loading and preprocessing the data
Loading the data
```{R}
raw_data <- read.csv("activity.csv", stringsAsFactors = F)
# ommit missing values
activity.data <- na.omit(raw_data)
```
### Mean total number of steps taken per day
Histogram of the total number of steps taken each day
```{R}
# Total number of steps taken per day
steps_per_day <- tapply(activity.data$steps, activity.data$date, sum)
head(steps_per_day)
# Histogram of the total number of steps taken each day
hist(steps_per_day, xlab="Total steps per day", main="Histogram of Total steps per day")
# Mean total number of steps per day
mean(steps_per_day)
# Median total number of steps per day
median(steps_per_day)
```
### Average daily activity pattern
Time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days
```{R}
steps_by_interval <- unlist(tapply(activity.data$step, activity.data$interval, mean, simplify=F))
plot(x=names(steps_by_interval), y=steps_by_interval, type="l", xlab="Interval", ylab="Average number of steps")
```
5-minute interval, on average across all the days in the dataset, with the maximum number of steps
```{R}
names(steps_by_interval)[steps_by_interval == max(steps_by_interval)]
```
### Imputing missing values
Total number of missing values in the dataset (i.e. the total number of rows with NAs)
```{R}
sum(is.na(raw_data$steps))
```
My strategy for filling in all of the missing values in the dataset is to replace them with the mean of 5-minute interval
```{R}
fill_in_data <- raw_data
fill_in_data$steps <- unlist(apply(raw_data, 1,
function(row)
if(is.na(row[1]))
as.numeric(as.vector(steps_by_interval[sub(" +", "", as.vector(row)[3])]))
else as.numeric(as.vector(row)[1])))
steps_per_day <- tapply(fill_in_data$step, fill_in_data$date, sum)
```
Histogram of the total number of (imputed) steps taken per day
```{R}
hist(steps_per_day, xlab="Total number of steps per day",
main="(Imputed) Histogram of total number of steps per day")
# Mean total number of steps per day
mean(steps_per_day)
# Median total number of steps per day
median(steps_per_day)
```
#### Do these values differ from the estimates from the first part of the assignment?
The mean has not changed as we have filled the missing data with the mean of the 5-minute interval.
The median have changed cause we have varied the values.
#### What is the impact of imputing missing data on the estimates of the total daily number of steps?
As we have added steps values where previously there were none the number of steps have increaded.
### Differences in activity patterns between weekdays and weekends
Factor variable in the dataset with two levels - "weekday" and "weekend"
```{R}
fill_in_data$day <- factor(ifelse(as.POSIXlt(as.Date(fill_in_data$date))$wday >5, "Weekend", "Weekday"))
```
Panel plot containing a time series plot of the 5-minute interval (x-axis) and the
average number of steps taken, averaged across all weekday days or weekend days (y-axis).
```{R}
fill_in_data2 = aggregate(steps ~ interval + day, fill_in_data, mean)
library(lattice)
xyplot(steps ~ interval | day, data = fill_in_data2, aspect = 1/2,
type = "l", main="Average 5-minute intervals: Weekdays vs. Weekends")
```