-
Notifications
You must be signed in to change notification settings - Fork 1
/
2-DateTime-RegEx.Rmd
199 lines (135 loc) · 9.61 KB
/
2-DateTime-RegEx.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: "Data Tidying: Dates, Time, and Regular Expressions"
author: "Drs. Sarangan Ravichandran and Randall Johnson"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(knitr)
library(lubridate)
# pull Examples for citing line number:
examples <- readLines('Examples.R')
```
# Table of Contents
1. [Dates and Time](#dateTime)
1.1 [Base R](#dateTimeR)
1.2 [lubridate](#dateTimeLub)
1.3 [Arithmetic](#dateTimeMath)
1.4 [Exercises](#eeDateTime)
2. [Regular Expressions](#regEx)
2.1 [Definition](#regExDefinition)
2.2 [Find and Replace](#findReplace)
2.3 [Exercises](#eeRegEx)
# <a name="dateTime"></a>Dates and Time
## <a name="dateTimeR"></a>Base R
Dates and times are stored in one of three formats in R:
* Date - number of days since January 1, 1970
* POSIXct - number of seconds since January 1, 1970
* POSIXlt - a list containing information about the time (see Table 1)
```{r kable1, echo = FALSE}
tibble(Position = 1:11,
Name = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"),
Value = c("seconds","minutes","hours","day of the month (1-31)", "month of the year (0-11)",
"years since 1900", "day of the week (0: Sunday - 6: Saturday)",
"day of the year (0-365)", "daylight savings (Yes/No)", "time zone", "seconds off of GMT")) %>%
kable(caption = 'Table 1: POSIXlt time format.')
```
If you are interested, there are some examples in [Examples.R](https://github.com/ravichas/TidyingData/blob/master/Examples.R) comparing the format of these two variables, starting on line ```r which(examples == "########## Comparison of POSIXct and POSIXlt formats ##########")```.
You can use the `as.date()` function to create a date variable in R, and you can also use it to convert strings into dates. In either case, you will want to be aware of how the text you give to `as.Date()` is formatted. According to the documentation (see `?as.Date`), the default format is "YYYY-MM-DD" or "YYYY/MM/DD". If neither works and you didn't specify a different format, `as.date()` will return `NA`.
To specify formatting, see the values in Table 2. For example, the default formats `as.Date()` expects would be coded as `%Y-%m-%d` or `%Y/%m/%d`. If you have a date, "May 3, 2018", you would need to specify the format as `%B %d, %Y`
```{r date formatting, echo = FALSE}
tibble(Code = c("%d", "%m","%b","%B","%y","%Y"),
Value = c("Day of the month (integer, 01-31)", "Month (integer, 01-12)", "Month (abbreviation)",
"Month (full name)", "Year (2 digit)", "Year (4 digit)")) %>%
kable(caption = 'Table 2: as.Date format codes. See ?strptime for more codes.')
```
```{r date examples}
as.Date('1920-6-16')
as.Date('2017/03/07')
as.Date("May 3, 2018", format = "%B %d, %Y")
## Get current Date and Time from the system
Sys.Date() # returns a Date object
Sys.time() # returns a POSIXct object
```
Formatting times is a little more complicated. For this task `strftime()` and `strptime()` are our friends.
- `strftime()` returns a character string (or character vector of strings) representing the input.
- `strptime()` returns POSIXlt formatted values represented by the input.
These function both have a default format of `%Y-%m-%d %H:%M:%S` if any element has a time component which is not midnight, and `%Y-%m-%d` otherwise (see `?strptime` for the full list of codes). For example:
```{r str_time}
# get the current system time as a string
Sys.time() %>% strftime()
# convert this string to POSIXlt format (assumes local time zone)
strptime("May 3, 2018 1:45 PM", format = "%B %d, %Y %H:%M %p")
```
## <a name="dateTimeLub"></a>lubridate
The [lubridate](https://github.com/tidyverse/lubridate/blob/master/vignettes/lubridate.Rmd) package, which is part of the tidyverse, has some additional functions that will help us work with dates and time.
- `now()` essentially does the same thing as `Sys.time()` by default, but you can optionally specify a different timezone.
- `second()`, `minute()`, `hour()`, `day()`, `year()`, ... allow you to extract or set (change) the specified part of a POSIXlt formatted variable.
- `seconds()`, `minutes()`, `hours()`, `days()`, `years()`, ... allow you to specify a period of time (e.g. `days(5)` is 5 days).
See line ```r which(examples == "########## lubridate Examples ##########")```of [Examples.R](https://github.com/ravichas/TidyingData/blob/master/Examples.R) for some examples using the *lubridate* package.
Note: there seems to be a bug with R's handling of the default time zone settings on some systems (as of early 2018). If you happen to run into this error,
```
Error in (function (dt, year, month, yday, mday, wday, hour, minute, second, :
Invalid timezone of input vector: ""
```
this bit of code will set things straight (or something similar, e.g. EDT or GMT):
```{r eval = FALSE}
Sys.setenv(TZ="EST")
```
## <a name="dateTimeMath"></a>Arithmetic
> "If anyone drove a time machine, they would crash" - Hadley Wickham, author of the tidyverse
As Hadley Wickham points out [here](https://github.com/tidyverse/lubridate/blob/master/vignettes/lubridate.Rmd#if-anyone-drove-a-time-machine-they-would-crash), arithmetic with dates isn't as straight forward as it could be. Take the following operation for example: `January 31 + one month`. There are at least three possible answers:
* February 31 (doesn't exist)
* March 4 (31 days after January 31st)
* February 28 (or the 29th if it is a leap year)
Ug! To avoid abiguity, lubridate implements addition in a very specific way.
* By default, adding time, increments the specified slot approriately in the data structure. For example, when you add a month, you simply increment the month slot by 1.
* If you add a month to December 23, the month slot goes back to 1 and the year is incrmented by 1.
* If you add a month to January 31, you get `NA`, because February 31 doesn't exist. If you want one of the other two options above, you need to craft your statement a little more carefully (e.g. by adding `days(30)` instead of `months(1)`).
* Alternately, some shorcut functions allow the addition of a fixed period of time.
* For example, `dyears(1)` is 365 days, even on a leap year, while `years(1)` increments the years slot by 1 year, without respect to leap years.
See line ```r which(examples == "########## Date arithmetic ##########")``` of [Examples.R](https://github.com/ravichas/TidyingData/blob/master/Examples.R) for some examples of date arithmetic.
## <a name="eeDateTime"></a>Excercises
Practice exercises for this section can be found in [Exercsies.Rmd](https://github.com/ravichas/TidyingData/blob/master/Exercises.md#timeEx).
# <a name="regEx"></a>Regular Expressions
## <a name="regExDefinition"></a>Define regular expressions
A regular expression (sometimes referred to as a regex) is a string of special charcaters defining search criteria for a pattern search of some text.
- Was originally developed for PERL
- Regular Expressions help us identify patterns in text
- Cross-platform compatible (mostly)
- Speed up calculations
There are three types of regular expression parts:
* Anchors, used to specify a position on the line of text
* Character Sets, used to match characters
* Modifiers, used to specify how many times the previous character should be repeated
| Type | Regex | Meaning |
|-------|--------------|---|
| Anchors | `^` | Beginning of the line |
| | `$` | End of the line |
| Character Sets | `[a-z]` | A lower case character |
| | `[A-Z]` | An upper case character |
| | `[0-9]` | An number character |
| | `.` | Any single character |
| | `[abc]` | Any letter in {a, b, c} |
| | `[^a]` | Any character that is *not* an 'a' |
| | `\e` | Escape |
| | `\f` | Form feed |
| | `\n` | New line |
| | `\r` | Carriage return |
| | `\t` | Tab |
| Modifiers | `?` | 0 or 1 repetitions |
| | `*` | 0 or more repetitions |
| | `+` | 1 or more repetitions |
| | `\{n\}` | exactly *n* repetitions |
| | `\{n,m\}` | at least *n* and not more than *m* repetitions |
As you can see, some characters have special meaning in regular expressions. If you want to search for a '[' or a '.', you will need to escape the special meaning with a `\` (e.g. `\[` or `\.`). Note that `\` is an escape character in R strings, so you will need to enter it as `\\` in your strings (e.g. if you want to include `\[` in your regular expression, you'll need to give R the string: "\\[")
You can also search for multiple patterns at the same time by including the `|` character between regular expressions. For example, if you wanted to search for _vcf_ files, you may want to use this regular expression to find both compressed and uncompressed files with the proper ending: `vcf$|vcf.tar.gz$`.
Check out line ```r which(examples == "########## grep ##########")``` of [Examples.R](https://github.com/ravichas/TidyingData/blob/master/Examples.R) for some examples of using the `grep()` and `grepl()` functions to search for IDs containing a specific pattern.
## <a name="findReplace"></a>Find and replace
Sometimes we want to find and replace specific text in a string. For this, we generally will use the `sub()` or `gsub()` functions.
* `sub()` replaces the first occurance of `pattern` with `replacement`.
* `gsub()` replaces all occurances of `pattern` with `replacement`.
See line ```r which(examples == "########## g/sub ##########")``` of [Examples.R](https://github.com/ravichas/TidyingData/blob/master/Examples.R) for some examples using `sub()` and `gsub()`.
## <a name="eeRegEx"></a>Exercises
Practice exercises for this section can be found in [Exercsies.Rmd](https://github.com/ravichas/TidyingData/blob/master/Exercises.md#regEx).