index.html

<!DOCTYPE html>
<html>
<head>
  <title>R Intro Part 4</title>
  <meta charset="utf-8">
  <meta name="description" content="R Intro Part 4">
  <meta name="author" content="Ilan Man">
  <meta name="generator" content="slidify" />
  <meta name="apple-mobile-web-app-capable" content="yes">
  <meta http-equiv="X-UA-Compatible" content="chrome=1">
  <link rel="stylesheet" href="libraries/frameworks/io2012/css/default.css" media="all" >
  <link rel="stylesheet" href="libraries/frameworks/io2012/phone.css" 
    media="only screen and (max-device-width: 480px)" >
  <link rel="stylesheet" href="libraries/frameworks/io2012/css/slidify.css" >
  <link rel="stylesheet" href="libraries/highlighters/highlight.js/css/tomorrow.css" />
  <base target="_blank"> <!-- This amazingness opens all links in a new tab. -->
  <script data-main="libraries/frameworks/io2012/js/slides" 
    src="libraries/frameworks/io2012/js/require-1.0.8.min.js">
  </script>
  
    <link rel="stylesheet" href = "assets/css/ribbons.css">

</head>
<body style="opacity: 0">
  <slides class="layout-widescreen">
    
    <!-- LOGO SLIDE -->
    <!-- END LOGO SLIDE -->
    

    <!-- TITLE SLIDE -->
    <!-- Should I move this to a Local Layout File? -->
    <slide class="title-slide segue nobackground">
      <hgroup class="auto-fadein">
        <h1>R Intro Part IV</h1>
        <h2>Data structures and functions</h2>
        <p>Ilan Man<br/>Strategy Operations @ Squarespace</p>
      </hgroup>
          </slide>

    <!-- SLIDES -->
      <slide class="" id="slide-1" style="background:;">
  <hgroup>
    <h2>Agenda</h2>
  </hgroup>
  <article>
    <ol>
<li>R intro</li>
<li>Data structures</li>
<li>Control structures</li>
<li>Functions</li>
<li>Commonly used built in functions</li>
<li>String manipulation</li>
<li>Miscellaneous Tips and tricks</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-2" style="background:;">
  <hgroup>
    <h2>R Intro</h2>
  </hgroup>
  <article>
    <h1>Background</h1>

<p><space></p>

<ul>
<li>Derivative of S language, developed at Bell Laboratories by John Chambers</li>
<li>R was created by two statisticians at the University of Auckland, New Zealand</li>
<li>R is written in C, Fortran and R</li>
<li>Open source (Revolution Analytics offers commerical software)</li>
<li>Originally command line, but graphical interfaces (including RStudio and Rattle) becoming new norm</li>
<li>Very popular, especially among academics and statisticians</li>
<li>Intepreted language - easier to write code, but slower computations</li>
<li>Packages available to speed up R code - <a href="http://cran.r-project.org/web/packages/Rcpp/index.html"><code>Rcpp</code></a>, <a href="http://cran.r-project.org/web/packages/ff/index.html"><code>ff</code></a>, <a href="http://cran.r-project.org/web/packages/snow/snow.pdf"><code>snow</code></a>, <a href="http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf"><code>parallel</code></a></li>
<li>R holds all data in RAM. Problematic for large data sets</li>
<li>R is excellent for prototyping</li>
<li><code>?help</code> -&gt; use this to get help. <code>?</code> is easily the most useful function in R.</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-3" style="background:;">
  <hgroup>
    <h2>R Intro</h2>
  </hgroup>
  <article>
    <h1>Background</h1>

<p><space></p>

<ul>
<li>Installing packages</li>
</ul>

<pre><code class="r">install.packages(&#39;ggplot2&#39;)    ## do this once only

require(&#39;ggplot2&#39;)             ## do this every time you load up an R session

library()                      ## shows you every package in your standard package location
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-4" style="background:;">
  <hgroup>
    <h2>R Intro</h2>
  </hgroup>
  <article>
    <h1>Styling</h1>

<p><space></p>

<ul>
<li><a href="http://cran.r-project.org/web/packages/rockchalk/vignettes/Rstyle.pdf">CRAN</a> and <a href="http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml">Google</a> style guide</li>
<li><p><a href="https://docs.google.com/document/d/1esDVxyWvH8AsX-VJa-8oqWaHLs4stGlIbk8kLc5VlII/edit">R Coding convention</a> is another resource</p></li>
<li><p>Use <code>&lt;-</code> NOT <code>=</code> for assignment</p></li>
<li><p>Spaces between operators like <code>+</code>, <code>%*%</code>, <code>&lt;</code>, <code>&gt;</code> and after closing brackets <code>)</code>, <code>}</code></p></li>
<li><p>Don&#39;t write functions named <code>rep()</code>, <code>sample()</code>, <code>plot()</code> or any other built-in R names</p></li>
<li><p><code>c</code> should not be used for any variable names</p></li>
<li><p><code>i</code> and <code>j</code> should only be used in loops, conditionals, etc...</p></li>
<li><p>Use camel case for functions: <code>myFirstFunction()</code> is better than <code>my.first.function()</code></p></li>
<li><p>Use <code>&#39;hello&#39;</code> or <code>&quot;hello&quot;</code> for strings, but be consistent.</p></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-5" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <ol>
<li>Vectors</li>
<li>Matrices</li>
<li>Lists</li>
<li>Data.frames</li>
<li>Factors</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-6" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectors</h1>

<p><space></p>

<ul>
<li>Fundamental R data type: Everything is a vector in R (including scalars)</li>
<li>Vector elements must be of the same type, or <code>mode</code> in R. Known as atomic.</li>
<li>Common ways to initialize a vector</li>
</ul>

<pre><code class="r">x &lt;- c(1,2,3,4,5,6,7,8,9,10) ## vector from 1 to 10 - class numeric
x &lt;- 1:10                    ## alternative - class integer
x &lt;- seq(from=1,to=10,by=1)  ## alternative - class numeric

n &lt;- 10
x &lt;- numeric(n)
for (i in 1:n) x[i] &lt;- i    ## as n gets large, this is very slow (compared to the alternatives)

x &lt;- numeric(0)             
for (i in 1:n) x &lt;- c(x,i)  ## preferred vs. above
                            ## to the extent possible, provide the size of your object when
                            ## initializing it
</code></pre>

<ul>
<li>Six atomic vector types:
<ul> <code>logical</code>, <code>character</code>, <code>integer</code>, <code>double</code>, <code>complex</code>, <code>raw</code></ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-7" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectors</h1>

<p><space></p>

<ul>
<li>Vectors obviate need for loops (most of the time!)</li>
</ul>

<pre><code class="r">x &lt;- seq(from = 1, to = 10, by = 1)
y &lt;- 0
for (i in c(1:length(x))) y[i] &lt;- x[i] * 5
print(y)
</code></pre>

<pre><code> [1]  5 10 15 20 25 30 35 40 45 50
</code></pre>

<pre><code class="r">## alternatively....
y &lt;- x * 5
print(y)
</code></pre>

<pre><code> [1]  5 10 15 20 25 30 35 40 45 50
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-8" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vector Indexing</h1>

<p><space></p>

<ul>
<li>Important, but only if you like using vectors. And R.</li>
<li>Indexing begins at 1, not 0.</li>
<li>Can index a vector by name, if elements are named.</li>
</ul>

<pre><code class="r">x &lt;- 1:10
x[ c( 1:5 , 8:10 ) ]         
[1] 1  2  3  4  5  8  9 10

x[ c(TRUE , FALSE) ]         ## recycling - common R feature. R will not give you a warning!
[1] 1 3 5 7 9                ## very useful, but make sure you are comparing vectors of same length

x &gt; 5                        ## Boolean vector. mode = &quot;logical&quot;
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

any(x &gt; 5)              
[1] TRUE
all(x &lt; 8)              
[1] FALSE
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-9" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectorized Operations</h1>

<p><space></p>

<ul>
<li>Easiest way to acheive speed in R - apply a function to a vector</li>
</ul>

<pre><code class="r">f &lt;- function(a, b) return(a^b)
f(x, 2)
</code></pre>

<pre><code> [1]   1   4   9  16  25  36  49  64  81 100
</code></pre>

<ul>
<li>Even operators such as <code>+</code>, <code>-</code>, <code>*</code> are functions</li>
</ul>

<pre><code class="r">&quot;*&quot;(x,5)              ## returns 5 * x[1], 5 * x[2], ...
&#39;[&#39;(x, x &gt; 5 )        ## returns vector of values where x[1] &gt; 5, x[2] &gt; 5, ..., x[10] &gt; 5 is TRUE

ifelse(x &lt; 5, x^2, 0) ## if (condition) { do something } else { do something else }
[1]  1  4  9 16  0  0  0  0  0  0
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-10" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectorized Operations</h1>

<p><space></p>

<ul>
<li>When coming from a different language, probably best NOT to translate code verbatim</li>
<li>Loops are your friend in C. In R, loops are like a bad friend - timeconsuming at best.</li>
<li>Under the hood, a vectorized operation is running a loop - in C. Much faster than in R.</li>
<li>Vectorization also provides clarity (but don&#39;t get carried away one-lining everthing)</li>
</ul>

<pre><code class="r">logsum &lt;- 0
x &lt;- seq(100,1000000,by=10)
for (i in 1:length(x)){
  logsum &lt;- logsum + log(x[i])
}
logsum
[1] 1281524                 ## this calculation takes about 0.17 seconds

# R translation
logsum &lt;- sum(log(x))       ## this calculation takes about 0.002 seconds.
[1] 1281524
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-11" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectorized Operations</h1>

<p><space></p>

<ul>
<li>Be careful when thinking you are vectorizing</li>
<li>Many R functions take a function as an argument</li>
<li><code>sum</code>, <code>max</code>, <code>min</code>, ... are exceptions</li>
</ul>

<pre><code class="r">mean(1,3,2)
[1] 1                 ## huh??

mean(c(1,3,2))
[1] 2                 ## that&#39;s better

max(1,3,2)
[1] 3
</code></pre>

<ul>
<li>Vectorization might not work when the current iteration depends on the previous (think $ \sum \sum $)</li>
<li>Try to put code outside of loops when possible</li>
<li>Use built-in functions such as <code>rowSums(x)</code> instead of <code>apply(x,1,sum)</code>...more on this later!</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-12" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectors: <code>NA</code> and <code>NULL</code></h1>

<p><space></p>

<ul>
<li><code>NA</code> appears often in messy data, especially when a value doesn&#39;t exist</li>
<li>R will attempt to calculate <code>NA</code>, and therefore return <code>NA</code></li>
<li>If R sees <code>NULL</code>, it skips it. <code>NULL</code> is non existant. Yet it exists as a <code>NULL</code>. ?philosophy.</li>
</ul>

<pre><code class="r">x &lt;- c(5, 10, NA, 20, 25)
mean(x)                
[1] NA

is.na(x)                ## commonly used when cleaning data sets
[1] FALSE FALSE  TRUE FALSE FALSE
mean(x,na.rm=TRUE)      ## 15

x &lt;- c(5,10,NULL,20,25)
mean(x)                 ## 15

length(NA)              ## NA is a logical constant of length 1
[1] 1
length(NULL)            ## NULL does not take any value. By definition, it&#39;s undefined
[1] 0                   ## ?philosphy
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-13" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vector Filtering</h1>

<p><space></p>

<ul>
<li>Extremely useful for quick data analysis. Similar to indexing.</li>
</ul>

<pre><code class="r">x &lt;- 1:10
x[ x &gt; 5 ]          ## What&#39;s happening here?
</code></pre>

<ul>
<li><code>x &gt; 5</code> is a function call to <code>&quot;&gt;&quot;(a,b)</code> which returns <code>TRUE</code> or <code>FALSE</code> on every element of vector <code>x</code>. </li>
<li>Output of <code>x &gt; 5</code> is <code>logical</code> vector. And when used as an index on <code>x</code>...</li>
</ul>

<pre><code class="r">x[c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]
</code></pre>

<pre><code>[1]  6  7  8  9 10
</code></pre>

<p>...returns elements of <code>x</code> that are <code>TRUE</code></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-14" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vector Filtering</h1>

<p><br>
Common filtering functions include:</p>

<pre><code class="r">subset(x, x &gt; 5)      ## [1]  6  7  8  9 10
which(x &gt; 5)          ## [1]  6  7  8  9 10
4%in%x                ## [1]  TRUE
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-15" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Vectors: Summary</h1>

<p><space></p>

<ul>
<li>Everything in R is a vector</li>
<li>All elements are of one type, <code>atomic</code></li>
<li>Vectorize whenever possible</li>
<li>Filtering and indexing are important concepts</li>
<li>Recycling - useful but note that R will not give you an error message</li>
<li><code>seq()</code>, <code>rep()</code>, <code>sample()</code>, <code>runif()</code></li>
<li><code>any()</code>, <code>all()</code>, <code>which()</code>, <code>subset()</code>, <code>%in%</code></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-16" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><br>
Compute the following:</p>

<p>a) \(\large \sum_{i=1}^{500} \ln{(i^{2})} + \frac{2}{i}\)</p>

<p>b) \(\large \frac{1}{n}\sum_{i=1}^{n} (\bar{X} - X_{i})^{2}\), where X ~ Normal(5,100) and n = 1000
<br><br>
Hint:</br></p>

<pre><code class="r">?rnorm
</code></pre>

<p>c) \(\large \frac{1}{n}\sum_{i=1}^{n} (\bar{X} - X_{i})(\bar{Y} - Y_{i})\), where X ~ Poisson with lambda of 2, Y ~ Exponential with a rate of 1, and n = 1000</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-17" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Matrices</h1>

<p><space></p>

<ul>
<li>Like vectors with two additional attributes: rows and columns</li>
<li>Column-major order: insert values in first column, going down, then continuing to second column, going down, as so on</li>
</ul>

<pre><code class="r">x &lt;- matrix(seq(1, 6, by = 1), nrow = 3, ncol = 2)  ## 3 by 2 matrix
print(x)
</code></pre>

<pre><code>     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
</code></pre>

<pre><code class="r">x &lt;- matrix( seq(1,6,by=1), nrow=3)                ## same as above
x &lt;- matrix( seq(1,6,by=1), nrow=3, byrow=TRUE)    ## row-major order    
x &lt;- matrix( seq(1,6,by=1), nrow=4)                ## is this ok?
x &lt;- matrix( seq(1,6,by=1), nrow=3, ncol=3)        ## is this ok?
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-18" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Matrix operations</h1>

<p><space></p>

<pre><code class="r">x &lt;- matrix(seq(1,9),nrow=3,ncol=3)
x + 5
x * 2
t(x)                        ## transpose
x %*% x                     ## inner product
crossprod(x,x)              ## cross product of x and x
x * x                       ## element-wise product
diag(x)                     ## diagonal components - identity matrix
det(x)                      ## determinant
eigen(x)                    ## list of eigenvalues and eigenvectors
</code></pre>

<ul>
<li>Remember your linear algebra!</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-19" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Matrix indexing and filtering</h1>

<p><space></p>

<pre><code class="r">x[2,1]            ## second row, first column

x[,1]             ## all rows, first column. Vector form, not matrix.

x[,]              ## all rows, all columns. Same as print(x), or just x. 

x[-1,]            ## remove first row. Negative indexing.
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-20" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Matrix class</h1>

<p><space></p>

<pre><code class="r">x &lt;- matrix( c(1:9), nrow=3, ncol=3)
class(x)                      ## matrix

y &lt;- x[1,]                    ## 3 element vector
class(y)                      ## integer
attributes(y)                 ## returns NULL

y &lt;- x[1,, drop=FALSE]     
class(y)                      ## matrix
attributes(y)                 ## 1 by 3 matrix

colnames(x) &lt;- c( &#39;first col&#39; , &#39;second col&#39; , &#39;third col&#39; )
rownames(x) &lt;- c( &#39;row 1&#39; , &#39;row 2&#39; , &#39;row 3&#39; )
</code></pre>

<ul>
<li>Higher dimension matrices also possible, <code>arrays</code></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-21" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><space></p>

<pre><code class="r">a) 
x &lt;- matrix(rep(c(1,3,-1,2),5),ncol=4)
(i)  What is returned by the following? Do it by hand before typing it in.
     mean(x[ x[1,] &gt; 1,  c(1:2) ])
(ii) Find the column in x which has the largest total.

b) 
y &lt;- matrix(c(c(1,2,4,8),c(2,3,-1,-7),c(0,5,12,-4),c(3,4,5,0)),ncol=4)
(i)  Calculate the trace of y.
(ii) Replace each element of the 3rd column with the median of the elements of the first, second and 
fourth columns for the same row. 
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-22" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Lists</h1>

<p><space></p>

<ul>
<li>Combine objects of different types. Can have different <code>modes</code>.</li>
<li>Forms basis for <code>data.frames</code></li>
<li>Vectors, matrices cannot be broken down into smaller components, hence atomic.</li>
<li>Lists can be broken down - known as recursive vectors.</li>
</ul>

<pre><code class="r">x &lt;- list(title = &quot;R presentation&quot;, date = format(as.POSIXlt(Sys.time(), &quot;EDT&quot;), 
    &quot;%m %d %Y&quot;), num_attendees = 10)
</code></pre>

<pre><code>Warning: unknown timezone &#39;EDT&#39;
</code></pre>

<pre><code class="r">print(x)
</code></pre>

<pre><code>$title
[1] &quot;R presentation&quot;

$date
[1] &quot;07 09 2014&quot;

$num_attendees
[1] 10
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-23" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Lists</h1>

<p><space></p>

<ul>
<li>Accessing <code>list</code> components</li>
</ul>

<pre><code class="r">## one bracket - [ - returns a list type
x[1]
</code></pre>

<pre><code>$title
[1] &quot;R presentation&quot;
</code></pre>

<pre><code class="r">## two brackets  -  [[  -  returns the actual element, in this case a character
x[[1]]
x$title
x[[&#39;title&#39;]]

[1] &quot;R presentation&quot;
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-24" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Lists</h1>

<p><space></p>

<ul>
<li>Accessing <code>list</code> components and values</li>
</ul>

<pre><code class="r">names(x)      
[1] &quot;title&quot;         &quot;date&quot;          &quot;num_attendees&quot;

unlist(x)      ## flattens the list into a character vector

pres_1 &lt;- format(as.POSIXlt(Sys.Date(),&quot;EDT&quot;),&quot;%m %d %Y&quot;) 
pres_2 &lt;- format(as.POSIXlt(Sys.Date()+30,&quot;EDT&quot;),&quot;%m %d %Y&quot;) 

x &lt;- list(title=&#39;1st R presentation&#39;, date=pres_1, num_attendees=10)
y &lt;- list(title=&#39;2nd R presentation&#39;, date=pres_2, num_attendees=20)

z &lt;- list(x,y)       ## list of lists
## z[[1]][1] is equivalent to x
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-25" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Lists</h1>

<p><space></p>

<ul>
<li>The result of most statistical operations in R return a <code>list</code></li>
<li>Knowing how to manipulate lists is important</li>
</ul>

<pre><code class="r">n &lt;- 100
x &lt;- rnorm(n, mean = 0, sd = 1)          ## sample of 100 random standard normal variables
y &lt;- 1 - 2 * x + rnorm(n)
f &lt;- y ~ x                               ## y ~ x is a formula object
r &lt;- lm(f)                               ## r is linear model object, i.e. linear regression

## the function str() - &quot;structure&quot;&quot; - is VERY useful in exploratory data analysis
## structure of r is a bunch of lists
str(r)

r$coeff
r$residuals
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-26" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Data Frames</h1>

<p><space></p>

<ul>
<li>The most useful object in R for data analysis</li>
<li>Like a matrix of lists, of equal length</li>
<li>Many R functions and packages assume input is in the form of a <code>data.frame</code></li>
<li><p>Every CSV or Text file you read in is a <code>data.frame</code>, i.e. most real data comes in the form of a <code>data.frame</code></p></li>
<li><p>Creating Data Frames</p></li>
</ul>

<pre><code class="r">z &lt;- data.frame()  ## data frame with 0 columns and 0 rows
y &lt;- data.frame(col1 = c(1, 2), col2 = c(&quot;a&quot;, &quot;b&quot;), row.names = c(&quot;row1&quot;, &quot;row2&quot;))
print(y)
</code></pre>

<pre><code>     col1 col2
row1    1    a
row2    2    b
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-27" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Data Frames</h1>

<p><space></p>

<pre><code class="r">x &lt;- data.frame(matrix( sample(c(50:100), size=12, replace=TRUE), nrow=6, ncol=2))

## return first column
x[,1]                 ## type is vector
x$X1                  ## type is vector
x[1]                  ## type data.frame. 
x[&#39;X1&#39;]               ## type data.frame
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-28" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Data Frames</h1>

<p><space></p>

<ul>
<li>Helpful <code>data.frame</code> functions</li>
</ul>

<pre><code class="r">x &lt;- x[-6,]                   ## remove rows or columns with a &quot;-&quot; sign. Like negative indexing.
y &lt;- data.frame(names = c(&quot;dave&quot;,&quot;jenny&quot;,&quot;scott&quot;,&quot;mary&quot;,&quot;harry&quot;) )
z &lt;- cbind(y, x)               ## column bind. Can be used on matrices too.
                               ## if you cbind two vectors you get a matrix, NOT data.frame

## alternatively you can create columns implicitly
x$names &lt;- c(&quot;dave&quot; ,&quot;jenny&quot; ,&quot;scott&quot; ,&quot;mary&quot; ,&quot;harry&quot;)

w &lt;- data.frame(names=&quot;megan&quot;, X1=82, X2=85)
z &lt;- rbind(z, w)               ## row bind

## make sure number of elements in row, column are consistent
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-29" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Data Frames</h1>

<p><space></p>

<pre><code class="r">## explicitly set columns names for z. Use rownames() for row names. Shocker.
names(z) &lt;- c(&quot;names&quot;, &quot;Exam 1&quot;,&quot;Exam 2&quot;)

## get dimensions
dim(z)                        
[1] 6 3

head(z)           ## default to first 6 rows
tail(z)           ## default to last 6 rows
</code></pre>

<ul>
<li>While very useful, <code>data.frames</code> are more memory intensive than <code>matrices</code></li>
<li>When initializing, if possible, preallocate <code>data.frame</code>, i.e. set size of <code>data.frame</code> before using it</li>
<li>Whenever possible, use <code>matrices</code></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-30" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Factors</h1>

<p><space></p>

<ul>
<li>Comes from the notion of categorical variables in statistics</li>
<li>Can be thought of as a <code>vector</code> with additional information - categories, or <code>levels</code></li>
<li>Used to split up data sets; commonly seen as columns of <code>data.frame</code>s</li>
</ul>

<pre><code class="r">x &lt;- factor(c(&quot;finance&quot;, &quot;tech&quot;, &quot;tech&quot;, &quot;auto&quot;, &quot;finance&quot;, &quot;energy&quot;, &quot;tech&quot;))
print(x)
</code></pre>

<pre><code>[1] finance tech    tech    auto    finance energy  tech   
Levels: auto energy finance tech
</code></pre>

<pre><code class="r">y &lt;- factor(x, levels = c(levels(x), &quot;tv&quot;))  ## include new level, even though no tv data exists
print(y)
</code></pre>

<pre><code>[1] finance tech    tech    auto    finance energy  tech   
Levels: auto energy finance tech tv
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-31" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Factors</h1>

<p><space></p>

<ul>
<li>use <code>levels</code> to order your levels. Helpful when sorting factors</li>
</ul>

<pre><code class="r">wday &lt;- c(&quot;mon&quot;, &quot;tues&quot;, &quot;mon&quot;, &quot;wed&quot;, &quot;fri&quot;, &quot;wed&quot;)
wdayf &lt;- factor(wday)
sort(wdayf)  ## did this do what we expected?
</code></pre>

<pre><code>## [1] fri  mon  mon  tues wed  wed 
## Levels: fri mon tues wed
</code></pre>

<pre><code class="r">wdayf &lt;- factor(wday, levels = c(&quot;mon&quot;, &quot;tues&quot;, &quot;wed&quot;, &quot;thurs&quot;, &quot;fri&quot;))  ## let&#39;s add Thursday as well
sort(wdayf)
</code></pre>

<pre><code>## [1] mon  mon  tues wed  wed  fri 
## Levels: mon tues wed thurs fri
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-32" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Factors</h1>

<p><space></p>

<ul>
<li>Common <code>factor</code> functions</li>
</ul>

<pre><code class="r">z$names2 &lt;- NULL                            ## NULL removes the object from the factor (or list)
z$gender &lt;- c(&quot;m&quot;,&quot;f&quot;,&quot;m&quot;,&quot;f&quot;,&quot;m&quot;,&quot;f&quot;)
z$party &lt;- c(&quot;D&quot;,&quot;D&quot;,&quot;R&quot;,&quot;R&quot;,&quot;D&quot;,&quot;D&quot;)
</code></pre>

<pre><code class="r">tbl &lt;- table(z$gender,z$party)              ## contingency table. class &quot;table&quot;
addmargins(tbl)           ## marginal sums

##      D R Sum
##  f   2 1   3
##  m   2 1   3
# # Sum 4 2   6
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-33" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Factors</h1>

<p><space></p>

<ul>
<li>Converting between factors and other types</li>
</ul>

<pre><code class="r">x &lt;- seq(5,20,by=5)
f &lt;- factor(x)

print(f)
[1] 5  10 15 20
Levels: 5 10 15 20

as.numeric(f)                    ## huh??
[1] 1 2 3 4

as.numeric(as.character(f))
[1]  5 10 15 20                  ## much better

as.numeric(levels(f))            ## more efficient due to less conversions
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-34" style="background:;">
  <hgroup>
    <h2>Data structures</h2>
  </hgroup>
  <article>
    <h1>Summary</h1>

<p><space></p>

<ul>
<li><code>Vectors</code> - lifeblood of R</li>
<li><code>Matrices</code> - great for linear algebra and stats functions</li>
<li><code>Lists</code> - store and access elements of complex objects</li>
<li><code>Data.frames</code> - data analysis object of choice</li>
<li><code>Factors</code> - good for statistics and categorization of data into groups</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-35" style="background:;">
  <hgroup>
    <h2>Control Structures</h2>
  </hgroup>
  <article>
    <ol>
<li><code>for()</code></li>
<li><code>while()</code></li>
<li><code>repeat()</code></li>
<li><code>try()</code></li>
<li><code>if()</code></li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-36" style="background:;">
  <hgroup>
    <h2>Control Structures</h2>
  </hgroup>
  <article>
    <h1><code>for()</code></h1>

<p><space></p>

<pre><code class="r">x &lt;- seq(0, 20, by=1)           ## default increment is 1 

for (i in c(1:length(x))){
  x[i] &lt;- x[i] * 2
}

## can be written on one line - but careful to not make it too messy
for (i in c(1:length(x))) x[i] &lt;- x[i] * 2
</code></pre>

<h1><code>while()</code></h1>

<p><space></p>

<pre><code class="r">i=1
while (i &lt;= 21) {
  x[i] &lt;- x[i] * 2
  i &lt;- i + 1
}
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-37" style="background:;">
  <hgroup>
    <h2>Control Structures</h2>
  </hgroup>
  <article>
    <h1><code>repeat()</code></h1>

<p><space></p>

<pre><code class="r">x &lt;- seq(0,20,by=1) 

i = 1
repeat {
  x[i] &lt;- x[i] * 2
  i &lt;- i + 1
  if (i &gt; 21) break
}
</code></pre>

<h1><code>try()</code></h1>

<p><space></p>

<pre><code class="r">try(&quot;hello&quot; + 1, silent = FALSE)
try(&quot;hello&quot; + 1, silent = TRUE)
tryCatch(&quot;hello&quot; + 1, error = function(e) print(&quot;don&#39;t be ridiculous&quot;))
</code></pre>

<pre><code>## [1] &quot;don&#39;t be ridiculous&quot;
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-38" style="background:;">
  <hgroup>
    <h2>Control Structures</h2>
  </hgroup>
  <article>
    <h1><code>if()</code></h1>

<p><space></p>

<pre><code class="r">if (a == b) {
  # do something
} else {                  ## the else statement MUST be on the same line as the 
  # do something else     ## closing bracket of the if()
}

if (a == b) {
  # do something
} 
else {                    ## WRONG. returns lots of headaches. 
  # do something else     
}

if (a == b) do something  ## one-liners

ifelse (a == b, x, y)     ## use ifelse() on vectors
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-39" style="background:;">
  <hgroup>
    <h2>Control Structures</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><space></p>

<pre><code class="r">(a) Write a loop to scan through an integer vector and return the index of the 
largest value. The loop should terminate as soon as the index is found. Ignore ties.

(b) Redo the above using built-in R functions such as rank(), sort() and order().
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-40" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <ol>
<li>Functions</li>
<li>Arguments</li>
<li>Environment</li>
<li>Pointers</li>
<li>Generic</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-41" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Functions</h1>

<p><space></p>

<ul>
<li>Write functions - it&#39;s good practice</li>
<li>Each function should perform a specified task - easily understood inputs and outputs</li>
<li><code>function()</code> is a built-in R function whose job is to create functions...#mindblown</li>
</ul>

<pre><code class="r">exponentiate &lt;- function(x, y) {
    return(x^y)
}

exponentiate(2, 4)
</code></pre>

<pre><code>## [1] 16
</code></pre>

<ul>
<li>The right hand side of <code>exponentiate()</code> has two arguments: the parameters and the body</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-42" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Arguments</h1>

<p><space></p>

<pre><code class="r">formals(exponentiate)        # $x   $y   These are the arguments to exponentiate()
body(exponentiate)           # { return (x^y) }
</code></pre>

<pre><code class="r">exponentiate  # prints out the entire function - good if you forget what&#39;s in it!
</code></pre>

<pre><code>## function(x, y) {
##     return(x^y)
## }
## &lt;environment: 0x0000000009b68db0&gt;
</code></pre>

<ul>
<li>Try it out on any built-in R function to see its innards</li>
<li>Note that it won&#39;t work on some functions that are written in C (e.g  <code>sum()</code>, <code>mean()</code>)</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-43" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Arguments</h1>

<p><space></p>

<ul>
<li>Arguments can have default values.</li>
</ul>

<pre><code class="r">f &lt;- function(x, y=3) { ... }

## Some functions have tons of parameters. You don&#39;t need to enter them all.
f &lt;- function(x, ...) {
  plot(x, ...)
}
</code></pre>

<ul>
<li>Anonymous functions. Single use. No name. No feelings exchanged.</li>
</ul>

<pre><code class="r">sapply (x, function(x) x*2)
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-44" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Environment and Scope</h1>

<p><space></p>

<ul>
<li>A function consists of its arguments, body and environment</li>
</ul>

<pre><code class="r">d &lt;- 8
f &lt;- function(y){
  x &lt;- 3 * y
  h &lt;- function (){
    return(y*(x+d))
  }
  return(x+h())
}
f(2)

# d is global to f()
# x is local to f() and global to h()
# h cannot be called at the &quot;top level&quot; since it&#39;s environment is limited to f()&#39;s
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-45" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><br>
Write a function that finds the maximum value in corresponding indices for two vectors. For example:</p>

<pre><code class="r">x &lt;- c(1,2,3,4)
y &lt;- c(0,3,5,4)
## output should be 
[1] 1 3 5 4
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-46" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Environment and scope</h1>

<p><space></p>

<pre><code class="r">ls()           # returns all the variables in the environment
               # good to know when you&#39;ve created a ton and are starting to lose track
rm(x)          # rm(x) removes x...rm(list=ls()) is usually not a great idea!
</code></pre>

<ul>
<li>You can even make custom operators!</li>
</ul>

<pre><code class="r">&quot;%powerUp%&quot; &lt;- function(a, b) return(a^b)

3 %powerUp% 2
</code></pre>

<pre><code>## [1] 9
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-47" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Pointers</h1>

<p><space></p>

<ul>
<li>R does not have pointer variables like Python</li>
</ul>

<pre><code class="r">## in Python
&gt;&gt;&gt; x = c(5,2,8)
&gt;&gt;&gt; x.sort()             ## this doesn&#39;t exist in R
&gt;&gt;&gt; x
[2, 5 , 8]

## in R
x &lt;- c(5,2,8)
sort(x)
[1] 2 5 8
x
[1] 5 2 8
x &lt;- sort(x)
x
[1] 2 5 8
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-48" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Generic functions</h1>

<p><space></p>

<ul>
<li>R is a dialect of S3 (S4 is the latest)</li>
<li>S3 has generic functions, such as <code>print()</code>, <code>plot()</code>, <code>summary()</code>. Concept of OOP.</li>
</ul>

<pre><code class="r">data(cars)              ## load built in dataset
fit &lt;- lm(dist ~ speed, data=cars)
summary(fit)

## same function call, on a different object type
summary(c(1,2,3))

## lists out all the methods for the summary function
methods(summary)  
</code></pre>

<ul>
<li>There is a <code>summary.lm()</code> function and a <code>summary.default()</code> function</li>
<li>Makes it harder to get the object that is printed, but this is possible</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-49" style="background:;">
  <hgroup>
    <h2>Functions</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><space>
<br>
a) Get the Adjusted R-squared from the regression of distance on speed in the cars dataset
<br>
<br>
b) Get the t-value of the X variable (i.e. speed)
<br>
<br>
c) Predict the braking distance if going 200 miles per hour</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-50" style="background:;">
  <hgroup>
    <h2>Commonly used built-in functions</h2>
  </hgroup>
  <article>
    <ol>
<li><code>apply()</code> </li>
<li><code>lapply()</code>, <code>tapply()</code>, <code>sapply()</code></li>
<li><code>mapply()</code></li>
<li><code>by()</code>, <code>cut()</code>, <code>aggregate()</code>, <code>split()</code></li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-51" style="background:;">
  <hgroup>
    <h2>Commonly used built-in functions</h2>
  </hgroup>
  <article>
    <h1><code>apply()</code></h1>

<p><space></p>

<ul>
<li>Apply a function to a row or column of a matrix</li>
</ul>

<pre><code class="r">x &lt;- matrix(sample(c(0:100),20,replace=TRUE),nrow=5,ncol=4)
apply(x,1,sum)                                   ## sum rows
apply(x,1,function(x) x^2)                       ## apply function to every element
                                                 ## What type of function is this?
</code></pre>

<ul>
<li>Easy to read</li>
<li>Less lines of code</li>
<li>NOT faster than for loops. Loops are built into these functions.</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-52" style="background:;">
  <hgroup>
    <h2>Commonly used built-in functions</h2>
  </hgroup>
  <article>
    <h1><code>lapply()</code>, <code>tapply()</code>, <code>sapply()</code></h1>

<p><space></p>

<ul>
<li>like <code>apply()</code> for other data structures</li>
</ul>

<pre><code class="r">lapply(list(z$&#39;Exam 1&#39;,z$&#39;Exam 2&#39;),mean)    ## mean of Exam scores; returns a list
                                            ## like apply() but can be used on data.frames
sapply(list(z$&#39;Exam 1&#39;,z$&#39;Exam 2&#39;),mean)    ## mean of Exam scores; returns a vector
sapply(list(z$&#39;Exam 1&#39;,z$&#39;Exam 2&#39;), mean, simplify=FALSE)   ## same as lappy()
</code></pre>

<pre><code class="r">## find mean exam 1 scores, split by party
tapply(z$&#39;Exam 1&#39;, z$party, FUN = mean, simplify=TRUE)      ## simplify determines output type

##     D      R 
## 82.25  84.50 
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-53" style="background:;">
  <hgroup>
    <h2>Commonly used built-in functions</h2>
  </hgroup>
  <article>
    <h1><code>mapply()</code></h1>

<p><space></p>

<ul>
<li>Apply a function to corresponding elements of a list</li>
</ul>

<pre><code class="r">a&lt;-c(1:5)
b&lt;-c(6:10)
d&lt;-c(11:15)

mapply(sum,a,b,d)
sum(a[1],b[1],d[1])
sum(a[2],b[2],d[2])

mapply(mean,a,b,d)            ## What&#39;s happening here?
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-54" style="background:;">
  <hgroup>
    <h2>Commonly used built-in functions</h2>
  </hgroup>
  <article>
    <h1><code>by()</code>, <code>cut()</code>, <code>aggregate()</code>, <code>split()</code></h1>

<p><space></p>

<ul>
<li>Like <code>apply()</code>, but used on <code>data.frames</code></li>
<li><a href="http://www.jstatsoft.org/v40/i01/paper">Split, apply, combine</a> is an important concept in data analysis</li>
<li>Package <a href="http://cran.r-project.org/web/packages/plyr/index.html"><code>plyr</code></a> is very popular and useful, but important to learn Base R first</li>
</ul>

<pre><code class="r">aggregate(z[,c(2:3)], by=list(z$party), mean)          ## mean Exam score by party
aggregate(z[,c(2:3)], by=list(z$part, z$gender), sum)  ## sum by party and gender

## same as tapply() but for data.frames (instead of arrays)
## returns class &quot;by&quot;
by(z$&#39;Exam 1&#39;,z$party,sum)

## convert numeric column of data.frame into factor
## great for binning data
z$age &lt;- c(21,29,38,41,26,50)
cut(z$age,breaks=c(20,30,40,50))

split(z,f = z$gender)                       ## split a dataframe according to a factor
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-55" style="background:;">
  <hgroup>
    <h2>Commonly used built-in functions</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><space></p>

<p>a)  Create a column for the average grade for each student. Label it.
    <li>D is between 50 and 59</li> 
    <li>C is between 60 and 69</li> 
    <li>B is between 70 and 79</li> 
    <li>A is between 80 and 100</li></p>

<p>b)</p>

<ul>
<li>The function <code>system.time()</code> returns timings for R operations. Examine the help documentation for this function.</li>
<li>Compute the median standard deviation of every column of a 100 by 100 matrix. Initialize a 100 x 100 matrix using a Random Uniform Variable, between 20 and 50 in each cell. Compute the median standard deviation of each column using:

<ul>
<li>A <code>for()</code> loop</li>
<li>An <code>apply()</code> function<br></li>
</ul></li>
<li>Which is the fastest?</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-56" style="background:;">
  <hgroup>
    <h2>String Manipulation</h2>
  </hgroup>
  <article>
    <h1>Basic string functions</h1>

<p><space></p>

<pre><code class="r">example &lt;- c(&quot;THIS IS AN EXAMPLE&quot;,&quot;and so is this&quot;,&quot;this is not&quot;,&quot;hello world&quot;,&quot;extra&quot;)
grep(&quot;an&quot;, example)                        ## return index of occurence &quot;an&quot;
[1] 2
grep(&quot;an&quot;, example, ignore.case=TRUE)
[1] 1 2
grep(&quot;an&quot;, example, ignore.case=TRUE, value=TRUE)  
[1] &quot;THIS IS AN EXAMPLE&quot; &quot;and so is this&quot; 

nchar(example)
[1] 18 14 11 11  5

paste(example[1],example[2])
[1] &quot;THIS IS AN EXAMPLE and so is this&quot;
files &lt;- c(&quot;ex1&quot;,&quot;ex2&quot;)
for (i in files){
   save(filename = paste(&quot;Title&quot;,i,&quot;.pdf&quot;))
}
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-57" style="background:;">
  <hgroup>
    <h2>String Manipulation</h2>
  </hgroup>
  <article>
    <h1>Basic string functions</h1>

<p><space></p>

<pre><code class="r"># formatting strings
sprintf(&quot;%f&quot;,exp(1))
[1] &quot;2.718282&quot;
sprintf(&quot;%0.2f&quot;,exp(1))
[1] &quot;2.72&quot;
sprintf(&quot;Today&#39;s date is %s&quot;,format(Sys.Date(),&quot;%d %b %Y&quot;))
[1] &quot;Today&#39;s date is 31 Oct 2013&quot;
</code></pre>

<pre><code class="r">example2 &lt;- &quot;Substring takes a subset of...the string!...It&#39;s nuts!&quot;
paste(substr(example2, 19, 21), substr(example2, 22, 24), sep = &quot;&quot;)
</code></pre>

<pre><code>## [1] &quot;subset&quot;
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-58" style="background:;">
  <hgroup>
    <h2>String Manipulation</h2>
  </hgroup>
  <article>
    <h1>Basic string functions</h1>

<p><space></p>

<pre><code class="r">sp &lt;- strsplit(example2,split=&quot;of&quot;)
sp
[[1]]
[1] &quot;Substring takes a sub-set &quot;  &quot;...the string!...It&#39;s nuts!&quot;

length(sp)
[1] 1

length(unlist(sp))
[1] 2

regexpr(&quot;!&quot;,example2)     ## first occurence of &quot;!&quot; in example2

gregexpr(&quot;!&quot;,example2)    ## all occurrences of &quot;!&quot; in example2
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-59" style="background:;">
  <hgroup>
    <h2>String Manipulation</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><space></p>

<pre><code class="r">a) Convert the following character vector into a 3 column dataframe. Name each column.
b) Format the numbers to be percentages with 2 decimal places.
c) Find the total score for people with J-letter first names. 
d) Find the most common weekday.

char_vec &lt;- c(&quot;{&#39;al&#39; &#39;einst&#39;} score:0.4503-[12302013]&quot;,
&quot;{&#39;isaac&#39; &#39;knewt&#39;} score:0.0007-[11202013]&quot;,
&quot;{&#39;ralph&#39; &#39;emerson&#39;} score:0.10321-[09122013]&quot;,
&quot;{&#39;james&#39; &#39;dean&#39;} score:0.84-[02032012]&quot;,
&quot;{&#39;jim&#39; &#39;beam&#39;} score:0.2-[10172013]&quot;,
&quot;{&#39;tommy&#39; &#39;bahamas&#39;} score:0.761-[05212013]&quot;,
&quot;{&#39;george&#39; &#39;of the jungle&#39;} score:0.9434-[01302013]&quot;,
&quot;{&#39;harry&#39; &#39;henderson&#39;} score:0.5456-[08112012]&quot;,
&quot;{&#39;johnny&#39; &#39;walker&#39;} score:0.309118-[08212011]&quot;)

e) Print out the following sentence with one word on each line:
y &lt;- &quot;This is a sentence.&quot;
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-60" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1>Basic statistical functions</h1>

<p><br></p>

<pre><code class="r">x &lt;- rnorm(1000,85,5)
y &lt;- 2 * runif(1000,0,10)
mean(x)             ## [1] 85.22434
median(x)           ## [1] 85.24506
sd(x)               ## [1] 4.869123
var(x)              ## [1] 23.70836
cov(x,y)            ## [1] 0.8053751
cor(x,y)            ## [1] 0.02883522
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-61" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1>Operator precedence</h1>

<p><space></p>

<pre><code class="r">1:n-1       ## wrong
1:(n-1)     ## this is what you want
</code></pre>

<pre><code class="r">-2.4 ^ 2.5  ## nice    
[1] -8.923354
x &lt;- -2.4
x ^ 2.5     ## not so nice
[1] NaN
</code></pre>

<ul>
<li>Parenthesis will force the operator to do what you want</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-62" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1>Boolean operations</h1>

<p><space></p>

<ul>
<li>Boolean operations coerce numbers to being <code>TRUE</code></li>
</ul>

<pre><code class="r">x == 4 | 6          ## OR function - returns bogus result
x == 4 | TRUE       ## weird

x == 4 | x == 6     ## better

x %in% c(4,6)       ## best
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-63" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1>Coercion</h1>

<p><space></p>

<ul>
<li>Sometimes it&#39;s good to coerce</li>
<li><code>read.csv()</code> returns a <code>data.frame</code>. What if you want to do math? Matrices are better.</li>
</ul>

<pre><code class="r">x &lt;- data.frame(num=c(1,2,3,4))
mean(x)                             ## Nope
x &lt;- as.matrix(x)
mean(x)                             ## Yup

as.numeric()
as.character()
as.factor()
as.data.frame()
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-64" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1><code>print()</code> vs <code>cat()</code></h1>

<p><space></p>

<ul>
<li><code>print()</code> is a generic function</li>
<li><code>cat()</code> is a concatenate function</li>
</ul>

<pre><code class="r">x &lt;- 2
print(&quot;One plus one is&quot;,x)
[1] &quot;One plus one is&quot;

## alternatively...
print(&quot;One plus one is&quot;);print(x);
[1] &quot;One plus one is&quot;
[1] 2

## even better...
cat(paste(&quot;One plus one is&quot;,x))
One plus one is 2
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-65" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1><code>class</code> vs <code>mode</code></h1>

<p><space></p>

<ul>
<li>An object&#39;s <code>mode</code> determines how it&#39;s stored in memory </li>
<li>An object&#39;s <code>class</code> determines its abstract type, a concept borrowed from OOP.</li>
<li>Most statistical programs don&#39;t have the OOP concept.</li>
</ul>

<pre><code class="r">x &lt;- data.frame(scores = c(80, 90, 70))
y &lt;- as.Date(&quot;2013-11-05&quot;)
cat(paste(mode(x), mode(y)))
</code></pre>

<pre><code>## list numeric
</code></pre>

<pre><code class="r">cat(paste(class(x), class(y)))
</code></pre>

<pre><code>## data.frame Date
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-66" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1><code>do.call()</code></h1>

<p><space></p>

<ul>
<li>lots of real data comes in row form, each element is a different <code>mode</code></li>
<li>many times stored as a list. Use <code>do.call</code> to combine the elements into a <code>data.frame</code></li>
</ul>

<pre><code class="r">a &lt;- list(1.3, 2.5, &quot;jeff&quot;)
b &lt;- list(4.5, 2.8, &quot;jerry&quot;)
d &lt;- list(6.5, 0.8, &quot;joe&quot;)
z &lt;- list(a, b, d)
df &lt;- data.frame(do.call(rbind, z))
df
</code></pre>

<pre><code>##    X1  X2    X3
## 1 1.3 2.5  jeff
## 2 4.5 2.8 jerry
## 3 6.5 0.8   joe
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-67" style="background:;">
  <hgroup>
    <h2>Miscellaneous Tips and tricks</h2>
  </hgroup>
  <article>
    <h1>Exercise!</h1>

<p><space></p>

<pre><code class="r">a)
## Create a new column called num2 for which each value is double the corresponding value in num
## Make sure num2 is also a factor
x &lt;- data.frame(num=factor(c(1.0,0.03,8.0, 0.4)))

b)
## Find the letters in z corresponding to the indices of even numbers in y
y &lt;- c(1,2,NA,4,5,8,5,2,3)
z &lt;- c(&quot;f&quot;,&quot;g&quot;,&quot;e&quot;,&quot;i&quot;,&quot;l&quot;,&quot;o&quot;,&quot;p&quot;,&quot;u&quot;)
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-68" style="background:;">
  <hgroup>
    <h2>Not covered</h2>
  </hgroup>
  <article>
    <ol>
<li>Object oriented programming with S3 and S4 </li>
<li>Input/output</li>
<li>Packages: <a href="http://cran.r-project.org/web/packages/ggplot2/index.html"><code>ggplot</code></a>, <a href="http://cran.r-project.org/web/packages/reshape/index.html"><code>reshape</code></a>, <a href="http://cran.r-project.org/web/packages/plyr/index.html"><code>plyr</code></a>, <a href="http://cran.r-project.org/web/packages/forecast/index.html"><code>forecast</code></a>, <a href="http://cran.r-project.org/web/packages/MASS/index.html"><code>MASS</code></a></li>
<li>Debugging: <code>browser()</code>, <code>warnings()</code></li>
<li>Parallelizing</li>
<li>Much more</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-69" style="background:;">
  <hgroup>
    <h2>Resources</h2>
  </hgroup>
  <article>
    <ol>
<li><a href="http://shop.oreilly.com/product/9780596809164.do">The Art of R Programming</a> - <em>programming in R</em></li>
<li><a href="http://nostarch.com/artofr.htm">R cookbook</a> - <em>recipes and tips</em></li>
<li><a href="https://www.coursera.org/course/compdata">Computing for Data Analysis</a> (Coursera) - <em>fundamentals</em></li>
<li><a href="http://slidify.org">Slidify</a> - <em>slides</em></li>
<li><a href="http://yihui.name/knitr/">knitR</a> - <em>reports and presentations</em></li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-70" style="background:;">
  <hgroup>
    <h2>PRACTICE, PRACTICE, PRACTICE</h2>
  </hgroup>
  <article>
    
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-71" style="background:;">
  <hgroup>
    <h2>Questions?</h2>
  </hgroup>
  <article>
    
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-72" style="background:;">
  <hgroup>
    <h2>Thank you!</h2>
  </hgroup>
  <article>
    
  </article>
  <!-- Presenter Notes -->
</slide>

    <slide class="backdrop"></slide>
  </slides>

  <!--[if IE]>
    <script 
      src="http://ajax.googleapis.com/ajax/libs/chrome-frame/1/CFInstall.min.js">  
    </script>
    <script>CFInstall.check({mode: 'overlay'});</script>
  <![endif]-->
</body>
<!-- Grab CDN jQuery, fall back to local if offline -->
<script src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.7.min.js"></script>
<script>window.jQuery || document.write('<script src="libraries/widgets/quiz/js/jquery-1.7.min.js"><\/script>')</script>
<!-- Load Javascripts for Widgets -->
<!-- MathJax: Fall back to local if CDN offline but local image fonts are not supported (saves >100MB) -->
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [['$','$'], ['\\(','\\)']],
      processEscapes: true
    }
  });
</script>
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<!-- <script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script> -->
<script>window.MathJax || document.write('<script type="text/x-mathjax-config">MathJax.Hub.Config({"HTML-CSS":{imageFont:null}});<\/script><script src="libraries/widgets/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML"><\/script>')
</script>
<!-- LOAD HIGHLIGHTER JS FILES -->
<script src="libraries/highlighters/highlight.js/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<!-- DONE LOADING HIGHLIGHTER JS FILES -->
</html>