p8105_hw1_zdz2101.Rmd

---
title: "p8105_hw1_zdz2101"
author: "Zelos Zhu"
date: "9/18/2018"
output: github_document
---

#Problem 1
```{r Load libs, message=FALSE}
library(tidyverse)
```

```{r Problem 1a}
problem1_df <- tibble(numeric_vec = runif(n = 10, min = 0, max = 5),
                      logical_vec = numeric_vec > 2,
                      char_vec = c("My", "name", "is", "Zelos", "Zhu", "and", "I", "like", "to", "eat"),
                      factor_vec = factor(  c( rep( c("small", "medium", "large"), 3), "x-large")  )
                      )
print(problem1_df)
str(problem1_df)

#means <- apply(problem1_df, 2, mean) --- doesn't work
#Above line coerces all into NA, probably because there is non-numeric in it, good to know

means_prob1 <- c(mean(problem1_df$numeric_vec),
                 mean(problem1_df$logical_vec),
                 mean(problem1_df$char_vec),
                 mean(problem1_df$factor_vec))
means_prob1
```

The mean() function successfully worked for the numeric and logical vectors of the data frame while it produced NAs for the character and factor vectors. This makes sense: the mean of the first vector could be calculated since it is a numeric variable. It's exactly what we think it should be: the sum of the elements divided by the number of elements. As a result the mean is `r round(means_prob1[1], 2)`. 

Interestingly enough, the second vector in the data frame could also have a mean and that is because R seems to automatically consider FALSE into 0 and TRUE into 1. So, taking a "mean" of a logical allows us to see the proportion of TRUE cases, similarly if we were to take the sum we would see the total counts of TRUE's. 

As expected, a mean could not be taken from neither the character or factor varaibles and that is because character and factor variables cannot be used with traditional "mathematical" functions. As a result taking their means produce a warning and the value we get is NA. 

```{r Problem 1b, eval = FALSE}
as.numeric(problem1_df$logical_vec)
as.numeric(problem1_df$char_vec)
as.numeric(problem1_df$factor_vec)
```

The logical vector turned into a vector of 0's and 1's, the FALSEs being 0, the TRUEs being 1.

The character vector turned into a vector of NA's. Though I imagine if you wrote out numbers in character form, they can convert properly. For example, executing the line

as.numeric(c("1","2","3")) 

would properly display 1,2,3. 

The factor vector turns in a vector that consist of the numbers `r 1:(length(levels(problem1_df$factor_vec)))`. I'm assuming R assigns each level of a factor a number and when you use the as.numeric() function with it, it pulls these numbers. 

```{r Problem 1c, eval = FALSE}
as.numeric(as.factor(problem1_df$char_vec))
as.numeric(as.character(problem1_df$factor_vec))
```

When you convert the character vector into a factor, it turns into a vector of 10 levels because I had 10 unique values, in which case when as.numeric() is applied it turns into a scrambled mess of the numbers 1 to 10. 

When you convert the factor vector into character and then attempt to use as.numeric() it displays a warning and we get a vector of NA's.


#Problem 2
```{r Problem 2 make plotting df}
problem2_df <- tibble( x = rnorm(1000),
                       y = rnorm(1000),
                       is_sum_positive = (x + y) > 0,
                       num_coerce = as.numeric(is_sum_positive),
                       factor_coerce = as.factor(is_sum_positive))
print(head(problem2_df))
str(problem2_df)
```

The size of the dataset is `r nrow(problem2_df)` rows (observations) and `r ncol(problem2_df)` columns (variables).
The mean of x is `r round( mean(problem2_df$x), 2)`.
The median of x is `r round( median(problem2_df$x), 2)`.
The proportion of cases for which the logical vector is TRUE is `r round( mean(problem2_df$is_sum_positive), 2)`. 

```{r Problem 2 graphs}
hw1plot_1 <- (ggplot(problem2_df, aes(x=x, y=y))
            + geom_point( aes(color = is_sum_positive))
            + ggtitle("Scatterplot of 1000 points based on normal distribution")
            + geom_abline(intercept = 0, slope = -1)
            + theme(plot.title = element_text(size = 10, face = "bold")))

hw1plot_2 <- (ggplot(problem2_df, aes(x=x, y=y))
            + geom_point( aes(color = num_coerce))
            + ggtitle("Scatterplot of 1000 points based on normal distribution")
            + geom_abline(intercept = 0, slope = -1)
            + theme(plot.title = element_text(size = 10, face = "bold")))

hw1plot_3 <- (ggplot(problem2_df, aes(x=x, y=y))
            + geom_point( aes(color = factor_coerce))
            + ggtitle("Scatterplot of 1000 points based on normal distribution")
            + geom_abline(intercept = 0, slope = -1)
            + theme(plot.title = element_text(size = 10, face = "bold")))

hw1plot_1
ggsave(filename = "hw1plot1.png")

hw1plot_2
hw1plot_3
#library(gridExtra) --- for my viewing pleasure when I was working on this so I don't have to scroll through in viewer
#grid.arrange(hw1plot_1, hw1plot_2, hw1plot_3, nrow=1)
```

The color arrangement seems to be the exact same for the logical and factor variables, each point has a distinct color allocated based on the discrete values TRUE and FALSE (in this case a light red and cyan).

MY hypothesis/guess on why this is, is that similar to how R automatically register logicals as 0/1's for mathematical functions, that in this setting, ggplot2 is written such that when logicals enter an aes, it automatically considers it a factor of 2 levels: TRUE/FALSE which have correspoding numbers to them, in this case 1 and 2. This would then allow R to have an index to pull certain "colors" from some sort of color vector. Otherwise with FALSE being 0, it would be difficult to use that as an index since R objects start indexing at 1. (Just my guess)

In regards to the second plot where we have the blue and black dots, it seems in this case our color is based on some sort of gradient, the range of which is defined as 0 to 1. In our case we only have exactly 0's and 1's because we coerced it from logical to numeric. As a result we have a concentration from each of the extremes of the gradient. Say if we plugged in x as our color we would see this entire black(low)-blue(high) gradient generated going left to right. The same could be said for if we plugged in y instead, but we would generate that gradient from bottom to top.