2011DataComparisons.Rmd

---
title: "2011DataComparison"
author: "Mausam Duggal"
date: "July 19, 2016"
output: html_document
---
---

```{r init, echo=FALSE, message=FALSE, warning=FALSE}
library(dplyr); library(ggplot2); library(knitr); library(gaussfacts); library(rmsfact)
```

#### 2011 Data Comparison 

Rolf and Carlos have noted some significant differences in the data being generated by PopSyn3 and those after expanding the TSRC. Of special significance was the differences observed in the frquency counts by **income** segments. Given the importance of getting the data inputs correct from the point of view of model estimation as well as aligning all the input datasets so that they match across the various models, we are putting together a comparison of the different datasets by certain key variables.

```{r echo=TRUE}
opts_knit$set(root.dir = 'c:/personal/r')
```

Batch in the PopSyn3 and 2011NHS data gotten from a semi-custom tabulation
```{r echo=TRUE}

nhs <- read.csv("2011NHS.csv", stringsAsFactors = FALSE)
hhold <- read.csv("households.csv", stringsAsFactors = FALSE)
pp <- read.csv("persons.csv", stringsAsFactors = FALSE)

```

#### Clean the 2011 NHS.

The 2011NHS semi-custom tabulation is an extremely big dataset with **2621 columns**. A lot of these are not needed for what we want to do, so it is best to bring this to a manageable size. Further, the data is not really designed in a format that is conducive to data wrangling. So, some cleaning steps are needed.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}

#' filter out records that correspond to only those that belong to Ontario
on <- nhs[c(19844:40115), ] 

#' only keep the income columns
on1 <- on[c(1, 2231:2621)]

#' only keep records with DA numbers. This will get rid of sub-totals by various
#' sub-level geographies.

on2 <- filter(on1, grepl("^[35]",Geography))

#' All the columns come in as characters. So, we need to convert them back to 
#' numeric for undertaking analysis

# create vector of column names to apply the conversion to
cols <- colnames(on2)
cols1 <- cols[-1]    # get rid of the first element in the column vector

#' conver to numeric
on2[,cols1]<-lapply(cols1, function(x) as.numeric(as.character(on2[,x])))
on2[is.na(on2)] <- 0    # set NA to zero

```

#### Get the PopSyn3 ready for comparison

The 2011 PopSyn3 outputs have already been batched in. Given that the person file only includes a placeholder for the household id, it is necessary to join the two dataframe, strip out unnecessary columns and create the necessary columns for estimating the number of people in each income category. 

Of note, all those person records with a **population below 15** are excluded because that is what the 2011NHS used for reporting their numbers. This could present a challenge with the TSRC data because people below that age will need to be removed as well for an accurate comparison.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}
#' join the tables to get the income variable mapped to the person table.
#' Further only keep persons 15 years or older.

pp1 <- left_join(pp, hhold, by.x = "hhid", by.y = "hhid") %>% 
  subset(., age>14) %>% subset(., select = c(hhid, hhinc))

#' create an income categorical variable
pp1 <- transform(pp1, incat = ifelse(hhinc < 50001, 1,
                                     ifelse(hhinc > 50000 & hhinc < 80001, 2,
                                            ifelse(hhinc > 80000 & hhinc < 100001, 3, 4))))

# summarise to get counts by income category
pp1.sum <- pp1 %>% group_by(incat) %>% summarise(cnt = n())

```

#### Populate Rolf and Carlos' TSRC expanded numbers

These are just manually batched in. Of note, currently the income categories **don't quite align** because the TSRC data breaks at 70,000 CAD; however, the 2011 NHS goes all the way to 80,000 CAD. So request Rolf and Carlos to produce TSRC with revised income categories. 

```{r warning=FALSE, tidy=TRUE, echo=TRUE}

#' I have just created these approximate values from Carlos' pdf document. They serve for comparison purposes only, hence exact values are not #' necessarily needed, although good to have. 

tsrc <- c(3000000, 1400000, 1550000, 2700000)

#' Populate the table
for (i in 1:4){
  on3_sum[3,i] <- tsrc[i]
}

```

#### Populate the PopSyn3 numbers

```{r warning=FALSE, tidy=TRUE, echo=TRUE}

#' Populate the table
for (i in 1:4){
  on3_sum[2,i] <- pp1.sum[i,2]
}

```

#### Compare how these numbers stack up.

```{r warning=FALSE, tidy=TRUE, echo=TRUE}


```