cm011.Rmd

# File input/output (I/O)

Today's class is all about reading in and writing out data files into and out of R.

## Worksheet

Normally there would be a separate worksheet file, but we'll do everything in-line for this class

## Resources

### References and tutorials

* Jenny Bryan's [notes](https://stat545.com/import-export.html) on file I/O
* [Tutorial](https://beanumber.github.io/sds192/lab-import.html#data_from_an_excel_file) on importing excel file
* [Tutorial](http://jenrichmond.rbind.io/post/how-to-use-the-here-package/) on relative paths and how RStudio treats .R script files and .Rmd files

### Package documentation

* [read_csv](https://readr.tidyverse.org/reference/read_delim.html) package
* [read_excel](https://www.rdocumentation.org/packages/readxl/versions/0.1.1/topics/read_excel) package (https://www.rdocumentation.org/packages/readxl/versions/0.1.1/topics/read_excel)
* [download.file](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/download.file.html)
* [here::here()](https://here.r-lib.org) package

<!---The following chunk allows errors when knitting--->

```{r allow errors, echo = FALSE}
knitr::opts_chunk$set(error = TRUE)
```

## Writing data to disk (10 mins)

Let's first load the built-in gapminder dataset and the tidyverse:

```{r, echo = FALSE, warning = FALSE, message = FALSE}
library("gapminder")
library("tidyverse")
library("tidyr")
```

Next, let's filter the data only from 2007 and only in the Asia continent and save it to a variable.

```{r}
gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
gap_asia_2007
```

We can write this to a comma-separated value (csv) file with just one command:

```
write_csv(gap_asia_2007,"exported_file.csv")
```
But let's not just write the csv file anywhere, we should download the file in a sensible location. The next section talks about the `here` package

## Making the case for here::here()

If you wanted to make your Rproj more portable and accessible to more users in a cross-platform (between Mac, Unix, Windows users), rather than specifying every path explicitly, here::here() allows you to set relative paths much more easily.

For example, create a new folder "data" and within it a subfolder called "cm011_data", specify here() and then save a file to that location with these commands:

###
```{r}
write_csv(gap_asia_2007,here::here("data","cm011_data","exported_file.csv"))
```

More thorough notes to come...

In the meantime, read this [short article](http://jenrichmond.rbind.io/post/how-to-use-the-here-package/) for an excellent explanation of why we use the here::here() package in R.

## Reading data from disk (5 mins)

The same csv file that we just saved to disk can be imported into R again by specifying the path where it exists:

```{r}
read_csv(here::here("data","cm011_data","exported_file.csv"))
```

Normally we would store the imported data into a new variable and you can use that by assigning the output to a variable. Notice that the output of the imported file is the same as the original tibble, and read_csv was intelligent enough to detect the types of the columns. This won't always be true so it's worth checking! The [read_csv package](https://readr.tidyverse.org/reference/read_delim.html) has many additional options including the ability to skip columns, skip rows, rename columns on import, trim whitespace, and more...

## Import a file from the web/cloud

### Import a csv file from the internet

To import a csv file from a web, assign the URL to a variable
```{r}
url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"
```
and then apply read_csv file to the `url`.

```{r}
read_csv(url)
```

### Import an excel file (.xls) from the internet

First, we'll need the package to load in excel files:

```{r}
library("readxl") 
```
'Datafiles from this tutorial were obtained from: https://beanumber.github.io/sds192/lab-import.html#data_from_an_excel_file

To import an .xls or .xlsx file from the internet, you first need to download it locally. The read_excel function from the readxl package can help us read it after we download it. To download it, create a new variable called xls_url, as well the name of the destination file you would like to download the data into. 

```{r}
xls_url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/GreatestGivers.xls"
download.file(xls_url,here::here("data","cm011_data","some_file.xls"))
```
**NOTE: If you are on windows and end up downloading a corrupt file, you need to add an extra argument: download.file(..., mode="wb"). More details about this behaviour can be found [here](https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30).**

Naming a file "some_file" is extremely bad practice (hard to keep track of the files) and I would strongly encourage you to name the file similar (or the same) to the original file. Let me show you a handy trick to extract the filename from the URL:

```{r}
file_name <- basename(xls_url)
download.file(xls_url,here::here("data","cm011_data",file_name))
```
Now we can import the file:

```{r}
read_excel(here::here("data","cm011_data",file_name))
```

## Read in a sample Excel file. (Optional)

Let's load in a sample dataset from a PhD research project in MRI (magnetic resonance imaging). In MRI, subjects are imaged and the data collected can be visualized in "slices". If a human head was being imaged, the first slice might be a cross-section of the neck. The next slice would be an image in the same plane as the first slice, but a few mm above the first slice, and so on and so forth. The exact details of the dataset aren't important, but here are key details we can extract from the dataset:

- the first column contains the subject ID (e.g., "HerS18Bs01.BS1/8")
- Each row contains 8 measurements, each from a different slice in the image
- The next column is a weighted average, and the final column is the volume measurement of all 8 slices
- The data is **not** in tidy format
- The subjects are divided into two treatment groups: "Avastin" and "Herceptin", but unfortunately, this information is not captured in the table
- The relevant data can be found in the range A1:K12 since the rest of the data contains either footer rows or repeated data.

```{r}
mri_file = here::here("data","cm011_data","Firas-MRI.xlsx")
mri <- read_excel(mri_file, range = "A1:K12")
```

Viewing the data using the `View()` function will allow you to investigate how the imported data looks in R. As hinted above, it looks like we have to do one preliminary data-processing steps before we can import all the data.

Notice that below row 12, two subsets of the data from the first 12 rows is duplicated into two groups: "Avastin" and "Herceptin" treated. Indeed, the person who did the analysis copied the raw data and split it into the two groups. While it is possible to do this processing in R, in this case it is better that we do this directly in the excel file. Let's do this and save a new file called `Firas-MRI_minor_cleaning.xlsx`

Note also that column J is a calculated column of weighted averages. Rather than bring this column in a-is, we want to remove this column and then later calculate it in R if we need to. 

We are now ready to import in the newly processed file:

```{r}
mri_file = here::here("data","cm011_data","Firas-MRI.xlsx")
mri <- read_excel(mri_file, range = "A1:L12")
# We can remove the "weighted averages" column like so:
mri <- mri[,-10]
```

Finally, let's make our data tidy using `pivot_longer`:

```{r}
mri <- mri %>% 
  tidyr::pivot_longer(cols = `Slice 1`:`Slice 8`,
               names_to = 'slice_no',
               values_to = 'value')
mri
```

And we are done! Ready to explore trends in this dataset.