week2.Rmd

---
title: "Week 2"
output:
  html_document:
    toc: true
    include:
      after_body: footer.html
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Key Git/GitHub Skills

* [The lecture notes](week2-moregit.html)
* [Lecture video](https://youtu.be/TiII9trsArk)

## Questions from the post-lecture chat

* Difference between forking and cloning
    * This [video](https://youtu.be/uuti1G48yhY) shows how to fork and some of the features you'll see on GitHub when you do that.
    * This [video](https://youtu.be/0-5LiuxNnbM) shows how to clone the same repository and the difference between what you see when you fork.

* Discussion of alternatives to using branches as repository 'versions'. [Video](https://youtu.be/t2YqepzFSYc). Branches are not mean for versioning your repository. They are for breaking off a copy to work on something and then merging those changes back in (or deleting). Branches might have a long life, e.g. a Development branch, but are not for 'versioning' (2020 data, 2021 data). It is tempting to use them that way, but there are better alternatives, e.g. releases, separate repositories, separate folders. A unique exception might be if you have a branch that say uses C++ versus pure R but they are otherwise the same code base.

## Branches

This [video](https://youtu.be/4TUZEQdaZGM) shows a basic workflow using branches.

In the lecture, I cautioned against using branches when you are beginning with Git. Even through I work on many Git repositories and develop public R packages, it is rare that I need to use branches. Normal scientific workflow does not involve branches and IMO most scientists won't gain much by adding that to their workflow. The GitHub features I talked about so far are things that are already part of our workflow (taking notes on what we are doing, saving copies along the way, reviewing work with collaborators, sharing our work) but Git/GitHub allows us to do it more efficiently, better and faster.

I do use branches regularly with certain R package projects and when I use them, it is in very specific ways

   * work on a file or set of files, finish work, merge into main, delete the branch. 
   * sandbox an idea. If I am really uncertain about an idea/change or about to make a major revamp, I'll make a branch while working on the idea, once I decide yeah/neah, I merge the branch into main and delete the branch.
   * When working on a branch, I stay off other branches and my main branch as much as possible.
   * I create a timeline (mentally) for a branch before I create it. The branch has a concrete purpose and end state (i.e. when xyz plot functions are done or when documentation for xyz files is done). Vague branches like 'development' are too amorphous.
   * When I am using GitHub Pages, and need the `gh-pages` branch to serve that up.
    
But usually other features of GitHub are sufficient and more appropriate.

   * Using the revert feature to get rid of a change that I made.
   * Using Releases to create the stable version and using the main as the development.
   * Releases also effectively create an 'archive' version of the repository at key states: draft 1, draft 2, etc.
   * Using issues to create concrete small chunks of work.
   * Using a fork of a repository and a pull request (if I am collaborating with someone)
   
That said, they are definitely helpful in certain situations. Before you start working with branches make sure think through some of the aspects of how they affect your file system and how that will affect your current workflow and the workflow of any users of your repository.
    
   * Switching branches changes your file system state. A branch is not a separate 'space'. When you switch branches, you tell Git to change your files to reflect the branch state. Do you have any code that 'sources' the folder/repository with branches? For example, do you have any code in other folders with lines like `source("Documents/myrepowithbranches/plot.R")`. Do you do things like that in practice? That kind of workflow will cause problems because they will reference the branch state when you switch branches. Your file system state will remain in the branch state until you switch back to the main branch.
   * File time stamps. [Video](https://youtu.be/_WXHp6uRmAI) Because switching branches on your computer causes Git to change your file system to that branch state, the file time stamps of any files that differ between branches will update to the time that you switched branches. It is possible to make Git change the file time stamps back to the last modification time. This [video](https://youtu.be/WbP9B_jfxPU) shows how.
   * Are you working in a team with users who are not Git-saavy but will need to access branches? If you only have them access branches on GitHub, then it is probably fine but if you will have them switches branches using RStudio or GitHub Desktop on their computer then they are likely to get confused when they accidentally leave their file system in a branch state.
   * Pulling and Pushing work a bit differently when you have branches. When you do a pull from GitHub, you will get the changes across all branches, but when you push, the push is branch specific. So if you have changes on the main branch and another branch, you need to push from main then switch to the other branch(es) and push from them too. [Video](https://youtu.be/uJxk2l5PEKQ)
   * Are you using a cloud backup system that syncs across devices/computers? Dropbox and iCloud are examples. Syncing repositories outside of Git (so not using the push/pull system) can definitely cause problems especially if you are using branches. If your backup system is linear, one computer being backed up, with no syncing across devices, and you are not working off-line then you are probably ok.
       1. Why are cloud backup/syncing systems so problematic? The problem happens when you work offline and you use branches. Remember that changing branches, changes your file system. Let's say you are on branch A on computer 1 (home) and working off-line. Then on computer 2 (office), your create branch B and do a bunch of work. You leave that computer on branch B. Then you go back home, get on computer 1 and get online. That computer syncs to the cloud and wipes out all the work on computer 2 because it is in branch A or it'll create a slew of 'Conflicted Copies'. These sort of problems happen all the time, when you do automatic syncing across computers.
       2. Backups of repositories with branches can be confusing even if you don't have syncing across devices. Let's say you switch to branch B and delete most of your files. You stay in branch B and that is what is backed up. The info to get back the files (in the main branch A) are still there but in the hidden .git folder. If you go and look at the back-up, you will just see branch B with all the files gone. You have to know that it is a Git repository and use Git to find our what branch the repository is on. But it might not be obvious to you that this folder is a Git repository.

* How to I clone just one branch? For this you will use Git from the command line. Open a terminal window, change to the directory where you keep repositories (e.g. `cd Documents/GitHub`). Then issue the command 
    ```
    git clone -b <branchname> --single-branch <remote-repo-url>
    ```
    
    You can add the name of the folder optionally at the end. So the command I issued in the video was
    
    ```
    git clone -b test --single-branch https://github.com/eeholmes/Week2 Week2-test
    ```    
    
    Here's a screen recording of this [Video](https://youtu.be/CNZh9L2qwCc).


## Fix timestamps

* Look in the .git folder (it's hidden so unhide files) for the hooks folder
* Click on one of the xxx.sample files and duplicate (click on the checkbox next to file, click More, click 'Copy to'). *This is important* duplicate like this so that you retain the special file permissions.
* Save the file as 'post-checkout' with no file ending. Get rid of the code in the copy and copy in the code below.

Here is the post-checkout code used in this [video](https://youtu.be/WbP9B_jfxPU) to show you how to fix time stamps when you switch branches.
```
#!/bin/sh -e

OS=${OS:-`uname`}

if [ "$OS" = 'Darwin' ]; then
  get_touch_time() {
    date -r ${unixtime} '+%Y%m%d%H%M.%S'
  }
else
  get_touch_time() {
    date -d @${unixtime} '+%Y%m%d%H%M.%S'
  }
fi

# all git files
git ls-tree -r --name-only HEAD > .git_ls-tree_r_name-only_HEAD

# modified git files
git diff --name-only > .git_diff_name-only

# only restore files not modified
comm -2 -3 .git_ls-tree_r_name-only_HEAD .git_diff_name-only | while read filename; do
  unixtime=$(git log -1 --format="%at" -- "${filename}")
  touchtime=$(get_touch_time)
  echo ${touchtime} "${filename}"
  touch -t ${touchtime} "${filename}"
done

rm .git_ls-tree_r_name-only_HEAD .git_diff_name-only
```