diff --git a/01_syllabus.qmd b/01_syllabus.qmd index 7489d2c..f213965 100644 --- a/01_syllabus.qmd +++ b/01_syllabus.qmd @@ -1,6 +1,5 @@ --- title: "Syllabus" -subtitle: "" format: html: code-line-numbers: true @@ -9,6 +8,8 @@ editor_options: chunk_output_type: console --- +![Reggee and Aaron R. Williams](images/aaron-and-reggee.jpeg) + ```{r} #| echo: false diff --git a/05_advanced-quarto.qmd b/05_advanced-quarto.qmd index ac51ec0..72a7e00 100644 --- a/05_advanced-quarto.qmd +++ b/05_advanced-quarto.qmd @@ -9,6 +9,8 @@ editor_options: chunk_output_type: console --- +![Chemistry stencils that used to be used for drawing equipment in lab notebooks](images/Schablone_Logarex_25524-S,_Chemie_II.jpg) + ```{r hidden-here-load} #| include: false diff --git a/06_reproducible-research-with-git.qmd b/06_reproducible-research-with-git.qmd index 2ea50c6..818fd63 100644 --- a/06_reproducible-research-with-git.qmd +++ b/06_reproducible-research-with-git.qmd @@ -1,12 +1,18 @@ --- title: "Reproducible Research with Git and GitHub" -abstract: Git and Github are powerful software tools used to control different versions of a codebase, track changes, and collaborate with other programmers. This section introduces both tools. +abstract: Git and GitHub are powerful software tools used to control different versions of a codebase, track changes, and collaborate with other programmers. This section introduces both tools. format: html: toc: true code-line-numbers: true --- +```{r} +#| echo: false +exercise_number <- 1 + +``` + ```{r} #| label: quarto-setup #| echo: false @@ -16,9 +22,35 @@ format: knitr::opts_chunk$set(fig.align = "center") library(tidyverse) +library(gt) library(knitr) library(RXKCD) +source("src/motivation.R") + +``` + +```{r} +#| label: tbl-roadmap +#| tbl-cap: "Opinionated Analysis Development" +#| echo: false + +motivation |> + filter(!is.na(Section), Section == "Version Control") |> + select(-`Analysis Feature`) |> + arrange(Section) |> + gt() |> + tab_header( + title = "Opinionated Analysis Development" + ) |> + tab_footnote( + footnote = "Added by Aaron R. Williams", + locations = cells_column_labels(columns = c(Tool, Section)) + ) |> + tab_source_note( + source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.") + ) + ``` ::: {.callout-note} @@ -29,31 +61,37 @@ library(RXKCD) The command line (also known as shell or console) is a way of controlling computers without using a graphical user interface (i.e. pointing-and-clicking). The command line is useful because pointing-and-clicking is tough to reproduce or scale and because lots of useful software is only available through the command line. Furthermore, cloud computing often requires use of the command line. -We will run Bash, a command line program, using Terminal on Mac and Git Bash on Windows. Open Terminal like any other program on Mac. Right-click in a desired directory and select "Git Bash Here" to access Git Bash on Windows. - -![](images/terminal.png){width="400" fig-align="center" width=70%} +There are different ways to use the command line. -Fortunately, we only need to know a little Bash for version control with Git and cloud computing. +Macs use the Terminal (@fig-terminal). Open Terminal like any other program on Mac. -`pwd` - print working directory - prints the file path to the current location in the +![Mac Terminal](images/terminal.png){#fig-terminal width="400" fig-align="center" width=70%} -`ls` - list - lists files and folders in the current working directory. +Git Bash, which is installed with Git, works well on Windows. If you have Git Bash, you should be able to right-click in a desired directory and select "Git Bash Here" to access Git Bash on Windows. -`cd` - change directory - move the current working directory. +RStudio contains a terminal in the tab adjacent to the console (@fig-terminal-rstudio). This will allow us to work at the common line with a common experience on Mac-, Windows-, and Linux-based computers. -`mkdir` - make directory - creates a directory (folder) in the current working directory. +![RStudio Terminal](images/terminal-rstudio.png){#fig-terminal-rstudio width="400" fig-align="center" width=70%} -`touch` - creates a text file with the provided name. +### Bash -`mv` - move - moves a file from one location to the other. +Bash is a shell program and command language that allows us to control our computer at the command line. Fortunately, we only need to know a little Bash for version control with Git. -`cat` - concatenate - concatenate and print a file. +- `pwd` - print working directory - prints the file path to the current location in the +- `ls` - list - lists files and folders in the current working directory. +- `cd` - change directory - move the current working directory. Specify the relative path to move down in a directory. Use `cd ..` to move up a directory. +- `mkdir` - make directory - creates a directory (folder) in the current working directory. +- `touch` - creates a text file with the provided name. +- `mv` - move - moves a file from one location to the other. +- `cat` - concatenate - concatenate and print a file. -```{r} -#| echo: false -exercise_number <- 1 +### Useful tips -``` +- Tab completion can save a ton of typing. Hitting tab twice shows all of the available options that can complete from the currently typed text. +- Hit the up arrow to cycle through previously submitted commands. +- Use `man ` to pull up help documentation. Hit `q` to exit. +- Typing `..` refers to the directory above the working directory. Writing `cd ..` changes to the directory above the working directory. +- Typing just `cd` changes to the home directory. ::: callout #### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"} @@ -64,9 +102,8 @@ exercise_number <- 1 + exercise_number ``` -1. Create a new directory called `cli-exercise`. -2. Navigate to this directory using `cd` in the Terminal or Git Bash. -3. Submit `pwd` to confirm you are in the correct directory. +1. Navigate to the `example-project` directory using `cd` in the RStudio terminal. +2. Submit `pwd` to confirm you are in the correct directory. ::: ::: callout @@ -88,8 +125,6 @@ ls ``` ::: - - ::: callout #### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"} @@ -134,15 +169,6 @@ cat poems/haiku.txt ``` ::: - -### Useful tips - -- Tab completion can save a ton of typing. Hitting tab twice shows all of the available options that can complete from the currently typed text. -- Hit the up arrow to cycle through previously submitted commands. -- Use `man ` to pull up help documentation. Hit `q` to exit. -- Typing `..` refers to the directory above the working directory. Writing `cd ..` changes to the directory above the working directory. -- Typing just `cd` changes to the home directory. - ### Programs We can run programs from the command line. Commands from programs always start with the name of the program. Running git commands intuitively start with `git`. For example: @@ -155,43 +181,75 @@ git status ## Why version control? -Version control is a system for managing and recording changes to files over time. Version control is essential to managing code and analyses. Good version control can: +::: {.callout-tip} +## Version Control + +Version control is a system for managing and recording changes to files over time. +::: + + Version control is essential to managing code and analyses. Good version control can: -- Limit the chance of making a mistake -- Maximize the chance of catching a mistake when it happens - Create a permanent record of changes to code - Easily undo mistakes by switching between iterations of code - Allow multiple paths of development while protecting working versions of code - Encourage communication between collaborators +- Facilitate multiple code reviews - Be used for external communication ## Why distributed version control? -*Centralized version control* stores all files and the log of those files in one centralized location. *Distributed version control* stores files and logs in one or many locations and has tools for combining the different versions of files and logs. +::: {.callout-tip} +## Centralized version control + +Centralized version control stores all files and the log of those files in one centralized location. +::: + +::: {.callout-tip} +## Distributed version control + +Distributed version control stores files and logs in one or many locations and has tools for combining the different versions of files and logs. +::: Centralized version control systems like Google Drive or Box are good for sharing a Microsoft Word document, but they are terrible for collaborating on code. Distributed version control allows for the simultaneous editing and running of code. It also allows for code development without sacrificing a working version of the code. +::: {.callout-note} +Git and GitHub are difficult to motivate a priori but the value is obvious after adopting the tools. We've done our best to motivate the tools. If you are unconvinced, we ask that you just trust us on this one. +::: + ## Git vs. GitHub +::: {.callout-tip} +## Git + Git is a distributed version-control system for tracking changes in code. Git is free, open-source software and can be used locally without an internet connection. It's like a turbo-charged version of Microsoft Word's track changes for code. +::: + +::: {.callout-tip} +## GitHub [GitHub](https://github.com/), which is owned by Microsoft, is an online hosting service for version control using Git. It also contains useful tools for collaboration and project management. It's like a turbo-charged version of Google Drive or Box for sharing repositories created using Git. +::: At first, it's easy to mix up Git and GitHub. Just try to remember that they are separate tools that complement each other well. +::: callout +#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"} + +1. If you don't already have an account, sign up for [GitHub](https://github.com/). +::: + ## SSH Keys for Authentication GitHub started requiring token-based or SSH-based authentication in [2021](https://github.blog/2020-12-15-token-authentication-requirements-for-git-operations/). We will focus on creating SSH keys for authentication. For instructions on creating a personal access token for authentication, see @sec-ap-a below. We will follow the [instructions for setting up SSH keys](https://happygitwithr.com/ssh-keys.html#option-2-set-up-from-the-shell) using the console, or terminal window, from Jenny Bryan's fantastic *Happy Git with R*. - ::: callout #### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"} -1. Follow the instructions above for setting up SSH keys using the console. We recommend using the default key location and key name. You can choose whether or not to add a password for the key. Note that if you choose to add a password, you will need to enter that password every time you perform operations with GitHub - so make sure you'll be able to remember it! +1. Follow [the instructions](https://happygitwithr.com/ssh-keys.html#option-2-set-up-from-the-shell) for setting up SSH keys using the console. We recommend using the default key location and key name. You can choose whether or not to add a password for the key. Note that if you choose to add a password, you will need to enter that password every time you perform operations with GitHub - so make sure you'll be able to remember it! 2. When you get to the section of the instructions to provide the public key to GitHub, we recommend obtaining the public key as follows: - In a terminal window, run `cat ~/.ssh/id_ed25519.pub` @@ -251,7 +309,7 @@ See @sec-app-b for the instructions on initializing a repo locally and then addi 4. Save the tracked files to the remote GitHub repository. 5. Repeat, repeat, repeat -[^files]: Github refuses to store files larger than 100 MiB. This poses a challenge to writing reproducible code. However, many data sources can be downloaded directly from the web or via APIs, allowing code to be reproducible without relying on storing large data sets on Github. Materials later in this book discuss scaping data from the web and using APIs. +[^files]: GitHub refuses to store files larger than 100 MiB. This poses a challenge to writing reproducible code. However, many data sources can be downloaded directly from the web or via APIs, allowing code to be reproducible without relying on storing large data sets on GitHub. Materials later in this book discuss scaping data from the web and using APIs. ![](images/git-github-workflow.jpeg){fig-align="center" width=70%} diff --git a/_quarto.yml b/_quarto.yml index 39bb355..488a7f3 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -24,9 +24,15 @@ book: - 06_reproducible-research-with-git.qmd - 07_advanced-git.qmd - part: Programming + chapters: + - functions-and-tests.qmd + - assertive-testing.qmd - part: Environment Management chapters: - renv.qmd + - part: Culture and Ethics + chapters: + - culture-and-ethics.qmd - references.qmd appendices: - reproducible-research-bootcamp_software-installation.qmd diff --git a/assertive-testing.qmd b/assertive-testing.qmd new file mode 100644 index 0000000..65cfa7d --- /dev/null +++ b/assertive-testing.qmd @@ -0,0 +1,351 @@ +--- +title: "Modular, Tested Code" +abstract: "" +format: + html: + code-line-numbers: true + fig-align: "center" +editor_options: + chunk_output_type: console +bibliography: references.bib +--- + +![A multiple-choice test](images/Exams_Start..._Now.jpg) + +~ Photo by [Ryan McGilchrist](https://en.wikipedia.org/wiki/Multiple_choice#/media/File:Exams_Start..._Now.jpg) + +```{r hidden-here-load} +#| include: false + +exercise_number <- 1 +``` + +```{r} +#| echo: false +#| warning: false + +library(tidyverse) +library(gt) + +source("src/motivation.R") + +``` + +```{r} +#| label: tbl-roadmap +#| tbl-cap: "Opinionated Analysis Development" +#| echo: false + +motivation |> + filter(!is.na(Section), Section == "Programming") |> + select(-`Analysis Feature`) |> + arrange(Section) |> + gt() |> + tab_header( + title = "Opinionated Analysis Development" + ) |> + tab_footnote( + footnote = "Added by Aaron R. Williams", + locations = cells_column_labels(columns = c(Tool, Section)) + ) |> + tab_source_note( + source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.") + ) + +``` + +## Assertive Testing of Data + +>While reproducibility drastically reduces the number of errors and opacity of analysis, without assertive testing it runs the risk of applying an analysis to corrupted data, or applying an analysis to data that have drifted too far from assumptions. ~ Parker + +Assertions are useful for verifying the quality of data. Many of the principles from assertions and unit testing for functions apply: + +- Fail fast, fail often +- Fail loudly +- Fail clearly + +Assertive testing of data and assumptions is often much squishier than the unit testing and assertions from the previous section. We must now rely on subject matter expertise and experience with the data to develop assertions that can catch corruptions of the data or data processing mistakes. + +> Assertive testing means establishing these quality-control checks – usually based on past knowledge of possible corruptions of the data – and halting an analysis if the quality-control checks are not passed, so the analyst can investigate and hopefully fix (or at least account for) the problem. ~ Parker + +### `library(assertr)` + +[`library(assertr)`](https://docs.ropensci.org/assertr/) is a framework for applying assertions to data frames in R. It works well with the pipe (`%>%` or `|>`) because the first argument of the five main functions is always a data frame. + +::: {.callout-tip} +## Predicate Function + +A predicate function is a function that returns a single `TRUE` or `FALSE`. +::: + +`verify()` takes a logical expression. If the all values are `TRUE` for the logical expression, the code proceeds. If any value is `FALSE` for the logical expression, the code terminates and returns a diagnostic tibble. + +```{r} +library(assertr) + +msleep %>% + verify(nrow(.) == 83) |> + verify(sleep_total < 24) |> + verify(has_class("sleep_total", class = "numeric")) + +``` + +```{r} +#| eval: false +msleep %>% + verify(nrow(.) == 82) |> + verify(sleep_total < 14) |> + verify(has_class("sleep_total", class = "character")) + +``` + +``` +verification [nrow(.) == 82] failed! (1 failure) + + verb redux_fn predicate column index value +1 verify NA nrow(.) == 82 NA 1 NA + +Error: assertr stopped execution + +``` + +`assert()` takes a predicate function and an arbitrary number of variables. `assert()` will terminate if any values violate the predicate functions. Can apply tests to multiple variables. + +```{r} +msleep %>% + assert(within_bounds(0, 24), c(sleep_total, sleep_rem, sleep_cycle)) + +``` + +`insist()` is like `assert()`, but `insist()` can make assertions based on the observed data (e.g. throw an error is any value exceed four sample standard deviations from the sample mean). + +```{r} +msleep %>% + insist(within_n_sds(n = 3), sleep_total) + +``` + +`assert_rows()` extends `assert()` so the assertion can rely on values from multiple columns (e.g. row means within a bound or row must have a certain number of non-missing values). + +```{r} +msleep |> + assert_rows(num_row_NAs, within_bounds(0, 5), everything()) + +``` + +`insist_rows()`extends `insist()` so the assertion can rely on values from multiple columns. This is less common but can be used to see if any observation exceeds a certain mahalanobis distance from other rows. + +- `verify()` predicate functions + - `has_all_names()` + - `has_only_names()` + - `has_class()` +- `assert()` predicate functions + - `not_na()` + - `within_bounds()` + - `in_set()` + - `is_uniq()` +- `insist()` predicate functions + - `within_n_sds()` + - `within_n_mads()` + +:::callout + +#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"} + +```{r} +#| echo: false + +exercise_number <- exercise_number + 1 +``` + +1. Run `glimpse(trees)`. +2. `verify()` that the variables `Girth` is numeric. +3. `assert()` that all three variables are in the interval $[0, \infty)$. +::: + +[This vignette](https://docs.ropensci.org/assertr/) demonstrates additional functionality. + +`library(assertr)` is designed to be used early in a workflow. If you want to run the assertions at the end of the workflow and you don't want to see printed tibble after printed tibble, end the chain of code with the following custom function. + +``` +#' Helper function to silence output from testing code +#' +#' @param data A data frame +#' +quiet <- function(data) { + + quiet <- data + +} +``` + +Example: [Boosting Upward Mobility from Poverty](https://github.com/UI-Research/mobility-from-poverty/blob/version2024/10_construct-database/11_construct_county_all.qmd) + +### Other Assertions + +[`library(tidylog)`](https://cran.r-project.org/web/packages/tidylog/readme/README.html) prints diagnostic information when functions from `library(dplyr)` and `library(tidylog)` are used. + +```{r} +library(tidylog) + + +``` + +```{r} +math_scores <- tribble( + ~name, ~math_score, + "Alec", 95, + "Bart", 97, + "Carrie", 100 +) + +reading_scores <- tribble( + ~name, ~reading_score, + "Alec", 88, + "Bart", 67, + "Carrie", 100, + "Zeta", 100 +) + +left_join(x = math_scores, y = reading_scores, by = "name") + +full_join(x = math_scores, y = reading_scores, by = "name") + +``` + +We'll detach tidylog to keep the rest of this document clean. + +```{r} +detach("package:tidylog", unload = TRUE) + +``` + +::: {.callout-note} +`library(tidylog)` is excellent for interactive development of data analyses. + +If you look at `library(tidylog)` output *more than once*, then write an assertion to capture the same information. +::: + +#### Missing Values + +The following throws an error if the data set contains any missing values. + +```{r} +missing_values <- map_dbl(.x = trees, ~sum(is.na(.x))) + +stopifnot(sum(missing_values) == 0) + +``` + +#### Joins + +Joins are one of the most dangerous parts of any data analysis. We can think of many different types of joins: + +- "one-to-one" +- "one-to-many" +- "many-to-one" +- "many-to-many" + +We can provide an expectation for the type of join using the `relationship` argument in `*_join()` functions. This is an assertion. + +Consider the test scores data sets from earlier. This should be a one-to-one join because each row in `x` matches at most 1 row in `y` and each row in `y` matches at most 1 row in `x`. + +```{r} +math_scores <- tribble( + ~name, ~math_score, + "Alec", 95, + "Bart", 97, + "Carrie", 100 +) + +reading_scores <- tribble( + ~name, ~reading_score, + "Alec", 88, + "Bart", 67, + "Carrie", 100, + "Zeta", 100 +) + +left_join( + x = math_scores, + y = reading_scores, + by = "name", + relationship = "one-to-one" +) + +``` + +Suppose there were two `"Alec"` in either data set. Then this code would throw a loud error. + +#### Pivots + +Pivots are also one of the most dangerous parts of any data analysis. We can write tests for the number of rows and the class for the output of pivots. + +Consider `table4a` from `library(tidyr)`. + +```{r} +table4a + +``` + +We want to pivot this data set to be longer because the data set isn't [tidy](https://r4ds.had.co.nz/tidy-data.html). Before writing code to tidy the data, we can probably come up with a few assertions: + +- There should be six rows. +- `year` and `cases` should be numeric. + +```{r} +table4a_tidy <- table4a |> + pivot_longer( + cols = c(`1999`, `2000`), + names_to = "year", + values_to = "cases" + ) |> + mutate(year = as.numeric(year)) + +stopifnot(nrow(table4a_tidy) == 6) +stopifnot(class(pull(table4a_tidy, year)) == "numeric") +stopifnot(class(pull(table4a_tidy, cases)) == "numeric") + +``` + +It's easy to get tired and to cut corners. Assertions never rest. + +> Understand, that your assertions is out there. It can't be bargained with. It can't be reasoned with. It doesn't feel pity or remorse or fear. It absolutely will not stop ever until your analysis is correct. ~ [Terminator (sort of)]https://www.youtube.com/watch?v=zu0rP2VWLWw) + +## Assertive Testing of Assumptions + +Assertive testing of assumptions is the squishiest of everything we've considered testing. We don't want to apply an analysis to data that have drifted too far from the assumptions of analysis. We also don't want to inappropriately apply a set of binary tests (think mechanical null hypothesis testing with p-values). + +At the very least, we should include visualizations and diagnostic tests that systematically explore the assumptions of an analysis in our Quarto documents. Then, we can use version control to track if anything changed unexpectedly. + +Beyond that, we need to rely on subject matter expertise to come up with heuristics for assertions. + +## Profiling and Benchmarking + +We skipped the questions "If you are not using efficient code, will you be able to identify it." + +Human time is expensive. Machine time is cheap. All else equal, we shouldn't worry too much about making our code more efficient. + +Sometimes, it is necessary to make our code more efficient. After all, who cares if our analysis is reproducible if it takes two weeks to run? + +::: {.callout-tip} +## Profiling + +Profiling is the systematic measurement of the run-time of each line of code. +::: + +::: {.callout-tip} +## Benchmarking + +Benchmarking is the precise measurement of the performance of a small piece of code. Typically, the code is run multiple times to improve the precision of the measurement. +::: + +Systematically making code more efficient generally proceeds in three steps: + +- Step 1: Profile the entire set of code to identify bottlenecks. +- Step 2: Benchmark small pieces of code that are responsible for the bottleneck. +- Step 3: Try to improve the slow pieces of code. Return to step 2 to evaluate the result. + +RStudio has built-in tools for profiling the run time and memory usage of large chunks of code. See [this section](https://adv-r.hadley.nz/perf-measure.html#profiling) of Advanced R to learn more. + +`library(microbenchmark)` has robust tools for benchmarking code. See [this section](https://adv-r.hadley.nz/perf-measure.html#microbenchmarking) of Advanced R to learn more. diff --git a/culture-and-ethics.qmd b/culture-and-ethics.qmd new file mode 100644 index 0000000..2d57a3f --- /dev/null +++ b/culture-and-ethics.qmd @@ -0,0 +1,86 @@ +--- +title: "Culture and Ethics" +abstract: "" +format: + html: + code-line-numbers: true + fig-align: "center" +editor_options: + chunk_output_type: console +bibliography: references.bib +--- + +![An early expression of culture](images/9_Bisonte_Magdaleniense_polícromo.jpg) + +~ Photo by [Museo de Altamira y D. Rodríguez](https://en.wikipedia.org/wiki/Culture#/media/File:9_Bisonte_Magdaleniense_pol%C3%ADcromo.jpg) + +```{r hidden-here-load} +#| include: false + +exercise_number <- 1 +``` + +## Culture + +Tools and strategies will only go so far. Culture is essential to building reproducible data analyses. + +People who are scared to be wrong in public will not work as hard to find errors. They will be less likely to share their code and data and to adopt transparent and reproducible practies. + +Instead of dunking on people for corrections or errors, we should celebrate their transparency. + +### Continuous Improvement + +Continuous improvement is an ongoing practice to improve processes and outputs. Continuous improvement focuses on incremental improvement over breakthrough improvement. + +There are many different continuous improvement models but most have at least three features: + +1. Analyze performance +2. Identify areas for improvement +3. Make incremental changes + +### Error Log + +Code reviews, tests, and assertions are essential for analyzing performance and identifying areas for improvement. + +Error logs are one tool for analyzing performance. Any time an error makes it past code review, document the error in a running tracker and note the plan for remedying the error. The errors can be analytic (e.g. `2 + 2 = 5`) or process (e.g. "we started too late on data collection and missed our deadline). + +### Blameless Postmortem + +Incident postmortems are common in data engineering: + +> An incident postmortem brings people together to discuss the details of an incident: why it happened, its impact, what actions were taken to mitigate it and resolve it, and what should be done to prevent it from happening again. \~ [Atlassian](https://www.atlassian.com/incident-management/postmortem) + +A blameless postmortem approaches the incident postmortem without any cynicism or hidden agendas: + +> In a blameless postmortem, it’s assumed that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying—and punishing—whoever screwed up, blameless postmortems focus on improving performance moving forward. \~ [Atlassian](https://www.atlassian.com/incident-management/postmortem/blameless) + +The term "incident postmortem" hides some of the value of this approach. + +1. We don't need an incident to host this type of discussion. +2. This type of meeting need not be post-data analysis. + +::: {callout-note} +Holding regular meetings where it is assumed that everyone acted with good intentions to analyze performance, identify areas for improvement, and make incremental changes will improve collaboration and strengthen data analyses. +::: + +## Ethics + +We could go talk for eight more hours about the ethics of statistics and data science. + +The social sciences are in a multi-decade transformation motivated by a series of major issues: + +- Multiple testing +- Hypothesizing After Results are Known (HARKing) +- p-hacking +- Publication bias + +Transparency and reproducibility help with some of these issues. Adopting version control *at the beginning* can help too. Pre-registration is a final tool that can help [@nosek2018]. + +::: callout-note +## Pre-Registration + +Pre-registration is the process of submitting a pre-analysis plan. This differentiates hypothesis generation and hypothesis testing, which is necessary because the same data cannot be used to generate and test a hypothesis. +::: + +- [Center for Open Science pre-registration resources](https://www.cos.io/initiatives/prereg) +- [Open Science pre-registration resources](https://help.osf.io/article/145-preregistration) diff --git a/docs/01_syllabus.html b/docs/01_syllabus.html index df47779..e461205 100644 --- a/docs/01_syllabus.html +++ b/docs/01_syllabus.html @@ -199,8 +199,27 @@ + + + + + + + + + + +