diff --git a/.Rproj.user/EEE9B81B/pcs/source-pane.pper b/.Rproj.user/EEE9B81B/pcs/source-pane.pper index 902cc6f..ddca97d 100644 --- a/.Rproj.user/EEE9B81B/pcs/source-pane.pper +++ b/.Rproj.user/EEE9B81B/pcs/source-pane.pper @@ -1,3 +1,3 @@ { - "activeTab": 0 + "activeTab": 2 } \ No newline at end of file diff --git a/.quarto/_freeze/01-basics/execute-results/html.json b/.quarto/_freeze/01-basics/execute-results/html.json index f606372..45e5d0d 100644 --- a/.quarto/_freeze/01-basics/execute-results/html.json +++ b/.quarto/_freeze/01-basics/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "7d2dcc8f9868107397250763dbc8ea6c", + "hash": "0a40a1abf3ec5b113cf949b0a8f93347", "result": { - "markdown": "# Projects and R Markdown {#sec-basics}\n\n## Intended Learning Outcomes {.unnumbered}\n\nBy the end of this chapter, you should be able to:\n\n- Re-familiarise yourself with setting up projects\n- Re-familiarise yourself with RMarkdown documents\n- Recap and apply data wrangling procedures to analyse data\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n## R and R Studio\n\nRemember, R is a programming language that you will write code in and RStudio is an Integrated Development Environment (IDE) which makes working with R easier as it's more user friendly. You need both components for this course.\n\nIf this is not ringing any bells yet, have a quick browse through the [materials from year 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#sec-intro-r){target=\"_blank\"} to refresh your memory.\n\n\n### R server\n\nUse the server *only* if you are unable to install R and RStudio on your computer (e.g., if you are using a Chromebook) or if you encounter issues while installing R on your own machine. Otherwise, you should install R and RStudio directly on your own computer. R and RStudio are already installed on the *R server*.\n\nYou will find the link to the server on Moodle.\n\n\n### Installing R and RStudio on your computer\n\nThe [RSetGo book](https://psyteachr.github.io/RSetGo/){target=\"_blank\"} provides detailed instructions on how to install R and RStudio on your computer. It also includes links to walkthroughs for installing R on different types of computers and operating systems.\n\nIf you had R and RStudio installed on your computer last year, we recommend updating to the latest versions. In fact, it’s a good practice to update them at the start of each academic year. Detailed guidance can be found in @sec-updating-r.\n\nOnce you have installed or updated R and RStudio, return to this chapter.\n\n\n### Settings for Reproducibility\n\nBy now, you should be aware that the Psychology department at the University of Glasgow places a strong emphasis on reproducibility, open science, and raising awareness about questionable research practices (QRPs) and how to avoid them. Therefore, it's important that you work in a reproducible manner so that others (and your future self) can understand and check your work. This also makes it easier for you to reuse your work in the future.\n\nAlways start with a clear workspace. If your `Global Environment` contains anything from a previous session, you can’t be certain whether your current code is working as intended or if it’s using objects created earlier.\n\nTo ensure a clean and reproducible workflow, there are a few settings you should adjust immediately after installing or updating RStudio. In Tools \\> Global Options... General tab\n\n* Uncheck the box labelled Restore .RData into workspace at startup to make sure no data from a previous session is loaded into the environment\n* set Save workspace to .RData on exit to **Never** to prevent your workspace from being saved when you exit RStudio.\n\n![Reproducibility settings in Global Options](images/rstudio_settings_reproducibility.png)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for keeping taps on parentheses\n\nR has included **rainbow parentheses** to help with keeping count on the brackets.\n\nTo enable the feature, go to Tools \\> Global Options... Code tab \\> Display tab and tick the last checkbox \"Use rainbow parentheses\"\n\n![Enable Rainbow parenthesis](images/rainbow.PNG)\n\n:::\n\n### RStudio panes\n\nRStudio has four main panes each in a quadrant of your screen:\n\n* Source pane\n* Environment pane\n* Console pane\n* Output pane\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAre you ready for a quick quiz to see what you remember about the RStudio panes from last year? Click on **Quiz** to see the questions.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Quiz\n\n**What is their purpose?**\n\n**The Source pane...**
\n\n\n**The Environment pane...**
\n\n\n**The Console pane...**
\n\n\n**The Output pane...**
\n\n\n**Where are these panes located by default?**\n\n* The Source pane is located? \n* The Environment pane is located? \n* The Console pane is located? \n* The Output pane is located? \n\n:::\n\n:::\n\nIf you were not quite sure about one/any of the panes, check out the [materials from Level 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#rstudio-panes){target=\"_blank\"}. If you want to know more about them, there is the [RStudio guide on posit](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html){target=\"_blank\"}\n\n\n\n## Activity 1: Creating a new project {#sec-project}\n\nIt's important to create a new RStudio project whenever you start a new project. This practice makes it easier to work in multiple contexts, such as when analysing different datasets simultaneously. Each RStudio project has its own folder location, workspace, and working directories, which keeps all your data and RMarkdown documents organised in one place.\n\nLast year, you learnt how to create projects on the server, so you already know the steps. If cannot quite recall how that was done, go back to the [Level 1 materials](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#new-project){target=\"_blank\"}.\n\nOn your own computer, open RStudio, and complete the following steps in this order:\n\n* Click on File \\> New Project...\n* Then, click on \"New Directory\"\n* Then, click on \"New Project\"\n* Name the directory something meaningful (e.g., \"2A_chapter1\"), and save it in a location that makes sense, for example, a dedicated folder you have for your level 2 Psychology labs - you can either select a folder you have already in place or create a new one (e.g., I named my new folder \"Level 2 labs\")\n* Click \"Create Project\". RStudio will restart itself and open with this new project directory as the working directory. If you accidentally close it, you can open it by double-clicking on the project icon in your folder\n* You can also check in your folder structure that everything was created as intended\n\n![Creating a new project](images/project_setup.gif)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Why is the Colour scheme in the gif different to my version?\n\nIn case anyone is wondering why my colour scheme in the gif above looks different to yours, I've set mine to \"Pastel On Dark\" in Tools \\> Global Options... \\> Appearances. And my computer lives in \"dark mode\".\n\n:::\n\n::: callout-important\n\n## Don't nest projects\n\nDon't ever save a new project **inside** another project directory. This can cause some hard-to-resolve problems.\n\n:::\n\n\n## Activity 2: Create a new R Markdown file {#sec-rmd}\n\n* Open a new R Markdown document: click File \\> New File \\> R Markdown or click on the little page icon with a green plus sign (top left).\n* Give it a meaningful `Title` (e.g., Level 2 chapter 1) - you can also change the title later. Feel free to add your name or GUID in the `Author` field author name. Keep the `Default Output Format` as HTML.\n* Once the .`Rmd` opened, you need to save the file.\n* To save it, click File \\> Save As... or click on the little disc icon. Name it something meaningful (e.g., \"chapter_01.Rmd\", \"01_intro.Rmd\"). Make sure there are no spaces in the name - R is not very fond of spaces... This file will automatically be saved in your project folder (i.e., your working directory) so you should now see this file appear in your file viewer pane.\n\n\n![Creating a new `.Rmd` file](images/Rmd_setup.gif)\n\n\nRemember, an R Markdown document or `.Rmd` has \"white space\" (i.e., the markdown for formatted text) and \"grey parts\" (i.e., code chunks) in the default colour scheme (see @fig-rmd). R Markdown is a powerful tool for creating dynamic documents because it allows you to integrate code and regular text seamlessly. You can then knit your `.Rmd` using the `knitr` package to create a final document as either a webpage (HTML), a PDF, or a Word document (.docx). We'll only knit to HTML documents in this course.\n\n\n![R markdown anatomy (image from [https://intro2r.com/r-markdown-anatomy.html](https://intro2r.com/r-markdown-anatomy.html){target=\"_blank\"})](images/rm_components.png)\n\n\n\n### Markdown\n\nThe markdown space in an `.Rmd` is ideal for writing notes that explain your code and document your thought process. Use this space to clarify what your code is doing, why certain decisions were made, and any insights or conclusions you have drawn along the way. These notes are invaluable when revisiting your work later, helping you (or others) understand the rationale behind key decisions, such as setting inclusion/exclusion criteria or interpreting the results of assumption tests. Effectively documenting your work in the markdown space enhances both the clarity and reproducibility of your analysis.\n\nThe markdown space offers a variety of formatting options to help you organise and present your notes effectively. Here are a few of them that can enhance your documentation:\n\n#### Heading levels {.unnumbered}\n\nThere is a variety of **heading levels** to make use of, using the `#` symbol.\n\n\n::: columns\n\n::: column\n\n##### You would incorporate this into your text as: {.unnumbered}\n\n\\# Heading level 1\n\n\\## Heading level 2\n\n\\### Heading level 3\n\n\\#### Heading level 4\n\n\\##### Heading level 5\n\n\\###### Heading level 6\n\n:::\n\n::: column\n\n##### And it will be displayed in your knitted html file as: {.unnumbered}\n\n![](images/heading_levels.PNG)\n\n:::\n\n:::\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My heading levels don't render properly when knitting\n\nYou need a space between the # and the first letter. If the space is missing, the heading will be displayed in the HTML file as ...\n\n#Heading 1\n\n:::\n\n#### Unordered and ordered lists {.unnumbered}\n\nYou can also include **unordered lists** and **ordered lists**. Click on the tabs below to see how they are incorporated\n\n::: panel-tabset\n\n## unordered lists\n\nYou can add **bullet points** using either `*`, `-` or `+` and they will turn into:\n\n* bullet point (created with `*`)\n* bullet point (created with `-`)\n+ bullet point (created with `+`)\n\nor use bullet points of different levels using 1 tab key press or 2 spaces (for sub-item 1) or 2 tabs/4 spaces (for sub-sub-item 1):\n\n* bullet point item 1\n * sub-item 1\n * sub-sub-item 1\n * sub-sub-item 2\n* bullet point item 2\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My bullet points don't render properly when knitting\n\nYou need an empty row before your bullet points start. If I delete the empty row before the bullet points, they will be displayed in the HTML as ...\n\nText without the empty row: * bullet point created with `*` - bullet point created with `-` + bullet point created with `+`\n\n:::\n\n\n## ordered lists\n\nStart the line with **1.**, **2.**, etc. When you want to include sub-items, either use the `tab` key twice or add **4 spaces**. Same goes for the sub-sub-item: include either 2 tabs (or 4 manual spaces) from the last item or 4 tabs/ 8 spaces from the start of the line.\n\n1. list item 1\n2. list item 2\n i) sub-item 1 (with 4 spaces)\n A. sub-sub-item 1 (with an additional 4 spaces from the last indent)\n\n::: {.callout-important collapse=\"true\"}\n\n## My list items don't render properly when knitting\n\nIf you don't leave enough spaces, the list won't be recognised, and your output looks like this:\n\n3. list item 3\n i) sub-item 1 (with only 2 spaces) \n A. sub-sub-item 1 (with an additional 2 spaces from the last indent)\n\n:::\n\n\n## ordered lists magic\n\nThe great thing though is that you don't need to know your alphabet or number sequences. R markdown will fix that for you\n\nIf I type into my `.Rmd`...\n\n![](images/list_magic.PNG)\n\n...it will be rendered in the knitted HTML output as...\n\n3. list item 3\n1. list item 1\n a) sub-item labelled \"a)\"\n i) sub-item labelled \"i)\"\n C) sub-item labelled \"C)\"\n Z) sub-item labelled \"Z)\"\n7. list item 7\n\n\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: The labels of the sub-items are not what I thought they would be. You said they are fixing themselves...\n\nYes, they do but you need to label your sub-item lists accordingly. The first label you list in each level is set as the baseline. If they are labelled `1)` instead of `i)` or `A.`, the output will show as follows, but the automatic-item-fixing still works:\n\n7. list item 7\n 1) list item \"1)\" with 4 spaces\n 1) list item \"1)\" with 8 spaces\n 6) this is an item labelled \"6)\" (magically corrected to \"2.\")\n:::\n\n:::\n\n#### Emphasis {.unnumbered}\n\nInclude **emphasis** to draw attention to keywords in your text:\n\n| R markdown syntax | Displayed in the knitted HTML file |\n|:----------------------------|:-----------------------------------|\n| \\*\\*bold text\\*\\* | **bold text** |\n| \\*italic text\\* | *italic text* |\n| \\*\\*\\*bold and italic\\*\\*\\* | ***bold and italic*** |\n\n\nOther examples can be found in the [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target=\"_blank\"}\n\n\n\n### Code chunks {#sec-chunks}\n\nEverything you write inside the **code chunks** will be interpreted as code and executed by R. Code chunks start with ```` ``` ```` followed by an `{r}` which specifies the coding language R, some space for code, and ends with ```` ``` ````. If you accidentally delete one of those backticks, your code won't run and/or your text parts will be interpreted as part of the code chunks or vice versa. This should be evident from the colour change - more white than expected typically indicates missing starting backticks, whilst too much grey/not enough white suggests missing ending backticks. But no need to fret if that happens - just add the missing backticks manually.\n\n\nYou can **insert a new code chunk** in several ways:\n\n\n* Click the `Insert a new code chunk` button in the RStudio Toolbar (green icon at the top right corner of the `Source pane`).\n* Select Code \\> Insert Chunk from the menu.\n* Using the shortcut `Ctrl + Alt + I` for Windows or `Cmd + Option + I` on MacOSX.\n* Type ```` ```{r} ```` and ```` ``` ```` manually\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Default `.Rmd` with highlighting - names in pink and knitr display options in purple](images/default_highlighted.png){#fig-rmd fig-align='center' width=100%}\n:::\n:::\n\n\n\n\nWithin the curly brackets of a code chunk, you can **specify a name** for the code chunk (see pink highlighting in @fig-rmd). The chunk name is not necessarily required; however, it is good practice to give each chunk a unique name to support more advanced knitting approaches. It also makes it easier to reference and manage chunks.\n\nWithin the curly brackets, you can also place **rules and arguments** (see purple highlighting in @fig-rmd) to control how your code is executed and what is displayed in your final HTML output. The most common **knitr display options** include:\n\n\n| Code | Does code run | Does code show | Do results show |\n|:--------------------|:-------------:|:--------------:|:---------------:|\n| eval=FALSE | NO | YES | NO |\n| echo=TRUE (default) | YES | YES | YES |\n| echo=FALSE | YES | NO | YES |\n| results='hide' | YES | YES | NO |\n| include=FALSE | YES | NO | NO |\n\n\n::: callout-important\n\nThe table above will be incredibly important for the data skills homework II. When solving error mode items you will need to pay attention to the first one `eval = FALSE`.\n\n:::\n\nOne last thing: In your newly created `.Rmd` file, delete everything below line 12 (keep the set-up code chunk) and save your `.Rmd` by clicking on the disc symbol.\n\n![Delete everything below line 12](images/delete_12.gif)\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nThat was quite a long section about what Markdown can do. I promise, we'll practice that more later. For the minute, we want you to create a new level 2 heading on line 12 and give it a meaningful heading title (something like \"Loading packages and reading in data\" or \"Chapter 1\").\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\nOn line 12, you should have typed **## Loading packages and reading in data** (or whatever meaningful title you chose). This will create level 2 heading once we knit the `.Rmd`.\n\n:::\n\n:::\n\n\n## Activity 3: Download the data {#sec-download_data_ch1}\n\nThe data for chapters 1-3. Download it here: [data_ch1.zip](data/data_ch1.zip \"download\"). There are 2 csv files contained in a zip folder. One is the data file we are going to use today `prp_data_reduced.csv` and the other is an Excel file `prp_codebook` that explains the variables in the data.\n\nThe first step is to **unzip the zip folder** so that the files are placed within the same folder as your project.\n\n* Place the zip folder within your 2A_chapter1 folder\n* Right mouse click --> `Extract All...`\n* Check the folder location is the one to extract the files to\n* Check the extracted files are placed next to the project icon\n* Files and project should be visible in the Output pane in RStudio\n\n::: {.callout-note collapse=\"true\"}\n\n## For screenshots click here\n\n::: {layout-ncol=\"1\"}\n\n![](images/pic1.PNG){fig-align=\"center\"}\n\n![](images/pic23.PNG){fig-align=\"center\"}\n\n![](images/pic45.PNG){fig-align=\"center\"}\n\nUnzipping a zip folder\n\n:::\n:::\n\nThe paper by Pownall et al. was a **registered report** published in 2023, and the original data can be found on OSF ([https://osf.io/5qshg/](https://osf.io/5qshg/){target=\"_blank\"}).\n\n**Citation**\n\n> Pownall, M., Pennington, C. R., Norris, E., Juanchich, M., Smailes, D., Russell, S., Gooch, D., Evans, T. R., Persson, S., Mak, M. H. C., Tzavella, L., Monk, R., Gough, T., Benwell, C. S. Y., Elsherif, M., Farran, E., Gallagher-Mitchell, T., Kendrick, L. T., Bahnmueller, J., . . . Clark, K. (2023). Evaluating the Pedagogical Effectiveness of Study Preregistration in the Undergraduate Dissertation. *Advances in Methods and Practices in Psychological Science, 6*(4). [https://doi.org/10.1177/25152459231202724](https://doi.org/10.1177/25152459231202724){target=\"_blank\"}\n\n**Abstract**\n\n> Research shows that questionable research practices (QRPs) are present in undergraduate final-year dissertation projects. One entry-level Open Science practice proposed to mitigate QRPs is “study preregistration,” through which researchers outline their research questions, design, method, and analysis plans before data collection and/or analysis. In this study, we aimed to empirically test the effectiveness of preregistration as a pedagogic tool in undergraduate dissertations using a quasi-experimental design. A total of 89 UK psychology students were recruited, including students who preregistered their empirical quantitative dissertation (*n* = 52; experimental group) and students who did not (*n* = 37; control group). Attitudes toward statistics, acceptance of QRPs, and perceived understanding of Open Science were measured both before and after dissertation completion. Exploratory measures included capability, opportunity, and motivation to engage with preregistration, measured at Time 1 only. This study was conducted as a Registered Report; Stage 1 protocol: https://osf.io/9hjbw (date of in-principle acceptance: September 21, 2021). Study preregistration did not significantly affect attitudes toward statistics or acceptance of QRPs. However, students who preregistered reported greater perceived understanding of Open Science concepts from Time 1 to Time 2 compared with students who did not preregister. Exploratory analyses indicated that students who preregistered reported significantly greater capability, opportunity, and motivation to preregister. Qualitative responses revealed that preregistration was perceived to improve clarity and organization of the dissertation, prevent QRPs, and promote rigor. Disadvantages and barriers included time, perceived rigidity, and need for training. These results contribute to discussions surrounding embedding Open Science principles into research training.\n\n**Changes made to the dataset**\n\nWe made some changes to the dataset for the purpose of increasing difficulty for data wrangling (@sec-wrangling and @sec-wrangling2) and data visualisation (@sec-dataviz and @sec-dataviz2). This will ensure some \"teachable moments\". The changes are as follows:\n\n* We removed some of the variables to make the data more manageable for teaching purposes.\n* We recoded some values from numeric responses to labels (e.g., `understanding`).\n* We added the word \"years\" to one of the `Age` entries.\n* We tidied a messy column `Ethnicity` but introduced a similar but easier-to-solve \"messiness pattern\" when recoding the `understanding` data.\n* The scores in the original file were already corrected from reverse-coded responses. We reversed that process to present raw data here.\n\n\n\n\n## Activity 4: Installing packages, loading packages, and reading in data\n\n### Installing packages\n\nWhen you install R and RStudio for the first time (or after an update), most of the packages we will be using won’t be pre-installed. Before you can load new packages like `tidyverse`, you will need to install them.\n\nIf you try to load a package that has not been installed yet, you will receive an error message that looks something like this: `Error in library(tidyverse) : there is no package called 'tidyverse'`. \n\nTo fix this, simply install the package first. **In the console**, type the command `install.packages(\"tidyverse\")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio.\n\n\nNote, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`.\n\n\n### Loading packages and reading in data\n\nThe first step is to load in the packages we need and read in the data. Today, we'll only be using `tidyverse`, and `read_csv()` will help us store the data from `prp_data_reduced.csv` in an object called data_prp.\n\nCopy the code into a code chunk in your `.Rmd` file and run it. You can either click the `green error` to run the entire code chunk, or use the shortcut `Ctrl + Enter` (Windows) or `Cmd + Enter` (Mac) to run a line of code/ pipe from the Rmd.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\nRows: 89 Columns: 91\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (17): Code, Age, Ethnicity, Opptional_mod_1_TEXT, Research_exp_1_TEXT, U...\ndbl (74): Gender, Secondyeargrade, Opptional_mod, Research_exp, Plan_prereg,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n\n\n\n## Activity 5: Familiarise yourself with the data {#sec-familiarise}\n\n* Look at the **Codebook** to get a feel of the variables in the dataset and how they have been measured. Note that some of the columns were deleted in the dataset you have been given.\n* You'll notice that some questionnaire data was collected at 2 different time points (i.e., SATS28, QRPs, Understanding_OS)\n* some of the data was only collected at one time point (i.e., supervisor judgements, OS_behav items, and Included_prereg variables are t2-only variables)\n\n\n\n### First glimpse at the data\n\nBefore you start wrangling your data, it is important to understand what kind of data you're working with and what the format of your dataframe looks like.\n\nAs you may have noticed, `read_csv()` provides a **message** listing the data types in your dataset and how many columns are of each type. Plus, it shows a few examples columns for each data type.\n\nTo obtain more detailed information about your data, you have several options. Click on the individual tabs to see the different options available. Test them out in your own `.Rmd` file and use whichever method you prefer (but do it).\n\n::: callout-warning\n\nSome of the output is a bit long because we do have quite a few variables in the data file.\n\n:::\n\n::: panel-tabset\n\n## visual inspection 1\n\nIn the `Global Environment`, click the blue arrow icon next to the object name `data_prp`. This action will expand the object, revealing details about its columns. The `$` symbol is commonly used in Base R to access a specific column within your dataframe.\n\n![Visual inspection of the data](images/data_prp.PNG)\n\nCon: When you have quite a few variables, not all of them are shown.\n\n## `glimpse()`\n\nUse `glimpse()` if you want a more detailed overview you can see on your screen. The output will display rows and column numbers, and some examples of the first couple of observations for each variable.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 91\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", …\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"2…\n$ Ethnicity \"White European\", \"White British…\n$ Secondyeargrade 2, 3, 1, 2, 2, 2, 2, 2, 1, 1, 1,…\n$ Opptional_mod 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2,…\n$ Opptional_mod_1_TEXT \"Research methods in first year\"…\n$ Research_exp 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…\n$ Research_exp_1_TEXT NA, NA, NA, NA, NA, NA, NA, NA, …\n$ Plan_prereg 1, 3, 1, 2, 1, 1, 3, 3, 2, 2, 2,…\n$ SATS28_1_Affect_Time1 4, 5, 5, 6, 2, 1, 6, 3, 2, 5, 2,…\n$ SATS28_2_Affect_Time1 5, 6, 3, 3, 6, 1, 2, 2, 7, 3, 4,…\n$ SATS28_3_Affect_Time1 3, 2, 5, 2, 6, 7, 2, 6, 6, 5, 2,…\n$ SATS28_4_Affect_Time1 4, 5, 2, 2, 6, 6, 5, 5, 5, 5, 2,…\n$ SATS28_5_Affect_Time1 5, 5, 5, 6, 1, 1, 5, 1, 2, 5, 2,…\n$ SATS28_6_Affect_Time1 5, 6, 2, 5, 6, 7, 4, 5, 5, 3, 5,…\n$ SATS28_7_CognitiveCompetence_Time1 4, 2, 2, 5, 6, 7, 2, 5, 5, 2, 2,…\n$ SATS28_8_CognitiveCompetence_Time1 2, 2, 2, 1, 6, 7, 2, 5, 3, 2, 3,…\n$ SATS28_9_CognitiveCompetence_Time1 2, 2, 2, 3, 3, 7, 2, 6, 3, 3, 1,…\n$ SATS28_10_CognitiveCompetence_Time1 6, 7, 6, 6, 4, 2, 6, 4, 5, 6, 5,…\n$ SATS28_11_CognitiveCompetence_Time1 4, 3, 5, 5, 3, 1, 6, 2, 5, 6, 5,…\n$ SATS28_12_CognitiveCompetence_Time1 3, 5, 3, 5, 5, 7, 3, 4, 7, 2, 3,…\n$ SATS28_13_Value_Time1 1, 1, 2, 1, 3, 7, 1, 2, 1, 2, 4,…\n$ SATS28_14_Value_Time1 7, 7, 6, 6, 5, 1, 6, 5, 7, 6, 2,…\n$ SATS28_15_Value_Time1 7, 7, 6, 6, 3, 5, 6, 6, 6, 5, 5,…\n$ SATS28_16_Value_Time1 2, 1, 3, 2, 6, 5, 3, 7, 2, 2, 2,…\n$ SATS28_17_Value_Time1 1, 1, 3, 3, 7, 7, 2, 7, 2, 2, 5,…\n$ SATS28_18_Value_Time1 3, 6, 5, 3, 1, 1, 5, 1, 5, 2, 2,…\n$ SATS28_19_Value_Time1 3, 3, 3, 3, 7, 7, 4, 5, 3, 5, 6,…\n$ SATS28_20_Value_Time1 2, 1, 4, 2, 7, 7, 2, 4, 2, 2, 7,…\n$ SATS28_21_Value_Time1 2, 1, 3, 2, 6, 7, 2, 5, 1, 3, 5,…\n$ SATS28_22_Difficulty_Time1 3, 2, 5, 3, 2, 1, 4, 2, 2, 5, 3,…\n$ SATS28_23_Difficulty_Time1 5, 6, 5, 6, 6, 7, 4, 6, 7, 5, 6,…\n$ SATS28_24_Difficulty_Time1 2, 2, 2, 3, 1, 4, 4, 2, 2, 2, 2,…\n$ SATS28_25_Difficulty_Time1 6, 7, 5, 5, 6, 7, 5, 6, 5, 5, 5,…\n$ SATS28_26_Difficulty_Time1 4, 2, 2, 2, 6, 7, 4, 5, 3, 5, 3,…\n$ SATS28_27_Difficulty_Time1 4, 5, 5, 3, 6, 7, 4, 3, 5, 3, 6,…\n$ SATS28_28_Difficulty_Time1 1, 7, 5, 5, 6, 6, 5, 4, 4, 4, 2,…\n$ QRPs_1_Time1 7, 7, 7, 7, 7, 7, 6, 2, 7, 6, 7,…\n$ QRPs_2_Time1 7, 7, 7, 7, 7, 7, 6, 7, 7, 7, 5,…\n$ QRPs_3_Time1 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3,…\n$ QRPs_4_Time1 7, 7, 6, 6, 7, 4, 6, 7, 7, 7, 6,…\n$ QRPs_5_Time1 3, 3, 7, 7, 2, 7, 4, 6, 7, 3, 2,…\n$ QRPs_6_Time1 4, 7, 6, 5, 7, 4, 4, 5, 7, 6, 5,…\n$ QRPs_7_Time1 5, 7, 7, 7, 7, 4, 5, 6, 7, 7, 5,…\n$ QRPs_8_Time1 7, 7, 7, 7, 7, 7, 7, 7, 7, 2, 7,…\n$ QRPs_9_Time1 6, 7, 7, 4, 7, 7, 3, 7, 6, 6, 2,…\n$ QRPs_10_Time1 7, 6, 5, 2, 5, 4, 2, 6, 7, 7, 2,…\n$ QRPs_11_Time1 7, 7, 7, 4, 7, 7, 4, 6, 7, 7, 5,…\n$ QRPs_12NotQRP_Time1 2, 2, 1, 4, 1, 4, 2, 4, 2, 2, 1,…\n$ QRPs_13NotQRP_Time1 1, 1, 1, 1, 1, 4, 2, 4, 1, 1, 1,…\n$ QRPs_14NotQRP_Time1 1, 4, 3, 4, 1, 4, 2, 3, 3, 4, 3,…\n$ QRPs_15NotQRP_Time1 2, 4, 2, 2, 1, 4, 2, 1, 4, 4, 2,…\n$ Understanding_OS_1_Time1 \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at…\n$ Understanding_OS_2_Time1 \"2\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_3_Time1 \"2\", \"Not at all confident\", \"3\"…\n$ Understanding_OS_4_Time1 \"6\", \"Not at all confident\", \"6\"…\n$ Understanding_OS_5_Time1 \"Entirely confident\", \"6\", \"6\", …\n$ Understanding_OS_6_Time1 \"Entirely confident\", \"Entirely …\n$ Understanding_OS_7_Time1 \"6\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_8_Time1 \"6\", \"3\", \"5\", \"3\", \"5\", \"Not at…\n$ Understanding_OS_9_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_10_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_11_Time1 \"Entirely confident\", \"2\", \"4\", …\n$ Understanding_OS_12_Time1 \"Entirely confident\", \"2\", \"5\", …\n$ Pre_reg_group 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2,…\n$ Other_OS_behav_2 1, NA, NA, NA, 1, NA, NA, 1, NA,…\n$ Other_OS_behav_4 1, NA, NA, NA, NA, NA, NA, NA, N…\n$ Other_OS_behav_5 NA, NA, NA, NA, 1, 1, NA, NA, NA…\n$ Closely_follow 2, 2, 2, NA, 3, 3, 3, NA, NA, 2,…\n$ SATS28_Affect_Time2_mean 3.500000, 3.166667, 4.833333, 4.…\n$ SATS28_CognitiveCompetence_Time2_mean 4.166667, 4.666667, 6.166667, 5.…\n$ SATS28_Value_Time2_mean 3.000000, 6.222222, 6.000000, 4.…\n$ SATS28_Difficulty_Time2_mean 2.857143, 2.857143, 4.000000, 2.…\n$ QRPs_Acceptance_Time2_mean 5.636364, 5.454545, 6.272727, 5.…\n$ Time2_Understanding_OS 5.583333, 3.333333, 5.416667, 4.…\n$ Supervisor_1 5, 7, 7, 1, 7, 1, 7, 6, 7, 5, 6,…\n$ Supervisor_2 5, 6, 7, 4, 6, 2, 7, 5, 6, 5, 5,…\n$ Supervisor_3 6, 7, 7, 1, 7, 1, 7, 5, 6, 6, 7,…\n$ Supervisor_4 6, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_5 5, 7, 7, 4, 7, 3, 7, 7, 6, 6, 6,…\n$ Supervisor_6 5, 7, 7, 4, 6, 3, 7, 6, 7, 6, 6,…\n$ Supervisor_7 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ Supervisor_8 5, 5, 7, 1, 7, 1, 7, 5, 7, 5, 6,…\n$ Supervisor_9 6, 7, 7, 4, 7, 3, 7, 5, 7, 6, 7,…\n$ Supervisor_10 5, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_11 NA, 7, 7, NA, 7, 1, 7, 5, 7, 6, …\n$ Supervisor_12 4, 5, 7, 1, 4, 1, 7, 3, 6, 6, 5,…\n$ Supervisor_13 4, 2, 5, 1, 2, 1, 6, 3, 5, 6, 5,…\n$ Supervisor_14 5, 7, 7, 1, 7, 1, 7, 5, 7, 6, 6,…\n$ Supervisor_15_R 1, 1, 1, 4, 1, 7, 1, 2, 1, 2, 1,…\n```\n:::\n:::\n\n\n\n## `spec()`\n\nYou can also use `spec()` as suggested in the message above and then it shows you a list of the data type in every single column. But it doesn't show you the number of rows and columns.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nspec(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ncols(\n Code = col_character(),\n Gender = col_double(),\n Age = col_character(),\n Ethnicity = col_character(),\n Secondyeargrade = col_double(),\n Opptional_mod = col_double(),\n Opptional_mod_1_TEXT = col_character(),\n Research_exp = col_double(),\n Research_exp_1_TEXT = col_character(),\n Plan_prereg = col_double(),\n SATS28_1_Affect_Time1 = col_double(),\n SATS28_2_Affect_Time1 = col_double(),\n SATS28_3_Affect_Time1 = col_double(),\n SATS28_4_Affect_Time1 = col_double(),\n SATS28_5_Affect_Time1 = col_double(),\n SATS28_6_Affect_Time1 = col_double(),\n SATS28_7_CognitiveCompetence_Time1 = col_double(),\n SATS28_8_CognitiveCompetence_Time1 = col_double(),\n SATS28_9_CognitiveCompetence_Time1 = col_double(),\n SATS28_10_CognitiveCompetence_Time1 = col_double(),\n SATS28_11_CognitiveCompetence_Time1 = col_double(),\n SATS28_12_CognitiveCompetence_Time1 = col_double(),\n SATS28_13_Value_Time1 = col_double(),\n SATS28_14_Value_Time1 = col_double(),\n SATS28_15_Value_Time1 = col_double(),\n SATS28_16_Value_Time1 = col_double(),\n SATS28_17_Value_Time1 = col_double(),\n SATS28_18_Value_Time1 = col_double(),\n SATS28_19_Value_Time1 = col_double(),\n SATS28_20_Value_Time1 = col_double(),\n SATS28_21_Value_Time1 = col_double(),\n SATS28_22_Difficulty_Time1 = col_double(),\n SATS28_23_Difficulty_Time1 = col_double(),\n SATS28_24_Difficulty_Time1 = col_double(),\n SATS28_25_Difficulty_Time1 = col_double(),\n SATS28_26_Difficulty_Time1 = col_double(),\n SATS28_27_Difficulty_Time1 = col_double(),\n SATS28_28_Difficulty_Time1 = col_double(),\n QRPs_1_Time1 = col_double(),\n QRPs_2_Time1 = col_double(),\n QRPs_3_Time1 = col_double(),\n QRPs_4_Time1 = col_double(),\n QRPs_5_Time1 = col_double(),\n QRPs_6_Time1 = col_double(),\n QRPs_7_Time1 = col_double(),\n QRPs_8_Time1 = col_double(),\n QRPs_9_Time1 = col_double(),\n QRPs_10_Time1 = col_double(),\n QRPs_11_Time1 = col_double(),\n QRPs_12NotQRP_Time1 = col_double(),\n QRPs_13NotQRP_Time1 = col_double(),\n QRPs_14NotQRP_Time1 = col_double(),\n QRPs_15NotQRP_Time1 = col_double(),\n Understanding_OS_1_Time1 = col_character(),\n Understanding_OS_2_Time1 = col_character(),\n Understanding_OS_3_Time1 = col_character(),\n Understanding_OS_4_Time1 = col_character(),\n Understanding_OS_5_Time1 = col_character(),\n Understanding_OS_6_Time1 = col_character(),\n Understanding_OS_7_Time1 = col_character(),\n Understanding_OS_8_Time1 = col_character(),\n Understanding_OS_9_Time1 = col_character(),\n Understanding_OS_10_Time1 = col_character(),\n Understanding_OS_11_Time1 = col_character(),\n Understanding_OS_12_Time1 = col_character(),\n Pre_reg_group = col_double(),\n Other_OS_behav_2 = col_double(),\n Other_OS_behav_4 = col_double(),\n Other_OS_behav_5 = col_double(),\n Closely_follow = col_double(),\n SATS28_Affect_Time2_mean = col_double(),\n SATS28_CognitiveCompetence_Time2_mean = col_double(),\n SATS28_Value_Time2_mean = col_double(),\n SATS28_Difficulty_Time2_mean = col_double(),\n QRPs_Acceptance_Time2_mean = col_double(),\n Time2_Understanding_OS = col_double(),\n Supervisor_1 = col_double(),\n Supervisor_2 = col_double(),\n Supervisor_3 = col_double(),\n Supervisor_4 = col_double(),\n Supervisor_5 = col_double(),\n Supervisor_6 = col_double(),\n Supervisor_7 = col_double(),\n Supervisor_8 = col_double(),\n Supervisor_9 = col_double(),\n Supervisor_10 = col_double(),\n Supervisor_11 = col_double(),\n Supervisor_12 = col_double(),\n Supervisor_13 = col_double(),\n Supervisor_14 = col_double(),\n Supervisor_15_R = col_double()\n)\n```\n:::\n:::\n\n\n\n## visual inspection 2\n\nIn the `Global Environment`, click on the object name `data_prp`. This action will open the data in a new tab. Hovering over the column headings with your mouse will also reveal their data type. However, it seems to be a fairly tedious process when you have loads of columns.\n\n::: {.callout-important collapse=\"true\"}\n\n## Hang on, where is the rest of my data? Why do I only see 50 columns?\n\nOne common source of confusion is not seeing all your columns when you open up a data object as a tab. This is because RStudio shows you a maximum of 50 columns at a time. If you have more than 50 columns, navigate with the arrows to see the remaining columns.\n\n![Showing 50 columns at a time](images/50_col.PNG)\n\n:::\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nNow that you have tested out all the options in your own `.Rmd` file, you can probably answer the following questions:\n\n* How many observations? \n* How many variables? \n* How many columns are `col_character` or `chr` data type? \n* How many columns are `col_double` or `dbl` data type? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe visual inspections shows you the **number of observations and variables**. `glimpse()` also gives you that information but calls them **rows and columns** respectively.\n\nThe **data type information** actually comes from the output when using the `read_csv()` function. Did you notice the information on **Column specification** (see screenshot below)?\n\n![message from `read_csv()` when reading in the data](images/col_spec.PNG)\n\nWhilst `spec()` is quite useful for data type information per individual column, it doesn't give you the total count of each data type. So it doesn't really help with answering the questions here - unless you want to count manually from its extremely long output.\n\n:::\n\nIn your `.Rmd`, include a **new heading level 2** called \"Information about the data\" (or something equally meaningful) and jot down some notes about `data_prp`. You could include the citation and/or the abstract, and whatever information you think you should note about this dataset (e.g., any observations from looking at the codebook?). You could also include some notes on the functions used so far and what they do. Try to incorporate some **bold**, *italic* or ***bold and italic*** emphasis and perhaps a bullet point or two.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Possible solution\n\n\\#\\# Information about the data\n\nThe data is from Pownall et al. (2023), and I can find the paper here: https://doi.org/10.1177/25152459231202724.\n\nI've noticed in the prp codebook that the SATS-28 questionnaire has quite a few \\*reverse-coded items\\*, and the supervisor support questionnaire also has a reverse-coded item.\n\nSo far, I think I prefer \\*\\*glimpse()\\*\\* to show me some more detail about the data. Specs() is too text-heavy for me which makes it hard to read.\n\nThings to keep in mind:\n\n* \\*\\*don't forget to load in tidyverse first!!!\\*\\*\n* always read in the data with \\*\\*read_csv\\*\\*, \\*\\*\\*never ever use read.csv\\*\\*\\*!!!\n\n![The output rendered in a knitted html file](images/knitted_markdown.PNG)\n\n:::\n\n:::\n\n### Data types {#sec-datatypes}\n\nEach variable has a **data type**, such as numeric (numbers), character (text), and logical (TRUE/FALSE values), or a special class of factor. As you have just seen, our `data_prp` only has character and numeric columns (so far).\n\n**Numeric data** can be double (`dbl`) or integer (`int`). Doubles can have decimal places (e.g., 1.1). Integers are the whole numbers (e.g., 1, 2, -1) and are displayed with the suffix L (e.g., 1L). This is not overly important but might leave you less puzzled the next time you see an L after a number.\n\n**Characters** (also called “strings”) is anything written between quotation marks. This is usually text, but in special circumstances, a number can be a character if it placed within quotation marks. This can happen when you are recoding variables. It might not be too obvious at the time, but you won't be able to calculate anything if the number is a character\n\n::: panel-tabset\n\n## Example data types\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntypeof(1)\ntypeof(1L)\ntypeof(\"1\")\ntypeof(\"text\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n[1] \"integer\"\n[1] \"character\"\n[1] \"character\"\n```\n:::\n:::\n\n\n## numeric computation\n\nNo problems here...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n1+1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n## character computation\n\nWhen the data type is incorrect, you won't be able to compute anything, despite your numbers being shown as numeric values in the dataframe. The error message tells you exactly what's wrong with it, i.e., that you have `non-numeric arguments`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n\"1\"+\"1\" # ERROR\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"1\" + \"1\": non-numeric argument to binary operator\n```\n:::\n:::\n\n\n:::\n\n**Logical** data (also sometimes called “Boolean” values) are one of two values: TRUE or FALSE (written in uppercase). They become really important when we use `filter()` or `mutate()` with conditional statements such as `case_when()`. More about those in @sec-wrangling2.\n\n\nSome commonly used logical operators:\n\n| operator | description |\n|:---------|:-----------------------------------------------|\n| \\> | greater than |\n| \\>= | greater than or equal to |\n| \\< | less than |\n| \\<= | less than or equal to |\n| == | equal to |\n| != | not equal to |\n| %in% | TRUE if any element is in the following vector |\n\n\nA **factor** is a specific type of integer or character that lets you assign the order of the categories. This becomes useful when you want to display certain categories in \"the correct order\" either in a dataframe (see *arrange*) or when plotting (see @sec-dataviz/ @sec-dataviz2).\n\n\n\n### Variable types\n\nYou've already encountered them in [Level 1](https://psyteachr.github.io/data-skills-v2/intro-to-probability.html){target=\"_blank\"} but let's refresh. Variables can be classified as **continuous** (numbers) or **categorical** (labels).\n\n**Categorical** variables are properties you can count. They can be **nominal**, where the categories don't have an order (e.g., gender) or **ordinal** (e.g., Likert scales either with numeric values 1-7 or with character labels such as \"agree\", \"neither agree nor disagree\", \"disagree\"). Categorical data may also be **factors** rather than characters.\n\n**Continuous variables** are properties you can measure and calculate sums/ means/ etc. They may be rounded to the nearest whole number, but it should make sense to have a value between them. Continuous variables always have a **numeric** data type (i.e. `integer` or `double`).\n\n::: callout-tip\n\n## Why is this important you may ask?\n\nKnowing your variable and data types will help later on when deciding on an appropriate plot (see @sec-dataviz and @sec-dataviz2) or which inferential test to run (@sec-nhstI to @sec-factorial).\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAs we've seen earlier, `data_prp` only had character and numeric variables which hardly tests your understanding to see if you can identify a variety of data types and variable types. So, for this little quiz, we've spiced it up a bit. We've selected a few columns, shortened some of the column names, and modified some of the data types. Here you can see the first few rows of the new object `data_quiz`. *You can find the code with explanations at the end of this section.*\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Code |Age |Gender |Ethnicity |Secondyeargrade | QRP_item| QRPs_mean|Understanding_item |QRP_item > 4 |\n|:----|:---|:------|:--------------|:-----------------------|--------:|---------:|:------------------|:------------|\n|Tr10 |22 |2 |White European |60-69% (2:1 grade) | 5| 5.636364|2 |TRUE |\n|Bi07 |20 |2 |White British |50-59% (2:2 grade) | 2| 5.454546|2 |FALSE |\n|SK03 |22 |2 |White British |≥ 70% (1st class grade) | 6| 6.272727|6 |TRUE |\n|SM95 |26 |2 |White British |60-69% (2:1 grade) | 2| 5.000000|2 |FALSE |\n|St01 |22 |2 |White British |60-69% (2:1 grade) | 6| 5.545454|6 |TRUE |\n\n
\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_quiz)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 9\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", \"St01\", \"St10\", \"Wa…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"20\", \"21\", \"21\", \"22…\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ Ethnicity \"White European\", \"White British\", \"White British\",…\n$ Secondyeargrade 60-69% (2:1 grade), 50-59% (2:2 grade), ≥ 70% (1st …\n$ QRP_item 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3, 4, 4, 4, 4, 6, 3, …\n$ QRPs_mean 5.636364, 5.454545, 6.272727, 5.000000, 5.545455, 6…\n$ Understanding_item \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at all confident\", \"4…\n$ `QRP_item > 4` TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,…\n```\n:::\n:::\n\n\n\n\nSelect from the dropdown menu the variable type and their data types for each of the columns.\n\n\n\n\n\n| Column | Variable type | Data type |\n|:---------------------|:--------------|:--------------|\n| `Age` | | |\n| `Gender` | | |\n| `Ethinicity` | | |\n| `Secondyeargrade` | | |\n| `QRP_item` | | |\n| `QRPs_mean` | | |\n| `Understanding_item` | | |\n| `QRP_item > 4` | | |\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Revealing the mystery code that created `data_quiz`\n\nThe code might look a bit complex for the minute despite the line-by-line explanations below. Come back to it after completing chapter 2.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n:::\n\n\nLets go through this line by line:\n\n* **line 1**: creates a new object called `data_quiz` and it is based on the already existing data object `data_prp`\n* **line 2**: we are selecting a few variables of interest, such as Code, Age etc. Some of those variables were renamed in the process according to the structure `new_name = old_name`, for example QRP item 3 at time point 1 got renamed as `QRP_item`.\\\n* **line 3**: The function `mutate()` is used to create a new column called `Gender` that turns the existing column `Gender` from a numeric value into a factor. R simply overwrites the existing column of the same name. If we had named the new column `Gender_factor`, we would have been able to retain the original `Gender` column and `Gender_factor` would have been added as the last column.\n* **line 4-6**: See how the line starts with an indent which indicates we are still within the `mutate()` function. You can also see this by counting brackets - in line 3 there are 2 opening brackets but only 1 closes.\n * Similar to `Gender`, we are replacing the \"old\" `Secondyeargrade` with the new `Secondyeargrade` column that is now a factor.\n * Turning our variable `Secondyeargrade` into a factor, spot the difference between this attempt and the one we used for `Gender`? Here we are using a lot more arguments in that factor function, namely levels and labels. **Levels** describes the unique values we have for that column, and in **labels** we want to define how these levels will be shown in the data object. If you don't add the levels and labels argument, the labels will be the labels (as you can see in the `Gender` column in which we kept the numbers).\n* **line 7**: Doesn't start with a function name and has an indent, which means we are *still* within the `mutate()` function - count the opening and closing brackets to confirm.\n * Here, we are creating a new column called `QRP_item > 4`. Notice the two backticks we have to use to make this weird column name work? This is because it has spaces (and we did mention that R doesn't like spaces). So the backticks help R to group it as a unit/ a single name.\n * Next we have a `case_when()` function which helps executing conditional statements. We are using it to check whether a statement is TRUE or FALSE. Here, we ask whether the QRP item (column `QRP_item`) is larger than 4 (midpoint of the scale) using the Boolean operator `>`. If the statement is `TRUE`, the label `TRUE` should appear in column `QRP_item > 4`. Otherwise, if the value is equal to 4 or smaller, the label should read `FALSE`. We will come back to conditional statements in @sec-wrangling. But long story short, this Boolean expression created the only logical data type in `data_quiz`.\n:::\n\nAnd with this, we are done with the individual walkthrough. Well done :)\n\n\n\n\n\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nThe data we will be using in the upcoming lab activities is a randomised controlled trials experiment by Binfet et al. (2021) that was conducted in Canada.\n\n**Citation**\n\n> Binfet, J. T., Green, F. L. L., & Draper, Z. A. (2021). The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. *Anthrozoös, 35*(1), 1–22. [https://doi.org/10.1080/08927936.2021.1944558](https://doi.org/10.1080/08927936.2021.1944558){target=\"_blank\"}\n\n**Abstract**\n\n> Researchers have claimed that canine-assisted interventions (CAIs) contribute significantly to bolstering participants' wellbeing, yet the mechanisms within interactions have received little empirical attention. The aim of this study was to assess the impact of client–canine contact on wellbeing outcomes in a sample of 284 undergraduate college students (77% female; 21% male, 2% non-binary). Participants self-selected to participate and were randomly assigned to one of two canine interaction treatment conditions (touch or no touch) or to a handler-only condition with no therapy dog present. To assess self-reports of wellbeing, measures of flourishing, positive and negative affect, social connectedness, happiness, integration into the campus community, stress, homesickness, and loneliness were administered. Exploratory analyses were conducted to assess whether these wellbeing measures could be considered as measuring a unidimensional construct. This included both reliability analysis and exploratory factor analysis. Based on the results of these analyses we created a composite measure using participant scores on a latent factor. We then conducted the tests of the four hypotheses using these factor scores. Results indicate that participants across all conditions experienced enhanced wellbeing on several measures; however, only those in the direct contact condition reported significant improvements on all measures of wellbeing. Additionally, direct interactions with therapy dogs through touch elicited greater wellbeing benefits than did no touch/indirect interactions or interactions with only a dog handler. Similarly, analyses using scores on the wellbeing factor indicated significant improvement in wellbeing across all conditions (handler-only, *d* = 0.18, *p* = 0.041; indirect, *d* = 0.38, *p* \\< 0.001; direct, *d* = 0.78, *p* \\< 0.001), with more benefit when a dog was present (*d* = 0.20, *p* \\< 0.001), and the most benefit coming from physical contact with the dog (*d* = 0.13, *p* = 0.002). The findings hold implications for post-secondary wellbeing programs as well as the organization and delivery of CAIs.\n\n\nHowever, we accessed the data via Ciaran Evans' github ([https://github.com/ciaran-evans/dog-data-analysis](https://github.com/ciaran-evans/dog-data-analysis){target=\"_blank\"}). Evans et al. (2023) published a paper that reused the Binfet data for teaching statistics and research methods. If anyone is interested, the accompanying paper is:\n\n> Evans, C., Cipolli, W., Draper, Z. A., & Binfet, J. T. (2023). Repurposing a Peer-Reviewed Publication to Engage Students in Statistics: An Illustration of Study Design, Data Collection, and Analysis. *Journal of Statistics and Data Science Education, 31*(3), 236–247. [https://doi.org/10.1080/26939169.2023.2238018](https://doi.org/10.1080/26939169.2023.2238018){target=\"_blank\"}\n\n**There are a few changes that Evans and we made to the data:**\n\n* Evans removed the demographics ethnicity and gender to make the study data available while protecting participant privacy. Which means we'll have limited demographic variables available, but we will make do with what we've got.\n* We modified some of the responses in the raw data csv - for example, we took out impossible response values and replaced them with `NA`.\n* We replaced some of the numbers with labels to increase the difficulty in the dataset for @sec-wrangling and @sec-wrangling2.\n\n\n\n### Task 1: Create a project folder for the lab activities {.unnumbered}\n\nSince we will be working with the same data throughout semester 1, create a separate project for the lab data. Name it something useful, like `lab_data` or `dogs_in_the_lab`. Make sure you are not placing it within the project you have already created today. If you need guidance, see @sec-project above.\n\n\n\n### Task 2: Create a new `.Rmd` file {.unnumbered}\n\n... and name it something useful. If you need help, have a look at @sec-rmd.\n\n\n\n### Task 3: Download the data {.unnumbered}\n\nDownload the data here: [data_pair_ch1](data/data_pair_ch1.zip \"download\"). The zip folder contains the raw data file with responses to individual questions, a cleaned version of the same data in long format and wide format, and the codebook describing the variables in the raw data file and the long format.\n\n**Unzip the folder and place the data files in the same folder as your project.**\n\n\n\n### Task 4: Familiarise yourself with the data {.unnumbered}\n\nOpen the data files, look at the codebook, and perhaps skim over the original Binfet article (methods in particular) to see what kind of measures they used.\n\nRead in the raw data file as `dog_data_raw` and the cleaned-up data (long format) as `dog_data_long`. See if you can answer the following questions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\ndog_data_long <- read_csv(\"dog_data_clean_long.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\nRows: 284 Columns: 136\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (41): GroupAssignment, L2_1, L2_2, L2_3, L2_4, L2_5, L2_6, L2_7, L2_8, L...\ndbl (95): RID, Age_Yrs, Year_of_Study, Live_Pets, Consumer_BARK, S1_1, HO1_1...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\nRows: 568 Columns: 16\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): GroupAssignment, Year_of_Study, Live_Pets, Stage\ndbl (12): RID, Age_Yrs, Consumer_BARK, Flourishing, PANAS_PA, PANAS_NA, SHS,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n* How many participants took part in the study? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nYou can see this from `dog_data_raw`. Each participant ID is on a single row meaning the number of observations is the number of participants.\n\nIf you look at `dog_data_long`, there are 568 observations. Each participant answered the questionnaires pre and post intervention, resulting in 2 rows per participant ID. This means you'd have to divide the number of observations by 2 to get to the number of participants.\n\n:::\n\n* How many different questionnaires did the participants answer? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nThe Binfet paper (e.g., Methods section and/or abstract) and the codebook show it's 9 questionnaires - Flourishing scale (variable `Flourishing`), the UCLS Loneliness scale Version 3 (`Loneliness`), Positive and Negative affect scale (`PANAS_PA` and `PANAS_NA`), the Subjective Happiness scale (`SHS`), the Social connectedness scale (`SCS`), and 3 scales with 1 question each, i.e., perception of stress levels (`Stress`), self-reported level of homesickness (`Homesick`), and integration into the campus community (`Engagement`).\n\nHowever, if you thought `PANAS_PA` and `PANAS_NA` are a single questionnaire, 8 was also acceptable as an answer here.\n\n:::\n\n\n\n\n## [Test your knowledge]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nAre you ready for some knowledge check questions to test your understanding of the chapter? We also have some faulty codes. See if you can spot what's wrong with them.\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nOne of the key first steps when we open RStudio is to:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nOpening an existing project (e.g., when coming back to the same dataset) or creating a new project (e.g., for a new task or new dataset) ensures that subsequent `.Rmd` files, any output, figures, etc are saved within the same folder on your computer (i.e., the working directory). If the`.Rmd` files or data is not in the same folder as \"the project icon\", things can get messy and code might not run.\n\n:::\n\n\n#### Question 2 {.unnumbered}\n\nWhen using the default environment colour settings for RStudio, what colour would the background of a code chunk be in R Markdown? \n\nWhen using the default environment colour settings for RStudio, what colour would the background of normal text be in R Markdown? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nAssuming you have not changed any of the settings in RStudio, code chunks will tend to have a grey background and normal text will tend to have a white background. This is a good way to check that you have closed and opened code chunks correctly.\n\n:::\n\n\n\n#### Question 3 {.unnumbered}\n\nCode chunks start and end with:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCode chunks always take the same general format of three backticks followed by curly parentheses and a lower case r inside the parentheses (`{r}`). People often mistake these backticks for single quotes but that will not work. If you have set your code chunk correctly using backticks, the background colour should change to grey from white.\n\n:::\n\n\n\n#### Question 4 {.unnumbered}\n\nWhat is the correct way to include a code chunk in RMarkdown that will be executed but neither the code nor its output will be shown in the final HTML document? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCheck the table of knitr display options in @sec-chunks.\n\n* {r, echo=FALSE} also executes the code and does not show the code, but it *does* display the result in the knitted html file. (matches 2/3 criteria)\n* {r, eval=FALSE} does not show the results but does *not* execute the code and it *does* show it in the knitted file. (matches 1/3 criteria)\n* {r, results=“hide”} executes the code and does not show results, however, it *does* include the code in the knitted html document. (matches 2/3 criteria)\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of these codes have mistakes in them, other code chunks are not quite producing what was aimed for. Your task is to spot anything faulty, explain why the things happened, and perhaps try to fix them.\n\n\n\n#### Question 5 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. You have just stated R, created a new `.Rmd` file, and typed the following code into your code chunk.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\n\nHowever, R gives you an error message: `could not find function \"read_csv\"`. What could be the reason?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\n\"Could not find function\" is an indication that you have forgotten to load in tidyverse. Because `read_csv()` is a function in the tidyverse collection, R cannot find it.\n\nFIX: Add `library(tidyverse)` prior to reading in the data and run the code chunk again.\n\n:::\n\n\n\n#### Question 6 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. This time, you are certain you have loaded in tidyverse first. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\nThe error message shows `'data.csv' does not exist in current working directory`. You check your folder and it looks like this:\n\n![](images/error_ch1_01.PNG)\n\nWhy is there an error message?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nR is looking for a csv file that is called data which is currently not in the working directory. We may assume it's in the data folder. Perhaps that happened when unzipping the zip file. So instead of placing the csv file on the same level as the project icon, it was unzipped into a folder named data.\n\nFIX - option 1: Take the `data.csv` out of the data folder and place it next to the project icon and the `.Rmd` file.\n\nFIX - option 2: Modify your R code to tell R that the data is in a separate folder called data, e.g., ...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data/data.csv\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\n\nYou want to load `tidyverse` into the library. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\nThe error message says: `Error in library(tidyverse) : there is no package called ‘tidyverse’`\n\nWhy is there an error message and how can we fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nIf R says there is no package called `tidyverse`, means you haven't installed the package yet. This could be an error message you receive either after switching computers or a fresh install of R and RStudio.\n\nFIX: Type `install.packages(\"tidyverse\")` into your **Console**.\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nYou knitted your `.Rmd` into a html but the output is not as expected. You see the following:\n\n![](images/error_knitted.PNG)\n\nWhy did the file not knit properly?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThere is a backtick missing in the code chunk. If you check your `.Rmd` file, you can see that the code chunk does not show up in grey which means it's one of the 3 backticks at the beginning of the chunk.\n\n![](images/error_ch1_08.PNG)\n\nFIX: Add a single backtick manually where it's missing.\n\n:::\n", + "markdown": "# Projects and R Markdown {#sec-basics}\n\n## Intended Learning Outcomes {.unnumbered}\n\nBy the end of this chapter, you should be able to:\n\n- Re-familiarise yourself with setting up projects\n- Re-familiarise yourself with RMarkdown documents\n- Recap and apply data wrangling procedures to analyse data\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n## R and R Studio\n\nRemember, R is a programming language that you will write code in and RStudio is an Integrated Development Environment (IDE) which makes working with R easier as it's more user friendly. You need both components for this course.\n\nIf this is not ringing any bells yet, have a quick browse through the [materials from year 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#sec-intro-r){target=\"_blank\"} to refresh your memory.\n\n\n### R server\n\nUse the server *only* if you are unable to install R and RStudio on your computer (e.g., if you are using a Chromebook) or if you encounter issues while installing R on your own machine. Otherwise, you should install R and RStudio directly on your own computer. R and RStudio are already installed on the *R server*.\n\nYou will find the link to the server on Moodle.\n\n\n### Installing R and RStudio on your computer\n\nThe [RSetGo book](https://psyteachr.github.io/RSetGo/){target=\"_blank\"} provides detailed instructions on how to install R and RStudio on your computer. It also includes links to walkthroughs for installing R on different types of computers and operating systems.\n\nIf you had R and RStudio installed on your computer last year, we recommend updating to the latest versions. In fact, it’s a good practice to update them at the start of each academic year. Detailed guidance can be found in @sec-updating-r.\n\nOnce you have installed or updated R and RStudio, return to this chapter.\n\n\n### Settings for Reproducibility\n\nBy now, you should be aware that the Psychology department at the University of Glasgow places a strong emphasis on reproducibility, open science, and raising awareness about questionable research practices (QRPs) and how to avoid them. Therefore, it's important that you work in a reproducible manner so that others (and your future self) can understand and check your work. This also makes it easier for you to reuse your work in the future.\n\nAlways start with a clear workspace. If your `Global Environment` contains anything from a previous session, you can’t be certain whether your current code is working as intended or if it’s using objects created earlier.\n\nTo ensure a clean and reproducible workflow, there are a few settings you should adjust immediately after installing or updating RStudio. In Tools \\> Global Options... General tab\n\n* Uncheck the box labelled Restore .RData into workspace at startup to make sure no data from a previous session is loaded into the environment\n* set Save workspace to .RData on exit to **Never** to prevent your workspace from being saved when you exit RStudio.\n\n![Reproducibility settings in Global Options](images/rstudio_settings_reproducibility.png)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for keeping taps on parentheses\n\nR has included **rainbow parentheses** to help with keeping count on the brackets.\n\nTo enable the feature, go to Tools \\> Global Options... Code tab \\> Display tab and tick the last checkbox \"Use rainbow parentheses\"\n\n![Enable Rainbow parenthesis](images/rainbow.PNG)\n\n:::\n\n### RStudio panes\n\nRStudio has four main panes each in a quadrant of your screen:\n\n* Source pane\n* Environment pane\n* Console pane\n* Output pane\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAre you ready for a quick quiz to see what you remember about the RStudio panes from last year? Click on **Quiz** to see the questions.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Quiz\n\n**What is their purpose?**\n\n**The Source pane...**
\n\n\n**The Environment pane...**
\n\n\n**The Console pane...**
\n\n\n**The Output pane...**
\n\n\n**Where are these panes located by default?**\n\n* The Source pane is located? \n* The Environment pane is located? \n* The Console pane is located? \n* The Output pane is located? \n\n:::\n\n:::\n\nIf you were not quite sure about one/any of the panes, check out the [materials from Level 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#rstudio-panes){target=\"_blank\"}. If you want to know more about them, there is the [RStudio guide on posit](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html){target=\"_blank\"}\n\n\n\n## Activity 1: Creating a new project {#sec-project}\n\nIt's important to create a new RStudio project whenever you start a new project. This practice makes it easier to work in multiple contexts, such as when analysing different datasets simultaneously. Each RStudio project has its own folder location, workspace, and working directories, which keeps all your data and RMarkdown documents organised in one place.\n\nLast year, you learnt how to create projects on the server, so you already know the steps. If cannot quite recall how that was done, go back to the [Level 1 materials](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#new-project){target=\"_blank\"}.\n\nOn your own computer, open RStudio, and complete the following steps in this order:\n\n* Click on File \\> New Project...\n* Then, click on \"New Directory\"\n* Then, click on \"New Project\"\n* Name the directory something meaningful (e.g., \"2A_chapter1\"), and save it in a location that makes sense, for example, a dedicated folder you have for your level 2 Psychology labs - you can either select a folder you have already in place or create a new one (e.g., I named my new folder \"Level 2 labs\")\n* Click \"Create Project\". RStudio will restart itself and open with this new project directory as the working directory. If you accidentally close it, you can open it by double-clicking on the project icon in your folder\n* You can also check in your folder structure that everything was created as intended\n\n![Creating a new project](images/project_setup.gif)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Why is the Colour scheme in the gif different to my version?\n\nIn case anyone is wondering why my colour scheme in the gif above looks different to yours, I've set mine to \"Pastel On Dark\" in Tools \\> Global Options... \\> Appearances. And my computer lives in \"dark mode\".\n\n:::\n\n::: callout-important\n\n## Don't nest projects\n\nDon't ever save a new project **inside** another project directory. This can cause some hard-to-resolve problems.\n\n:::\n\n\n## Activity 2: Create a new R Markdown file {#sec-rmd}\n\n* Open a new R Markdown document: click File \\> New File \\> R Markdown or click on the little page icon with a green plus sign (top left).\n* Give it a meaningful `Title` (e.g., Level 2 chapter 1) - you can also change the title later. Feel free to add your name or GUID in the `Author` field author name. Keep the `Default Output Format` as HTML.\n* Once the .`Rmd` opened, you need to save the file.\n* To save it, click File \\> Save As... or click on the little disc icon. Name it something meaningful (e.g., \"chapter_01.Rmd\", \"01_intro.Rmd\"). Make sure there are no spaces in the name - R is not very fond of spaces... This file will automatically be saved in your project folder (i.e., your working directory) so you should now see this file appear in your file viewer pane.\n\n\n![Creating a new `.Rmd` file](images/Rmd_setup.gif)\n\n\nRemember, an R Markdown document or `.Rmd` has \"white space\" (i.e., the markdown for formatted text) and \"grey parts\" (i.e., code chunks) in the default colour scheme (see @fig-rmd). R Markdown is a powerful tool for creating dynamic documents because it allows you to integrate code and regular text seamlessly. You can then knit your `.Rmd` using the `knitr` package to create a final document as either a webpage (HTML), a PDF, or a Word document (.docx). We'll only knit to HTML documents in this course.\n\n\n![R markdown anatomy (image from [https://intro2r.com/r-markdown-anatomy.html](https://intro2r.com/r-markdown-anatomy.html){target=\"_blank\"})](images/rm_components.png)\n\n\n\n### Markdown\n\nThe markdown space in an `.Rmd` is ideal for writing notes that explain your code and document your thought process. Use this space to clarify what your code is doing, why certain decisions were made, and any insights or conclusions you have drawn along the way. These notes are invaluable when revisiting your work later, helping you (or others) understand the rationale behind key decisions, such as setting inclusion/exclusion criteria or interpreting the results of assumption tests. Effectively documenting your work in the markdown space enhances both the clarity and reproducibility of your analysis.\n\nThe markdown space offers a variety of formatting options to help you organise and present your notes effectively. Here are a few of them that can enhance your documentation:\n\n#### Heading levels {.unnumbered}\n\nThere is a variety of **heading levels** to make use of, using the `#` symbol.\n\n\n::: columns\n\n::: column\n\n##### You would incorporate this into your text as: {.unnumbered}\n\n\\# Heading level 1\n\n\\## Heading level 2\n\n\\### Heading level 3\n\n\\#### Heading level 4\n\n\\##### Heading level 5\n\n\\###### Heading level 6\n\n:::\n\n::: column\n\n##### And it will be displayed in your knitted html file as: {.unnumbered}\n\n![](images/heading_levels.PNG)\n\n:::\n\n:::\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My heading levels don't render properly when knitting\n\nYou need a space between the # and the first letter. If the space is missing, the heading will be displayed in the HTML file as ...\n\n#Heading 1\n\n:::\n\n#### Unordered and ordered lists {.unnumbered}\n\nYou can also include **unordered lists** and **ordered lists**. Click on the tabs below to see how they are incorporated\n\n::: panel-tabset\n\n## unordered lists\n\nYou can add **bullet points** using either `*`, `-` or `+` and they will turn into:\n\n* bullet point (created with `*`)\n* bullet point (created with `-`)\n+ bullet point (created with `+`)\n\nor use bullet points of different levels using 1 tab key press or 2 spaces (for sub-item 1) or 2 tabs/4 spaces (for sub-sub-item 1):\n\n* bullet point item 1\n * sub-item 1\n * sub-sub-item 1\n * sub-sub-item 2\n* bullet point item 2\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My bullet points don't render properly when knitting\n\nYou need an empty row before your bullet points start. If I delete the empty row before the bullet points, they will be displayed in the HTML as ...\n\nText without the empty row: * bullet point created with `*` - bullet point created with `-` + bullet point created with `+`\n\n:::\n\n\n## ordered lists\n\nStart the line with **1.**, **2.**, etc. When you want to include sub-items, either use the `tab` key twice or add **4 spaces**. Same goes for the sub-sub-item: include either 2 tabs (or 4 manual spaces) from the last item or 4 tabs/ 8 spaces from the start of the line.\n\n1. list item 1\n2. list item 2\n i) sub-item 1 (with 4 spaces)\n A. sub-sub-item 1 (with an additional 4 spaces from the last indent)\n\n::: {.callout-important collapse=\"true\"}\n\n## My list items don't render properly when knitting\n\nIf you don't leave enough spaces, the list won't be recognised, and your output looks like this:\n\n3. list item 3\n i) sub-item 1 (with only 2 spaces) \n A. sub-sub-item 1 (with an additional 2 spaces from the last indent)\n\n:::\n\n\n## ordered lists magic\n\nThe great thing though is that you don't need to know your alphabet or number sequences. R markdown will fix that for you\n\nIf I type into my `.Rmd`...\n\n![](images/list_magic.PNG)\n\n...it will be rendered in the knitted HTML output as...\n\n3. list item 3\n1. list item 1\n a) sub-item labelled \"a)\"\n i) sub-item labelled \"i)\"\n C) sub-item labelled \"C)\"\n Z) sub-item labelled \"Z)\"\n7. list item 7\n\n\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: The labels of the sub-items are not what I thought they would be. You said they are fixing themselves...\n\nYes, they do but you need to label your sub-item lists accordingly. The first label you list in each level is set as the baseline. If they are labelled `1)` instead of `i)` or `A.`, the output will show as follows, but the automatic-item-fixing still works:\n\n7. list item 7\n 1) list item \"1)\" with 4 spaces\n 1) list item \"1)\" with 8 spaces\n 6) this is an item labelled \"6)\" (magically corrected to \"2.\")\n:::\n\n:::\n\n#### Emphasis {.unnumbered}\n\nInclude **emphasis** to draw attention to keywords in your text:\n\n| R markdown syntax | Displayed in the knitted HTML file |\n|:----------------------------|:-----------------------------------|\n| \\*\\*bold text\\*\\* | **bold text** |\n| \\*italic text\\* | *italic text* |\n| \\*\\*\\*bold and italic\\*\\*\\* | ***bold and italic*** |\n\n\nOther examples can be found in the [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target=\"_blank\"}\n\n\n\n### Code chunks {#sec-chunks}\n\nEverything you write inside the **code chunks** will be interpreted as code and executed by R. Code chunks start with ```` ``` ```` followed by an `{r}` which specifies the coding language R, some space for code, and ends with ```` ``` ````. If you accidentally delete one of those backticks, your code won't run and/or your text parts will be interpreted as part of the code chunks or vice versa. This should be evident from the colour change - more white than expected typically indicates missing starting backticks, whilst too much grey/not enough white suggests missing ending backticks. But no need to fret if that happens - just add the missing backticks manually.\n\n\nYou can **insert a new code chunk** in several ways:\n\n\n* Click the `Insert a new code chunk` button in the RStudio Toolbar (green icon at the top right corner of the `Source pane`).\n* Select Code \\> Insert Chunk from the menu.\n* Using the shortcut `Ctrl + Alt + I` for Windows or `Cmd + Option + I` on MacOSX.\n* Type ```` ```{r} ```` and ```` ``` ```` manually\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Default `.Rmd` with highlighting - names in pink and knitr display options in purple](images/default_highlighted.png){#fig-rmd fig-align='center' width=100%}\n:::\n:::\n\n\n\n\nWithin the curly brackets of a code chunk, you can **specify a name** for the code chunk (see pink highlighting in @fig-rmd). The chunk name is not necessarily required; however, it is good practice to give each chunk a unique name to support more advanced knitting approaches. It also makes it easier to reference and manage chunks.\n\nWithin the curly brackets, you can also place **rules and arguments** (see purple highlighting in @fig-rmd) to control how your code is executed and what is displayed in your final HTML output. The most common **knitr display options** include:\n\n\n| Code | Does code run | Does code show | Do results show |\n|:--------------------|:-------------:|:--------------:|:---------------:|\n| eval=FALSE | NO | YES | NO |\n| echo=TRUE (default) | YES | YES | YES |\n| echo=FALSE | YES | NO | YES |\n| results='hide' | YES | YES | NO |\n| include=FALSE | YES | NO | NO |\n\n\n::: callout-important\n\nThe table above will be incredibly important for the data skills homework II. When solving error mode items you will need to pay attention to the first one `eval = FALSE`.\n\n:::\n\nOne last thing: In your newly created `.Rmd` file, delete everything below line 12 (keep the set-up code chunk) and save your `.Rmd` by clicking on the disc symbol.\n\n![Delete everything below line 12](images/delete_12.gif)\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nThat was quite a long section about what Markdown can do. I promise, we'll practice that more later. For the minute, we want you to create a new level 2 heading on line 12 and give it a meaningful heading title (something like \"Loading packages and reading in data\" or \"Chapter 1\").\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\nOn line 12, you should have typed **## Loading packages and reading in data** (or whatever meaningful title you chose). This will create level 2 heading once we knit the `.Rmd`.\n\n:::\n\n:::\n\n\n## Activity 3: Download the data {#sec-download_data_ch1}\n\nThe data for chapters 1-3. Download it here: [data_ch1.zip](data/data_ch1.zip \"download\"). There are 2 csv files contained in a zip folder. One is the data file we are going to use today `prp_data_reduced.csv` and the other is an Excel file `prp_codebook` that explains the variables in the data.\n\nThe first step is to **unzip the zip folder** so that the files are placed within the same folder as your project.\n\n* Place the zip folder within your 2A_chapter1 folder\n* Right mouse click --> `Extract All...`\n* Check the folder location is the one to extract the files to\n* Check the extracted files are placed next to the project icon\n* Files and project should be visible in the Output pane in RStudio\n\n::: {.callout-note collapse=\"false\"}\n\n## Screenshots for \"unzipping a zip folder\"\n\n::: {layout-ncol=\"1\"}\n\n![](images/pic1.PNG){fig-align=\"center\"}\n\n![](images/pic23.PNG){fig-align=\"center\"}\n\n![](images/pic45.PNG){fig-align=\"center\"}\n\nUnzipping a zip folder\n\n:::\n:::\n\nThe paper by Pownall et al. was a **registered report** published in 2023, and the original data can be found on OSF ([https://osf.io/5qshg/](https://osf.io/5qshg/){target=\"_blank\"}).\n\n**Citation**\n\n> Pownall, M., Pennington, C. R., Norris, E., Juanchich, M., Smailes, D., Russell, S., Gooch, D., Evans, T. R., Persson, S., Mak, M. H. C., Tzavella, L., Monk, R., Gough, T., Benwell, C. S. Y., Elsherif, M., Farran, E., Gallagher-Mitchell, T., Kendrick, L. T., Bahnmueller, J., . . . Clark, K. (2023). Evaluating the Pedagogical Effectiveness of Study Preregistration in the Undergraduate Dissertation. *Advances in Methods and Practices in Psychological Science, 6*(4). [https://doi.org/10.1177/25152459231202724](https://doi.org/10.1177/25152459231202724){target=\"_blank\"}\n\n**Abstract**\n\n> Research shows that questionable research practices (QRPs) are present in undergraduate final-year dissertation projects. One entry-level Open Science practice proposed to mitigate QRPs is “study preregistration,” through which researchers outline their research questions, design, method, and analysis plans before data collection and/or analysis. In this study, we aimed to empirically test the effectiveness of preregistration as a pedagogic tool in undergraduate dissertations using a quasi-experimental design. A total of 89 UK psychology students were recruited, including students who preregistered their empirical quantitative dissertation (*n* = 52; experimental group) and students who did not (*n* = 37; control group). Attitudes toward statistics, acceptance of QRPs, and perceived understanding of Open Science were measured both before and after dissertation completion. Exploratory measures included capability, opportunity, and motivation to engage with preregistration, measured at Time 1 only. This study was conducted as a Registered Report; Stage 1 protocol: https://osf.io/9hjbw (date of in-principle acceptance: September 21, 2021). Study preregistration did not significantly affect attitudes toward statistics or acceptance of QRPs. However, students who preregistered reported greater perceived understanding of Open Science concepts from Time 1 to Time 2 compared with students who did not preregister. Exploratory analyses indicated that students who preregistered reported significantly greater capability, opportunity, and motivation to preregister. Qualitative responses revealed that preregistration was perceived to improve clarity and organization of the dissertation, prevent QRPs, and promote rigor. Disadvantages and barriers included time, perceived rigidity, and need for training. These results contribute to discussions surrounding embedding Open Science principles into research training.\n\n**Changes made to the dataset**\n\nWe made some changes to the dataset for the purpose of increasing difficulty for data wrangling (@sec-wrangling and @sec-wrangling2) and data visualisation (@sec-dataviz and @sec-dataviz2). This will ensure some \"teachable moments\". The changes are as follows:\n\n* We removed some of the variables to make the data more manageable for teaching purposes.\n* We recoded some values from numeric responses to labels (e.g., `understanding`).\n* We added the word \"years\" to one of the `Age` entries.\n* We tidied a messy column `Ethnicity` but introduced a similar but easier-to-solve \"messiness pattern\" when recoding the `understanding` data.\n* The scores in the original file were already corrected from reverse-coded responses. We reversed that process to present raw data here.\n\n\n\n\n## Activity 4: Installing packages, loading packages, and reading in data\n\n### Installing packages\n\nWhen you install R and RStudio for the first time (or after an update), most of the packages we will be using won’t be pre-installed. Before you can load new packages like `tidyverse`, you will need to install them.\n\nIf you try to load a package that has not been installed yet, you will receive an error message that looks something like this: `Error in library(tidyverse) : there is no package called 'tidyverse'`. \n\nTo fix this, simply install the package first. **In the console**, type the command `install.packages(\"tidyverse\")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio.\n\n::: callout-important\n\n## Install packages from the console only\n\nNever include `install.packages()` in the Rmd. Only install packages from the console pane or the packages tab of the lower right pane!!!\n:::\n\n\nNote, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`.\n\n\n### Loading packages and reading in data\n\nThe first step is to load in the packages we need and read in the data. Today, we'll only be using `tidyverse`, and `read_csv()` will help us store the data from `prp_data_reduced.csv` in an object called data_prp.\n\nCopy the code into a code chunk in your `.Rmd` file and run it. You can either click the `green error` to run the entire code chunk, or use the shortcut `Ctrl + Enter` (Windows) or `Cmd + Enter` (Mac) to run a line of code/ pipe from the Rmd.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\nRows: 89 Columns: 91\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (17): Code, Age, Ethnicity, Opptional_mod_1_TEXT, Research_exp_1_TEXT, U...\ndbl (74): Gender, Secondyeargrade, Opptional_mod, Research_exp, Plan_prereg,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n\n\n\n## Activity 5: Familiarise yourself with the data {#sec-familiarise}\n\n* Look at the **Codebook** to get a feel of the variables in the dataset and how they have been measured. Note that some of the columns were deleted in the dataset you have been given.\n* You'll notice that some questionnaire data was collected at 2 different time points (i.e., SATS28, QRPs, Understanding_OS)\n* some of the data was only collected at one time point (i.e., supervisor judgements, OS_behav items, and Included_prereg variables are t2-only variables)\n\n\n\n### First glimpse at the data\n\nBefore you start wrangling your data, it is important to understand what kind of data you're working with and what the format of your dataframe looks like.\n\nAs you may have noticed, `read_csv()` provides a **message** listing the data types in your dataset and how many columns are of each type. Plus, it shows a few examples columns for each data type.\n\nTo obtain more detailed information about your data, you have several options. Click on the individual tabs to see the different options available. Test them out in your own `.Rmd` file and use whichever method you prefer (but do it).\n\n::: callout-warning\n\nSome of the output is a bit long because we do have quite a few variables in the data file.\n\n:::\n\n::: panel-tabset\n\n## visual inspection 1\n\nIn the `Global Environment`, click the blue arrow icon next to the object name `data_prp`. This action will expand the object, revealing details about its columns. The `$` symbol is commonly used in Base R to access a specific column within your dataframe.\n\n![Visual inspection of the data](images/data_prp.PNG)\n\nCon: When you have quite a few variables, not all of them are shown.\n\n## `glimpse()`\n\nUse `glimpse()` if you want a more detailed overview you can see on your screen. The output will display rows and column numbers, and some examples of the first couple of observations for each variable.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 91\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", …\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"2…\n$ Ethnicity \"White European\", \"White British…\n$ Secondyeargrade 2, 3, 1, 2, 2, 2, 2, 2, 1, 1, 1,…\n$ Opptional_mod 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2,…\n$ Opptional_mod_1_TEXT \"Research methods in first year\"…\n$ Research_exp 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…\n$ Research_exp_1_TEXT NA, NA, NA, NA, NA, NA, NA, NA, …\n$ Plan_prereg 1, 3, 1, 2, 1, 1, 3, 3, 2, 2, 2,…\n$ SATS28_1_Affect_Time1 4, 5, 5, 6, 2, 1, 6, 3, 2, 5, 2,…\n$ SATS28_2_Affect_Time1 5, 6, 3, 3, 6, 1, 2, 2, 7, 3, 4,…\n$ SATS28_3_Affect_Time1 3, 2, 5, 2, 6, 7, 2, 6, 6, 5, 2,…\n$ SATS28_4_Affect_Time1 4, 5, 2, 2, 6, 6, 5, 5, 5, 5, 2,…\n$ SATS28_5_Affect_Time1 5, 5, 5, 6, 1, 1, 5, 1, 2, 5, 2,…\n$ SATS28_6_Affect_Time1 5, 6, 2, 5, 6, 7, 4, 5, 5, 3, 5,…\n$ SATS28_7_CognitiveCompetence_Time1 4, 2, 2, 5, 6, 7, 2, 5, 5, 2, 2,…\n$ SATS28_8_CognitiveCompetence_Time1 2, 2, 2, 1, 6, 7, 2, 5, 3, 2, 3,…\n$ SATS28_9_CognitiveCompetence_Time1 2, 2, 2, 3, 3, 7, 2, 6, 3, 3, 1,…\n$ SATS28_10_CognitiveCompetence_Time1 6, 7, 6, 6, 4, 2, 6, 4, 5, 6, 5,…\n$ SATS28_11_CognitiveCompetence_Time1 4, 3, 5, 5, 3, 1, 6, 2, 5, 6, 5,…\n$ SATS28_12_CognitiveCompetence_Time1 3, 5, 3, 5, 5, 7, 3, 4, 7, 2, 3,…\n$ SATS28_13_Value_Time1 1, 1, 2, 1, 3, 7, 1, 2, 1, 2, 4,…\n$ SATS28_14_Value_Time1 7, 7, 6, 6, 5, 1, 6, 5, 7, 6, 2,…\n$ SATS28_15_Value_Time1 7, 7, 6, 6, 3, 5, 6, 6, 6, 5, 5,…\n$ SATS28_16_Value_Time1 2, 1, 3, 2, 6, 5, 3, 7, 2, 2, 2,…\n$ SATS28_17_Value_Time1 1, 1, 3, 3, 7, 7, 2, 7, 2, 2, 5,…\n$ SATS28_18_Value_Time1 3, 6, 5, 3, 1, 1, 5, 1, 5, 2, 2,…\n$ SATS28_19_Value_Time1 3, 3, 3, 3, 7, 7, 4, 5, 3, 5, 6,…\n$ SATS28_20_Value_Time1 2, 1, 4, 2, 7, 7, 2, 4, 2, 2, 7,…\n$ SATS28_21_Value_Time1 2, 1, 3, 2, 6, 7, 2, 5, 1, 3, 5,…\n$ SATS28_22_Difficulty_Time1 3, 2, 5, 3, 2, 1, 4, 2, 2, 5, 3,…\n$ SATS28_23_Difficulty_Time1 5, 6, 5, 6, 6, 7, 4, 6, 7, 5, 6,…\n$ SATS28_24_Difficulty_Time1 2, 2, 2, 3, 1, 4, 4, 2, 2, 2, 2,…\n$ SATS28_25_Difficulty_Time1 6, 7, 5, 5, 6, 7, 5, 6, 5, 5, 5,…\n$ SATS28_26_Difficulty_Time1 4, 2, 2, 2, 6, 7, 4, 5, 3, 5, 3,…\n$ SATS28_27_Difficulty_Time1 4, 5, 5, 3, 6, 7, 4, 3, 5, 3, 6,…\n$ SATS28_28_Difficulty_Time1 1, 7, 5, 5, 6, 6, 5, 4, 4, 4, 2,…\n$ QRPs_1_Time1 7, 7, 7, 7, 7, 7, 6, 2, 7, 6, 7,…\n$ QRPs_2_Time1 7, 7, 7, 7, 7, 7, 6, 7, 7, 7, 5,…\n$ QRPs_3_Time1 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3,…\n$ QRPs_4_Time1 7, 7, 6, 6, 7, 4, 6, 7, 7, 7, 6,…\n$ QRPs_5_Time1 3, 3, 7, 7, 2, 7, 4, 6, 7, 3, 2,…\n$ QRPs_6_Time1 4, 7, 6, 5, 7, 4, 4, 5, 7, 6, 5,…\n$ QRPs_7_Time1 5, 7, 7, 7, 7, 4, 5, 6, 7, 7, 5,…\n$ QRPs_8_Time1 7, 7, 7, 7, 7, 7, 7, 7, 7, 2, 7,…\n$ QRPs_9_Time1 6, 7, 7, 4, 7, 7, 3, 7, 6, 6, 2,…\n$ QRPs_10_Time1 7, 6, 5, 2, 5, 4, 2, 6, 7, 7, 2,…\n$ QRPs_11_Time1 7, 7, 7, 4, 7, 7, 4, 6, 7, 7, 5,…\n$ QRPs_12NotQRP_Time1 2, 2, 1, 4, 1, 4, 2, 4, 2, 2, 1,…\n$ QRPs_13NotQRP_Time1 1, 1, 1, 1, 1, 4, 2, 4, 1, 1, 1,…\n$ QRPs_14NotQRP_Time1 1, 4, 3, 4, 1, 4, 2, 3, 3, 4, 3,…\n$ QRPs_15NotQRP_Time1 2, 4, 2, 2, 1, 4, 2, 1, 4, 4, 2,…\n$ Understanding_OS_1_Time1 \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at…\n$ Understanding_OS_2_Time1 \"2\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_3_Time1 \"2\", \"Not at all confident\", \"3\"…\n$ Understanding_OS_4_Time1 \"6\", \"Not at all confident\", \"6\"…\n$ Understanding_OS_5_Time1 \"Entirely confident\", \"6\", \"6\", …\n$ Understanding_OS_6_Time1 \"Entirely confident\", \"Entirely …\n$ Understanding_OS_7_Time1 \"6\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_8_Time1 \"6\", \"3\", \"5\", \"3\", \"5\", \"Not at…\n$ Understanding_OS_9_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_10_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_11_Time1 \"Entirely confident\", \"2\", \"4\", …\n$ Understanding_OS_12_Time1 \"Entirely confident\", \"2\", \"5\", …\n$ Pre_reg_group 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2,…\n$ Other_OS_behav_2 1, NA, NA, NA, 1, NA, NA, 1, NA,…\n$ Other_OS_behav_4 1, NA, NA, NA, NA, NA, NA, NA, N…\n$ Other_OS_behav_5 NA, NA, NA, NA, 1, 1, NA, NA, NA…\n$ Closely_follow 2, 2, 2, NA, 3, 3, 3, NA, NA, 2,…\n$ SATS28_Affect_Time2_mean 3.500000, 3.166667, 4.833333, 4.…\n$ SATS28_CognitiveCompetence_Time2_mean 4.166667, 4.666667, 6.166667, 5.…\n$ SATS28_Value_Time2_mean 3.000000, 6.222222, 6.000000, 4.…\n$ SATS28_Difficulty_Time2_mean 2.857143, 2.857143, 4.000000, 2.…\n$ QRPs_Acceptance_Time2_mean 5.636364, 5.454545, 6.272727, 5.…\n$ Time2_Understanding_OS 5.583333, 3.333333, 5.416667, 4.…\n$ Supervisor_1 5, 7, 7, 1, 7, 1, 7, 6, 7, 5, 6,…\n$ Supervisor_2 5, 6, 7, 4, 6, 2, 7, 5, 6, 5, 5,…\n$ Supervisor_3 6, 7, 7, 1, 7, 1, 7, 5, 6, 6, 7,…\n$ Supervisor_4 6, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_5 5, 7, 7, 4, 7, 3, 7, 7, 6, 6, 6,…\n$ Supervisor_6 5, 7, 7, 4, 6, 3, 7, 6, 7, 6, 6,…\n$ Supervisor_7 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ Supervisor_8 5, 5, 7, 1, 7, 1, 7, 5, 7, 5, 6,…\n$ Supervisor_9 6, 7, 7, 4, 7, 3, 7, 5, 7, 6, 7,…\n$ Supervisor_10 5, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_11 NA, 7, 7, NA, 7, 1, 7, 5, 7, 6, …\n$ Supervisor_12 4, 5, 7, 1, 4, 1, 7, 3, 6, 6, 5,…\n$ Supervisor_13 4, 2, 5, 1, 2, 1, 6, 3, 5, 6, 5,…\n$ Supervisor_14 5, 7, 7, 1, 7, 1, 7, 5, 7, 6, 6,…\n$ Supervisor_15_R 1, 1, 1, 4, 1, 7, 1, 2, 1, 2, 1,…\n```\n:::\n:::\n\n\n\n## `spec()`\n\nYou can also use `spec()` as suggested in the message above and then it shows you a list of the data type in every single column. But it doesn't show you the number of rows and columns.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nspec(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ncols(\n Code = col_character(),\n Gender = col_double(),\n Age = col_character(),\n Ethnicity = col_character(),\n Secondyeargrade = col_double(),\n Opptional_mod = col_double(),\n Opptional_mod_1_TEXT = col_character(),\n Research_exp = col_double(),\n Research_exp_1_TEXT = col_character(),\n Plan_prereg = col_double(),\n SATS28_1_Affect_Time1 = col_double(),\n SATS28_2_Affect_Time1 = col_double(),\n SATS28_3_Affect_Time1 = col_double(),\n SATS28_4_Affect_Time1 = col_double(),\n SATS28_5_Affect_Time1 = col_double(),\n SATS28_6_Affect_Time1 = col_double(),\n SATS28_7_CognitiveCompetence_Time1 = col_double(),\n SATS28_8_CognitiveCompetence_Time1 = col_double(),\n SATS28_9_CognitiveCompetence_Time1 = col_double(),\n SATS28_10_CognitiveCompetence_Time1 = col_double(),\n SATS28_11_CognitiveCompetence_Time1 = col_double(),\n SATS28_12_CognitiveCompetence_Time1 = col_double(),\n SATS28_13_Value_Time1 = col_double(),\n SATS28_14_Value_Time1 = col_double(),\n SATS28_15_Value_Time1 = col_double(),\n SATS28_16_Value_Time1 = col_double(),\n SATS28_17_Value_Time1 = col_double(),\n SATS28_18_Value_Time1 = col_double(),\n SATS28_19_Value_Time1 = col_double(),\n SATS28_20_Value_Time1 = col_double(),\n SATS28_21_Value_Time1 = col_double(),\n SATS28_22_Difficulty_Time1 = col_double(),\n SATS28_23_Difficulty_Time1 = col_double(),\n SATS28_24_Difficulty_Time1 = col_double(),\n SATS28_25_Difficulty_Time1 = col_double(),\n SATS28_26_Difficulty_Time1 = col_double(),\n SATS28_27_Difficulty_Time1 = col_double(),\n SATS28_28_Difficulty_Time1 = col_double(),\n QRPs_1_Time1 = col_double(),\n QRPs_2_Time1 = col_double(),\n QRPs_3_Time1 = col_double(),\n QRPs_4_Time1 = col_double(),\n QRPs_5_Time1 = col_double(),\n QRPs_6_Time1 = col_double(),\n QRPs_7_Time1 = col_double(),\n QRPs_8_Time1 = col_double(),\n QRPs_9_Time1 = col_double(),\n QRPs_10_Time1 = col_double(),\n QRPs_11_Time1 = col_double(),\n QRPs_12NotQRP_Time1 = col_double(),\n QRPs_13NotQRP_Time1 = col_double(),\n QRPs_14NotQRP_Time1 = col_double(),\n QRPs_15NotQRP_Time1 = col_double(),\n Understanding_OS_1_Time1 = col_character(),\n Understanding_OS_2_Time1 = col_character(),\n Understanding_OS_3_Time1 = col_character(),\n Understanding_OS_4_Time1 = col_character(),\n Understanding_OS_5_Time1 = col_character(),\n Understanding_OS_6_Time1 = col_character(),\n Understanding_OS_7_Time1 = col_character(),\n Understanding_OS_8_Time1 = col_character(),\n Understanding_OS_9_Time1 = col_character(),\n Understanding_OS_10_Time1 = col_character(),\n Understanding_OS_11_Time1 = col_character(),\n Understanding_OS_12_Time1 = col_character(),\n Pre_reg_group = col_double(),\n Other_OS_behav_2 = col_double(),\n Other_OS_behav_4 = col_double(),\n Other_OS_behav_5 = col_double(),\n Closely_follow = col_double(),\n SATS28_Affect_Time2_mean = col_double(),\n SATS28_CognitiveCompetence_Time2_mean = col_double(),\n SATS28_Value_Time2_mean = col_double(),\n SATS28_Difficulty_Time2_mean = col_double(),\n QRPs_Acceptance_Time2_mean = col_double(),\n Time2_Understanding_OS = col_double(),\n Supervisor_1 = col_double(),\n Supervisor_2 = col_double(),\n Supervisor_3 = col_double(),\n Supervisor_4 = col_double(),\n Supervisor_5 = col_double(),\n Supervisor_6 = col_double(),\n Supervisor_7 = col_double(),\n Supervisor_8 = col_double(),\n Supervisor_9 = col_double(),\n Supervisor_10 = col_double(),\n Supervisor_11 = col_double(),\n Supervisor_12 = col_double(),\n Supervisor_13 = col_double(),\n Supervisor_14 = col_double(),\n Supervisor_15_R = col_double()\n)\n```\n:::\n:::\n\n\n\n## visual inspection 2\n\nIn the `Global Environment`, click on the object name `data_prp`. This action will open the data in a new tab. Hovering over the column headings with your mouse will also reveal their data type. However, it seems to be a fairly tedious process when you have loads of columns.\n\n::: {.callout-important collapse=\"true\"}\n\n## Hang on, where is the rest of my data? Why do I only see 50 columns?\n\nOne common source of confusion is not seeing all your columns when you open up a data object as a tab. This is because RStudio shows you a maximum of 50 columns at a time. If you have more than 50 columns, navigate with the arrows to see the remaining columns.\n\n![Showing 50 columns at a time](images/50_col.PNG)\n\n:::\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nNow that you have tested out all the options in your own `.Rmd` file, you can probably answer the following questions:\n\n* How many observations? \n* How many variables? \n* How many columns are `col_character` or `chr` data type? \n* How many columns are `col_double` or `dbl` data type? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe visual inspections shows you the **number of observations and variables**. `glimpse()` also gives you that information but calls them **rows and columns** respectively.\n\nThe **data type information** actually comes from the output when using the `read_csv()` function. Did you notice the information on **Column specification** (see screenshot below)?\n\n![message from `read_csv()` when reading in the data](images/col_spec.PNG)\n\nWhilst `spec()` is quite useful for data type information per individual column, it doesn't give you the total count of each data type. So it doesn't really help with answering the questions here - unless you want to count manually from its extremely long output.\n\n:::\n\nIn your `.Rmd`, include a **new heading level 2** called \"Information about the data\" (or something equally meaningful) and jot down some notes about `data_prp`. You could include the citation and/or the abstract, and whatever information you think you should note about this dataset (e.g., any observations from looking at the codebook?). You could also include some notes on the functions used so far and what they do. Try to incorporate some **bold**, *italic* or ***bold and italic*** emphasis and perhaps a bullet point or two.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Possible solution\n\n\\#\\# Information about the data\n\nThe data is from Pownall et al. (2023), and I can find the paper here: https://doi.org/10.1177/25152459231202724.\n\nI've noticed in the prp codebook that the SATS-28 questionnaire has quite a few \\*reverse-coded items\\*, and the supervisor support questionnaire also has a reverse-coded item.\n\nSo far, I think I prefer \\*\\*glimpse()\\*\\* to show me some more detail about the data. Specs() is too text-heavy for me which makes it hard to read.\n\nThings to keep in mind:\n\n* \\*\\*don't forget to load in tidyverse first!!!\\*\\*\n* always read in the data with \\*\\*read_csv\\*\\*, \\*\\*\\*never ever use read.csv\\*\\*\\*!!!\n\n![The output rendered in a knitted html file](images/knitted_markdown.PNG)\n\n:::\n\n:::\n\n### Data types {#sec-datatypes}\n\nEach variable has a **data type**, such as numeric (numbers), character (text), and logical (TRUE/FALSE values), or a special class of factor. As you have just seen, our `data_prp` only has character and numeric columns (so far).\n\n**Numeric data** can be double (`dbl`) or integer (`int`). Doubles can have decimal places (e.g., 1.1). Integers are the whole numbers (e.g., 1, 2, -1) and are displayed with the suffix L (e.g., 1L). This is not overly important but might leave you less puzzled the next time you see an L after a number.\n\n**Characters** (also called “strings”) is anything written between quotation marks. This is usually text, but in special circumstances, a number can be a character if it placed within quotation marks. This can happen when you are recoding variables. It might not be too obvious at the time, but you won't be able to calculate anything if the number is a character\n\n::: panel-tabset\n\n## Example data types\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntypeof(1)\ntypeof(1L)\ntypeof(\"1\")\ntypeof(\"text\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n[1] \"integer\"\n[1] \"character\"\n[1] \"character\"\n```\n:::\n:::\n\n\n## numeric computation\n\nNo problems here...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n1+1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n## character computation\n\nWhen the data type is incorrect, you won't be able to compute anything, despite your numbers being shown as numeric values in the dataframe. The error message tells you exactly what's wrong with it, i.e., that you have `non-numeric arguments`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n\"1\"+\"1\" # ERROR\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"1\" + \"1\": non-numeric argument to binary operator\n```\n:::\n:::\n\n\n:::\n\n**Logical** data (also sometimes called “Boolean” values) are one of two values: TRUE or FALSE (written in uppercase). They become really important when we use `filter()` or `mutate()` with conditional statements such as `case_when()`. More about those in @sec-wrangling2.\n\n\nSome commonly used logical operators:\n\n| operator | description |\n|:---------|:-----------------------------------------------|\n| \\> | greater than |\n| \\>= | greater than or equal to |\n| \\< | less than |\n| \\<= | less than or equal to |\n| == | equal to |\n| != | not equal to |\n| %in% | TRUE if any element is in the following vector |\n\n\nA **factor** is a specific type of integer or character that lets you assign the order of the categories. This becomes useful when you want to display certain categories in \"the correct order\" either in a dataframe (see *arrange*) or when plotting (see @sec-dataviz/ @sec-dataviz2).\n\n\n\n### Variable types\n\nYou've already encountered them in [Level 1](https://psyteachr.github.io/data-skills-v2/intro-to-probability.html){target=\"_blank\"} but let's refresh. Variables can be classified as **continuous** (numbers) or **categorical** (labels).\n\n**Categorical** variables are properties you can count. They can be **nominal**, where the categories don't have an order (e.g., gender) or **ordinal** (e.g., Likert scales either with numeric values 1-7 or with character labels such as \"agree\", \"neither agree nor disagree\", \"disagree\"). Categorical data may also be **factors** rather than characters.\n\n**Continuous variables** are properties you can measure and calculate sums/ means/ etc. They may be rounded to the nearest whole number, but it should make sense to have a value between them. Continuous variables always have a **numeric** data type (i.e. `integer` or `double`).\n\n::: callout-tip\n\n## Why is this important you may ask?\n\nKnowing your variable and data types will help later on when deciding on an appropriate plot (see @sec-dataviz and @sec-dataviz2) or which inferential test to run (@sec-nhstI to @sec-factorial).\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAs we've seen earlier, `data_prp` only had character and numeric variables which hardly tests your understanding to see if you can identify a variety of data types and variable types. So, for this little quiz, we've spiced it up a bit. We've selected a few columns, shortened some of the column names, and modified some of the data types. Here you can see the first few rows of the new object `data_quiz`. *You can find the code with explanations at the end of this section.*\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Code |Age |Gender |Ethnicity |Secondyeargrade | QRP_item| QRPs_mean|Understanding_item |QRP_item > 4 |\n|:----|:---|:------|:--------------|:-----------------------|--------:|---------:|:------------------|:------------|\n|Tr10 |22 |2 |White European |60-69% (2:1 grade) | 5| 5.636364|2 |TRUE |\n|Bi07 |20 |2 |White British |50-59% (2:2 grade) | 2| 5.454546|2 |FALSE |\n|SK03 |22 |2 |White British |≥ 70% (1st class grade) | 6| 6.272727|6 |TRUE |\n|SM95 |26 |2 |White British |60-69% (2:1 grade) | 2| 5.000000|2 |FALSE |\n|St01 |22 |2 |White British |60-69% (2:1 grade) | 6| 5.545454|6 |TRUE |\n\n
\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_quiz)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 9\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", \"St01\", \"St10\", \"Wa…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"20\", \"21\", \"21\", \"22…\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ Ethnicity \"White European\", \"White British\", \"White British\",…\n$ Secondyeargrade 60-69% (2:1 grade), 50-59% (2:2 grade), ≥ 70% (1st …\n$ QRP_item 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3, 4, 4, 4, 4, 6, 3, …\n$ QRPs_mean 5.636364, 5.454545, 6.272727, 5.000000, 5.545455, 6…\n$ Understanding_item \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at all confident\", \"4…\n$ `QRP_item > 4` TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,…\n```\n:::\n:::\n\n\n\n\nSelect from the dropdown menu the variable type and their data types for each of the columns.\n\n\n\n\n\n| Column | Variable type | Data type |\n|:---------------------|:--------------|:--------------|\n| `Age` | | |\n| `Gender` | | |\n| `Ethinicity` | | |\n| `Secondyeargrade` | | |\n| `QRP_item` | | |\n| `QRPs_mean` | | |\n| `Understanding_item` | | |\n| `QRP_item > 4` | | |\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Revealing the mystery code that created `data_quiz`\n\nThe code might look a bit complex for the minute despite the line-by-line explanations below. Come back to it after completing chapter 2.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n:::\n\n\nLets go through this line by line:\n\n* **line 1**: creates a new object called `data_quiz` and it is based on the already existing data object `data_prp`\n* **line 2**: we are selecting a few variables of interest, such as Code, Age etc. Some of those variables were renamed in the process according to the structure `new_name = old_name`, for example QRP item 3 at time point 1 got renamed as `QRP_item`.\\\n* **line 3**: The function `mutate()` is used to create a new column called `Gender` that turns the existing column `Gender` from a numeric value into a factor. R simply overwrites the existing column of the same name. If we had named the new column `Gender_factor`, we would have been able to retain the original `Gender` column and `Gender_factor` would have been added as the last column.\n* **line 4-6**: See how the line starts with an indent which indicates we are still within the `mutate()` function. You can also see this by counting brackets - in line 3 there are 2 opening brackets but only 1 closes.\n * Similar to `Gender`, we are replacing the \"old\" `Secondyeargrade` with the new `Secondyeargrade` column that is now a factor.\n * Turning our variable `Secondyeargrade` into a factor, spot the difference between this attempt and the one we used for `Gender`? Here we are using a lot more arguments in that factor function, namely levels and labels. **Levels** describes the unique values we have for that column, and in **labels** we want to define how these levels will be shown in the data object. If you don't add the levels and labels argument, the labels will be the labels (as you can see in the `Gender` column in which we kept the numbers).\n* **line 7**: Doesn't start with a function name and has an indent, which means we are *still* within the `mutate()` function - count the opening and closing brackets to confirm.\n * Here, we are creating a new column called `QRP_item > 4`. Notice the two backticks we have to use to make this weird column name work? This is because it has spaces (and we did mention that R doesn't like spaces). So the backticks help R to group it as a unit/ a single name.\n * Next we have a `case_when()` function which helps executing conditional statements. We are using it to check whether a statement is TRUE or FALSE. Here, we ask whether the QRP item (column `QRP_item`) is larger than 4 (midpoint of the scale) using the Boolean operator `>`. If the statement is `TRUE`, the label `TRUE` should appear in column `QRP_item > 4`. Otherwise, if the value is equal to 4 or smaller, the label should read `FALSE`. We will come back to conditional statements in @sec-wrangling. But long story short, this Boolean expression created the only logical data type in `data_quiz`.\n:::\n\nAnd with this, we are done with the individual walkthrough. Well done :)\n\n\n\n\n\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nThe data we will be using in the upcoming lab activities is a randomised controlled trials experiment by Binfet et al. (2021) that was conducted in Canada.\n\n**Citation**\n\n> Binfet, J. T., Green, F. L. L., & Draper, Z. A. (2021). The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. *Anthrozoös, 35*(1), 1–22. [https://doi.org/10.1080/08927936.2021.1944558](https://doi.org/10.1080/08927936.2021.1944558){target=\"_blank\"}\n\n**Abstract**\n\n> Researchers have claimed that canine-assisted interventions (CAIs) contribute significantly to bolstering participants' wellbeing, yet the mechanisms within interactions have received little empirical attention. The aim of this study was to assess the impact of client–canine contact on wellbeing outcomes in a sample of 284 undergraduate college students (77% female; 21% male, 2% non-binary). Participants self-selected to participate and were randomly assigned to one of two canine interaction treatment conditions (touch or no touch) or to a handler-only condition with no therapy dog present. To assess self-reports of wellbeing, measures of flourishing, positive and negative affect, social connectedness, happiness, integration into the campus community, stress, homesickness, and loneliness were administered. Exploratory analyses were conducted to assess whether these wellbeing measures could be considered as measuring a unidimensional construct. This included both reliability analysis and exploratory factor analysis. Based on the results of these analyses we created a composite measure using participant scores on a latent factor. We then conducted the tests of the four hypotheses using these factor scores. Results indicate that participants across all conditions experienced enhanced wellbeing on several measures; however, only those in the direct contact condition reported significant improvements on all measures of wellbeing. Additionally, direct interactions with therapy dogs through touch elicited greater wellbeing benefits than did no touch/indirect interactions or interactions with only a dog handler. Similarly, analyses using scores on the wellbeing factor indicated significant improvement in wellbeing across all conditions (handler-only, *d* = 0.18, *p* = 0.041; indirect, *d* = 0.38, *p* \\< 0.001; direct, *d* = 0.78, *p* \\< 0.001), with more benefit when a dog was present (*d* = 0.20, *p* \\< 0.001), and the most benefit coming from physical contact with the dog (*d* = 0.13, *p* = 0.002). The findings hold implications for post-secondary wellbeing programs as well as the organization and delivery of CAIs.\n\n\nHowever, we accessed the data via Ciaran Evans' github ([https://github.com/ciaran-evans/dog-data-analysis](https://github.com/ciaran-evans/dog-data-analysis){target=\"_blank\"}). Evans et al. (2023) published a paper that reused the Binfet data for teaching statistics and research methods. If anyone is interested, the accompanying paper is:\n\n> Evans, C., Cipolli, W., Draper, Z. A., & Binfet, J. T. (2023). Repurposing a Peer-Reviewed Publication to Engage Students in Statistics: An Illustration of Study Design, Data Collection, and Analysis. *Journal of Statistics and Data Science Education, 31*(3), 236–247. [https://doi.org/10.1080/26939169.2023.2238018](https://doi.org/10.1080/26939169.2023.2238018){target=\"_blank\"}\n\n**There are a few changes that Evans and we made to the data:**\n\n* Evans removed the demographics ethnicity and gender to make the study data available while protecting participant privacy. Which means we'll have limited demographic variables available, but we will make do with what we've got.\n* We modified some of the responses in the raw data csv - for example, we took out impossible response values and replaced them with `NA`.\n* We replaced some of the numbers with labels to increase the difficulty in the dataset for @sec-wrangling and @sec-wrangling2.\n\n\n\n### Task 1: Create a project folder for the lab activities {.unnumbered}\n\nSince we will be working with the same data throughout semester 1, create a separate project for the lab data. Name it something useful, like `lab_data` or `dogs_in_the_lab`. Make sure you are not placing it within the project you have already created today. If you need guidance, see @sec-project above.\n\n\n\n### Task 2: Create a new `.Rmd` file {.unnumbered}\n\n... and name it something useful. If you need help, have a look at @sec-rmd.\n\n\n\n### Task 3: Download the data {.unnumbered}\n\nDownload the data here: [data_pair_ch1](data/data_pair_ch1.zip \"download\"). The zip folder contains the raw data file with responses to individual questions, a cleaned version of the same data in long format and wide format, and the codebook describing the variables in the raw data file and the long format.\n\n**Unzip the folder and place the data files in the same folder as your project.**\n\n\n\n### Task 4: Familiarise yourself with the data {.unnumbered}\n\nOpen the data files, look at the codebook, and perhaps skim over the original Binfet article (methods in particular) to see what kind of measures they used.\n\nRead in the raw data file as `dog_data_raw` and the cleaned-up data (long format) as `dog_data_long`. See if you can answer the following questions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\ndog_data_long <- read_csv(\"dog_data_clean_long.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\nRows: 284 Columns: 136\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (41): GroupAssignment, L2_1, L2_2, L2_3, L2_4, L2_5, L2_6, L2_7, L2_8, L...\ndbl (95): RID, Age_Yrs, Year_of_Study, Live_Pets, Consumer_BARK, S1_1, HO1_1...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\nRows: 568 Columns: 16\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): GroupAssignment, Year_of_Study, Live_Pets, Stage\ndbl (12): RID, Age_Yrs, Consumer_BARK, Flourishing, PANAS_PA, PANAS_NA, SHS,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n* How many participants took part in the study? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nYou can see this from `dog_data_raw`. Each participant ID is on a single row meaning the number of observations is the number of participants.\n\nIf you look at `dog_data_long`, there are 568 observations. Each participant answered the questionnaires pre and post intervention, resulting in 2 rows per participant ID. This means you'd have to divide the number of observations by 2 to get to the number of participants.\n\n:::\n\n* How many different questionnaires did the participants answer? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nThe Binfet paper (e.g., Methods section and/or abstract) and the codebook show it's 9 questionnaires - Flourishing scale (variable `Flourishing`), the UCLS Loneliness scale Version 3 (`Loneliness`), Positive and Negative affect scale (`PANAS_PA` and `PANAS_NA`), the Subjective Happiness scale (`SHS`), the Social connectedness scale (`SCS`), and 3 scales with 1 question each, i.e., perception of stress levels (`Stress`), self-reported level of homesickness (`Homesick`), and integration into the campus community (`Engagement`).\n\nHowever, if you thought `PANAS_PA` and `PANAS_NA` are a single questionnaire, 8 was also acceptable as an answer here.\n\n:::\n\n\n\n\n## [Test your knowledge]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nAre you ready for some knowledge check questions to test your understanding of the chapter? We also have some faulty codes. See if you can spot what's wrong with them.\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nOne of the key first steps when we open RStudio is to:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nOpening an existing project (e.g., when coming back to the same dataset) or creating a new project (e.g., for a new task or new dataset) ensures that subsequent `.Rmd` files, any output, figures, etc are saved within the same folder on your computer (i.e., the working directory). If the`.Rmd` files or data is not in the same folder as \"the project icon\", things can get messy and code might not run.\n\n:::\n\n\n#### Question 2 {.unnumbered}\n\nWhen using the default environment colour settings for RStudio, what colour would the background of a code chunk be in R Markdown? \n\nWhen using the default environment colour settings for RStudio, what colour would the background of normal text be in R Markdown? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nAssuming you have not changed any of the settings in RStudio, code chunks will tend to have a grey background and normal text will tend to have a white background. This is a good way to check that you have closed and opened code chunks correctly.\n\n:::\n\n\n\n#### Question 3 {.unnumbered}\n\nCode chunks start and end with:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCode chunks always take the same general format of three backticks followed by curly parentheses and a lower case r inside the parentheses (`{r}`). People often mistake these backticks for single quotes but that will not work. If you have set your code chunk correctly using backticks, the background colour should change to grey from white.\n\n:::\n\n\n\n#### Question 4 {.unnumbered}\n\nWhat is the correct way to include a code chunk in RMarkdown that will be executed but neither the code nor its output will be shown in the final HTML document? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCheck the table of knitr display options in @sec-chunks.\n\n* {r, echo=FALSE} also executes the code and does not show the code, but it *does* display the result in the knitted html file. (matches 2/3 criteria)\n* {r, eval=FALSE} does not show the results but does *not* execute the code and it *does* show it in the knitted file. (matches 1/3 criteria)\n* {r, results=“hide”} executes the code and does not show results, however, it *does* include the code in the knitted html document. (matches 2/3 criteria)\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of these codes have mistakes in them, other code chunks are not quite producing what was aimed for. Your task is to spot anything faulty, explain why the things happened, and perhaps try to fix them.\n\n\n\n#### Question 5 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. You have just stated R, created a new `.Rmd` file, and typed the following code into your code chunk.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\n\nHowever, R gives you an error message: `could not find function \"read_csv\"`. What could be the reason?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\n\"Could not find function\" is an indication that you have forgotten to load in tidyverse. Because `read_csv()` is a function in the tidyverse collection, R cannot find it.\n\nFIX: Add `library(tidyverse)` prior to reading in the data and run the code chunk again.\n\n:::\n\n\n\n#### Question 6 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. This time, you are certain you have loaded in tidyverse first. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\nThe error message shows `'data.csv' does not exist in current working directory`. You check your folder and it looks like this:\n\n![](images/error_ch1_01.PNG)\n\nWhy is there an error message?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nR is looking for a csv file that is called data which is currently not in the working directory. We may assume it's in the data folder. Perhaps that happened when unzipping the zip file. So instead of placing the csv file on the same level as the project icon, it was unzipped into a folder named data.\n\nFIX - option 1: Take the `data.csv` out of the data folder and place it next to the project icon and the `.Rmd` file.\n\nFIX - option 2: Modify your R code to tell R that the data is in a separate folder called data, e.g., ...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data/data.csv\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\n\nYou want to load `tidyverse` into the library. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\nThe error message says: `Error in library(tidyverse) : there is no package called ‘tidyverse’`\n\nWhy is there an error message and how can we fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nIf R says there is no package called `tidyverse`, means you haven't installed the package yet. This could be an error message you receive either after switching computers or a fresh install of R and RStudio.\n\nFIX: Type `install.packages(\"tidyverse\")` into your **Console**.\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nYou knitted your `.Rmd` into a html but the output is not as expected. You see the following:\n\n![](images/error_knitted.PNG)\n\nWhy did the file not knit properly?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThere is a backtick missing in the code chunk. If you check your `.Rmd` file, you can see that the code chunk does not show up in grey which means it's one of the 3 backticks at the beginning of the chunk.\n\n![](images/error_ch1_08.PNG)\n\nFIX: Add a single backtick manually where it's missing.\n\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/.quarto/_freeze/02-wrangling/execute-results/html.json b/.quarto/_freeze/02-wrangling/execute-results/html.json index e5f22d3..dc380b5 100644 --- a/.quarto/_freeze/02-wrangling/execute-results/html.json +++ b/.quarto/_freeze/02-wrangling/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "85f2fa0ec2f8baeeb06d4b656ca711c6", + "hash": "f28c870f506255c679a3456fc152f995", "result": { - "markdown": "# Data wrangling I {#sec-wrangling}\n\n## Intended Learning Outcomes {.unnumbered}\n\nIn the next two chapters, we will build on the data wrangling skills from level 1. We will revisit all the functions you have already encountered (and might have forgotten over the summer break) and introduce 2 or 3 new functions. These two chapters will provide an opportunity to revise and apply the functions to a novel dataset.\n\nBy the end of this chapter, you should be able to:\n\n- apply familiar data wrangling functions to novel datasets\n- read and interpret error messages\n- realise there are several ways of getting to the results\n- export data objects as csv files\n\nThe main purpose of this chapter and @sec-wrangling2 is to wrangle your data into shape for data visualisation (@sec-dataviz and @sec-dataviz2). For the two chapters, we will:\n\n1. calculate demographics\n2. tidy 3 different questionnaires with varying degrees of complexity\n3. solve an error mode problem\n4. join all data objects together\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nBefore we start, we need to set up some things.\n\n\n## Activity 1: Setup\n\n* We will be working on the **dataset by Pownall et al. (2023)** again, which means we can still use the project we created last week. The data files will already be there, so no need to download them again.\n* To **open the project** in RStudio, go to the folder in which you stored the project and the data last time, and double click on the project icon.\n* **Create a new `.Rmd` file** for chapter 2 and save it to your project folder. Name it something meaningful (e.g., “chapter_02.Rmd”, “02_data_wrangling.Rmd”). See @sec-rmd if you need some guidance.\n* In your newly created `.Rmd` file, delete everything below line 12 (after the set-up code chunk).\n\n\n\n## Activity 2: Load in the libraries and read in the data\n\nWe will use `tidyverse` today, and we want to create a data object `data_prp` that stores the data from the file `prp_data_reduced.csv`.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(???)\ndata_prp <- read_csv(\"???\")\n```\n:::\n\n\n\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n\n:::\n\nIf you need a quick reminder what the dataset was about, have a look at the abstract in @sec-download_data_ch1. We also addressed the changes we made to the dataset there.\n\nAnd remember to have a quick `glimpse()` at your data.\n\n\n\n## Activity 3: Calculating demographics\n\nLet’s start with some simple data-wrangling steps to compute demographics for our original dataset, `data_prp`. First, we want to determine how many participants took part in the study by Pownall et al. (2023) and compute the mean age and the standard deviation of age for the sample.\n\n\n\n### ... for the full sample using `summarise()`\n\nThe `summarise()` function is part of the **\"Wickham Six\"** alongside `group_by()`, `select()`, `filter()`, `mutate()`, and `arrange()`. You used them plenty of times last year.\n\nWithin `summarise()`, we can use the `n()` function, which calculates the number of rows in the dataset. Since each row corresponds to a unique participant, this gives us the total number of participants.\n\nTo calculate the mean age and the standard deviation of age, we need to use the functions `mean()` and `sd()` on the column `Age` respectively.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: There were 2 warnings in `summarise()`.\nThe first warning was:\nℹ In argument: `mean_age = mean(Age)`.\nCaused by warning in `mean.default()`:\n! argument is not numeric or logical: returning NA\nℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.\n```\n:::\n\n```{.r .cell-code}\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nR did not give us an error message per se, but the output is not quite as expected either. There are `NA` values in the `mean_age` and `sd_age` columns. Looking at the warning message and at `Age`, can you explain what happened?\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nThe warning message says: `argument is not numeric or logical: returning NA` If we look at the `Age` column more closely, we can see that it's a character data type.\n\n:::\n\n\n\n#### Fixing `Age` {.unnumbered}\n\nMight be wise to look at the unique answers in column `Age` to determine what is wrong. We can do that with the function `distinct()`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nage_distinct <- data_prp %>% \n distinct(Age)\n\nage_distinct\n```\n:::\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Show the unique values of `Age`.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Age |\n|:--------|\n|22 |\n|20 |\n|26 |\n|21 |\n|29 |\n|23 |\n|39 |\n|NA |\n|24 |\n|43 |\n|31 |\n|25 years |\n\n
\n:::\n:::\n\n:::\n\n::: columns\n\n::: column\n\nOne cell has the string \"years\" added to their number 25, which has converted the entire column into a character column.\n\nWe can easily fix this by extracting only the numbers from the column and converting it into a numeric data type. The `parse_number()` function, which is part of the `tidyverse` package, handles both steps in one go (so there’s no need to load additional packages).\n\nWe will combine this with the `mutate()` function to create a new column called `Age` (containing those numeric values), effectively replacing the old `Age` column (which had the character values).\n\n:::\n\n::: column\n\n![parse_number() illustration by Allison Horst (see [https://allisonhorst.com/r-packages-functions](https://allisonhorst.com/r-packages-functions){target=\"_blank\"})](images/parse_number.png){width=\"95%\"}\n\n:::\n\n:::\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_prp <- data_prp %>% \n mutate(Age = parse_number(Age))\n\ntypeof(data_prp$Age) # fixed\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n:::\n\n\n\n\n#### Computing summary stats {.unnumbered}\n\nExcellent. Now that the numbers are in a numeric format, let's try calculating the demographics for the total sample again.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nEven though there's no error or warning, the table still shows `NA` values for `mean_age` and `sd_age`. So, what could possibly be wrong now?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nDid you notice that the `Age` column in `age_distinct` contains some missing values (`NA`)? To be honest, it's easier to spot this issue in the actual R output than in the printed HTML page.\n\n:::\n\n\n\n#### Computing summary stats - third attempt {.unnumbered}\n\nTo ensure R ignores missing values during calculations, we need to add the extra argument `na.rm = TRUE` to the `mean()` and `sd()` functions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age, na.rm = TRUE), # mean age\n sd_age = sd(Age, na.rm = TRUE)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|--------:|\n| 89| 21.88506| 3.485603|\n\n
\n:::\n:::\n\n\nFinally, we’ve got it! 🥳 Third time's the charm!\n\n\n\n### ... per gender using `summarise()` and `group_by()`\n\nNow we want to compute the summary statistics for each gender. The code inside the `summarise()` function remains unchanged; we just need to use the `group_by()` function beforehand to tell R that we want to compute the summary statistics for each group separately. It’s also a good practice to use `ungroup()` afterwards, so you are not taking groupings forward unintentionally.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% # split data up into groups (here Gender)\n summarise(n = n(), # participant number \n mean_age = mean(Age, na.rm = TRUE), # mean age \n sd_age = sd(Age, na.rm = TRUE)) %>% # standard deviation of age\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| mean_age| sd_age|\n|------:|--:|--------:|--------:|\n| 1| 17| 23.31250| 5.770254|\n| 2| 69| 21.57353| 2.738973|\n| 3| 3| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n\n\n### Adding percentages\n\nSometimes, it may be useful to calculate percentages, such as for the gender split. You can do this by adding a line within the `summarise()` function to perform the calculation. All we need to do is take the number of female, male, and non-binary participants (stored in the `n` column of `demo_by_gender`), divide it by the total number of participants (stored in the `n` column of `demo_total`), and multiply by 100. Let's add `percentage` to the `summarise()` function of `demo_by_gender`. Make sure that the code for `percentages` is placed after the value for `n` has been computed.\n\nAccessing the value of `n` for the different gender categories is straightforward because we can refer back to it directly. However, since the total number of participants is stored in a different data object, we need to use a base R function to access it – specifically the `$` operator. To do this, you simply type the name of the data object (in this case, `demo_total`), followed by the `$` symbol (with no spaces), and then the name of the column you want to retrieve (in this case, `n`). The general pattern is `data$column`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n # n from the line above divided by n from demo_total *100\n percentage = n/demo_total$n *100, \n mean_age = mean(Age, na.rm = TRUE), \n sd_age = sd(Age, na.rm = TRUE)) %>% \n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|--------:|\n| 1| 17| 19.101124| 23.31250| 5.770254|\n| 2| 69| 77.528090| 21.57353| 2.738973|\n| 3| 3| 3.370786| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for decimal places - use `round()`\n\nNot super important, because you could round the values by yourself when writing up your reports, but if you wanted to tidy up the decimal places in the output, you can do that using the `round()` function. You would need to \"wrap\" it around your computations and specify how many decimal places you want to display (for example `mean(Age)` would turn into `round(mean(Age), 1)`). It may look odd for `percentage`, just make sure the number that specifies the decimal places is placed **within** the round function. The default value is 0 (meaning no decimal spaces).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n percentage = round(n/demo_total$n *100, 2), # percentage with 2 decimal places\n mean_age = round(mean(Age, na.rm = TRUE), 1), # mean Age with 1 decimal place\n sd_age = round(sd(Age, na.rm = TRUE), 3)) %>% # sd Age with 3 decimal places\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|------:|\n| 1| 17| 19.10| 23.3| 5.770|\n| 2| 69| 77.53| 21.6| 2.739|\n| 3| 3| 3.37| 21.3| 1.155|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 4: Questionable Research Practices (QRPs) {#sec-ch2_act4}\n\n#### The main goal is to compute the mean QRP score per participant for time point 1. {.unnumbered}\n\nAt the moment, the data is in wide format. The table below shows data from the first 3 participants:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhead(data_prp, n = 3)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | Gender| Age|Ethnicity | Secondyeargrade| Opptional_mod|Opptional_mod_1_TEXT | Research_exp|Research_exp_1_TEXT | Plan_prereg| SATS28_1_Affect_Time1| SATS28_2_Affect_Time1| SATS28_3_Affect_Time1| SATS28_4_Affect_Time1| SATS28_5_Affect_Time1| SATS28_6_Affect_Time1| SATS28_7_CognitiveCompetence_Time1| SATS28_8_CognitiveCompetence_Time1| SATS28_9_CognitiveCompetence_Time1| SATS28_10_CognitiveCompetence_Time1| SATS28_11_CognitiveCompetence_Time1| SATS28_12_CognitiveCompetence_Time1| SATS28_13_Value_Time1| SATS28_14_Value_Time1| SATS28_15_Value_Time1| SATS28_16_Value_Time1| SATS28_17_Value_Time1| SATS28_18_Value_Time1| SATS28_19_Value_Time1| SATS28_20_Value_Time1| SATS28_21_Value_Time1| SATS28_22_Difficulty_Time1| SATS28_23_Difficulty_Time1| SATS28_24_Difficulty_Time1| SATS28_25_Difficulty_Time1| SATS28_26_Difficulty_Time1| SATS28_27_Difficulty_Time1| SATS28_28_Difficulty_Time1| QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1| QRPs_12NotQRP_Time1| QRPs_13NotQRP_Time1| QRPs_14NotQRP_Time1| QRPs_15NotQRP_Time1|Understanding_OS_1_Time1 |Understanding_OS_2_Time1 |Understanding_OS_3_Time1 |Understanding_OS_4_Time1 |Understanding_OS_5_Time1 |Understanding_OS_6_Time1 |Understanding_OS_7_Time1 |Understanding_OS_8_Time1 |Understanding_OS_9_Time1 |Understanding_OS_10_Time1 |Understanding_OS_11_Time1 |Understanding_OS_12_Time1 | Pre_reg_group| Other_OS_behav_2| Other_OS_behav_4| Other_OS_behav_5| Closely_follow| SATS28_Affect_Time2_mean| SATS28_CognitiveCompetence_Time2_mean| SATS28_Value_Time2_mean| SATS28_Difficulty_Time2_mean| QRPs_Acceptance_Time2_mean| Time2_Understanding_OS| Supervisor_1| Supervisor_2| Supervisor_3| Supervisor_4| Supervisor_5| Supervisor_6| Supervisor_7| Supervisor_8| Supervisor_9| Supervisor_10| Supervisor_11| Supervisor_12| Supervisor_13| Supervisor_14| Supervisor_15_R|\n|:----|------:|---:|:--------------|---------------:|-------------:|:------------------------------|------------:|:-------------------|-----------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|----------------------------------:|----------------------------------:|----------------------------------:|-----------------------------------:|-----------------------------------:|-----------------------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:-------------------------|:-------------------------|:-------------------------|-------------:|----------------:|----------------:|----------------:|--------------:|------------------------:|-------------------------------------:|-----------------------:|----------------------------:|--------------------------:|----------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------:|-------------:|-------------:|---------------:|\n|Tr10 | 2| 22|White European | 2| 1|Research methods in first year | 2|NA | 1| 4| 5| 3| 4| 5| 5| 4| 2| 2| 6| 4| 3| 1| 7| 7| 2| 1| 3| 3| 2| 2| 3| 5| 2| 6| 4| 4| 1| 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7| 2| 1| 1| 2|2 |2 |2 |6 |Entirely confident |Entirely confident |6 |6 |Entirely confident |Entirely confident |Entirely confident |Entirely confident | 1| 1| 1| NA| 2| 3.500000| 4.166667| 3.000000| 2.857143| 5.636364| 5.583333| 5| 5| 6| 6| 5| 5| 1| 5| 6| 5| NA| 4| 4| 5| 1|\n|Bi07 | 2| 20|White British | 3| 2|NA | 2|NA | 3| 5| 6| 2| 5| 5| 6| 2| 2| 2| 7| 3| 5| 1| 7| 7| 1| 1| 6| 3| 1| 1| 2| 6| 2| 7| 2| 5| 7| 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7| 2| 1| 4| 4|2 |Not at all confident |Not at all confident |Not at all confident |6 |Entirely confident |Not at all confident |3 |6 |6 |2 |2 | 1| NA| NA| NA| 2| 3.166667| 4.666667| 6.222222| 2.857143| 5.454546| 3.333333| 7| 6| 7| 7| 7| 7| 1| 5| 7| 7| 7| 5| 2| 7| 1|\n|SK03 | 2| 22|White British | 1| 2|NA | 2|NA | 1| 5| 3| 5| 2| 5| 2| 2| 2| 2| 6| 5| 3| 2| 6| 6| 3| 3| 5| 3| 4| 3| 5| 5| 2| 5| 2| 5| 5| 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7| 1| 1| 3| 2|6 |2 |3 |6 |6 |5 |2 |5 |5 |5 |4 |5 | 1| NA| NA| NA| 2| 4.833333| 6.166667| 6.000000| 4.000000| 6.272727| 5.416667| 7| 7| 7| 7| 7| 7| 1| 7| 7| 7| 7| 7| 5| 7| 1|\n\n
\n:::\n:::\n\n

\n\nLooking at the QRP data at time point 1, you determine that\n\n* individual item columns are , and\n* according to the codebook, there are reverse-coded items in this questionnaire.\n\nAccording to the codebook and the data table above, we just have to **compute the average score for QRP items to **, since items to are distractor items. Seems quite straightforward.\n\nHowever, as you can see in the table above, each item is in a separate column, meaning the data is in **wide format**. It would be much easier to calculate the mean scores if the items were arranged in **long format**.\n\n\nLet’s tackle this problem step by step. It’s best to create a separate data object for this. If we tried to compute it within `data_prp`, it could quickly become messy.\n\n\n* **Step 1**: Select the relevant columns `Code`, and `QRPs_1_Time1` to `QRPs_1_Time1` and store them in an object called `qrp_t1`\n* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can calculate the average score more easily (in step 3)\n* **Step 3**: Calculate the average QRP score (`QRPs_Acceptance_Time1_mean`) per participant using `group_by()` and `summarise()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_t1 <- data_prp %>% \n #Step 1\n select(Code, QRPs_1_Time1:QRPs_11_Time1) %>%\n # Step 2\n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(Code) %>% # grouping by participant id\n summarise(QRPs_Acceptance_Time1_mean = mean(Scores)) %>% # calculating the average Score\n ungroup() # just make it a habit\n```\n:::\n\n\n::: {.callout-caution icon=\"false\" collapse=\"true\"}\n\n## Explain the individual functions\n\n::: panel-tabset\n\n## `select ()`\n\nThe select function allows to include or exclude certain variables (columns). Here we want to focus on the participant ID column (i.e., `Code`) and the QRP items at time point 1. We can either list them all individually, i.e., Code, QRPs_1_Time1, QRPs_2_Time1, QRPs_3_Time1, and so forth (you get the gist), but that would take forever to type.\n\nA shortcut is to use the colon operator `:`. It allows us to select all columns that fall within the range of `first_column_name` to `last_column_name`. We can apply this here since the QRP items (1 to 11) are sequentially listed in `data_prp`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step1 <- data_prp %>% \n select(Code, QRPs_1_Time1:QRPs_11_Time1)\n\n# show first 5 rows of qrp_step1\nhead(qrp_step1, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1|\n|:----|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|\n|Tr10 | 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7|\n|Bi07 | 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7|\n|SK03 | 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7|\n|SM95 | 7| 7| 2| 6| 7| 5| 7| 7| 4| 2| 4|\n|St01 | 7| 7| 6| 7| 2| 7| 7| 7| 7| 5| 7|\n\n
\n:::\n:::\n\n\nHow many rows/observations and columns/variables do we have in `qrp_step1`?\n\n* rows/observations: \n* columns/variables: \n\n## `pivot_longer()`\n\nAs you can see, the table we got from Step 1 is in wide format. To get it into wide format, we need to define:\n\n* the columns that need to be reshuffled from wide into long format (`col` argument). Here we selected \"everything except the `Code` column\", as indicated by `-Code` \\[minus `Code`\\]. However, `QRPs_1_Time1:QRPs_11_Time1` would also work and give you the exact same result.\n* the `names_to` argument. R is creating a new column in which all the column names from the columns you selected in `col` will be stored in. Here we are naming this column \"Items\" but you could pick something equally sensible if you like.\n* the `values_to` argument. R creates this second column to store all responses the participants gave to the individual questions, i.e., all the numbers in this case. We named it \"Scores\" here, but you could have called it something different, like \"Responses\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step2 <- qrp_step1 %>% \n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\")\n\n# show first 15 rows of qrp_step2\nhead(qrp_step2, n = 15)\n```\n\n::: {.cell-output-display}\n
\n\n|Code |Items | Scores|\n|:----|:-------------|------:|\n|Tr10 |QRPs_1_Time1 | 7|\n|Tr10 |QRPs_2_Time1 | 7|\n|Tr10 |QRPs_3_Time1 | 5|\n|Tr10 |QRPs_4_Time1 | 7|\n|Tr10 |QRPs_5_Time1 | 3|\n|Tr10 |QRPs_6_Time1 | 4|\n|Tr10 |QRPs_7_Time1 | 5|\n|Tr10 |QRPs_8_Time1 | 7|\n|Tr10 |QRPs_9_Time1 | 6|\n|Tr10 |QRPs_10_Time1 | 7|\n|Tr10 |QRPs_11_Time1 | 7|\n|Bi07 |QRPs_1_Time1 | 7|\n|Bi07 |QRPs_2_Time1 | 7|\n|Bi07 |QRPs_3_Time1 | 2|\n|Bi07 |QRPs_4_Time1 | 7|\n\n
\n:::\n:::\n\n\nNow, have a look at `qrp_step2`. In total, we now have rows/observations, per participant, and columns/variables.\n\n## `group_by()` and `summarise()`\n\nThis follows exactly the same sequence we used when calculating descriptive statistics by gender. The only difference is that we are now grouping the data by the participant's `Code` instead of `Gender`.\n\n`summarise()` works exactly the same way: `summarise(new_column_name = function_to_calculate_something(column_name_of_numeric_values))`\n\nThe `function_to_calculate_something` can be `mean()`, `sd()` or `sum()` for mean scores, standard deviations, or summed-up scores respectively. You could also use `min()` or `max()` if you wanted to determine the lowest or the highest score for each participant.\n\n:::\n\n:::\n\n::: callout-tip\n\nYou could **rename the columns whilst selecting** them. The pattern would be `select(new_name = old_name)`. For example, if we wanted to select variable `Code` and rename it as `Participant_ID`, we could do that.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrenaming_col <- data_prp %>% \n select(Participant_ID = Code)\n\nhead(renaming_col, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Participant_ID |\n|:--------------|\n|Tr10 |\n|Bi07 |\n|SK03 |\n|SM95 |\n|St01 |\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 5: Knitting\n\nOnce you've completed your R Markdown file, the final step is to \"knit\" it, which converts the `.Rmd` file into a HTML file. Knitting combines your code, text, and output (like tables and plots) into a single cohesive document. This is a really good way to check your code is working.\n\nTo knit the file, **click the Knit button** at the top of your RStudio window. The document will be generated and, depending on your setting, automatically opened in the viewer in the `Output pane` or an external browser window.\n\nIf any errors occur during knitting, RStudio will show you an error message with details to help you troubleshoot.\n\nIf you want to **intentionally keep any errors** we tackled today to keep a reference on how you solved them, you could add `error=TRUE` or `eval=FALSE` to the code chunk that isn't running.\n\n\n\n## Activity 6: Export a data object as a csv\n\nTo avoid having to repeat the same steps in the next chapter, it's a good idea to save the data objects you've created today as csv files. You can do this by using the `write_csv()` function from the `readr` package. The csv files will appear in your project folder.\n\nThe basic syntax is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_object, \"filename.csv\")\n```\n:::\n\n\nNow, let's export the objects `data_prp` and `qrp_t1`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_prp, \"data_prp_for_ch3.csv\")\n```\n:::\n\n\nHere we named the file `data_prp_for_ch3.csv`, so we wouldn't override the original data csv file `prp_data_reduced.csv`. However, feel free to choose a name that makes sense to you.\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nExport `qrp_t1`.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(qrp_t1, \"qrp_t1.csv\")\n```\n:::\n\n\n:::\n\n:::\n\nCheck that your csv files have appeared in your project folder, and you're all set!\n\n**That’s it for Chapter 2: Individual Walkthrough.**\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n\nWe will continue working with the data from Binfet et al. (2021), focusing on the randomised controlled trial of therapy dog interventions. Today, our goal is to **calculate an average `Flourishing` score for each participant** at time point 1 (pre-intervention) using the raw data file `dog_data_raw`. Currently, the data looks like this:\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| F1_1| F1_2| F1_3| F1_4| F1_5| F1_6| F1_7| F1_8|\n|---:|----:|----:|----:|----:|----:|----:|----:|----:|\n| 1| 6| 7| 5| 5| 7| 7| 6| 6|\n| 2| 5| 7| 6| 5| 5| 5| 5| 4|\n| 3| 5| 5| 5| 6| 6| 6| 5| 5|\n| 4| 7| 6| 7| 7| 7| 6| 7| 4|\n| 5| 5| 5| 4| 6| 7| 7| 7| 6|\n\n
\n:::\n:::\n\n\n\nHowever, we want the data to look like this:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| Flourishing_pre|\n|---:|---------------:|\n| 1| 6.125|\n| 2| 5.250|\n| 3| 5.375|\n| 4| 6.375|\n| 5| 5.875|\n\n
\n:::\n:::\n\n\n\n\n### Task 1: Open the R project you created last week {.unnumbered}\n\nIf you haven’t created an R project for the lab yet, please do so now. If you already have one set up, go ahead and open it.\n\n\n### Task 2: Open your `.Rmd` file from last week {.unnumbered}\n\nSince we haven’t used it much yet, feel free to continue using the `.Rmd` file you created last week in Task 2.\n\n\n### Task 3: Load in the library and read in the data {.unnumbered}\n\nThe data should be in your project folder. If you didn’t download it last week, or if you’d like a fresh copy, you can download the data again here: [data_pair_ch1](data/data_pair_ch1.zip \"download\").\n\nWe will be using the `tidyverse` package today, and the data file we need to read in is `dog_data_raw.csv`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(???)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"???\")\n```\n:::\n\n\n:::\n\n\n### Task 4: Calculating the mean for `Flourishing_pre` {.unnumbered}\n\n\n* **Step 1**: Select all relevant columns from `dog_data_raw`, including participant ID and all items from the `Flourishing` questionnaire completed before the intervention. Store this data in an object called `data_flourishing`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nLook at the codebook. Try to determine:\n\n* The variable name of the column where the participant ID is stored.\n* The items related to the Flourishing scale at the pre-intervention stage.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nFrom the codebook, we know that:\n\n* The participant ID column is called `RID`.\n* The Flourishing items at the pre-intervention stage start with `F1_`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_flourishing <- ??? %>% \n select(???, F1_???:F1_???)\n```\n:::\n\n\n:::\n\n:::\n\n\n* **Step 2**: Pivot the data from wide format to long format so we can calculate the average score more easily (in step 3).\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nWhich pivot function should you use? We have `pivot_wider()` and `pivot_longer()` to choose from.\n\nWe also need 3 arguments in that function:\n\n* The columns you want to select (e.g., all the Flourishing items),\n* The name of the column where the current column headings will be stored (e.g., \"Questionnaire\"),\n* The name of the column that should store all the values (e.g., \"Responses\").\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nWe need `pivot_longer()`. You already encountered `pivot_longer()` in first year (or in the individual walkthrough if you have already completed this Chapter). The 3 arguments was also a give-away; `pivot_wider()` only requires 2 arguments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n pivot_longer(cols = ???, names_to = \"???\", values_to = \"???\")\n```\n:::\n\n\n:::\n\n:::\n\n* **Step 3**: Calculate the average Flourishing score per participant and name this column `Flourishing_pre` to match the table above.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nBefore summarising the mean, you may need to group the data.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nTo compute an average score **per participant**, we would need to group by participant ID first.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n group_by(???) %>% \n summarise(Flourishing_pre = mean(???)) %>% \n ungroup()\n```\n:::\n\n:::\n\n:::\n\n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(tidyverse)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\n\n# Task 4: Tidying \ndata_flourishing <- dog_data_raw %>% \n # Step 1\n select(RID, F1_1:F1_8) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Questionnaire\", values_to = \"Responses\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_pre = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\n:::\n\n\n\n## [Test your knowledge and challenge yourself]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nWhich function of the Wickham Six would you use to include or exclude certain variables (columns)? \n\n\n#### Question 2 {.unnumbered}\n\nWhich function of the Wickham Six would you use to create new columns or modify existing columns in a dataframe? \n\n\n#### Question 3 {.unnumbered}\n\n\nWhich function of the Wickham Six would you use to organise data into groups based on one or more columns? \n\n\n\n#### Question 4 {.unnumbered}\n\nWhich function of the Wickham Six would you use to sort the rows of a dataframe based on the values in one or more columns? \n\n\n\n#### Question 5 {.unnumbered}\n\nWhich function of the Wickham Six would NOT modify the original dataframe? \n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain these answers\n\n| Function | Description |\n|:-------------|:------------------------------------------------------|\n| `select()` | Include or exclude certain variables/columns |\n| `filter()` | Include or exclude certain observations/rows |\n| `mutate()` | Creates new columns or modifies existing ones |\n| `arrange()` | Changes the order of the rows |\n| `group_by()` | Split data into groups based on one or more variables |\n| `summarise()`| Creates a new dataframe returning one row for each combination of grouping variables |\n\n\nTechnically, the first five functions operate on the existing data object, making adjustments like sorting the data (e.g., with `arrange()`), reducing the number of rows (e.g., with `filter()`), reducing the number of columns (e.g., with `select()`), or adding new columns (e.g., with `mutate()`). In contrast, `summarise()` fundamentally alters the structure of the original dataframe by generating a completely new dataframe that contains only summary statistics, rather than retaining the original rows and columns.\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of the code chunks contain mistakes and result in errors, while others do not produce the expected results. Your task is to identify any issues, explain why they occurred, and, if possible, fix them.\n\nWe will use a few built-in datasets, such as `billboard` and `starwars`, to help you replicate the errors in your own R environment. You can view the data either by typing the dataset name directly into your console or by storing the data as a separate object in your `Global Environment`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbillboard\n\nstarwars_data = starwars\n```\n:::\n\n\n\n\n#### Question 6 {.unnumbered}\n\nCurrently, the weekly song rankings for Billboard Top 100 in 2000 are in wide format, with each week in a separate column. The following code is supposed to transpose the wide-format `billboard` data into long format:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(names_to = \"weeks\", values_to = \"rank\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `pivot_longer()`:\n! `cols` must select at least one column.\n```\n:::\n:::\n\n\nWhat does this error message mean and how do you fix it?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe error message indicates that the `cols` argument is missing in the function. This means the function doesn’t know which columns to transpose from wide format to long format.\n\nFIX: Add `cols = wk1:wk76` to the function to select columns from wk1 to wk76. Alternatively, `cols = starts_with(\"wk\")` would also work since all columns start with the letter combination \"wk\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(cols = wk1:wk76, names_to = \"weeks\", values_to = \"rank\")\n# OR\nlong_data <- billboard %>% \n pivot_longer(cols = starts_with(\"wk\"), names_to = \"weeks\", values_to = \"rank\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\nThe following code is intended to calculate the mean height of all the characters in the built-in `starwars` dataset, grouped by their gender. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = height)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\ndplyr 1.1.0.\nℹ Please use `reframe()` instead.\nℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n always returns an ungrouped data frame and adjust accordingly.\n```\n:::\n:::\n\n\nThe code runs, but it's giving us some weird warning and the output is also not as expected. What steps should we take to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe aggregation function `mean()` is missing from within `summarise()`. Without it, the function does not perform any aggregation and returns *all* rows with only the columns for gender and height.\n\nFIX: Wrap the `mean()` function around the variable you want to aggregate, here `height`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height))\n```\n:::\n\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nFollowing up on Question 7, we now have `summary_data` that looks approximately correct - it has the expected rows and column numbers, however, the cell values are \"weird\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | NA|\n|masculine | NA|\n|NA | 175|\n\n
\n:::\n:::\n\n\n\nCan you explain what is happening here? And how can we modify the code to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nLook at the original `starwars` data. You will notice that some of the characters with feminine and masculine gender entries have missing height values. However, all four characters without a specified gender have provided their height.\n\nFIX: We need to add `na.rm = TRUE` to the `mean()` function to ensure that R ignores missing values before aggregating the data.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height, na.rm = TRUE))\n\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | 166.5333|\n|masculine | 176.5323|\n|NA | 175.0000|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n### Challenge yourself {.unnumbered}\n\nIf you want to **challenge yourself** and further apply the skills from Chapter 2, you can wrangle the data from `dog_data_raw` for additional questionnaires from either the pre- and/or post-intervention stages:\n\n* Calculate the mean score for `flourishing_post` for each participant.\n* Calculate the mean score for the `PANAS` (Positive and/or Negative Affect) per participant\n* Calculate the mean score for happiness (`SHS`) per participant\n\nThe 3 steps are equivalent for those questionnaires - select, pivot, group_by and summarise; you just have to \"replace\" the questionnaire items involved.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution for **Challenge yourself**\n\nFlourishing post-intervention\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## flourishing_post\nflourishing_post <- dog_data_raw %>% \n # Step 1\n select(RID, starts_with(\"F2\")) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Names\", values_to = \"Response\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_post = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\nThe PANAS could be solved more concisely with the skills we learn in @sec-wrangling2, but for now, you would have solved it this way:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# PANAS - positive affect pre\nPANAS_PA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_3, PN1_5, PN1_7, PN1_8, PN1_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - positive affect post\nPANAS_PA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_3, PN2_5, PN2_7, PN2_8, PN2_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_post = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect pre\nPANAS_NA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_1, PN1_2, PN1_4, PN1_6, PN1_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect post\nPANAS_NA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_1, PN2_2, PN2_4, PN2_6, PN2_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_post = mean(Scores)) %>% \n ungroup()\n```\n:::\n\n\nHappiness scale\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# happiness_pre\nhappiness_pre <- dog_data_raw %>% \n # Step 1\n select(RID, HA1_1, HA1_2, HA1_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_pre = mean(Score)) %>% \n ungroup()\n\n#happiness_post\nhappiness_post <- dog_data_raw %>% \n # Step 1\n select(RID, HA2_1, HA2_2, HA2_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_post = mean(Score)) %>% \n ungroup()\n```\n:::\n\n\n:::\n", + "markdown": "# Data wrangling I {#sec-wrangling}\n\n## Intended Learning Outcomes {.unnumbered}\n\nIn the next two chapters, we will build on the data wrangling skills from level 1. We will revisit all the functions you have already encountered (and might have forgotten over the summer break) and introduce 2 or 3 new functions. These two chapters will provide an opportunity to revise and apply the functions to a novel dataset.\n\nBy the end of this chapter, you should be able to:\n\n- apply familiar data wrangling functions to novel datasets\n- read and interpret error messages\n- realise there are several ways of getting to the results\n- export data objects as csv files\n\nThe main purpose of this chapter and @sec-wrangling2 is to wrangle your data into shape for data visualisation (@sec-dataviz and @sec-dataviz2). For the two chapters, we will:\n\n1. calculate demographics\n2. tidy 3 different questionnaires with varying degrees of complexity\n3. solve an error mode problem\n4. join all data objects together\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nBefore we start, we need to set up some things.\n\n\n## Activity 1: Setup\n\n* We will be working on the **dataset by Pownall et al. (2023)** again, which means we can still use the project we created last week. The data files will already be there, so no need to download them again.\n* To **open the project** in RStudio, go to the folder in which you stored the project and the data last time, and double click on the project icon.\n* **Create a new `.Rmd` file** for chapter 2 and save it to your project folder. Name it something meaningful (e.g., “chapter_02.Rmd”, “02_data_wrangling.Rmd”). See @sec-rmd if you need some guidance.\n* In your newly created `.Rmd` file, delete everything below line 12 (after the set-up code chunk).\n\n\n\n## Activity 2: Load in the libraries and read in the data\n\nWe will use `tidyverse` today, and we want to create a data object `data_prp` that stores the data from the file `prp_data_reduced.csv`.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(???)\ndata_prp <- read_csv(\"???\")\n```\n:::\n\n\n\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n\n:::\n\nIf you need a quick reminder what the dataset was about, have a look at the abstract in @sec-download_data_ch1. We also addressed the changes we made to the dataset there.\n\nAnd remember to have a quick `glimpse()` at your data.\n\n\n\n## Activity 3: Calculating demographics\n\nLet’s start with some simple data-wrangling steps to compute demographics for our original dataset, `data_prp`. First, we want to determine how many participants took part in the study by Pownall et al. (2023) and compute the mean age and the standard deviation of age for the sample.\n\n\n\n### ... for the full sample using `summarise()`\n\nThe `summarise()` function is part of the **\"Wickham Six\"** alongside `group_by()`, `select()`, `filter()`, `mutate()`, and `arrange()`. You used them plenty of times last year.\n\nWithin `summarise()`, we can use the `n()` function, which calculates the number of rows in the dataset. Since each row corresponds to a unique participant, this gives us the total number of participants.\n\nTo calculate the mean age and the standard deviation of age, we need to use the functions `mean()` and `sd()` on the column `Age` respectively.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: There were 2 warnings in `summarise()`.\nThe first warning was:\nℹ In argument: `mean_age = mean(Age)`.\nCaused by warning in `mean.default()`:\n! argument is not numeric or logical: returning NA\nℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.\n```\n:::\n\n```{.r .cell-code}\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nR did not give us an error message per se, but the output is not quite as expected either. There are `NA` values in the `mean_age` and `sd_age` columns. Looking at the warning message and at `Age`, can you explain what happened?\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nThe warning message says: `argument is not numeric or logical: returning NA` If we look at the `Age` column more closely, we can see that it's a character data type.\n\n:::\n\n\n\n#### Fixing `Age` {.unnumbered}\n\nMight be wise to look at the unique answers in column `Age` to determine what is wrong. We can do that with the function `distinct()`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nage_distinct <- data_prp %>% \n distinct(Age)\n\nage_distinct\n```\n:::\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Show the unique values of `Age`.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Age |\n|:--------|\n|22 |\n|20 |\n|26 |\n|21 |\n|29 |\n|23 |\n|39 |\n|NA |\n|24 |\n|43 |\n|31 |\n|25 years |\n\n
\n:::\n:::\n\n:::\n\n::: columns\n\n::: column\n\nOne cell has the string \"years\" added to their number 25, which has converted the entire column into a character column.\n\nWe can easily fix this by extracting only the numbers from the column and converting it into a numeric data type. The `parse_number()` function, which is part of the `tidyverse` package, handles both steps in one go (so there’s no need to load additional packages).\n\nWe will combine this with the `mutate()` function to create a new column called `Age` (containing those numeric values), effectively replacing the old `Age` column (which had the character values).\n\n:::\n\n::: column\n\n![parse_number() illustration by Allison Horst (see [https://allisonhorst.com/r-packages-functions](https://allisonhorst.com/r-packages-functions){target=\"_blank\"})](images/parse_number.png){width=\"95%\"}\n\n:::\n\n:::\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_prp <- data_prp %>% \n mutate(Age = parse_number(Age))\n\ntypeof(data_prp$Age) # fixed\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n:::\n\n\n\n\n#### Computing summary stats {.unnumbered}\n\nExcellent. Now that the numbers are in a numeric format, let's try calculating the demographics for the total sample again.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nEven though there's no error or warning, the table still shows `NA` values for `mean_age` and `sd_age`. So, what could possibly be wrong now?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nDid you notice that the `Age` column in `age_distinct` contains some missing values (`NA`)? To be honest, it's easier to spot this issue in the actual R output than in the printed HTML page.\n\n:::\n\n\n\n#### Computing summary stats - third attempt {.unnumbered}\n\nTo ensure R ignores missing values during calculations, we need to add the extra argument `na.rm = TRUE` to the `mean()` and `sd()` functions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age, na.rm = TRUE), # mean age\n sd_age = sd(Age, na.rm = TRUE)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|--------:|\n| 89| 21.88506| 3.485603|\n\n
\n:::\n:::\n\n\nFinally, we’ve got it! 🥳 Third time's the charm!\n\n\n\n### ... per gender using `summarise()` and `group_by()`\n\nNow we want to compute the summary statistics for each gender. The code inside the `summarise()` function remains unchanged; we just need to use the `group_by()` function beforehand to tell R that we want to compute the summary statistics for each group separately. It’s also a good practice to use `ungroup()` afterwards, so you are not taking groupings forward unintentionally.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% # split data up into groups (here Gender)\n summarise(n = n(), # participant number \n mean_age = mean(Age, na.rm = TRUE), # mean age \n sd_age = sd(Age, na.rm = TRUE)) %>% # standard deviation of age\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| mean_age| sd_age|\n|------:|--:|--------:|--------:|\n| 1| 17| 23.31250| 5.770254|\n| 2| 69| 21.57353| 2.738973|\n| 3| 3| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n\n\n### Adding percentages\n\nSometimes, it may be useful to calculate percentages, such as for the gender split. You can do this by adding a line within the `summarise()` function to perform the calculation. All we need to do is take the number of female, male, and non-binary participants (stored in the `n` column of `demo_by_gender`), divide it by the total number of participants (stored in the `n` column of `demo_total`), and multiply by 100. Let's add `percentage` to the `summarise()` function of `demo_by_gender`. Make sure that the code for `percentages` is placed after the value for `n` has been computed.\n\nAccessing the value of `n` for the different gender categories is straightforward because we can refer back to it directly. However, since the total number of participants is stored in a different data object, we need to use a base R function to access it – specifically the `$` operator. To do this, you simply type the name of the data object (in this case, `demo_total`), followed by the `$` symbol (with no spaces), and then the name of the column you want to retrieve (in this case, `n`). The general pattern is `data$column`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n # n from the line above divided by n from demo_total *100\n percentage = n/demo_total$n *100, \n mean_age = mean(Age, na.rm = TRUE), \n sd_age = sd(Age, na.rm = TRUE)) %>% \n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|--------:|\n| 1| 17| 19.101124| 23.31250| 5.770254|\n| 2| 69| 77.528090| 21.57353| 2.738973|\n| 3| 3| 3.370786| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for decimal places - use `round()`\n\nNot super important, because you could round the values by yourself when writing up your reports, but if you wanted to tidy up the decimal places in the output, you can do that using the `round()` function. You would need to \"wrap\" it around your computations and specify how many decimal places you want to display (for example `mean(Age)` would turn into `round(mean(Age), 1)`). It may look odd for `percentage`, just make sure the number that specifies the decimal places is placed **within** the round function. The default value is 0 (meaning no decimal spaces).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n percentage = round(n/demo_total$n *100, 2), # percentage with 2 decimal places\n mean_age = round(mean(Age, na.rm = TRUE), 1), # mean Age with 1 decimal place\n sd_age = round(sd(Age, na.rm = TRUE), 3)) %>% # sd Age with 3 decimal places\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|------:|\n| 1| 17| 19.10| 23.3| 5.770|\n| 2| 69| 77.53| 21.6| 2.739|\n| 3| 3| 3.37| 21.3| 1.155|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 4: Questionable Research Practices (QRPs) {#sec-ch2_act4}\n\n#### The main goal is to compute the mean QRP score per participant for time point 1. {.unnumbered}\n\nAt the moment, the data is in wide format. The table below shows data from the first 3 participants:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhead(data_prp, n = 3)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | Gender| Age|Ethnicity | Secondyeargrade| Opptional_mod|Opptional_mod_1_TEXT | Research_exp|Research_exp_1_TEXT | Plan_prereg| SATS28_1_Affect_Time1| SATS28_2_Affect_Time1| SATS28_3_Affect_Time1| SATS28_4_Affect_Time1| SATS28_5_Affect_Time1| SATS28_6_Affect_Time1| SATS28_7_CognitiveCompetence_Time1| SATS28_8_CognitiveCompetence_Time1| SATS28_9_CognitiveCompetence_Time1| SATS28_10_CognitiveCompetence_Time1| SATS28_11_CognitiveCompetence_Time1| SATS28_12_CognitiveCompetence_Time1| SATS28_13_Value_Time1| SATS28_14_Value_Time1| SATS28_15_Value_Time1| SATS28_16_Value_Time1| SATS28_17_Value_Time1| SATS28_18_Value_Time1| SATS28_19_Value_Time1| SATS28_20_Value_Time1| SATS28_21_Value_Time1| SATS28_22_Difficulty_Time1| SATS28_23_Difficulty_Time1| SATS28_24_Difficulty_Time1| SATS28_25_Difficulty_Time1| SATS28_26_Difficulty_Time1| SATS28_27_Difficulty_Time1| SATS28_28_Difficulty_Time1| QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1| QRPs_12NotQRP_Time1| QRPs_13NotQRP_Time1| QRPs_14NotQRP_Time1| QRPs_15NotQRP_Time1|Understanding_OS_1_Time1 |Understanding_OS_2_Time1 |Understanding_OS_3_Time1 |Understanding_OS_4_Time1 |Understanding_OS_5_Time1 |Understanding_OS_6_Time1 |Understanding_OS_7_Time1 |Understanding_OS_8_Time1 |Understanding_OS_9_Time1 |Understanding_OS_10_Time1 |Understanding_OS_11_Time1 |Understanding_OS_12_Time1 | Pre_reg_group| Other_OS_behav_2| Other_OS_behav_4| Other_OS_behav_5| Closely_follow| SATS28_Affect_Time2_mean| SATS28_CognitiveCompetence_Time2_mean| SATS28_Value_Time2_mean| SATS28_Difficulty_Time2_mean| QRPs_Acceptance_Time2_mean| Time2_Understanding_OS| Supervisor_1| Supervisor_2| Supervisor_3| Supervisor_4| Supervisor_5| Supervisor_6| Supervisor_7| Supervisor_8| Supervisor_9| Supervisor_10| Supervisor_11| Supervisor_12| Supervisor_13| Supervisor_14| Supervisor_15_R|\n|:----|------:|---:|:--------------|---------------:|-------------:|:------------------------------|------------:|:-------------------|-----------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|----------------------------------:|----------------------------------:|----------------------------------:|-----------------------------------:|-----------------------------------:|-----------------------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:-------------------------|:-------------------------|:-------------------------|-------------:|----------------:|----------------:|----------------:|--------------:|------------------------:|-------------------------------------:|-----------------------:|----------------------------:|--------------------------:|----------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------:|-------------:|-------------:|---------------:|\n|Tr10 | 2| 22|White European | 2| 1|Research methods in first year | 2|NA | 1| 4| 5| 3| 4| 5| 5| 4| 2| 2| 6| 4| 3| 1| 7| 7| 2| 1| 3| 3| 2| 2| 3| 5| 2| 6| 4| 4| 1| 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7| 2| 1| 1| 2|2 |2 |2 |6 |Entirely confident |Entirely confident |6 |6 |Entirely confident |Entirely confident |Entirely confident |Entirely confident | 1| 1| 1| NA| 2| 3.500000| 4.166667| 3.000000| 2.857143| 5.636364| 5.583333| 5| 5| 6| 6| 5| 5| 1| 5| 6| 5| NA| 4| 4| 5| 1|\n|Bi07 | 2| 20|White British | 3| 2|NA | 2|NA | 3| 5| 6| 2| 5| 5| 6| 2| 2| 2| 7| 3| 5| 1| 7| 7| 1| 1| 6| 3| 1| 1| 2| 6| 2| 7| 2| 5| 7| 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7| 2| 1| 4| 4|2 |Not at all confident |Not at all confident |Not at all confident |6 |Entirely confident |Not at all confident |3 |6 |6 |2 |2 | 1| NA| NA| NA| 2| 3.166667| 4.666667| 6.222222| 2.857143| 5.454546| 3.333333| 7| 6| 7| 7| 7| 7| 1| 5| 7| 7| 7| 5| 2| 7| 1|\n|SK03 | 2| 22|White British | 1| 2|NA | 2|NA | 1| 5| 3| 5| 2| 5| 2| 2| 2| 2| 6| 5| 3| 2| 6| 6| 3| 3| 5| 3| 4| 3| 5| 5| 2| 5| 2| 5| 5| 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7| 1| 1| 3| 2|6 |2 |3 |6 |6 |5 |2 |5 |5 |5 |4 |5 | 1| NA| NA| NA| 2| 4.833333| 6.166667| 6.000000| 4.000000| 6.272727| 5.416667| 7| 7| 7| 7| 7| 7| 1| 7| 7| 7| 7| 7| 5| 7| 1|\n\n
\n:::\n:::\n\n

\n\nLooking at the QRP data at time point 1, you determine that\n\n* individual item columns are , and\n* according to the codebook, there are reverse-coded items in this questionnaire.\n\nAccording to the codebook and the data table above, we just have to **compute the average score for QRP items to **, since items to are distractor items. Seems quite straightforward.\n\nHowever, as you can see in the table above, each item is in a separate column, meaning the data is in **wide format**. It would be much easier to calculate the mean scores if the items were arranged in **long format**.\n\n\nLet’s tackle this problem step by step. It’s best to create a separate data object for this. If we tried to compute it within `data_prp`, it could quickly become messy.\n\n\n* **Step 1**: Select the relevant columns `Code`, and `QRPs_1_Time1` to `QRPs_11_Time1` and store them in an object called `qrp_t1`\n* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can calculate the average score more easily (in step 3)\n* **Step 3**: Calculate the average QRP score (`QRPs_Acceptance_Time1_mean`) per participant using `group_by()` and `summarise()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_t1 <- data_prp %>% \n #Step 1\n select(Code, QRPs_1_Time1:QRPs_11_Time1) %>%\n # Step 2\n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(Code) %>% # grouping by participant id\n summarise(QRPs_Acceptance_Time1_mean = mean(Scores)) %>% # calculating the average Score\n ungroup() # just make it a habit\n```\n:::\n\n\n::: {.callout-caution icon=\"false\" collapse=\"true\"}\n\n## Explain the individual functions\n\n::: panel-tabset\n\n## `select ()`\n\nThe select function allows to include or exclude certain variables (columns). Here we want to focus on the participant ID column (i.e., `Code`) and the QRP items at time point 1. We can either list them all individually, i.e., Code, QRPs_1_Time1, QRPs_2_Time1, QRPs_3_Time1, and so forth (you get the gist), but that would take forever to type.\n\nA shortcut is to use the colon operator `:`. It allows us to select all columns that fall within the range of `first_column_name` to `last_column_name`. We can apply this here since the QRP items (1 to 11) are sequentially listed in `data_prp`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step1 <- data_prp %>% \n select(Code, QRPs_1_Time1:QRPs_11_Time1)\n\n# show first 5 rows of qrp_step1\nhead(qrp_step1, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1|\n|:----|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|\n|Tr10 | 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7|\n|Bi07 | 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7|\n|SK03 | 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7|\n|SM95 | 7| 7| 2| 6| 7| 5| 7| 7| 4| 2| 4|\n|St01 | 7| 7| 6| 7| 2| 7| 7| 7| 7| 5| 7|\n\n
\n:::\n:::\n\n\nHow many rows/observations and columns/variables do we have in `qrp_step1`?\n\n* rows/observations: \n* columns/variables: \n\n## `pivot_longer()`\n\nAs you can see, the table we got from Step 1 is in wide format. To get it into wide format, we need to define:\n\n* the columns that need to be reshuffled from wide into long format (`col` argument). Here we selected \"everything except the `Code` column\", as indicated by `-Code` \\[minus `Code`\\]. However, `QRPs_1_Time1:QRPs_11_Time1` would also work and give you the exact same result.\n* the `names_to` argument. R is creating a new column in which all the column names from the columns you selected in `col` will be stored in. Here we are naming this column \"Items\" but you could pick something equally sensible if you like.\n* the `values_to` argument. R creates this second column to store all responses the participants gave to the individual questions, i.e., all the numbers in this case. We named it \"Scores\" here, but you could have called it something different, like \"Responses\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step2 <- qrp_step1 %>% \n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\")\n\n# show first 15 rows of qrp_step2\nhead(qrp_step2, n = 15)\n```\n\n::: {.cell-output-display}\n
\n\n|Code |Items | Scores|\n|:----|:-------------|------:|\n|Tr10 |QRPs_1_Time1 | 7|\n|Tr10 |QRPs_2_Time1 | 7|\n|Tr10 |QRPs_3_Time1 | 5|\n|Tr10 |QRPs_4_Time1 | 7|\n|Tr10 |QRPs_5_Time1 | 3|\n|Tr10 |QRPs_6_Time1 | 4|\n|Tr10 |QRPs_7_Time1 | 5|\n|Tr10 |QRPs_8_Time1 | 7|\n|Tr10 |QRPs_9_Time1 | 6|\n|Tr10 |QRPs_10_Time1 | 7|\n|Tr10 |QRPs_11_Time1 | 7|\n|Bi07 |QRPs_1_Time1 | 7|\n|Bi07 |QRPs_2_Time1 | 7|\n|Bi07 |QRPs_3_Time1 | 2|\n|Bi07 |QRPs_4_Time1 | 7|\n\n
\n:::\n:::\n\n\nNow, have a look at `qrp_step2`. In total, we now have rows/observations, per participant, and columns/variables.\n\n## `group_by()` and `summarise()`\n\nThis follows exactly the same sequence we used when calculating descriptive statistics by gender. The only difference is that we are now grouping the data by the participant's `Code` instead of `Gender`.\n\n`summarise()` works exactly the same way: `summarise(new_column_name = function_to_calculate_something(column_name_of_numeric_values))`\n\nThe `function_to_calculate_something` can be `mean()`, `sd()` or `sum()` for mean scores, standard deviations, or summed-up scores respectively. You could also use `min()` or `max()` if you wanted to determine the lowest or the highest score for each participant.\n\n:::\n\n:::\n\n::: callout-tip\n\nYou could **rename the columns whilst selecting** them. The pattern would be `select(new_name = old_name)`. For example, if we wanted to select variable `Code` and rename it as `Participant_ID`, we could do that.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrenaming_col <- data_prp %>% \n select(Participant_ID = Code)\n\nhead(renaming_col, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Participant_ID |\n|:--------------|\n|Tr10 |\n|Bi07 |\n|SK03 |\n|SM95 |\n|St01 |\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 5: Knitting\n\nOnce you've completed your R Markdown file, the final step is to \"knit\" it, which converts the `.Rmd` file into a HTML file. Knitting combines your code, text, and output (like tables and plots) into a single cohesive document. This is a really good way to check your code is working.\n\nTo knit the file, **click the Knit button** at the top of your RStudio window. The document will be generated and, depending on your setting, automatically opened in the viewer in the `Output pane` or an external browser window.\n\nIf any errors occur during knitting, RStudio will show you an error message with details to help you troubleshoot.\n\nIf you want to **intentionally keep any errors** we tackled today to keep a reference on how you solved them, you could add `error=TRUE` or `eval=FALSE` to the code chunk that isn't running.\n\n\n\n## Activity 6: Export a data object as a csv\n\nTo avoid having to repeat the same steps in the next chapter, it's a good idea to save the data objects you've created today as csv files. You can do this by using the `write_csv()` function from the `readr` package. The csv files will appear in your project folder.\n\nThe basic syntax is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_object, \"filename.csv\")\n```\n:::\n\n\nNow, let's export the objects `data_prp` and `qrp_t1`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_prp, \"data_prp_for_ch3.csv\")\n```\n:::\n\n\nHere we named the file `data_prp_for_ch3.csv`, so we wouldn't override the original data csv file `prp_data_reduced.csv`. However, feel free to choose a name that makes sense to you.\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nExport `qrp_t1`.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(qrp_t1, \"qrp_t1.csv\")\n```\n:::\n\n\n:::\n\n:::\n\nCheck that your csv files have appeared in your project folder, and you're all set!\n\n**That’s it for Chapter 2: Individual Walkthrough.**\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n\nWe will continue working with the data from Binfet et al. (2021), focusing on the randomised controlled trial of therapy dog interventions. Today, our goal is to **calculate an average `Flourishing` score for each participant** at time point 1 (pre-intervention) using the raw data file `dog_data_raw`. Currently, the data looks like this:\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| F1_1| F1_2| F1_3| F1_4| F1_5| F1_6| F1_7| F1_8|\n|---:|----:|----:|----:|----:|----:|----:|----:|----:|\n| 1| 6| 7| 5| 5| 7| 7| 6| 6|\n| 2| 5| 7| 6| 5| 5| 5| 5| 4|\n| 3| 5| 5| 5| 6| 6| 6| 5| 5|\n| 4| 7| 6| 7| 7| 7| 6| 7| 4|\n| 5| 5| 5| 4| 6| 7| 7| 7| 6|\n\n
\n:::\n:::\n\n\n\nHowever, we want the data to look like this:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| Flourishing_pre|\n|---:|---------------:|\n| 1| 6.125|\n| 2| 5.250|\n| 3| 5.375|\n| 4| 6.375|\n| 5| 5.875|\n\n
\n:::\n:::\n\n\n\n\n### Task 1: Open the R project you created last week {.unnumbered}\n\nIf you haven’t created an R project for the lab yet, please do so now. If you already have one set up, go ahead and open it.\n\n\n### Task 2: Open your `.Rmd` file from last week {.unnumbered}\n\nSince we haven’t used it much yet, feel free to continue using the `.Rmd` file you created last week in Task 2.\n\n\n### Task 3: Load in the library and read in the data {.unnumbered}\n\nThe data should be in your project folder. If you didn’t download it last week, or if you’d like a fresh copy, you can download the data again here: [data_pair_ch1](data/data_pair_ch1.zip \"download\").\n\nWe will be using the `tidyverse` package today, and the data file we need to read in is `dog_data_raw.csv`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(???)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"???\")\n```\n:::\n\n\n:::\n\n\n### Task 4: Calculating the mean for `Flourishing_pre` {.unnumbered}\n\n\n* **Step 1**: Select all relevant columns from `dog_data_raw`, including participant ID and all items from the `Flourishing` questionnaire completed before the intervention. Store this data in an object called `data_flourishing`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nLook at the codebook. Try to determine:\n\n* The variable name of the column where the participant ID is stored.\n* The items related to the Flourishing scale at the pre-intervention stage.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nFrom the codebook, we know that:\n\n* The participant ID column is called `RID`.\n* The Flourishing items at the pre-intervention stage start with `F1_`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_flourishing <- ??? %>% \n select(???, F1_???:F1_???)\n```\n:::\n\n\n:::\n\n:::\n\n\n* **Step 2**: Pivot the data from wide format to long format so we can calculate the average score more easily (in step 3).\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nWhich pivot function should you use? We have `pivot_wider()` and `pivot_longer()` to choose from.\n\nWe also need 3 arguments in that function:\n\n* The columns you want to select (e.g., all the Flourishing items),\n* The name of the column where the current column headings will be stored (e.g., \"Questionnaire\"),\n* The name of the column that should store all the values (e.g., \"Responses\").\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nWe need `pivot_longer()`. You already encountered `pivot_longer()` in first year (or in the individual walkthrough if you have already completed this Chapter). The 3 arguments was also a give-away; `pivot_wider()` only requires 2 arguments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n pivot_longer(cols = ???, names_to = \"???\", values_to = \"???\")\n```\n:::\n\n\n:::\n\n:::\n\n* **Step 3**: Calculate the average Flourishing score per participant and name this column `Flourishing_pre` to match the table above.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nBefore summarising the mean, you may need to group the data.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nTo compute an average score **per participant**, we would need to group by participant ID first.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n group_by(???) %>% \n summarise(Flourishing_pre = mean(???)) %>% \n ungroup()\n```\n:::\n\n:::\n\n:::\n\n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(tidyverse)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\n\n# Task 4: Tidying \ndata_flourishing <- dog_data_raw %>% \n # Step 1\n select(RID, F1_1:F1_8) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Questionnaire\", values_to = \"Responses\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_pre = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\n:::\n\n\n\n## [Test your knowledge and challenge yourself]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nWhich function of the Wickham Six would you use to include or exclude certain variables (columns)? \n\n\n#### Question 2 {.unnumbered}\n\nWhich function of the Wickham Six would you use to create new columns or modify existing columns in a dataframe? \n\n\n#### Question 3 {.unnumbered}\n\n\nWhich function of the Wickham Six would you use to organise data into groups based on one or more columns? \n\n\n\n#### Question 4 {.unnumbered}\n\nWhich function of the Wickham Six would you use to sort the rows of a dataframe based on the values in one or more columns? \n\n\n\n#### Question 5 {.unnumbered}\n\nWhich function of the Wickham Six would NOT modify the original dataframe? \n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain these answers\n\n| Function | Description |\n|:-------------|:------------------------------------------------------|\n| `select()` | Include or exclude certain variables/columns |\n| `filter()` | Include or exclude certain observations/rows |\n| `mutate()` | Creates new columns or modifies existing ones |\n| `arrange()` | Changes the order of the rows |\n| `group_by()` | Split data into groups based on one or more variables |\n| `summarise()`| Creates a new dataframe returning one row for each combination of grouping variables |\n\n\nTechnically, the first five functions operate on the existing data object, making adjustments like sorting the data (e.g., with `arrange()`), reducing the number of rows (e.g., with `filter()`), reducing the number of columns (e.g., with `select()`), or adding new columns (e.g., with `mutate()`). In contrast, `summarise()` fundamentally alters the structure of the original dataframe by generating a completely new dataframe that contains only summary statistics, rather than retaining the original rows and columns.\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of the code chunks contain mistakes and result in errors, while others do not produce the expected results. Your task is to identify any issues, explain why they occurred, and, if possible, fix them.\n\nWe will use a few built-in datasets, such as `billboard` and `starwars`, to help you replicate the errors in your own R environment. You can view the data either by typing the dataset name directly into your console or by storing the data as a separate object in your `Global Environment`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbillboard\n\nstarwars_data = starwars\n```\n:::\n\n\n\n\n#### Question 6 {.unnumbered}\n\nCurrently, the weekly song rankings for Billboard Top 100 in 2000 are in wide format, with each week in a separate column. The following code is supposed to transpose the wide-format `billboard` data into long format:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(names_to = \"weeks\", values_to = \"rank\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `pivot_longer()`:\n! `cols` must select at least one column.\n```\n:::\n:::\n\n\nWhat does this error message mean and how do you fix it?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe error message indicates that the `cols` argument is missing in the function. This means the function doesn’t know which columns to transpose from wide format to long format.\n\nFIX: Add `cols = wk1:wk76` to the function to select columns from wk1 to wk76. Alternatively, `cols = starts_with(\"wk\")` would also work since all columns start with the letter combination \"wk\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(cols = wk1:wk76, names_to = \"weeks\", values_to = \"rank\")\n# OR\nlong_data <- billboard %>% \n pivot_longer(cols = starts_with(\"wk\"), names_to = \"weeks\", values_to = \"rank\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\nThe following code is intended to calculate the mean height of all the characters in the built-in `starwars` dataset, grouped by their gender. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = height)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\ndplyr 1.1.0.\nℹ Please use `reframe()` instead.\nℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n always returns an ungrouped data frame and adjust accordingly.\n```\n:::\n:::\n\n\nThe code runs, but it's giving us some weird warning and the output is also not as expected. What steps should we take to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe aggregation function `mean()` is missing from within `summarise()`. Without it, the function does not perform any aggregation and returns *all* rows with only the columns for gender and height.\n\nFIX: Wrap the `mean()` function around the variable you want to aggregate, here `height`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height))\n```\n:::\n\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nFollowing up on Question 7, we now have `summary_data` that looks approximately correct - it has the expected rows and column numbers, however, the cell values are \"weird\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | NA|\n|masculine | NA|\n|NA | 175|\n\n
\n:::\n:::\n\n\n\nCan you explain what is happening here? And how can we modify the code to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nLook at the original `starwars` data. You will notice that some of the characters with feminine and masculine gender entries have missing height values. However, all four characters without a specified gender have provided their height.\n\nFIX: We need to add `na.rm = TRUE` to the `mean()` function to ensure that R ignores missing values before aggregating the data.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height, na.rm = TRUE))\n\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | 166.5333|\n|masculine | 176.5323|\n|NA | 175.0000|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n### Challenge yourself {.unnumbered}\n\nIf you want to **challenge yourself** and further apply the skills from Chapter 2, you can wrangle the data from `dog_data_raw` for additional questionnaires from either the pre- and/or post-intervention stages:\n\n* Calculate the mean score for `flourishing_post` for each participant.\n* Calculate the mean score for the `PANAS` (Positive and/or Negative Affect) per participant\n* Calculate the mean score for happiness (`SHS`) per participant\n\nThe 3 steps are equivalent for those questionnaires - select, pivot, group_by and summarise; you just have to \"replace\" the questionnaire items involved.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution for **Challenge yourself**\n\nFlourishing post-intervention\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## flourishing_post\nflourishing_post <- dog_data_raw %>% \n # Step 1\n select(RID, starts_with(\"F2\")) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Names\", values_to = \"Response\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_post = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\nThe PANAS could be solved more concisely with the skills we learn in @sec-wrangling2, but for now, you would have solved it this way:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# PANAS - positive affect pre\nPANAS_PA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_3, PN1_5, PN1_7, PN1_8, PN1_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - positive affect post\nPANAS_PA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_3, PN2_5, PN2_7, PN2_8, PN2_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_post = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect pre\nPANAS_NA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_1, PN1_2, PN1_4, PN1_6, PN1_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect post\nPANAS_NA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_1, PN2_2, PN2_4, PN2_6, PN2_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_post = mean(Scores)) %>% \n ungroup()\n```\n:::\n\n\nHappiness scale\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# happiness_pre\nhappiness_pre <- dog_data_raw %>% \n # Step 1\n select(RID, HA1_1, HA1_2, HA1_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_pre = mean(Score)) %>% \n ungroup()\n\n#happiness_post\nhappiness_post <- dog_data_raw %>% \n # Step 1\n select(RID, HA2_1, HA2_2, HA2_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_post = mean(Score)) %>% \n ungroup()\n```\n:::\n\n\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/.quarto/cites/index.json b/.quarto/cites/index.json index 02e2ca5..000df8e 100644 --- a/.quarto/cites/index.json +++ b/.quarto/cites/index.json @@ -1 +1 @@ -{"09-correlation.qmd":[],"09-simple-regression.qmd":[],"05-dataviz2.qmd":[],"07-apes.qmd":[],"11-one-way-anova.qmd":[],"11-multiple-regression.qmd":[],"06-paired.qmd":[],"08-paired.qmd":[],"appendix-c-exporting-server.qmd":[],"06-independent.qmd":[],"04-dataviz.qmd":[],"10-regression.qmd":[],"references.qmd":[],"04-dataviz2.qmd":[],"index.qmd":[],"13-factorial-anova.qmd":[],"01-basics.qmd":[],"08-correlation.qmd":[],"02-wrangling.qmd":[],"12-one-way-anova.qmd":[],"appendix-d-symbols.qmd":[],"10-multiple-regression.qmd":[],"03-wrangling2.qmd":[],"05-independent.qmd":[],"instructions.qmd":["usethis"],"03-dataviz.qmd":[],"appendix-a-installing-r.qmd":[],"appendix-x-How-to-cite-R.qmd":[],"07-independent.qmd":[],"06-chi-square-one-sample.qmd":[],"04-chi-square-one-sample.qmd":[],"appendix-b-updating-packages.qmd":[],"05-chi-square-one-sample.qmd":[],"12-factorial-anova.qmd":[],"webexercises.qmd":[],"07-paired.qmd":[],"appendix-y-license.qmd":[],"04-prob-binom-one-sample.qmd":[]} +{"05-independent.qmd":[],"01-basics.qmd":[],"07-apes.qmd":[],"12-factorial-anova.qmd":[],"03-dataviz.qmd":[],"03-wrangling2.qmd":[],"appendix-d-symbols.qmd":[],"08-paired.qmd":[],"09-simple-regression.qmd":[],"12-one-way-anova.qmd":[],"references.qmd":[],"index.qmd":[],"04-chi-square-one-sample.qmd":[],"06-independent.qmd":[],"06-chi-square-one-sample.qmd":[],"webexercises.qmd":[],"04-dataviz.qmd":[],"appendix-b-updating-packages.qmd":[],"appendix-x-How-to-cite-R.qmd":[],"11-multiple-regression.qmd":[],"13-factorial-anova.qmd":[],"02-wrangling.qmd":[],"appendix-a-installing-r.qmd":[],"07-paired.qmd":[],"04-dataviz2.qmd":[],"05-dataviz2.qmd":[],"04-prob-binom-one-sample.qmd":[],"appendix-y-license.qmd":[],"07-independent.qmd":[],"06-paired.qmd":[],"05-chi-square-one-sample.qmd":[],"instructions.qmd":["usethis"],"10-regression.qmd":[],"appendix-c-exporting-server.qmd":[],"11-one-way-anova.qmd":[],"10-multiple-regression.qmd":[],"08-correlation.qmd":[],"09-correlation.qmd":[]} diff --git a/.quarto/idx/01-basics.qmd.json b/.quarto/idx/01-basics.qmd.json index 740ec3e..9d0ab86 100644 --- a/.quarto/idx/01-basics.qmd.json +++ b/.quarto/idx/01-basics.qmd.json @@ -1 +1 @@ -{"title":"Projects and R Markdown","markdown":{"headingText":"Projects and R Markdown","headingAttr":{"id":"sec-basics","classes":[],"keyvalue":[]},"containsRefs":false,"markdown":"\n## Intended Learning Outcomes {.unnumbered}\n\nBy the end of this chapter, you should be able to:\n\n- Re-familiarise yourself with setting up projects\n- Re-familiarise yourself with RMarkdown documents\n- Recap and apply data wrangling procedures to analyse data\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n## R and R Studio\n\nRemember, R is a programming language that you will write code in and RStudio is an Integrated Development Environment (IDE) which makes working with R easier as it's more user friendly. You need both components for this course.\n\nIf this is not ringing any bells yet, have a quick browse through the [materials from year 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#sec-intro-r){target=\"_blank\"} to refresh your memory.\n\n\n### R server\n\nUse the server *only* if you are unable to install R and RStudio on your computer (e.g., if you are using a Chromebook) or if you encounter issues while installing R on your own machine. Otherwise, you should install R and RStudio directly on your own computer. R and RStudio are already installed on the *R server*.\n\nYou will find the link to the server on Moodle.\n\n\n### Installing R and RStudio on your computer\n\nThe [RSetGo book](https://psyteachr.github.io/RSetGo/){target=\"_blank\"} provides detailed instructions on how to install R and RStudio on your computer. It also includes links to walkthroughs for installing R on different types of computers and operating systems.\n\nIf you had R and RStudio installed on your computer last year, we recommend updating to the latest versions. In fact, it’s a good practice to update them at the start of each academic year. Detailed guidance can be found in @sec-updating-r.\n\nOnce you have installed or updated R and RStudio, return to this chapter.\n\n\n### Settings for Reproducibility\n\nBy now, you should be aware that the Psychology department at the University of Glasgow places a strong emphasis on reproducibility, open science, and raising awareness about questionable research practices (QRPs) and how to avoid them. Therefore, it's important that you work in a reproducible manner so that others (and your future self) can understand and check your work. This also makes it easier for you to reuse your work in the future.\n\nAlways start with a clear workspace. If your `Global Environment` contains anything from a previous session, you can’t be certain whether your current code is working as intended or if it’s using objects created earlier.\n\nTo ensure a clean and reproducible workflow, there are a few settings you should adjust immediately after installing or updating RStudio. In Tools \\> Global Options... General tab\n\n* Uncheck the box labelled Restore .RData into workspace at startup to make sure no data from a previous session is loaded into the environment\n* set Save workspace to .RData on exit to **Never** to prevent your workspace from being saved when you exit RStudio.\n\n![Reproducibility settings in Global Options](images/rstudio_settings_reproducibility.png)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for keeping taps on parentheses\n\nR has included **rainbow parentheses** to help with keeping count on the brackets.\n\nTo enable the feature, go to Tools \\> Global Options... Code tab \\> Display tab and tick the last checkbox \"Use rainbow parentheses\"\n\n![Enable Rainbow parenthesis](images/rainbow.PNG)\n\n:::\n\n### RStudio panes\n\nRStudio has four main panes each in a quadrant of your screen:\n\n* Source pane\n* Environment pane\n* Console pane\n* Output pane\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAre you ready for a quick quiz to see what you remember about the RStudio panes from last year? Click on **Quiz** to see the questions.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Quiz\n\n**What is their purpose?**\n\n**The Source pane...** `r longmcq(c(answer = \"allows users to view and edit various code-related files, such as .Rmd files\", \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", \"provides an area to interactively execute code\"))`\n\n**The Environment pane...** `r longmcq(c(\"allows users to view and edit various code-related files, such as .Rmd files\", \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", answer = \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", \"provides an area to interactively execute code\"))`\n\n**The Console pane...** `r longmcq(c(\"allows users to view and edit various code-related files, such as .Rmd files\", \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", answer = \"provides an area to interactively execute code\"))`\n\n**The Output pane...** `r longmcq(c(\"allows users to view and edit various code-related files, such as .Rmd files\", answer = \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", \"provides an area to interactively execute code\"))`\n\n**Where are these panes located by default?**\n\n* The Source pane is located? `r mcq(sample(c(answer = \"top left\", \"bottom left\", \"top right\", \"bottom right\")))`\n* The Environment pane is located? `r mcq(sample(c(\"top left\", \"bottom left\", answer = \"top right\", \"bottom right\")))`\n* The Console pane is located? `r mcq(sample(c(\"top left\", answer = \"bottom left\", \"top right\", \"bottom right\")))`\n* The Output pane is located? `r mcq(sample(c(\"top left\", \"bottom left\", \"top right\", answer = \"bottom right\")))`\n\n:::\n\n:::\n\nIf you were not quite sure about one/any of the panes, check out the [materials from Level 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#rstudio-panes){target=\"_blank\"}. If you want to know more about them, there is the [RStudio guide on posit](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html){target=\"_blank\"}\n\n\n\n## Activity 1: Creating a new project {#sec-project}\n\nIt's important to create a new RStudio project whenever you start a new project. This practice makes it easier to work in multiple contexts, such as when analysing different datasets simultaneously. Each RStudio project has its own folder location, workspace, and working directories, which keeps all your data and RMarkdown documents organised in one place.\n\nLast year, you learnt how to create projects on the server, so you already know the steps. If cannot quite recall how that was done, go back to the [Level 1 materials](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#new-project){target=\"_blank\"}.\n\nOn your own computer, open RStudio, and complete the following steps in this order:\n\n* Click on File \\> New Project...\n* Then, click on \"New Directory\"\n* Then, click on \"New Project\"\n* Name the directory something meaningful (e.g., \"2A_chapter1\"), and save it in a location that makes sense, for example, a dedicated folder you have for your level 2 Psychology labs - you can either select a folder you have already in place or create a new one (e.g., I named my new folder \"Level 2 labs\")\n* Click \"Create Project\". RStudio will restart itself and open with this new project directory as the working directory. If you accidentally close it, you can open it by double-clicking on the project icon in your folder\n* You can also check in your folder structure that everything was created as intended\n\n![Creating a new project](images/project_setup.gif)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Why is the Colour scheme in the gif different to my version?\n\nIn case anyone is wondering why my colour scheme in the gif above looks different to yours, I've set mine to \"Pastel On Dark\" in Tools \\> Global Options... \\> Appearances. And my computer lives in \"dark mode\".\n\n:::\n\n::: callout-important\n\n## Don't nest projects\n\nDon't ever save a new project **inside** another project directory. This can cause some hard-to-resolve problems.\n\n:::\n\n\n## Activity 2: Create a new R Markdown file {#sec-rmd}\n\n* Open a new R Markdown document: click File \\> New File \\> R Markdown or click on the little page icon with a green plus sign (top left).\n* Give it a meaningful `Title` (e.g., Level 2 chapter 1) - you can also change the title later. Feel free to add your name or GUID in the `Author` field author name. Keep the `Default Output Format` as HTML.\n* Once the .`Rmd` opened, you need to save the file.\n* To save it, click File \\> Save As... or click on the little disc icon. Name it something meaningful (e.g., \"chapter_01.Rmd\", \"01_intro.Rmd\"). Make sure there are no spaces in the name - R is not very fond of spaces... This file will automatically be saved in your project folder (i.e., your working directory) so you should now see this file appear in your file viewer pane.\n\n\n![Creating a new `.Rmd` file](images/Rmd_setup.gif)\n\n\nRemember, an R Markdown document or `.Rmd` has \"white space\" (i.e., the markdown for formatted text) and \"grey parts\" (i.e., code chunks) in the default colour scheme (see @fig-rmd). R Markdown is a powerful tool for creating dynamic documents because it allows you to integrate code and regular text seamlessly. You can then knit your `.Rmd` using the `knitr` package to create a final document as either a webpage (HTML), a PDF, or a Word document (.docx). We'll only knit to HTML documents in this course.\n\n\n![R markdown anatomy (image from [https://intro2r.com/r-markdown-anatomy.html](https://intro2r.com/r-markdown-anatomy.html){target=\"_blank\"})](images/rm_components.png)\n\n\n\n### Markdown\n\nThe markdown space in an `.Rmd` is ideal for writing notes that explain your code and document your thought process. Use this space to clarify what your code is doing, why certain decisions were made, and any insights or conclusions you have drawn along the way. These notes are invaluable when revisiting your work later, helping you (or others) understand the rationale behind key decisions, such as setting inclusion/exclusion criteria or interpreting the results of assumption tests. Effectively documenting your work in the markdown space enhances both the clarity and reproducibility of your analysis.\n\nThe markdown space offers a variety of formatting options to help you organise and present your notes effectively. Here are a few of them that can enhance your documentation:\n\n#### Heading levels {.unnumbered}\n\nThere is a variety of **heading levels** to make use of, using the `#` symbol.\n\n\n::: columns\n\n::: column\n\n##### You would incorporate this into your text as: {.unnumbered}\n\n\\# Heading level 1\n\n\\## Heading level 2\n\n\\### Heading level 3\n\n\\#### Heading level 4\n\n\\##### Heading level 5\n\n\\###### Heading level 6\n\n:::\n\n::: column\n\n##### And it will be displayed in your knitted html file as: {.unnumbered}\n\n![](images/heading_levels.PNG)\n\n:::\n\n:::\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My heading levels don't render properly when knitting\n\nYou need a space between the # and the first letter. If the space is missing, the heading will be displayed in the HTML file as ...\n\n#Heading 1\n\n:::\n\n#### Unordered and ordered lists {.unnumbered}\n\nYou can also include **unordered lists** and **ordered lists**. Click on the tabs below to see how they are incorporated\n\n::: panel-tabset\n\n## unordered lists\n\nYou can add **bullet points** using either `*`, `-` or `+` and they will turn into:\n\n* bullet point (created with `*`)\n* bullet point (created with `-`)\n+ bullet point (created with `+`)\n\nor use bullet points of different levels using 1 tab key press or 2 spaces (for sub-item 1) or 2 tabs/4 spaces (for sub-sub-item 1):\n\n* bullet point item 1\n * sub-item 1\n * sub-sub-item 1\n * sub-sub-item 2\n* bullet point item 2\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My bullet points don't render properly when knitting\n\nYou need an empty row before your bullet points start. If I delete the empty row before the bullet points, they will be displayed in the HTML as ...\n\nText without the empty row: * bullet point created with `*` - bullet point created with `-` + bullet point created with `+`\n\n:::\n\n\n## ordered lists\n\nStart the line with **1.**, **2.**, etc. When you want to include sub-items, either use the `tab` key twice or add **4 spaces**. Same goes for the sub-sub-item: include either 2 tabs (or 4 manual spaces) from the last item or 4 tabs/ 8 spaces from the start of the line.\n\n1. list item 1\n2. list item 2\n i) sub-item 1 (with 4 spaces)\n A. sub-sub-item 1 (with an additional 4 spaces from the last indent)\n\n::: {.callout-important collapse=\"true\"}\n\n## My list items don't render properly when knitting\n\nIf you don't leave enough spaces, the list won't be recognised, and your output looks like this:\n\n3. list item 3\n i) sub-item 1 (with only 2 spaces) \n A. sub-sub-item 1 (with an additional 2 spaces from the last indent)\n\n:::\n\n\n## ordered lists magic\n\nThe great thing though is that you don't need to know your alphabet or number sequences. R markdown will fix that for you\n\nIf I type into my `.Rmd`...\n\n![](images/list_magic.PNG)\n\n...it will be rendered in the knitted HTML output as...\n\n3. list item 3\n1. list item 1\n a) sub-item labelled \"a)\"\n i) sub-item labelled \"i)\"\n C) sub-item labelled \"C)\"\n Z) sub-item labelled \"Z)\"\n7. list item 7\n\n\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: The labels of the sub-items are not what I thought they would be. You said they are fixing themselves...\n\nYes, they do but you need to label your sub-item lists accordingly. The first label you list in each level is set as the baseline. If they are labelled `1)` instead of `i)` or `A.`, the output will show as follows, but the automatic-item-fixing still works:\n\n7. list item 7\n 1) list item \"1)\" with 4 spaces\n 1) list item \"1)\" with 8 spaces\n 6) this is an item labelled \"6)\" (magically corrected to \"2.\")\n:::\n\n:::\n\n#### Emphasis {.unnumbered}\n\nInclude **emphasis** to draw attention to keywords in your text:\n\n| R markdown syntax | Displayed in the knitted HTML file |\n|:----------------------------|:-----------------------------------|\n| \\*\\*bold text\\*\\* | **bold text** |\n| \\*italic text\\* | *italic text* |\n| \\*\\*\\*bold and italic\\*\\*\\* | ***bold and italic*** |\n\n\nOther examples can be found in the [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target=\"_blank\"}\n\n\n\n### Code chunks {#sec-chunks}\n\nEverything you write inside the **code chunks** will be interpreted as code and executed by R. Code chunks start with ```` ``` ```` followed by an `{r}` which specifies the coding language R, some space for code, and ends with ```` ``` ````. If you accidentally delete one of those backticks, your code won't run and/or your text parts will be interpreted as part of the code chunks or vice versa. This should be evident from the colour change - more white than expected typically indicates missing starting backticks, whilst too much grey/not enough white suggests missing ending backticks. But no need to fret if that happens - just add the missing backticks manually.\n\n\nYou can **insert a new code chunk** in several ways:\n\n\n* Click the `Insert a new code chunk` button in the RStudio Toolbar (green icon at the top right corner of the `Source pane`).\n* Select Code \\> Insert Chunk from the menu.\n* Using the shortcut `Ctrl + Alt + I` for Windows or `Cmd + Option + I` on MacOSX.\n* Type ```` ```{r} ```` and ```` ``` ```` manually\n\n\n```{r fig-rmd, echo=FALSE, fig.cap=\"Default `.Rmd` with highlighting - names in pink and knitr display options in purple\"}\nknitr::include_graphics(\"images/default_highlighted.png\")\n```\n\n\n\nWithin the curly brackets of a code chunk, you can **specify a name** for the code chunk (see pink highlighting in @fig-rmd). The chunk name is not necessarily required; however, it is good practice to give each chunk a unique name to support more advanced knitting approaches. It also makes it easier to reference and manage chunks.\n\nWithin the curly brackets, you can also place **rules and arguments** (see purple highlighting in @fig-rmd) to control how your code is executed and what is displayed in your final HTML output. The most common **knitr display options** include:\n\n\n| Code | Does code run | Does code show | Do results show |\n|:--------------------|:-------------:|:--------------:|:---------------:|\n| eval=FALSE | NO | YES | NO |\n| echo=TRUE (default) | YES | YES | YES |\n| echo=FALSE | YES | NO | YES |\n| results='hide' | YES | YES | NO |\n| include=FALSE | YES | NO | NO |\n\n\n::: callout-important\n\nThe table above will be incredibly important for the data skills homework II. When solving error mode items you will need to pay attention to the first one `eval = FALSE`.\n\n:::\n\nOne last thing: In your newly created `.Rmd` file, delete everything below line 12 (keep the set-up code chunk) and save your `.Rmd` by clicking on the disc symbol.\n\n![Delete everything below line 12](images/delete_12.gif)\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nThat was quite a long section about what Markdown can do. I promise, we'll practice that more later. For the minute, we want you to create a new level 2 heading on line 12 and give it a meaningful heading title (something like \"Loading packages and reading in data\" or \"Chapter 1\").\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\nOn line 12, you should have typed **## Loading packages and reading in data** (or whatever meaningful title you chose). This will create level 2 heading once we knit the `.Rmd`.\n\n:::\n\n:::\n\n\n## Activity 3: Download the data {#sec-download_data_ch1}\n\nThe data for chapters 1-3. Download it here: [data_ch1.zip](data/data_ch1.zip \"download\"). There are 2 csv files contained in a zip folder. One is the data file we are going to use today `prp_data_reduced.csv` and the other is an Excel file `prp_codebook` that explains the variables in the data.\n\nThe first step is to **unzip the zip folder** so that the files are placed within the same folder as your project.\n\n* Place the zip folder within your 2A_chapter1 folder\n* Right mouse click --> `Extract All...`\n* Check the folder location is the one to extract the files to\n* Check the extracted files are placed next to the project icon\n* Files and project should be visible in the Output pane in RStudio\n\n::: {.callout-note collapse=\"true\"}\n\n## For screenshots click here\n\n::: {layout-ncol=\"1\"}\n\n![](images/pic1.PNG){fig-align=\"center\"}\n\n![](images/pic23.PNG){fig-align=\"center\"}\n\n![](images/pic45.PNG){fig-align=\"center\"}\n\nUnzipping a zip folder\n\n:::\n:::\n\nThe paper by Pownall et al. was a **registered report** published in 2023, and the original data can be found on OSF ([https://osf.io/5qshg/](https://osf.io/5qshg/){target=\"_blank\"}).\n\n**Citation**\n\n> Pownall, M., Pennington, C. R., Norris, E., Juanchich, M., Smailes, D., Russell, S., Gooch, D., Evans, T. R., Persson, S., Mak, M. H. C., Tzavella, L., Monk, R., Gough, T., Benwell, C. S. Y., Elsherif, M., Farran, E., Gallagher-Mitchell, T., Kendrick, L. T., Bahnmueller, J., . . . Clark, K. (2023). Evaluating the Pedagogical Effectiveness of Study Preregistration in the Undergraduate Dissertation. *Advances in Methods and Practices in Psychological Science, 6*(4). [https://doi.org/10.1177/25152459231202724](https://doi.org/10.1177/25152459231202724){target=\"_blank\"}\n\n**Abstract**\n\n> Research shows that questionable research practices (QRPs) are present in undergraduate final-year dissertation projects. One entry-level Open Science practice proposed to mitigate QRPs is “study preregistration,” through which researchers outline their research questions, design, method, and analysis plans before data collection and/or analysis. In this study, we aimed to empirically test the effectiveness of preregistration as a pedagogic tool in undergraduate dissertations using a quasi-experimental design. A total of 89 UK psychology students were recruited, including students who preregistered their empirical quantitative dissertation (*n* = 52; experimental group) and students who did not (*n* = 37; control group). Attitudes toward statistics, acceptance of QRPs, and perceived understanding of Open Science were measured both before and after dissertation completion. Exploratory measures included capability, opportunity, and motivation to engage with preregistration, measured at Time 1 only. This study was conducted as a Registered Report; Stage 1 protocol: https://osf.io/9hjbw (date of in-principle acceptance: September 21, 2021). Study preregistration did not significantly affect attitudes toward statistics or acceptance of QRPs. However, students who preregistered reported greater perceived understanding of Open Science concepts from Time 1 to Time 2 compared with students who did not preregister. Exploratory analyses indicated that students who preregistered reported significantly greater capability, opportunity, and motivation to preregister. Qualitative responses revealed that preregistration was perceived to improve clarity and organization of the dissertation, prevent QRPs, and promote rigor. Disadvantages and barriers included time, perceived rigidity, and need for training. These results contribute to discussions surrounding embedding Open Science principles into research training.\n\n**Changes made to the dataset**\n\nWe made some changes to the dataset for the purpose of increasing difficulty for data wrangling (@sec-wrangling and @sec-wrangling2) and data visualisation (@sec-dataviz and @sec-dataviz2). This will ensure some \"teachable moments\". The changes are as follows:\n\n* We removed some of the variables to make the data more manageable for teaching purposes.\n* We recoded some values from numeric responses to labels (e.g., `understanding`).\n* We added the word \"years\" to one of the `Age` entries.\n* We tidied a messy column `Ethnicity` but introduced a similar but easier-to-solve \"messiness pattern\" when recoding the `understanding` data.\n* The scores in the original file were already corrected from reverse-coded responses. We reversed that process to present raw data here.\n\n\n\n\n## Activity 4: Installing packages, loading packages, and reading in data\n\n### Installing packages\n\nWhen you install R and RStudio for the first time (or after an update), most of the packages we will be using won’t be pre-installed. Before you can load new packages like `tidyverse`, you will need to install them.\n\nIf you try to load a package that has not been installed yet, you will receive an error message that looks something like this: `Error in library(tidyverse) : there is no package called 'tidyverse'`. \n\nTo fix this, simply install the package first. **In the console**, type the command `install.packages(\"tidyverse\")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio.\n\n\nNote, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`.\n\n\n### Loading packages and reading in data\n\nThe first step is to load in the packages we need and read in the data. Today, we'll only be using `tidyverse`, and `read_csv()` will help us store the data from `prp_data_reduced.csv` in an object called data_prp.\n\nCopy the code into a code chunk in your `.Rmd` file and run it. You can either click the `green error` to run the entire code chunk, or use the shortcut `Ctrl + Enter` (Windows) or `Cmd + Enter` (Mac) to run a line of code/ pipe from the Rmd.\n\n```{r eval=FALSE}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n\n\n\n```{r echo=FALSE}\n## I basically have to have 2 code chunks since I tell them to put the data files next to the project, and mine are in a separate folder called data - unless I'll turn this into a fixed path\nlibrary(tidyverse)\ndata_prp <- read_csv(\"data/prp_data_reduced.csv\")\n```\n\n\n\n\n## Activity 5: Familiarise yourself with the data {#sec-familiarise}\n\n* Look at the **Codebook** to get a feel of the variables in the dataset and how they have been measured. Note that some of the columns were deleted in the dataset you have been given.\n* You'll notice that some questionnaire data was collected at 2 different time points (i.e., SATS28, QRPs, Understanding_OS)\n* some of the data was only collected at one time point (i.e., supervisor judgements, OS_behav items, and Included_prereg variables are t2-only variables)\n\n\n\n### First glimpse at the data\n\nBefore you start wrangling your data, it is important to understand what kind of data you're working with and what the format of your dataframe looks like.\n\nAs you may have noticed, `read_csv()` provides a **message** listing the data types in your dataset and how many columns are of each type. Plus, it shows a few examples columns for each data type.\n\nTo obtain more detailed information about your data, you have several options. Click on the individual tabs to see the different options available. Test them out in your own `.Rmd` file and use whichever method you prefer (but do it).\n\n::: callout-warning\n\nSome of the output is a bit long because we do have quite a few variables in the data file.\n\n:::\n\n::: panel-tabset\n\n## visual inspection 1\n\nIn the `Global Environment`, click the blue arrow icon next to the object name `data_prp`. This action will expand the object, revealing details about its columns. The `$` symbol is commonly used in Base R to access a specific column within your dataframe.\n\n![Visual inspection of the data](images/data_prp.PNG)\n\nCon: When you have quite a few variables, not all of them are shown.\n\n## `glimpse()`\n\nUse `glimpse()` if you want a more detailed overview you can see on your screen. The output will display rows and column numbers, and some examples of the first couple of observations for each variable.\n\n\n```{r}\nglimpse(data_prp)\n```\n\n\n## `spec()`\n\nYou can also use `spec()` as suggested in the message above and then it shows you a list of the data type in every single column. But it doesn't show you the number of rows and columns.\n\n\n```{r}\nspec(data_prp)\n```\n\n\n## visual inspection 2\n\nIn the `Global Environment`, click on the object name `data_prp`. This action will open the data in a new tab. Hovering over the column headings with your mouse will also reveal their data type. However, it seems to be a fairly tedious process when you have loads of columns.\n\n::: {.callout-important collapse=\"true\"}\n\n## Hang on, where is the rest of my data? Why do I only see 50 columns?\n\nOne common source of confusion is not seeing all your columns when you open up a data object as a tab. This is because RStudio shows you a maximum of 50 columns at a time. If you have more than 50 columns, navigate with the arrows to see the remaining columns.\n\n![Showing 50 columns at a time](images/50_col.PNG)\n\n:::\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nNow that you have tested out all the options in your own `.Rmd` file, you can probably answer the following questions:\n\n* How many observations? `r fitb(\"89\")`\n* How many variables? `r fitb(\"91\")`\n* How many columns are `col_character` or `chr` data type? `r fitb(\"17\")`\n* How many columns are `col_double` or `dbl` data type? `r fitb(\"74\")`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe visual inspections shows you the **number of observations and variables**. `glimpse()` also gives you that information but calls them **rows and columns** respectively.\n\nThe **data type information** actually comes from the output when using the `read_csv()` function. Did you notice the information on **Column specification** (see screenshot below)?\n\n![message from `read_csv()` when reading in the data](images/col_spec.PNG)\n\nWhilst `spec()` is quite useful for data type information per individual column, it doesn't give you the total count of each data type. So it doesn't really help with answering the questions here - unless you want to count manually from its extremely long output.\n\n:::\n\nIn your `.Rmd`, include a **new heading level 2** called \"Information about the data\" (or something equally meaningful) and jot down some notes about `data_prp`. You could include the citation and/or the abstract, and whatever information you think you should note about this dataset (e.g., any observations from looking at the codebook?). You could also include some notes on the functions used so far and what they do. Try to incorporate some **bold**, *italic* or ***bold and italic*** emphasis and perhaps a bullet point or two.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Possible solution\n\n\\#\\# Information about the data\n\nThe data is from Pownall et al. (2023), and I can find the paper here: https://doi.org/10.1177/25152459231202724.\n\nI've noticed in the prp codebook that the SATS-28 questionnaire has quite a few \\*reverse-coded items\\*, and the supervisor support questionnaire also has a reverse-coded item.\n\nSo far, I think I prefer \\*\\*glimpse()\\*\\* to show me some more detail about the data. Specs() is too text-heavy for me which makes it hard to read.\n\nThings to keep in mind:\n\n* \\*\\*don't forget to load in tidyverse first!!!\\*\\*\n* always read in the data with \\*\\*read_csv\\*\\*, \\*\\*\\*never ever use read.csv\\*\\*\\*!!!\n\n![The output rendered in a knitted html file](images/knitted_markdown.PNG)\n\n:::\n\n:::\n\n### Data types {#sec-datatypes}\n\nEach variable has a **data type**, such as numeric (numbers), character (text), and logical (TRUE/FALSE values), or a special class of factor. As you have just seen, our `data_prp` only has character and numeric columns (so far).\n\n**Numeric data** can be double (`dbl`) or integer (`int`). Doubles can have decimal places (e.g., 1.1). Integers are the whole numbers (e.g., 1, 2, -1) and are displayed with the suffix L (e.g., 1L). This is not overly important but might leave you less puzzled the next time you see an L after a number.\n\n**Characters** (also called “strings”) is anything written between quotation marks. This is usually text, but in special circumstances, a number can be a character if it placed within quotation marks. This can happen when you are recoding variables. It might not be too obvious at the time, but you won't be able to calculate anything if the number is a character\n\n::: panel-tabset\n\n## Example data types\n\n```{r}\ntypeof(1)\ntypeof(1L)\ntypeof(\"1\")\ntypeof(\"text\")\n```\n\n## numeric computation\n\nNo problems here...\n\n```{r}\n1+1\n```\n\n## character computation\n\nWhen the data type is incorrect, you won't be able to compute anything, despite your numbers being shown as numeric values in the dataframe. The error message tells you exactly what's wrong with it, i.e., that you have `non-numeric arguments`.\n\n```{r, error = TRUE}\n\"1\"+\"1\" # ERROR\n```\n\n:::\n\n**Logical** data (also sometimes called “Boolean” values) are one of two values: TRUE or FALSE (written in uppercase). They become really important when we use `filter()` or `mutate()` with conditional statements such as `case_when()`. More about those in @sec-wrangling2.\n\n\nSome commonly used logical operators:\n\n| operator | description |\n|:---------|:-----------------------------------------------|\n| \\> | greater than |\n| \\>= | greater than or equal to |\n| \\< | less than |\n| \\<= | less than or equal to |\n| == | equal to |\n| != | not equal to |\n| %in% | TRUE if any element is in the following vector |\n\n\nA **factor** is a specific type of integer or character that lets you assign the order of the categories. This becomes useful when you want to display certain categories in \"the correct order\" either in a dataframe (see *arrange*) or when plotting (see @sec-dataviz/ @sec-dataviz2).\n\n\n\n### Variable types\n\nYou've already encountered them in [Level 1](https://psyteachr.github.io/data-skills-v2/intro-to-probability.html){target=\"_blank\"} but let's refresh. Variables can be classified as **continuous** (numbers) or **categorical** (labels).\n\n**Categorical** variables are properties you can count. They can be **nominal**, where the categories don't have an order (e.g., gender) or **ordinal** (e.g., Likert scales either with numeric values 1-7 or with character labels such as \"agree\", \"neither agree nor disagree\", \"disagree\"). Categorical data may also be **factors** rather than characters.\n\n**Continuous variables** are properties you can measure and calculate sums/ means/ etc. They may be rounded to the nearest whole number, but it should make sense to have a value between them. Continuous variables always have a **numeric** data type (i.e. `integer` or `double`).\n\n::: callout-tip\n\n## Why is this important you may ask?\n\nKnowing your variable and data types will help later on when deciding on an appropriate plot (see @sec-dataviz and @sec-dataviz2) or which inferential test to run (@sec-nhstI to @sec-factorial).\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAs we've seen earlier, `data_prp` only had character and numeric variables which hardly tests your understanding to see if you can identify a variety of data types and variable types. So, for this little quiz, we've spiced it up a bit. We've selected a few columns, shortened some of the column names, and modified some of the data types. Here you can see the first few rows of the new object `data_quiz`. *You can find the code with explanations at the end of this section.*\n\n```{r echo=FALSE}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n\n```{r echo=FALSE}\n# the `head()` function shows the first n number of rows of a dataset (here 5)\nhead(data_quiz, n = 5)\n```\n\n```{r}\nglimpse(data_quiz)\n```\n\n\n\nSelect from the dropdown menu the variable type and their data types for each of the columns.\n\n```{r, include = FALSE}\n# variable type\ncon <- c(answer = \"continuous\", x = \"nominal\", x = \"ordinal\")\nnom <- c(x = \"continuous\", answer = \"nominal\", x = \"ordinal\")\nord <- c(x = \"continuous\", x = \"nominal\", answer = \"ordinal\")\n\n# data type\nnum <- c(answer = \"numeric\", x = \"character\", x = \"logical\", x = \"factor\")\nchr <- c(x = \"numeric\", answer = \"character\", x = \"logical\", x = \"factor\")\nlog <- c(x = \"numeric\", x = \"character\", answer = \"logical\", x = \"factor\")\nfctr <- c(x = \"numeric\", x = \"character\", x = \"logical\", answer = \"factor\")\n\n```\n\n| Column | Variable type | Data type |\n|:---------------------|:--------------|:--------------|\n| `Age` | `r mcq(con)` | `r mcq(chr)` |\n| `Gender` | `r mcq(nom)` | `r mcq(fctr)` |\n| `Ethinicity` | `r mcq(nom)` | `r mcq(chr)` |\n| `Secondyeargrade` | `r mcq(ord)` | `r mcq(fctr)` |\n| `QRP_item` | `r mcq(ord)` | `r mcq(num)` |\n| `QRPs_mean` | `r mcq(con)` | `r mcq(num)` |\n| `Understanding_item` | `r mcq(ord)` | `r mcq(chr)` |\n| `QRP_item > 4` | `r mcq(nom)` | `r mcq(log)` |\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Revealing the mystery code that created `data_quiz`\n\nThe code might look a bit complex for the minute despite the line-by-line explanations below. Come back to it after completing chapter 2.\n\n```{r eval=FALSE}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n\nLets go through this line by line:\n\n* **line 1**: creates a new object called `data_quiz` and it is based on the already existing data object `data_prp`\n* **line 2**: we are selecting a few variables of interest, such as Code, Age etc. Some of those variables were renamed in the process according to the structure `new_name = old_name`, for example QRP item 3 at time point 1 got renamed as `QRP_item`.\\\n* **line 3**: The function `mutate()` is used to create a new column called `Gender` that turns the existing column `Gender` from a numeric value into a factor. R simply overwrites the existing column of the same name. If we had named the new column `Gender_factor`, we would have been able to retain the original `Gender` column and `Gender_factor` would have been added as the last column.\n* **line 4-6**: See how the line starts with an indent which indicates we are still within the `mutate()` function. You can also see this by counting brackets - in line 3 there are 2 opening brackets but only 1 closes.\n * Similar to `Gender`, we are replacing the \"old\" `Secondyeargrade` with the new `Secondyeargrade` column that is now a factor.\n * Turning our variable `Secondyeargrade` into a factor, spot the difference between this attempt and the one we used for `Gender`? Here we are using a lot more arguments in that factor function, namely levels and labels. **Levels** describes the unique values we have for that column, and in **labels** we want to define how these levels will be shown in the data object. If you don't add the levels and labels argument, the labels will be the labels (as you can see in the `Gender` column in which we kept the numbers).\n* **line 7**: Doesn't start with a function name and has an indent, which means we are *still* within the `mutate()` function - count the opening and closing brackets to confirm.\n * Here, we are creating a new column called `QRP_item > 4`. Notice the two backticks we have to use to make this weird column name work? This is because it has spaces (and we did mention that R doesn't like spaces). So the backticks help R to group it as a unit/ a single name.\n * Next we have a `case_when()` function which helps executing conditional statements. We are using it to check whether a statement is TRUE or FALSE. Here, we ask whether the QRP item (column `QRP_item`) is larger than 4 (midpoint of the scale) using the Boolean operator `>`. If the statement is `TRUE`, the label `TRUE` should appear in column `QRP_item > 4`. Otherwise, if the value is equal to 4 or smaller, the label should read `FALSE`. We will come back to conditional statements in @sec-wrangling. But long story short, this Boolean expression created the only logical data type in `data_quiz`.\n:::\n\nAnd with this, we are done with the individual walkthrough. Well done :)\n\n\n\n\n\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nThe data we will be using in the upcoming lab activities is a randomised controlled trials experiment by Binfet et al. (2021) that was conducted in Canada.\n\n**Citation**\n\n> Binfet, J. T., Green, F. L. L., & Draper, Z. A. (2021). The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. *Anthrozoös, 35*(1), 1–22. [https://doi.org/10.1080/08927936.2021.1944558](https://doi.org/10.1080/08927936.2021.1944558){target=\"_blank\"}\n\n**Abstract**\n\n> Researchers have claimed that canine-assisted interventions (CAIs) contribute significantly to bolstering participants' wellbeing, yet the mechanisms within interactions have received little empirical attention. The aim of this study was to assess the impact of client–canine contact on wellbeing outcomes in a sample of 284 undergraduate college students (77% female; 21% male, 2% non-binary). Participants self-selected to participate and were randomly assigned to one of two canine interaction treatment conditions (touch or no touch) or to a handler-only condition with no therapy dog present. To assess self-reports of wellbeing, measures of flourishing, positive and negative affect, social connectedness, happiness, integration into the campus community, stress, homesickness, and loneliness were administered. Exploratory analyses were conducted to assess whether these wellbeing measures could be considered as measuring a unidimensional construct. This included both reliability analysis and exploratory factor analysis. Based on the results of these analyses we created a composite measure using participant scores on a latent factor. We then conducted the tests of the four hypotheses using these factor scores. Results indicate that participants across all conditions experienced enhanced wellbeing on several measures; however, only those in the direct contact condition reported significant improvements on all measures of wellbeing. Additionally, direct interactions with therapy dogs through touch elicited greater wellbeing benefits than did no touch/indirect interactions or interactions with only a dog handler. Similarly, analyses using scores on the wellbeing factor indicated significant improvement in wellbeing across all conditions (handler-only, *d* = 0.18, *p* = 0.041; indirect, *d* = 0.38, *p* \\< 0.001; direct, *d* = 0.78, *p* \\< 0.001), with more benefit when a dog was present (*d* = 0.20, *p* \\< 0.001), and the most benefit coming from physical contact with the dog (*d* = 0.13, *p* = 0.002). The findings hold implications for post-secondary wellbeing programs as well as the organization and delivery of CAIs.\n\n\nHowever, we accessed the data via Ciaran Evans' github ([https://github.com/ciaran-evans/dog-data-analysis](https://github.com/ciaran-evans/dog-data-analysis){target=\"_blank\"}). Evans et al. (2023) published a paper that reused the Binfet data for teaching statistics and research methods. If anyone is interested, the accompanying paper is:\n\n> Evans, C., Cipolli, W., Draper, Z. A., & Binfet, J. T. (2023). Repurposing a Peer-Reviewed Publication to Engage Students in Statistics: An Illustration of Study Design, Data Collection, and Analysis. *Journal of Statistics and Data Science Education, 31*(3), 236–247. [https://doi.org/10.1080/26939169.2023.2238018](https://doi.org/10.1080/26939169.2023.2238018){target=\"_blank\"}\n\n**There are a few changes that Evans and we made to the data:**\n\n* Evans removed the demographics ethnicity and gender to make the study data available while protecting participant privacy. Which means we'll have limited demographic variables available, but we will make do with what we've got.\n* We modified some of the responses in the raw data csv - for example, we took out impossible response values and replaced them with `NA`.\n* We replaced some of the numbers with labels to increase the difficulty in the dataset for @sec-wrangling and @sec-wrangling2.\n\n\n\n### Task 1: Create a project folder for the lab activities {.unnumbered}\n\nSince we will be working with the same data throughout semester 1, create a separate project for the lab data. Name it something useful, like `lab_data` or `dogs_in_the_lab`. Make sure you are not placing it within the project you have already created today. If you need guidance, see @sec-project above.\n\n\n\n### Task 2: Create a new `.Rmd` file {.unnumbered}\n\n... and name it something useful. If you need help, have a look at @sec-rmd.\n\n\n\n### Task 3: Download the data {.unnumbered}\n\nDownload the data here: [data_pair_ch1](data/data_pair_ch1.zip \"download\"). The zip folder contains the raw data file with responses to individual questions, a cleaned version of the same data in long format and wide format, and the codebook describing the variables in the raw data file and the long format.\n\n**Unzip the folder and place the data files in the same folder as your project.**\n\n\n\n### Task 4: Familiarise yourself with the data {.unnumbered}\n\nOpen the data files, look at the codebook, and perhaps skim over the original Binfet article (methods in particular) to see what kind of measures they used.\n\nRead in the raw data file as `dog_data_raw` and the cleaned-up data (long format) as `dog_data_long`. See if you can answer the following questions.\n\n```{r eval=FALSE}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\ndog_data_long <- read_csv(\"dog_data_clean_long.csv\")\n```\n\n```{r echo=FALSE}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"data/dog_data_raw.csv\")\ndog_data_long <- read_csv(\"data/dog_data_clean_long.csv\")\n```\n\n* How many participants took part in the study? `r fitb(\"284\")`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nYou can see this from `dog_data_raw`. Each participant ID is on a single row meaning the number of observations is the number of participants.\n\nIf you look at `dog_data_long`, there are 568 observations. Each participant answered the questionnaires pre and post intervention, resulting in 2 rows per participant ID. This means you'd have to divide the number of observations by 2 to get to the number of participants.\n\n:::\n\n* How many different questionnaires did the participants answer? `r fitb(c(\"9\", \"8\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nThe Binfet paper (e.g., Methods section and/or abstract) and the codebook show it's 9 questionnaires - Flourishing scale (variable `Flourishing`), the UCLS Loneliness scale Version 3 (`Loneliness`), Positive and Negative affect scale (`PANAS_PA` and `PANAS_NA`), the Subjective Happiness scale (`SHS`), the Social connectedness scale (`SCS`), and 3 scales with 1 question each, i.e., perception of stress levels (`Stress`), self-reported level of homesickness (`Homesick`), and integration into the campus community (`Engagement`).\n\nHowever, if you thought `PANAS_PA` and `PANAS_NA` are a single questionnaire, 8 was also acceptable as an answer here.\n\n:::\n\n\n\n\n## [Test your knowledge]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nAre you ready for some knowledge check questions to test your understanding of the chapter? We also have some faulty codes. See if you can spot what's wrong with them.\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nOne of the key first steps when we open RStudio is to: `r longmcq(c(x = \"put on some music as we will be here a while\", answer = \"open an existing project or create a new one\", x = \"make a coffee\", x = \"check out the news\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nOpening an existing project (e.g., when coming back to the same dataset) or creating a new project (e.g., for a new task or new dataset) ensures that subsequent `.Rmd` files, any output, figures, etc are saved within the same folder on your computer (i.e., the working directory). If the`.Rmd` files or data is not in the same folder as \"the project icon\", things can get messy and code might not run.\n\n:::\n\n\n#### Question 2 {.unnumbered}\n\nWhen using the default environment colour settings for RStudio, what colour would the background of a code chunk be in R Markdown? `r mcq(c(x = \"red\", x = \"white\", x = \"green\", answer = \"grey\"))`\n\nWhen using the default environment colour settings for RStudio, what colour would the background of normal text be in R Markdown? `r mcq(c(x = \"red\", answer = \"white\", x = \"green\", x = \"grey\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nAssuming you have not changed any of the settings in RStudio, code chunks will tend to have a grey background and normal text will tend to have a white background. This is a good way to check that you have closed and opened code chunks correctly.\n\n:::\n\n\n\n#### Question 3 {.unnumbered}\n\nCode chunks start and end with: `r longmcq(c(x = \"three single quotes\", answer = \"three backticks\", x = \"three double quotes\", x = \"three single asterisks\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCode chunks always take the same general format of three backticks followed by curly parentheses and a lower case r inside the parentheses (`{r}`). People often mistake these backticks for single quotes but that will not work. If you have set your code chunk correctly using backticks, the background colour should change to grey from white.\n\n:::\n\n\n\n#### Question 4 {.unnumbered}\n\nWhat is the correct way to include a code chunk in RMarkdown that will be executed but neither the code nor its output will be shown in the final HTML document? `r mcq(c(x = \"{r, echo=FALSE}\", x = \"{r, eval=FALSE}\", answer = \"{r, include=FALSE}\", x = \"{r, results='hide'}\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCheck the table of knitr display options in @sec-chunks.\n\n* {r, echo=FALSE} also executes the code and does not show the code, but it *does* display the result in the knitted html file. (matches 2/3 criteria)\n* {r, eval=FALSE} does not show the results but does *not* execute the code and it *does* show it in the knitted file. (matches 1/3 criteria)\n* {r, results=“hide”} executes the code and does not show results, however, it *does* include the code in the knitted html document. (matches 2/3 criteria)\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of these codes have mistakes in them, other code chunks are not quite producing what was aimed for. Your task is to spot anything faulty, explain why the things happened, and perhaps try to fix them.\n\n\n\n#### Question 5 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. You have just stated R, created a new `.Rmd` file, and typed the following code into your code chunk.\n\n```{r eval=FALSE}\ndata <- read_csv(\"data.csv\")\n```\n\n\nHowever, R gives you an error message: `could not find function \"read_csv\"`. What could be the reason?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\n\"Could not find function\" is an indication that you have forgotten to load in tidyverse. Because `read_csv()` is a function in the tidyverse collection, R cannot find it.\n\nFIX: Add `library(tidyverse)` prior to reading in the data and run the code chunk again.\n\n:::\n\n\n\n#### Question 6 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. This time, you are certain you have loaded in tidyverse first. The code is as follows:\n\n```{r eval=FALSE}\nlibrary(tidyverse)\ndata <- read_csv(\"data.csv\")\n```\n\nThe error message shows `'data.csv' does not exist in current working directory`. You check your folder and it looks like this:\n\n![](images/error_ch1_01.PNG)\n\nWhy is there an error message?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nR is looking for a csv file that is called data which is currently not in the working directory. We may assume it's in the data folder. Perhaps that happened when unzipping the zip file. So instead of placing the csv file on the same level as the project icon, it was unzipped into a folder named data.\n\nFIX - option 1: Take the `data.csv` out of the data folder and place it next to the project icon and the `.Rmd` file.\n\nFIX - option 2: Modify your R code to tell R that the data is in a separate folder called data, e.g., ...\n\n```{r eval=FALSE}\nlibrary(tidyverse)\ndata <- read_csv(\"data/data.csv\")\n```\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\n\nYou want to load `tidyverse` into the library. The code is as follows:\n\n```{r eval=FALSE}\nlibrary(tidyverse)\n```\n\n\nThe error message says: `Error in library(tidyverse) : there is no package called ‘tidyverse’`\n\nWhy is there an error message and how can we fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nIf R says there is no package called `tidyverse`, means you haven't installed the package yet. This could be an error message you receive either after switching computers or a fresh install of R and RStudio.\n\nFIX: Type `install.packages(\"tidyverse\")` into your **Console**.\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nYou knitted your `.Rmd` into a html but the output is not as expected. You see the following:\n\n![](images/error_knitted.PNG)\n\nWhy did the file not knit properly?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThere is a backtick missing in the code chunk. If you check your `.Rmd` file, you can see that the code chunk does not show up in grey which means it's one of the 3 backticks at the beginning of the chunk.\n\n![](images/error_ch1_08.PNG)\n\nFIX: Add a single backtick manually where it's missing.\n\n:::\n","srcMarkdownNoYaml":""},"formats":{"html":{"identifier":{"display-name":"HTML","target-format":"html","base-format":"html"},"execute":{"fig-width":7,"fig-height":5,"fig-format":"retina","fig-dpi":96,"df-print":"kable","error":false,"eval":true,"cache":null,"freeze":"auto","echo":true,"output":true,"warning":true,"include":true,"keep-md":false,"keep-ipynb":false,"ipynb":null,"enabled":null,"daemon":null,"daemon-restart":false,"debug":false,"ipynb-filters":[],"engine":"knitr"},"render":{"keep-tex":false,"keep-source":false,"keep-hidden":false,"prefer-html":false,"output-divs":true,"output-ext":"html","fig-align":"default","fig-pos":null,"fig-env":null,"code-fold":false,"code-overflow":"wrap","code-link":true,"code-line-numbers":true,"code-tools":false,"tbl-colwidths":"auto","merge-includes":true,"inline-includes":false,"preserve-yaml":false,"latex-auto-mk":true,"latex-auto-install":true,"latex-clean":true,"latex-max-runs":10,"latex-makeindex":"makeindex","latex-makeindex-opts":[],"latex-tlmgr-opts":[],"latex-input-paths":[],"latex-output-dir":null,"link-external-icon":false,"link-external-newwindow":false,"self-contained-math":false,"format-resources":[],"notebook-links":true,"format-links":true},"pandoc":{"standalone":true,"wrap":"none","default-image-extension":"png","to":"html","css":["https://use.fontawesome.com/releases/v5.13.0/css/all.css","include/booktem.css","include/webex.css","include/glossary.css","include/style.css","include/custom.scss"],"highlight-style":"a11y","include-after-body":["include/webex.js","include/script.js"],"output-file":"01-basics.html"},"language":{"toc-title-document":"Table of contents","toc-title-website":"On this page","related-formats-title":"Other Formats","related-notebooks-title":"Notebooks","source-notebooks-prefix":"Source","section-title-abstract":"Abstract","section-title-appendices":"Appendices","section-title-footnotes":"Footnotes","section-title-references":"References","section-title-reuse":"Reuse","section-title-copyright":"Copyright","section-title-citation":"Citation","appendix-attribution-cite-as":"For attribution, please cite this work as:","appendix-attribution-bibtex":"BibTeX citation:","title-block-author-single":"Author","title-block-author-plural":"Authors","title-block-affiliation-single":"Affiliation","title-block-affiliation-plural":"Affiliations","title-block-published":"Published","title-block-modified":"Modified","callout-tip-title":"Tip","callout-note-title":"Note","callout-warning-title":"Warning","callout-important-title":"Important","callout-caution-title":"Caution","code-summary":"Code","code-tools-menu-caption":"Code","code-tools-show-all-code":"Show All Code","code-tools-hide-all-code":"Hide All Code","code-tools-view-source":"View Source","code-tools-source-code":"Source Code","code-line":"Line","code-lines":"Lines","copy-button-tooltip":"Copy to Clipboard","copy-button-tooltip-success":"Copied!","repo-action-links-edit":"Edit this page","repo-action-links-source":"View source","repo-action-links-issue":"Report an issue","back-to-top":"Back to top","search-no-results-text":"No results","search-matching-documents-text":"matching documents","search-copy-link-title":"Copy link to search","search-hide-matches-text":"Hide additional matches","search-more-match-text":"more match in this document","search-more-matches-text":"more matches in this document","search-clear-button-title":"Clear","search-detached-cancel-button-title":"Cancel","search-submit-button-title":"Submit","search-label":"Search","toggle-section":"Toggle section","toggle-sidebar":"Toggle sidebar navigation","toggle-dark-mode":"Toggle dark mode","toggle-reader-mode":"Toggle reader mode","toggle-navigation":"Toggle navigation","crossref-fig-title":"Figure","crossref-tbl-title":"Table","crossref-lst-title":"Listing","crossref-thm-title":"Theorem","crossref-lem-title":"Lemma","crossref-cor-title":"Corollary","crossref-prp-title":"Proposition","crossref-cnj-title":"Conjecture","crossref-def-title":"Definition","crossref-exm-title":"Example","crossref-exr-title":"Exercise","crossref-ch-prefix":"Chapter","crossref-apx-prefix":"Appendix","crossref-sec-prefix":"Section","crossref-eq-prefix":"Equation","crossref-lof-title":"List of Figures","crossref-lot-title":"List of Tables","crossref-lol-title":"List of Listings","environment-proof-title":"Proof","environment-remark-title":"Remark","environment-solution-title":"Solution","listing-page-order-by":"Order By","listing-page-order-by-default":"Default","listing-page-order-by-date-asc":"Oldest","listing-page-order-by-date-desc":"Newest","listing-page-order-by-number-desc":"High to Low","listing-page-order-by-number-asc":"Low to High","listing-page-field-date":"Date","listing-page-field-title":"Title","listing-page-field-description":"Description","listing-page-field-author":"Author","listing-page-field-filename":"File Name","listing-page-field-filemodified":"Modified","listing-page-field-subtitle":"Subtitle","listing-page-field-readingtime":"Reading Time","listing-page-field-categories":"Categories","listing-page-minutes-compact":"{0} min","listing-page-category-all":"All","listing-page-no-matches":"No matching items"},"metadata":{"lang":"en","fig-responsive":true,"quarto-version":"1.3.450","lightbox":true,"bibliography":["include/references.bib"],"csl":"include/apa.csl","theme":{"light":["flatly","include/light.scss"],"dark":["darkly","include/dark.scss"]},"code-copy":"hover","mainfont":"","monofont":""},"extensions":{"book":{"multiFile":true}}}},"projectFormats":["html"]} \ No newline at end of file +{"title":"Projects and R Markdown","markdown":{"headingText":"Projects and R Markdown","headingAttr":{"id":"sec-basics","classes":[],"keyvalue":[]},"containsRefs":false,"markdown":"\n## Intended Learning Outcomes {.unnumbered}\n\nBy the end of this chapter, you should be able to:\n\n- Re-familiarise yourself with setting up projects\n- Re-familiarise yourself with RMarkdown documents\n- Recap and apply data wrangling procedures to analyse data\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n## R and R Studio\n\nRemember, R is a programming language that you will write code in and RStudio is an Integrated Development Environment (IDE) which makes working with R easier as it's more user friendly. You need both components for this course.\n\nIf this is not ringing any bells yet, have a quick browse through the [materials from year 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#sec-intro-r){target=\"_blank\"} to refresh your memory.\n\n\n### R server\n\nUse the server *only* if you are unable to install R and RStudio on your computer (e.g., if you are using a Chromebook) or if you encounter issues while installing R on your own machine. Otherwise, you should install R and RStudio directly on your own computer. R and RStudio are already installed on the *R server*.\n\nYou will find the link to the server on Moodle.\n\n\n### Installing R and RStudio on your computer\n\nThe [RSetGo book](https://psyteachr.github.io/RSetGo/){target=\"_blank\"} provides detailed instructions on how to install R and RStudio on your computer. It also includes links to walkthroughs for installing R on different types of computers and operating systems.\n\nIf you had R and RStudio installed on your computer last year, we recommend updating to the latest versions. In fact, it’s a good practice to update them at the start of each academic year. Detailed guidance can be found in @sec-updating-r.\n\nOnce you have installed or updated R and RStudio, return to this chapter.\n\n\n### Settings for Reproducibility\n\nBy now, you should be aware that the Psychology department at the University of Glasgow places a strong emphasis on reproducibility, open science, and raising awareness about questionable research practices (QRPs) and how to avoid them. Therefore, it's important that you work in a reproducible manner so that others (and your future self) can understand and check your work. This also makes it easier for you to reuse your work in the future.\n\nAlways start with a clear workspace. If your `Global Environment` contains anything from a previous session, you can’t be certain whether your current code is working as intended or if it’s using objects created earlier.\n\nTo ensure a clean and reproducible workflow, there are a few settings you should adjust immediately after installing or updating RStudio. In Tools \\> Global Options... General tab\n\n* Uncheck the box labelled Restore .RData into workspace at startup to make sure no data from a previous session is loaded into the environment\n* set Save workspace to .RData on exit to **Never** to prevent your workspace from being saved when you exit RStudio.\n\n![Reproducibility settings in Global Options](images/rstudio_settings_reproducibility.png)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for keeping taps on parentheses\n\nR has included **rainbow parentheses** to help with keeping count on the brackets.\n\nTo enable the feature, go to Tools \\> Global Options... Code tab \\> Display tab and tick the last checkbox \"Use rainbow parentheses\"\n\n![Enable Rainbow parenthesis](images/rainbow.PNG)\n\n:::\n\n### RStudio panes\n\nRStudio has four main panes each in a quadrant of your screen:\n\n* Source pane\n* Environment pane\n* Console pane\n* Output pane\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAre you ready for a quick quiz to see what you remember about the RStudio panes from last year? Click on **Quiz** to see the questions.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Quiz\n\n**What is their purpose?**\n\n**The Source pane...** `r longmcq(c(answer = \"allows users to view and edit various code-related files, such as .Rmd files\", \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", \"provides an area to interactively execute code\"))`\n\n**The Environment pane...** `r longmcq(c(\"allows users to view and edit various code-related files, such as .Rmd files\", \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", answer = \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", \"provides an area to interactively execute code\"))`\n\n**The Console pane...** `r longmcq(c(\"allows users to view and edit various code-related files, such as .Rmd files\", \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", answer = \"provides an area to interactively execute code\"))`\n\n**The Output pane...** `r longmcq(c(\"allows users to view and edit various code-related files, such as .Rmd files\", answer = \"contains the Files, Plots, R Packages, Help, Tutorial, Viewer, and Presentation tabs\", \"includes the Environment tab that displays currently saved objects, and the History tab that displays the commands that were executed in the current session along a search function\", \"provides an area to interactively execute code\"))`\n\n**Where are these panes located by default?**\n\n* The Source pane is located? `r mcq(sample(c(answer = \"top left\", \"bottom left\", \"top right\", \"bottom right\")))`\n* The Environment pane is located? `r mcq(sample(c(\"top left\", \"bottom left\", answer = \"top right\", \"bottom right\")))`\n* The Console pane is located? `r mcq(sample(c(\"top left\", answer = \"bottom left\", \"top right\", \"bottom right\")))`\n* The Output pane is located? `r mcq(sample(c(\"top left\", \"bottom left\", \"top right\", answer = \"bottom right\")))`\n\n:::\n\n:::\n\nIf you were not quite sure about one/any of the panes, check out the [materials from Level 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#rstudio-panes){target=\"_blank\"}. If you want to know more about them, there is the [RStudio guide on posit](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html){target=\"_blank\"}\n\n\n\n## Activity 1: Creating a new project {#sec-project}\n\nIt's important to create a new RStudio project whenever you start a new project. This practice makes it easier to work in multiple contexts, such as when analysing different datasets simultaneously. Each RStudio project has its own folder location, workspace, and working directories, which keeps all your data and RMarkdown documents organised in one place.\n\nLast year, you learnt how to create projects on the server, so you already know the steps. If cannot quite recall how that was done, go back to the [Level 1 materials](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#new-project){target=\"_blank\"}.\n\nOn your own computer, open RStudio, and complete the following steps in this order:\n\n* Click on File \\> New Project...\n* Then, click on \"New Directory\"\n* Then, click on \"New Project\"\n* Name the directory something meaningful (e.g., \"2A_chapter1\"), and save it in a location that makes sense, for example, a dedicated folder you have for your level 2 Psychology labs - you can either select a folder you have already in place or create a new one (e.g., I named my new folder \"Level 2 labs\")\n* Click \"Create Project\". RStudio will restart itself and open with this new project directory as the working directory. If you accidentally close it, you can open it by double-clicking on the project icon in your folder\n* You can also check in your folder structure that everything was created as intended\n\n![Creating a new project](images/project_setup.gif)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Why is the Colour scheme in the gif different to my version?\n\nIn case anyone is wondering why my colour scheme in the gif above looks different to yours, I've set mine to \"Pastel On Dark\" in Tools \\> Global Options... \\> Appearances. And my computer lives in \"dark mode\".\n\n:::\n\n::: callout-important\n\n## Don't nest projects\n\nDon't ever save a new project **inside** another project directory. This can cause some hard-to-resolve problems.\n\n:::\n\n\n## Activity 2: Create a new R Markdown file {#sec-rmd}\n\n* Open a new R Markdown document: click File \\> New File \\> R Markdown or click on the little page icon with a green plus sign (top left).\n* Give it a meaningful `Title` (e.g., Level 2 chapter 1) - you can also change the title later. Feel free to add your name or GUID in the `Author` field author name. Keep the `Default Output Format` as HTML.\n* Once the .`Rmd` opened, you need to save the file.\n* To save it, click File \\> Save As... or click on the little disc icon. Name it something meaningful (e.g., \"chapter_01.Rmd\", \"01_intro.Rmd\"). Make sure there are no spaces in the name - R is not very fond of spaces... This file will automatically be saved in your project folder (i.e., your working directory) so you should now see this file appear in your file viewer pane.\n\n\n![Creating a new `.Rmd` file](images/Rmd_setup.gif)\n\n\nRemember, an R Markdown document or `.Rmd` has \"white space\" (i.e., the markdown for formatted text) and \"grey parts\" (i.e., code chunks) in the default colour scheme (see @fig-rmd). R Markdown is a powerful tool for creating dynamic documents because it allows you to integrate code and regular text seamlessly. You can then knit your `.Rmd` using the `knitr` package to create a final document as either a webpage (HTML), a PDF, or a Word document (.docx). We'll only knit to HTML documents in this course.\n\n\n![R markdown anatomy (image from [https://intro2r.com/r-markdown-anatomy.html](https://intro2r.com/r-markdown-anatomy.html){target=\"_blank\"})](images/rm_components.png)\n\n\n\n### Markdown\n\nThe markdown space in an `.Rmd` is ideal for writing notes that explain your code and document your thought process. Use this space to clarify what your code is doing, why certain decisions were made, and any insights or conclusions you have drawn along the way. These notes are invaluable when revisiting your work later, helping you (or others) understand the rationale behind key decisions, such as setting inclusion/exclusion criteria or interpreting the results of assumption tests. Effectively documenting your work in the markdown space enhances both the clarity and reproducibility of your analysis.\n\nThe markdown space offers a variety of formatting options to help you organise and present your notes effectively. Here are a few of them that can enhance your documentation:\n\n#### Heading levels {.unnumbered}\n\nThere is a variety of **heading levels** to make use of, using the `#` symbol.\n\n\n::: columns\n\n::: column\n\n##### You would incorporate this into your text as: {.unnumbered}\n\n\\# Heading level 1\n\n\\## Heading level 2\n\n\\### Heading level 3\n\n\\#### Heading level 4\n\n\\##### Heading level 5\n\n\\###### Heading level 6\n\n:::\n\n::: column\n\n##### And it will be displayed in your knitted html file as: {.unnumbered}\n\n![](images/heading_levels.PNG)\n\n:::\n\n:::\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My heading levels don't render properly when knitting\n\nYou need a space between the # and the first letter. If the space is missing, the heading will be displayed in the HTML file as ...\n\n#Heading 1\n\n:::\n\n#### Unordered and ordered lists {.unnumbered}\n\nYou can also include **unordered lists** and **ordered lists**. Click on the tabs below to see how they are incorporated\n\n::: panel-tabset\n\n## unordered lists\n\nYou can add **bullet points** using either `*`, `-` or `+` and they will turn into:\n\n* bullet point (created with `*`)\n* bullet point (created with `-`)\n+ bullet point (created with `+`)\n\nor use bullet points of different levels using 1 tab key press or 2 spaces (for sub-item 1) or 2 tabs/4 spaces (for sub-sub-item 1):\n\n* bullet point item 1\n * sub-item 1\n * sub-sub-item 1\n * sub-sub-item 2\n* bullet point item 2\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My bullet points don't render properly when knitting\n\nYou need an empty row before your bullet points start. If I delete the empty row before the bullet points, they will be displayed in the HTML as ...\n\nText without the empty row: * bullet point created with `*` - bullet point created with `-` + bullet point created with `+`\n\n:::\n\n\n## ordered lists\n\nStart the line with **1.**, **2.**, etc. When you want to include sub-items, either use the `tab` key twice or add **4 spaces**. Same goes for the sub-sub-item: include either 2 tabs (or 4 manual spaces) from the last item or 4 tabs/ 8 spaces from the start of the line.\n\n1. list item 1\n2. list item 2\n i) sub-item 1 (with 4 spaces)\n A. sub-sub-item 1 (with an additional 4 spaces from the last indent)\n\n::: {.callout-important collapse=\"true\"}\n\n## My list items don't render properly when knitting\n\nIf you don't leave enough spaces, the list won't be recognised, and your output looks like this:\n\n3. list item 3\n i) sub-item 1 (with only 2 spaces) \n A. sub-sub-item 1 (with an additional 2 spaces from the last indent)\n\n:::\n\n\n## ordered lists magic\n\nThe great thing though is that you don't need to know your alphabet or number sequences. R markdown will fix that for you\n\nIf I type into my `.Rmd`...\n\n![](images/list_magic.PNG)\n\n...it will be rendered in the knitted HTML output as...\n\n3. list item 3\n1. list item 1\n a) sub-item labelled \"a)\"\n i) sub-item labelled \"i)\"\n C) sub-item labelled \"C)\"\n Z) sub-item labelled \"Z)\"\n7. list item 7\n\n\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: The labels of the sub-items are not what I thought they would be. You said they are fixing themselves...\n\nYes, they do but you need to label your sub-item lists accordingly. The first label you list in each level is set as the baseline. If they are labelled `1)` instead of `i)` or `A.`, the output will show as follows, but the automatic-item-fixing still works:\n\n7. list item 7\n 1) list item \"1)\" with 4 spaces\n 1) list item \"1)\" with 8 spaces\n 6) this is an item labelled \"6)\" (magically corrected to \"2.\")\n:::\n\n:::\n\n#### Emphasis {.unnumbered}\n\nInclude **emphasis** to draw attention to keywords in your text:\n\n| R markdown syntax | Displayed in the knitted HTML file |\n|:----------------------------|:-----------------------------------|\n| \\*\\*bold text\\*\\* | **bold text** |\n| \\*italic text\\* | *italic text* |\n| \\*\\*\\*bold and italic\\*\\*\\* | ***bold and italic*** |\n\n\nOther examples can be found in the [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target=\"_blank\"}\n\n\n\n### Code chunks {#sec-chunks}\n\nEverything you write inside the **code chunks** will be interpreted as code and executed by R. Code chunks start with ```` ``` ```` followed by an `{r}` which specifies the coding language R, some space for code, and ends with ```` ``` ````. If you accidentally delete one of those backticks, your code won't run and/or your text parts will be interpreted as part of the code chunks or vice versa. This should be evident from the colour change - more white than expected typically indicates missing starting backticks, whilst too much grey/not enough white suggests missing ending backticks. But no need to fret if that happens - just add the missing backticks manually.\n\n\nYou can **insert a new code chunk** in several ways:\n\n\n* Click the `Insert a new code chunk` button in the RStudio Toolbar (green icon at the top right corner of the `Source pane`).\n* Select Code \\> Insert Chunk from the menu.\n* Using the shortcut `Ctrl + Alt + I` for Windows or `Cmd + Option + I` on MacOSX.\n* Type ```` ```{r} ```` and ```` ``` ```` manually\n\n\n```{r fig-rmd, echo=FALSE, fig.cap=\"Default `.Rmd` with highlighting - names in pink and knitr display options in purple\"}\nknitr::include_graphics(\"images/default_highlighted.png\")\n```\n\n\n\nWithin the curly brackets of a code chunk, you can **specify a name** for the code chunk (see pink highlighting in @fig-rmd). The chunk name is not necessarily required; however, it is good practice to give each chunk a unique name to support more advanced knitting approaches. It also makes it easier to reference and manage chunks.\n\nWithin the curly brackets, you can also place **rules and arguments** (see purple highlighting in @fig-rmd) to control how your code is executed and what is displayed in your final HTML output. The most common **knitr display options** include:\n\n\n| Code | Does code run | Does code show | Do results show |\n|:--------------------|:-------------:|:--------------:|:---------------:|\n| eval=FALSE | NO | YES | NO |\n| echo=TRUE (default) | YES | YES | YES |\n| echo=FALSE | YES | NO | YES |\n| results='hide' | YES | YES | NO |\n| include=FALSE | YES | NO | NO |\n\n\n::: callout-important\n\nThe table above will be incredibly important for the data skills homework II. When solving error mode items you will need to pay attention to the first one `eval = FALSE`.\n\n:::\n\nOne last thing: In your newly created `.Rmd` file, delete everything below line 12 (keep the set-up code chunk) and save your `.Rmd` by clicking on the disc symbol.\n\n![Delete everything below line 12](images/delete_12.gif)\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nThat was quite a long section about what Markdown can do. I promise, we'll practice that more later. For the minute, we want you to create a new level 2 heading on line 12 and give it a meaningful heading title (something like \"Loading packages and reading in data\" or \"Chapter 1\").\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\nOn line 12, you should have typed **## Loading packages and reading in data** (or whatever meaningful title you chose). This will create level 2 heading once we knit the `.Rmd`.\n\n:::\n\n:::\n\n\n## Activity 3: Download the data {#sec-download_data_ch1}\n\nThe data for chapters 1-3. Download it here: [data_ch1.zip](data/data_ch1.zip \"download\"). There are 2 csv files contained in a zip folder. One is the data file we are going to use today `prp_data_reduced.csv` and the other is an Excel file `prp_codebook` that explains the variables in the data.\n\nThe first step is to **unzip the zip folder** so that the files are placed within the same folder as your project.\n\n* Place the zip folder within your 2A_chapter1 folder\n* Right mouse click --> `Extract All...`\n* Check the folder location is the one to extract the files to\n* Check the extracted files are placed next to the project icon\n* Files and project should be visible in the Output pane in RStudio\n\n::: {.callout-note collapse=\"false\"}\n\n## Screenshots for \"unzipping a zip folder\"\n\n::: {layout-ncol=\"1\"}\n\n![](images/pic1.PNG){fig-align=\"center\"}\n\n![](images/pic23.PNG){fig-align=\"center\"}\n\n![](images/pic45.PNG){fig-align=\"center\"}\n\nUnzipping a zip folder\n\n:::\n:::\n\nThe paper by Pownall et al. was a **registered report** published in 2023, and the original data can be found on OSF ([https://osf.io/5qshg/](https://osf.io/5qshg/){target=\"_blank\"}).\n\n**Citation**\n\n> Pownall, M., Pennington, C. R., Norris, E., Juanchich, M., Smailes, D., Russell, S., Gooch, D., Evans, T. R., Persson, S., Mak, M. H. C., Tzavella, L., Monk, R., Gough, T., Benwell, C. S. Y., Elsherif, M., Farran, E., Gallagher-Mitchell, T., Kendrick, L. T., Bahnmueller, J., . . . Clark, K. (2023). Evaluating the Pedagogical Effectiveness of Study Preregistration in the Undergraduate Dissertation. *Advances in Methods and Practices in Psychological Science, 6*(4). [https://doi.org/10.1177/25152459231202724](https://doi.org/10.1177/25152459231202724){target=\"_blank\"}\n\n**Abstract**\n\n> Research shows that questionable research practices (QRPs) are present in undergraduate final-year dissertation projects. One entry-level Open Science practice proposed to mitigate QRPs is “study preregistration,” through which researchers outline their research questions, design, method, and analysis plans before data collection and/or analysis. In this study, we aimed to empirically test the effectiveness of preregistration as a pedagogic tool in undergraduate dissertations using a quasi-experimental design. A total of 89 UK psychology students were recruited, including students who preregistered their empirical quantitative dissertation (*n* = 52; experimental group) and students who did not (*n* = 37; control group). Attitudes toward statistics, acceptance of QRPs, and perceived understanding of Open Science were measured both before and after dissertation completion. Exploratory measures included capability, opportunity, and motivation to engage with preregistration, measured at Time 1 only. This study was conducted as a Registered Report; Stage 1 protocol: https://osf.io/9hjbw (date of in-principle acceptance: September 21, 2021). Study preregistration did not significantly affect attitudes toward statistics or acceptance of QRPs. However, students who preregistered reported greater perceived understanding of Open Science concepts from Time 1 to Time 2 compared with students who did not preregister. Exploratory analyses indicated that students who preregistered reported significantly greater capability, opportunity, and motivation to preregister. Qualitative responses revealed that preregistration was perceived to improve clarity and organization of the dissertation, prevent QRPs, and promote rigor. Disadvantages and barriers included time, perceived rigidity, and need for training. These results contribute to discussions surrounding embedding Open Science principles into research training.\n\n**Changes made to the dataset**\n\nWe made some changes to the dataset for the purpose of increasing difficulty for data wrangling (@sec-wrangling and @sec-wrangling2) and data visualisation (@sec-dataviz and @sec-dataviz2). This will ensure some \"teachable moments\". The changes are as follows:\n\n* We removed some of the variables to make the data more manageable for teaching purposes.\n* We recoded some values from numeric responses to labels (e.g., `understanding`).\n* We added the word \"years\" to one of the `Age` entries.\n* We tidied a messy column `Ethnicity` but introduced a similar but easier-to-solve \"messiness pattern\" when recoding the `understanding` data.\n* The scores in the original file were already corrected from reverse-coded responses. We reversed that process to present raw data here.\n\n\n\n\n## Activity 4: Installing packages, loading packages, and reading in data\n\n### Installing packages\n\nWhen you install R and RStudio for the first time (or after an update), most of the packages we will be using won’t be pre-installed. Before you can load new packages like `tidyverse`, you will need to install them.\n\nIf you try to load a package that has not been installed yet, you will receive an error message that looks something like this: `Error in library(tidyverse) : there is no package called 'tidyverse'`. \n\nTo fix this, simply install the package first. **In the console**, type the command `install.packages(\"tidyverse\")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio.\n\n::: callout-important\n\n## Install packages from the console only\n\nNever include `install.packages()` in the Rmd. Only install packages from the console pane or the packages tab of the lower right pane!!!\n:::\n\n\nNote, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`.\n\n\n### Loading packages and reading in data\n\nThe first step is to load in the packages we need and read in the data. Today, we'll only be using `tidyverse`, and `read_csv()` will help us store the data from `prp_data_reduced.csv` in an object called data_prp.\n\nCopy the code into a code chunk in your `.Rmd` file and run it. You can either click the `green error` to run the entire code chunk, or use the shortcut `Ctrl + Enter` (Windows) or `Cmd + Enter` (Mac) to run a line of code/ pipe from the Rmd.\n\n```{r eval=FALSE}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n\n\n\n```{r echo=FALSE}\n## I basically have to have 2 code chunks since I tell them to put the data files next to the project, and mine are in a separate folder called data - unless I'll turn this into a fixed path\nlibrary(tidyverse)\ndata_prp <- read_csv(\"data/prp_data_reduced.csv\")\n```\n\n\n\n\n## Activity 5: Familiarise yourself with the data {#sec-familiarise}\n\n* Look at the **Codebook** to get a feel of the variables in the dataset and how they have been measured. Note that some of the columns were deleted in the dataset you have been given.\n* You'll notice that some questionnaire data was collected at 2 different time points (i.e., SATS28, QRPs, Understanding_OS)\n* some of the data was only collected at one time point (i.e., supervisor judgements, OS_behav items, and Included_prereg variables are t2-only variables)\n\n\n\n### First glimpse at the data\n\nBefore you start wrangling your data, it is important to understand what kind of data you're working with and what the format of your dataframe looks like.\n\nAs you may have noticed, `read_csv()` provides a **message** listing the data types in your dataset and how many columns are of each type. Plus, it shows a few examples columns for each data type.\n\nTo obtain more detailed information about your data, you have several options. Click on the individual tabs to see the different options available. Test them out in your own `.Rmd` file and use whichever method you prefer (but do it).\n\n::: callout-warning\n\nSome of the output is a bit long because we do have quite a few variables in the data file.\n\n:::\n\n::: panel-tabset\n\n## visual inspection 1\n\nIn the `Global Environment`, click the blue arrow icon next to the object name `data_prp`. This action will expand the object, revealing details about its columns. The `$` symbol is commonly used in Base R to access a specific column within your dataframe.\n\n![Visual inspection of the data](images/data_prp.PNG)\n\nCon: When you have quite a few variables, not all of them are shown.\n\n## `glimpse()`\n\nUse `glimpse()` if you want a more detailed overview you can see on your screen. The output will display rows and column numbers, and some examples of the first couple of observations for each variable.\n\n\n```{r}\nglimpse(data_prp)\n```\n\n\n## `spec()`\n\nYou can also use `spec()` as suggested in the message above and then it shows you a list of the data type in every single column. But it doesn't show you the number of rows and columns.\n\n\n```{r}\nspec(data_prp)\n```\n\n\n## visual inspection 2\n\nIn the `Global Environment`, click on the object name `data_prp`. This action will open the data in a new tab. Hovering over the column headings with your mouse will also reveal their data type. However, it seems to be a fairly tedious process when you have loads of columns.\n\n::: {.callout-important collapse=\"true\"}\n\n## Hang on, where is the rest of my data? Why do I only see 50 columns?\n\nOne common source of confusion is not seeing all your columns when you open up a data object as a tab. This is because RStudio shows you a maximum of 50 columns at a time. If you have more than 50 columns, navigate with the arrows to see the remaining columns.\n\n![Showing 50 columns at a time](images/50_col.PNG)\n\n:::\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nNow that you have tested out all the options in your own `.Rmd` file, you can probably answer the following questions:\n\n* How many observations? `r fitb(\"89\")`\n* How many variables? `r fitb(\"91\")`\n* How many columns are `col_character` or `chr` data type? `r fitb(\"17\")`\n* How many columns are `col_double` or `dbl` data type? `r fitb(\"74\")`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe visual inspections shows you the **number of observations and variables**. `glimpse()` also gives you that information but calls them **rows and columns** respectively.\n\nThe **data type information** actually comes from the output when using the `read_csv()` function. Did you notice the information on **Column specification** (see screenshot below)?\n\n![message from `read_csv()` when reading in the data](images/col_spec.PNG)\n\nWhilst `spec()` is quite useful for data type information per individual column, it doesn't give you the total count of each data type. So it doesn't really help with answering the questions here - unless you want to count manually from its extremely long output.\n\n:::\n\nIn your `.Rmd`, include a **new heading level 2** called \"Information about the data\" (or something equally meaningful) and jot down some notes about `data_prp`. You could include the citation and/or the abstract, and whatever information you think you should note about this dataset (e.g., any observations from looking at the codebook?). You could also include some notes on the functions used so far and what they do. Try to incorporate some **bold**, *italic* or ***bold and italic*** emphasis and perhaps a bullet point or two.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Possible solution\n\n\\#\\# Information about the data\n\nThe data is from Pownall et al. (2023), and I can find the paper here: https://doi.org/10.1177/25152459231202724.\n\nI've noticed in the prp codebook that the SATS-28 questionnaire has quite a few \\*reverse-coded items\\*, and the supervisor support questionnaire also has a reverse-coded item.\n\nSo far, I think I prefer \\*\\*glimpse()\\*\\* to show me some more detail about the data. Specs() is too text-heavy for me which makes it hard to read.\n\nThings to keep in mind:\n\n* \\*\\*don't forget to load in tidyverse first!!!\\*\\*\n* always read in the data with \\*\\*read_csv\\*\\*, \\*\\*\\*never ever use read.csv\\*\\*\\*!!!\n\n![The output rendered in a knitted html file](images/knitted_markdown.PNG)\n\n:::\n\n:::\n\n### Data types {#sec-datatypes}\n\nEach variable has a **data type**, such as numeric (numbers), character (text), and logical (TRUE/FALSE values), or a special class of factor. As you have just seen, our `data_prp` only has character and numeric columns (so far).\n\n**Numeric data** can be double (`dbl`) or integer (`int`). Doubles can have decimal places (e.g., 1.1). Integers are the whole numbers (e.g., 1, 2, -1) and are displayed with the suffix L (e.g., 1L). This is not overly important but might leave you less puzzled the next time you see an L after a number.\n\n**Characters** (also called “strings”) is anything written between quotation marks. This is usually text, but in special circumstances, a number can be a character if it placed within quotation marks. This can happen when you are recoding variables. It might not be too obvious at the time, but you won't be able to calculate anything if the number is a character\n\n::: panel-tabset\n\n## Example data types\n\n```{r}\ntypeof(1)\ntypeof(1L)\ntypeof(\"1\")\ntypeof(\"text\")\n```\n\n## numeric computation\n\nNo problems here...\n\n```{r}\n1+1\n```\n\n## character computation\n\nWhen the data type is incorrect, you won't be able to compute anything, despite your numbers being shown as numeric values in the dataframe. The error message tells you exactly what's wrong with it, i.e., that you have `non-numeric arguments`.\n\n```{r, error = TRUE}\n\"1\"+\"1\" # ERROR\n```\n\n:::\n\n**Logical** data (also sometimes called “Boolean” values) are one of two values: TRUE or FALSE (written in uppercase). They become really important when we use `filter()` or `mutate()` with conditional statements such as `case_when()`. More about those in @sec-wrangling2.\n\n\nSome commonly used logical operators:\n\n| operator | description |\n|:---------|:-----------------------------------------------|\n| \\> | greater than |\n| \\>= | greater than or equal to |\n| \\< | less than |\n| \\<= | less than or equal to |\n| == | equal to |\n| != | not equal to |\n| %in% | TRUE if any element is in the following vector |\n\n\nA **factor** is a specific type of integer or character that lets you assign the order of the categories. This becomes useful when you want to display certain categories in \"the correct order\" either in a dataframe (see *arrange*) or when plotting (see @sec-dataviz/ @sec-dataviz2).\n\n\n\n### Variable types\n\nYou've already encountered them in [Level 1](https://psyteachr.github.io/data-skills-v2/intro-to-probability.html){target=\"_blank\"} but let's refresh. Variables can be classified as **continuous** (numbers) or **categorical** (labels).\n\n**Categorical** variables are properties you can count. They can be **nominal**, where the categories don't have an order (e.g., gender) or **ordinal** (e.g., Likert scales either with numeric values 1-7 or with character labels such as \"agree\", \"neither agree nor disagree\", \"disagree\"). Categorical data may also be **factors** rather than characters.\n\n**Continuous variables** are properties you can measure and calculate sums/ means/ etc. They may be rounded to the nearest whole number, but it should make sense to have a value between them. Continuous variables always have a **numeric** data type (i.e. `integer` or `double`).\n\n::: callout-tip\n\n## Why is this important you may ask?\n\nKnowing your variable and data types will help later on when deciding on an appropriate plot (see @sec-dataviz and @sec-dataviz2) or which inferential test to run (@sec-nhstI to @sec-factorial).\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAs we've seen earlier, `data_prp` only had character and numeric variables which hardly tests your understanding to see if you can identify a variety of data types and variable types. So, for this little quiz, we've spiced it up a bit. We've selected a few columns, shortened some of the column names, and modified some of the data types. Here you can see the first few rows of the new object `data_quiz`. *You can find the code with explanations at the end of this section.*\n\n```{r echo=FALSE}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n\n```{r echo=FALSE}\n# the `head()` function shows the first n number of rows of a dataset (here 5)\nhead(data_quiz, n = 5)\n```\n\n```{r}\nglimpse(data_quiz)\n```\n\n\n\nSelect from the dropdown menu the variable type and their data types for each of the columns.\n\n```{r, include = FALSE}\n# variable type\ncon <- c(answer = \"continuous\", x = \"nominal\", x = \"ordinal\")\nnom <- c(x = \"continuous\", answer = \"nominal\", x = \"ordinal\")\nord <- c(x = \"continuous\", x = \"nominal\", answer = \"ordinal\")\n\n# data type\nnum <- c(answer = \"numeric\", x = \"character\", x = \"logical\", x = \"factor\")\nchr <- c(x = \"numeric\", answer = \"character\", x = \"logical\", x = \"factor\")\nlog <- c(x = \"numeric\", x = \"character\", answer = \"logical\", x = \"factor\")\nfctr <- c(x = \"numeric\", x = \"character\", x = \"logical\", answer = \"factor\")\n\n```\n\n| Column | Variable type | Data type |\n|:---------------------|:--------------|:--------------|\n| `Age` | `r mcq(con)` | `r mcq(chr)` |\n| `Gender` | `r mcq(nom)` | `r mcq(fctr)` |\n| `Ethinicity` | `r mcq(nom)` | `r mcq(chr)` |\n| `Secondyeargrade` | `r mcq(ord)` | `r mcq(fctr)` |\n| `QRP_item` | `r mcq(ord)` | `r mcq(num)` |\n| `QRPs_mean` | `r mcq(con)` | `r mcq(num)` |\n| `Understanding_item` | `r mcq(ord)` | `r mcq(chr)` |\n| `QRP_item > 4` | `r mcq(nom)` | `r mcq(log)` |\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Revealing the mystery code that created `data_quiz`\n\nThe code might look a bit complex for the minute despite the line-by-line explanations below. Come back to it after completing chapter 2.\n\n```{r eval=FALSE}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n\nLets go through this line by line:\n\n* **line 1**: creates a new object called `data_quiz` and it is based on the already existing data object `data_prp`\n* **line 2**: we are selecting a few variables of interest, such as Code, Age etc. Some of those variables were renamed in the process according to the structure `new_name = old_name`, for example QRP item 3 at time point 1 got renamed as `QRP_item`.\\\n* **line 3**: The function `mutate()` is used to create a new column called `Gender` that turns the existing column `Gender` from a numeric value into a factor. R simply overwrites the existing column of the same name. If we had named the new column `Gender_factor`, we would have been able to retain the original `Gender` column and `Gender_factor` would have been added as the last column.\n* **line 4-6**: See how the line starts with an indent which indicates we are still within the `mutate()` function. You can also see this by counting brackets - in line 3 there are 2 opening brackets but only 1 closes.\n * Similar to `Gender`, we are replacing the \"old\" `Secondyeargrade` with the new `Secondyeargrade` column that is now a factor.\n * Turning our variable `Secondyeargrade` into a factor, spot the difference between this attempt and the one we used for `Gender`? Here we are using a lot more arguments in that factor function, namely levels and labels. **Levels** describes the unique values we have for that column, and in **labels** we want to define how these levels will be shown in the data object. If you don't add the levels and labels argument, the labels will be the labels (as you can see in the `Gender` column in which we kept the numbers).\n* **line 7**: Doesn't start with a function name and has an indent, which means we are *still* within the `mutate()` function - count the opening and closing brackets to confirm.\n * Here, we are creating a new column called `QRP_item > 4`. Notice the two backticks we have to use to make this weird column name work? This is because it has spaces (and we did mention that R doesn't like spaces). So the backticks help R to group it as a unit/ a single name.\n * Next we have a `case_when()` function which helps executing conditional statements. We are using it to check whether a statement is TRUE or FALSE. Here, we ask whether the QRP item (column `QRP_item`) is larger than 4 (midpoint of the scale) using the Boolean operator `>`. If the statement is `TRUE`, the label `TRUE` should appear in column `QRP_item > 4`. Otherwise, if the value is equal to 4 or smaller, the label should read `FALSE`. We will come back to conditional statements in @sec-wrangling. But long story short, this Boolean expression created the only logical data type in `data_quiz`.\n:::\n\nAnd with this, we are done with the individual walkthrough. Well done :)\n\n\n\n\n\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nThe data we will be using in the upcoming lab activities is a randomised controlled trials experiment by Binfet et al. (2021) that was conducted in Canada.\n\n**Citation**\n\n> Binfet, J. T., Green, F. L. L., & Draper, Z. A. (2021). The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. *Anthrozoös, 35*(1), 1–22. [https://doi.org/10.1080/08927936.2021.1944558](https://doi.org/10.1080/08927936.2021.1944558){target=\"_blank\"}\n\n**Abstract**\n\n> Researchers have claimed that canine-assisted interventions (CAIs) contribute significantly to bolstering participants' wellbeing, yet the mechanisms within interactions have received little empirical attention. The aim of this study was to assess the impact of client–canine contact on wellbeing outcomes in a sample of 284 undergraduate college students (77% female; 21% male, 2% non-binary). Participants self-selected to participate and were randomly assigned to one of two canine interaction treatment conditions (touch or no touch) or to a handler-only condition with no therapy dog present. To assess self-reports of wellbeing, measures of flourishing, positive and negative affect, social connectedness, happiness, integration into the campus community, stress, homesickness, and loneliness were administered. Exploratory analyses were conducted to assess whether these wellbeing measures could be considered as measuring a unidimensional construct. This included both reliability analysis and exploratory factor analysis. Based on the results of these analyses we created a composite measure using participant scores on a latent factor. We then conducted the tests of the four hypotheses using these factor scores. Results indicate that participants across all conditions experienced enhanced wellbeing on several measures; however, only those in the direct contact condition reported significant improvements on all measures of wellbeing. Additionally, direct interactions with therapy dogs through touch elicited greater wellbeing benefits than did no touch/indirect interactions or interactions with only a dog handler. Similarly, analyses using scores on the wellbeing factor indicated significant improvement in wellbeing across all conditions (handler-only, *d* = 0.18, *p* = 0.041; indirect, *d* = 0.38, *p* \\< 0.001; direct, *d* = 0.78, *p* \\< 0.001), with more benefit when a dog was present (*d* = 0.20, *p* \\< 0.001), and the most benefit coming from physical contact with the dog (*d* = 0.13, *p* = 0.002). The findings hold implications for post-secondary wellbeing programs as well as the organization and delivery of CAIs.\n\n\nHowever, we accessed the data via Ciaran Evans' github ([https://github.com/ciaran-evans/dog-data-analysis](https://github.com/ciaran-evans/dog-data-analysis){target=\"_blank\"}). Evans et al. (2023) published a paper that reused the Binfet data for teaching statistics and research methods. If anyone is interested, the accompanying paper is:\n\n> Evans, C., Cipolli, W., Draper, Z. A., & Binfet, J. T. (2023). Repurposing a Peer-Reviewed Publication to Engage Students in Statistics: An Illustration of Study Design, Data Collection, and Analysis. *Journal of Statistics and Data Science Education, 31*(3), 236–247. [https://doi.org/10.1080/26939169.2023.2238018](https://doi.org/10.1080/26939169.2023.2238018){target=\"_blank\"}\n\n**There are a few changes that Evans and we made to the data:**\n\n* Evans removed the demographics ethnicity and gender to make the study data available while protecting participant privacy. Which means we'll have limited demographic variables available, but we will make do with what we've got.\n* We modified some of the responses in the raw data csv - for example, we took out impossible response values and replaced them with `NA`.\n* We replaced some of the numbers with labels to increase the difficulty in the dataset for @sec-wrangling and @sec-wrangling2.\n\n\n\n### Task 1: Create a project folder for the lab activities {.unnumbered}\n\nSince we will be working with the same data throughout semester 1, create a separate project for the lab data. Name it something useful, like `lab_data` or `dogs_in_the_lab`. Make sure you are not placing it within the project you have already created today. If you need guidance, see @sec-project above.\n\n\n\n### Task 2: Create a new `.Rmd` file {.unnumbered}\n\n... and name it something useful. If you need help, have a look at @sec-rmd.\n\n\n\n### Task 3: Download the data {.unnumbered}\n\nDownload the data here: [data_pair_ch1](data/data_pair_ch1.zip \"download\"). The zip folder contains the raw data file with responses to individual questions, a cleaned version of the same data in long format and wide format, and the codebook describing the variables in the raw data file and the long format.\n\n**Unzip the folder and place the data files in the same folder as your project.**\n\n\n\n### Task 4: Familiarise yourself with the data {.unnumbered}\n\nOpen the data files, look at the codebook, and perhaps skim over the original Binfet article (methods in particular) to see what kind of measures they used.\n\nRead in the raw data file as `dog_data_raw` and the cleaned-up data (long format) as `dog_data_long`. See if you can answer the following questions.\n\n```{r eval=FALSE}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\ndog_data_long <- read_csv(\"dog_data_clean_long.csv\")\n```\n\n```{r echo=FALSE}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"data/dog_data_raw.csv\")\ndog_data_long <- read_csv(\"data/dog_data_clean_long.csv\")\n```\n\n* How many participants took part in the study? `r fitb(\"284\")`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nYou can see this from `dog_data_raw`. Each participant ID is on a single row meaning the number of observations is the number of participants.\n\nIf you look at `dog_data_long`, there are 568 observations. Each participant answered the questionnaires pre and post intervention, resulting in 2 rows per participant ID. This means you'd have to divide the number of observations by 2 to get to the number of participants.\n\n:::\n\n* How many different questionnaires did the participants answer? `r fitb(c(\"9\", \"8\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nThe Binfet paper (e.g., Methods section and/or abstract) and the codebook show it's 9 questionnaires - Flourishing scale (variable `Flourishing`), the UCLS Loneliness scale Version 3 (`Loneliness`), Positive and Negative affect scale (`PANAS_PA` and `PANAS_NA`), the Subjective Happiness scale (`SHS`), the Social connectedness scale (`SCS`), and 3 scales with 1 question each, i.e., perception of stress levels (`Stress`), self-reported level of homesickness (`Homesick`), and integration into the campus community (`Engagement`).\n\nHowever, if you thought `PANAS_PA` and `PANAS_NA` are a single questionnaire, 8 was also acceptable as an answer here.\n\n:::\n\n\n\n\n## [Test your knowledge]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nAre you ready for some knowledge check questions to test your understanding of the chapter? We also have some faulty codes. See if you can spot what's wrong with them.\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nOne of the key first steps when we open RStudio is to: `r longmcq(c(x = \"put on some music as we will be here a while\", answer = \"open an existing project or create a new one\", x = \"make a coffee\", x = \"check out the news\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nOpening an existing project (e.g., when coming back to the same dataset) or creating a new project (e.g., for a new task or new dataset) ensures that subsequent `.Rmd` files, any output, figures, etc are saved within the same folder on your computer (i.e., the working directory). If the`.Rmd` files or data is not in the same folder as \"the project icon\", things can get messy and code might not run.\n\n:::\n\n\n#### Question 2 {.unnumbered}\n\nWhen using the default environment colour settings for RStudio, what colour would the background of a code chunk be in R Markdown? `r mcq(c(x = \"red\", x = \"white\", x = \"green\", answer = \"grey\"))`\n\nWhen using the default environment colour settings for RStudio, what colour would the background of normal text be in R Markdown? `r mcq(c(x = \"red\", answer = \"white\", x = \"green\", x = \"grey\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nAssuming you have not changed any of the settings in RStudio, code chunks will tend to have a grey background and normal text will tend to have a white background. This is a good way to check that you have closed and opened code chunks correctly.\n\n:::\n\n\n\n#### Question 3 {.unnumbered}\n\nCode chunks start and end with: `r longmcq(c(x = \"three single quotes\", answer = \"three backticks\", x = \"three double quotes\", x = \"three single asterisks\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCode chunks always take the same general format of three backticks followed by curly parentheses and a lower case r inside the parentheses (`{r}`). People often mistake these backticks for single quotes but that will not work. If you have set your code chunk correctly using backticks, the background colour should change to grey from white.\n\n:::\n\n\n\n#### Question 4 {.unnumbered}\n\nWhat is the correct way to include a code chunk in RMarkdown that will be executed but neither the code nor its output will be shown in the final HTML document? `r mcq(c(x = \"{r, echo=FALSE}\", x = \"{r, eval=FALSE}\", answer = \"{r, include=FALSE}\", x = \"{r, results='hide'}\"))`\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCheck the table of knitr display options in @sec-chunks.\n\n* {r, echo=FALSE} also executes the code and does not show the code, but it *does* display the result in the knitted html file. (matches 2/3 criteria)\n* {r, eval=FALSE} does not show the results but does *not* execute the code and it *does* show it in the knitted file. (matches 1/3 criteria)\n* {r, results=“hide”} executes the code and does not show results, however, it *does* include the code in the knitted html document. (matches 2/3 criteria)\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of these codes have mistakes in them, other code chunks are not quite producing what was aimed for. Your task is to spot anything faulty, explain why the things happened, and perhaps try to fix them.\n\n\n\n#### Question 5 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. You have just stated R, created a new `.Rmd` file, and typed the following code into your code chunk.\n\n```{r eval=FALSE}\ndata <- read_csv(\"data.csv\")\n```\n\n\nHowever, R gives you an error message: `could not find function \"read_csv\"`. What could be the reason?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\n\"Could not find function\" is an indication that you have forgotten to load in tidyverse. Because `read_csv()` is a function in the tidyverse collection, R cannot find it.\n\nFIX: Add `library(tidyverse)` prior to reading in the data and run the code chunk again.\n\n:::\n\n\n\n#### Question 6 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. This time, you are certain you have loaded in tidyverse first. The code is as follows:\n\n```{r eval=FALSE}\nlibrary(tidyverse)\ndata <- read_csv(\"data.csv\")\n```\n\nThe error message shows `'data.csv' does not exist in current working directory`. You check your folder and it looks like this:\n\n![](images/error_ch1_01.PNG)\n\nWhy is there an error message?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nR is looking for a csv file that is called data which is currently not in the working directory. We may assume it's in the data folder. Perhaps that happened when unzipping the zip file. So instead of placing the csv file on the same level as the project icon, it was unzipped into a folder named data.\n\nFIX - option 1: Take the `data.csv` out of the data folder and place it next to the project icon and the `.Rmd` file.\n\nFIX - option 2: Modify your R code to tell R that the data is in a separate folder called data, e.g., ...\n\n```{r eval=FALSE}\nlibrary(tidyverse)\ndata <- read_csv(\"data/data.csv\")\n```\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\n\nYou want to load `tidyverse` into the library. The code is as follows:\n\n```{r eval=FALSE}\nlibrary(tidyverse)\n```\n\n\nThe error message says: `Error in library(tidyverse) : there is no package called ‘tidyverse’`\n\nWhy is there an error message and how can we fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nIf R says there is no package called `tidyverse`, means you haven't installed the package yet. This could be an error message you receive either after switching computers or a fresh install of R and RStudio.\n\nFIX: Type `install.packages(\"tidyverse\")` into your **Console**.\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nYou knitted your `.Rmd` into a html but the output is not as expected. You see the following:\n\n![](images/error_knitted.PNG)\n\nWhy did the file not knit properly?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThere is a backtick missing in the code chunk. If you check your `.Rmd` file, you can see that the code chunk does not show up in grey which means it's one of the 3 backticks at the beginning of the chunk.\n\n![](images/error_ch1_08.PNG)\n\nFIX: Add a single backtick manually where it's missing.\n\n:::\n","srcMarkdownNoYaml":""},"formats":{"html":{"identifier":{"display-name":"HTML","target-format":"html","base-format":"html"},"execute":{"fig-width":7,"fig-height":5,"fig-format":"retina","fig-dpi":96,"df-print":"kable","error":false,"eval":true,"cache":null,"freeze":"auto","echo":true,"output":true,"warning":true,"include":true,"keep-md":false,"keep-ipynb":false,"ipynb":null,"enabled":null,"daemon":null,"daemon-restart":false,"debug":false,"ipynb-filters":[],"engine":"knitr"},"render":{"keep-tex":false,"keep-source":false,"keep-hidden":false,"prefer-html":false,"output-divs":true,"output-ext":"html","fig-align":"default","fig-pos":null,"fig-env":null,"code-fold":false,"code-overflow":"wrap","code-link":true,"code-line-numbers":true,"code-tools":false,"tbl-colwidths":"auto","merge-includes":true,"inline-includes":false,"preserve-yaml":false,"latex-auto-mk":true,"latex-auto-install":true,"latex-clean":true,"latex-max-runs":10,"latex-makeindex":"makeindex","latex-makeindex-opts":[],"latex-tlmgr-opts":[],"latex-input-paths":[],"latex-output-dir":null,"link-external-icon":false,"link-external-newwindow":false,"self-contained-math":false,"format-resources":[],"notebook-links":true,"format-links":true},"pandoc":{"standalone":true,"wrap":"none","default-image-extension":"png","to":"html","css":["https://use.fontawesome.com/releases/v5.13.0/css/all.css","include/booktem.css","include/webex.css","include/glossary.css","include/style.css","include/custom.scss"],"highlight-style":"a11y","include-after-body":["include/webex.js","include/script.js"],"output-file":"01-basics.html"},"language":{"toc-title-document":"Table of contents","toc-title-website":"On this page","related-formats-title":"Other Formats","related-notebooks-title":"Notebooks","source-notebooks-prefix":"Source","section-title-abstract":"Abstract","section-title-appendices":"Appendices","section-title-footnotes":"Footnotes","section-title-references":"References","section-title-reuse":"Reuse","section-title-copyright":"Copyright","section-title-citation":"Citation","appendix-attribution-cite-as":"For attribution, please cite this work as:","appendix-attribution-bibtex":"BibTeX citation:","title-block-author-single":"Author","title-block-author-plural":"Authors","title-block-affiliation-single":"Affiliation","title-block-affiliation-plural":"Affiliations","title-block-published":"Published","title-block-modified":"Modified","callout-tip-title":"Tip","callout-note-title":"Note","callout-warning-title":"Warning","callout-important-title":"Important","callout-caution-title":"Caution","code-summary":"Code","code-tools-menu-caption":"Code","code-tools-show-all-code":"Show All Code","code-tools-hide-all-code":"Hide All Code","code-tools-view-source":"View Source","code-tools-source-code":"Source Code","code-line":"Line","code-lines":"Lines","copy-button-tooltip":"Copy to Clipboard","copy-button-tooltip-success":"Copied!","repo-action-links-edit":"Edit this page","repo-action-links-source":"View source","repo-action-links-issue":"Report an issue","back-to-top":"Back to top","search-no-results-text":"No results","search-matching-documents-text":"matching documents","search-copy-link-title":"Copy link to search","search-hide-matches-text":"Hide additional matches","search-more-match-text":"more match in this document","search-more-matches-text":"more matches in this document","search-clear-button-title":"Clear","search-detached-cancel-button-title":"Cancel","search-submit-button-title":"Submit","search-label":"Search","toggle-section":"Toggle section","toggle-sidebar":"Toggle sidebar navigation","toggle-dark-mode":"Toggle dark mode","toggle-reader-mode":"Toggle reader mode","toggle-navigation":"Toggle navigation","crossref-fig-title":"Figure","crossref-tbl-title":"Table","crossref-lst-title":"Listing","crossref-thm-title":"Theorem","crossref-lem-title":"Lemma","crossref-cor-title":"Corollary","crossref-prp-title":"Proposition","crossref-cnj-title":"Conjecture","crossref-def-title":"Definition","crossref-exm-title":"Example","crossref-exr-title":"Exercise","crossref-ch-prefix":"Chapter","crossref-apx-prefix":"Appendix","crossref-sec-prefix":"Section","crossref-eq-prefix":"Equation","crossref-lof-title":"List of Figures","crossref-lot-title":"List of Tables","crossref-lol-title":"List of Listings","environment-proof-title":"Proof","environment-remark-title":"Remark","environment-solution-title":"Solution","listing-page-order-by":"Order By","listing-page-order-by-default":"Default","listing-page-order-by-date-asc":"Oldest","listing-page-order-by-date-desc":"Newest","listing-page-order-by-number-desc":"High to Low","listing-page-order-by-number-asc":"Low to High","listing-page-field-date":"Date","listing-page-field-title":"Title","listing-page-field-description":"Description","listing-page-field-author":"Author","listing-page-field-filename":"File Name","listing-page-field-filemodified":"Modified","listing-page-field-subtitle":"Subtitle","listing-page-field-readingtime":"Reading Time","listing-page-field-categories":"Categories","listing-page-minutes-compact":"{0} min","listing-page-category-all":"All","listing-page-no-matches":"No matching items"},"metadata":{"lang":"en","fig-responsive":true,"quarto-version":"1.3.450","lightbox":true,"bibliography":["include/references.bib"],"csl":"include/apa.csl","theme":{"light":["flatly","include/light.scss"],"dark":["darkly","include/dark.scss"]},"code-copy":"hover","mainfont":"","monofont":""},"extensions":{"book":{"multiFile":true}}}},"projectFormats":["html"]} \ No newline at end of file diff --git a/.quarto/xref/5c23a141 b/.quarto/xref/5c23a141 index ba7841b..3e702b8 100644 --- a/.quarto/xref/5c23a141 +++ b/.quarto/xref/5c23a141 @@ -1 +1 @@ -{"headings":["intended-learning-outcomes","individual-walkthrough","r-and-r-studio","r-server","installing-r-and-rstudio-on-your-computer","settings-for-reproducibility","rstudio-panes","sec-project","sec-rmd","markdown","heading-levels","you-would-incorporate-this-into-your-text-as","and-it-will-be-displayed-in-your-knitted-html-file-as","unordered-and-ordered-lists","emphasis","sec-chunks","sec-download_data_ch1","activity-4-installing-packages-loading-packages-and-reading-in-data","installing-packages","loading-packages-and-reading-in-data","sec-familiarise","first-glimpse-at-the-data","sec-datatypes","variable-types","pair-coding","task-1-create-a-project-folder-for-the-lab-activities","task-2-create-a-new-.rmd-file","task-3-download-the-data","task-4-familiarise-yourself-with-the-data","test-your-knowledge","knowledge-check","question-1","question-2","question-3","question-4","error-mode","question-5","question-6","question-7","question-8","sec-basics"],"entries":[{"order":{"section":[1,0,0,0,0,0,0],"number":1},"key":"sec-basics"},{"order":{"section":[1,2,0,0,0,0,0],"number":1},"caption":"1.2 Activity 1: Creating a new project","key":"sec-project"},{"order":{"section":[1,6,2,0,0,0,0],"number":6},"caption":"1.6.2 Data types","key":"sec-datatypes"},{"order":{"section":[1,6,0,0,0,0,0],"number":5},"caption":"1.6 Activity 5: Familiarise yourself with the data","key":"sec-familiarise"},{"order":{"section":[1,3,2,0,0,0,0],"number":1},"caption":"Default .Rmd with highlighting - names in pink and knitr display options in purple","key":"fig-rmd"},{"order":{"section":[1,4,0,0,0,0,0],"number":4},"caption":"1.4 Activity 3: Download the data","key":"sec-download_data_ch1"},{"order":{"section":[1,3,0,0,0,0,0],"number":2},"caption":"1.3 Activity 2: Create a new R Markdown file","key":"sec-rmd"},{"order":{"section":[1,3,2,0,0,0,0],"number":3},"caption":"1.3.2 Code chunks","key":"sec-chunks"}],"options":{"chapter-id":"sec-basics","chapters":true}} \ No newline at end of file +{"options":{"chapters":true,"chapter-id":"sec-basics"},"headings":["intended-learning-outcomes","individual-walkthrough","r-and-r-studio","r-server","installing-r-and-rstudio-on-your-computer","settings-for-reproducibility","rstudio-panes","sec-project","sec-rmd","markdown","heading-levels","you-would-incorporate-this-into-your-text-as","and-it-will-be-displayed-in-your-knitted-html-file-as","unordered-and-ordered-lists","emphasis","sec-chunks","sec-download_data_ch1","activity-4-installing-packages-loading-packages-and-reading-in-data","installing-packages","loading-packages-and-reading-in-data","sec-familiarise","first-glimpse-at-the-data","sec-datatypes","variable-types","pair-coding","task-1-create-a-project-folder-for-the-lab-activities","task-2-create-a-new-.rmd-file","task-3-download-the-data","task-4-familiarise-yourself-with-the-data","test-your-knowledge","knowledge-check","question-1","question-2","question-3","question-4","error-mode","question-5","question-6","question-7","question-8","sec-basics"],"entries":[{"key":"sec-basics","order":{"section":[1,0,0,0,0,0,0],"number":1}},{"key":"sec-chunks","order":{"section":[1,3,2,0,0,0,0],"number":3},"caption":"1.3.2 Code chunks"},{"key":"sec-rmd","order":{"section":[1,3,0,0,0,0,0],"number":2},"caption":"1.3 Activity 2: Create a new R Markdown file"},{"key":"sec-download_data_ch1","order":{"section":[1,4,0,0,0,0,0],"number":4},"caption":"1.4 Activity 3: Download the data"},{"key":"sec-familiarise","order":{"section":[1,6,0,0,0,0,0],"number":5},"caption":"1.6 Activity 5: Familiarise yourself with the data"},{"key":"fig-rmd","order":{"section":[1,3,2,0,0,0,0],"number":1},"caption":"Default .Rmd with highlighting - names in pink and knitr display options in purple"},{"key":"sec-datatypes","order":{"section":[1,6,2,0,0,0,0],"number":6},"caption":"1.6.2 Data types"},{"key":"sec-project","order":{"section":[1,2,0,0,0,0,0],"number":1},"caption":"1.2 Activity 1: Creating a new project"}]} \ No newline at end of file diff --git a/.quarto/xref/7a0d69cd b/.quarto/xref/7a0d69cd index f0417c8..e5ccceb 100644 --- a/.quarto/xref/7a0d69cd +++ b/.quarto/xref/7a0d69cd @@ -1 +1 @@ -{"entries":[{"key":"sec-wrangling","order":{"section":[2,0,0,0,0,0,0],"number":1}},{"caption":"2.4 Activity 4: Questionable Research Practices (QRPs)","key":"sec-ch2_act4","order":{"section":[2,4,0,0,0,0,0],"number":1}}],"headings":["intended-learning-outcomes","individual-walkthrough","activity-1-setup","activity-2-load-in-the-libraries-and-read-in-the-data","activity-3-calculating-demographics","for-the-full-sample-using-summarise","fixing-age","computing-summary-stats","computing-summary-stats---third-attempt","per-gender-using-summarise-and-group_by","adding-percentages","sec-ch2_act4","the-main-goal-is-to-compute-the-mean-qrp-score-per-participant-for-time-point-1.","activity-5-knitting","activity-6-export-a-data-object-as-a-csv","pair-coding","task-1-open-the-r-project-you-created-last-week","task-2-open-your-.rmd-file-from-last-week","task-3-load-in-the-library-and-read-in-the-data","task-4-calculating-the-mean-for-flourishing_pre","test-your-knowledge-and-challenge-yourself","knowledge-check","question-1","question-2","question-3","question-4","question-5","error-mode","question-6","question-7","question-8","challenge-yourself","sec-wrangling"],"options":{"chapter-id":"sec-wrangling","chapters":true}} \ No newline at end of file +{"options":{"chapters":true,"chapter-id":"sec-wrangling"},"entries":[{"key":"sec-wrangling","order":{"number":1,"section":[2,0,0,0,0,0,0]}},{"key":"sec-ch2_act4","caption":"2.4 Activity 4: Questionable Research Practices (QRPs)","order":{"number":1,"section":[2,4,0,0,0,0,0]}}],"headings":["intended-learning-outcomes","individual-walkthrough","activity-1-setup","activity-2-load-in-the-libraries-and-read-in-the-data","activity-3-calculating-demographics","for-the-full-sample-using-summarise","fixing-age","computing-summary-stats","computing-summary-stats---third-attempt","per-gender-using-summarise-and-group_by","adding-percentages","sec-ch2_act4","the-main-goal-is-to-compute-the-mean-qrp-score-per-participant-for-time-point-1.","activity-5-knitting","activity-6-export-a-data-object-as-a-csv","pair-coding","task-1-open-the-r-project-you-created-last-week","task-2-open-your-.rmd-file-from-last-week","task-3-load-in-the-library-and-read-in-the-data","task-4-calculating-the-mean-for-flourishing_pre","test-your-knowledge-and-challenge-yourself","knowledge-check","question-1","question-2","question-3","question-4","question-5","error-mode","question-6","question-7","question-8","challenge-yourself","sec-wrangling"]} \ No newline at end of file diff --git a/01-basics.qmd b/01-basics.qmd index c65ea94..e7749c4 100644 --- a/01-basics.qmd +++ b/01-basics.qmd @@ -382,9 +382,9 @@ The first step is to **unzip the zip folder** so that the files are placed withi * Check the extracted files are placed next to the project icon * Files and project should be visible in the Output pane in RStudio -::: {.callout-note collapse="true"} +::: {.callout-note collapse="false"} -## For screenshots click here +## Screenshots for "unzipping a zip folder" ::: {layout-ncol="1"} @@ -432,6 +432,13 @@ If you try to load a package that has not been installed yet, you will receive a To fix this, simply install the package first. **In the console**, type the command `install.packages("tidyverse")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio. +::: callout-important + +## Install packages from the console only + +Never include `install.packages()` in the Rmd. Only install packages from the console pane or the packages tab of the lower right pane!!! +::: + Note, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`. diff --git a/02-wrangling.qmd b/02-wrangling.qmd index a3346f5..6922b12 100644 --- a/02-wrangling.qmd +++ b/02-wrangling.qmd @@ -281,7 +281,7 @@ However, as you can see in the table above, each item is in a separate column, m Let’s tackle this problem step by step. It’s best to create a separate data object for this. If we tried to compute it within `data_prp`, it could quickly become messy. -* **Step 1**: Select the relevant columns `Code`, and `QRPs_1_Time1` to `QRPs_1_Time1` and store them in an object called `qrp_t1` +* **Step 1**: Select the relevant columns `Code`, and `QRPs_1_Time1` to `QRPs_11_Time1` and store them in an object called `qrp_t1` * **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can calculate the average score more easily (in step 3) * **Step 3**: Calculate the average QRP score (`QRPs_Acceptance_Time1_mean`) per participant using `group_by()` and `summarise()` diff --git a/_freeze/01-basics/execute-results/html.json b/_freeze/01-basics/execute-results/html.json index f606372..45e5d0d 100644 --- a/_freeze/01-basics/execute-results/html.json +++ b/_freeze/01-basics/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "7d2dcc8f9868107397250763dbc8ea6c", + "hash": "0a40a1abf3ec5b113cf949b0a8f93347", "result": { - "markdown": "# Projects and R Markdown {#sec-basics}\n\n## Intended Learning Outcomes {.unnumbered}\n\nBy the end of this chapter, you should be able to:\n\n- Re-familiarise yourself with setting up projects\n- Re-familiarise yourself with RMarkdown documents\n- Recap and apply data wrangling procedures to analyse data\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n## R and R Studio\n\nRemember, R is a programming language that you will write code in and RStudio is an Integrated Development Environment (IDE) which makes working with R easier as it's more user friendly. You need both components for this course.\n\nIf this is not ringing any bells yet, have a quick browse through the [materials from year 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#sec-intro-r){target=\"_blank\"} to refresh your memory.\n\n\n### R server\n\nUse the server *only* if you are unable to install R and RStudio on your computer (e.g., if you are using a Chromebook) or if you encounter issues while installing R on your own machine. Otherwise, you should install R and RStudio directly on your own computer. R and RStudio are already installed on the *R server*.\n\nYou will find the link to the server on Moodle.\n\n\n### Installing R and RStudio on your computer\n\nThe [RSetGo book](https://psyteachr.github.io/RSetGo/){target=\"_blank\"} provides detailed instructions on how to install R and RStudio on your computer. It also includes links to walkthroughs for installing R on different types of computers and operating systems.\n\nIf you had R and RStudio installed on your computer last year, we recommend updating to the latest versions. In fact, it’s a good practice to update them at the start of each academic year. Detailed guidance can be found in @sec-updating-r.\n\nOnce you have installed or updated R and RStudio, return to this chapter.\n\n\n### Settings for Reproducibility\n\nBy now, you should be aware that the Psychology department at the University of Glasgow places a strong emphasis on reproducibility, open science, and raising awareness about questionable research practices (QRPs) and how to avoid them. Therefore, it's important that you work in a reproducible manner so that others (and your future self) can understand and check your work. This also makes it easier for you to reuse your work in the future.\n\nAlways start with a clear workspace. If your `Global Environment` contains anything from a previous session, you can’t be certain whether your current code is working as intended or if it’s using objects created earlier.\n\nTo ensure a clean and reproducible workflow, there are a few settings you should adjust immediately after installing or updating RStudio. In Tools \\> Global Options... General tab\n\n* Uncheck the box labelled Restore .RData into workspace at startup to make sure no data from a previous session is loaded into the environment\n* set Save workspace to .RData on exit to **Never** to prevent your workspace from being saved when you exit RStudio.\n\n![Reproducibility settings in Global Options](images/rstudio_settings_reproducibility.png)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for keeping taps on parentheses\n\nR has included **rainbow parentheses** to help with keeping count on the brackets.\n\nTo enable the feature, go to Tools \\> Global Options... Code tab \\> Display tab and tick the last checkbox \"Use rainbow parentheses\"\n\n![Enable Rainbow parenthesis](images/rainbow.PNG)\n\n:::\n\n### RStudio panes\n\nRStudio has four main panes each in a quadrant of your screen:\n\n* Source pane\n* Environment pane\n* Console pane\n* Output pane\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAre you ready for a quick quiz to see what you remember about the RStudio panes from last year? Click on **Quiz** to see the questions.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Quiz\n\n**What is their purpose?**\n\n**The Source pane...**
\n\n\n**The Environment pane...**
\n\n\n**The Console pane...**
\n\n\n**The Output pane...**
\n\n\n**Where are these panes located by default?**\n\n* The Source pane is located? \n* The Environment pane is located? \n* The Console pane is located? \n* The Output pane is located? \n\n:::\n\n:::\n\nIf you were not quite sure about one/any of the panes, check out the [materials from Level 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#rstudio-panes){target=\"_blank\"}. If you want to know more about them, there is the [RStudio guide on posit](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html){target=\"_blank\"}\n\n\n\n## Activity 1: Creating a new project {#sec-project}\n\nIt's important to create a new RStudio project whenever you start a new project. This practice makes it easier to work in multiple contexts, such as when analysing different datasets simultaneously. Each RStudio project has its own folder location, workspace, and working directories, which keeps all your data and RMarkdown documents organised in one place.\n\nLast year, you learnt how to create projects on the server, so you already know the steps. If cannot quite recall how that was done, go back to the [Level 1 materials](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#new-project){target=\"_blank\"}.\n\nOn your own computer, open RStudio, and complete the following steps in this order:\n\n* Click on File \\> New Project...\n* Then, click on \"New Directory\"\n* Then, click on \"New Project\"\n* Name the directory something meaningful (e.g., \"2A_chapter1\"), and save it in a location that makes sense, for example, a dedicated folder you have for your level 2 Psychology labs - you can either select a folder you have already in place or create a new one (e.g., I named my new folder \"Level 2 labs\")\n* Click \"Create Project\". RStudio will restart itself and open with this new project directory as the working directory. If you accidentally close it, you can open it by double-clicking on the project icon in your folder\n* You can also check in your folder structure that everything was created as intended\n\n![Creating a new project](images/project_setup.gif)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Why is the Colour scheme in the gif different to my version?\n\nIn case anyone is wondering why my colour scheme in the gif above looks different to yours, I've set mine to \"Pastel On Dark\" in Tools \\> Global Options... \\> Appearances. And my computer lives in \"dark mode\".\n\n:::\n\n::: callout-important\n\n## Don't nest projects\n\nDon't ever save a new project **inside** another project directory. This can cause some hard-to-resolve problems.\n\n:::\n\n\n## Activity 2: Create a new R Markdown file {#sec-rmd}\n\n* Open a new R Markdown document: click File \\> New File \\> R Markdown or click on the little page icon with a green plus sign (top left).\n* Give it a meaningful `Title` (e.g., Level 2 chapter 1) - you can also change the title later. Feel free to add your name or GUID in the `Author` field author name. Keep the `Default Output Format` as HTML.\n* Once the .`Rmd` opened, you need to save the file.\n* To save it, click File \\> Save As... or click on the little disc icon. Name it something meaningful (e.g., \"chapter_01.Rmd\", \"01_intro.Rmd\"). Make sure there are no spaces in the name - R is not very fond of spaces... This file will automatically be saved in your project folder (i.e., your working directory) so you should now see this file appear in your file viewer pane.\n\n\n![Creating a new `.Rmd` file](images/Rmd_setup.gif)\n\n\nRemember, an R Markdown document or `.Rmd` has \"white space\" (i.e., the markdown for formatted text) and \"grey parts\" (i.e., code chunks) in the default colour scheme (see @fig-rmd). R Markdown is a powerful tool for creating dynamic documents because it allows you to integrate code and regular text seamlessly. You can then knit your `.Rmd` using the `knitr` package to create a final document as either a webpage (HTML), a PDF, or a Word document (.docx). We'll only knit to HTML documents in this course.\n\n\n![R markdown anatomy (image from [https://intro2r.com/r-markdown-anatomy.html](https://intro2r.com/r-markdown-anatomy.html){target=\"_blank\"})](images/rm_components.png)\n\n\n\n### Markdown\n\nThe markdown space in an `.Rmd` is ideal for writing notes that explain your code and document your thought process. Use this space to clarify what your code is doing, why certain decisions were made, and any insights or conclusions you have drawn along the way. These notes are invaluable when revisiting your work later, helping you (or others) understand the rationale behind key decisions, such as setting inclusion/exclusion criteria or interpreting the results of assumption tests. Effectively documenting your work in the markdown space enhances both the clarity and reproducibility of your analysis.\n\nThe markdown space offers a variety of formatting options to help you organise and present your notes effectively. Here are a few of them that can enhance your documentation:\n\n#### Heading levels {.unnumbered}\n\nThere is a variety of **heading levels** to make use of, using the `#` symbol.\n\n\n::: columns\n\n::: column\n\n##### You would incorporate this into your text as: {.unnumbered}\n\n\\# Heading level 1\n\n\\## Heading level 2\n\n\\### Heading level 3\n\n\\#### Heading level 4\n\n\\##### Heading level 5\n\n\\###### Heading level 6\n\n:::\n\n::: column\n\n##### And it will be displayed in your knitted html file as: {.unnumbered}\n\n![](images/heading_levels.PNG)\n\n:::\n\n:::\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My heading levels don't render properly when knitting\n\nYou need a space between the # and the first letter. If the space is missing, the heading will be displayed in the HTML file as ...\n\n#Heading 1\n\n:::\n\n#### Unordered and ordered lists {.unnumbered}\n\nYou can also include **unordered lists** and **ordered lists**. Click on the tabs below to see how they are incorporated\n\n::: panel-tabset\n\n## unordered lists\n\nYou can add **bullet points** using either `*`, `-` or `+` and they will turn into:\n\n* bullet point (created with `*`)\n* bullet point (created with `-`)\n+ bullet point (created with `+`)\n\nor use bullet points of different levels using 1 tab key press or 2 spaces (for sub-item 1) or 2 tabs/4 spaces (for sub-sub-item 1):\n\n* bullet point item 1\n * sub-item 1\n * sub-sub-item 1\n * sub-sub-item 2\n* bullet point item 2\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My bullet points don't render properly when knitting\n\nYou need an empty row before your bullet points start. If I delete the empty row before the bullet points, they will be displayed in the HTML as ...\n\nText without the empty row: * bullet point created with `*` - bullet point created with `-` + bullet point created with `+`\n\n:::\n\n\n## ordered lists\n\nStart the line with **1.**, **2.**, etc. When you want to include sub-items, either use the `tab` key twice or add **4 spaces**. Same goes for the sub-sub-item: include either 2 tabs (or 4 manual spaces) from the last item or 4 tabs/ 8 spaces from the start of the line.\n\n1. list item 1\n2. list item 2\n i) sub-item 1 (with 4 spaces)\n A. sub-sub-item 1 (with an additional 4 spaces from the last indent)\n\n::: {.callout-important collapse=\"true\"}\n\n## My list items don't render properly when knitting\n\nIf you don't leave enough spaces, the list won't be recognised, and your output looks like this:\n\n3. list item 3\n i) sub-item 1 (with only 2 spaces) \n A. sub-sub-item 1 (with an additional 2 spaces from the last indent)\n\n:::\n\n\n## ordered lists magic\n\nThe great thing though is that you don't need to know your alphabet or number sequences. R markdown will fix that for you\n\nIf I type into my `.Rmd`...\n\n![](images/list_magic.PNG)\n\n...it will be rendered in the knitted HTML output as...\n\n3. list item 3\n1. list item 1\n a) sub-item labelled \"a)\"\n i) sub-item labelled \"i)\"\n C) sub-item labelled \"C)\"\n Z) sub-item labelled \"Z)\"\n7. list item 7\n\n\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: The labels of the sub-items are not what I thought they would be. You said they are fixing themselves...\n\nYes, they do but you need to label your sub-item lists accordingly. The first label you list in each level is set as the baseline. If they are labelled `1)` instead of `i)` or `A.`, the output will show as follows, but the automatic-item-fixing still works:\n\n7. list item 7\n 1) list item \"1)\" with 4 spaces\n 1) list item \"1)\" with 8 spaces\n 6) this is an item labelled \"6)\" (magically corrected to \"2.\")\n:::\n\n:::\n\n#### Emphasis {.unnumbered}\n\nInclude **emphasis** to draw attention to keywords in your text:\n\n| R markdown syntax | Displayed in the knitted HTML file |\n|:----------------------------|:-----------------------------------|\n| \\*\\*bold text\\*\\* | **bold text** |\n| \\*italic text\\* | *italic text* |\n| \\*\\*\\*bold and italic\\*\\*\\* | ***bold and italic*** |\n\n\nOther examples can be found in the [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target=\"_blank\"}\n\n\n\n### Code chunks {#sec-chunks}\n\nEverything you write inside the **code chunks** will be interpreted as code and executed by R. Code chunks start with ```` ``` ```` followed by an `{r}` which specifies the coding language R, some space for code, and ends with ```` ``` ````. If you accidentally delete one of those backticks, your code won't run and/or your text parts will be interpreted as part of the code chunks or vice versa. This should be evident from the colour change - more white than expected typically indicates missing starting backticks, whilst too much grey/not enough white suggests missing ending backticks. But no need to fret if that happens - just add the missing backticks manually.\n\n\nYou can **insert a new code chunk** in several ways:\n\n\n* Click the `Insert a new code chunk` button in the RStudio Toolbar (green icon at the top right corner of the `Source pane`).\n* Select Code \\> Insert Chunk from the menu.\n* Using the shortcut `Ctrl + Alt + I` for Windows or `Cmd + Option + I` on MacOSX.\n* Type ```` ```{r} ```` and ```` ``` ```` manually\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Default `.Rmd` with highlighting - names in pink and knitr display options in purple](images/default_highlighted.png){#fig-rmd fig-align='center' width=100%}\n:::\n:::\n\n\n\n\nWithin the curly brackets of a code chunk, you can **specify a name** for the code chunk (see pink highlighting in @fig-rmd). The chunk name is not necessarily required; however, it is good practice to give each chunk a unique name to support more advanced knitting approaches. It also makes it easier to reference and manage chunks.\n\nWithin the curly brackets, you can also place **rules and arguments** (see purple highlighting in @fig-rmd) to control how your code is executed and what is displayed in your final HTML output. The most common **knitr display options** include:\n\n\n| Code | Does code run | Does code show | Do results show |\n|:--------------------|:-------------:|:--------------:|:---------------:|\n| eval=FALSE | NO | YES | NO |\n| echo=TRUE (default) | YES | YES | YES |\n| echo=FALSE | YES | NO | YES |\n| results='hide' | YES | YES | NO |\n| include=FALSE | YES | NO | NO |\n\n\n::: callout-important\n\nThe table above will be incredibly important for the data skills homework II. When solving error mode items you will need to pay attention to the first one `eval = FALSE`.\n\n:::\n\nOne last thing: In your newly created `.Rmd` file, delete everything below line 12 (keep the set-up code chunk) and save your `.Rmd` by clicking on the disc symbol.\n\n![Delete everything below line 12](images/delete_12.gif)\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nThat was quite a long section about what Markdown can do. I promise, we'll practice that more later. For the minute, we want you to create a new level 2 heading on line 12 and give it a meaningful heading title (something like \"Loading packages and reading in data\" or \"Chapter 1\").\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\nOn line 12, you should have typed **## Loading packages and reading in data** (or whatever meaningful title you chose). This will create level 2 heading once we knit the `.Rmd`.\n\n:::\n\n:::\n\n\n## Activity 3: Download the data {#sec-download_data_ch1}\n\nThe data for chapters 1-3. Download it here: [data_ch1.zip](data/data_ch1.zip \"download\"). There are 2 csv files contained in a zip folder. One is the data file we are going to use today `prp_data_reduced.csv` and the other is an Excel file `prp_codebook` that explains the variables in the data.\n\nThe first step is to **unzip the zip folder** so that the files are placed within the same folder as your project.\n\n* Place the zip folder within your 2A_chapter1 folder\n* Right mouse click --> `Extract All...`\n* Check the folder location is the one to extract the files to\n* Check the extracted files are placed next to the project icon\n* Files and project should be visible in the Output pane in RStudio\n\n::: {.callout-note collapse=\"true\"}\n\n## For screenshots click here\n\n::: {layout-ncol=\"1\"}\n\n![](images/pic1.PNG){fig-align=\"center\"}\n\n![](images/pic23.PNG){fig-align=\"center\"}\n\n![](images/pic45.PNG){fig-align=\"center\"}\n\nUnzipping a zip folder\n\n:::\n:::\n\nThe paper by Pownall et al. was a **registered report** published in 2023, and the original data can be found on OSF ([https://osf.io/5qshg/](https://osf.io/5qshg/){target=\"_blank\"}).\n\n**Citation**\n\n> Pownall, M., Pennington, C. R., Norris, E., Juanchich, M., Smailes, D., Russell, S., Gooch, D., Evans, T. R., Persson, S., Mak, M. H. C., Tzavella, L., Monk, R., Gough, T., Benwell, C. S. Y., Elsherif, M., Farran, E., Gallagher-Mitchell, T., Kendrick, L. T., Bahnmueller, J., . . . Clark, K. (2023). Evaluating the Pedagogical Effectiveness of Study Preregistration in the Undergraduate Dissertation. *Advances in Methods and Practices in Psychological Science, 6*(4). [https://doi.org/10.1177/25152459231202724](https://doi.org/10.1177/25152459231202724){target=\"_blank\"}\n\n**Abstract**\n\n> Research shows that questionable research practices (QRPs) are present in undergraduate final-year dissertation projects. One entry-level Open Science practice proposed to mitigate QRPs is “study preregistration,” through which researchers outline their research questions, design, method, and analysis plans before data collection and/or analysis. In this study, we aimed to empirically test the effectiveness of preregistration as a pedagogic tool in undergraduate dissertations using a quasi-experimental design. A total of 89 UK psychology students were recruited, including students who preregistered their empirical quantitative dissertation (*n* = 52; experimental group) and students who did not (*n* = 37; control group). Attitudes toward statistics, acceptance of QRPs, and perceived understanding of Open Science were measured both before and after dissertation completion. Exploratory measures included capability, opportunity, and motivation to engage with preregistration, measured at Time 1 only. This study was conducted as a Registered Report; Stage 1 protocol: https://osf.io/9hjbw (date of in-principle acceptance: September 21, 2021). Study preregistration did not significantly affect attitudes toward statistics or acceptance of QRPs. However, students who preregistered reported greater perceived understanding of Open Science concepts from Time 1 to Time 2 compared with students who did not preregister. Exploratory analyses indicated that students who preregistered reported significantly greater capability, opportunity, and motivation to preregister. Qualitative responses revealed that preregistration was perceived to improve clarity and organization of the dissertation, prevent QRPs, and promote rigor. Disadvantages and barriers included time, perceived rigidity, and need for training. These results contribute to discussions surrounding embedding Open Science principles into research training.\n\n**Changes made to the dataset**\n\nWe made some changes to the dataset for the purpose of increasing difficulty for data wrangling (@sec-wrangling and @sec-wrangling2) and data visualisation (@sec-dataviz and @sec-dataviz2). This will ensure some \"teachable moments\". The changes are as follows:\n\n* We removed some of the variables to make the data more manageable for teaching purposes.\n* We recoded some values from numeric responses to labels (e.g., `understanding`).\n* We added the word \"years\" to one of the `Age` entries.\n* We tidied a messy column `Ethnicity` but introduced a similar but easier-to-solve \"messiness pattern\" when recoding the `understanding` data.\n* The scores in the original file were already corrected from reverse-coded responses. We reversed that process to present raw data here.\n\n\n\n\n## Activity 4: Installing packages, loading packages, and reading in data\n\n### Installing packages\n\nWhen you install R and RStudio for the first time (or after an update), most of the packages we will be using won’t be pre-installed. Before you can load new packages like `tidyverse`, you will need to install them.\n\nIf you try to load a package that has not been installed yet, you will receive an error message that looks something like this: `Error in library(tidyverse) : there is no package called 'tidyverse'`. \n\nTo fix this, simply install the package first. **In the console**, type the command `install.packages(\"tidyverse\")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio.\n\n\nNote, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`.\n\n\n### Loading packages and reading in data\n\nThe first step is to load in the packages we need and read in the data. Today, we'll only be using `tidyverse`, and `read_csv()` will help us store the data from `prp_data_reduced.csv` in an object called data_prp.\n\nCopy the code into a code chunk in your `.Rmd` file and run it. You can either click the `green error` to run the entire code chunk, or use the shortcut `Ctrl + Enter` (Windows) or `Cmd + Enter` (Mac) to run a line of code/ pipe from the Rmd.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\nRows: 89 Columns: 91\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (17): Code, Age, Ethnicity, Opptional_mod_1_TEXT, Research_exp_1_TEXT, U...\ndbl (74): Gender, Secondyeargrade, Opptional_mod, Research_exp, Plan_prereg,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n\n\n\n## Activity 5: Familiarise yourself with the data {#sec-familiarise}\n\n* Look at the **Codebook** to get a feel of the variables in the dataset and how they have been measured. Note that some of the columns were deleted in the dataset you have been given.\n* You'll notice that some questionnaire data was collected at 2 different time points (i.e., SATS28, QRPs, Understanding_OS)\n* some of the data was only collected at one time point (i.e., supervisor judgements, OS_behav items, and Included_prereg variables are t2-only variables)\n\n\n\n### First glimpse at the data\n\nBefore you start wrangling your data, it is important to understand what kind of data you're working with and what the format of your dataframe looks like.\n\nAs you may have noticed, `read_csv()` provides a **message** listing the data types in your dataset and how many columns are of each type. Plus, it shows a few examples columns for each data type.\n\nTo obtain more detailed information about your data, you have several options. Click on the individual tabs to see the different options available. Test them out in your own `.Rmd` file and use whichever method you prefer (but do it).\n\n::: callout-warning\n\nSome of the output is a bit long because we do have quite a few variables in the data file.\n\n:::\n\n::: panel-tabset\n\n## visual inspection 1\n\nIn the `Global Environment`, click the blue arrow icon next to the object name `data_prp`. This action will expand the object, revealing details about its columns. The `$` symbol is commonly used in Base R to access a specific column within your dataframe.\n\n![Visual inspection of the data](images/data_prp.PNG)\n\nCon: When you have quite a few variables, not all of them are shown.\n\n## `glimpse()`\n\nUse `glimpse()` if you want a more detailed overview you can see on your screen. The output will display rows and column numbers, and some examples of the first couple of observations for each variable.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 91\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", …\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"2…\n$ Ethnicity \"White European\", \"White British…\n$ Secondyeargrade 2, 3, 1, 2, 2, 2, 2, 2, 1, 1, 1,…\n$ Opptional_mod 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2,…\n$ Opptional_mod_1_TEXT \"Research methods in first year\"…\n$ Research_exp 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…\n$ Research_exp_1_TEXT NA, NA, NA, NA, NA, NA, NA, NA, …\n$ Plan_prereg 1, 3, 1, 2, 1, 1, 3, 3, 2, 2, 2,…\n$ SATS28_1_Affect_Time1 4, 5, 5, 6, 2, 1, 6, 3, 2, 5, 2,…\n$ SATS28_2_Affect_Time1 5, 6, 3, 3, 6, 1, 2, 2, 7, 3, 4,…\n$ SATS28_3_Affect_Time1 3, 2, 5, 2, 6, 7, 2, 6, 6, 5, 2,…\n$ SATS28_4_Affect_Time1 4, 5, 2, 2, 6, 6, 5, 5, 5, 5, 2,…\n$ SATS28_5_Affect_Time1 5, 5, 5, 6, 1, 1, 5, 1, 2, 5, 2,…\n$ SATS28_6_Affect_Time1 5, 6, 2, 5, 6, 7, 4, 5, 5, 3, 5,…\n$ SATS28_7_CognitiveCompetence_Time1 4, 2, 2, 5, 6, 7, 2, 5, 5, 2, 2,…\n$ SATS28_8_CognitiveCompetence_Time1 2, 2, 2, 1, 6, 7, 2, 5, 3, 2, 3,…\n$ SATS28_9_CognitiveCompetence_Time1 2, 2, 2, 3, 3, 7, 2, 6, 3, 3, 1,…\n$ SATS28_10_CognitiveCompetence_Time1 6, 7, 6, 6, 4, 2, 6, 4, 5, 6, 5,…\n$ SATS28_11_CognitiveCompetence_Time1 4, 3, 5, 5, 3, 1, 6, 2, 5, 6, 5,…\n$ SATS28_12_CognitiveCompetence_Time1 3, 5, 3, 5, 5, 7, 3, 4, 7, 2, 3,…\n$ SATS28_13_Value_Time1 1, 1, 2, 1, 3, 7, 1, 2, 1, 2, 4,…\n$ SATS28_14_Value_Time1 7, 7, 6, 6, 5, 1, 6, 5, 7, 6, 2,…\n$ SATS28_15_Value_Time1 7, 7, 6, 6, 3, 5, 6, 6, 6, 5, 5,…\n$ SATS28_16_Value_Time1 2, 1, 3, 2, 6, 5, 3, 7, 2, 2, 2,…\n$ SATS28_17_Value_Time1 1, 1, 3, 3, 7, 7, 2, 7, 2, 2, 5,…\n$ SATS28_18_Value_Time1 3, 6, 5, 3, 1, 1, 5, 1, 5, 2, 2,…\n$ SATS28_19_Value_Time1 3, 3, 3, 3, 7, 7, 4, 5, 3, 5, 6,…\n$ SATS28_20_Value_Time1 2, 1, 4, 2, 7, 7, 2, 4, 2, 2, 7,…\n$ SATS28_21_Value_Time1 2, 1, 3, 2, 6, 7, 2, 5, 1, 3, 5,…\n$ SATS28_22_Difficulty_Time1 3, 2, 5, 3, 2, 1, 4, 2, 2, 5, 3,…\n$ SATS28_23_Difficulty_Time1 5, 6, 5, 6, 6, 7, 4, 6, 7, 5, 6,…\n$ SATS28_24_Difficulty_Time1 2, 2, 2, 3, 1, 4, 4, 2, 2, 2, 2,…\n$ SATS28_25_Difficulty_Time1 6, 7, 5, 5, 6, 7, 5, 6, 5, 5, 5,…\n$ SATS28_26_Difficulty_Time1 4, 2, 2, 2, 6, 7, 4, 5, 3, 5, 3,…\n$ SATS28_27_Difficulty_Time1 4, 5, 5, 3, 6, 7, 4, 3, 5, 3, 6,…\n$ SATS28_28_Difficulty_Time1 1, 7, 5, 5, 6, 6, 5, 4, 4, 4, 2,…\n$ QRPs_1_Time1 7, 7, 7, 7, 7, 7, 6, 2, 7, 6, 7,…\n$ QRPs_2_Time1 7, 7, 7, 7, 7, 7, 6, 7, 7, 7, 5,…\n$ QRPs_3_Time1 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3,…\n$ QRPs_4_Time1 7, 7, 6, 6, 7, 4, 6, 7, 7, 7, 6,…\n$ QRPs_5_Time1 3, 3, 7, 7, 2, 7, 4, 6, 7, 3, 2,…\n$ QRPs_6_Time1 4, 7, 6, 5, 7, 4, 4, 5, 7, 6, 5,…\n$ QRPs_7_Time1 5, 7, 7, 7, 7, 4, 5, 6, 7, 7, 5,…\n$ QRPs_8_Time1 7, 7, 7, 7, 7, 7, 7, 7, 7, 2, 7,…\n$ QRPs_9_Time1 6, 7, 7, 4, 7, 7, 3, 7, 6, 6, 2,…\n$ QRPs_10_Time1 7, 6, 5, 2, 5, 4, 2, 6, 7, 7, 2,…\n$ QRPs_11_Time1 7, 7, 7, 4, 7, 7, 4, 6, 7, 7, 5,…\n$ QRPs_12NotQRP_Time1 2, 2, 1, 4, 1, 4, 2, 4, 2, 2, 1,…\n$ QRPs_13NotQRP_Time1 1, 1, 1, 1, 1, 4, 2, 4, 1, 1, 1,…\n$ QRPs_14NotQRP_Time1 1, 4, 3, 4, 1, 4, 2, 3, 3, 4, 3,…\n$ QRPs_15NotQRP_Time1 2, 4, 2, 2, 1, 4, 2, 1, 4, 4, 2,…\n$ Understanding_OS_1_Time1 \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at…\n$ Understanding_OS_2_Time1 \"2\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_3_Time1 \"2\", \"Not at all confident\", \"3\"…\n$ Understanding_OS_4_Time1 \"6\", \"Not at all confident\", \"6\"…\n$ Understanding_OS_5_Time1 \"Entirely confident\", \"6\", \"6\", …\n$ Understanding_OS_6_Time1 \"Entirely confident\", \"Entirely …\n$ Understanding_OS_7_Time1 \"6\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_8_Time1 \"6\", \"3\", \"5\", \"3\", \"5\", \"Not at…\n$ Understanding_OS_9_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_10_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_11_Time1 \"Entirely confident\", \"2\", \"4\", …\n$ Understanding_OS_12_Time1 \"Entirely confident\", \"2\", \"5\", …\n$ Pre_reg_group 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2,…\n$ Other_OS_behav_2 1, NA, NA, NA, 1, NA, NA, 1, NA,…\n$ Other_OS_behav_4 1, NA, NA, NA, NA, NA, NA, NA, N…\n$ Other_OS_behav_5 NA, NA, NA, NA, 1, 1, NA, NA, NA…\n$ Closely_follow 2, 2, 2, NA, 3, 3, 3, NA, NA, 2,…\n$ SATS28_Affect_Time2_mean 3.500000, 3.166667, 4.833333, 4.…\n$ SATS28_CognitiveCompetence_Time2_mean 4.166667, 4.666667, 6.166667, 5.…\n$ SATS28_Value_Time2_mean 3.000000, 6.222222, 6.000000, 4.…\n$ SATS28_Difficulty_Time2_mean 2.857143, 2.857143, 4.000000, 2.…\n$ QRPs_Acceptance_Time2_mean 5.636364, 5.454545, 6.272727, 5.…\n$ Time2_Understanding_OS 5.583333, 3.333333, 5.416667, 4.…\n$ Supervisor_1 5, 7, 7, 1, 7, 1, 7, 6, 7, 5, 6,…\n$ Supervisor_2 5, 6, 7, 4, 6, 2, 7, 5, 6, 5, 5,…\n$ Supervisor_3 6, 7, 7, 1, 7, 1, 7, 5, 6, 6, 7,…\n$ Supervisor_4 6, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_5 5, 7, 7, 4, 7, 3, 7, 7, 6, 6, 6,…\n$ Supervisor_6 5, 7, 7, 4, 6, 3, 7, 6, 7, 6, 6,…\n$ Supervisor_7 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ Supervisor_8 5, 5, 7, 1, 7, 1, 7, 5, 7, 5, 6,…\n$ Supervisor_9 6, 7, 7, 4, 7, 3, 7, 5, 7, 6, 7,…\n$ Supervisor_10 5, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_11 NA, 7, 7, NA, 7, 1, 7, 5, 7, 6, …\n$ Supervisor_12 4, 5, 7, 1, 4, 1, 7, 3, 6, 6, 5,…\n$ Supervisor_13 4, 2, 5, 1, 2, 1, 6, 3, 5, 6, 5,…\n$ Supervisor_14 5, 7, 7, 1, 7, 1, 7, 5, 7, 6, 6,…\n$ Supervisor_15_R 1, 1, 1, 4, 1, 7, 1, 2, 1, 2, 1,…\n```\n:::\n:::\n\n\n\n## `spec()`\n\nYou can also use `spec()` as suggested in the message above and then it shows you a list of the data type in every single column. But it doesn't show you the number of rows and columns.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nspec(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ncols(\n Code = col_character(),\n Gender = col_double(),\n Age = col_character(),\n Ethnicity = col_character(),\n Secondyeargrade = col_double(),\n Opptional_mod = col_double(),\n Opptional_mod_1_TEXT = col_character(),\n Research_exp = col_double(),\n Research_exp_1_TEXT = col_character(),\n Plan_prereg = col_double(),\n SATS28_1_Affect_Time1 = col_double(),\n SATS28_2_Affect_Time1 = col_double(),\n SATS28_3_Affect_Time1 = col_double(),\n SATS28_4_Affect_Time1 = col_double(),\n SATS28_5_Affect_Time1 = col_double(),\n SATS28_6_Affect_Time1 = col_double(),\n SATS28_7_CognitiveCompetence_Time1 = col_double(),\n SATS28_8_CognitiveCompetence_Time1 = col_double(),\n SATS28_9_CognitiveCompetence_Time1 = col_double(),\n SATS28_10_CognitiveCompetence_Time1 = col_double(),\n SATS28_11_CognitiveCompetence_Time1 = col_double(),\n SATS28_12_CognitiveCompetence_Time1 = col_double(),\n SATS28_13_Value_Time1 = col_double(),\n SATS28_14_Value_Time1 = col_double(),\n SATS28_15_Value_Time1 = col_double(),\n SATS28_16_Value_Time1 = col_double(),\n SATS28_17_Value_Time1 = col_double(),\n SATS28_18_Value_Time1 = col_double(),\n SATS28_19_Value_Time1 = col_double(),\n SATS28_20_Value_Time1 = col_double(),\n SATS28_21_Value_Time1 = col_double(),\n SATS28_22_Difficulty_Time1 = col_double(),\n SATS28_23_Difficulty_Time1 = col_double(),\n SATS28_24_Difficulty_Time1 = col_double(),\n SATS28_25_Difficulty_Time1 = col_double(),\n SATS28_26_Difficulty_Time1 = col_double(),\n SATS28_27_Difficulty_Time1 = col_double(),\n SATS28_28_Difficulty_Time1 = col_double(),\n QRPs_1_Time1 = col_double(),\n QRPs_2_Time1 = col_double(),\n QRPs_3_Time1 = col_double(),\n QRPs_4_Time1 = col_double(),\n QRPs_5_Time1 = col_double(),\n QRPs_6_Time1 = col_double(),\n QRPs_7_Time1 = col_double(),\n QRPs_8_Time1 = col_double(),\n QRPs_9_Time1 = col_double(),\n QRPs_10_Time1 = col_double(),\n QRPs_11_Time1 = col_double(),\n QRPs_12NotQRP_Time1 = col_double(),\n QRPs_13NotQRP_Time1 = col_double(),\n QRPs_14NotQRP_Time1 = col_double(),\n QRPs_15NotQRP_Time1 = col_double(),\n Understanding_OS_1_Time1 = col_character(),\n Understanding_OS_2_Time1 = col_character(),\n Understanding_OS_3_Time1 = col_character(),\n Understanding_OS_4_Time1 = col_character(),\n Understanding_OS_5_Time1 = col_character(),\n Understanding_OS_6_Time1 = col_character(),\n Understanding_OS_7_Time1 = col_character(),\n Understanding_OS_8_Time1 = col_character(),\n Understanding_OS_9_Time1 = col_character(),\n Understanding_OS_10_Time1 = col_character(),\n Understanding_OS_11_Time1 = col_character(),\n Understanding_OS_12_Time1 = col_character(),\n Pre_reg_group = col_double(),\n Other_OS_behav_2 = col_double(),\n Other_OS_behav_4 = col_double(),\n Other_OS_behav_5 = col_double(),\n Closely_follow = col_double(),\n SATS28_Affect_Time2_mean = col_double(),\n SATS28_CognitiveCompetence_Time2_mean = col_double(),\n SATS28_Value_Time2_mean = col_double(),\n SATS28_Difficulty_Time2_mean = col_double(),\n QRPs_Acceptance_Time2_mean = col_double(),\n Time2_Understanding_OS = col_double(),\n Supervisor_1 = col_double(),\n Supervisor_2 = col_double(),\n Supervisor_3 = col_double(),\n Supervisor_4 = col_double(),\n Supervisor_5 = col_double(),\n Supervisor_6 = col_double(),\n Supervisor_7 = col_double(),\n Supervisor_8 = col_double(),\n Supervisor_9 = col_double(),\n Supervisor_10 = col_double(),\n Supervisor_11 = col_double(),\n Supervisor_12 = col_double(),\n Supervisor_13 = col_double(),\n Supervisor_14 = col_double(),\n Supervisor_15_R = col_double()\n)\n```\n:::\n:::\n\n\n\n## visual inspection 2\n\nIn the `Global Environment`, click on the object name `data_prp`. This action will open the data in a new tab. Hovering over the column headings with your mouse will also reveal their data type. However, it seems to be a fairly tedious process when you have loads of columns.\n\n::: {.callout-important collapse=\"true\"}\n\n## Hang on, where is the rest of my data? Why do I only see 50 columns?\n\nOne common source of confusion is not seeing all your columns when you open up a data object as a tab. This is because RStudio shows you a maximum of 50 columns at a time. If you have more than 50 columns, navigate with the arrows to see the remaining columns.\n\n![Showing 50 columns at a time](images/50_col.PNG)\n\n:::\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nNow that you have tested out all the options in your own `.Rmd` file, you can probably answer the following questions:\n\n* How many observations? \n* How many variables? \n* How many columns are `col_character` or `chr` data type? \n* How many columns are `col_double` or `dbl` data type? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe visual inspections shows you the **number of observations and variables**. `glimpse()` also gives you that information but calls them **rows and columns** respectively.\n\nThe **data type information** actually comes from the output when using the `read_csv()` function. Did you notice the information on **Column specification** (see screenshot below)?\n\n![message from `read_csv()` when reading in the data](images/col_spec.PNG)\n\nWhilst `spec()` is quite useful for data type information per individual column, it doesn't give you the total count of each data type. So it doesn't really help with answering the questions here - unless you want to count manually from its extremely long output.\n\n:::\n\nIn your `.Rmd`, include a **new heading level 2** called \"Information about the data\" (or something equally meaningful) and jot down some notes about `data_prp`. You could include the citation and/or the abstract, and whatever information you think you should note about this dataset (e.g., any observations from looking at the codebook?). You could also include some notes on the functions used so far and what they do. Try to incorporate some **bold**, *italic* or ***bold and italic*** emphasis and perhaps a bullet point or two.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Possible solution\n\n\\#\\# Information about the data\n\nThe data is from Pownall et al. (2023), and I can find the paper here: https://doi.org/10.1177/25152459231202724.\n\nI've noticed in the prp codebook that the SATS-28 questionnaire has quite a few \\*reverse-coded items\\*, and the supervisor support questionnaire also has a reverse-coded item.\n\nSo far, I think I prefer \\*\\*glimpse()\\*\\* to show me some more detail about the data. Specs() is too text-heavy for me which makes it hard to read.\n\nThings to keep in mind:\n\n* \\*\\*don't forget to load in tidyverse first!!!\\*\\*\n* always read in the data with \\*\\*read_csv\\*\\*, \\*\\*\\*never ever use read.csv\\*\\*\\*!!!\n\n![The output rendered in a knitted html file](images/knitted_markdown.PNG)\n\n:::\n\n:::\n\n### Data types {#sec-datatypes}\n\nEach variable has a **data type**, such as numeric (numbers), character (text), and logical (TRUE/FALSE values), or a special class of factor. As you have just seen, our `data_prp` only has character and numeric columns (so far).\n\n**Numeric data** can be double (`dbl`) or integer (`int`). Doubles can have decimal places (e.g., 1.1). Integers are the whole numbers (e.g., 1, 2, -1) and are displayed with the suffix L (e.g., 1L). This is not overly important but might leave you less puzzled the next time you see an L after a number.\n\n**Characters** (also called “strings”) is anything written between quotation marks. This is usually text, but in special circumstances, a number can be a character if it placed within quotation marks. This can happen when you are recoding variables. It might not be too obvious at the time, but you won't be able to calculate anything if the number is a character\n\n::: panel-tabset\n\n## Example data types\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntypeof(1)\ntypeof(1L)\ntypeof(\"1\")\ntypeof(\"text\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n[1] \"integer\"\n[1] \"character\"\n[1] \"character\"\n```\n:::\n:::\n\n\n## numeric computation\n\nNo problems here...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n1+1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n## character computation\n\nWhen the data type is incorrect, you won't be able to compute anything, despite your numbers being shown as numeric values in the dataframe. The error message tells you exactly what's wrong with it, i.e., that you have `non-numeric arguments`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n\"1\"+\"1\" # ERROR\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"1\" + \"1\": non-numeric argument to binary operator\n```\n:::\n:::\n\n\n:::\n\n**Logical** data (also sometimes called “Boolean” values) are one of two values: TRUE or FALSE (written in uppercase). They become really important when we use `filter()` or `mutate()` with conditional statements such as `case_when()`. More about those in @sec-wrangling2.\n\n\nSome commonly used logical operators:\n\n| operator | description |\n|:---------|:-----------------------------------------------|\n| \\> | greater than |\n| \\>= | greater than or equal to |\n| \\< | less than |\n| \\<= | less than or equal to |\n| == | equal to |\n| != | not equal to |\n| %in% | TRUE if any element is in the following vector |\n\n\nA **factor** is a specific type of integer or character that lets you assign the order of the categories. This becomes useful when you want to display certain categories in \"the correct order\" either in a dataframe (see *arrange*) or when plotting (see @sec-dataviz/ @sec-dataviz2).\n\n\n\n### Variable types\n\nYou've already encountered them in [Level 1](https://psyteachr.github.io/data-skills-v2/intro-to-probability.html){target=\"_blank\"} but let's refresh. Variables can be classified as **continuous** (numbers) or **categorical** (labels).\n\n**Categorical** variables are properties you can count. They can be **nominal**, where the categories don't have an order (e.g., gender) or **ordinal** (e.g., Likert scales either with numeric values 1-7 or with character labels such as \"agree\", \"neither agree nor disagree\", \"disagree\"). Categorical data may also be **factors** rather than characters.\n\n**Continuous variables** are properties you can measure and calculate sums/ means/ etc. They may be rounded to the nearest whole number, but it should make sense to have a value between them. Continuous variables always have a **numeric** data type (i.e. `integer` or `double`).\n\n::: callout-tip\n\n## Why is this important you may ask?\n\nKnowing your variable and data types will help later on when deciding on an appropriate plot (see @sec-dataviz and @sec-dataviz2) or which inferential test to run (@sec-nhstI to @sec-factorial).\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAs we've seen earlier, `data_prp` only had character and numeric variables which hardly tests your understanding to see if you can identify a variety of data types and variable types. So, for this little quiz, we've spiced it up a bit. We've selected a few columns, shortened some of the column names, and modified some of the data types. Here you can see the first few rows of the new object `data_quiz`. *You can find the code with explanations at the end of this section.*\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Code |Age |Gender |Ethnicity |Secondyeargrade | QRP_item| QRPs_mean|Understanding_item |QRP_item > 4 |\n|:----|:---|:------|:--------------|:-----------------------|--------:|---------:|:------------------|:------------|\n|Tr10 |22 |2 |White European |60-69% (2:1 grade) | 5| 5.636364|2 |TRUE |\n|Bi07 |20 |2 |White British |50-59% (2:2 grade) | 2| 5.454546|2 |FALSE |\n|SK03 |22 |2 |White British |≥ 70% (1st class grade) | 6| 6.272727|6 |TRUE |\n|SM95 |26 |2 |White British |60-69% (2:1 grade) | 2| 5.000000|2 |FALSE |\n|St01 |22 |2 |White British |60-69% (2:1 grade) | 6| 5.545454|6 |TRUE |\n\n
\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_quiz)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 9\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", \"St01\", \"St10\", \"Wa…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"20\", \"21\", \"21\", \"22…\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ Ethnicity \"White European\", \"White British\", \"White British\",…\n$ Secondyeargrade 60-69% (2:1 grade), 50-59% (2:2 grade), ≥ 70% (1st …\n$ QRP_item 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3, 4, 4, 4, 4, 6, 3, …\n$ QRPs_mean 5.636364, 5.454545, 6.272727, 5.000000, 5.545455, 6…\n$ Understanding_item \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at all confident\", \"4…\n$ `QRP_item > 4` TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,…\n```\n:::\n:::\n\n\n\n\nSelect from the dropdown menu the variable type and their data types for each of the columns.\n\n\n\n\n\n| Column | Variable type | Data type |\n|:---------------------|:--------------|:--------------|\n| `Age` | | |\n| `Gender` | | |\n| `Ethinicity` | | |\n| `Secondyeargrade` | | |\n| `QRP_item` | | |\n| `QRPs_mean` | | |\n| `Understanding_item` | | |\n| `QRP_item > 4` | | |\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Revealing the mystery code that created `data_quiz`\n\nThe code might look a bit complex for the minute despite the line-by-line explanations below. Come back to it after completing chapter 2.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n:::\n\n\nLets go through this line by line:\n\n* **line 1**: creates a new object called `data_quiz` and it is based on the already existing data object `data_prp`\n* **line 2**: we are selecting a few variables of interest, such as Code, Age etc. Some of those variables were renamed in the process according to the structure `new_name = old_name`, for example QRP item 3 at time point 1 got renamed as `QRP_item`.\\\n* **line 3**: The function `mutate()` is used to create a new column called `Gender` that turns the existing column `Gender` from a numeric value into a factor. R simply overwrites the existing column of the same name. If we had named the new column `Gender_factor`, we would have been able to retain the original `Gender` column and `Gender_factor` would have been added as the last column.\n* **line 4-6**: See how the line starts with an indent which indicates we are still within the `mutate()` function. You can also see this by counting brackets - in line 3 there are 2 opening brackets but only 1 closes.\n * Similar to `Gender`, we are replacing the \"old\" `Secondyeargrade` with the new `Secondyeargrade` column that is now a factor.\n * Turning our variable `Secondyeargrade` into a factor, spot the difference between this attempt and the one we used for `Gender`? Here we are using a lot more arguments in that factor function, namely levels and labels. **Levels** describes the unique values we have for that column, and in **labels** we want to define how these levels will be shown in the data object. If you don't add the levels and labels argument, the labels will be the labels (as you can see in the `Gender` column in which we kept the numbers).\n* **line 7**: Doesn't start with a function name and has an indent, which means we are *still* within the `mutate()` function - count the opening and closing brackets to confirm.\n * Here, we are creating a new column called `QRP_item > 4`. Notice the two backticks we have to use to make this weird column name work? This is because it has spaces (and we did mention that R doesn't like spaces). So the backticks help R to group it as a unit/ a single name.\n * Next we have a `case_when()` function which helps executing conditional statements. We are using it to check whether a statement is TRUE or FALSE. Here, we ask whether the QRP item (column `QRP_item`) is larger than 4 (midpoint of the scale) using the Boolean operator `>`. If the statement is `TRUE`, the label `TRUE` should appear in column `QRP_item > 4`. Otherwise, if the value is equal to 4 or smaller, the label should read `FALSE`. We will come back to conditional statements in @sec-wrangling. But long story short, this Boolean expression created the only logical data type in `data_quiz`.\n:::\n\nAnd with this, we are done with the individual walkthrough. Well done :)\n\n\n\n\n\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nThe data we will be using in the upcoming lab activities is a randomised controlled trials experiment by Binfet et al. (2021) that was conducted in Canada.\n\n**Citation**\n\n> Binfet, J. T., Green, F. L. L., & Draper, Z. A. (2021). The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. *Anthrozoös, 35*(1), 1–22. [https://doi.org/10.1080/08927936.2021.1944558](https://doi.org/10.1080/08927936.2021.1944558){target=\"_blank\"}\n\n**Abstract**\n\n> Researchers have claimed that canine-assisted interventions (CAIs) contribute significantly to bolstering participants' wellbeing, yet the mechanisms within interactions have received little empirical attention. The aim of this study was to assess the impact of client–canine contact on wellbeing outcomes in a sample of 284 undergraduate college students (77% female; 21% male, 2% non-binary). Participants self-selected to participate and were randomly assigned to one of two canine interaction treatment conditions (touch or no touch) or to a handler-only condition with no therapy dog present. To assess self-reports of wellbeing, measures of flourishing, positive and negative affect, social connectedness, happiness, integration into the campus community, stress, homesickness, and loneliness were administered. Exploratory analyses were conducted to assess whether these wellbeing measures could be considered as measuring a unidimensional construct. This included both reliability analysis and exploratory factor analysis. Based on the results of these analyses we created a composite measure using participant scores on a latent factor. We then conducted the tests of the four hypotheses using these factor scores. Results indicate that participants across all conditions experienced enhanced wellbeing on several measures; however, only those in the direct contact condition reported significant improvements on all measures of wellbeing. Additionally, direct interactions with therapy dogs through touch elicited greater wellbeing benefits than did no touch/indirect interactions or interactions with only a dog handler. Similarly, analyses using scores on the wellbeing factor indicated significant improvement in wellbeing across all conditions (handler-only, *d* = 0.18, *p* = 0.041; indirect, *d* = 0.38, *p* \\< 0.001; direct, *d* = 0.78, *p* \\< 0.001), with more benefit when a dog was present (*d* = 0.20, *p* \\< 0.001), and the most benefit coming from physical contact with the dog (*d* = 0.13, *p* = 0.002). The findings hold implications for post-secondary wellbeing programs as well as the organization and delivery of CAIs.\n\n\nHowever, we accessed the data via Ciaran Evans' github ([https://github.com/ciaran-evans/dog-data-analysis](https://github.com/ciaran-evans/dog-data-analysis){target=\"_blank\"}). Evans et al. (2023) published a paper that reused the Binfet data for teaching statistics and research methods. If anyone is interested, the accompanying paper is:\n\n> Evans, C., Cipolli, W., Draper, Z. A., & Binfet, J. T. (2023). Repurposing a Peer-Reviewed Publication to Engage Students in Statistics: An Illustration of Study Design, Data Collection, and Analysis. *Journal of Statistics and Data Science Education, 31*(3), 236–247. [https://doi.org/10.1080/26939169.2023.2238018](https://doi.org/10.1080/26939169.2023.2238018){target=\"_blank\"}\n\n**There are a few changes that Evans and we made to the data:**\n\n* Evans removed the demographics ethnicity and gender to make the study data available while protecting participant privacy. Which means we'll have limited demographic variables available, but we will make do with what we've got.\n* We modified some of the responses in the raw data csv - for example, we took out impossible response values and replaced them with `NA`.\n* We replaced some of the numbers with labels to increase the difficulty in the dataset for @sec-wrangling and @sec-wrangling2.\n\n\n\n### Task 1: Create a project folder for the lab activities {.unnumbered}\n\nSince we will be working with the same data throughout semester 1, create a separate project for the lab data. Name it something useful, like `lab_data` or `dogs_in_the_lab`. Make sure you are not placing it within the project you have already created today. If you need guidance, see @sec-project above.\n\n\n\n### Task 2: Create a new `.Rmd` file {.unnumbered}\n\n... and name it something useful. If you need help, have a look at @sec-rmd.\n\n\n\n### Task 3: Download the data {.unnumbered}\n\nDownload the data here: [data_pair_ch1](data/data_pair_ch1.zip \"download\"). The zip folder contains the raw data file with responses to individual questions, a cleaned version of the same data in long format and wide format, and the codebook describing the variables in the raw data file and the long format.\n\n**Unzip the folder and place the data files in the same folder as your project.**\n\n\n\n### Task 4: Familiarise yourself with the data {.unnumbered}\n\nOpen the data files, look at the codebook, and perhaps skim over the original Binfet article (methods in particular) to see what kind of measures they used.\n\nRead in the raw data file as `dog_data_raw` and the cleaned-up data (long format) as `dog_data_long`. See if you can answer the following questions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\ndog_data_long <- read_csv(\"dog_data_clean_long.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\nRows: 284 Columns: 136\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (41): GroupAssignment, L2_1, L2_2, L2_3, L2_4, L2_5, L2_6, L2_7, L2_8, L...\ndbl (95): RID, Age_Yrs, Year_of_Study, Live_Pets, Consumer_BARK, S1_1, HO1_1...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\nRows: 568 Columns: 16\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): GroupAssignment, Year_of_Study, Live_Pets, Stage\ndbl (12): RID, Age_Yrs, Consumer_BARK, Flourishing, PANAS_PA, PANAS_NA, SHS,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n* How many participants took part in the study? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nYou can see this from `dog_data_raw`. Each participant ID is on a single row meaning the number of observations is the number of participants.\n\nIf you look at `dog_data_long`, there are 568 observations. Each participant answered the questionnaires pre and post intervention, resulting in 2 rows per participant ID. This means you'd have to divide the number of observations by 2 to get to the number of participants.\n\n:::\n\n* How many different questionnaires did the participants answer? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nThe Binfet paper (e.g., Methods section and/or abstract) and the codebook show it's 9 questionnaires - Flourishing scale (variable `Flourishing`), the UCLS Loneliness scale Version 3 (`Loneliness`), Positive and Negative affect scale (`PANAS_PA` and `PANAS_NA`), the Subjective Happiness scale (`SHS`), the Social connectedness scale (`SCS`), and 3 scales with 1 question each, i.e., perception of stress levels (`Stress`), self-reported level of homesickness (`Homesick`), and integration into the campus community (`Engagement`).\n\nHowever, if you thought `PANAS_PA` and `PANAS_NA` are a single questionnaire, 8 was also acceptable as an answer here.\n\n:::\n\n\n\n\n## [Test your knowledge]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nAre you ready for some knowledge check questions to test your understanding of the chapter? We also have some faulty codes. See if you can spot what's wrong with them.\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nOne of the key first steps when we open RStudio is to:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nOpening an existing project (e.g., when coming back to the same dataset) or creating a new project (e.g., for a new task or new dataset) ensures that subsequent `.Rmd` files, any output, figures, etc are saved within the same folder on your computer (i.e., the working directory). If the`.Rmd` files or data is not in the same folder as \"the project icon\", things can get messy and code might not run.\n\n:::\n\n\n#### Question 2 {.unnumbered}\n\nWhen using the default environment colour settings for RStudio, what colour would the background of a code chunk be in R Markdown? \n\nWhen using the default environment colour settings for RStudio, what colour would the background of normal text be in R Markdown? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nAssuming you have not changed any of the settings in RStudio, code chunks will tend to have a grey background and normal text will tend to have a white background. This is a good way to check that you have closed and opened code chunks correctly.\n\n:::\n\n\n\n#### Question 3 {.unnumbered}\n\nCode chunks start and end with:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCode chunks always take the same general format of three backticks followed by curly parentheses and a lower case r inside the parentheses (`{r}`). People often mistake these backticks for single quotes but that will not work. If you have set your code chunk correctly using backticks, the background colour should change to grey from white.\n\n:::\n\n\n\n#### Question 4 {.unnumbered}\n\nWhat is the correct way to include a code chunk in RMarkdown that will be executed but neither the code nor its output will be shown in the final HTML document? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCheck the table of knitr display options in @sec-chunks.\n\n* {r, echo=FALSE} also executes the code and does not show the code, but it *does* display the result in the knitted html file. (matches 2/3 criteria)\n* {r, eval=FALSE} does not show the results but does *not* execute the code and it *does* show it in the knitted file. (matches 1/3 criteria)\n* {r, results=“hide”} executes the code and does not show results, however, it *does* include the code in the knitted html document. (matches 2/3 criteria)\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of these codes have mistakes in them, other code chunks are not quite producing what was aimed for. Your task is to spot anything faulty, explain why the things happened, and perhaps try to fix them.\n\n\n\n#### Question 5 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. You have just stated R, created a new `.Rmd` file, and typed the following code into your code chunk.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\n\nHowever, R gives you an error message: `could not find function \"read_csv\"`. What could be the reason?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\n\"Could not find function\" is an indication that you have forgotten to load in tidyverse. Because `read_csv()` is a function in the tidyverse collection, R cannot find it.\n\nFIX: Add `library(tidyverse)` prior to reading in the data and run the code chunk again.\n\n:::\n\n\n\n#### Question 6 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. This time, you are certain you have loaded in tidyverse first. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\nThe error message shows `'data.csv' does not exist in current working directory`. You check your folder and it looks like this:\n\n![](images/error_ch1_01.PNG)\n\nWhy is there an error message?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nR is looking for a csv file that is called data which is currently not in the working directory. We may assume it's in the data folder. Perhaps that happened when unzipping the zip file. So instead of placing the csv file on the same level as the project icon, it was unzipped into a folder named data.\n\nFIX - option 1: Take the `data.csv` out of the data folder and place it next to the project icon and the `.Rmd` file.\n\nFIX - option 2: Modify your R code to tell R that the data is in a separate folder called data, e.g., ...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data/data.csv\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\n\nYou want to load `tidyverse` into the library. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\nThe error message says: `Error in library(tidyverse) : there is no package called ‘tidyverse’`\n\nWhy is there an error message and how can we fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nIf R says there is no package called `tidyverse`, means you haven't installed the package yet. This could be an error message you receive either after switching computers or a fresh install of R and RStudio.\n\nFIX: Type `install.packages(\"tidyverse\")` into your **Console**.\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nYou knitted your `.Rmd` into a html but the output is not as expected. You see the following:\n\n![](images/error_knitted.PNG)\n\nWhy did the file not knit properly?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThere is a backtick missing in the code chunk. If you check your `.Rmd` file, you can see that the code chunk does not show up in grey which means it's one of the 3 backticks at the beginning of the chunk.\n\n![](images/error_ch1_08.PNG)\n\nFIX: Add a single backtick manually where it's missing.\n\n:::\n", + "markdown": "# Projects and R Markdown {#sec-basics}\n\n## Intended Learning Outcomes {.unnumbered}\n\nBy the end of this chapter, you should be able to:\n\n- Re-familiarise yourself with setting up projects\n- Re-familiarise yourself with RMarkdown documents\n- Recap and apply data wrangling procedures to analyse data\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n## R and R Studio\n\nRemember, R is a programming language that you will write code in and RStudio is an Integrated Development Environment (IDE) which makes working with R easier as it's more user friendly. You need both components for this course.\n\nIf this is not ringing any bells yet, have a quick browse through the [materials from year 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#sec-intro-r){target=\"_blank\"} to refresh your memory.\n\n\n### R server\n\nUse the server *only* if you are unable to install R and RStudio on your computer (e.g., if you are using a Chromebook) or if you encounter issues while installing R on your own machine. Otherwise, you should install R and RStudio directly on your own computer. R and RStudio are already installed on the *R server*.\n\nYou will find the link to the server on Moodle.\n\n\n### Installing R and RStudio on your computer\n\nThe [RSetGo book](https://psyteachr.github.io/RSetGo/){target=\"_blank\"} provides detailed instructions on how to install R and RStudio on your computer. It also includes links to walkthroughs for installing R on different types of computers and operating systems.\n\nIf you had R and RStudio installed on your computer last year, we recommend updating to the latest versions. In fact, it’s a good practice to update them at the start of each academic year. Detailed guidance can be found in @sec-updating-r.\n\nOnce you have installed or updated R and RStudio, return to this chapter.\n\n\n### Settings for Reproducibility\n\nBy now, you should be aware that the Psychology department at the University of Glasgow places a strong emphasis on reproducibility, open science, and raising awareness about questionable research practices (QRPs) and how to avoid them. Therefore, it's important that you work in a reproducible manner so that others (and your future self) can understand and check your work. This also makes it easier for you to reuse your work in the future.\n\nAlways start with a clear workspace. If your `Global Environment` contains anything from a previous session, you can’t be certain whether your current code is working as intended or if it’s using objects created earlier.\n\nTo ensure a clean and reproducible workflow, there are a few settings you should adjust immediately after installing or updating RStudio. In Tools \\> Global Options... General tab\n\n* Uncheck the box labelled Restore .RData into workspace at startup to make sure no data from a previous session is loaded into the environment\n* set Save workspace to .RData on exit to **Never** to prevent your workspace from being saved when you exit RStudio.\n\n![Reproducibility settings in Global Options](images/rstudio_settings_reproducibility.png)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for keeping taps on parentheses\n\nR has included **rainbow parentheses** to help with keeping count on the brackets.\n\nTo enable the feature, go to Tools \\> Global Options... Code tab \\> Display tab and tick the last checkbox \"Use rainbow parentheses\"\n\n![Enable Rainbow parenthesis](images/rainbow.PNG)\n\n:::\n\n### RStudio panes\n\nRStudio has four main panes each in a quadrant of your screen:\n\n* Source pane\n* Environment pane\n* Console pane\n* Output pane\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAre you ready for a quick quiz to see what you remember about the RStudio panes from last year? Click on **Quiz** to see the questions.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Quiz\n\n**What is their purpose?**\n\n**The Source pane...**
\n\n\n**The Environment pane...**
\n\n\n**The Console pane...**
\n\n\n**The Output pane...**
\n\n\n**Where are these panes located by default?**\n\n* The Source pane is located? \n* The Environment pane is located? \n* The Console pane is located? \n* The Output pane is located? \n\n:::\n\n:::\n\nIf you were not quite sure about one/any of the panes, check out the [materials from Level 1](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#rstudio-panes){target=\"_blank\"}. If you want to know more about them, there is the [RStudio guide on posit](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html){target=\"_blank\"}\n\n\n\n## Activity 1: Creating a new project {#sec-project}\n\nIt's important to create a new RStudio project whenever you start a new project. This practice makes it easier to work in multiple contexts, such as when analysing different datasets simultaneously. Each RStudio project has its own folder location, workspace, and working directories, which keeps all your data and RMarkdown documents organised in one place.\n\nLast year, you learnt how to create projects on the server, so you already know the steps. If cannot quite recall how that was done, go back to the [Level 1 materials](https://psyteachr.github.io/data-skills-v2/sec-intro.html?q=RMark#new-project){target=\"_blank\"}.\n\nOn your own computer, open RStudio, and complete the following steps in this order:\n\n* Click on File \\> New Project...\n* Then, click on \"New Directory\"\n* Then, click on \"New Project\"\n* Name the directory something meaningful (e.g., \"2A_chapter1\"), and save it in a location that makes sense, for example, a dedicated folder you have for your level 2 Psychology labs - you can either select a folder you have already in place or create a new one (e.g., I named my new folder \"Level 2 labs\")\n* Click \"Create Project\". RStudio will restart itself and open with this new project directory as the working directory. If you accidentally close it, you can open it by double-clicking on the project icon in your folder\n* You can also check in your folder structure that everything was created as intended\n\n![Creating a new project](images/project_setup.gif)\n\n::: {.callout-tip collapse=\"true\"}\n\n## Why is the Colour scheme in the gif different to my version?\n\nIn case anyone is wondering why my colour scheme in the gif above looks different to yours, I've set mine to \"Pastel On Dark\" in Tools \\> Global Options... \\> Appearances. And my computer lives in \"dark mode\".\n\n:::\n\n::: callout-important\n\n## Don't nest projects\n\nDon't ever save a new project **inside** another project directory. This can cause some hard-to-resolve problems.\n\n:::\n\n\n## Activity 2: Create a new R Markdown file {#sec-rmd}\n\n* Open a new R Markdown document: click File \\> New File \\> R Markdown or click on the little page icon with a green plus sign (top left).\n* Give it a meaningful `Title` (e.g., Level 2 chapter 1) - you can also change the title later. Feel free to add your name or GUID in the `Author` field author name. Keep the `Default Output Format` as HTML.\n* Once the .`Rmd` opened, you need to save the file.\n* To save it, click File \\> Save As... or click on the little disc icon. Name it something meaningful (e.g., \"chapter_01.Rmd\", \"01_intro.Rmd\"). Make sure there are no spaces in the name - R is not very fond of spaces... This file will automatically be saved in your project folder (i.e., your working directory) so you should now see this file appear in your file viewer pane.\n\n\n![Creating a new `.Rmd` file](images/Rmd_setup.gif)\n\n\nRemember, an R Markdown document or `.Rmd` has \"white space\" (i.e., the markdown for formatted text) and \"grey parts\" (i.e., code chunks) in the default colour scheme (see @fig-rmd). R Markdown is a powerful tool for creating dynamic documents because it allows you to integrate code and regular text seamlessly. You can then knit your `.Rmd` using the `knitr` package to create a final document as either a webpage (HTML), a PDF, or a Word document (.docx). We'll only knit to HTML documents in this course.\n\n\n![R markdown anatomy (image from [https://intro2r.com/r-markdown-anatomy.html](https://intro2r.com/r-markdown-anatomy.html){target=\"_blank\"})](images/rm_components.png)\n\n\n\n### Markdown\n\nThe markdown space in an `.Rmd` is ideal for writing notes that explain your code and document your thought process. Use this space to clarify what your code is doing, why certain decisions were made, and any insights or conclusions you have drawn along the way. These notes are invaluable when revisiting your work later, helping you (or others) understand the rationale behind key decisions, such as setting inclusion/exclusion criteria or interpreting the results of assumption tests. Effectively documenting your work in the markdown space enhances both the clarity and reproducibility of your analysis.\n\nThe markdown space offers a variety of formatting options to help you organise and present your notes effectively. Here are a few of them that can enhance your documentation:\n\n#### Heading levels {.unnumbered}\n\nThere is a variety of **heading levels** to make use of, using the `#` symbol.\n\n\n::: columns\n\n::: column\n\n##### You would incorporate this into your text as: {.unnumbered}\n\n\\# Heading level 1\n\n\\## Heading level 2\n\n\\### Heading level 3\n\n\\#### Heading level 4\n\n\\##### Heading level 5\n\n\\###### Heading level 6\n\n:::\n\n::: column\n\n##### And it will be displayed in your knitted html file as: {.unnumbered}\n\n![](images/heading_levels.PNG)\n\n:::\n\n:::\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My heading levels don't render properly when knitting\n\nYou need a space between the # and the first letter. If the space is missing, the heading will be displayed in the HTML file as ...\n\n#Heading 1\n\n:::\n\n#### Unordered and ordered lists {.unnumbered}\n\nYou can also include **unordered lists** and **ordered lists**. Click on the tabs below to see how they are incorporated\n\n::: panel-tabset\n\n## unordered lists\n\nYou can add **bullet points** using either `*`, `-` or `+` and they will turn into:\n\n* bullet point (created with `*`)\n* bullet point (created with `-`)\n+ bullet point (created with `+`)\n\nor use bullet points of different levels using 1 tab key press or 2 spaces (for sub-item 1) or 2 tabs/4 spaces (for sub-sub-item 1):\n\n* bullet point item 1\n * sub-item 1\n * sub-sub-item 1\n * sub-sub-item 2\n* bullet point item 2\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: My bullet points don't render properly when knitting\n\nYou need an empty row before your bullet points start. If I delete the empty row before the bullet points, they will be displayed in the HTML as ...\n\nText without the empty row: * bullet point created with `*` - bullet point created with `-` + bullet point created with `+`\n\n:::\n\n\n## ordered lists\n\nStart the line with **1.**, **2.**, etc. When you want to include sub-items, either use the `tab` key twice or add **4 spaces**. Same goes for the sub-sub-item: include either 2 tabs (or 4 manual spaces) from the last item or 4 tabs/ 8 spaces from the start of the line.\n\n1. list item 1\n2. list item 2\n i) sub-item 1 (with 4 spaces)\n A. sub-sub-item 1 (with an additional 4 spaces from the last indent)\n\n::: {.callout-important collapse=\"true\"}\n\n## My list items don't render properly when knitting\n\nIf you don't leave enough spaces, the list won't be recognised, and your output looks like this:\n\n3. list item 3\n i) sub-item 1 (with only 2 spaces) \n A. sub-sub-item 1 (with an additional 2 spaces from the last indent)\n\n:::\n\n\n## ordered lists magic\n\nThe great thing though is that you don't need to know your alphabet or number sequences. R markdown will fix that for you\n\nIf I type into my `.Rmd`...\n\n![](images/list_magic.PNG)\n\n...it will be rendered in the knitted HTML output as...\n\n3. list item 3\n1. list item 1\n a) sub-item labelled \"a)\"\n i) sub-item labelled \"i)\"\n C) sub-item labelled \"C)\"\n Z) sub-item labelled \"Z)\"\n7. list item 7\n\n\n\n::: {.callout-important collapse=\"true\"}\n\n## ERROR: The labels of the sub-items are not what I thought they would be. You said they are fixing themselves...\n\nYes, they do but you need to label your sub-item lists accordingly. The first label you list in each level is set as the baseline. If they are labelled `1)` instead of `i)` or `A.`, the output will show as follows, but the automatic-item-fixing still works:\n\n7. list item 7\n 1) list item \"1)\" with 4 spaces\n 1) list item \"1)\" with 8 spaces\n 6) this is an item labelled \"6)\" (magically corrected to \"2.\")\n:::\n\n:::\n\n#### Emphasis {.unnumbered}\n\nInclude **emphasis** to draw attention to keywords in your text:\n\n| R markdown syntax | Displayed in the knitted HTML file |\n|:----------------------------|:-----------------------------------|\n| \\*\\*bold text\\*\\* | **bold text** |\n| \\*italic text\\* | *italic text* |\n| \\*\\*\\*bold and italic\\*\\*\\* | ***bold and italic*** |\n\n\nOther examples can be found in the [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target=\"_blank\"}\n\n\n\n### Code chunks {#sec-chunks}\n\nEverything you write inside the **code chunks** will be interpreted as code and executed by R. Code chunks start with ```` ``` ```` followed by an `{r}` which specifies the coding language R, some space for code, and ends with ```` ``` ````. If you accidentally delete one of those backticks, your code won't run and/or your text parts will be interpreted as part of the code chunks or vice versa. This should be evident from the colour change - more white than expected typically indicates missing starting backticks, whilst too much grey/not enough white suggests missing ending backticks. But no need to fret if that happens - just add the missing backticks manually.\n\n\nYou can **insert a new code chunk** in several ways:\n\n\n* Click the `Insert a new code chunk` button in the RStudio Toolbar (green icon at the top right corner of the `Source pane`).\n* Select Code \\> Insert Chunk from the menu.\n* Using the shortcut `Ctrl + Alt + I` for Windows or `Cmd + Option + I` on MacOSX.\n* Type ```` ```{r} ```` and ```` ``` ```` manually\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Default `.Rmd` with highlighting - names in pink and knitr display options in purple](images/default_highlighted.png){#fig-rmd fig-align='center' width=100%}\n:::\n:::\n\n\n\n\nWithin the curly brackets of a code chunk, you can **specify a name** for the code chunk (see pink highlighting in @fig-rmd). The chunk name is not necessarily required; however, it is good practice to give each chunk a unique name to support more advanced knitting approaches. It also makes it easier to reference and manage chunks.\n\nWithin the curly brackets, you can also place **rules and arguments** (see purple highlighting in @fig-rmd) to control how your code is executed and what is displayed in your final HTML output. The most common **knitr display options** include:\n\n\n| Code | Does code run | Does code show | Do results show |\n|:--------------------|:-------------:|:--------------:|:---------------:|\n| eval=FALSE | NO | YES | NO |\n| echo=TRUE (default) | YES | YES | YES |\n| echo=FALSE | YES | NO | YES |\n| results='hide' | YES | YES | NO |\n| include=FALSE | YES | NO | NO |\n\n\n::: callout-important\n\nThe table above will be incredibly important for the data skills homework II. When solving error mode items you will need to pay attention to the first one `eval = FALSE`.\n\n:::\n\nOne last thing: In your newly created `.Rmd` file, delete everything below line 12 (keep the set-up code chunk) and save your `.Rmd` by clicking on the disc symbol.\n\n![Delete everything below line 12](images/delete_12.gif)\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nThat was quite a long section about what Markdown can do. I promise, we'll practice that more later. For the minute, we want you to create a new level 2 heading on line 12 and give it a meaningful heading title (something like \"Loading packages and reading in data\" or \"Chapter 1\").\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\nOn line 12, you should have typed **## Loading packages and reading in data** (or whatever meaningful title you chose). This will create level 2 heading once we knit the `.Rmd`.\n\n:::\n\n:::\n\n\n## Activity 3: Download the data {#sec-download_data_ch1}\n\nThe data for chapters 1-3. Download it here: [data_ch1.zip](data/data_ch1.zip \"download\"). There are 2 csv files contained in a zip folder. One is the data file we are going to use today `prp_data_reduced.csv` and the other is an Excel file `prp_codebook` that explains the variables in the data.\n\nThe first step is to **unzip the zip folder** so that the files are placed within the same folder as your project.\n\n* Place the zip folder within your 2A_chapter1 folder\n* Right mouse click --> `Extract All...`\n* Check the folder location is the one to extract the files to\n* Check the extracted files are placed next to the project icon\n* Files and project should be visible in the Output pane in RStudio\n\n::: {.callout-note collapse=\"false\"}\n\n## Screenshots for \"unzipping a zip folder\"\n\n::: {layout-ncol=\"1\"}\n\n![](images/pic1.PNG){fig-align=\"center\"}\n\n![](images/pic23.PNG){fig-align=\"center\"}\n\n![](images/pic45.PNG){fig-align=\"center\"}\n\nUnzipping a zip folder\n\n:::\n:::\n\nThe paper by Pownall et al. was a **registered report** published in 2023, and the original data can be found on OSF ([https://osf.io/5qshg/](https://osf.io/5qshg/){target=\"_blank\"}).\n\n**Citation**\n\n> Pownall, M., Pennington, C. R., Norris, E., Juanchich, M., Smailes, D., Russell, S., Gooch, D., Evans, T. R., Persson, S., Mak, M. H. C., Tzavella, L., Monk, R., Gough, T., Benwell, C. S. Y., Elsherif, M., Farran, E., Gallagher-Mitchell, T., Kendrick, L. T., Bahnmueller, J., . . . Clark, K. (2023). Evaluating the Pedagogical Effectiveness of Study Preregistration in the Undergraduate Dissertation. *Advances in Methods and Practices in Psychological Science, 6*(4). [https://doi.org/10.1177/25152459231202724](https://doi.org/10.1177/25152459231202724){target=\"_blank\"}\n\n**Abstract**\n\n> Research shows that questionable research practices (QRPs) are present in undergraduate final-year dissertation projects. One entry-level Open Science practice proposed to mitigate QRPs is “study preregistration,” through which researchers outline their research questions, design, method, and analysis plans before data collection and/or analysis. In this study, we aimed to empirically test the effectiveness of preregistration as a pedagogic tool in undergraduate dissertations using a quasi-experimental design. A total of 89 UK psychology students were recruited, including students who preregistered their empirical quantitative dissertation (*n* = 52; experimental group) and students who did not (*n* = 37; control group). Attitudes toward statistics, acceptance of QRPs, and perceived understanding of Open Science were measured both before and after dissertation completion. Exploratory measures included capability, opportunity, and motivation to engage with preregistration, measured at Time 1 only. This study was conducted as a Registered Report; Stage 1 protocol: https://osf.io/9hjbw (date of in-principle acceptance: September 21, 2021). Study preregistration did not significantly affect attitudes toward statistics or acceptance of QRPs. However, students who preregistered reported greater perceived understanding of Open Science concepts from Time 1 to Time 2 compared with students who did not preregister. Exploratory analyses indicated that students who preregistered reported significantly greater capability, opportunity, and motivation to preregister. Qualitative responses revealed that preregistration was perceived to improve clarity and organization of the dissertation, prevent QRPs, and promote rigor. Disadvantages and barriers included time, perceived rigidity, and need for training. These results contribute to discussions surrounding embedding Open Science principles into research training.\n\n**Changes made to the dataset**\n\nWe made some changes to the dataset for the purpose of increasing difficulty for data wrangling (@sec-wrangling and @sec-wrangling2) and data visualisation (@sec-dataviz and @sec-dataviz2). This will ensure some \"teachable moments\". The changes are as follows:\n\n* We removed some of the variables to make the data more manageable for teaching purposes.\n* We recoded some values from numeric responses to labels (e.g., `understanding`).\n* We added the word \"years\" to one of the `Age` entries.\n* We tidied a messy column `Ethnicity` but introduced a similar but easier-to-solve \"messiness pattern\" when recoding the `understanding` data.\n* The scores in the original file were already corrected from reverse-coded responses. We reversed that process to present raw data here.\n\n\n\n\n## Activity 4: Installing packages, loading packages, and reading in data\n\n### Installing packages\n\nWhen you install R and RStudio for the first time (or after an update), most of the packages we will be using won’t be pre-installed. Before you can load new packages like `tidyverse`, you will need to install them.\n\nIf you try to load a package that has not been installed yet, you will receive an error message that looks something like this: `Error in library(tidyverse) : there is no package called 'tidyverse'`. \n\nTo fix this, simply install the package first. **In the console**, type the command `install.packages(\"tidyverse\")`. This **only needs to be done once after a fresh installation**. After that, you will be able to load the `tidyverse` package into your library whenever you open RStudio.\n\n::: callout-important\n\n## Install packages from the console only\n\nNever include `install.packages()` in the Rmd. Only install packages from the console pane or the packages tab of the lower right pane!!!\n:::\n\n\nNote, there will be other packages used in later chapters that will also need to be installed before their first use, so this error is not limited to `tidyverse`.\n\n\n### Loading packages and reading in data\n\nThe first step is to load in the packages we need and read in the data. Today, we'll only be using `tidyverse`, and `read_csv()` will help us store the data from `prp_data_reduced.csv` in an object called data_prp.\n\nCopy the code into a code chunk in your `.Rmd` file and run it. You can either click the `green error` to run the entire code chunk, or use the shortcut `Ctrl + Enter` (Windows) or `Cmd + Enter` (Mac) to run a line of code/ pipe from the Rmd.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\nRows: 89 Columns: 91\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (17): Code, Age, Ethnicity, Opptional_mod_1_TEXT, Research_exp_1_TEXT, U...\ndbl (74): Gender, Secondyeargrade, Opptional_mod, Research_exp, Plan_prereg,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n\n\n\n## Activity 5: Familiarise yourself with the data {#sec-familiarise}\n\n* Look at the **Codebook** to get a feel of the variables in the dataset and how they have been measured. Note that some of the columns were deleted in the dataset you have been given.\n* You'll notice that some questionnaire data was collected at 2 different time points (i.e., SATS28, QRPs, Understanding_OS)\n* some of the data was only collected at one time point (i.e., supervisor judgements, OS_behav items, and Included_prereg variables are t2-only variables)\n\n\n\n### First glimpse at the data\n\nBefore you start wrangling your data, it is important to understand what kind of data you're working with and what the format of your dataframe looks like.\n\nAs you may have noticed, `read_csv()` provides a **message** listing the data types in your dataset and how many columns are of each type. Plus, it shows a few examples columns for each data type.\n\nTo obtain more detailed information about your data, you have several options. Click on the individual tabs to see the different options available. Test them out in your own `.Rmd` file and use whichever method you prefer (but do it).\n\n::: callout-warning\n\nSome of the output is a bit long because we do have quite a few variables in the data file.\n\n:::\n\n::: panel-tabset\n\n## visual inspection 1\n\nIn the `Global Environment`, click the blue arrow icon next to the object name `data_prp`. This action will expand the object, revealing details about its columns. The `$` symbol is commonly used in Base R to access a specific column within your dataframe.\n\n![Visual inspection of the data](images/data_prp.PNG)\n\nCon: When you have quite a few variables, not all of them are shown.\n\n## `glimpse()`\n\nUse `glimpse()` if you want a more detailed overview you can see on your screen. The output will display rows and column numbers, and some examples of the first couple of observations for each variable.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 91\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", …\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"2…\n$ Ethnicity \"White European\", \"White British…\n$ Secondyeargrade 2, 3, 1, 2, 2, 2, 2, 2, 1, 1, 1,…\n$ Opptional_mod 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2,…\n$ Opptional_mod_1_TEXT \"Research methods in first year\"…\n$ Research_exp 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…\n$ Research_exp_1_TEXT NA, NA, NA, NA, NA, NA, NA, NA, …\n$ Plan_prereg 1, 3, 1, 2, 1, 1, 3, 3, 2, 2, 2,…\n$ SATS28_1_Affect_Time1 4, 5, 5, 6, 2, 1, 6, 3, 2, 5, 2,…\n$ SATS28_2_Affect_Time1 5, 6, 3, 3, 6, 1, 2, 2, 7, 3, 4,…\n$ SATS28_3_Affect_Time1 3, 2, 5, 2, 6, 7, 2, 6, 6, 5, 2,…\n$ SATS28_4_Affect_Time1 4, 5, 2, 2, 6, 6, 5, 5, 5, 5, 2,…\n$ SATS28_5_Affect_Time1 5, 5, 5, 6, 1, 1, 5, 1, 2, 5, 2,…\n$ SATS28_6_Affect_Time1 5, 6, 2, 5, 6, 7, 4, 5, 5, 3, 5,…\n$ SATS28_7_CognitiveCompetence_Time1 4, 2, 2, 5, 6, 7, 2, 5, 5, 2, 2,…\n$ SATS28_8_CognitiveCompetence_Time1 2, 2, 2, 1, 6, 7, 2, 5, 3, 2, 3,…\n$ SATS28_9_CognitiveCompetence_Time1 2, 2, 2, 3, 3, 7, 2, 6, 3, 3, 1,…\n$ SATS28_10_CognitiveCompetence_Time1 6, 7, 6, 6, 4, 2, 6, 4, 5, 6, 5,…\n$ SATS28_11_CognitiveCompetence_Time1 4, 3, 5, 5, 3, 1, 6, 2, 5, 6, 5,…\n$ SATS28_12_CognitiveCompetence_Time1 3, 5, 3, 5, 5, 7, 3, 4, 7, 2, 3,…\n$ SATS28_13_Value_Time1 1, 1, 2, 1, 3, 7, 1, 2, 1, 2, 4,…\n$ SATS28_14_Value_Time1 7, 7, 6, 6, 5, 1, 6, 5, 7, 6, 2,…\n$ SATS28_15_Value_Time1 7, 7, 6, 6, 3, 5, 6, 6, 6, 5, 5,…\n$ SATS28_16_Value_Time1 2, 1, 3, 2, 6, 5, 3, 7, 2, 2, 2,…\n$ SATS28_17_Value_Time1 1, 1, 3, 3, 7, 7, 2, 7, 2, 2, 5,…\n$ SATS28_18_Value_Time1 3, 6, 5, 3, 1, 1, 5, 1, 5, 2, 2,…\n$ SATS28_19_Value_Time1 3, 3, 3, 3, 7, 7, 4, 5, 3, 5, 6,…\n$ SATS28_20_Value_Time1 2, 1, 4, 2, 7, 7, 2, 4, 2, 2, 7,…\n$ SATS28_21_Value_Time1 2, 1, 3, 2, 6, 7, 2, 5, 1, 3, 5,…\n$ SATS28_22_Difficulty_Time1 3, 2, 5, 3, 2, 1, 4, 2, 2, 5, 3,…\n$ SATS28_23_Difficulty_Time1 5, 6, 5, 6, 6, 7, 4, 6, 7, 5, 6,…\n$ SATS28_24_Difficulty_Time1 2, 2, 2, 3, 1, 4, 4, 2, 2, 2, 2,…\n$ SATS28_25_Difficulty_Time1 6, 7, 5, 5, 6, 7, 5, 6, 5, 5, 5,…\n$ SATS28_26_Difficulty_Time1 4, 2, 2, 2, 6, 7, 4, 5, 3, 5, 3,…\n$ SATS28_27_Difficulty_Time1 4, 5, 5, 3, 6, 7, 4, 3, 5, 3, 6,…\n$ SATS28_28_Difficulty_Time1 1, 7, 5, 5, 6, 6, 5, 4, 4, 4, 2,…\n$ QRPs_1_Time1 7, 7, 7, 7, 7, 7, 6, 2, 7, 6, 7,…\n$ QRPs_2_Time1 7, 7, 7, 7, 7, 7, 6, 7, 7, 7, 5,…\n$ QRPs_3_Time1 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3,…\n$ QRPs_4_Time1 7, 7, 6, 6, 7, 4, 6, 7, 7, 7, 6,…\n$ QRPs_5_Time1 3, 3, 7, 7, 2, 7, 4, 6, 7, 3, 2,…\n$ QRPs_6_Time1 4, 7, 6, 5, 7, 4, 4, 5, 7, 6, 5,…\n$ QRPs_7_Time1 5, 7, 7, 7, 7, 4, 5, 6, 7, 7, 5,…\n$ QRPs_8_Time1 7, 7, 7, 7, 7, 7, 7, 7, 7, 2, 7,…\n$ QRPs_9_Time1 6, 7, 7, 4, 7, 7, 3, 7, 6, 6, 2,…\n$ QRPs_10_Time1 7, 6, 5, 2, 5, 4, 2, 6, 7, 7, 2,…\n$ QRPs_11_Time1 7, 7, 7, 4, 7, 7, 4, 6, 7, 7, 5,…\n$ QRPs_12NotQRP_Time1 2, 2, 1, 4, 1, 4, 2, 4, 2, 2, 1,…\n$ QRPs_13NotQRP_Time1 1, 1, 1, 1, 1, 4, 2, 4, 1, 1, 1,…\n$ QRPs_14NotQRP_Time1 1, 4, 3, 4, 1, 4, 2, 3, 3, 4, 3,…\n$ QRPs_15NotQRP_Time1 2, 4, 2, 2, 1, 4, 2, 1, 4, 4, 2,…\n$ Understanding_OS_1_Time1 \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at…\n$ Understanding_OS_2_Time1 \"2\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_3_Time1 \"2\", \"Not at all confident\", \"3\"…\n$ Understanding_OS_4_Time1 \"6\", \"Not at all confident\", \"6\"…\n$ Understanding_OS_5_Time1 \"Entirely confident\", \"6\", \"6\", …\n$ Understanding_OS_6_Time1 \"Entirely confident\", \"Entirely …\n$ Understanding_OS_7_Time1 \"6\", \"Not at all confident\", \"2\"…\n$ Understanding_OS_8_Time1 \"6\", \"3\", \"5\", \"3\", \"5\", \"Not at…\n$ Understanding_OS_9_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_10_Time1 \"Entirely confident\", \"6\", \"5\", …\n$ Understanding_OS_11_Time1 \"Entirely confident\", \"2\", \"4\", …\n$ Understanding_OS_12_Time1 \"Entirely confident\", \"2\", \"5\", …\n$ Pre_reg_group 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2,…\n$ Other_OS_behav_2 1, NA, NA, NA, 1, NA, NA, 1, NA,…\n$ Other_OS_behav_4 1, NA, NA, NA, NA, NA, NA, NA, N…\n$ Other_OS_behav_5 NA, NA, NA, NA, 1, 1, NA, NA, NA…\n$ Closely_follow 2, 2, 2, NA, 3, 3, 3, NA, NA, 2,…\n$ SATS28_Affect_Time2_mean 3.500000, 3.166667, 4.833333, 4.…\n$ SATS28_CognitiveCompetence_Time2_mean 4.166667, 4.666667, 6.166667, 5.…\n$ SATS28_Value_Time2_mean 3.000000, 6.222222, 6.000000, 4.…\n$ SATS28_Difficulty_Time2_mean 2.857143, 2.857143, 4.000000, 2.…\n$ QRPs_Acceptance_Time2_mean 5.636364, 5.454545, 6.272727, 5.…\n$ Time2_Understanding_OS 5.583333, 3.333333, 5.416667, 4.…\n$ Supervisor_1 5, 7, 7, 1, 7, 1, 7, 6, 7, 5, 6,…\n$ Supervisor_2 5, 6, 7, 4, 6, 2, 7, 5, 6, 5, 5,…\n$ Supervisor_3 6, 7, 7, 1, 7, 1, 7, 5, 6, 6, 7,…\n$ Supervisor_4 6, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_5 5, 7, 7, 4, 7, 3, 7, 7, 6, 6, 6,…\n$ Supervisor_6 5, 7, 7, 4, 6, 3, 7, 6, 7, 6, 6,…\n$ Supervisor_7 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ Supervisor_8 5, 5, 7, 1, 7, 1, 7, 5, 7, 5, 6,…\n$ Supervisor_9 6, 7, 7, 4, 7, 3, 7, 5, 7, 6, 7,…\n$ Supervisor_10 5, 7, 7, 1, 7, 1, 7, 6, 7, 6, 6,…\n$ Supervisor_11 NA, 7, 7, NA, 7, 1, 7, 5, 7, 6, …\n$ Supervisor_12 4, 5, 7, 1, 4, 1, 7, 3, 6, 6, 5,…\n$ Supervisor_13 4, 2, 5, 1, 2, 1, 6, 3, 5, 6, 5,…\n$ Supervisor_14 5, 7, 7, 1, 7, 1, 7, 5, 7, 6, 6,…\n$ Supervisor_15_R 1, 1, 1, 4, 1, 7, 1, 2, 1, 2, 1,…\n```\n:::\n:::\n\n\n\n## `spec()`\n\nYou can also use `spec()` as suggested in the message above and then it shows you a list of the data type in every single column. But it doesn't show you the number of rows and columns.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nspec(data_prp)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ncols(\n Code = col_character(),\n Gender = col_double(),\n Age = col_character(),\n Ethnicity = col_character(),\n Secondyeargrade = col_double(),\n Opptional_mod = col_double(),\n Opptional_mod_1_TEXT = col_character(),\n Research_exp = col_double(),\n Research_exp_1_TEXT = col_character(),\n Plan_prereg = col_double(),\n SATS28_1_Affect_Time1 = col_double(),\n SATS28_2_Affect_Time1 = col_double(),\n SATS28_3_Affect_Time1 = col_double(),\n SATS28_4_Affect_Time1 = col_double(),\n SATS28_5_Affect_Time1 = col_double(),\n SATS28_6_Affect_Time1 = col_double(),\n SATS28_7_CognitiveCompetence_Time1 = col_double(),\n SATS28_8_CognitiveCompetence_Time1 = col_double(),\n SATS28_9_CognitiveCompetence_Time1 = col_double(),\n SATS28_10_CognitiveCompetence_Time1 = col_double(),\n SATS28_11_CognitiveCompetence_Time1 = col_double(),\n SATS28_12_CognitiveCompetence_Time1 = col_double(),\n SATS28_13_Value_Time1 = col_double(),\n SATS28_14_Value_Time1 = col_double(),\n SATS28_15_Value_Time1 = col_double(),\n SATS28_16_Value_Time1 = col_double(),\n SATS28_17_Value_Time1 = col_double(),\n SATS28_18_Value_Time1 = col_double(),\n SATS28_19_Value_Time1 = col_double(),\n SATS28_20_Value_Time1 = col_double(),\n SATS28_21_Value_Time1 = col_double(),\n SATS28_22_Difficulty_Time1 = col_double(),\n SATS28_23_Difficulty_Time1 = col_double(),\n SATS28_24_Difficulty_Time1 = col_double(),\n SATS28_25_Difficulty_Time1 = col_double(),\n SATS28_26_Difficulty_Time1 = col_double(),\n SATS28_27_Difficulty_Time1 = col_double(),\n SATS28_28_Difficulty_Time1 = col_double(),\n QRPs_1_Time1 = col_double(),\n QRPs_2_Time1 = col_double(),\n QRPs_3_Time1 = col_double(),\n QRPs_4_Time1 = col_double(),\n QRPs_5_Time1 = col_double(),\n QRPs_6_Time1 = col_double(),\n QRPs_7_Time1 = col_double(),\n QRPs_8_Time1 = col_double(),\n QRPs_9_Time1 = col_double(),\n QRPs_10_Time1 = col_double(),\n QRPs_11_Time1 = col_double(),\n QRPs_12NotQRP_Time1 = col_double(),\n QRPs_13NotQRP_Time1 = col_double(),\n QRPs_14NotQRP_Time1 = col_double(),\n QRPs_15NotQRP_Time1 = col_double(),\n Understanding_OS_1_Time1 = col_character(),\n Understanding_OS_2_Time1 = col_character(),\n Understanding_OS_3_Time1 = col_character(),\n Understanding_OS_4_Time1 = col_character(),\n Understanding_OS_5_Time1 = col_character(),\n Understanding_OS_6_Time1 = col_character(),\n Understanding_OS_7_Time1 = col_character(),\n Understanding_OS_8_Time1 = col_character(),\n Understanding_OS_9_Time1 = col_character(),\n Understanding_OS_10_Time1 = col_character(),\n Understanding_OS_11_Time1 = col_character(),\n Understanding_OS_12_Time1 = col_character(),\n Pre_reg_group = col_double(),\n Other_OS_behav_2 = col_double(),\n Other_OS_behav_4 = col_double(),\n Other_OS_behav_5 = col_double(),\n Closely_follow = col_double(),\n SATS28_Affect_Time2_mean = col_double(),\n SATS28_CognitiveCompetence_Time2_mean = col_double(),\n SATS28_Value_Time2_mean = col_double(),\n SATS28_Difficulty_Time2_mean = col_double(),\n QRPs_Acceptance_Time2_mean = col_double(),\n Time2_Understanding_OS = col_double(),\n Supervisor_1 = col_double(),\n Supervisor_2 = col_double(),\n Supervisor_3 = col_double(),\n Supervisor_4 = col_double(),\n Supervisor_5 = col_double(),\n Supervisor_6 = col_double(),\n Supervisor_7 = col_double(),\n Supervisor_8 = col_double(),\n Supervisor_9 = col_double(),\n Supervisor_10 = col_double(),\n Supervisor_11 = col_double(),\n Supervisor_12 = col_double(),\n Supervisor_13 = col_double(),\n Supervisor_14 = col_double(),\n Supervisor_15_R = col_double()\n)\n```\n:::\n:::\n\n\n\n## visual inspection 2\n\nIn the `Global Environment`, click on the object name `data_prp`. This action will open the data in a new tab. Hovering over the column headings with your mouse will also reveal their data type. However, it seems to be a fairly tedious process when you have loads of columns.\n\n::: {.callout-important collapse=\"true\"}\n\n## Hang on, where is the rest of my data? Why do I only see 50 columns?\n\nOne common source of confusion is not seeing all your columns when you open up a data object as a tab. This is because RStudio shows you a maximum of 50 columns at a time. If you have more than 50 columns, navigate with the arrows to see the remaining columns.\n\n![Showing 50 columns at a time](images/50_col.PNG)\n\n:::\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nNow that you have tested out all the options in your own `.Rmd` file, you can probably answer the following questions:\n\n* How many observations? \n* How many variables? \n* How many columns are `col_character` or `chr` data type? \n* How many columns are `col_double` or `dbl` data type? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe visual inspections shows you the **number of observations and variables**. `glimpse()` also gives you that information but calls them **rows and columns** respectively.\n\nThe **data type information** actually comes from the output when using the `read_csv()` function. Did you notice the information on **Column specification** (see screenshot below)?\n\n![message from `read_csv()` when reading in the data](images/col_spec.PNG)\n\nWhilst `spec()` is quite useful for data type information per individual column, it doesn't give you the total count of each data type. So it doesn't really help with answering the questions here - unless you want to count manually from its extremely long output.\n\n:::\n\nIn your `.Rmd`, include a **new heading level 2** called \"Information about the data\" (or something equally meaningful) and jot down some notes about `data_prp`. You could include the citation and/or the abstract, and whatever information you think you should note about this dataset (e.g., any observations from looking at the codebook?). You could also include some notes on the functions used so far and what they do. Try to incorporate some **bold**, *italic* or ***bold and italic*** emphasis and perhaps a bullet point or two.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Possible solution\n\n\\#\\# Information about the data\n\nThe data is from Pownall et al. (2023), and I can find the paper here: https://doi.org/10.1177/25152459231202724.\n\nI've noticed in the prp codebook that the SATS-28 questionnaire has quite a few \\*reverse-coded items\\*, and the supervisor support questionnaire also has a reverse-coded item.\n\nSo far, I think I prefer \\*\\*glimpse()\\*\\* to show me some more detail about the data. Specs() is too text-heavy for me which makes it hard to read.\n\nThings to keep in mind:\n\n* \\*\\*don't forget to load in tidyverse first!!!\\*\\*\n* always read in the data with \\*\\*read_csv\\*\\*, \\*\\*\\*never ever use read.csv\\*\\*\\*!!!\n\n![The output rendered in a knitted html file](images/knitted_markdown.PNG)\n\n:::\n\n:::\n\n### Data types {#sec-datatypes}\n\nEach variable has a **data type**, such as numeric (numbers), character (text), and logical (TRUE/FALSE values), or a special class of factor. As you have just seen, our `data_prp` only has character and numeric columns (so far).\n\n**Numeric data** can be double (`dbl`) or integer (`int`). Doubles can have decimal places (e.g., 1.1). Integers are the whole numbers (e.g., 1, 2, -1) and are displayed with the suffix L (e.g., 1L). This is not overly important but might leave you less puzzled the next time you see an L after a number.\n\n**Characters** (also called “strings”) is anything written between quotation marks. This is usually text, but in special circumstances, a number can be a character if it placed within quotation marks. This can happen when you are recoding variables. It might not be too obvious at the time, but you won't be able to calculate anything if the number is a character\n\n::: panel-tabset\n\n## Example data types\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntypeof(1)\ntypeof(1L)\ntypeof(\"1\")\ntypeof(\"text\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n[1] \"integer\"\n[1] \"character\"\n[1] \"character\"\n```\n:::\n:::\n\n\n## numeric computation\n\nNo problems here...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n1+1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n## character computation\n\nWhen the data type is incorrect, you won't be able to compute anything, despite your numbers being shown as numeric values in the dataframe. The error message tells you exactly what's wrong with it, i.e., that you have `non-numeric arguments`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n\"1\"+\"1\" # ERROR\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"1\" + \"1\": non-numeric argument to binary operator\n```\n:::\n:::\n\n\n:::\n\n**Logical** data (also sometimes called “Boolean” values) are one of two values: TRUE or FALSE (written in uppercase). They become really important when we use `filter()` or `mutate()` with conditional statements such as `case_when()`. More about those in @sec-wrangling2.\n\n\nSome commonly used logical operators:\n\n| operator | description |\n|:---------|:-----------------------------------------------|\n| \\> | greater than |\n| \\>= | greater than or equal to |\n| \\< | less than |\n| \\<= | less than or equal to |\n| == | equal to |\n| != | not equal to |\n| %in% | TRUE if any element is in the following vector |\n\n\nA **factor** is a specific type of integer or character that lets you assign the order of the categories. This becomes useful when you want to display certain categories in \"the correct order\" either in a dataframe (see *arrange*) or when plotting (see @sec-dataviz/ @sec-dataviz2).\n\n\n\n### Variable types\n\nYou've already encountered them in [Level 1](https://psyteachr.github.io/data-skills-v2/intro-to-probability.html){target=\"_blank\"} but let's refresh. Variables can be classified as **continuous** (numbers) or **categorical** (labels).\n\n**Categorical** variables are properties you can count. They can be **nominal**, where the categories don't have an order (e.g., gender) or **ordinal** (e.g., Likert scales either with numeric values 1-7 or with character labels such as \"agree\", \"neither agree nor disagree\", \"disagree\"). Categorical data may also be **factors** rather than characters.\n\n**Continuous variables** are properties you can measure and calculate sums/ means/ etc. They may be rounded to the nearest whole number, but it should make sense to have a value between them. Continuous variables always have a **numeric** data type (i.e. `integer` or `double`).\n\n::: callout-tip\n\n## Why is this important you may ask?\n\nKnowing your variable and data types will help later on when deciding on an appropriate plot (see @sec-dataviz and @sec-dataviz2) or which inferential test to run (@sec-nhstI to @sec-factorial).\n\n:::\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nAs we've seen earlier, `data_prp` only had character and numeric variables which hardly tests your understanding to see if you can identify a variety of data types and variable types. So, for this little quiz, we've spiced it up a bit. We've selected a few columns, shortened some of the column names, and modified some of the data types. Here you can see the first few rows of the new object `data_quiz`. *You can find the code with explanations at the end of this section.*\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Code |Age |Gender |Ethnicity |Secondyeargrade | QRP_item| QRPs_mean|Understanding_item |QRP_item > 4 |\n|:----|:---|:------|:--------------|:-----------------------|--------:|---------:|:------------------|:------------|\n|Tr10 |22 |2 |White European |60-69% (2:1 grade) | 5| 5.636364|2 |TRUE |\n|Bi07 |20 |2 |White British |50-59% (2:2 grade) | 2| 5.454546|2 |FALSE |\n|SK03 |22 |2 |White British |≥ 70% (1st class grade) | 6| 6.272727|6 |TRUE |\n|SM95 |26 |2 |White British |60-69% (2:1 grade) | 2| 5.000000|2 |FALSE |\n|St01 |22 |2 |White British |60-69% (2:1 grade) | 6| 5.545454|6 |TRUE |\n\n
\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(data_quiz)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 89\nColumns: 9\n$ Code \"Tr10\", \"Bi07\", \"SK03\", \"SM95\", \"St01\", \"St10\", \"Wa…\n$ Age \"22\", \"20\", \"22\", \"26\", \"22\", \"20\", \"21\", \"21\", \"22…\n$ Gender 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ Ethnicity \"White European\", \"White British\", \"White British\",…\n$ Secondyeargrade 60-69% (2:1 grade), 50-59% (2:2 grade), ≥ 70% (1st …\n$ QRP_item 5, 2, 6, 2, 6, 4, 6, 3, 7, 3, 3, 4, 4, 4, 4, 6, 3, …\n$ QRPs_mean 5.636364, 5.454545, 6.272727, 5.000000, 5.545455, 6…\n$ Understanding_item \"2\", \"2\", \"6\", \"2\", \"6\", \"Not at all confident\", \"4…\n$ `QRP_item > 4` TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,…\n```\n:::\n:::\n\n\n\n\nSelect from the dropdown menu the variable type and their data types for each of the columns.\n\n\n\n\n\n| Column | Variable type | Data type |\n|:---------------------|:--------------|:--------------|\n| `Age` | | |\n| `Gender` | | |\n| `Ethinicity` | | |\n| `Secondyeargrade` | | |\n| `QRP_item` | | |\n| `QRPs_mean` | | |\n| `Understanding_item` | | |\n| `QRP_item > 4` | | |\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Revealing the mystery code that created `data_quiz`\n\nThe code might look a bit complex for the minute despite the line-by-line explanations below. Come back to it after completing chapter 2.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_quiz <- data_prp %>% \n select(Code, Age, Gender, Ethnicity, Secondyeargrade, QRP_item = QRPs_3_Time1, QRPs_mean = QRPs_Acceptance_Time2_mean, Understanding_item = Understanding_OS_1_Time1) %>% \n mutate(Gender = factor(Gender),\n Secondyeargrade = factor(Secondyeargrade,\n levels = c(1, 2, 3, 4, 5),\n labels = c(\"≥ 70% (1st class grade)\", \"60-69% (2:1 grade)\", \"50-59% (2:2 grade)\", \"40-49% (3rd class)\", \"< 40%\")),\n `QRP_item > 4` = case_when(\n QRP_item > 4 ~ TRUE, \n .default = FALSE))\n```\n:::\n\n\nLets go through this line by line:\n\n* **line 1**: creates a new object called `data_quiz` and it is based on the already existing data object `data_prp`\n* **line 2**: we are selecting a few variables of interest, such as Code, Age etc. Some of those variables were renamed in the process according to the structure `new_name = old_name`, for example QRP item 3 at time point 1 got renamed as `QRP_item`.\\\n* **line 3**: The function `mutate()` is used to create a new column called `Gender` that turns the existing column `Gender` from a numeric value into a factor. R simply overwrites the existing column of the same name. If we had named the new column `Gender_factor`, we would have been able to retain the original `Gender` column and `Gender_factor` would have been added as the last column.\n* **line 4-6**: See how the line starts with an indent which indicates we are still within the `mutate()` function. You can also see this by counting brackets - in line 3 there are 2 opening brackets but only 1 closes.\n * Similar to `Gender`, we are replacing the \"old\" `Secondyeargrade` with the new `Secondyeargrade` column that is now a factor.\n * Turning our variable `Secondyeargrade` into a factor, spot the difference between this attempt and the one we used for `Gender`? Here we are using a lot more arguments in that factor function, namely levels and labels. **Levels** describes the unique values we have for that column, and in **labels** we want to define how these levels will be shown in the data object. If you don't add the levels and labels argument, the labels will be the labels (as you can see in the `Gender` column in which we kept the numbers).\n* **line 7**: Doesn't start with a function name and has an indent, which means we are *still* within the `mutate()` function - count the opening and closing brackets to confirm.\n * Here, we are creating a new column called `QRP_item > 4`. Notice the two backticks we have to use to make this weird column name work? This is because it has spaces (and we did mention that R doesn't like spaces). So the backticks help R to group it as a unit/ a single name.\n * Next we have a `case_when()` function which helps executing conditional statements. We are using it to check whether a statement is TRUE or FALSE. Here, we ask whether the QRP item (column `QRP_item`) is larger than 4 (midpoint of the scale) using the Boolean operator `>`. If the statement is `TRUE`, the label `TRUE` should appear in column `QRP_item > 4`. Otherwise, if the value is equal to 4 or smaller, the label should read `FALSE`. We will come back to conditional statements in @sec-wrangling. But long story short, this Boolean expression created the only logical data type in `data_quiz`.\n:::\n\nAnd with this, we are done with the individual walkthrough. Well done :)\n\n\n\n\n\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nThe data we will be using in the upcoming lab activities is a randomised controlled trials experiment by Binfet et al. (2021) that was conducted in Canada.\n\n**Citation**\n\n> Binfet, J. T., Green, F. L. L., & Draper, Z. A. (2021). The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. *Anthrozoös, 35*(1), 1–22. [https://doi.org/10.1080/08927936.2021.1944558](https://doi.org/10.1080/08927936.2021.1944558){target=\"_blank\"}\n\n**Abstract**\n\n> Researchers have claimed that canine-assisted interventions (CAIs) contribute significantly to bolstering participants' wellbeing, yet the mechanisms within interactions have received little empirical attention. The aim of this study was to assess the impact of client–canine contact on wellbeing outcomes in a sample of 284 undergraduate college students (77% female; 21% male, 2% non-binary). Participants self-selected to participate and were randomly assigned to one of two canine interaction treatment conditions (touch or no touch) or to a handler-only condition with no therapy dog present. To assess self-reports of wellbeing, measures of flourishing, positive and negative affect, social connectedness, happiness, integration into the campus community, stress, homesickness, and loneliness were administered. Exploratory analyses were conducted to assess whether these wellbeing measures could be considered as measuring a unidimensional construct. This included both reliability analysis and exploratory factor analysis. Based on the results of these analyses we created a composite measure using participant scores on a latent factor. We then conducted the tests of the four hypotheses using these factor scores. Results indicate that participants across all conditions experienced enhanced wellbeing on several measures; however, only those in the direct contact condition reported significant improvements on all measures of wellbeing. Additionally, direct interactions with therapy dogs through touch elicited greater wellbeing benefits than did no touch/indirect interactions or interactions with only a dog handler. Similarly, analyses using scores on the wellbeing factor indicated significant improvement in wellbeing across all conditions (handler-only, *d* = 0.18, *p* = 0.041; indirect, *d* = 0.38, *p* \\< 0.001; direct, *d* = 0.78, *p* \\< 0.001), with more benefit when a dog was present (*d* = 0.20, *p* \\< 0.001), and the most benefit coming from physical contact with the dog (*d* = 0.13, *p* = 0.002). The findings hold implications for post-secondary wellbeing programs as well as the organization and delivery of CAIs.\n\n\nHowever, we accessed the data via Ciaran Evans' github ([https://github.com/ciaran-evans/dog-data-analysis](https://github.com/ciaran-evans/dog-data-analysis){target=\"_blank\"}). Evans et al. (2023) published a paper that reused the Binfet data for teaching statistics and research methods. If anyone is interested, the accompanying paper is:\n\n> Evans, C., Cipolli, W., Draper, Z. A., & Binfet, J. T. (2023). Repurposing a Peer-Reviewed Publication to Engage Students in Statistics: An Illustration of Study Design, Data Collection, and Analysis. *Journal of Statistics and Data Science Education, 31*(3), 236–247. [https://doi.org/10.1080/26939169.2023.2238018](https://doi.org/10.1080/26939169.2023.2238018){target=\"_blank\"}\n\n**There are a few changes that Evans and we made to the data:**\n\n* Evans removed the demographics ethnicity and gender to make the study data available while protecting participant privacy. Which means we'll have limited demographic variables available, but we will make do with what we've got.\n* We modified some of the responses in the raw data csv - for example, we took out impossible response values and replaced them with `NA`.\n* We replaced some of the numbers with labels to increase the difficulty in the dataset for @sec-wrangling and @sec-wrangling2.\n\n\n\n### Task 1: Create a project folder for the lab activities {.unnumbered}\n\nSince we will be working with the same data throughout semester 1, create a separate project for the lab data. Name it something useful, like `lab_data` or `dogs_in_the_lab`. Make sure you are not placing it within the project you have already created today. If you need guidance, see @sec-project above.\n\n\n\n### Task 2: Create a new `.Rmd` file {.unnumbered}\n\n... and name it something useful. If you need help, have a look at @sec-rmd.\n\n\n\n### Task 3: Download the data {.unnumbered}\n\nDownload the data here: [data_pair_ch1](data/data_pair_ch1.zip \"download\"). The zip folder contains the raw data file with responses to individual questions, a cleaned version of the same data in long format and wide format, and the codebook describing the variables in the raw data file and the long format.\n\n**Unzip the folder and place the data files in the same folder as your project.**\n\n\n\n### Task 4: Familiarise yourself with the data {.unnumbered}\n\nOpen the data files, look at the codebook, and perhaps skim over the original Binfet article (methods in particular) to see what kind of measures they used.\n\nRead in the raw data file as `dog_data_raw` and the cleaned-up data (long format) as `dog_data_long`. See if you can answer the following questions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\ndog_data_long <- read_csv(\"dog_data_clean_long.csv\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stderr}\n```\nRows: 284 Columns: 136\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (41): GroupAssignment, L2_1, L2_2, L2_3, L2_4, L2_5, L2_6, L2_7, L2_8, L...\ndbl (95): RID, Age_Yrs, Year_of_Study, Live_Pets, Consumer_BARK, S1_1, HO1_1...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\nRows: 568 Columns: 16\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): GroupAssignment, Year_of_Study, Live_Pets, Stage\ndbl (12): RID, Age_Yrs, Consumer_BARK, Flourishing, PANAS_PA, PANAS_NA, SHS,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n* How many participants took part in the study? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nYou can see this from `dog_data_raw`. Each participant ID is on a single row meaning the number of observations is the number of participants.\n\nIf you look at `dog_data_long`, there are 568 observations. Each participant answered the questionnaires pre and post intervention, resulting in 2 rows per participant ID. This means you'd have to divide the number of observations by 2 to get to the number of participants.\n\n:::\n\n* How many different questionnaires did the participants answer? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nThe Binfet paper (e.g., Methods section and/or abstract) and the codebook show it's 9 questionnaires - Flourishing scale (variable `Flourishing`), the UCLS Loneliness scale Version 3 (`Loneliness`), Positive and Negative affect scale (`PANAS_PA` and `PANAS_NA`), the Subjective Happiness scale (`SHS`), the Social connectedness scale (`SCS`), and 3 scales with 1 question each, i.e., perception of stress levels (`Stress`), self-reported level of homesickness (`Homesick`), and integration into the campus community (`Engagement`).\n\nHowever, if you thought `PANAS_PA` and `PANAS_NA` are a single questionnaire, 8 was also acceptable as an answer here.\n\n:::\n\n\n\n\n## [Test your knowledge]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nAre you ready for some knowledge check questions to test your understanding of the chapter? We also have some faulty codes. See if you can spot what's wrong with them.\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nOne of the key first steps when we open RStudio is to:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nOpening an existing project (e.g., when coming back to the same dataset) or creating a new project (e.g., for a new task or new dataset) ensures that subsequent `.Rmd` files, any output, figures, etc are saved within the same folder on your computer (i.e., the working directory). If the`.Rmd` files or data is not in the same folder as \"the project icon\", things can get messy and code might not run.\n\n:::\n\n\n#### Question 2 {.unnumbered}\n\nWhen using the default environment colour settings for RStudio, what colour would the background of a code chunk be in R Markdown? \n\nWhen using the default environment colour settings for RStudio, what colour would the background of normal text be in R Markdown? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nAssuming you have not changed any of the settings in RStudio, code chunks will tend to have a grey background and normal text will tend to have a white background. This is a good way to check that you have closed and opened code chunks correctly.\n\n:::\n\n\n\n#### Question 3 {.unnumbered}\n\nCode chunks start and end with:
\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCode chunks always take the same general format of three backticks followed by curly parentheses and a lower case r inside the parentheses (`{r}`). People often mistake these backticks for single quotes but that will not work. If you have set your code chunk correctly using backticks, the background colour should change to grey from white.\n\n:::\n\n\n\n#### Question 4 {.unnumbered}\n\nWhat is the correct way to include a code chunk in RMarkdown that will be executed but neither the code nor its output will be shown in the final HTML document? \n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain this answer\n\nCheck the table of knitr display options in @sec-chunks.\n\n* {r, echo=FALSE} also executes the code and does not show the code, but it *does* display the result in the knitted html file. (matches 2/3 criteria)\n* {r, eval=FALSE} does not show the results but does *not* execute the code and it *does* show it in the knitted file. (matches 1/3 criteria)\n* {r, results=“hide”} executes the code and does not show results, however, it *does* include the code in the knitted html document. (matches 2/3 criteria)\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of these codes have mistakes in them, other code chunks are not quite producing what was aimed for. Your task is to spot anything faulty, explain why the things happened, and perhaps try to fix them.\n\n\n\n#### Question 5 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. You have just stated R, created a new `.Rmd` file, and typed the following code into your code chunk.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\n\nHowever, R gives you an error message: `could not find function \"read_csv\"`. What could be the reason?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\n\"Could not find function\" is an indication that you have forgotten to load in tidyverse. Because `read_csv()` is a function in the tidyverse collection, R cannot find it.\n\nFIX: Add `library(tidyverse)` prior to reading in the data and run the code chunk again.\n\n:::\n\n\n\n#### Question 6 {.unnumbered}\n\nYou want to read in data with the `read_csv()` function. This time, you are certain you have loaded in tidyverse first. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data.csv\")\n```\n:::\n\n\nThe error message shows `'data.csv' does not exist in current working directory`. You check your folder and it looks like this:\n\n![](images/error_ch1_01.PNG)\n\nWhy is there an error message?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nR is looking for a csv file that is called data which is currently not in the working directory. We may assume it's in the data folder. Perhaps that happened when unzipping the zip file. So instead of placing the csv file on the same level as the project icon, it was unzipped into a folder named data.\n\nFIX - option 1: Take the `data.csv` out of the data folder and place it next to the project icon and the `.Rmd` file.\n\nFIX - option 2: Modify your R code to tell R that the data is in a separate folder called data, e.g., ...\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata <- read_csv(\"data/data.csv\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\n\nYou want to load `tidyverse` into the library. The code is as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\nThe error message says: `Error in library(tidyverse) : there is no package called ‘tidyverse’`\n\nWhy is there an error message and how can we fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nIf R says there is no package called `tidyverse`, means you haven't installed the package yet. This could be an error message you receive either after switching computers or a fresh install of R and RStudio.\n\nFIX: Type `install.packages(\"tidyverse\")` into your **Console**.\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nYou knitted your `.Rmd` into a html but the output is not as expected. You see the following:\n\n![](images/error_knitted.PNG)\n\nWhy did the file not knit properly?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThere is a backtick missing in the code chunk. If you check your `.Rmd` file, you can see that the code chunk does not show up in grey which means it's one of the 3 backticks at the beginning of the chunk.\n\n![](images/error_ch1_08.PNG)\n\nFIX: Add a single backtick manually where it's missing.\n\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/02-wrangling/execute-results/html.json b/_freeze/02-wrangling/execute-results/html.json index e5f22d3..dc380b5 100644 --- a/_freeze/02-wrangling/execute-results/html.json +++ b/_freeze/02-wrangling/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "85f2fa0ec2f8baeeb06d4b656ca711c6", + "hash": "f28c870f506255c679a3456fc152f995", "result": { - "markdown": "# Data wrangling I {#sec-wrangling}\n\n## Intended Learning Outcomes {.unnumbered}\n\nIn the next two chapters, we will build on the data wrangling skills from level 1. We will revisit all the functions you have already encountered (and might have forgotten over the summer break) and introduce 2 or 3 new functions. These two chapters will provide an opportunity to revise and apply the functions to a novel dataset.\n\nBy the end of this chapter, you should be able to:\n\n- apply familiar data wrangling functions to novel datasets\n- read and interpret error messages\n- realise there are several ways of getting to the results\n- export data objects as csv files\n\nThe main purpose of this chapter and @sec-wrangling2 is to wrangle your data into shape for data visualisation (@sec-dataviz and @sec-dataviz2). For the two chapters, we will:\n\n1. calculate demographics\n2. tidy 3 different questionnaires with varying degrees of complexity\n3. solve an error mode problem\n4. join all data objects together\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nBefore we start, we need to set up some things.\n\n\n## Activity 1: Setup\n\n* We will be working on the **dataset by Pownall et al. (2023)** again, which means we can still use the project we created last week. The data files will already be there, so no need to download them again.\n* To **open the project** in RStudio, go to the folder in which you stored the project and the data last time, and double click on the project icon.\n* **Create a new `.Rmd` file** for chapter 2 and save it to your project folder. Name it something meaningful (e.g., “chapter_02.Rmd”, “02_data_wrangling.Rmd”). See @sec-rmd if you need some guidance.\n* In your newly created `.Rmd` file, delete everything below line 12 (after the set-up code chunk).\n\n\n\n## Activity 2: Load in the libraries and read in the data\n\nWe will use `tidyverse` today, and we want to create a data object `data_prp` that stores the data from the file `prp_data_reduced.csv`.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(???)\ndata_prp <- read_csv(\"???\")\n```\n:::\n\n\n\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n\n:::\n\nIf you need a quick reminder what the dataset was about, have a look at the abstract in @sec-download_data_ch1. We also addressed the changes we made to the dataset there.\n\nAnd remember to have a quick `glimpse()` at your data.\n\n\n\n## Activity 3: Calculating demographics\n\nLet’s start with some simple data-wrangling steps to compute demographics for our original dataset, `data_prp`. First, we want to determine how many participants took part in the study by Pownall et al. (2023) and compute the mean age and the standard deviation of age for the sample.\n\n\n\n### ... for the full sample using `summarise()`\n\nThe `summarise()` function is part of the **\"Wickham Six\"** alongside `group_by()`, `select()`, `filter()`, `mutate()`, and `arrange()`. You used them plenty of times last year.\n\nWithin `summarise()`, we can use the `n()` function, which calculates the number of rows in the dataset. Since each row corresponds to a unique participant, this gives us the total number of participants.\n\nTo calculate the mean age and the standard deviation of age, we need to use the functions `mean()` and `sd()` on the column `Age` respectively.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: There were 2 warnings in `summarise()`.\nThe first warning was:\nℹ In argument: `mean_age = mean(Age)`.\nCaused by warning in `mean.default()`:\n! argument is not numeric or logical: returning NA\nℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.\n```\n:::\n\n```{.r .cell-code}\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nR did not give us an error message per se, but the output is not quite as expected either. There are `NA` values in the `mean_age` and `sd_age` columns. Looking at the warning message and at `Age`, can you explain what happened?\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nThe warning message says: `argument is not numeric or logical: returning NA` If we look at the `Age` column more closely, we can see that it's a character data type.\n\n:::\n\n\n\n#### Fixing `Age` {.unnumbered}\n\nMight be wise to look at the unique answers in column `Age` to determine what is wrong. We can do that with the function `distinct()`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nage_distinct <- data_prp %>% \n distinct(Age)\n\nage_distinct\n```\n:::\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Show the unique values of `Age`.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Age |\n|:--------|\n|22 |\n|20 |\n|26 |\n|21 |\n|29 |\n|23 |\n|39 |\n|NA |\n|24 |\n|43 |\n|31 |\n|25 years |\n\n
\n:::\n:::\n\n:::\n\n::: columns\n\n::: column\n\nOne cell has the string \"years\" added to their number 25, which has converted the entire column into a character column.\n\nWe can easily fix this by extracting only the numbers from the column and converting it into a numeric data type. The `parse_number()` function, which is part of the `tidyverse` package, handles both steps in one go (so there’s no need to load additional packages).\n\nWe will combine this with the `mutate()` function to create a new column called `Age` (containing those numeric values), effectively replacing the old `Age` column (which had the character values).\n\n:::\n\n::: column\n\n![parse_number() illustration by Allison Horst (see [https://allisonhorst.com/r-packages-functions](https://allisonhorst.com/r-packages-functions){target=\"_blank\"})](images/parse_number.png){width=\"95%\"}\n\n:::\n\n:::\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_prp <- data_prp %>% \n mutate(Age = parse_number(Age))\n\ntypeof(data_prp$Age) # fixed\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n:::\n\n\n\n\n#### Computing summary stats {.unnumbered}\n\nExcellent. Now that the numbers are in a numeric format, let's try calculating the demographics for the total sample again.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nEven though there's no error or warning, the table still shows `NA` values for `mean_age` and `sd_age`. So, what could possibly be wrong now?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nDid you notice that the `Age` column in `age_distinct` contains some missing values (`NA`)? To be honest, it's easier to spot this issue in the actual R output than in the printed HTML page.\n\n:::\n\n\n\n#### Computing summary stats - third attempt {.unnumbered}\n\nTo ensure R ignores missing values during calculations, we need to add the extra argument `na.rm = TRUE` to the `mean()` and `sd()` functions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age, na.rm = TRUE), # mean age\n sd_age = sd(Age, na.rm = TRUE)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|--------:|\n| 89| 21.88506| 3.485603|\n\n
\n:::\n:::\n\n\nFinally, we’ve got it! 🥳 Third time's the charm!\n\n\n\n### ... per gender using `summarise()` and `group_by()`\n\nNow we want to compute the summary statistics for each gender. The code inside the `summarise()` function remains unchanged; we just need to use the `group_by()` function beforehand to tell R that we want to compute the summary statistics for each group separately. It’s also a good practice to use `ungroup()` afterwards, so you are not taking groupings forward unintentionally.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% # split data up into groups (here Gender)\n summarise(n = n(), # participant number \n mean_age = mean(Age, na.rm = TRUE), # mean age \n sd_age = sd(Age, na.rm = TRUE)) %>% # standard deviation of age\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| mean_age| sd_age|\n|------:|--:|--------:|--------:|\n| 1| 17| 23.31250| 5.770254|\n| 2| 69| 21.57353| 2.738973|\n| 3| 3| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n\n\n### Adding percentages\n\nSometimes, it may be useful to calculate percentages, such as for the gender split. You can do this by adding a line within the `summarise()` function to perform the calculation. All we need to do is take the number of female, male, and non-binary participants (stored in the `n` column of `demo_by_gender`), divide it by the total number of participants (stored in the `n` column of `demo_total`), and multiply by 100. Let's add `percentage` to the `summarise()` function of `demo_by_gender`. Make sure that the code for `percentages` is placed after the value for `n` has been computed.\n\nAccessing the value of `n` for the different gender categories is straightforward because we can refer back to it directly. However, since the total number of participants is stored in a different data object, we need to use a base R function to access it – specifically the `$` operator. To do this, you simply type the name of the data object (in this case, `demo_total`), followed by the `$` symbol (with no spaces), and then the name of the column you want to retrieve (in this case, `n`). The general pattern is `data$column`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n # n from the line above divided by n from demo_total *100\n percentage = n/demo_total$n *100, \n mean_age = mean(Age, na.rm = TRUE), \n sd_age = sd(Age, na.rm = TRUE)) %>% \n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|--------:|\n| 1| 17| 19.101124| 23.31250| 5.770254|\n| 2| 69| 77.528090| 21.57353| 2.738973|\n| 3| 3| 3.370786| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for decimal places - use `round()`\n\nNot super important, because you could round the values by yourself when writing up your reports, but if you wanted to tidy up the decimal places in the output, you can do that using the `round()` function. You would need to \"wrap\" it around your computations and specify how many decimal places you want to display (for example `mean(Age)` would turn into `round(mean(Age), 1)`). It may look odd for `percentage`, just make sure the number that specifies the decimal places is placed **within** the round function. The default value is 0 (meaning no decimal spaces).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n percentage = round(n/demo_total$n *100, 2), # percentage with 2 decimal places\n mean_age = round(mean(Age, na.rm = TRUE), 1), # mean Age with 1 decimal place\n sd_age = round(sd(Age, na.rm = TRUE), 3)) %>% # sd Age with 3 decimal places\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|------:|\n| 1| 17| 19.10| 23.3| 5.770|\n| 2| 69| 77.53| 21.6| 2.739|\n| 3| 3| 3.37| 21.3| 1.155|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 4: Questionable Research Practices (QRPs) {#sec-ch2_act4}\n\n#### The main goal is to compute the mean QRP score per participant for time point 1. {.unnumbered}\n\nAt the moment, the data is in wide format. The table below shows data from the first 3 participants:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhead(data_prp, n = 3)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | Gender| Age|Ethnicity | Secondyeargrade| Opptional_mod|Opptional_mod_1_TEXT | Research_exp|Research_exp_1_TEXT | Plan_prereg| SATS28_1_Affect_Time1| SATS28_2_Affect_Time1| SATS28_3_Affect_Time1| SATS28_4_Affect_Time1| SATS28_5_Affect_Time1| SATS28_6_Affect_Time1| SATS28_7_CognitiveCompetence_Time1| SATS28_8_CognitiveCompetence_Time1| SATS28_9_CognitiveCompetence_Time1| SATS28_10_CognitiveCompetence_Time1| SATS28_11_CognitiveCompetence_Time1| SATS28_12_CognitiveCompetence_Time1| SATS28_13_Value_Time1| SATS28_14_Value_Time1| SATS28_15_Value_Time1| SATS28_16_Value_Time1| SATS28_17_Value_Time1| SATS28_18_Value_Time1| SATS28_19_Value_Time1| SATS28_20_Value_Time1| SATS28_21_Value_Time1| SATS28_22_Difficulty_Time1| SATS28_23_Difficulty_Time1| SATS28_24_Difficulty_Time1| SATS28_25_Difficulty_Time1| SATS28_26_Difficulty_Time1| SATS28_27_Difficulty_Time1| SATS28_28_Difficulty_Time1| QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1| QRPs_12NotQRP_Time1| QRPs_13NotQRP_Time1| QRPs_14NotQRP_Time1| QRPs_15NotQRP_Time1|Understanding_OS_1_Time1 |Understanding_OS_2_Time1 |Understanding_OS_3_Time1 |Understanding_OS_4_Time1 |Understanding_OS_5_Time1 |Understanding_OS_6_Time1 |Understanding_OS_7_Time1 |Understanding_OS_8_Time1 |Understanding_OS_9_Time1 |Understanding_OS_10_Time1 |Understanding_OS_11_Time1 |Understanding_OS_12_Time1 | Pre_reg_group| Other_OS_behav_2| Other_OS_behav_4| Other_OS_behav_5| Closely_follow| SATS28_Affect_Time2_mean| SATS28_CognitiveCompetence_Time2_mean| SATS28_Value_Time2_mean| SATS28_Difficulty_Time2_mean| QRPs_Acceptance_Time2_mean| Time2_Understanding_OS| Supervisor_1| Supervisor_2| Supervisor_3| Supervisor_4| Supervisor_5| Supervisor_6| Supervisor_7| Supervisor_8| Supervisor_9| Supervisor_10| Supervisor_11| Supervisor_12| Supervisor_13| Supervisor_14| Supervisor_15_R|\n|:----|------:|---:|:--------------|---------------:|-------------:|:------------------------------|------------:|:-------------------|-----------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|----------------------------------:|----------------------------------:|----------------------------------:|-----------------------------------:|-----------------------------------:|-----------------------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:-------------------------|:-------------------------|:-------------------------|-------------:|----------------:|----------------:|----------------:|--------------:|------------------------:|-------------------------------------:|-----------------------:|----------------------------:|--------------------------:|----------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------:|-------------:|-------------:|---------------:|\n|Tr10 | 2| 22|White European | 2| 1|Research methods in first year | 2|NA | 1| 4| 5| 3| 4| 5| 5| 4| 2| 2| 6| 4| 3| 1| 7| 7| 2| 1| 3| 3| 2| 2| 3| 5| 2| 6| 4| 4| 1| 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7| 2| 1| 1| 2|2 |2 |2 |6 |Entirely confident |Entirely confident |6 |6 |Entirely confident |Entirely confident |Entirely confident |Entirely confident | 1| 1| 1| NA| 2| 3.500000| 4.166667| 3.000000| 2.857143| 5.636364| 5.583333| 5| 5| 6| 6| 5| 5| 1| 5| 6| 5| NA| 4| 4| 5| 1|\n|Bi07 | 2| 20|White British | 3| 2|NA | 2|NA | 3| 5| 6| 2| 5| 5| 6| 2| 2| 2| 7| 3| 5| 1| 7| 7| 1| 1| 6| 3| 1| 1| 2| 6| 2| 7| 2| 5| 7| 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7| 2| 1| 4| 4|2 |Not at all confident |Not at all confident |Not at all confident |6 |Entirely confident |Not at all confident |3 |6 |6 |2 |2 | 1| NA| NA| NA| 2| 3.166667| 4.666667| 6.222222| 2.857143| 5.454546| 3.333333| 7| 6| 7| 7| 7| 7| 1| 5| 7| 7| 7| 5| 2| 7| 1|\n|SK03 | 2| 22|White British | 1| 2|NA | 2|NA | 1| 5| 3| 5| 2| 5| 2| 2| 2| 2| 6| 5| 3| 2| 6| 6| 3| 3| 5| 3| 4| 3| 5| 5| 2| 5| 2| 5| 5| 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7| 1| 1| 3| 2|6 |2 |3 |6 |6 |5 |2 |5 |5 |5 |4 |5 | 1| NA| NA| NA| 2| 4.833333| 6.166667| 6.000000| 4.000000| 6.272727| 5.416667| 7| 7| 7| 7| 7| 7| 1| 7| 7| 7| 7| 7| 5| 7| 1|\n\n
\n:::\n:::\n\n

\n\nLooking at the QRP data at time point 1, you determine that\n\n* individual item columns are , and\n* according to the codebook, there are reverse-coded items in this questionnaire.\n\nAccording to the codebook and the data table above, we just have to **compute the average score for QRP items to **, since items to are distractor items. Seems quite straightforward.\n\nHowever, as you can see in the table above, each item is in a separate column, meaning the data is in **wide format**. It would be much easier to calculate the mean scores if the items were arranged in **long format**.\n\n\nLet’s tackle this problem step by step. It’s best to create a separate data object for this. If we tried to compute it within `data_prp`, it could quickly become messy.\n\n\n* **Step 1**: Select the relevant columns `Code`, and `QRPs_1_Time1` to `QRPs_1_Time1` and store them in an object called `qrp_t1`\n* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can calculate the average score more easily (in step 3)\n* **Step 3**: Calculate the average QRP score (`QRPs_Acceptance_Time1_mean`) per participant using `group_by()` and `summarise()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_t1 <- data_prp %>% \n #Step 1\n select(Code, QRPs_1_Time1:QRPs_11_Time1) %>%\n # Step 2\n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(Code) %>% # grouping by participant id\n summarise(QRPs_Acceptance_Time1_mean = mean(Scores)) %>% # calculating the average Score\n ungroup() # just make it a habit\n```\n:::\n\n\n::: {.callout-caution icon=\"false\" collapse=\"true\"}\n\n## Explain the individual functions\n\n::: panel-tabset\n\n## `select ()`\n\nThe select function allows to include or exclude certain variables (columns). Here we want to focus on the participant ID column (i.e., `Code`) and the QRP items at time point 1. We can either list them all individually, i.e., Code, QRPs_1_Time1, QRPs_2_Time1, QRPs_3_Time1, and so forth (you get the gist), but that would take forever to type.\n\nA shortcut is to use the colon operator `:`. It allows us to select all columns that fall within the range of `first_column_name` to `last_column_name`. We can apply this here since the QRP items (1 to 11) are sequentially listed in `data_prp`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step1 <- data_prp %>% \n select(Code, QRPs_1_Time1:QRPs_11_Time1)\n\n# show first 5 rows of qrp_step1\nhead(qrp_step1, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1|\n|:----|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|\n|Tr10 | 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7|\n|Bi07 | 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7|\n|SK03 | 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7|\n|SM95 | 7| 7| 2| 6| 7| 5| 7| 7| 4| 2| 4|\n|St01 | 7| 7| 6| 7| 2| 7| 7| 7| 7| 5| 7|\n\n
\n:::\n:::\n\n\nHow many rows/observations and columns/variables do we have in `qrp_step1`?\n\n* rows/observations: \n* columns/variables: \n\n## `pivot_longer()`\n\nAs you can see, the table we got from Step 1 is in wide format. To get it into wide format, we need to define:\n\n* the columns that need to be reshuffled from wide into long format (`col` argument). Here we selected \"everything except the `Code` column\", as indicated by `-Code` \\[minus `Code`\\]. However, `QRPs_1_Time1:QRPs_11_Time1` would also work and give you the exact same result.\n* the `names_to` argument. R is creating a new column in which all the column names from the columns you selected in `col` will be stored in. Here we are naming this column \"Items\" but you could pick something equally sensible if you like.\n* the `values_to` argument. R creates this second column to store all responses the participants gave to the individual questions, i.e., all the numbers in this case. We named it \"Scores\" here, but you could have called it something different, like \"Responses\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step2 <- qrp_step1 %>% \n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\")\n\n# show first 15 rows of qrp_step2\nhead(qrp_step2, n = 15)\n```\n\n::: {.cell-output-display}\n
\n\n|Code |Items | Scores|\n|:----|:-------------|------:|\n|Tr10 |QRPs_1_Time1 | 7|\n|Tr10 |QRPs_2_Time1 | 7|\n|Tr10 |QRPs_3_Time1 | 5|\n|Tr10 |QRPs_4_Time1 | 7|\n|Tr10 |QRPs_5_Time1 | 3|\n|Tr10 |QRPs_6_Time1 | 4|\n|Tr10 |QRPs_7_Time1 | 5|\n|Tr10 |QRPs_8_Time1 | 7|\n|Tr10 |QRPs_9_Time1 | 6|\n|Tr10 |QRPs_10_Time1 | 7|\n|Tr10 |QRPs_11_Time1 | 7|\n|Bi07 |QRPs_1_Time1 | 7|\n|Bi07 |QRPs_2_Time1 | 7|\n|Bi07 |QRPs_3_Time1 | 2|\n|Bi07 |QRPs_4_Time1 | 7|\n\n
\n:::\n:::\n\n\nNow, have a look at `qrp_step2`. In total, we now have rows/observations, per participant, and columns/variables.\n\n## `group_by()` and `summarise()`\n\nThis follows exactly the same sequence we used when calculating descriptive statistics by gender. The only difference is that we are now grouping the data by the participant's `Code` instead of `Gender`.\n\n`summarise()` works exactly the same way: `summarise(new_column_name = function_to_calculate_something(column_name_of_numeric_values))`\n\nThe `function_to_calculate_something` can be `mean()`, `sd()` or `sum()` for mean scores, standard deviations, or summed-up scores respectively. You could also use `min()` or `max()` if you wanted to determine the lowest or the highest score for each participant.\n\n:::\n\n:::\n\n::: callout-tip\n\nYou could **rename the columns whilst selecting** them. The pattern would be `select(new_name = old_name)`. For example, if we wanted to select variable `Code` and rename it as `Participant_ID`, we could do that.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrenaming_col <- data_prp %>% \n select(Participant_ID = Code)\n\nhead(renaming_col, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Participant_ID |\n|:--------------|\n|Tr10 |\n|Bi07 |\n|SK03 |\n|SM95 |\n|St01 |\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 5: Knitting\n\nOnce you've completed your R Markdown file, the final step is to \"knit\" it, which converts the `.Rmd` file into a HTML file. Knitting combines your code, text, and output (like tables and plots) into a single cohesive document. This is a really good way to check your code is working.\n\nTo knit the file, **click the Knit button** at the top of your RStudio window. The document will be generated and, depending on your setting, automatically opened in the viewer in the `Output pane` or an external browser window.\n\nIf any errors occur during knitting, RStudio will show you an error message with details to help you troubleshoot.\n\nIf you want to **intentionally keep any errors** we tackled today to keep a reference on how you solved them, you could add `error=TRUE` or `eval=FALSE` to the code chunk that isn't running.\n\n\n\n## Activity 6: Export a data object as a csv\n\nTo avoid having to repeat the same steps in the next chapter, it's a good idea to save the data objects you've created today as csv files. You can do this by using the `write_csv()` function from the `readr` package. The csv files will appear in your project folder.\n\nThe basic syntax is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_object, \"filename.csv\")\n```\n:::\n\n\nNow, let's export the objects `data_prp` and `qrp_t1`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_prp, \"data_prp_for_ch3.csv\")\n```\n:::\n\n\nHere we named the file `data_prp_for_ch3.csv`, so we wouldn't override the original data csv file `prp_data_reduced.csv`. However, feel free to choose a name that makes sense to you.\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nExport `qrp_t1`.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(qrp_t1, \"qrp_t1.csv\")\n```\n:::\n\n\n:::\n\n:::\n\nCheck that your csv files have appeared in your project folder, and you're all set!\n\n**That’s it for Chapter 2: Individual Walkthrough.**\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n\nWe will continue working with the data from Binfet et al. (2021), focusing on the randomised controlled trial of therapy dog interventions. Today, our goal is to **calculate an average `Flourishing` score for each participant** at time point 1 (pre-intervention) using the raw data file `dog_data_raw`. Currently, the data looks like this:\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| F1_1| F1_2| F1_3| F1_4| F1_5| F1_6| F1_7| F1_8|\n|---:|----:|----:|----:|----:|----:|----:|----:|----:|\n| 1| 6| 7| 5| 5| 7| 7| 6| 6|\n| 2| 5| 7| 6| 5| 5| 5| 5| 4|\n| 3| 5| 5| 5| 6| 6| 6| 5| 5|\n| 4| 7| 6| 7| 7| 7| 6| 7| 4|\n| 5| 5| 5| 4| 6| 7| 7| 7| 6|\n\n
\n:::\n:::\n\n\n\nHowever, we want the data to look like this:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| Flourishing_pre|\n|---:|---------------:|\n| 1| 6.125|\n| 2| 5.250|\n| 3| 5.375|\n| 4| 6.375|\n| 5| 5.875|\n\n
\n:::\n:::\n\n\n\n\n### Task 1: Open the R project you created last week {.unnumbered}\n\nIf you haven’t created an R project for the lab yet, please do so now. If you already have one set up, go ahead and open it.\n\n\n### Task 2: Open your `.Rmd` file from last week {.unnumbered}\n\nSince we haven’t used it much yet, feel free to continue using the `.Rmd` file you created last week in Task 2.\n\n\n### Task 3: Load in the library and read in the data {.unnumbered}\n\nThe data should be in your project folder. If you didn’t download it last week, or if you’d like a fresh copy, you can download the data again here: [data_pair_ch1](data/data_pair_ch1.zip \"download\").\n\nWe will be using the `tidyverse` package today, and the data file we need to read in is `dog_data_raw.csv`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(???)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"???\")\n```\n:::\n\n\n:::\n\n\n### Task 4: Calculating the mean for `Flourishing_pre` {.unnumbered}\n\n\n* **Step 1**: Select all relevant columns from `dog_data_raw`, including participant ID and all items from the `Flourishing` questionnaire completed before the intervention. Store this data in an object called `data_flourishing`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nLook at the codebook. Try to determine:\n\n* The variable name of the column where the participant ID is stored.\n* The items related to the Flourishing scale at the pre-intervention stage.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nFrom the codebook, we know that:\n\n* The participant ID column is called `RID`.\n* The Flourishing items at the pre-intervention stage start with `F1_`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_flourishing <- ??? %>% \n select(???, F1_???:F1_???)\n```\n:::\n\n\n:::\n\n:::\n\n\n* **Step 2**: Pivot the data from wide format to long format so we can calculate the average score more easily (in step 3).\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nWhich pivot function should you use? We have `pivot_wider()` and `pivot_longer()` to choose from.\n\nWe also need 3 arguments in that function:\n\n* The columns you want to select (e.g., all the Flourishing items),\n* The name of the column where the current column headings will be stored (e.g., \"Questionnaire\"),\n* The name of the column that should store all the values (e.g., \"Responses\").\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nWe need `pivot_longer()`. You already encountered `pivot_longer()` in first year (or in the individual walkthrough if you have already completed this Chapter). The 3 arguments was also a give-away; `pivot_wider()` only requires 2 arguments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n pivot_longer(cols = ???, names_to = \"???\", values_to = \"???\")\n```\n:::\n\n\n:::\n\n:::\n\n* **Step 3**: Calculate the average Flourishing score per participant and name this column `Flourishing_pre` to match the table above.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nBefore summarising the mean, you may need to group the data.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nTo compute an average score **per participant**, we would need to group by participant ID first.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n group_by(???) %>% \n summarise(Flourishing_pre = mean(???)) %>% \n ungroup()\n```\n:::\n\n:::\n\n:::\n\n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(tidyverse)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\n\n# Task 4: Tidying \ndata_flourishing <- dog_data_raw %>% \n # Step 1\n select(RID, F1_1:F1_8) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Questionnaire\", values_to = \"Responses\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_pre = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\n:::\n\n\n\n## [Test your knowledge and challenge yourself]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nWhich function of the Wickham Six would you use to include or exclude certain variables (columns)? \n\n\n#### Question 2 {.unnumbered}\n\nWhich function of the Wickham Six would you use to create new columns or modify existing columns in a dataframe? \n\n\n#### Question 3 {.unnumbered}\n\n\nWhich function of the Wickham Six would you use to organise data into groups based on one or more columns? \n\n\n\n#### Question 4 {.unnumbered}\n\nWhich function of the Wickham Six would you use to sort the rows of a dataframe based on the values in one or more columns? \n\n\n\n#### Question 5 {.unnumbered}\n\nWhich function of the Wickham Six would NOT modify the original dataframe? \n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain these answers\n\n| Function | Description |\n|:-------------|:------------------------------------------------------|\n| `select()` | Include or exclude certain variables/columns |\n| `filter()` | Include or exclude certain observations/rows |\n| `mutate()` | Creates new columns or modifies existing ones |\n| `arrange()` | Changes the order of the rows |\n| `group_by()` | Split data into groups based on one or more variables |\n| `summarise()`| Creates a new dataframe returning one row for each combination of grouping variables |\n\n\nTechnically, the first five functions operate on the existing data object, making adjustments like sorting the data (e.g., with `arrange()`), reducing the number of rows (e.g., with `filter()`), reducing the number of columns (e.g., with `select()`), or adding new columns (e.g., with `mutate()`). In contrast, `summarise()` fundamentally alters the structure of the original dataframe by generating a completely new dataframe that contains only summary statistics, rather than retaining the original rows and columns.\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of the code chunks contain mistakes and result in errors, while others do not produce the expected results. Your task is to identify any issues, explain why they occurred, and, if possible, fix them.\n\nWe will use a few built-in datasets, such as `billboard` and `starwars`, to help you replicate the errors in your own R environment. You can view the data either by typing the dataset name directly into your console or by storing the data as a separate object in your `Global Environment`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbillboard\n\nstarwars_data = starwars\n```\n:::\n\n\n\n\n#### Question 6 {.unnumbered}\n\nCurrently, the weekly song rankings for Billboard Top 100 in 2000 are in wide format, with each week in a separate column. The following code is supposed to transpose the wide-format `billboard` data into long format:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(names_to = \"weeks\", values_to = \"rank\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `pivot_longer()`:\n! `cols` must select at least one column.\n```\n:::\n:::\n\n\nWhat does this error message mean and how do you fix it?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe error message indicates that the `cols` argument is missing in the function. This means the function doesn’t know which columns to transpose from wide format to long format.\n\nFIX: Add `cols = wk1:wk76` to the function to select columns from wk1 to wk76. Alternatively, `cols = starts_with(\"wk\")` would also work since all columns start with the letter combination \"wk\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(cols = wk1:wk76, names_to = \"weeks\", values_to = \"rank\")\n# OR\nlong_data <- billboard %>% \n pivot_longer(cols = starts_with(\"wk\"), names_to = \"weeks\", values_to = \"rank\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\nThe following code is intended to calculate the mean height of all the characters in the built-in `starwars` dataset, grouped by their gender. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = height)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\ndplyr 1.1.0.\nℹ Please use `reframe()` instead.\nℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n always returns an ungrouped data frame and adjust accordingly.\n```\n:::\n:::\n\n\nThe code runs, but it's giving us some weird warning and the output is also not as expected. What steps should we take to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe aggregation function `mean()` is missing from within `summarise()`. Without it, the function does not perform any aggregation and returns *all* rows with only the columns for gender and height.\n\nFIX: Wrap the `mean()` function around the variable you want to aggregate, here `height`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height))\n```\n:::\n\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nFollowing up on Question 7, we now have `summary_data` that looks approximately correct - it has the expected rows and column numbers, however, the cell values are \"weird\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | NA|\n|masculine | NA|\n|NA | 175|\n\n
\n:::\n:::\n\n\n\nCan you explain what is happening here? And how can we modify the code to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nLook at the original `starwars` data. You will notice that some of the characters with feminine and masculine gender entries have missing height values. However, all four characters without a specified gender have provided their height.\n\nFIX: We need to add `na.rm = TRUE` to the `mean()` function to ensure that R ignores missing values before aggregating the data.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height, na.rm = TRUE))\n\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | 166.5333|\n|masculine | 176.5323|\n|NA | 175.0000|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n### Challenge yourself {.unnumbered}\n\nIf you want to **challenge yourself** and further apply the skills from Chapter 2, you can wrangle the data from `dog_data_raw` for additional questionnaires from either the pre- and/or post-intervention stages:\n\n* Calculate the mean score for `flourishing_post` for each participant.\n* Calculate the mean score for the `PANAS` (Positive and/or Negative Affect) per participant\n* Calculate the mean score for happiness (`SHS`) per participant\n\nThe 3 steps are equivalent for those questionnaires - select, pivot, group_by and summarise; you just have to \"replace\" the questionnaire items involved.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution for **Challenge yourself**\n\nFlourishing post-intervention\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## flourishing_post\nflourishing_post <- dog_data_raw %>% \n # Step 1\n select(RID, starts_with(\"F2\")) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Names\", values_to = \"Response\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_post = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\nThe PANAS could be solved more concisely with the skills we learn in @sec-wrangling2, but for now, you would have solved it this way:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# PANAS - positive affect pre\nPANAS_PA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_3, PN1_5, PN1_7, PN1_8, PN1_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - positive affect post\nPANAS_PA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_3, PN2_5, PN2_7, PN2_8, PN2_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_post = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect pre\nPANAS_NA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_1, PN1_2, PN1_4, PN1_6, PN1_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect post\nPANAS_NA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_1, PN2_2, PN2_4, PN2_6, PN2_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_post = mean(Scores)) %>% \n ungroup()\n```\n:::\n\n\nHappiness scale\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# happiness_pre\nhappiness_pre <- dog_data_raw %>% \n # Step 1\n select(RID, HA1_1, HA1_2, HA1_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_pre = mean(Score)) %>% \n ungroup()\n\n#happiness_post\nhappiness_post <- dog_data_raw %>% \n # Step 1\n select(RID, HA2_1, HA2_2, HA2_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_post = mean(Score)) %>% \n ungroup()\n```\n:::\n\n\n:::\n", + "markdown": "# Data wrangling I {#sec-wrangling}\n\n## Intended Learning Outcomes {.unnumbered}\n\nIn the next two chapters, we will build on the data wrangling skills from level 1. We will revisit all the functions you have already encountered (and might have forgotten over the summer break) and introduce 2 or 3 new functions. These two chapters will provide an opportunity to revise and apply the functions to a novel dataset.\n\nBy the end of this chapter, you should be able to:\n\n- apply familiar data wrangling functions to novel datasets\n- read and interpret error messages\n- realise there are several ways of getting to the results\n- export data objects as csv files\n\nThe main purpose of this chapter and @sec-wrangling2 is to wrangle your data into shape for data visualisation (@sec-dataviz and @sec-dataviz2). For the two chapters, we will:\n\n1. calculate demographics\n2. tidy 3 different questionnaires with varying degrees of complexity\n3. solve an error mode problem\n4. join all data objects together\n\n## [Individual Walkthrough]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\nBefore we start, we need to set up some things.\n\n\n## Activity 1: Setup\n\n* We will be working on the **dataset by Pownall et al. (2023)** again, which means we can still use the project we created last week. The data files will already be there, so no need to download them again.\n* To **open the project** in RStudio, go to the folder in which you stored the project and the data last time, and double click on the project icon.\n* **Create a new `.Rmd` file** for chapter 2 and save it to your project folder. Name it something meaningful (e.g., “chapter_02.Rmd”, “02_data_wrangling.Rmd”). See @sec-rmd if you need some guidance.\n* In your newly created `.Rmd` file, delete everything below line 12 (after the set-up code chunk).\n\n\n\n## Activity 2: Load in the libraries and read in the data\n\nWe will use `tidyverse` today, and we want to create a data object `data_prp` that stores the data from the file `prp_data_reduced.csv`.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(???)\ndata_prp <- read_csv(\"???\")\n```\n:::\n\n\n\n\n:::\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata_prp <- read_csv(\"prp_data_reduced.csv\")\n```\n:::\n\n\n:::\n\nIf you need a quick reminder what the dataset was about, have a look at the abstract in @sec-download_data_ch1. We also addressed the changes we made to the dataset there.\n\nAnd remember to have a quick `glimpse()` at your data.\n\n\n\n## Activity 3: Calculating demographics\n\nLet’s start with some simple data-wrangling steps to compute demographics for our original dataset, `data_prp`. First, we want to determine how many participants took part in the study by Pownall et al. (2023) and compute the mean age and the standard deviation of age for the sample.\n\n\n\n### ... for the full sample using `summarise()`\n\nThe `summarise()` function is part of the **\"Wickham Six\"** alongside `group_by()`, `select()`, `filter()`, `mutate()`, and `arrange()`. You used them plenty of times last year.\n\nWithin `summarise()`, we can use the `n()` function, which calculates the number of rows in the dataset. Since each row corresponds to a unique participant, this gives us the total number of participants.\n\nTo calculate the mean age and the standard deviation of age, we need to use the functions `mean()` and `sd()` on the column `Age` respectively.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: There were 2 warnings in `summarise()`.\nThe first warning was:\nℹ In argument: `mean_age = mean(Age)`.\nCaused by warning in `mean.default()`:\n! argument is not numeric or logical: returning NA\nℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.\n```\n:::\n\n```{.r .cell-code}\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nR did not give us an error message per se, but the output is not quite as expected either. There are `NA` values in the `mean_age` and `sd_age` columns. Looking at the warning message and at `Age`, can you explain what happened?\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nThe warning message says: `argument is not numeric or logical: returning NA` If we look at the `Age` column more closely, we can see that it's a character data type.\n\n:::\n\n\n\n#### Fixing `Age` {.unnumbered}\n\nMight be wise to look at the unique answers in column `Age` to determine what is wrong. We can do that with the function `distinct()`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nage_distinct <- data_prp %>% \n distinct(Age)\n\nage_distinct\n```\n:::\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Show the unique values of `Age`.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n|Age |\n|:--------|\n|22 |\n|20 |\n|26 |\n|21 |\n|29 |\n|23 |\n|39 |\n|NA |\n|24 |\n|43 |\n|31 |\n|25 years |\n\n
\n:::\n:::\n\n:::\n\n::: columns\n\n::: column\n\nOne cell has the string \"years\" added to their number 25, which has converted the entire column into a character column.\n\nWe can easily fix this by extracting only the numbers from the column and converting it into a numeric data type. The `parse_number()` function, which is part of the `tidyverse` package, handles both steps in one go (so there’s no need to load additional packages).\n\nWe will combine this with the `mutate()` function to create a new column called `Age` (containing those numeric values), effectively replacing the old `Age` column (which had the character values).\n\n:::\n\n::: column\n\n![parse_number() illustration by Allison Horst (see [https://allisonhorst.com/r-packages-functions](https://allisonhorst.com/r-packages-functions){target=\"_blank\"})](images/parse_number.png){width=\"95%\"}\n\n:::\n\n:::\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_prp <- data_prp %>% \n mutate(Age = parse_number(Age))\n\ntypeof(data_prp$Age) # fixed\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n:::\n\n\n\n\n#### Computing summary stats {.unnumbered}\n\nExcellent. Now that the numbers are in a numeric format, let's try calculating the demographics for the total sample again.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age), # mean age\n sd_age = sd(Age)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|------:|\n| 89| NA| NA|\n\n
\n:::\n:::\n\n\nEven though there's no error or warning, the table still shows `NA` values for `mean_age` and `sd_age`. So, what could possibly be wrong now?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Answer\n\nDid you notice that the `Age` column in `age_distinct` contains some missing values (`NA`)? To be honest, it's easier to spot this issue in the actual R output than in the printed HTML page.\n\n:::\n\n\n\n#### Computing summary stats - third attempt {.unnumbered}\n\nTo ensure R ignores missing values during calculations, we need to add the extra argument `na.rm = TRUE` to the `mean()` and `sd()` functions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_total <- data_prp %>% \n summarise(n = n(), # participant number\n mean_age = mean(Age, na.rm = TRUE), # mean age\n sd_age = sd(Age, na.rm = TRUE)) # standard deviation of age\n\ndemo_total\n```\n\n::: {.cell-output-display}\n
\n\n| n| mean_age| sd_age|\n|--:|--------:|--------:|\n| 89| 21.88506| 3.485603|\n\n
\n:::\n:::\n\n\nFinally, we’ve got it! 🥳 Third time's the charm!\n\n\n\n### ... per gender using `summarise()` and `group_by()`\n\nNow we want to compute the summary statistics for each gender. The code inside the `summarise()` function remains unchanged; we just need to use the `group_by()` function beforehand to tell R that we want to compute the summary statistics for each group separately. It’s also a good practice to use `ungroup()` afterwards, so you are not taking groupings forward unintentionally.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% # split data up into groups (here Gender)\n summarise(n = n(), # participant number \n mean_age = mean(Age, na.rm = TRUE), # mean age \n sd_age = sd(Age, na.rm = TRUE)) %>% # standard deviation of age\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| mean_age| sd_age|\n|------:|--:|--------:|--------:|\n| 1| 17| 23.31250| 5.770254|\n| 2| 69| 21.57353| 2.738973|\n| 3| 3| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n\n\n### Adding percentages\n\nSometimes, it may be useful to calculate percentages, such as for the gender split. You can do this by adding a line within the `summarise()` function to perform the calculation. All we need to do is take the number of female, male, and non-binary participants (stored in the `n` column of `demo_by_gender`), divide it by the total number of participants (stored in the `n` column of `demo_total`), and multiply by 100. Let's add `percentage` to the `summarise()` function of `demo_by_gender`. Make sure that the code for `percentages` is placed after the value for `n` has been computed.\n\nAccessing the value of `n` for the different gender categories is straightforward because we can refer back to it directly. However, since the total number of participants is stored in a different data object, we need to use a base R function to access it – specifically the `$` operator. To do this, you simply type the name of the data object (in this case, `demo_total`), followed by the `$` symbol (with no spaces), and then the name of the column you want to retrieve (in this case, `n`). The general pattern is `data$column`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n # n from the line above divided by n from demo_total *100\n percentage = n/demo_total$n *100, \n mean_age = mean(Age, na.rm = TRUE), \n sd_age = sd(Age, na.rm = TRUE)) %>% \n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|--------:|\n| 1| 17| 19.101124| 23.31250| 5.770254|\n| 2| 69| 77.528090| 21.57353| 2.738973|\n| 3| 3| 3.370786| 21.33333| 1.154700|\n\n
\n:::\n:::\n\n\n::: {.callout-tip collapse=\"true\"}\n\n## Tip for decimal places - use `round()`\n\nNot super important, because you could round the values by yourself when writing up your reports, but if you wanted to tidy up the decimal places in the output, you can do that using the `round()` function. You would need to \"wrap\" it around your computations and specify how many decimal places you want to display (for example `mean(Age)` would turn into `round(mean(Age), 1)`). It may look odd for `percentage`, just make sure the number that specifies the decimal places is placed **within** the round function. The default value is 0 (meaning no decimal spaces).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_by_gender <- data_prp %>% \n group_by(Gender) %>% \n summarise(n = n(), \n percentage = round(n/demo_total$n *100, 2), # percentage with 2 decimal places\n mean_age = round(mean(Age, na.rm = TRUE), 1), # mean Age with 1 decimal place\n sd_age = round(sd(Age, na.rm = TRUE), 3)) %>% # sd Age with 3 decimal places\n ungroup()\n\ndemo_by_gender\n```\n\n::: {.cell-output-display}\n
\n\n| Gender| n| percentage| mean_age| sd_age|\n|------:|--:|----------:|--------:|------:|\n| 1| 17| 19.10| 23.3| 5.770|\n| 2| 69| 77.53| 21.6| 2.739|\n| 3| 3| 3.37| 21.3| 1.155|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 4: Questionable Research Practices (QRPs) {#sec-ch2_act4}\n\n#### The main goal is to compute the mean QRP score per participant for time point 1. {.unnumbered}\n\nAt the moment, the data is in wide format. The table below shows data from the first 3 participants:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhead(data_prp, n = 3)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | Gender| Age|Ethnicity | Secondyeargrade| Opptional_mod|Opptional_mod_1_TEXT | Research_exp|Research_exp_1_TEXT | Plan_prereg| SATS28_1_Affect_Time1| SATS28_2_Affect_Time1| SATS28_3_Affect_Time1| SATS28_4_Affect_Time1| SATS28_5_Affect_Time1| SATS28_6_Affect_Time1| SATS28_7_CognitiveCompetence_Time1| SATS28_8_CognitiveCompetence_Time1| SATS28_9_CognitiveCompetence_Time1| SATS28_10_CognitiveCompetence_Time1| SATS28_11_CognitiveCompetence_Time1| SATS28_12_CognitiveCompetence_Time1| SATS28_13_Value_Time1| SATS28_14_Value_Time1| SATS28_15_Value_Time1| SATS28_16_Value_Time1| SATS28_17_Value_Time1| SATS28_18_Value_Time1| SATS28_19_Value_Time1| SATS28_20_Value_Time1| SATS28_21_Value_Time1| SATS28_22_Difficulty_Time1| SATS28_23_Difficulty_Time1| SATS28_24_Difficulty_Time1| SATS28_25_Difficulty_Time1| SATS28_26_Difficulty_Time1| SATS28_27_Difficulty_Time1| SATS28_28_Difficulty_Time1| QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1| QRPs_12NotQRP_Time1| QRPs_13NotQRP_Time1| QRPs_14NotQRP_Time1| QRPs_15NotQRP_Time1|Understanding_OS_1_Time1 |Understanding_OS_2_Time1 |Understanding_OS_3_Time1 |Understanding_OS_4_Time1 |Understanding_OS_5_Time1 |Understanding_OS_6_Time1 |Understanding_OS_7_Time1 |Understanding_OS_8_Time1 |Understanding_OS_9_Time1 |Understanding_OS_10_Time1 |Understanding_OS_11_Time1 |Understanding_OS_12_Time1 | Pre_reg_group| Other_OS_behav_2| Other_OS_behav_4| Other_OS_behav_5| Closely_follow| SATS28_Affect_Time2_mean| SATS28_CognitiveCompetence_Time2_mean| SATS28_Value_Time2_mean| SATS28_Difficulty_Time2_mean| QRPs_Acceptance_Time2_mean| Time2_Understanding_OS| Supervisor_1| Supervisor_2| Supervisor_3| Supervisor_4| Supervisor_5| Supervisor_6| Supervisor_7| Supervisor_8| Supervisor_9| Supervisor_10| Supervisor_11| Supervisor_12| Supervisor_13| Supervisor_14| Supervisor_15_R|\n|:----|------:|---:|:--------------|---------------:|-------------:|:------------------------------|------------:|:-------------------|-----------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|----------------------------------:|----------------------------------:|----------------------------------:|-----------------------------------:|-----------------------------------:|-----------------------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|---------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|--------------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:------------------------|:-------------------------|:-------------------------|:-------------------------|-------------:|----------------:|----------------:|----------------:|--------------:|------------------------:|-------------------------------------:|-----------------------:|----------------------------:|--------------------------:|----------------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|-------------:|-------------:|-------------:|---------------:|\n|Tr10 | 2| 22|White European | 2| 1|Research methods in first year | 2|NA | 1| 4| 5| 3| 4| 5| 5| 4| 2| 2| 6| 4| 3| 1| 7| 7| 2| 1| 3| 3| 2| 2| 3| 5| 2| 6| 4| 4| 1| 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7| 2| 1| 1| 2|2 |2 |2 |6 |Entirely confident |Entirely confident |6 |6 |Entirely confident |Entirely confident |Entirely confident |Entirely confident | 1| 1| 1| NA| 2| 3.500000| 4.166667| 3.000000| 2.857143| 5.636364| 5.583333| 5| 5| 6| 6| 5| 5| 1| 5| 6| 5| NA| 4| 4| 5| 1|\n|Bi07 | 2| 20|White British | 3| 2|NA | 2|NA | 3| 5| 6| 2| 5| 5| 6| 2| 2| 2| 7| 3| 5| 1| 7| 7| 1| 1| 6| 3| 1| 1| 2| 6| 2| 7| 2| 5| 7| 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7| 2| 1| 4| 4|2 |Not at all confident |Not at all confident |Not at all confident |6 |Entirely confident |Not at all confident |3 |6 |6 |2 |2 | 1| NA| NA| NA| 2| 3.166667| 4.666667| 6.222222| 2.857143| 5.454546| 3.333333| 7| 6| 7| 7| 7| 7| 1| 5| 7| 7| 7| 5| 2| 7| 1|\n|SK03 | 2| 22|White British | 1| 2|NA | 2|NA | 1| 5| 3| 5| 2| 5| 2| 2| 2| 2| 6| 5| 3| 2| 6| 6| 3| 3| 5| 3| 4| 3| 5| 5| 2| 5| 2| 5| 5| 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7| 1| 1| 3| 2|6 |2 |3 |6 |6 |5 |2 |5 |5 |5 |4 |5 | 1| NA| NA| NA| 2| 4.833333| 6.166667| 6.000000| 4.000000| 6.272727| 5.416667| 7| 7| 7| 7| 7| 7| 1| 7| 7| 7| 7| 7| 5| 7| 1|\n\n
\n:::\n:::\n\n

\n\nLooking at the QRP data at time point 1, you determine that\n\n* individual item columns are , and\n* according to the codebook, there are reverse-coded items in this questionnaire.\n\nAccording to the codebook and the data table above, we just have to **compute the average score for QRP items to **, since items to are distractor items. Seems quite straightforward.\n\nHowever, as you can see in the table above, each item is in a separate column, meaning the data is in **wide format**. It would be much easier to calculate the mean scores if the items were arranged in **long format**.\n\n\nLet’s tackle this problem step by step. It’s best to create a separate data object for this. If we tried to compute it within `data_prp`, it could quickly become messy.\n\n\n* **Step 1**: Select the relevant columns `Code`, and `QRPs_1_Time1` to `QRPs_11_Time1` and store them in an object called `qrp_t1`\n* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can calculate the average score more easily (in step 3)\n* **Step 3**: Calculate the average QRP score (`QRPs_Acceptance_Time1_mean`) per participant using `group_by()` and `summarise()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_t1 <- data_prp %>% \n #Step 1\n select(Code, QRPs_1_Time1:QRPs_11_Time1) %>%\n # Step 2\n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(Code) %>% # grouping by participant id\n summarise(QRPs_Acceptance_Time1_mean = mean(Scores)) %>% # calculating the average Score\n ungroup() # just make it a habit\n```\n:::\n\n\n::: {.callout-caution icon=\"false\" collapse=\"true\"}\n\n## Explain the individual functions\n\n::: panel-tabset\n\n## `select ()`\n\nThe select function allows to include or exclude certain variables (columns). Here we want to focus on the participant ID column (i.e., `Code`) and the QRP items at time point 1. We can either list them all individually, i.e., Code, QRPs_1_Time1, QRPs_2_Time1, QRPs_3_Time1, and so forth (you get the gist), but that would take forever to type.\n\nA shortcut is to use the colon operator `:`. It allows us to select all columns that fall within the range of `first_column_name` to `last_column_name`. We can apply this here since the QRP items (1 to 11) are sequentially listed in `data_prp`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step1 <- data_prp %>% \n select(Code, QRPs_1_Time1:QRPs_11_Time1)\n\n# show first 5 rows of qrp_step1\nhead(qrp_step1, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Code | QRPs_1_Time1| QRPs_2_Time1| QRPs_3_Time1| QRPs_4_Time1| QRPs_5_Time1| QRPs_6_Time1| QRPs_7_Time1| QRPs_8_Time1| QRPs_9_Time1| QRPs_10_Time1| QRPs_11_Time1|\n|:----|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|-------------:|\n|Tr10 | 7| 7| 5| 7| 3| 4| 5| 7| 6| 7| 7|\n|Bi07 | 7| 7| 2| 7| 3| 7| 7| 7| 7| 6| 7|\n|SK03 | 7| 7| 6| 6| 7| 6| 7| 7| 7| 5| 7|\n|SM95 | 7| 7| 2| 6| 7| 5| 7| 7| 4| 2| 4|\n|St01 | 7| 7| 6| 7| 2| 7| 7| 7| 7| 5| 7|\n\n
\n:::\n:::\n\n\nHow many rows/observations and columns/variables do we have in `qrp_step1`?\n\n* rows/observations: \n* columns/variables: \n\n## `pivot_longer()`\n\nAs you can see, the table we got from Step 1 is in wide format. To get it into wide format, we need to define:\n\n* the columns that need to be reshuffled from wide into long format (`col` argument). Here we selected \"everything except the `Code` column\", as indicated by `-Code` \\[minus `Code`\\]. However, `QRPs_1_Time1:QRPs_11_Time1` would also work and give you the exact same result.\n* the `names_to` argument. R is creating a new column in which all the column names from the columns you selected in `col` will be stored in. Here we are naming this column \"Items\" but you could pick something equally sensible if you like.\n* the `values_to` argument. R creates this second column to store all responses the participants gave to the individual questions, i.e., all the numbers in this case. We named it \"Scores\" here, but you could have called it something different, like \"Responses\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqrp_step2 <- qrp_step1 %>% \n pivot_longer(cols = -Code, names_to = \"Items\", values_to = \"Scores\")\n\n# show first 15 rows of qrp_step2\nhead(qrp_step2, n = 15)\n```\n\n::: {.cell-output-display}\n
\n\n|Code |Items | Scores|\n|:----|:-------------|------:|\n|Tr10 |QRPs_1_Time1 | 7|\n|Tr10 |QRPs_2_Time1 | 7|\n|Tr10 |QRPs_3_Time1 | 5|\n|Tr10 |QRPs_4_Time1 | 7|\n|Tr10 |QRPs_5_Time1 | 3|\n|Tr10 |QRPs_6_Time1 | 4|\n|Tr10 |QRPs_7_Time1 | 5|\n|Tr10 |QRPs_8_Time1 | 7|\n|Tr10 |QRPs_9_Time1 | 6|\n|Tr10 |QRPs_10_Time1 | 7|\n|Tr10 |QRPs_11_Time1 | 7|\n|Bi07 |QRPs_1_Time1 | 7|\n|Bi07 |QRPs_2_Time1 | 7|\n|Bi07 |QRPs_3_Time1 | 2|\n|Bi07 |QRPs_4_Time1 | 7|\n\n
\n:::\n:::\n\n\nNow, have a look at `qrp_step2`. In total, we now have rows/observations, per participant, and columns/variables.\n\n## `group_by()` and `summarise()`\n\nThis follows exactly the same sequence we used when calculating descriptive statistics by gender. The only difference is that we are now grouping the data by the participant's `Code` instead of `Gender`.\n\n`summarise()` works exactly the same way: `summarise(new_column_name = function_to_calculate_something(column_name_of_numeric_values))`\n\nThe `function_to_calculate_something` can be `mean()`, `sd()` or `sum()` for mean scores, standard deviations, or summed-up scores respectively. You could also use `min()` or `max()` if you wanted to determine the lowest or the highest score for each participant.\n\n:::\n\n:::\n\n::: callout-tip\n\nYou could **rename the columns whilst selecting** them. The pattern would be `select(new_name = old_name)`. For example, if we wanted to select variable `Code` and rename it as `Participant_ID`, we could do that.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrenaming_col <- data_prp %>% \n select(Participant_ID = Code)\n\nhead(renaming_col, n = 5)\n```\n\n::: {.cell-output-display}\n
\n\n|Participant_ID |\n|:--------------|\n|Tr10 |\n|Bi07 |\n|SK03 |\n|SM95 |\n|St01 |\n\n
\n:::\n:::\n\n\n:::\n\n\n\n## Activity 5: Knitting\n\nOnce you've completed your R Markdown file, the final step is to \"knit\" it, which converts the `.Rmd` file into a HTML file. Knitting combines your code, text, and output (like tables and plots) into a single cohesive document. This is a really good way to check your code is working.\n\nTo knit the file, **click the Knit button** at the top of your RStudio window. The document will be generated and, depending on your setting, automatically opened in the viewer in the `Output pane` or an external browser window.\n\nIf any errors occur during knitting, RStudio will show you an error message with details to help you troubleshoot.\n\nIf you want to **intentionally keep any errors** we tackled today to keep a reference on how you solved them, you could add `error=TRUE` or `eval=FALSE` to the code chunk that isn't running.\n\n\n\n## Activity 6: Export a data object as a csv\n\nTo avoid having to repeat the same steps in the next chapter, it's a good idea to save the data objects you've created today as csv files. You can do this by using the `write_csv()` function from the `readr` package. The csv files will appear in your project folder.\n\nThe basic syntax is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_object, \"filename.csv\")\n```\n:::\n\n\nNow, let's export the objects `data_prp` and `qrp_t1`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(data_prp, \"data_prp_for_ch3.csv\")\n```\n:::\n\n\nHere we named the file `data_prp_for_ch3.csv`, so we wouldn't override the original data csv file `prp_data_reduced.csv`. However, feel free to choose a name that makes sense to you.\n\n::: {.callout-note icon=\"false\"}\n\n## Your Turn\n\nExport `qrp_t1`.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwrite_csv(qrp_t1, \"qrp_t1.csv\")\n```\n:::\n\n\n:::\n\n:::\n\nCheck that your csv files have appeared in your project folder, and you're all set!\n\n**That’s it for Chapter 2: Individual Walkthrough.**\n\n## [Pair-coding]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n\nWe will continue working with the data from Binfet et al. (2021), focusing on the randomised controlled trial of therapy dog interventions. Today, our goal is to **calculate an average `Flourishing` score for each participant** at time point 1 (pre-intervention) using the raw data file `dog_data_raw`. Currently, the data looks like this:\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| F1_1| F1_2| F1_3| F1_4| F1_5| F1_6| F1_7| F1_8|\n|---:|----:|----:|----:|----:|----:|----:|----:|----:|\n| 1| 6| 7| 5| 5| 7| 7| 6| 6|\n| 2| 5| 7| 6| 5| 5| 5| 5| 4|\n| 3| 5| 5| 5| 6| 6| 6| 5| 5|\n| 4| 7| 6| 7| 7| 7| 6| 7| 4|\n| 5| 5| 5| 4| 6| 7| 7| 7| 6|\n\n
\n:::\n:::\n\n\n\nHowever, we want the data to look like this:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n| RID| Flourishing_pre|\n|---:|---------------:|\n| 1| 6.125|\n| 2| 5.250|\n| 3| 5.375|\n| 4| 6.375|\n| 5| 5.875|\n\n
\n:::\n:::\n\n\n\n\n### Task 1: Open the R project you created last week {.unnumbered}\n\nIf you haven’t created an R project for the lab yet, please do so now. If you already have one set up, go ahead and open it.\n\n\n### Task 2: Open your `.Rmd` file from last week {.unnumbered}\n\nSince we haven’t used it much yet, feel free to continue using the `.Rmd` file you created last week in Task 2.\n\n\n### Task 3: Load in the library and read in the data {.unnumbered}\n\nThe data should be in your project folder. If you didn’t download it last week, or if you’d like a fresh copy, you can download the data again here: [data_pair_ch1](data/data_pair_ch1.zip \"download\").\n\nWe will be using the `tidyverse` package today, and the data file we need to read in is `dog_data_raw.csv`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(???)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"???\")\n```\n:::\n\n\n:::\n\n\n### Task 4: Calculating the mean for `Flourishing_pre` {.unnumbered}\n\n\n* **Step 1**: Select all relevant columns from `dog_data_raw`, including participant ID and all items from the `Flourishing` questionnaire completed before the intervention. Store this data in an object called `data_flourishing`.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nLook at the codebook. Try to determine:\n\n* The variable name of the column where the participant ID is stored.\n* The items related to the Flourishing scale at the pre-intervention stage.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nFrom the codebook, we know that:\n\n* The participant ID column is called `RID`.\n* The Flourishing items at the pre-intervention stage start with `F1_`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata_flourishing <- ??? %>% \n select(???, F1_???:F1_???)\n```\n:::\n\n\n:::\n\n:::\n\n\n* **Step 2**: Pivot the data from wide format to long format so we can calculate the average score more easily (in step 3).\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nWhich pivot function should you use? We have `pivot_wider()` and `pivot_longer()` to choose from.\n\nWe also need 3 arguments in that function:\n\n* The columns you want to select (e.g., all the Flourishing items),\n* The name of the column where the current column headings will be stored (e.g., \"Questionnaire\"),\n* The name of the column that should store all the values (e.g., \"Responses\").\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nWe need `pivot_longer()`. You already encountered `pivot_longer()` in first year (or in the individual walkthrough if you have already completed this Chapter). The 3 arguments was also a give-away; `pivot_wider()` only requires 2 arguments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n pivot_longer(cols = ???, names_to = \"???\", values_to = \"???\")\n```\n:::\n\n\n:::\n\n:::\n\n* **Step 3**: Calculate the average Flourishing score per participant and name this column `Flourishing_pre` to match the table above.\n\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## Hint\n\nBefore summarising the mean, you may need to group the data.\n\n::: {.callout-note collapse=\"true\" icon=\"false\"}\n\n## More concrete hint\n\nTo compute an average score **per participant**, we would need to group by participant ID first.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n group_by(???) %>% \n summarise(Flourishing_pre = mean(???)) %>% \n ungroup()\n```\n:::\n\n:::\n\n:::\n\n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loading tidyverse into the library\nlibrary(tidyverse)\n\n# reading in `dog_data_raw.csv`\ndog_data_raw <- read_csv(\"dog_data_raw.csv\")\n\n# Task 4: Tidying \ndata_flourishing <- dog_data_raw %>% \n # Step 1\n select(RID, F1_1:F1_8) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Questionnaire\", values_to = \"Responses\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_pre = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\n:::\n\n\n\n## [Test your knowledge and challenge yourself]{style=\"color: #F39C12; text-transform: uppercase;\"} {.unnumbered}\n\n### Knowledge check {.unnumbered}\n\n#### Question 1 {.unnumbered}\n\nWhich function of the Wickham Six would you use to include or exclude certain variables (columns)? \n\n\n#### Question 2 {.unnumbered}\n\nWhich function of the Wickham Six would you use to create new columns or modify existing columns in a dataframe? \n\n\n#### Question 3 {.unnumbered}\n\n\nWhich function of the Wickham Six would you use to organise data into groups based on one or more columns? \n\n\n\n#### Question 4 {.unnumbered}\n\nWhich function of the Wickham Six would you use to sort the rows of a dataframe based on the values in one or more columns? \n\n\n\n#### Question 5 {.unnumbered}\n\nWhich function of the Wickham Six would NOT modify the original dataframe? \n\n\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain these answers\n\n| Function | Description |\n|:-------------|:------------------------------------------------------|\n| `select()` | Include or exclude certain variables/columns |\n| `filter()` | Include or exclude certain observations/rows |\n| `mutate()` | Creates new columns or modifies existing ones |\n| `arrange()` | Changes the order of the rows |\n| `group_by()` | Split data into groups based on one or more variables |\n| `summarise()`| Creates a new dataframe returning one row for each combination of grouping variables |\n\n\nTechnically, the first five functions operate on the existing data object, making adjustments like sorting the data (e.g., with `arrange()`), reducing the number of rows (e.g., with `filter()`), reducing the number of columns (e.g., with `select()`), or adding new columns (e.g., with `mutate()`). In contrast, `summarise()` fundamentally alters the structure of the original dataframe by generating a completely new dataframe that contains only summary statistics, rather than retaining the original rows and columns.\n\n:::\n\n\n\n### Error mode {.unnumbered}\n\nSome of the code chunks contain mistakes and result in errors, while others do not produce the expected results. Your task is to identify any issues, explain why they occurred, and, if possible, fix them.\n\nWe will use a few built-in datasets, such as `billboard` and `starwars`, to help you replicate the errors in your own R environment. You can view the data either by typing the dataset name directly into your console or by storing the data as a separate object in your `Global Environment`.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbillboard\n\nstarwars_data = starwars\n```\n:::\n\n\n\n\n#### Question 6 {.unnumbered}\n\nCurrently, the weekly song rankings for Billboard Top 100 in 2000 are in wide format, with each week in a separate column. The following code is supposed to transpose the wide-format `billboard` data into long format:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(names_to = \"weeks\", values_to = \"rank\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `pivot_longer()`:\n! `cols` must select at least one column.\n```\n:::\n:::\n\n\nWhat does this error message mean and how do you fix it?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe error message indicates that the `cols` argument is missing in the function. This means the function doesn’t know which columns to transpose from wide format to long format.\n\nFIX: Add `cols = wk1:wk76` to the function to select columns from wk1 to wk76. Alternatively, `cols = starts_with(\"wk\")` would also work since all columns start with the letter combination \"wk\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlong_data <- billboard %>% \n pivot_longer(cols = wk1:wk76, names_to = \"weeks\", values_to = \"rank\")\n# OR\nlong_data <- billboard %>% \n pivot_longer(cols = starts_with(\"wk\"), names_to = \"weeks\", values_to = \"rank\")\n```\n:::\n\n\n:::\n\n\n\n#### Question 7 {.unnumbered}\n\nThe following code is intended to calculate the mean height of all the characters in the built-in `starwars` dataset, grouped by their gender. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = height)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Returning more (or less) than 1 row per `summarise()` group was deprecated in\ndplyr 1.1.0.\nℹ Please use `reframe()` instead.\nℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`\n always returns an ungrouped data frame and adjust accordingly.\n```\n:::\n:::\n\n\nThe code runs, but it's giving us some weird warning and the output is also not as expected. What steps should we take to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nThe aggregation function `mean()` is missing from within `summarise()`. Without it, the function does not perform any aggregation and returns *all* rows with only the columns for gender and height.\n\nFIX: Wrap the `mean()` function around the variable you want to aggregate, here `height`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height))\n```\n:::\n\n\n:::\n\n\n\n#### Question 8 {.unnumbered}\n\nFollowing up on Question 7, we now have `summary_data` that looks approximately correct - it has the expected rows and column numbers, however, the cell values are \"weird\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | NA|\n|masculine | NA|\n|NA | 175|\n\n
\n:::\n:::\n\n\n\nCan you explain what is happening here? And how can we modify the code to fix this?\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Explain the solution\n\nLook at the original `starwars` data. You will notice that some of the characters with feminine and masculine gender entries have missing height values. However, all four characters without a specified gender have provided their height.\n\nFIX: We need to add `na.rm = TRUE` to the `mean()` function to ensure that R ignores missing values before aggregating the data.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary_data <- starwars %>%\n group_by(gender) %>%\n summarise(mean_height = mean(height, na.rm = TRUE))\n\nsummary_data\n```\n\n::: {.cell-output-display}\n
\n\n|gender | mean_height|\n|:---------|-----------:|\n|feminine | 166.5333|\n|masculine | 176.5323|\n|NA | 175.0000|\n\n
\n:::\n:::\n\n\n:::\n\n\n\n### Challenge yourself {.unnumbered}\n\nIf you want to **challenge yourself** and further apply the skills from Chapter 2, you can wrangle the data from `dog_data_raw` for additional questionnaires from either the pre- and/or post-intervention stages:\n\n* Calculate the mean score for `flourishing_post` for each participant.\n* Calculate the mean score for the `PANAS` (Positive and/or Negative Affect) per participant\n* Calculate the mean score for happiness (`SHS`) per participant\n\nThe 3 steps are equivalent for those questionnaires - select, pivot, group_by and summarise; you just have to \"replace\" the questionnaire items involved.\n\n::: {.callout-caution collapse=\"true\" icon=\"false\"}\n\n## Solution for **Challenge yourself**\n\nFlourishing post-intervention\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## flourishing_post\nflourishing_post <- dog_data_raw %>% \n # Step 1\n select(RID, starts_with(\"F2\")) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Names\", values_to = \"Response\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(Flourishing_post = mean(Response)) %>% \n ungroup()\n```\n:::\n\n\nThe PANAS could be solved more concisely with the skills we learn in @sec-wrangling2, but for now, you would have solved it this way:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# PANAS - positive affect pre\nPANAS_PA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_3, PN1_5, PN1_7, PN1_8, PN1_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - positive affect post\nPANAS_PA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_3, PN2_5, PN2_7, PN2_8, PN2_10) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_PA_post = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect pre\nPANAS_NA_pre <- dog_data_raw %>% \n # Step 1\n select(RID, PN1_1, PN1_2, PN1_4, PN1_6, PN1_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_pre = mean(Scores)) %>% \n ungroup()\n\n# PANAS - negative affect post\nPANAS_NA_post <- dog_data_raw %>% \n # Step 1\n select(RID, PN2_1, PN2_2, PN2_4, PN2_6, PN2_9) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Items\", values_to = \"Scores\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(PANAS_NA_post = mean(Scores)) %>% \n ungroup()\n```\n:::\n\n\nHappiness scale\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# happiness_pre\nhappiness_pre <- dog_data_raw %>% \n # Step 1\n select(RID, HA1_1, HA1_2, HA1_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_pre = mean(Score)) %>% \n ungroup()\n\n#happiness_post\nhappiness_post <- dog_data_raw %>% \n # Step 1\n select(RID, HA2_1, HA2_2, HA2_3) %>% \n # Step 2\n pivot_longer(cols = -RID, names_to = \"Item\", values_to = \"Score\") %>% \n # Step 3\n group_by(RID) %>% \n summarise(SHS_post = mean(Score)) %>% \n ungroup()\n```\n:::\n\n\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/docs/01-basics.html b/docs/01-basics.html index 79ab89f..18f8930 100644 --- a/docs/01-basics.html +++ b/docs/01-basics.html @@ -394,46 +394,46 @@

What is their purpose?

The Source pane… -
- +
+
The Environment pane… -
- +
+
The Console pane… -
- +
+
The Output pane… -
- +
+

Where are these panes located by default?

  • The Source pane is located? + + +
  • The Environment pane is located?
  • The Console pane is located?
  • The Output pane is located? +
@@ -826,16 +826,16 @@

Files and project should be visible in the Output pane in RStudio
-