forked from rafavdz/quants_workbook
-
Notifications
You must be signed in to change notification settings - Fork 1
/
lab-workbook.tex
3568 lines (2770 loc) · 215 KB
/
lab-workbook.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
%
\documentclass[
]{book}
\usepackage{amsmath,amssymb}
\usepackage{iftex}
\ifPDFTeX
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math} % this also loads fontspec
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
\usepackage{lmodern}
\ifPDFTeX\else
% xetex/luatex font selection
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{#1}}}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
\usepackage{booktabs}
\usepackage{amsthm}
\makeatletter
\def\thm@space@setup{%
\thm@preskip=8pt plus 2pt minus 4pt
\thm@postskip=\thm@preskip
}
\makeatother
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage{threeparttablex}
\usepackage[normalem]{ulem}
\usepackage{makecell}
\usepackage{xcolor}
\ifLuaTeX
\usepackage{selnolig} % disable illegal ligatures
\fi
\usepackage[]{natbib}
\bibliographystyle{apalike}
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same}
\hypersetup{
pdftitle={UG Quantitative Methods in the Social Sciences lab workbook},
pdfauthor={by J Rafael Verudzco Torres and Mark Wong},
hidelinks,
pdfcreator={LaTeX via pandoc}}
\title{UG Quantitative Methods in the Social Sciences lab workbook}
\usepackage{etoolbox}
\makeatletter
\providecommand{\subtitle}[1]{% add subtitle to \maketitle
\apptocmd{\@title}{\par {\large #1 \par}}{}{}
}
\makeatother
\subtitle{A step-by-step guide for conducting quantitative research with R}
\author{by J Rafael Verudzco Torres and Mark Wong}
\date{2024-09-09}
\begin{document}
\maketitle
{
\setcounter{tocdepth}{1}
\tableofcontents
}
\hypertarget{Welcome}{%
\chapter*{Welcome}\label{Welcome}}
\addcontentsline{toc}{chapter}{Welcome}
\includegraphics{./images/cover.PNG}
Welcome to the Quantitative Methods in the Social Sciences lab!
This workbook is targeted to University of Glasgow students enrolled in the Undergraduate Quantitative Research Methods course of the School of Social \& Political Sciences. The activities are designed for \href{https://rstudio.cloud/}{RStudio Cloud}.
The book was written using \texttt{R} \href{https://github.com/rstudio/bookdown}{bookdown} package based on the GitHub repository: \url{https://github.com/rstudio/bookdown-demo}.
\includegraphics{./images/by-nc-sa.png}
The online version of this book is licensed under the \href{http://creativecommons.org/licenses/by-nc-sa/4.0/}{Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License}.
\hypertarget{lab-intro}{%
\chapter{Introduction to R}\label{lab-intro}}
For this course we will be using \href{https://www.r-project.org/}{R} \citep{R-base} and \href{https://rstudio.com/}{R Studio} as the main tools for conducting quantitative analysis. \texttt{R} and the basic versions of \texttt{R\ Studio} are open-source and thus free software. Even though \texttt{R} appeared in the early 90s, it has been gaining a lot of popularity in recent years. In fact, it is now one of the most common software for doing statistics in academia.
\texttt{R} and \texttt{R\ Studio} are two separate things. \texttt{R} is the actual programming language and the main processing tool which does the computations in the background, whereas RStudio integrates all functionalities in a friendly and interactive interface. In short, for this course (and most of the times in practice) you chiefly RStudio whilst \texttt{R} is silently doing all the work in the background. Thereafter, we will refer to \texttt{R}, as the integrated interface.
\texttt{R} works in a command-based line environment. This means that you need to call the commands (or \emph{\textbf{functions}}, as called in R) through text. This can look intimidating at first glance. But do not worry, we will guide you step by step.
At this point you may be wondering why you need to bother learning these tools. In the next section you will see some of the advantages and examples that can be achieved using \texttt{R}.
\hypertarget{why-r}{%
\section{Why R?}\label{why-r}}
\hypertarget{r-a-flexible-tool}{%
\subsection{R: a flexible tool}\label{r-a-flexible-tool}}
R can be applied in a wide variety of fields and subjects, including not only those in the social sciences (e.g.~sociology, politics or policy research), but also in humanities (e.g.~history, digital humanities), natural and physical sciences (e.g.~biology, chemistry or geography), health (e.g.~medical studies, public health, epidemiology), business and management (e.g.~finance, economics, marketing), among many others.
The broad application of R is due to its flexibility which allows to perform a range of tasks related to data. These cover tasks at initial stages, such as downloading, mining, or importing data. But it is also useful to manipulate, edit, transform, and organize information. Furthermore and most important for us, there are a set of tools that allow us to analyse data using a range of statistical techniques. These are useful to understand, summarize and draw conclusions about samples, e.g.~people. Lastly, \texttt{R} is powerful to communicate and share information and documents. There are several extensions (called \emph{\textbf{packages}} in R) that can help to produce static and interactive plots/charts, maps, written reports, interactive applications or even entire books! In fact this workbook was written from RStudio.
\hypertarget{advantages-of-using-r}{%
\subsection{Advantages of using R}\label{advantages-of-using-r}}
Some of the advantages of using R are the following:
\begin{itemize}
\tightlist
\item
It is free and open source. You do not need to pay for a license. Thus you can use it anywhere at anytime even if you do not have an affiliation to an institution or organization (e.g.~University or workplace);
\item
It is a collaborative project. This means that it is the users who maintain, extend and update its applications;
\item
It is reproducible. Research can be more transparent since you will get the same results every time you run your analysis through a specific pathway (i.e.~through scripts);
\item
High compatibility. You can read and produce most types of file extensions;
\item
There are a number of easy-access web resources to support you in the learning process.
\end{itemize}
\hypertarget{getting-started}{%
\section{Getting started}\label{getting-started}}
\hypertarget{setting-up-rstudio}{%
\subsection{Setting up RStudio}\label{setting-up-rstudio}}
At this point you need to know that there are at least two alternatives to start using RStudio. One, and by far the most common, is to download both \texttt{R} and RStudio and install the applications in your local drive. The other option is RStudio Cloud. This is an on-line version of RStudio that does not require installing any additional software. You can run it directly from your browser (e.g.~GoogleChrome, Safari, Firefox, etc). For now, we will use the cloud version.
To get started, follow the next steps:
\textbf{Part 1} Create an RStudio Cloud account.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Click on this link \href{https://sso.rstudio.cloud/glasgow}{RStudio Cloud - SSO}, which should automatically open a new tab in your web browser or go directly to the browser and copy this URL: \url{https://sso.rstudio.cloud/glasgow};
\item
Enter your University of Glasgow email address in the login page as normal;
\item
Then it gets linked to the SSO sign-in page, which you input your GUID and password (same page as if you're logging into the library portal/e-reading list);
\textbackslash begin\{figure\}
\end{enumerate}
\includegraphics[width=1\linewidth]{./images/sso_login} \hfill{}
\caption{SSO Login}
\label{fig:unnamed-chunk-2}
\textbackslash end\{figure\}
4. Done! You will be taken you into your own Rstudio Cloud work space.
\textbf{Part 2} Join your lab group.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
You will receive a link from your tutor to join your lab group on RStudio Cloud (the link will be posted on Moodle too). N.b. you must use this specific link to join and access your lab group workspace, as each link is unique to your group. So only use your group's specific link. Copy and paste the link in your web browser. You should see the following window:
\textbackslash begin\{figure\}
\end{enumerate}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_joinspace} \hfill{}
\caption{Join Space.}
\label{fig:unnamed-chunk-3}
\textbackslash end\{figure\}
2. Join your lab by clicking on the `Join space' button shown above.
3. Open the shared space form the left-hand side pane called `Quants Lab Group..' and start the Lab 1 project by clicking on the `Start' button as shown below:
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_startproject} \hfill{}
\caption{Start project.}\label{fig:unnamed-chunk-4}
\end{figure}
\hypertarget{rstudio-environment}{%
\subsection{RStudio environment}\label{rstudio-environment}}
\hypertarget{rstudio-screen}{%
\subsubsection{RStudio screen}\label{rstudio-screen}}
Once you have started `Lab 1' you will see the screen below.
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_lab1_empty} \hfill{}
\caption{Project name.}\label{fig:unnamed-chunk-5}
\end{figure}
Now, go to the ``File'' tab and create a R Script as follows \texttt{File\ \textgreater{}\ New\ file\ \textgreater{}\ R\ Script}
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_new_script} \hfill{}
\caption{New R Script.}\label{fig:unnamed-chunk-6}
\end{figure}
Once you have created your first R Script, save it by clicking on File \textgreater{} Save as.. \textgreater{} \texttt{{[}write\ the\ name\ of\ your\ file{]}}.
After this, your RStudio screen will be split in four \textbf{important} windows or panes as shown below:
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_panels} \hfill{}
\caption{R Studio panes.}\label{fig:unnamed-chunk-7}
\end{figure}
\begin{itemize}
\tightlist
\item
In \textbf{Pane 1}, you have your newly created \texttt{R} script. This is the area where you will be working most of the time. From here, you will write functions. To run an \texttt{R} script line, you can click on the \texttt{Run} green arrow situated on the top of pane 1 or more commonly you can run a code line by typing \texttt{alt\ +\ enter}. The things you write in this section will be saved in your R script file.
\includegraphics{./images/rstudio_cloud_run_button.png}
\item
In \textbf{Pane 2}, you have the ``Global Environment'', this is one of the most useful tabs in this pane. It shows you the active `objects' that you have available/loaded in your current session (this will probably make more sense in the coming sections).
\item
In \textbf{Pane 3}, you have the R Console, this is where you will see most of the results of the functions you run from your script (pane 1). You can also write and run functions from here, by typing the function and hitting enter. NOTE that what you do here will NOT be saved, this is usually used to quickly call functions that you do not want to save in your script.
\textbackslash begin\{figure\}
\end{itemize}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_console} \hfill{}
\caption{Console.}
\label{fig:unnamed-chunk-8}
\textbackslash end\{figure\}
* Finally, in \textbf{Pane 4} you have multiple useful tabs. In the \texttt{File} tab you can see the files and directories that you have in your R project. In the \texttt{Plot} tab you will see a preview of the static plots/charts you will be producing from your script. In \texttt{Packages}, you have a list of the extensions or plug-ins (called `packages' in R) that are installed in your working environment. The \texttt{Help} contains some resources that clarify or expand what each of the functions does. Again, probably this will make more sense once you get started. We will come back to this later. Finally, the \texttt{Viewer} displays interactive outputs.
\hypertarget{hands-on-r}{%
\section{Hands on R}\label{hands-on-r}}
Now you are ready! It is your turn to start exploring and getting familiar with R by completing the following activities.
5
\#\#\#\# R as calculator
Go to your \textbf{console} (pane 3, bottom-left pane), write some simple calculations and run them by typing `enter' after each of them, as shown below.
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_r_as_calculator} \hfill{}
\caption{R Console as calculator.}\label{fig:unnamed-chunk-9}
\end{figure}
Try different operations such as \texttt{50\ /\ 20} or \texttt{3\ *\ 5}.
Fairly simple, right? And don't forget, it is entirely normal to copy/paste and tweak any existing codes. Unlike writing an essay or an exam, you don't actually need to know and write codes ``off the cuff'' or recite/memorise any syntax. You are only expected to know how to run the codes and tweak them as you go along, there is a huge amount of trial and error when you work in R. So don't worry if you feel like you are just making minor changes to the codes, that's how it's supposed to work, and the first few weeks is all about getting comfortable in using R, then the level of challenge will go up. Let's continue with the next activities!
\hypertarget{testing-logical-operators}{%
\subsubsection{Testing logical operators}\label{testing-logical-operators}}
Now, write and run the following lines in your \textbf{console} (pane 3) and take some time to observe the result in detail for each of them:
\begin{itemize}
\tightlist
\item
\texttt{10\ ==\ 10}
\item
\texttt{10\ !=\ 10}
\item
\texttt{1\ ==\ 5}
\item
\texttt{1\ \textgreater{}\ 5}
\item
\texttt{\textquotesingle{}a\textquotesingle{}\ ==\ \textquotesingle{}a\textquotesingle{}}
\item
\texttt{\textquotesingle{}a\textquotesingle{}\ ==\ \textquotesingle{}b\textquotesingle{}}
\end{itemize}
What do you see? \ldots{}
\ldots That's it! When you use the double equal sign \texttt{==} you are \emph{asking} R whether the value on the left hand-side of the operator is equal to the one on the right hand-side. Likewise, when you combine the exclamation mark \texttt{!} with other operator, you get the reversed result. In the past exercises you used \texttt{!=}, this was interpreted as ``is not equal to'', that is why \texttt{10\ !=\ 10} returns \texttt{FALSE}, but \texttt{10\ ==\ 10} returns \texttt{TRUE}.
\texttt{R} can process different classes of inputs. In this case we used letters and we \emph{asked} R whether `a' was equal to `a', and of course the result is \texttt{TRUE}. Note that when you want to input text (referred as \emph{character} values in R), you need quotation marks \texttt{\textquotesingle{}}. If you want to enter numeric values, you simply input the raw number. These are different `class' values.
Perhaps logical operators do not make much sense at this point, but you will find out later that they are useful to manipulate data. For example, these are essential to filter a data set based on specific \emph{rules} or patterns.
\hypertarget{assigning-values-to-objects}{%
\subsubsection{Assigning values to `objects'}\label{assigning-values-to-objects}}
In \texttt{R}, it is very common (and practical) to store values or data as `objects'. These are temporally stored in your current session. Let's try it!
Now, we will work in the \texttt{R} script file (\textbf{Pane 1}, top-left pane), write the following and run it by clicking the green arrow or using \texttt{alt\ +\ enter}:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{a }\OtherTok{\textless{}{-}} \DecValTok{10}
\NormalTok{a }\SpecialCharTok{+} \DecValTok{5}
\end{Highlighting}
\end{Shaded}
What do you observe?\ldots{}
\ldots That's right! The operator \texttt{\textless{}-} assigned the numeric value \texttt{10} to the object \texttt{a} (on the left hand-side of the arrow). Later, you used the object (\texttt{a}) to compute a sum (i.e, \texttt{a\ +\ 5}).
Now, write and run the following in your \texttt{R} script (Pane 1)
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{c }\OtherTok{\textless{}{-}} \DecValTok{3}
\NormalTok{a }\SpecialCharTok{*}\NormalTok{ c}
\end{Highlighting}
\end{Shaded}
As you can see, you stored the numeric value \texttt{3} in the variable \texttt{c}. Then, you called the previously created object \texttt{a} in a multiplication.
In the same way as you assigned these simple variables, you will store other types of objects later, e.g.~vectors, data frames or lists. This is useful because those objects will be ready in your session to do some computations.
There are a few things to note when assigning objects to variables. If you use a different value to the same variable, e.g.~by typing \texttt{a\ \textless{}-\ 5}, you will replace the old value with the new. So, instead of having \texttt{a} representing the value 10, you will have \texttt{5}. You can see the objects available in your session on the Global Environment (`Environment' tab in Pane 2) as shown below.
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_environment} \hfill{}
\caption{'Environment' tab.}\label{fig:unnamed-chunk-12}
\end{figure}
This is a very good start, great job!
Note that the changes made in your script are saved automatically in RStudio Cloud. To verify this, have a look at the name of your script in the top-left of pane 1. If changes are due to be saved, the name will be written in red. If it is in red, save changes manually by clicking on the disk icon. After you have made sure your changes are saved, end your session simply by closing the RStudio Cloud tab in your browser.
\hypertarget{activity}{%
\section{Activity}\label{activity}}
Discuss the following questions with your neighbour or tutor:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
What are the main differences between working on a \texttt{R} script file (pane 1) and directly on the console (pane 3)?
\item
Can you describe what happens when your run the following code? (tip: look at the environment tab in pane 2)
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{object1 }\OtherTok{\textless{}{-}} \DecValTok{10}
\NormalTok{object1 }\OtherTok{\textless{}{-}} \DecValTok{30}
\end{Highlighting}
\end{Shaded}
\hypertarget{lab2}{%
\chapter{Data in R}\label{lab2}}
\hypertarget{welcome-back}{%
\section{Welcome back!}\label{welcome-back}}
In our previous lab, we set up an RStudio Cloud session and we got familiar with the RStudio environment and some of the purpose and contents of its panes. In this Lab we will learn about R packages, how to install them and load them. Also, we will use different types of data. You will have the chance to practice with additional \texttt{R} operators. Lastly, we will load a real-world data set and put in practice your new skills.
\hypertarget{learn-packages}{%
\section{R Packages}\label{learn-packages}}
As mentioned in our last lab, \texttt{R} \citep{R-base} is a collaborative project. This means that \texttt{R} users are developing, maintaining, and extending the functionalities constantly. When you set up \texttt{R} and RStudio for the first time, as we did it last week, it comes only with the `basic' functionalities by default. However, there are literally thousands of extensions that are developed by other users. In R, these non-default extensions are called \emph{\textbf{packages}}.
Most of the times, we use packages because they simplify our work in \texttt{R} or they allow us to extend the capabilites of base R.
\hypertarget{installing-packages}{%
\subsection{Installing packages}\label{installing-packages}}
Let's put hands-on to install and load some useful packages. We will start with \texttt{tidyverse} \citep{R-tidyverse}.\footnote{\url{https://www.tidyverse.org/}}
\hypertarget{activity-1}{%
\subsection{Activity:}\label{activity-1}}
\textbf{Part 1}. Access your lab group in R Studio Cloud
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Make sure you have a free, institutional-subscription \href{https://rstudio.cloud/}{RStudio Cloud} account (in case you have not created one yet, please follow the guidance provided in \protect\hyperlink{lab-intro}{Lab 1});
\item
You will receive a link from your tutor to join your lab group in a shared space. Copy and paste it in your web browser (log in if necessary). If you already joined your lab group in RStudio Cloud, simply access the `Lab 2' project and omit steps 3 to 5. Otherwise, continue with steps 3, 4 and 5.
\item
If you did not join your lab group yet, you should see the following window:
\textbackslash begin\{figure\}
\end{enumerate}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_joinspace} \hfill{}
\caption{Join Space}
\label{fig:unnamed-chunk-14}
\textbackslash end\{figure\}
4. Click on the `Join space' button shown above.
5. Open the shared space form the left-hand side pane called `Quants Lab Group..' and start the Lab 2 project by clicking on the `Start' button as shown below:
\begin{figure}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_startproject_lab2} \hfill{}
\caption{Start Lab 2.}\label{fig:unnamed-chunk-15}
\end{figure}
\textbf{Part 2}. Working on your script
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Once you have accessed the `Lab 2' project, write or copy the following line in your \textbf{script} (pane 1) and run it:
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{install.packages}\NormalTok{(}\StringTok{\textquotesingle{}tidyverse\textquotesingle{}}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{1}
\tightlist
\item
Wait until you get the message `The downloaded source packages are in\ldots{}'. The installing process can take up to a couple of minutes to finish.
\item
Once the package is installed, you need to load it using the \texttt{library()} function. Please, copy and paste the following line, and run it:
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(tidyverse)}
\end{Highlighting}
\end{Shaded}
And that's it, \texttt{tidyverse} is ready to be used in your current session!
There are couple of things you should know. First, the packages need to be installed only per project in RStudio Cloud (and only once if you are working in RStudio Desktop version). However, packages must be loaded using the \texttt{library()} function every time you restart an R session.
Another thing to notice is that when you install a package you need to use quotation marks, whereas in \texttt{library()} you only need to write the plain package name within brackets. Usually, you will load the packages at the beginning of your script.
\hypertarget{types-of-variables}{%
\section{Types of variables}\label{types-of-variables}}
\texttt{R} can handle many classes of data. It is crucial that you can distinguish the main ones. Broadly speaking there are two types of variables,
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
\textbf{categorical} and;
\item
\textbf{numeric} (formally know as interval or ratio).
\end{enumerate}
Categorical variables are distinctive because they are limited in the number of categories it can take, e.g., country, name, political party, or gender. Ordinal data is a \emph{sub-type} of the categorical, and it is used when the categories can be ranked and their order is meaningful, e.g., education level or level of satisfaction. Numeric values can be continuous (these are usually measured and can take infinite values, e.g.~speed or time).\footnote{For more details, please refer to the DataCamp module \href{https://learn.datacamp.com/courses/introduction-to-data-in-r}{Introduction to Data in R}.}
In \texttt{R}, the basic types of data are known as `atomic vectors' and there are 6 of them (logical, integer, double, character, complex and raw). In the social sciences, we often use the following: \texttt{numeric}, \texttt{factor} and \texttt{character}. Numeric vectors are used to represent continuous numerical data.\footnote{Notice that \texttt{numeric} vectors can be represented as \texttt{integer}or \texttt{double} in \texttt{R}, their difference is of little relevance for now.} On the other hand, factor vectors are used to represent categorical and ordinal data.
In R, there are couple of functions that will help us to identify the type of data. First, we have \texttt{glimpse()}. This prints some of the main characteristics of a data set, namely its overall dimension, name of each variable (column), the first values for each variable, and the type of the variable. Second we have the function \texttt{class()}, that will help us to determine the overall class(type) of on \texttt{R} object.
\hypertarget{activity-2}{%
\subsection{Activity:}\label{activity-2}}
We are now going to use some datasets that are pre-loaded in the \texttt{R} session by default. Please go to your `Lab\_2' project in RStudio Cloud and do the following:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
We will start with a classic dataset example in R called \texttt{iris}. This contains measurements of various flowers species (for more info type \texttt{?iris} in your console). Please go to your \textbf{console} and type the line below.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(iris)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{1}
\tightlist
\item
What do you observe from the output?\ldots{} First, it tells you the number of rows and the columns on the top. Later, it lists the name of each variable. Additionally, it tells you the type of the variable between these symbols \texttt{\textless{}\ \textgreater{}}. The first five variables in this dataset are of type \texttt{\textless{}dbl\textgreater{}} which is a type of numeric variable. The last, \texttt{Species}, is a factor \texttt{\textless{}fct\textgreater{}}. In sum, there is information of the species and four types of continuous measures associated to each flower in this dataset.
\item
Now you know that each flower belongs to a species, but what are the specific categories in this data set? To find out, type the following in your console.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{levels}\NormalTok{(iris}\SpecialCharTok{$}\NormalTok{Species)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{3}
\tightlist
\item
As you can see, there are three categories, which are three types of flower species. In \texttt{R} the categories in factor vectors are are called \emph{levels}.
\end{enumerate}
Note the syntax above. Inside the function, we used the name of the dataset followed by the dollar sign (\$) which is is needed to access the specific column/variable \texttt{Species}.
Now, let's get serious and explore Star Wars. Yes, the famous film series!
The \texttt{starwars} data set from the \texttt{dplyr} package contains information about the characters, including height, hair colour, and sex (to get more information type \texttt{?starwars} in your console). At this time we will use a reduced version of the full data set. Please complete the following activities from your \texttt{R} script (pane 1).
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
First, we will run the next couple of lines to reduce the data set, and then we will glimpse the Star Wars characters:
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{starwars2 }\OtherTok{\textless{}{-}}\NormalTok{ starwars[ ,}\DecValTok{1}\SpecialCharTok{:}\DecValTok{11}\NormalTok{]}
\FunctionTok{glimpse}\NormalTok{(starwars2)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{1}
\tightlist
\item
What do you observe this time? \ldots{} It seems that the data type is not consistent with their content. For example, the variables \texttt{species}, \texttt{gender}, and \texttt{hair\_color} are of type \texttt{\textless{}chr\textgreater{}} (that is \texttt{character}), when according to what we just learnt they should be a factor. To transform them, we will use the function ´factor()´. This process is known as coercing a variable, that is when you change from one type to another.
\item
Let's coerce the species variable from character to factor and assign the result to the same column in the dataset.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{starwars2}\SpecialCharTok{$}\NormalTok{species }\OtherTok{\textless{}{-}} \FunctionTok{factor}\NormalTok{(starwars2}\SpecialCharTok{$}\NormalTok{species)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{6}
\tightlist
\item
Let's check if the type of variable really changed by glimpsing the data and checking the levels of \texttt{species}.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(starwars2)}
\FunctionTok{levels}\NormalTok{(starwars2}\SpecialCharTok{$}\NormalTok{species)}
\end{Highlighting}
\end{Shaded}
The glimpse result now is telling us that \texttt{species} is a \texttt{\textless{}fct\textgreater{}}, as expected. Furthermore, the \texttt{levels()} function reveals that there are 37 types of species, including Human, Ewok, Droid, and more.
Hopefully, these examples will help you to identify the the main vector types and more importantly to coerce them in an appropriate type. Be aware that many data sets represent categories with numeric values, for example, using `0' for males and `1' for females. Usually, large data sets are accompanied by extra information in a \emph{code book} or \emph{documentation} file, which specifies the values of the numeric code and their respective meaning. It's important to read the code book/documentation of every dataset as the conventions and meanings can vary.
\hypertarget{more-operators-and-some-essential-symbols}{%
\section{More operators and some essential symbols}\label{more-operators-and-some-essential-symbols}}
A useful operator is the pipe \texttt{\%\textgreater{}\%}. This is part of the \texttt{tidyverse} package. So, it is ready for you to use. This operator passes the result of one operation to the next. Check the results of the following operations in your \textbf{console}:
\begin{Shaded}
\begin{Highlighting}[]
\DecValTok{1} \SpecialCharTok{\%\textgreater{}\%} \SpecialCharTok{+} \DecValTok{1}
\DecValTok{1} \SpecialCharTok{\%\textgreater{}\%} \SpecialCharTok{+} \DecValTok{1} \SpecialCharTok{\%\textgreater{}\%} \SpecialCharTok{+} \DecValTok{5}
\end{Highlighting}
\end{Shaded}
Observe what happened\ldots The result from the first line was 2. This is because this line can be read as: `take 1, THEN sum 1'. Therefore, the result is 2.
Similarly, the second line follows this process: `take 1, THEN sum 1, take the result of this (which is '2') and THEN sum 5'. Therefore, the result is 7. This can sound a bit abstract at this point, but we will practice with some data in the next section.
\hypertarget{black-lives-matter}{%
\section{Black lives matter!}\label{black-lives-matter}}
In this section we will work with data originally collected by The Guardian in 2015, for more information click \href{https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/about-the-counted}{here}. The data set we will use today is an extended version which was openly shared in GitHub by the American news website \href{https://fivethirtyeight.com/}{FiveThirtyEight}. This data set contains information about the people that were killed by police or other law enforcement bodies in the US, such as age, gender, race/ethnicity, etc. Additionally, it includes information about the city or region where the event happened. For more information click \href{https://github.com/fivethirtyeight/data/tree/master/police-killings}{here}.
\hypertarget{downloading-and-reading-the-data}{%
\subsection{Downloading and reading the data}\label{downloading-and-reading-the-data}}
For the following excercices, please make sure that your are working in your \texttt{R} script.
First, we will create a new folder in our project directory to store the data. To do it from \texttt{R}, run this line in your script (Don't worry if you get a warning. This appears because you already have a folder with this name):
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{dir.create}\NormalTok{(}\StringTok{"data"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
Note that in the `Files' tab of Pane 4, there is a new folder called \texttt{data}.
Now, download the data from the GitHub repository using the function \texttt{download.file()}. This function takes two arguments separated by a comma: (1) the URL and (2) the destination (including the directory, file name, and file extension), as shown below. Also, since the file we downloaded is wrapped in a \texttt{.zip} file, we will need to unzip it using \texttt{unzip()}. Copy, paste in your script, the following lines:
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{download.file}\NormalTok{(}\StringTok{"https://projects.fivethirtyeight.com/data{-}webpage{-}data/datasets/police{-}killings.zip"}\NormalTok{,}
\StringTok{"data/police{-}killings.zip"}\NormalTok{)}
\FunctionTok{unzip}\NormalTok{(}\StringTok{"data/police{-}killings.zip"}\NormalTok{, }\AttributeTok{exdir =} \StringTok{"data"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
After following the previous steps, we are ready to read the data. As you can see in the `File' tab, the data comes as a \texttt{.csv} file. Thus, we can use the \texttt{read\_csv()} function included in the \texttt{tidyverse} package (make sure you the package is loaded in your session as explained in a \protect\hyperlink{installing-packages}{previous section}). We will assign the data in an object called \texttt{police}.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{police }\OtherTok{\textless{}{-}} \FunctionTok{read\_csv}\NormalTok{(}\StringTok{"data/police{-}killings/police\_killings.csv"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\hypertarget{examining-the-data}{%
\subsection{Examining the data}\label{examining-the-data}}
If you look at your `Environment' tab in pane 2, you will see there is a new object called \texttt{police}, which has 467 observations and 34 variables (or columns). To start exploring the contents, we will glimpse the \texttt{police} data as following:
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(police)}
\end{Highlighting}
\end{Shaded}
As you can see, there are several variables included in the dataset, such as age, gender, law enforcement agency (\texttt{lawenforcementagency}), or whether the victim was armed (\texttt{armed}). You will see some of these variables are not in the appropriate type. For instance, some are categorical and should be type \texttt{\textless{}fct\textgreater{}} instead of \texttt{\textless{}chr\textgreater{}}.
\hypertarget{data-wrangling}{%
\subsection{Data wrangling}\label{data-wrangling}}
Before coercing these variables, we will create a smaller subset selecting only the variables that we are interested in. To do so, we can use the \texttt{select()} function. The \texttt{select} function takes the name of the data first and then the name of the variables we want to keep (no quotation marks needed). We will select a few variables and assign the result to a new object called \texttt{police\_2}.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{police\_2 }\OtherTok{\textless{}{-}} \FunctionTok{select}\NormalTok{(police, age, gender, raceethnicity, lawenforcementagency, armed)}
\end{Highlighting}
\end{Shaded}
If you look again to the `Environment' tab, there is a second data set with the same number of observations but only 5 variables. You can glimpse this object to have a better idea of its contents.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(police\_2)}
\end{Highlighting}
\end{Shaded}
Having a closer look at the reduced version, we can see that in fact all the variables are of type \texttt{\textless{}chr\textgreater{}}, including \texttt{age}.
Let's coerce the variables in to their correct type. We will start with age, from character to numeric:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{police\_2 }\OtherTok{\textless{}{-}}\NormalTok{ police\_2 }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{age =} \FunctionTok{as.numeric}\NormalTok{(age))}
\end{Highlighting}
\end{Shaded}
Age is not known for some cases. Thus, it is recorded as `Unknown' in the dataset. Since this is not recognized as a numeric value in the coercion process, \texttt{R} automatically sets it as a missing value, \texttt{NA}. This is why it will give you a warning message.
We can continue coercing \texttt{raceethnicity} and \texttt{gender} from character to a factor:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{police\_2 }\OtherTok{\textless{}{-}}\NormalTok{ police\_2 }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{raceethnicity =} \FunctionTok{factor}\NormalTok{(raceethnicity))}
\NormalTok{police\_2 }\OtherTok{\textless{}{-}}\NormalTok{ police\_2 }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{gender =} \FunctionTok{factor}\NormalTok{(gender))}
\end{Highlighting}
\end{Shaded}
Let's run a summary of your data. This shows the number of observations in each category or a summary of a numeric variable:
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{summary}\NormalTok{(police\_2)}
\end{Highlighting}
\end{Shaded}
There are some interesting figures coming out from the summary. For instance, in age you can see that the youngest is\ldots{} 16 years old(?!), and the oldest 87 years old. Also, the vast majority are male individuals (445 vs 22). In relation to race/ethnicity, roughly half of them is `White', whereas `Black' individuals represent an important share. One may question about the proportion of people killed in terms of race/ethnicity compared to the composition of the total population (considering Black is a minority group in the US).
Let's suppose that we only want observations in which race/ethnicity is not unknown. To `remove' \emph{undesired} observation we can use the \texttt{filter()} function. We will assign the result of \texttt{filter} in a variable called \texttt{police\_2}.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{police\_2 }\OtherTok{\textless{}{-}}\NormalTok{ police\_2 }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{filter}\NormalTok{(raceethnicity }\SpecialCharTok{!=} \StringTok{\textquotesingle{}Unknown\textquotesingle{}}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
So, what just happened in the code above? First, the pipe operator, \texttt{\%\textgreater{}\%}: What we are doing verbally is \emph{take the object \texttt{police\_2}, THEN filter raceethnicity based on a condition}. Later, what is happening inside \texttt{filter}? Lets have a look at what \texttt{R} does in the background for us (Artwork by @alison\_horst):
\begin{figure}
\includegraphics[width=1\linewidth]{./images/lab_2_filter} \hfill{}
\caption{Filter. Source: Artwork by @Horst.}\label{fig:unnamed-chunk-34}
\end{figure}
In the example above, we are keeping the observations in \texttt{raceethnicity} that are NOT EQUAL to `Unknown'. Finally, when we assigned the result to an object named as the same as our previous object, we replaced the \emph{old} dataset with the filtered version.
\hypertarget{activity-3}{%
\section{Activity}\label{activity-3}}
Discuss the following questions with your neighbour or tutor:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
What is the main purpose of the functions \texttt{select()} and \texttt{filter}?
\item
What does \emph{coerce} mean in the context of \texttt{R}? and Why do we need to coerce some variables?
\item
What is the \texttt{mutate()} function useful for?
\end{enumerate}
Using the police\_2 dataset:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Filter how many observations are `White' in \texttt{raceethnicity}? How may rows/observations are left?
\item
How many `Latino/Hispanic' are there in the dataset?
\item
Using the example of Figure 2.3, could you filter how many were killed that were (a) `Black' and (b) killed by firearm (`firearm')?
\item
What about `White' and `firearm'?
\end{enumerate}
This is the end of Lab 2. Again, the changes in your script should be saved automatically in R Studio Cloud. However, make sure this is the case as you were taught in Lab 1. After this, you can close the tab in your web browser. Hope you had fun!
\hypertarget{data-wrangling-1}{%
\chapter{Data wrangling}\label{data-wrangling-1}}
Welcome to Lab 3!
In our previous session we learned about \texttt{R} packages, including how to install and load them. We talked about the main types of data used in social science research and how to represent them in \texttt{R}. Also, we played around with some datasets using some key functions, such as: \texttt{filter()}, \texttt{select()}, and \texttt{mutate()}. In this session we will learn how to import data in \texttt{R}, clean and format the data using a real-world dataset. These is a common and important phase in quantitative research.
\hypertarget{importing-and-data-wrangling}{%
\section{Importing and data wrangling}\label{importing-and-data-wrangling}}
Today, we will be working with data generated by the \href{https://www.ark.ac.uk/ARK/}{Access Research Knowledge (ARK)} hub. ARK conducts a series of surveys about society and life in Northern Ireland. For this lab, we will be working with the results of the \href{https://www.ark.ac.uk/nilt/}{Northern Ireland Life and Times Survey (NILT)} in the year 2012. In particular, we will be using a teaching dataset that focuses on community relations and political attitudes. This includes background information of the participants and their household. Please take 5-10 minutes to read the documentation of this dataset (\href{https://www.ark.ac.uk/teaching/NILT2012TeachingResources.pdf}{click here to access the documentation}). p.s. You will have to regularly consult this document to understand and use the data in NILT. So, I recommend you to save the PDF file in your local drive if you can. This NILT teaching dataset is also what you will be using for the research report assignment in this course (smart, isn't it?) - so it's worth investing the time to learn how to work with this data through the next few labs, as part of the preparation and practice for your assignemnt.
\hypertarget{downloading-and-reading-the-data-1}{%
\subsection{Downloading and reading the data}\label{downloading-and-reading-the-data-1}}
We will continue using R Studio Cloud, as we did in our previous labs. Please follow the next steps:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Go to your `Quants lab group' in \href{https://rstudio.cloud/}{RStudio Cloud} (if you have not joined a shared space, follow the instructions in \protect\hyperlink{learn-packages}{Section 2.2} of \protect\hyperlink{lab2}{Lab 2}).
\item
Start the project called `NILT' located in your lab group.
\item
Once you have initialized the project, generate a new \texttt{R} scrip file, and save it as `Exploratory analysis'.
\item
Load the \texttt{tidyverse} and \texttt{haven} packages. This last package is useful to import data from SPSS (the \texttt{tidyverse} package was pre-installed in your session). You can copy, paste, and run the following functions from your script:
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(tidyverse)}
\FunctionTok{library}\NormalTok{(haven)}
\end{Highlighting}
\end{Shaded}
Next, we will create a folder to store the data. Then, download and read the NILT data set, following the next steps:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
From your script, create a new folder called `data':
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{dir.create}\NormalTok{(}\StringTok{\textquotesingle{}data\textquotesingle{}}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{1}
\tightlist
\item
Download the data using the \texttt{download.file()} function. Remember that you have to specify the URL first, and the destination of the file second (including the folder).
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{download.file}\NormalTok{(}\StringTok{\textquotesingle{}https://www.ark.ac.uk/teaching/NILT2012GR.sav\textquotesingle{}}\NormalTok{, }
\StringTok{\textquotesingle{}data/nilt2012.sav\textquotesingle{}}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{2}
\tightlist
\item
Take a look to the `Files' tab in pane 3, you will see a folder called `data', click on it, and you will see the \texttt{nilt2012.sav} file.
\textbackslash begin\{figure\}
\end{enumerate}
\includegraphics[width=1\linewidth]{./images/rstudio_cloud_files} \hfill{}
\caption{Cloud files.}
\label{fig:unnamed-chunk-38}
\textbackslash end\{figure\}
4. To read this type of file use the \texttt{read\_sav()} function. Read the \texttt{.sav} file and assign it to an object called \texttt{nilt}.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{nilt }\OtherTok{\textless{}{-}} \FunctionTok{read\_sav}\NormalTok{(}\StringTok{"data/nilt2012.sav"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
And that's it! You should see a new data object in your `Environment' tab (Pane 2) ready to be used. You can also see that this contains 1204 observations (rows) and 133 variables (columns). Lets glimpse our newly imported data and see the type of variables included.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{glimpse}\NormalTok{(nilt)}
\end{Highlighting}
\end{Shaded}
\hypertarget{data-wrangling-2}{%
\section{Data wrangling}\label{data-wrangling-2}}
As you can see from the result of glimpse, the class for practically all the variables is \texttt{\textless{}dbl+lbl\textgreater{}}. What does this mean? This happened because usually datasets use numbers to represent each of the categories/levels in categorical variables. These numbers are \emph{labelled} with their respective meaning. This is why we have a combination of value types (\texttt{\textless{}dbl+lbl\textgreater{}}). Take the example of the variable called \texttt{rsex}, as you can see from the values displayed using \texttt{glimpse()}, this includes numbers only, e.g.~\texttt{1,1,2,2...}. This is because `1' represents `Male' respondents and `2' represents `Female' respondents in the NILT dataset (n.b.~the authors of this lab workbook recognise that sex and gender are different concepts, and we acknowledge this tension and that it will be problematic to imply or define gender identities as binary, as with any dataset. More recent surveys normally approach this in a more inclusive way by offering self-describe options). You can check the pre-defined parameters of the variable in NILT in the \href{https://www.ark.ac.uk/teaching/NILT2012TeachingResources.pdf}{documentation} or running \texttt{print\_labels(nilt\$rsex)} in your console, which returns the numeric value and its respective label. As with \texttt{rsex}, this is the case for many other variables in this data set.
You should be aware that this type of `mix' variable is a special case since we imported a file from a \emph{foreign} file that saves metadata for each variable (containing the names of the categories). As you learned in the last lab, in \texttt{R} we treat categorical variables as \texttt{factor}. Therefore, we will coerce some variables as \texttt{factor}. This time we will use the function \texttt{as\_factor()} instead of the simple \texttt{factor()} that we used before. This is because \texttt{as\_factor()} allows us to keep the names of each category in the variables. The syntax is exactly the same as before. Copy and run the following from your script:
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# Gender of the respondent}
\NormalTok{nilt }\OtherTok{\textless{}{-}}\NormalTok{ nilt }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{rsex =} \FunctionTok{as\_factor}\NormalTok{(rsex))}
\CommentTok{\# Highest Educational qualification}
\NormalTok{nilt }\OtherTok{\textless{}{-}}\NormalTok{ nilt }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{highqual =} \FunctionTok{as\_factor}\NormalTok{(highqual))}
\CommentTok{\# Religion}
\NormalTok{nilt }\OtherTok{\textless{}{-}}\NormalTok{ nilt }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{religcat =} \FunctionTok{as\_factor}\NormalTok{(religcat))}
\CommentTok{\# Politic identification}
\NormalTok{nilt }\OtherTok{\textless{}{-}}\NormalTok{ nilt }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{uninatid =} \FunctionTok{as\_factor}\NormalTok{(uninatid))}
\CommentTok{\# Happiness}
\NormalTok{nilt }\OtherTok{\textless{}{-}}\NormalTok{ nilt }\SpecialCharTok{\%\textgreater{}\%} \FunctionTok{mutate}\NormalTok{(}\AttributeTok{ruhappy =} \FunctionTok{as\_factor}\NormalTok{(ruhappy))}
\end{Highlighting}
\end{Shaded}
Notice from the code above that we are replacing the `old' dataset with the result of the mutated variables that are of type \texttt{factor}. This is why we assigned the result with the \emph{assigning operator} \texttt{\textless{}-}.
What about the numeric variables? In the documentation file there is a table in which you will see a type of measure `scale'. This usually refers to continuous numeric variables (e.g.~age or income).\footnote{Be careful, in some cases these actually correspond to \emph{discrete} numeric values in this dataset (things that can be counted, e.g.~number of\ldots).} Let's coerce some variables to the appropriate type.
In the previous operation we coerced the variables as factor one by one, but we can transform several variables at once within the \texttt{mutate} function. As we did before, copy and run the following code in your script:
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# Coerce several variables as numeric}
\NormalTok{nilt }\OtherTok{\textless{}{-}}\NormalTok{ nilt }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{mutate}\NormalTok{(}
\AttributeTok{rage =} \FunctionTok{as.numeric}\NormalTok{(rage),}
\AttributeTok{rhourswk =} \FunctionTok{as.numeric}\NormalTok{(rhourswk),}
\AttributeTok{persinc2 =} \FunctionTok{as.numeric}\NormalTok{(persinc2),}
\NormalTok{ )}
\end{Highlighting}
\end{Shaded}
Before doing some analyses, we will drop unused levels (or categories) in our dataset using the function \texttt{droplevels()}, as following:
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# drop unused levels}
\NormalTok{nilt }\OtherTok{\textless{}{-}} \FunctionTok{droplevels}\NormalTok{(nilt)}
\end{Highlighting}
\end{Shaded}
The previous function is useful to remove some categories that are not being used in the dataset (e.g.~categories including 0 observations).
Finally, save the NILT survey in an \texttt{.rds} file (this is the R format). We will not use this file now, but this will save us time formatting the dataset in next labs (So, we do not have to repeat the steps above every time).
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{saveRDS}\NormalTok{(nilt, }\StringTok{"data/nilt\_r\_object.rds"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\hypertarget{read-the-clean-dataset}{%
\section{Read the clean dataset}\label{read-the-clean-dataset}}
Phew! Good job. You have completed the basics for wrangling the data and producing a workable dataset.
As a final step, just double check that things went as expected. For this purpose, we will re-read the clean dataset.
\hypertarget{activity-4}{%
\subsection{Activity}\label{activity-4}}
\begin{itemize}
\tightlist
\item
Using the \texttt{readRDS()} function, read the \texttt{.rds} file that you just created in the last step and assign it to an object called \texttt{cleanesed\_data}. Remember to include the full directory of the file using quotation marks inside the function.
\item
Run the \texttt{glimpse} function on the \texttt{cleanesed\_data} object.
\item
Run the \texttt{glimpse} function on the \texttt{nilt} object.
\item
Do they look the same? If yes, it means that you successfully saved your work.
\end{itemize}
\hypertarget{exploratory-data-analysis}{%
\chapter{Exploratory data analysis}\label{exploratory-data-analysis}}