From b423ba7efc7508f5e277489859cd70db8c618ece Mon Sep 17 00:00:00 2001 From: trevorcampbell Date: Wed, 21 Aug 2024 20:03:42 +0000 Subject: [PATCH] deploy: 3011db3a984f1a7efa7f33b5a254908f49e4f53c --- .../figure-html/01-ref-vs-tibble-1.png | Bin 479619 -> 479619 bytes .../figure-html/02-dataframe-1.png | Bin 102423 -> 102423 bytes pull583/_main_files/figure-html/02-obs-1.png | Bin 93430 -> 93430 bytes .../figure-html/02-tidy-image-1.png | Bin 156456 -> 156456 bytes .../figure-html/02-vec-vs-list-1.png | Bin 45587 -> 45587 bytes .../_main_files/figure-html/02-vector-1.png | Bin 26767 -> 26767 bytes .../_main_files/figure-html/02-vectors-1.png | Bin 99790 -> 99790 bytes .../figure-html/activate-and-run-button-1.png | Bin 174457 -> 174457 bytes .../figure-html/add-collab-01-1.png | Bin 327587 -> 327587 bytes .../figure-html/add-collab-02-1.png | Bin 240768 -> 240768 bytes .../figure-html/add-collab-03-1.png | Bin 363829 -> 363829 bytes .../figure-html/add-collab-04-1.png | Bin 273765 -> 273765 bytes .../figure-html/add-collab-05-1.png | Bin 252578 -> 252578 bytes .../_main_files/figure-html/clone-01-1.png | Bin 159601 -> 159601 bytes .../_main_files/figure-html/clone-02-1.png | Bin 291800 -> 291800 bytes .../_main_files/figure-html/clone-03-1.png | Bin 160044 -> 160044 bytes .../_main_files/figure-html/clone-04-1.png | Bin 155365 -> 155365 bytes .../figure-html/code-cell-not-run-1.png | Bin 182474 -> 182474 bytes .../figure-html/code-cell-run-1.png | Bin 358718 -> 358718 bytes .../convert-to-markdown-cell-1.png | Bin 269921 -> 269921 bytes .../figure-html/create-new-code-cell-1.png | Bin 169488 -> 169488 bytes .../figure-html/create-new-file-01-1.png | Bin 356131 -> 356131 bytes .../figure-html/create-new-file-02-1.png | Bin 269604 -> 269604 bytes .../figure-html/create-new-file-03-1.png | Bin 225306 -> 225306 bytes .../figure-html/docker-desktop-images-1.png | Bin 73548 -> 73548 bytes .../docker-desktop-runconfig-1.png | Bin 94084 -> 94084 bytes .../figure-html/docker-desktop-search-1.png | Bin 102127 -> 102127 bytes .../figure-html/docker-desktop-url-1.png | Bin 132419 -> 132419 bytes .../figure-html/generate-pat-01-1.png | Bin 104053 -> 104053 bytes .../figure-html/generate-pat-02-1.png | Bin 359293 -> 359293 bytes .../figure-html/generate-pat-03-1.png | Bin 173724 -> 173724 bytes .../_main_files/figure-html/git-add-01-1.png | Bin 271993 -> 271993 bytes .../_main_files/figure-html/git-add-02-1.png | Bin 304469 -> 304469 bytes .../_main_files/figure-html/git-add-03-1.png | Bin 300497 -> 300497 bytes .../figure-html/git-commit-01-1.png | Bin 518196 -> 518196 bytes .../figure-html/git-commit-03-1.png | Bin 355898 -> 355898 bytes .../_main_files/figure-html/git-pull-00-1.png | Bin 357152 -> 357152 bytes .../_main_files/figure-html/git-pull-01-1.png | Bin 338026 -> 338026 bytes .../_main_files/figure-html/git-pull-02-1.png | Bin 288729 -> 288729 bytes .../_main_files/figure-html/git-pull-03-1.png | Bin 367892 -> 367892 bytes .../_main_files/figure-html/git-pull-04-1.png | Bin 481152 -> 481152 bytes .../_main_files/figure-html/git-push-01-1.png | Bin 339710 -> 339710 bytes .../_main_files/figure-html/git-push-02-1.png | Bin 320771 -> 320771 bytes .../_main_files/figure-html/git-push-03-1.png | Bin 326564 -> 326564 bytes .../_main_files/figure-html/git-push-04-1.png | Bin 440884 -> 440884 bytes .../_main_files/figure-html/img-arrange-1.png | Bin 59507 -> 59507 bytes .../_main_files/figure-html/img-filter-1.png | Bin 83663 -> 83663 bytes .../_main_files/figure-html/img-ggplot-1.png | Bin 165258 -> 165258 bytes .../_main_files/figure-html/img-mutate-1.png | Bin 118403 -> 118403 bytes .../figure-html/img-pivot-longer-1.png | Bin 166077 -> 166077 bytes .../figure-html/img-pivot-wider-1.png | Bin 100008 -> 100008 bytes .../figure-html/img-read-csv-1.png | Bin 45884 -> 45884 bytes .../_main_files/figure-html/img-select-1.png | Bin 49600 -> 49600 bytes .../figure-html/img-separate-1.png | Bin 128761 -> 128761 bytes .../_main_files/figure-html/issue-01-1.png | Bin 391945 -> 391945 bytes .../_main_files/figure-html/issue-02-1.png | Bin 230244 -> 230244 bytes .../_main_files/figure-html/issue-03-1.png | Bin 462659 -> 462659 bytes .../_main_files/figure-html/issue-04-1.png | Bin 487695 -> 487695 bytes .../_main_files/figure-html/issue-06-1.png | Bin 384235 -> 384235 bytes .../_main_files/figure-html/launcher-1.png | Bin 210962 -> 210962 bytes .../figure-html/markdown-cell-not-run-1.png | Bin 170111 -> 170111 bytes .../figure-html/markdown-cell-run-1.png | Bin 160529 -> 160529 bytes .../figure-html/merge-conflict-01-1.png | Bin 356185 -> 356185 bytes .../figure-html/merge-conflict-03-1.png | Bin 330969 -> 330969 bytes .../figure-html/merge-conflict-04-1.png | Bin 346504 -> 346504 bytes .../figure-html/merge-conflict-05-1.png | Bin 337675 -> 337675 bytes .../figure-html/merge-conflict-06-1.png | Bin 318977 -> 318977 bytes .../figure-html/mutate-across-1.png | Bin 17930 -> 17930 bytes .../figure-html/new-repository-01-1.png | Bin 388924 -> 388924 bytes .../figure-html/new-repository-02-1.png | Bin 340622 -> 340622 bytes .../figure-html/new-repository-03-1.png | Bin 309321 -> 309321 bytes .../figure-html/open-data-w-editor-1-1.png | Bin 302867 -> 302867 bytes .../figure-html/open-data-w-editor-2-1.png | Bin 976187 -> 976187 bytes .../figure-html/out-of-order-1-1.png | Bin 46463 -> 46463 bytes .../figure-html/out-of-order-2-1.png | Bin 49900 -> 49900 bytes .../figure-html/out-of-order-3-1.png | Bin 69998 -> 69998 bytes .../_main_files/figure-html/pen-tool-01-1.png | Bin 298831 -> 298831 bytes .../_main_files/figure-html/pen-tool-02-1.png | Bin 222380 -> 222380 bytes .../_main_files/figure-html/pen-tool-03-1.png | Bin 226916 -> 226916 bytes .../figure-html/restart-kernel-run-all-1.png | Bin 226369 -> 226369 bytes pull583/_main_files/figure-html/rowwise-1.png | Bin 9979 -> 9979 bytes .../_main_files/figure-html/summarize-1.png | Bin 10416 -> 10416 bytes .../figure-html/summarize-across-1.png | Bin 13593 -> 13593 bytes .../figure-html/summarize-groupby-1.png | Bin 14985 -> 14985 bytes .../figure-html/ubuntu-docker-terminal-1.png | Bin 209027 -> 209027 bytes .../figure-html/upload-files-01-1.png | Bin 371590 -> 371590 bytes .../figure-html/upload-files-02-1.png | Bin 358888 -> 358888 bytes .../_main_files/figure-html/vc-ba2-add-1.png | Bin 117358 -> 117358 bytes .../figure-html/vc-ba3-commit-1.png | Bin 124594 -> 124594 bytes .../figure-html/vc1-no-changes-1.png | Bin 139664 -> 139664 bytes .../_main_files/figure-html/vc2-changes-1.png | Bin 149129 -> 149129 bytes .../_main_files/figure-html/vc5-push-1.png | Bin 261330 -> 261330 bytes .../figure-html/vc6-remote-changes-1.png | Bin 183654 -> 183654 bytes .../_main_files/figure-html/vc7-pull-1.png | Bin 199988 -> 199988 bytes pull583/classification1.html | 2 +- pull583/index.html | 10 ++++++---- pull583/regression1.html | 2 +- pull583/regression2.html | 2 +- pull583/search_index.json | 2 +- 99 files changed, 10 insertions(+), 8 deletions(-) diff --git a/pull583/_main_files/figure-html/01-ref-vs-tibble-1.png b/pull583/_main_files/figure-html/01-ref-vs-tibble-1.png index 9ff225f678d19d6ff5a9d3f570de5e51f80b4019..7540db29d678c10f8543bd3017d89bf20b7c293f 100644 GIT binary patch delta 92 zcmZo(EZe+TcEc@3b`b`aS~G!Z%?}ydA2Nb46A&{4F$)m00x{e6hm7n~{P;~Rtc;Dd f4GgRd44!QLTt2-efL#e$aOw;1)6=&Gu%80}%`zYv delta 92 zcmZo(EZe+TcEc@3c42*;_zTlonjbQ@KV$@9CLm@8Viq7~1!A`C4;k5~`0<;VSs5B? f8yHv_7`*e~3Ygv!z^;TWIA3(ro9SBv*v|m~1NI-C diff --git a/pull583/_main_files/figure-html/02-dataframe-1.png b/pull583/_main_files/figure-html/02-dataframe-1.png index 2f23b4428505c99b0affd6d2564a4aeaff6bd3fb..62162701648d19e803357ff8659a67e4196ebd24 100644 GIT binary patch delta 72 zcmbQffNlB$wh0s1MHpDQ3>6p}r*ECkXwuJbYGGw;tZiUmWnl1R>*w<6xB3~Cq>%)t SzVJTH00f?{elF{r5}E)qnih8e delta 72 zcmbQffNlB$wh0s1h4pnct}hd6oW6BBqe(x%iJ6t5k+y+>m4U(i1p;TM-|A;nl136- T_jg4h0}yz+`njxgN@xNAfk+pp diff --git a/pull583/_main_files/figure-html/02-obs-1.png b/pull583/_main_files/figure-html/02-obs-1.png index 34948c2ec96cf5c9d61b542b438d20f62b028254..6e3030b74260d8cb15a6f52489ac634fbb3102a6 100644 GIT binary patch delta 72 zcmex%ll9w8)(I2XMHpDQ3>6p}r*ECkD4fG@YGGw;tZiUmWnl1R>*w<6J98M7q>%)t SzVJTH00f?{elF{r5}E*Skry2R delta 72 zcmex%ll9w8)(I2Xh4pnct}hd6oW6BBqi_ztiJ6t5k+y+>m4U(i1p;TM@62ITl136- T_jg4h0}yz+`njxgN@xNAzCRdE diff --git a/pull583/_main_files/figure-html/02-tidy-image-1.png b/pull583/_main_files/figure-html/02-tidy-image-1.png index b7c9998bfb8230ff38c35399f2155c3960e5feef..f12efb70fed3257b7f6d42d0c9fc5afa820e8ebe 100644 GIT binary patch delta 75 zcmZ2+jB~{?&IuFPMHpDQ3>6p}r?*aL+&Z1fHHqKU!phiK+rYrez~IT&&*jr!Cow5W WBMDA@;eDC`2s~Z=T-G@yGywpaju+AZ delta 75 zcmZ2+jB~{?&IuFPh4pnct}hd6oZdQ}aqDy@*Cc)uGb=+QZ36=<1B3eu1kO%>oy4Rh WjU>44?}|bOAn6qQPB+-dZ)#y>Y^-fyU}a$NWb5bh$=5b2Nh66(ec^qY P0SG)@{an^LB{Ts50Y(-H delta 68 zcmbRIglX~(Mft7*5{RIMNCtusBB#k7t?(d31 P1|aZs^>bP0l+XkKN!b@U diff --git a/pull583/_main_files/figure-html/02-vector-1.png b/pull583/_main_files/figure-html/02-vector-1.png index 12a17ee102338a7dca650b281ce941a30c7a4207..a7784bf0108f09064175ab379597a2928c8505ed 100644 GIT binary patch delta 68 zcmeCb$k>08al!<45e8N+Lj{J7)7PZ)n_5^I8*3XFSQ!{R+4{MBa!`hnG?LiV7v85C OfWXt$&t;ucLK6VyZ52KM delta 68 zcmeCb$k>08al!<4VSQbV>&t{TPG6JGZ(?R;XrygmU}a!%e}TZ+$w3)P(nw075WzHZ?+wXiZa);2J(GB9|u^>g|3iLH!E(nx|+ SUwEHp00K`}KbLh*2~7Z9cNY`@ delta 72 zcmX@t&33MvZNdb0VSQbV>&t{1r*ECk__~GP#LUXjNZY``%D~|M0)eyBC$=&wNh1la T`@5o$0SG)@{an^LB{Ts5tR)yk diff --git a/pull583/_main_files/figure-html/activate-and-run-button-1.png b/pull583/_main_files/figure-html/activate-and-run-button-1.png index 69e8fbe6dfb708092ca9431a5bca5a61e58f2452..999da0255c20e30b005cf0432e8c5bdb3d51faba 100644 GIT binary patch delta 72 zcmex)iRCJSpl}yP1?(Z3p diff --git a/pull583/_main_files/figure-html/add-collab-01-1.png b/pull583/_main_files/figure-html/add-collab-01-1.png index 7b82e0aeb6c2726366101ecd91096c1c360b8fbb..840f7aae5a3a411a7bffe494b43681c96f9eb32f 100644 GIT binary patch delta 80 zcmZ4dUwH9<;SHB~*hQE*HCEZbXuiqQev^k0h?#(xdHYQs7WGg3rWRJl#@Yr3Rt5%7 Zwtg<3&i|i930ZLJ3-8m@9saW<0|42)9i#vN delta 80 zcmZ4dUwH9<;SHB~*oBQuHeX0O*L;(w{U#415HkTW^Y)uOEb5>5P0XwejkFC6tPBk9 ZFAz98o&P_J60+dBzbgu-JN#!!1^^Hh9yI^} diff --git a/pull583/_main_files/figure-html/add-collab-02-1.png b/pull583/_main_files/figure-html/add-collab-02-1.png index 7689491c73e84fd3183e3bee117edd7a7cbf0922..298c76a406120ebbfbd6b5eb56bbbad1c8ea0bad 100644 GIT binary patch delta 76 zcmZp;$=7g`Z^I=Xb`fSyjaBw9ns4&7-{fHgVy5jkd6-$|^P5^&85?UG7+4t?JlXoW WeEOT!%u2|DQ(t(Wo-VtFIT-*gsT(r@ delta 76 zcmZp;$=7g`Z^I=Xc3~saBiDl>ns4&7-{fHgVy5jkd6-$|^P8Ai85(ID7+4t?++QGY WcKVyu%u2|D>;A4NoG!bDIT-*(P#aeO diff --git a/pull583/_main_files/figure-html/add-collab-03-1.png b/pull583/_main_files/figure-html/add-collab-03-1.png index 5bb6ac6da5c62bc005e701ecd7f778f5d43bf2e2..9a5104ff2307141469915159299185a150fcf8e2 100644 GIT binary patch delta 84 zcmdlwOKj^bu??4a*hQGRHZIN!YQD+Sev^k0h?#(x8Hic7-{fK4Sa+MgZ_;8`l5; delta 81 zcmaEQOW^4(fep`j*oBSEV(&3OYJSJl{*H$ch?#(xdHXva7Abjt6EiDABW(i%D+7c3 Z3k1$i7uR7?LKa;2cSYfL4;>a+MgaI%8@m7i diff --git a/pull583/_main_files/figure-html/add-collab-05-1.png b/pull583/_main_files/figure-html/add-collab-05-1.png index 2fa7e1ad105a003044744ba4de2475a8dbfa8d26..4d3b2f314ce30d337e53b229099b43837914b78c 100644 GIT binary patch delta 77 zcmZ3qm4DGz{teH0*hQGRHZIN!YJSJl{*H$ch?%y(<6#az!*6O~Wo)c%U|?ln@MP=f X^64QrnU#rU|?lnaDRco X+36uSnU#8-{fK3ev^mkYa+j?g_W_fwt<0_fx(llpUbD; T&0|tR7M%LR`}B0ae5Pao-If{3 delta 72 zcmex(kMrX_&JCA%*oBSsEw`>#YQD+Sev^lB`%NCEuZjF7W>$tq+6D$z1_t*R2%MdM TH;+jPS#aIo6@}CJ@|ltW2+SH& diff --git a/pull583/_main_files/figure-html/clone-02-1.png b/pull583/_main_files/figure-html/clone-02-1.png index 7d25feb394f78e47969bd4c45c96aa32f2e0aaf1..673a3fd7426a6cd2b8e427fbbef3abba41dc4657 100644 GIT binary patch delta 80 zcmccdT=2$o!3~#q*hQFG!la!;n{V>8-{fHgVkRJF-hPvZB_@sE)WXWxSlhtB%D~{s Y*3aeBU5Z(hkOil{@IF1gw3sCs0F`tb-~a#s delta 80 zcmccdT=2$o!3~#q*oBSsEw`>#YQD+Sev^k0h?#(xdHYQsmY6ht6EiDABW(i%D+7c3 Y3k1$icPVC3LKa;2cSYg!(qfim0Lb$kX#fBK diff --git a/pull583/_main_files/figure-html/clone-03-1.png b/pull583/_main_files/figure-html/clone-03-1.png index 585f8cf94734fd077ff4d5f4f9aa3054c31a2854..6e7f97dd31c59f48a2c9943adeade78342d18486 100644 GIT binary patch delta 72 zcmZ4Uh;z*&&JCA%*hQFG!la!;n{V>8-{fK3ev^l3Lo&aqg_W_fwt<0_fx(llpUbDu TC}2`T7M%LR`}Fja1x(2Rs+}2r delta 72 zcmZ4Uh;z*&&JCA%*oBP@w(MPYz4<0j`%NCk?KgRtHYD?#m{}PbX&V?=85rDOAaHj2 Ui~=SlWWjZRR}@Y^S-_MG02;F!I{*Lx diff --git a/pull583/_main_files/figure-html/clone-04-1.png b/pull583/_main_files/figure-html/clone-04-1.png index b4db248fa9e1ac2d86668d4378245881fe2bbd65..8a0e0f4e004853ebddbb2373819c0bc6f8bbe583 100644 GIT binary patch delta 72 zcmaF5m-Fdf&JCA%*hQFG6=jMKH{axGzsbY6{U#4nW+1<*g_W_fwt<0_fx(llpUbBQ T#xW@&3r>CEeR_IR98)p?v%eVz delta 72 zcmaF5m-Fdf&JCA%*oBP@w(MPYz4<0j`%NCk?KgRtG6VTd%&ZKJv<(cb3=Hls5I8$M TFpfzHS#aIo6@}BA;+T>F4AvT0 diff --git a/pull583/_main_files/figure-html/code-cell-not-run-1.png b/pull583/_main_files/figure-html/code-cell-not-run-1.png index bd03806940aa29f661135ee8175fab8dee3f5761..d2c87383b0374adbb899f234fb13478f9a91387f 100644 GIT binary patch delta 72 zcmX>#k^9s{?hVg**hQGk+OI25X@1Ak{*H%n`#T;c%ia8@7FNba+6D$z1_n-tejJ`| Tc7jO>S#k^9s{?hVg**oBQ$Zzed|G{56%f5*eP{T&aJjAwaK7lKH`8xduqFclKBOJP diff --git a/pull583/_main_files/figure-html/convert-to-markdown-cell-1.png b/pull583/_main_files/figure-html/convert-to-markdown-cell-1.png index 80ded2d6c5c3ab0a98ed2eb447112841a3587a50..ba96a00b2cf2d5778fcd3506a6e43066fd9bfff7 100644 GIT binary patch delta 80 zcmaF3N8sTefep`j*hQEu67u;bH^1X)f5*cJ#7scUy!{;y%LM^`Qwu9&BW(i%D+2?k YLq85rKP$_kge+)0-=Ti`KUtP!0KcRg?f?J) delta 80 zcmaF3N8sTefep`j*oBSM&GVE6n&0uXzvE#9VkRJF-u{k<<$?geiJ6t5p|*j6m4U%K Y53YdeXJuKGkOk+9ZhABQpDar<0EJ;1`Tzg` diff --git a/pull583/_main_files/figure-html/create-new-code-cell-1.png b/pull583/_main_files/figure-html/create-new-code-cell-1.png index 325b867ef322e472888bbedfed397c42210bcca2..29fa1319e55125959fcdd00cad0b955027b73193 100644 GIT binary patch delta 72 zcmbQRhHJtat_{z5*hQGkKU@%IYJSJl{*H%n`#T<{`VM|m3oBzIZ36=<0|Tc+KMqf? TnaZSuENDF6p?><>sZ7ZLxg!|t delta 72 zcmbQRhHJtat_{z5*oBSM95RL5n&0uXzvE%t{*H&KzJuSy%*xPE+rYrez~G$+SHSd| TsZ2`9g7ZZ;y_vptDpN85fl?R! diff --git a/pull583/_main_files/figure-html/create-new-file-01-1.png b/pull583/_main_files/figure-html/create-new-file-01-1.png index 6c62f78646d0abb9b3a34c049bcbc0b948c549a4..32ca7fb76675dfa0e8411864a24533f81085aa4c 100644 GIT binary patch delta 84 zcmZ2{PjvA;(G8b)*hQEb)e5)YYQD+Sev^k0h?#(x8Hic7-{fIkp2%-%VP$NrZD3$! bVDMz?=kn?Od8|svf>U33pPqgok2M(pN>v`s delta 84 zcmZ2{PjvA;(G8b)*oBRB7VP;oqxmLJ`%NB3AZ7w$W*}zSev^lFc_P1wnU$fDwt<0_ bfx-O+0%xc9=dmgw3$FXSqHy|wJl13Yh*BSF diff --git a/pull583/_main_files/figure-html/create-new-file-02-1.png b/pull583/_main_files/figure-html/create-new-file-02-1.png index ef27845798509a40593774743881c056dcfe3289..f3d173a24f7b0d03e7f3ff365fae294510019a34 100644 GIT binary patch delta 80 zcmZ3oOJK<^fen{<*hQEb)e5)YYQD+Sev^k0h?#(xdHYQsmKFT`rWRJl#@Yr3Rt5%7 Zwtg<3K0$^>30ZLJ3-8m@56ZA40|1{M92Wop delta 80 zcmZ3oOJK<^fen{<*oBRBukSv^*L;(w{U#415HkTW^Y)uOEGzi=P0XwejkFC6tPBk9 ZFAz98eS!>&60+dBzbgu-ACzH91^~EL95Da@ diff --git a/pull583/_main_files/figure-html/create-new-file-03-1.png b/pull583/_main_files/figure-html/create-new-file-03-1.png index 0c36e7776551cebd9662d28b953747a165825be8..8a85fde39f4f991218f6c7abd1d8c136b5136124 100644 GIT binary patch delta 76 zcmbRBfOpmd-VK*{*hQF`_E$VhXuiqQev^k0h?%zE>|uUi_U*zY366z&dCEeR{i=45Jb=02%!i%m4rY delta 69 zcmX@JkLAohmJJ+C?83&3e=4T0Y366z&dgn)#Ww^D{9DXYreuSs5B>8yHv_7~EeVaCW+34x>|uUi_U*zY366z&d>k_C8HM% delta 69 zcmaDqm+k#rwhbIi?83&3e=4T0Y366z&d>k_DWewj diff --git a/pull583/_main_files/figure-html/docker-desktop-url-1.png b/pull583/_main_files/figure-html/docker-desktop-url-1.png index 92de2782230237bfa16d7f59d65dd83e1282628b..b3f56d44af023b2f34d4ccd02b26cc73a250a2e3 100644 GIT binary patch delta 69 zcmX@y#c{ZcV*>{hy9l$;qVwNan)#Ww^D{AC`p<7_VP$NrZD3$!VDMz?=kn{hyRb3ipNi>gn)#Ww^D{AC`p<7-W@TukZD3$!U~qqdz}e{^7@3ri Q1=syuQMg@^iAjkW03u5j+W-In diff --git a/pull583/_main_files/figure-html/generate-pat-01-1.png b/pull583/_main_files/figure-html/generate-pat-01-1.png index 1f582d513c5c6765b21839b788bae870d9eec779..ab665c1054fc1c9f51b82292a6b355712c3487bf 100644 GIT binary patch delta 68 zcmeymhVAPbwhb?M*hQF`Pt;%g*8HAl`+FY7#~u8p7FNc_+6D$z1_n>IelDN>a4Mq` Qvf$Jg-lwMvO=C<30Nw5yt^fc4 delta 68 zcmeymhVAPbwhb?M*oBSs{5H+r()^xh`+FY7#~u78W>$tq+6D$z1_t*R2%Me%a4Mq` Qvf#SED+;FzO=C<30Jdov!2kdN diff --git a/pull583/_main_files/figure-html/generate-pat-02-1.png b/pull583/_main_files/figure-html/generate-pat-02-1.png index f227bd32a5a04ac2e72e870a432385f92515741d..3657890f5b4851050543e055f979b9d6b95fc936 100644 GIT binary patch delta 85 zcmezSP4w?K(G46->>|v}C+e?#YvyNa=VxLBVi3&?#4OwSnOKcV`AsdXjE%Jo46FI?7F+Z)SS|1bjpf7Ken delta 85 zcmezSP4w?K(G46-?7~L+mRnaVHS;sI^D{95F^FacVwUavOsqzw{3d2rhDO>3237_J a_ZJA9ogPrms)Q`K?(d4i?TzKEf0zL%?-@Y= diff --git a/pull583/_main_files/figure-html/generate-pat-03-1.png b/pull583/_main_files/figure-html/generate-pat-03-1.png index 402f0f09a490b588e20a0881eab71b3508beeefb..4e2fb22a7046c76fdae628230a10ffccf7677b40 100644 GIT binary patch delta 72 zcmbPpmTS>_?4E*hQF`Pt;%g*8HBQ{XGxk_V+wYTr>GiEv$@#YJSht{+@?%`+FWHu9^HMW>$tq+6D$z1_t*R2%MeH Uv5ZLxS#aIo6@}9+moX&+0Jf+Y^8f$< diff --git a/pull583/_main_files/figure-html/git-add-01-1.png b/pull583/_main_files/figure-html/git-add-01-1.png index 8952dd5c13a8ef313564d1db8966a6ec50eca84c..e2514ee04fb2a8b2eec0ede4f1cef19447c3d11c 100644 GIT binary patch delta 80 zcmex)M&Rcefen{<*hQFG6=jMKH{axGzsbW0#7scUy!|E*%U=n8Qwu9&V{HQiD+7Zk YTR)dif2zu&ge*Aqh4<;{VrneO0LQKzPXGV_ delta 80 zcmex)M&Rcefen{<*oBP@w(MPYz4<0j`%NB3AZ7w$=Iu9mSpG`zo0wS{8fhCCSQ!}H ZUm$RH`cqXFC1k;Me^(Sv7gJ+N1^_Q*9iadK diff --git a/pull583/_main_files/figure-html/git-add-02-1.png b/pull583/_main_files/figure-html/git-add-02-1.png index f6791a3a4421342ab7c3a5312ff3e724c2d49f3e..f6ca7c15a7d87ad14b35c63a97d96a3d38bd0071 100644 GIT binary patch delta 80 zcmcb5O6ck-p$(UK*hQFG6=jMKH{axGzsbW0#7scUy!|E*%gyQhrWRJl#@Yr3Rt5%7 Zwtg<3esBqk60+db7v86*e_Fzl3;@mv9v1)r delta 80 zcmcb5O6ck-p$(UK*oBP@5AT}h*nE?x{U#415HkTW^Y)uOEH|g~o0wS{8fhCCSQ!}H ZUm$RH`oSeEO2~ri{;nvT{%HwIG5`;o9?$>) diff --git a/pull583/_main_files/figure-html/git-add-03-1.png b/pull583/_main_files/figure-html/git-add-03-1.png index f45276b857cdf1fe261f5b02b2afad0e0abab039..2ee1b5ca228af187c7359e5d9fcdaa844fac775e 100644 GIT binary patch delta 80 zcmcb(TIk|xp$(UK*hQGxc8DrfHs9oFzsbW0#7scUy!|E*OK1nbsfCrXv9^JMm4U&N Yt)I)M+fHRsLKd9*!u#~}ys0e70FC?{4gdfE delta 80 zcmcb(TIk|xp$(UK*oBP@5AT}h*nE?x{U#415HkTW^Y)uOETJ9zCT3QKM%o4jRt5(5 Z7YLl4ZabAl30ZL6-xY<^^QN*S0|3yK9MJ#( diff --git a/pull583/_main_files/figure-html/git-commit-01-1.png b/pull583/_main_files/figure-html/git-commit-01-1.png index 2aa5ef6bd46ab4b05234b9612993f5354003a12b..3325e37bd0e631b8281e2430095957f723337488 100644 GIT binary patch delta 92 zcmdmTL4L~x`3;wN*hQGxc8DrfHs9oFzsbW0#7scU48$xz%nHP8+i&u)@3_TpYGGw; gtZiUmWnl1R>*w<63!bwpAq!4@;eC4g#pmqF08RQLvj6}9 delta 92 zcmdmTL4L~x`3;wN*oBP@5AT}h*nE?x{U#415HkTWGZ3=?F)I+WZNJIGzT+0ZiJ6t5 hk+y+>m4U(i1p;TMFL=(bgexa#@v-1*g97K0W<<+3Uv3$FXSqHy~4T-Ia&l!YI? diff --git a/pull583/_main_files/figure-html/git-pull-00-1.png b/pull583/_main_files/figure-html/git-pull-00-1.png index 7f0b4c8bf143b6419f7a437d391a06cf34986b71..566517cff7e00b8f14dbb6dcd2479504e48c6d91 100644 GIT binary patch delta 84 zcmZ4ROmx9B(G8b)*hQGRk1dHi(tMMr{U#415HkTWGZ3?EzsbY8B$eOP!phiK+rYre bz~IT&&*js5iddD91*g97K0SSJ5oCEeR?{(25T|^in1Q8 delta 84 zcmaELLFCm1kqwu4*oBSEwQd@hG~eWDzsbW0#7scU48$zkZ}PCdlj1iqvobW&HZZU< bFu1=!;Oz8k>a0r0g6saSD4fo&!I}&JRC^tw diff --git a/pull583/_main_files/figure-html/git-pull-02-1.png b/pull583/_main_files/figure-html/git-pull-02-1.png index fa9a770ea5710a586748547df0d00446c001c787..9be8d4ff07f7d26b44ddbb1bac8dc9ae89bea6fd 100644 GIT binary patch delta 80 zcmcb4Uhw95!3~#q*hQFmA`YBRZ@$UXev^k0h?#(xdHYQsme?qMQwu9&V{HQiD+7Zk YTR)dicTHzeLKd9*!u#~}vUHYY0NBDD+5i9m delta 80 zcmcb4Uhw95!3~#q*oBSEwQd@hG~eWDzsbW0#7scUy!|E*OKcRsiJ6t5k+y+>m4U(i Y1p;TMyQZ@!Aq%eiyP|M8-{fHgVkRJF24WTUc*7+gb`fUw^Fl)Jn{V>8-{fHgVkRJF-hPvZrTsd;sfCrXv9^JMm4U&N Yt)I)M=RaXlLKd9*!u#~}1y5L#0kl&cumAu6 delta 80 zcmZoZCER>Uc*7+gc3~r9-7Vj9n{V>8-{fHgVkRJF-hPvZrTsd;iJ6t5k+y+>m4U(i Y1p;TM=RaXlLKa;2cSYg!1y5L#0n=0+>i_@% diff --git a/pull583/_main_files/figure-html/git-push-03-1.png b/pull583/_main_files/figure-html/git-push-03-1.png index 76b5c65e9c2b55457838ebda71225ed2e931ea92..ac53d80a14f6687899d220d2e143c46129072e54 100644 GIT binary patch delta 80 zcmZ4TU3kfN;SHB~*hQE*;@9%mHQ(fEzsbW0#7scUy!|E*i^e;CQwu9&V{HQiD+7Zk YTR)di7x>Mhge*Aqh4<;{j=x!w0leuQEC2ui delta 80 zcmZ4TU3kfN;SHB~*oBRZb+>%aZNACVev^k0h?#(xdHYQs7L9lOCT3QKM%o4jRt5(5 Z7YLl4F7TU030ZL6-xY<^9e=YV0|4#=9j^cY diff --git a/pull583/_main_files/figure-html/git-push-04-1.png b/pull583/_main_files/figure-html/git-push-04-1.png index f9776db41840508ac58761a6beb8f8017af1dd31..39d0649516c4054892512cd1c2b12d92cedf01a8 100644 GIT binary patch delta 88 zcmdn;MQY0zsSTHS*hQE*;@9%mHQ(fEzsbW0#7scU48$xz%)0$158IA4{H7LG#>UzP e237_JPquz8pT1xxn-a3%)EC~Tr(fL3mJ9&iO(7)! delta 88 zcmdn;MQY0zsSTHS*oBQuHeX0O*L;(w{U#415HkTWGZ3=?G3)l5JZwAG@SB)f85(ID e7+4t?++QGYcKU*yY)Z(2>;A4NoPKd9TQUGR4x8yCY!xbl137nFS_Xs P0}yz+`njxgN@xNAexw&u diff --git a/pull583/_main_files/figure-html/img-filter-1.png b/pull583/_main_files/figure-html/img-filter-1.png index 12f01b87306343cbfbfa368feb3997efbb8bc8e5..7ad87ca5ac1d12defa804b9203a7e7f08628f7a4 100644 GIT binary patch delta 72 zcmX@#%X+?-b;1O85eB9yp<5R>PTxA6@r^6LsfCrXk+y+>m4SiNp&y5*PjX{al136V Sp6^i400f?{elF{r5}E*?02em^ delta 72 zcmX@#%X+?-b;1O8VSTM_n~gU%PTxA6@r^6LiJ6t5p|*j6m4U%K53YdeliV1Uq>%*Y Si*9G(_5!AZk^7wyoKMy%*xPE+rYrez~G$+SHN`NRwgBB VB*FQjo8B-0fv2mV%Q~loCIH2m7p?#R diff --git a/pull583/_main_files/figure-html/img-mutate-1.png b/pull583/_main_files/figure-html/img-mutate-1.png index 08eb1bea84c006f9e0d6ee2695269f6f27b57b73..0695ac203941c48fbce7b87341bb5056b371e15c 100644 GIT binary patch delta 72 zcmZpk%icVfeZmBG5eBx&7KOVSr*ECkxZo_msfCrXv9^JMm4U&Nt)I)MyPacHl137o T`ojA(0}yz+`njxgN@xNAn)w(! delta 72 zcmZpk%icVfeZmBGVST-g_r9-aoW6BBgTe~DWM4f;iDNy diff --git a/pull583/_main_files/figure-html/img-pivot-longer-1.png b/pull583/_main_files/figure-html/img-pivot-longer-1.png index 5fa971551f340f096818ee3f87977f3fc7579da8..b55a95a5c200d683bc6a6160d5cb38f0a1274291 100644 GIT binary patch delta 75 zcmdlxk!$Znt_c&^MHpDQ3>6p}r?*aL+&Z1$tq+6D$z1_t*R2%MeX(7~i6 WjU>44?}|bOAn075W9&P0}wXiZa);2J(GB9|u^>g|3>^4RvX(Yj^ SFT77P0D-5gpUXO@geCwzo)&@t delta 72 zcmZ3{%eJDIZNdb0VSQbV>&t{1r*ECkc(j$@#LUXjNZY``%D~|M0)eyBv)dSzq>%*I S{asPW00f?{elF{r5}E*vOc$~M diff --git a/pull583/_main_files/figure-html/img-read-csv-1.png b/pull583/_main_files/figure-html/img-read-csv-1.png index d35ce0daf233085b16a481ba867c4c7c5c29dc92..1940b7ed4977d572ce10bd8303c417b01a81158d 100644 GIT binary patch delta 68 zcmdnClhElmBm0l136Up6^i4 O00f?{elF{r5}E))y%ur+ delta 68 zcmdn;qti*9;qti*95*D8T diff --git a/pull583/_main_files/figure-html/img-separate-1.png b/pull583/_main_files/figure-html/img-separate-1.png index 55f9cd67f891ffc7615ca6f67b1516e3d95aeca0..59b6bea3e60473787be9ffd920b0d9047f3296db 100644 GIT binary patch delta 72 zcmezQmi^~j_6ZZ%MHpDQ3>6p}r*ECkDE5ut)WXWxSlhtB%D~{s*3aeB_k3eil137o T`ojA(0}yz+`njxgN@xNAsmT~= delta 72 zcmezQmi^~j_6ZZ%h4pnct}hd6oW6BBqu4io6EiDABW(i%D+7c33k1$i-}8-8Ng7FT T-QN|33_#%N>gTe~DWM4f_;(qZ diff --git a/pull583/_main_files/figure-html/issue-01-1.png b/pull583/_main_files/figure-html/issue-01-1.png index ac668da497cb4df57e03bc499efd94e30d331dbf..9742bf539f8fa116e4c924b659ebf8ca190e5ba7 100644 GIT binary patch delta 84 zcmeDDCf@l?e8VLkb`fU&gC}R*ZNACVev^k0h?#(x8Hic7-{fKKeamlZVP$NrZD3$! bVDMz?=kn>LzgU%!1*g97K0STeFV9_AON{3d2rhDO>3237_J_ZJA9 VoqnN-SqWKi-QN|3(;1tYlL0o28rlE= diff --git a/pull583/_main_files/figure-html/issue-03-1.png b/pull583/_main_files/figure-html/issue-03-1.png index 0ca680d4b605d94417172d74d19fc7634cd56ab0..ad900b7a60cd9f209fd6ddfb74991981c0fa9aa2 100644 GIT binary patch delta 92 zcmX>+Pv-DEnGMf**hQEHd=5_6Xnx1j{*H$ch?#(x8Hibcm=%cGroZE1|G{r+VP$Nr gZD3$!VDMz?=kn=adD)ea1*g97KD}Lyk6o4#0AL*;ssI20 delta 92 zcmX>+Pv-DEnGMf**oBQ5y6cx)HNWF&f5*cJ#7scU48$xz%nHP8)8Fy1|KK+toCJHNWF&f5*cJ#7scU48$xz%nHP8+u!l9A4%djwXiZa h);2J(GB9|u^>g|3gZb=A$bwT}c%R<>DW6@I5dg)YBM<-p delta 93 zcmeBwB-{T;cEfWXc41?N?)v3c&F^^H-|;X4F%u9o12GE_vjQ>O_IEt&N0RtW%&ZKJ hv<(cb3=Hls5I8&iU_QGNvf#SED+;%N%4e5l1OS*8A~65} diff --git a/pull583/_main_files/figure-html/issue-06-1.png b/pull583/_main_files/figure-html/issue-06-1.png index ec0153d4eb24c66208cd2c8254249ace1d7e7490..db40a862b08f230ea3e6bda2696672b69b24244f 100644 GIT binary patch delta 84 zcmaF8Q~dQ#@eP-F*hQEHt&c6=*L;(w{U#415HkTWGZ3?EzsbXzf1cme!phiK+rYre bz~IT&&*js@@31N%3r>CEeR_Jw9oA$3hFu^B delta 84 zcmaF8Q~dQ#@eP-F*oBQ5|5Qw0(|nVs{U#415HkTWGZ3?EzsbXzf1cmO%*xP6+rYre bz~KG@fwR-Y@31N%3$FXSqHuc09oA$3lt&-+ diff --git a/pull583/_main_files/figure-html/launcher-1.png b/pull583/_main_files/figure-html/launcher-1.png index 87fb6237dc96c5a3d7dd14514301e84fe0accf05..5576146b82d87b2baca6ea8baffb8aa7dc471425 100644 GIT binary patch delta 76 zcmbRAfoIYOo(<1=*hQGE`V#Z^H^1X)f5*cJ#7x`Y@h~@P@tay$85?OE7+4t?I34FdpzlL1VY8aV&} delta 76 zcmbRAfoIYOo(<1=*oBQWFI?jjZ+^$q{*H$ch?%y(<6&;p;x{p~GBngSFt9Q(c;~?t WFul&4SqWKizUZbm)7P6bCj$T`S{aJ~ diff --git a/pull583/_main_files/figure-html/markdown-cell-not-run-1.png b/pull583/_main_files/figure-html/markdown-cell-not-run-1.png index f2b141a0ea775b05e8ec6735a9bdd49a5ccaa82e..bef38f0e827607df1e4acc758e43c045a9ca35db 100644 GIT binary patch delta 72 zcmeyrf$RSUt_{z5*hQGkKU@%IYJSJl{*H%n`#T<{cU}CZ7FNba+6D$z1_n-tejJ|u TdODL5vY_#Nhx+LCEeR}%0Jl13YvRojc delta 84 zcmcb4PxR(J(G8b)*oBQO&nC(hH{axGzsbW0#7scU48$zkZ}PC-P2@K*vobW&HZZU< bFu1=!;Oz9Hd8|svg6saSD4hN+k2M(pY^NVq diff --git a/pull583/_main_files/figure-html/merge-conflict-03-1.png b/pull583/_main_files/figure-html/merge-conflict-03-1.png index 000e9afdbe51726c12da387843b5cc6c57dae92a..46e9f88ef0f47c46017e6d288234b96d01004a10 100644 GIT binary patch delta 80 zcmcaPQ{?7Mkqwu4*hQH6ww+t$-F%a${U#415HkTW^Y)uOEV2LjO)ad9jkOI7tPBjE YZ2eq5-Ia$`30ZLJ3-8m@%XnCm0oiXHmH+?% delta 80 zcmcaPQ{?7Mkqwu4*oBQO&nC(hH{axGzsbW0#7scUy!|E*OYDDs6EiDABW(i%D+7c3 Y3k1$icjaMKLKa;2cSYg!G9K1s0I!G}vj6}9 diff --git a/pull583/_main_files/figure-html/merge-conflict-04-1.png b/pull583/_main_files/figure-html/merge-conflict-04-1.png index 197341a689c7fa1ab66bec9ac1404dbcff14daee..15064334907d6af693c24746975904df99e6f552 100644 GIT binary patch delta 84 zcmeB}EZQ+ybi*Ycb`fU2ZRb{bH{axGzsbW0#7scU48$zkZ}PD6*zlWLSQ#5@8yHv_ b7(ChfxqSK;4^|~)!Kp92Pfu6%WK9MD2TdIw delta 84 zcmeB}EZQ+ybi*Ycc3~sSvx#!W%{O`4Z}Kn#F%u9o12N0?n>?&MHvA@LR)$8}1_o9J a2KN^ToSpu~gH;JxaNXY(h0|3%S(5?V(;Pbh diff --git a/pull583/_main_files/figure-html/merge-conflict-05-1.png b/pull583/_main_files/figure-html/merge-conflict-05-1.png index e8f6f96cdca08d4a01bfb22af220e7ccca6a1aa0..46fdd513864ca619d3c9612321243f69de84d28f 100644 GIT binary patch delta 84 zcmeA^C(?aRWWyyMb`fU&gC}R*ZNACVev^k0h?#(x8Hic7-{fKKm*h9KurfB*HZZU< bFnF@{bNTdgHC82L!Kp92PfuT=#+nQOReT;o delta 84 zcmeA^C(?aRWWyyMc3~r{kPQEm%{O`4Z}Kn#F%u9o12N0?n>?)jlKduSR)$8}1_o9J a2KN^ToSj~-#;SxYxbE+Y!s#p2Sd#$?QXK36 diff --git a/pull583/_main_files/figure-html/merge-conflict-06-1.png b/pull583/_main_files/figure-html/merge-conflict-06-1.png index 989833c7527c5240d627c3a46638bec5e5275b05..1c01aae23008e896da492d09ea4f1112ec359ad9 100644 GIT binary patch delta 80 zcmZqNBiy)0c*7+gb`fU&gC}R*ZNACVev^k0h?#(xdHYQsmevdWrWRJl#@Yr3Rt5%7 Zwtg<3o_m)?30ZLJ3-8m@=iOyV1_1NV9-IIG delta 80 zcmZqNBiy)0c*7+gc3~r{kPQEm%{O`4Z}Kn#F%u9oZ@0y(nwCEeR}%McdW?(hNB?H delta 84 zcmdmUPkhfk@eP-F*oBR>JnzS^ZNACVev^k0h?#(x8Hic7-{fIE@Q~lc%*xP6+rYre bz~KG@fwR+Bykk{D7F_ptMd9?D?^u%okboes diff --git a/pull583/_main_files/figure-html/new-repository-02-1.png b/pull583/_main_files/figure-html/new-repository-02-1.png index 40068040e0a83f500162ecb21db710a9a2112fb5..86961db91ba5515d69e53d465b9d98af1b649453 100644 GIT binary patch delta 84 zcmeBsD$@5?&Ss{E!FR>sEK1_o9J a22ZwrE}#C(kW~p;aOw;1)6=z$Sd#%CaUC-N delta 84 zcmeBsD$@58-{fHgVkRJF24a@&H+fivRQXNJtPG8`4GgRd b4DK%wI6M89A*&Ly;JUvn3a4uuu_glmDNY?6 diff --git a/pull583/_main_files/figure-html/new-repository-03-1.png b/pull583/_main_files/figure-html/new-repository-03-1.png index 3d783de196a707fee1a793836b817171529ba8fc..79b74866e10907321b025bd70fafa56b71cd9f7c 100644 GIT binary patch delta 80 zcmX^4LFnWMp$(UK*hQGEf2PH2Hs9oFzsbW0#7scUy!|E*%h^@@rWRJl#@Yr3Rt5%7 Zwtg<3zI8i`60+db7v86*Kikfd3;+$E9|-^e delta 80 zcmX^4LFnWMp$(UK*oBR>?QY~-H{axGzsbW0#7scUy!|E*%h^@@CT3QKM%o4jRt5(5 Z7YLl4zI8i`60+dBzbgu-Kikfd3;_7%9)AD; diff --git a/pull583/_main_files/figure-html/open-data-w-editor-1-1.png b/pull583/_main_files/figure-html/open-data-w-editor-1-1.png index b98d06fe43124762e66273133a6de1e6b7adaea0..6c8d6d6a0b138f421ebaad24f6d3bfb777ac771d 100644 GIT binary patch delta 80 zcmbQdPiXQ!p$(UK*hQEu#WU8-{fHgVkRJF-hPvZW%@*ZQwu9&BW(i%D+2?k YLq85rub;=Fge+)0-=Ti`hIuT>09tMu-2eap delta 80 zcmbQdPiXQ!p$(UK*oBQWmS41MY`)3Uev^k0h?#(xdHYQsmgy7uP0Xwe4YdsntPBj^ Zd2j_xub;=Fge*8;bkm#Z8|JYj0|2H~8}kXH9*hQEu#WU8-{fHgVkRJF24WTkXH9*oBQWFI?jjZ@$UXev^k0h?#(x8Hibcm=%cGfS4VKIe?fGh`E56 x8;C*bdAHx>;oE2>Q delta 68 zcmaF2h~?cPmJQE&*oBQWmS41MY<|bH{T&bEU1ok0Gb=+wZ36=<1A})STmjQ>2{0-l P3(gna^kzD@AY(EBWwI8R diff --git a/pull583/_main_files/figure-html/pen-tool-01-1.png b/pull583/_main_files/figure-html/pen-tool-01-1.png index 0621f3ffcd3ccd353919572759dda2997191e263..afa9ab649324469b7efacaa00b90a5a95d26d31b 100644 GIT binary patch delta 80 zcmX>?QY~-H{axGzsbW0#7scUy!|E*%jHIX6EiDABW(i%D+7c3 Y3k1$i-`&TegeI?7F(>+s|lK}-!8Lj{T delta 76 zcmZ2;l6TEX-VK*{*oBRB7VP;oqxmLJ`%NB3AZFTrlZV+LoZrOE%FsyLz`)AD;Qj)E Wv(qI~n3a$P*Zo~lINdXaIT-+3fg0KX diff --git a/pull583/_main_files/figure-html/pen-tool-03-1.png b/pull583/_main_files/figure-html/pen-tool-03-1.png index 7af4135533994b557ac6625e45ab54441c808f08..70d1752594dd39d2029de4653583153b7b29c646 100644 GIT binary patch delta 76 zcmaFzhWE)E-VK*{*hQEb)e5)YYQD+Sev^k0h?%zE{zv8gY-Pcs04 Mr>mdKI;Vst0QJ`um;e9( delta 66 zcmezE``dTI1a@J4{bM(e+ijdKp~i1wW@TukZD3$!U~qqdz}d?QOE!U Mp00i_>zopr04~KAfB*mh diff --git a/pull583/_main_files/figure-html/summarize-1.png b/pull583/_main_files/figure-html/summarize-1.png index c35d0893cea6903bb569e92e89ba37a4fcd4e674..8cd7b86e2059a03c7478a8738b5b578dcaaafd41 100644 GIT binary patch delta 66 zcmdlGxFK-D1a=Vyw#gQSyEaZgtIls~VP$NrZD3$!VDMz?=km$L8cNbgVpCsupJo68 MPgg&ebxsLQ005R1+yDRo delta 66 zcmdlGxFK-D1a@J4y^Z(2uh=;KtUAAmnU$fDwt<0_fx-O+0%s=|YbZ%0iLLv)qL2Xy NJYD@<);T3K0RTa@7U%!~ diff --git a/pull583/_main_files/figure-html/summarize-across-1.png b/pull583/_main_files/figure-html/summarize-across-1.png index 68bb626a70aa66a8c99b55a4398983a4747193e1..64ddae749b3ebf1a95162ede67c6d39a58214467 100644 GIT binary patch delta 66 zcmbQ4H8X3%1a=Vyw#gQSyEaZYGvPP2urfB*HZZUuCQ8ysVpCsupJo68 MPgg&ebxsLQ0Q7ei%K!iX delta 66 zcmbQ4H8X3%1a@J4y^Z(2uh=-<%!J>>%*xP6+rYrez~KG@fwPnEm?%jjiLLv)qL2Xy NJYD@<);T3K0RT0n7S{j( diff --git a/pull583/_main_files/figure-html/summarize-groupby-1.png b/pull583/_main_files/figure-html/summarize-groupby-1.png index e542b245314f8b5713e1b01ce77fe94ddbcbfd07..ad29759c6691710eada4eadd6f557bfad321d335 100644 GIT binary patch delta 66 zcmeAy?JS)zfn9`wZL&q-u8q@|S@N4&SQ#5@8yHv_7(ChfxqPyZm69})*wh!^rx}32 M)78&qol`;+0Qc|}tN;K2 delta 66 zcmeAy?JS)zfn8W%Z{xl1D>hDFX31}2W@TukZD3$!U~qqdz}d+@R!Y)HV(b2{C}aQv MPgg&ebxsLQ0688OxBvhE diff --git a/pull583/_main_files/figure-html/ubuntu-docker-terminal-1.png b/pull583/_main_files/figure-html/ubuntu-docker-terminal-1.png index ee611f28fbdaa323f614ca662d132c631e359f94..f13c89fddf9a70ccef1ace8b9c5eaa472d95b8cb 100644 GIT binary patch delta 77 zcmZpE$kY6gX9Fh_y9l$;qVwNangy8J1(+Ctm}$EJ6LXvvzo~_lv9Y#+ft7*5ldYf2 Wr`PH*Dgngy8J1(+Ctm}$EJ6LXvvzloWZp^>(Mft7*5{RIMN Wr`PH*DF4lHkY{MlUb`fT#{T0s=ns4&7-{fHgVkRJF24a@&H+fjO=JA_aSQ#5@8yHv_ b7(ChfxqSMkRjf+Lf>U33pPsI~nl%{!NrfJ~ delta 84 zcmZp>F4lHkY{MlUc3~quzfH5ZG~eWDzsbW0#7scU48$zkZ}PBm&Eq#QvobW&HZZU< bFu1=!;Oz8It5}tg1=syuQ8-U33pPt@Y$(jrRw|O7l delta 84 zcmaFyS@gwc(G8b)*oBSs{5H+r(tMMr{U#415HkTWGZ3?EzsbXzlf!RfW@TukZD3$! bU~qqdz}e{`m8?q0g6saSD4gC}$(jrRi9{ap diff --git a/pull583/_main_files/figure-html/vc-ba2-add-1.png b/pull583/_main_files/figure-html/vc-ba2-add-1.png index 64d136091c592a9730b798c689129d03252e128f..3bd7bc6aac48c67cabbecf3e491e42f50fcbba77 100644 GIT binary patch delta 72 zcmaDih5g+W_6ZZ%MVPGm67%;rPTxA6vHK*ysfCrXv9^JMm4U&Nt)I)M8=hiRl137o T`ojA(0}yz+`njxgN@xNA-ohCx delta 72 zcmaDih5g+W_6ZZ%g^jd4@5irgoW6BBWA{mZ6EiDABW(i%D+7c33k1$iH$26tB#k7v T?(d311|aZs^>bP0l+XkK)#@1E diff --git a/pull583/_main_files/figure-html/vc-ba3-commit-1.png b/pull583/_main_files/figure-html/vc-ba3-commit-1.png index 9418666bc235ca0fecba1794867fc85c0e8d629e..f2b09ee31c12fc8c95fd9658b87fdc2452abd491 100644 GIT binary patch delta 72 zcmdmVmVMJ%_6ZZ%MVPGm67%;rPTxA6@%&SMQwu9&V{HQiD+7ZkTR)diFMYbP0l+XkK`;i&t delta 72 zcmdmVmVMJ%_6ZZ%g^jd4@5irgoW6BBgTe~DWM4f^0^tA diff --git a/pull583/_main_files/figure-html/vc1-no-changes-1.png b/pull583/_main_files/figure-html/vc1-no-changes-1.png index c04e2721fd9ab086c85789dbbc27c015a68fd0ba..ff3beb76336aea31fd019504c434223f67355818 100644 GIT binary patch delta 75 zcmbPmm}3GEO<)&cvg%9B-`_aBbvonL=}c=C_)RUWjE%Jo46F UaOw;1(+oi1>FVdQ&MBb@0PpS?-~a#s delta 75 zcmbPmm}3GEO<)%`(!6kuPrPw@>vYDg)0x&P@SB)f85(ID7+4t?++QGYc6zWPlae%& U;JUvn3K@XF)78&qol`;+0MHH?DgXcg diff --git a/pull583/_main_files/figure-html/vc2-changes-1.png b/pull583/_main_files/figure-html/vc2-changes-1.png index 6f3ce4216cdde239044bcab79358881204f77c23..9bcf7acdfb12c5c0105e1f6e1e6620b71e4014db 100644 GIT binary patch delta 75 zcmeC|8;Zlw@znT=E`qkW@TukZD3$!U~qqdz}e|OZcIwj VNP_GBt|(*x0#8>zmvv4FO#sgv7+3%R diff --git a/pull583/_main_files/figure-html/vc5-push-1.png b/pull583/_main_files/figure-html/vc5-push-1.png index c05b0c0e0e139be35180a1e03fb4fd485719d81d..db73bf3131f7f817df7308a8f52f42be071ec398 100644 GIT binary patch delta 78 zcmccglmF6B{s|MvZP#zxhoqtc;Dd4GgRd44!QLTt0p3 ZA7&+KB*CdKyiYR#fv2mV%Q~loCIIg89XJ30 delta 78 zcmccglmF6B{s|M8yHv_7~EeVaCZ9C ZKg>$fNP_GBt|(*x0#8>zmvv4FO#tU!9P9u9 diff --git a/pull583/_main_files/figure-html/vc6-remote-changes-1.png b/pull583/_main_files/figure-html/vc6-remote-changes-1.png index bc52adf7f026300ccc7c583f71ede4839c379ae4..33bc7bd19817de765b90b04ee30a034c96495d91 100644 GIT binary patch delta 75 zcmaDhi~HFu?g^zW>$tq+6D$z1_t*R2%Md+d6r2@ W8cA^7-xY-nK;Y@>=d#Wzp$PyiYZ>?e diff --git a/pull583/_main_files/figure-html/vc7-pull-1.png b/pull583/_main_files/figure-html/vc7-pull-1.png index b1ebae6c3bb51f047a7655a05338fcedadc7f094..88eb0387f7f190a9a10e246e17ad29be435527bb 100644 GIT binary patch delta 78 zcmdloi)YI$o(U7!MVPFArp0SEPH&yg*gBnQ>vZNo9)42`D`R7A0|P4qgC|=*mrwuB Z!>lBYBslej_h|+o@O1TaS?83{1OPaA8R!52 delta 78 zcmdloi)YI$o(U7!g^jd4@5irgoZdQ}v2{Ar*6GZFJp3kRR)$8}1_o9J2KN^ToSpui ZhgnG)NpRiY6@?5y;OXk;vd$@?2>?8>8PNa$ diff --git a/pull583/classification1.html b/pull583/classification1.html index b40114ab0..ce06168bd 100644 --- a/pull583/classification1.html +++ b/pull583/classification1.html @@ -1068,7 +1068,7 @@

5.5.2 More than two explanatory v as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.

- +

Figure 5.8: 3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.

diff --git a/pull583/index.html b/pull583/index.html index 7518e39c2..71b81bcff 100644 --- a/pull583/index.html +++ b/pull583/index.html @@ -562,10 +562,12 @@

Welcome!CRC Press website or on Amazon.

For the python version of the textbook, visit https://python.datasciencebook.ca.

-

This book is listed in a number of open educational resource (OER) collections: -- The University of British Columbia OER collection -- The OER Commons -- MERLOT

+

This book is listed in a number of open educational resource (OER) collections:

+

This work by Tiffany Timbers, Trevor Campbell, diff --git a/pull583/regression1.html b/pull583/regression1.html index b482991e8..8ffb893f4 100644 --- a/pull583/regression1.html +++ b/pull583/regression1.html @@ -1239,7 +1239,7 @@

7.9 Multivariable K-NN regression predictors instead of 1.

- +

Figure 7.10: K-NN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes.

diff --git a/pull583/regression2.html b/pull583/regression2.html index 3753d5223..328f36945 100644 --- a/pull583/regression2.html +++ b/pull583/regression2.html @@ -934,7 +934,7 @@

8.6 Multivariable linear regressi shown in Figure 8.7.

- +

Figure 8.7: Linear regression plane of best fit overlaid on top of the data (using price, house size, and number of bedrooms as predictors). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the regression plane looks like for learning purposes.

diff --git a/pull583/search_index.json b/pull583/search_index.json index 811d14987..f3d924d09 100644 --- a/pull583/search_index.json +++ b/pull583/search_index.json @@ -1 +1 @@ -[["index.html", "A First Introduction Welcome!", " Data Science A First Introduction Tiffany Timbers, Trevor Campbell, and Melissa Lee 2024-08-21 Welcome! This is the website for Data Science: A First Introduction. You can read the web version of the book on this site. Click a section in the table of contents on the left side of the page to navigate to it. If you are on a mobile device, you may need to open the table of contents first by clicking the menu button on the top left of the page. You can purchase a PDF or print copy of the book on the CRC Press website or on Amazon. For the python version of the textbook, visit https://python.datasciencebook.ca. This book is listed in a number of open educational resource (OER) collections: - The University of British Columbia OER collection - The OER Commons - MERLOT This work by Tiffany Timbers, Trevor Campbell, and Melissa Lee is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. "],["foreword.html", "Foreword", " Foreword Roger D. Peng Johns Hopkins Bloomberg School of Public Health 2022-01-04 The field of data science has expanded and grown significantly in recent years, attracting excitement and interest from many different directions. The demand for introductory educational materials has grown concurrently with the growth of the field itself, leading to a proliferation of textbooks, courses, blog posts, and tutorials. This book is an important contribution to this fast-growing literature, but given the wide availability of materials, a reader should be inclined to ask, “What is the unique contribution of this book?” In order to answer that question it is useful to step back for a moment and consider the development of the field of data science over the past few years. When thinking about data science, it is important to consider two questions: “What is data science?” and “How should one do data science?” The former question is under active discussion amongst a broad community of researchers and practitioners and there does not appear to be much consensus to date. However, there seems a general understanding that data science focuses on the more “active” elements—data wrangling, cleaning, and analysis—of answering questions with data. These elements are often highly problem-specific and may seem difficult to generalize across applications. Nevertheless, over time we have seen some core elements emerge that appear to repeat themselves as useful concepts across different problems. Given the lack of clear agreement over the definition of data science, there is a strong need for a book like this one to propose a vision for what the field is and what the implications are for the activities in which members of the field engage. The first important concept addressed by this book is tidy data, which is a format for tabular data formally introduced to the statistical community in a 2014 paper by Hadley Wickham. The tidy data organization strategy has proven a powerful abstract concept for conducting data analysis, in large part because of the vast toolchain implemented in the Tidyverse collection of R packages. The second key concept is the development of workflows for reproducible and auditable data analyses. Modern data analyses have only grown in complexity due to the availability of data and the ease with which we can implement complex data analysis procedures. Furthermore, these data analyses are often part of decision-making processes that may have significant impacts on people and communities. Therefore, there is a critical need to build reproducible analyses that can be studied and repeated by others in a reliable manner. Statistical methods clearly represent an important element of data science for building prediction and classification models and for making inferences about unobserved populations. Finally, because a field can succeed only if it fosters an active and collaborative community, it has become clear that being fluent in the tools of collaboration is a core element of data science. This book takes these core concepts and focuses on how one can apply them to do data science in a rigorous manner. Students who learn from this book will be well-versed in the techniques and principles behind producing reliable evidence from data. This book is centered around the use of the R programming language within the tidy data framework, and as such employs the most recent advances in data analysis coding. The use of Jupyter notebooks for exercises immediately places the student in an environment that encourages auditability and reproducibility of analyses. The integration of git and GitHub into the course is a key tool for teaching about collaboration and community, key concepts that are critical to data science. The demand for training in data science continues to increase. The availability of large quantities of data to answer a variety of questions, the computational power available to many more people than ever before, and the public awareness of the importance of data for decision-making have all contributed to the need for high-quality data science work. This book provides a sophisticated first introduction to the field of data science and provides a balanced mix of practical skills along with generalizable principles. As we continue to introduce students to data science and train them to confront an expanding array of data science problems, they will be well-served by the ideas presented here. "],["preface.html", "Preface", " Preface This textbook aims to be an approachable introduction to the world of data science. In this book, we define data science as the process of generating insight from data through reproducible and auditable processes. If you analyze some data and give your analysis to a friend or colleague, they should be able to re-run the analysis from start to finish and get the same result you did (reproducibility). They should also be able to see and understand all the steps in the analysis, as well as the history of how the analysis developed (auditability). Creating reproducible and auditable analyses allows both you and others to easily double-check and validate your work. At a high level, in this book, you will learn how to identify common problems in data science, and solve those problems with reproducible and auditable workflows. Figure 0.1 summarizes what you will learn in each chapter of this book. Throughout, you will learn how to use the R programming language (R Core Team 2021) to perform all the tasks associated with data analysis. You will spend the first four chapters learning how to use R to load, clean, wrangle (i.e., restructure the data into a usable format) and visualize data while answering descriptive and exploratory data analysis questions. In the next six chapters, you will learn how to answer predictive, exploratory, and inferential data analysis questions with common methods in data science, including classification, regression, clustering, and estimation. In the final chapters (11–13), you will learn how to combine R code, formatted text, and images in a single coherent document with Jupyter, use version control for collaboration, and install and configure the software needed for data science on your own computer. If you are reading this book as part of a course that you are taking, the instructor may have set up all of these tools already for you; in this case, you can continue on through the book reading the chapters in order. But if you are reading this independently, you may want to jump to these last three chapters early before going on to make sure your computer is set up in such a way that you can try out the example code that we include throughout the book. Figure 0.1: Where are we going? Each chapter in the book has an accompanying worksheet that provides exercises to help you practice the concepts you will learn. We strongly recommend that you work through the worksheet when you finish reading each chapter before moving on to the next chapter. All of the worksheets are available at https://worksheets.datasciencebook.ca; the “Exercises” section at the end of each chapter points you to the right worksheet for that chapter. For each worksheet, you can either launch an interactive version of the worksheet in your browser by clicking the “launch binder” button, or preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["acknowledgments.html", "Acknowledgments", " Acknowledgments We’d like to thank everyone that has contributed to the development of Data Science: A First Introduction. This is an open source textbook that began as a collection of course readings for DSCI 100, a new introductory data science course at the University of British Columbia (UBC). Several faculty members in the UBC Department of Statistics were pivotal in shaping the direction of that course, and as such, contributed greatly to the broad structure and list of topics in this book. We would especially like to thank Matías Salibían-Barrera for his mentorship during the initial development and roll-out of both DSCI 100 and this book. His door was always open when we needed to chat about how to best introduce and teach data science to our first-year students. We would also like to thank Gabriela Cohen Freue for her DSCI 561 (Regression I) teaching materials from the UBC Master of Data Science program, as some of our linear regression figures were inspired from these. We would also like to thank all those who contributed to the process of publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions: Rohan Alexander, Isabella Ghement, Virgilio Gómez Rubio, Albert Kim, Adam Loy, Maria Prokofieva, Emily Riederer, and Greg Wilson. The book was improved substantially by their insights. We would like to give special thanks to Jim Zidek for his support and encouragement throughout the process, and to Roger Peng for graciously offering to write the Foreword. Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the past few years. They provided invaluable feedback on the book and worksheets; they found bugs for us (and stood by very patiently in class while we frantically fixed those bugs); and they brought a level of enthusiasm to the class that sustained us during the hard work of creating a new course and writing a textbook. Our interactions with them taught us how to teach data science, and that learning is reflected in the content of this book. "],["about-the-authors.html", "About the authors", " About the authors Tiffany Timbers is an Associate Professor of Teaching in the Department of Statistics and Co-Director for the Master of Data Science program (Vancouver Option) at the University of British Columbia. In these roles she teaches and develops curriculum around the responsible application of Data Science to solve real-world problems. One of her favorite courses she teaches is a graduate course on collaborative software development, which focuses on teaching how to create R and Python packages using modern tools and workflows. Trevor Campbell is an Associate Professor in the Department of Statistics at the University of British Columbia. His research focuses on automated, scalable Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and Bayesian theory. He was previously a postdoctoral associate advised by Tamara Broderick in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute for Data, Systems, and Society (IDSS) at MIT, a Ph.D. candidate under Jonathan How in the Laboratory for Information and Decision Systems (LIDS) at MIT, and before that he was in the Engineering Science program at the University of Toronto. Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. She teaches and develops curriculum for undergraduate statistics and data science courses. Her work focuses on student-centered approaches to teaching, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives. "],["intro.html", "Chapter 1 R and the Tidyverse 1.1 Overview 1.2 Chapter learning objectives 1.3 Canadian languages data set 1.4 Asking a question 1.5 Loading a tabular data set 1.6 Naming things in R 1.7 Creating subsets of data frames with filter & select 1.8 Using arrange to order and slice to select rows by index number 1.9 Adding and modifying columns using mutate 1.10 Exploring data with visualizations 1.11 Accessing documentation 1.12 Exercises", " Chapter 1 R and the Tidyverse 1.1 Overview This chapter provides an introduction to data science and the R programming language. The goal here is to get your hands dirty right from the start! We will walk through an entire data analysis, and along the way introduce different types of data analysis question, some fundamental programming concepts in R, and the basics of loading, cleaning, and visualizing data. In the following chapters, we will dig into each of these steps in much more detail; but for now, let’s jump in to see how much we can do with data science! 1.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Identify the different types of data analysis question and categorize a question into the correct type. Load the tidyverse package into R. Read tabular data with read_csv. Create new variables and objects in R using the assignment symbol. Create and organize subsets of tabular data using filter, select, arrange, and slice. Add and modify columns in tabular data using mutate. Visualize data with a ggplot bar plot. Use ? to access help and documentation tools in R. 1.3 Canadian languages data set In this chapter, we will walk through a full analysis of a data set relating to languages spoken at home by Canadian residents (Figure 1.1). Many Indigenous peoples exist in Canada with their own cultures and languages; these languages are often unique to Canada and not spoken anywhere else in the world (Statistics Canada 2018). Sadly, colonization has led to the loss of many of these languages. For instance, generations of children were not allowed to speak their mother tongue (the first language an individual learns in childhood) in Canadian residential schools. Colonizers also renamed places they had “discovered” (K. Wilson 2018). Acts such as these have significantly harmed the continuity of Indigenous languages in Canada, and some languages are considered “endangered” as few people report speaking them. To learn more, please see Canadian Geographic’s article, “Mapping Indigenous Languages in Canada” (Walker 2017), They Came for the Children: Canada, Aboriginal peoples, and Residential Schools (Truth and Reconciliation Commission of Canada 2012) and the Truth and Reconciliation Commission of Canada’s Calls to Action (Truth and Reconciliation Commission of Canada 2015). Figure 1.1: Map of Canada. The data set we will study in this chapter is taken from the canlang R data package (Timbers 2020), which has population language data collected during the 2016 Canadian census (Statistics Canada 2016a). In this data, there are 214 languages recorded, each having six different properties: category: Higher-level language category, describing whether the language is an Official Canadian language, an Aboriginal (i.e., Indigenous) language, or a Non-Official and Non-Aboriginal language. language: The name of the language. mother_tongue: Number of Canadian residents who reported the language as their mother tongue. Mother tongue is generally defined as the language someone was exposed to since birth. most_at_home: Number of Canadian residents who reported the language as being spoken most often at home. most_at_work: Number of Canadian residents who reported the language as being used most often at work. lang_known: Number of Canadian residents who reported knowledge of the language. According to the census, more than 60 Aboriginal languages were reported as being spoken in Canada. Suppose we want to know which are the most common; then we might ask the following question, which we wish to answer using our data: Which ten Aboriginal languages were most often reported in 2016 as mother tongues in Canada, and how many people speak each of them? Note: Data science cannot be done without a deep understanding of the data and problem domain. In this book, we have simplified the data sets used in our examples to concentrate on methods and fundamental concepts. But in real life, you cannot and should not do data science without a domain expert. Alternatively, it is common to practice data science in your own domain of expertise! Remember that when you work with data, it is essential to think about how the data were collected, which affects the conclusions you can draw. If your data are biased, then your results will be biased! 1.4 Asking a question Every good data analysis begins with a question—like the above—that you aim to answer using data. As it turns out, there are actually a number of different types of question regarding data: descriptive, exploratory, predictive, inferential, causal, and mechanistic, all of which are defined in Table 1.1. Carefully formulating a question as early as possible in your analysis—and correctly identifying which type of question it is—will guide your overall approach to the analysis as well as the selection of appropriate tools. Table 1.1: Types of data analysis question (Leek and Peng 2015; Peng and Matsui 2015). Question type Description Example Descriptive A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). How many people live in each province and territory in Canada? Exploratory A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? Predictive A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. What political party will someone vote for in the next Canadian election? Inferential A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. Does political party voting change with indicators of wealth for all people living in Canada? Causal A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. Does wealth lead to voting for a certain political party in Canadian elections? Mechanistic A question that asks about the underlying mechanism of the observed patterns, trends, or relationships (i.e., how does it happen?) How does wealth lead to voting for a certain political party in Canadian elections? In this book, you will learn techniques to answer the first four types of question: descriptive, exploratory, predictive, and inferential; causal and mechanistic questions are beyond the scope of this book. In particular, you will learn how to apply the following analysis tools: Summarization: computing and reporting aggregated values pertaining to a data set. Summarization is most often used to answer descriptive questions, and can occasionally help with answering exploratory questions. For example, you might use summarization to answer the following question: What is the average race time for runners in this data set? Tools for summarization are covered in detail in Chapters 2 and 3, but appear regularly throughout the text. Visualization: plotting data graphically. Visualization is typically used to answer descriptive and exploratory questions, but plays a critical supporting role in answering all of the types of question in Table 1.1. For example, you might use visualization to answer the following question: Is there any relationship between race time and age for runners in this data set? This is covered in detail in Chapter 4, but again appears regularly throughout the book. Classification: predicting a class or category for a new observation. Classification is used to answer predictive questions. For example, you might use classification to answer the following question: Given measurements of a tumor’s average cell area and perimeter, is the tumor benign or malignant? Classification is covered in Chapters 5 and 6. Regression: predicting a quantitative value for a new observation. Regression is also used to answer predictive questions. For example, you might use regression to answer the following question: What will be the race time for a 20-year-old runner who weighs 50kg? Regression is covered in Chapters 7 and 8. Clustering: finding previously unknown/unlabeled subgroups in a data set. Clustering is often used to answer exploratory questions. For example, you might use clustering to answer the following question: What products are commonly bought together on Amazon? Clustering is covered in Chapter 9. Estimation: taking measurements for a small number of items from a large group and making a good guess for the average or proportion for the large group. Estimation is used to answer inferential questions. For example, you might use estimation to answer the following question: Given a survey of cellphone ownership of 100 Canadians, what proportion of the entire Canadian population own Android phones? Estimation is covered in Chapter 10. Referring to Table 1.1, our question about Aboriginal languages is an example of a descriptive question: we are summarizing the characteristics of a data set without further interpretation. And referring to the list above, it looks like we should use visualization and perhaps some summarization to answer the question. So in the remainder of this chapter, we will work towards making a visualization that shows us the ten most common Aboriginal languages in Canada and their associated counts, according to the 2016 census. 1.5 Loading a tabular data set A data set is, at its core essence, a structured collection of numbers and characters. Aside from that, there are really no strict rules; data sets can come in many different forms! Perhaps the most common form of data set that you will find in the wild, however, is tabular data. Think spreadsheets in Microsoft Excel: tabular data are rectangular-shaped and spreadsheet-like, as shown in Figure 1.2. In this book, we will focus primarily on tabular data. Since we are using R for data analysis in this book, the first step for us is to load the data into R. When we load tabular data into R, it is represented as a data frame object. Figure 1.2 shows that an R data frame is very similar to a spreadsheet. We refer to the rows as observations; these are the individual objects for which we collect data. In Figure 1.2, the observations are languages. We refer to the columns as variables; these are the characteristics of each observation. In Figure 1.2, the variables are the the language’s category, its name, the number of mother tongue speakers, etc. Figure 1.2: A spreadsheet versus a data frame in R. The first kind of data file that we will learn how to load into R as a data frame is the comma-separated values format (.csv for short). These files have names ending in .csv, and can be opened and saved using common spreadsheet programs like Microsoft Excel and Google Sheets. For example, the .csv file named can_lang.csv is included with the code for this book. If we were to open this data in a plain text editor (a program like Notepad that just shows text with no formatting), we would see each row on its own line, and each entry in the table separated by a comma: category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 To load this data into R so that we can do things with it (e.g., perform analyses or create data visualizations), we will need to use a function. A function is a special word in R that takes instructions (we call these arguments) and does something. The function we will use to load a .csv file into R is called read_csv. In its most basic use-case, read_csv expects that the data file: has column names (or headers), uses a comma (,) to separate the columns, and does not have row names. Below you’ll see the code used to load the data into R using the read_csv function. Note that the read_csv function is not included in the base installation of R, meaning that it is not one of the primary functions ready to use when you install R. Therefore, you need to load it from somewhere else before you can use it. The place from which we will load it is called an R package. An R package is a collection of functions that can be used in addition to the built-in R package functions once loaded. The read_csv function, in particular, can be made accessible by loading the tidyverse R package (Wickham 2021b; Wickham et al. 2019) using the library function. The tidyverse package contains many functions that we will use throughout this book to load, clean, wrangle, and visualize data. library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.2 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Note: You may have noticed that we got some extra output from R regarding attached packages and conflicts below our code line. These are examples of messages in R, which give the user more information that might be handy to know. The Attaching packages message is natural when loading tidyverse, since tidyverse actually automatically causes other packages to be imported too, such as dplyr. In the future, when we load tidyverse in this book, we will silence these messages to help with the readability of the book. The Conflicts message is also totally normal in this circumstance. This message tells you if functions from different packages share the same name, which is confusing to R. For example, in this case, the dplyr package and the stats package both provide a function called filter. The message above (dplyr::filter() masks stats::filter()) is R telling you that it is going to default to the dplyr package version of this function. So if you use the filter function, you will be using the dplyr version. In order to use the stats version, you need to use its full name stats::filter. Messages are not errors, so generally you don’t need to take action when you see a message; but you should always read the message and critically think about what it means and whether you need to do anything about it. After loading the tidyverse package, we can call the read_csv function and pass it a single argument: the name of the file, \"can_lang.csv\". We have to put quotes around file names and other letters and words that we use in our code to distinguish it from the special words (like functions!) that make up the R programming language. The file’s name is the only argument we need to provide because our file satisfies everything else that the read_csv function expects in the default use-case. Figure 1.3 describes how we use the read_csv to read data into R. Figure 1.3: Syntax for the read_csv function. read_csv("data/can_lang.csv") ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows Note: There is another function that also loads csv files named read.csv. We will always use read_csv in this book, as it is designed to play nicely with all of the other tidyverse functions, which we will use extensively. Be careful not to accidentally use read.csv, as it can cause some tricky errors to occur in your code that are hard to track down! 1.6 Naming things in R When we loaded the 2016 Canadian census language data using read_csv, we did not give this data frame a name. Therefore the data was just printed on the screen, and we cannot do anything else with it. That isn’t very useful. What would be more useful would be to give a name to the data frame that read_csv outputs, so that we can refer to it later for analysis and visualization. The way to assign a name to a value in R is via the assignment symbol <-. On the left side of the assignment symbol you put the name that you want to use, and on the right side of the assignment symbol you put the value that you want the name to refer to. Names can be used to refer to almost anything in R, such as numbers, words (also known as strings of characters), and data frames! Below, we set my_number to 3 (the result of 1+2) and we set name to the string \"Alice\". my_number <- 1 + 2 name <- "Alice" Note that when we name something in R using the assignment symbol, <-, we do not need to surround the name we are creating with quotes. This is because we are formally telling R that this special word denotes the value of whatever is on the right-hand side. Only characters and words that act as values on the right-hand side of the assignment symbol—e.g., the file name \"data/can_lang.csv\" that we specified before, or \"Alice\" above—need to be surrounded by quotes. After making the assignment, we can use the special name words we have created in place of their values. For example, if we want to do something with the value 3 later on, we can just use my_number instead. Let’s try adding 2 to my_number; you will see that R just interprets this as adding 3 and 2: my_number + 2 ## [1] 5 Object names can consist of letters, numbers, periods . and underscores _. Other symbols won’t work since they have their own meanings in R. For example, - is the subtraction symbol; if we try to assign a name with the - symbol, R will complain and we will get an error! my-number <- 1 Error in my - number <- 1 : object 'my' not found There are certain conventions for naming objects in R. When naming an object we suggest using only lowercase letters, numbers and underscores _ to separate the words in a name. R is case sensitive, which means that Letter and letter would be two different objects in R. You should also try to give your objects meaningful names. For instance, you can name a data frame x. However, using more meaningful terms, such as language_data, will help you remember what each name in your code represents. We recommend following the Tidyverse naming conventions outlined in the Tidyverse Style Guide (Wickham 2020). Let’s now use the assignment symbol to give the name can_lang to the 2016 Canadian census language data frame that we get from read_csv. can_lang <- read_csv("data/can_lang.csv") Wait a minute, nothing happened this time! Where’s our data? Actually, something did happen: the data was loaded in and now has the name can_lang associated with it. And we can use that name to access the data frame and do things with it. For example, we can type the name of the data frame to print the first few rows on the screen. You will also see at the top that the number of observations (i.e., rows) and variables (i.e., columns) are printed. Printing the first few rows of a data frame like this is a handy way to get a quick sense for what is contained in a data frame. can_lang ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 1.7 Creating subsets of data frames with filter & select Now that we’ve loaded our data into R, we can start wrangling the data to find the ten Aboriginal languages that were most often reported in 2016 as mother tongues in Canada. In particular, we will construct a table with the ten Aboriginal languages that have the largest counts in the mother_tongue column. The filter and select functions from the tidyverse package will help us here. The filter function allows you to obtain a subset of the rows with specific values, while the select function allows you to obtain a subset of the columns. Therefore, we can filter the rows to extract the Aboriginal languages in the data set, and then use select to obtain only the columns we want to include in our table. 1.7.1 Using filter to extract rows Looking at the can_lang data above, we see the category column contains different high-level categories of languages, which include “Aboriginal languages”, “Non-Official & Non-Aboriginal languages” and “Official languages”. To answer our question we want to filter our data set so we restrict our attention to only those languages in the “Aboriginal languages” category. We can use the filter function to obtain the subset of rows with desired values from a data frame. Figure 1.4 outlines what arguments we need to specify to use filter. The first argument to filter is the name of the data frame object, can_lang. The second argument is a logical statement to use when filtering the rows. A logical statement evaluates to either TRUE or FALSE; filter keeps only those rows for which the logical statement evaluates to TRUE. For example, in our analysis, we are interested in keeping only languages in the “Aboriginal languages” higher-level category. We can use the equivalency operator == to compare the values of the category column with the value \"Aboriginal languages\"; you will learn about many other kinds of logical statements in Chapter 3. Similar to when we loaded the data file and put quotes around the file name, here we need to put quotes around \"Aboriginal languages\". Using quotes tells R that this is a string value and not one of the special words that make up the R programming language, or one of the names we have given to data frames in the code we have already written. Figure 1.4: Syntax for the filter function. With these arguments, filter returns a data frame that has all the columns of the input data frame, but only those rows we asked for in our logical filter statement. aboriginal_lang <- filter(can_lang, category == "Aboriginal languages") aboriginal_lang ## # A tibble: 67 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Aboriginal langu… Algonqu… 45 10 0 120 ## 3 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 4 Aboriginal langu… Athabas… 50 10 0 85 ## 5 Aboriginal langu… Atikame… 6150 5465 1100 6645 ## 6 Aboriginal langu… Babine … 110 20 10 210 ## 7 Aboriginal langu… Beaver 190 50 0 340 ## 8 Aboriginal langu… Blackfo… 2815 1110 85 5645 ## 9 Aboriginal langu… Carrier 1025 250 15 2100 ## 10 Aboriginal langu… Cayuga 45 10 10 125 ## # ℹ 57 more rows It’s good practice to check the output after using a function in R. We can see the original can_lang data set contained 214 rows with multiple kinds of category. The data frame aboriginal_lang contains only 67 rows, and looks like it only contains languages in the “Aboriginal languages” in the category column. So it looks like the function gave us the result we wanted! 1.7.2 Using select to extract columns Now let’s use select to extract the language and mother_tongue columns from this data frame. Figure 1.5 shows us the syntax for the select function. To extract these columns, we need to provide the select function with three arguments. The first argument is the name of the data frame object, which in this example is aboriginal_lang. The second and third arguments are the column names that we want to select: language and mother_tongue. After passing these three arguments, the select function returns two columns (the language and mother_tongue columns that we asked for) as a data frame. This code is also a great example of why being able to name things in R is useful: you can see that we are using the result of our earlier filter step (which we named aboriginal_lang) here in the next step of the analysis! Figure 1.5: Syntax for the select function. selected_lang <- select(aboriginal_lang, language, mother_tongue) selected_lang ## # A tibble: 67 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Aboriginal languages, n.o.s. 590 ## 2 Algonquian languages, n.i.e. 45 ## 3 Algonquin 1260 ## 4 Athabaskan languages, n.i.e. 50 ## 5 Atikamekw 6150 ## 6 Babine (Wetsuwet'en) 110 ## 7 Beaver 190 ## 8 Blackfoot 2815 ## 9 Carrier 1025 ## 10 Cayuga 45 ## # ℹ 57 more rows 1.8 Using arrange to order and slice to select rows by index number We have used filter and select to obtain a table with only the Aboriginal languages in the data set and their associated counts. However, we want to know the ten languages that are spoken most often. As a next step, we will order the mother_tongue column from largest to smallest value and then extract only the top ten rows. This is where the arrange and slice functions come to the rescue! The arrange function allows us to order the rows of a data frame by the values of a particular column. Figure 1.6 details what arguments we need to specify to use the arrange function. We need to pass the data frame as the first argument to this function, and the variable to order by as the second argument. Since we want to choose the ten Aboriginal languages most often reported as a mother tongue language, we will use the arrange function to order the rows in our selected_lang data frame by the mother_tongue column. We want to arrange the rows in descending order (from largest to smallest), so we pass the column to the desc function before using it as an argument. Figure 1.6: Syntax for the arrange function. arranged_lang <- arrange(selected_lang, by = desc(mother_tongue)) arranged_lang ## # A tibble: 67 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Cree, n.o.s. 64050 ## 2 Inuktitut 35210 ## 3 Ojibway 17885 ## 4 Oji-Cree 12855 ## 5 Dene 10700 ## 6 Montagnais (Innu) 10235 ## 7 Mi'kmaq 6690 ## 8 Atikamekw 6150 ## 9 Plains Cree 3065 ## 10 Stoney 3025 ## # ℹ 57 more rows Next we will use the slice function, which selects rows according to their row number. Since we want to choose the most common ten languages, we will indicate we want the rows 1 to 10 using the argument 1:10. ten_lang <- slice(arranged_lang, 1:10) ten_lang ## # A tibble: 10 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Cree, n.o.s. 64050 ## 2 Inuktitut 35210 ## 3 Ojibway 17885 ## 4 Oji-Cree 12855 ## 5 Dene 10700 ## 6 Montagnais (Innu) 10235 ## 7 Mi'kmaq 6690 ## 8 Atikamekw 6150 ## 9 Plains Cree 3065 ## 10 Stoney 3025 1.9 Adding and modifying columns using mutate Recall that our data analysis question referred to the count of Canadians that speak each of the top ten most commonly reported Aboriginal languages as their mother tongue, and the ten_lang data frame indeed contains those counts… But perhaps, seeing these numbers, we became curious about the percentage of the population of Canada associated with each count. It is common to come up with new data analysis questions in the process of answering a first one—so fear not and explore! To answer this small question along the way, we need to divide each count in the mother_tongue column by the total Canadian population according to the 2016 census—i.e., 35,151,728—and multiply it by 100. We can perform this computation using the mutate function. We pass the ten_lang data frame as its first argument, then specify the equation that computes the percentages in the second argument. By using a new variable name on the left hand side of the equation, we will create a new column in the data frame; and if we use an existing name, we will modify that variable. In this case, we will opt to create a new column called mother_tongue_percent. canadian_population = 35151728 ten_lang_percent = mutate(ten_lang, mother_tongue_percent = 100 * mother_tongue / canadian_population) ten_lang_percent ## # A tibble: 10 × 3 ## language mother_tongue mother_tongue_percent ## <chr> <dbl> <dbl> ## 1 Cree, n.o.s. 64050 0.182 ## 2 Inuktitut 35210 0.100 ## 3 Ojibway 17885 0.0509 ## 4 Oji-Cree 12855 0.0366 ## 5 Dene 10700 0.0304 ## 6 Montagnais (Innu) 10235 0.0291 ## 7 Mi'kmaq 6690 0.0190 ## 8 Atikamekw 6150 0.0175 ## 9 Plains Cree 3065 0.00872 ## 10 Stoney 3025 0.00861 The ten_lang_percent data frame shows that the ten Aboriginal languages in the ten_lang data frame were spoken as a mother tongue by between 0.008% and 0.18% of the Canadian population. 1.10 Exploring data with visualizations The ten_lang table we generated in Section 1.8 answers our initial data analysis question. Are we done? Well, not quite; tables are almost never the best way to present the result of your analysis to your audience. Even the ten_lang table with only two columns presents some difficulty: for example, you have to scrutinize the table quite closely to get a sense for the relative numbers of speakers of each language. When you move on to more complicated analyses, this issue only gets worse. In contrast, a visualization would convey this information in a much more easily understood format. Visualizations are a great tool for summarizing information to help you effectively communicate with your audience, and creating effective data visualizations is an essential component of any data analysis. In this section we will develop a visualization of the ten Aboriginal languages that were most often reported in 2016 as mother tongues in Canada, as well as the number of people that speak each of them. 1.10.1 Using ggplot to create a bar plot In our data set, we can see that language and mother_tongue are in separate columns (or variables). In addition, there is a single row (or observation) for each language. The data are, therefore, in what we call a tidy data format. Tidy data is a fundamental concept and will be a significant focus in the remainder of this book: many of the functions from tidyverse require tidy data, including the ggplot function that we will use shortly for our visualization. We will formally introduce tidy data in Chapter 3. We will make a bar plot to visualize our data. A bar plot is a chart where the lengths of the bars represent certain values, like counts or proportions. We will make a bar plot using the mother_tongue and language columns from our ten_lang data frame. To create a bar plot of these two variables using the ggplot function, we must specify the data frame, which variables to put on the x and y axes, and what kind of plot to create. The ggplot function and its common usage is illustrated in Figure 1.7. Figure 1.8 shows the resulting bar plot generated by following the instructions in Figure 1.7. Figure 1.7: Creating a bar plot with the ggplot function. ggplot(ten_lang, aes(x = language, y = mother_tongue)) + geom_bar(stat = "identity") Figure 1.8: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made. Note: The vast majority of the time, a single expression in R must be contained in a single line of code. However, there are a small number of situations in which you can have a single R expression span multiple lines. Above is one such case: here, R knows that a line cannot end with a + symbol, and so it keeps reading the next line to figure out what the right-hand side of the + symbol should be. We could, of course, put all of the added layers on one line of code, but splitting them across multiple lines helps a lot with code readability. 1.10.2 Formatting ggplot objects It is exciting that we can already visualize our data to help answer our question, but we are not done yet! We can (and should) do more to improve the interpretability of the data visualization that we created. For example, by default, R uses the column names as the axis labels. Usually these column names do not have enough information about the variable in the column. We really should replace this default with a more informative label. For the example above, R uses the column name mother_tongue as the label for the y axis, but most people will not know what that is. And even if they did, they will not know how we measured this variable, or the group of people on which the measurements were taken. An axis label that reads “Mother Tongue (Number of Canadian Residents)” would be much more informative. Adding additional layers to our visualizations that we create in ggplot is one common and easy way to improve and refine our data visualizations. New layers are added to ggplot objects using the + symbol. For example, we can use the xlab (short for x axis label) and ylab (short for y axis label) functions to add layers where we specify meaningful and informative labels for the x and y axes. Again, since we are specifying words (e.g. \"Mother Tongue (Number of Canadian Residents)\") as arguments to xlab and ylab, we surround them with double quotation marks. We can add many more layers to format the plot further, and we will explore these in Chapter 4. ggplot(ten_lang, aes(x = language, y = mother_tongue)) + geom_bar(stat = "identity") + xlab("Language") + ylab("Mother Tongue (Number of Canadian Residents)") Figure 1.9: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue with x and y labels. Note that this visualization is not done yet; there are still improvements to be made. The result is shown in Figure 1.9. This is already quite an improvement! Let’s tackle the next major issue with the visualization in Figure 1.9: the overlapping x axis labels, which are currently making it difficult to read the different language names. One solution is to rotate the plot such that the bars are horizontal rather than vertical. To accomplish this, we will swap the x and y coordinate axes: ggplot(ten_lang, aes(x = mother_tongue, y = language)) + geom_bar(stat = "identity") + xlab("Mother Tongue (Number of Canadian Residents)") + ylab("Language") Figure 1.10: Horizontal bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. There are no more serious issues with this visualization, but it could be refined further. Another big step forward, as shown in Figure 1.10! There are no more serious issues with the visualization. Now comes time to refine the visualization to make it even more well-suited to answering the question we asked earlier in this chapter. For example, the visualization could be made more transparent by organizing the bars according to the number of Canadian residents reporting each language, rather than in alphabetical order. We can reorder the bars using the reorder function, which orders a variable (here language) based on the values of the second variable (mother_tongue). ggplot(ten_lang, aes(x = mother_tongue, y = reorder(language, mother_tongue))) + geom_bar(stat = "identity") + xlab("Mother Tongue (Number of Canadian Residents)") + ylab("Language") Figure 1.11: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue with bars reordered. Figure 1.11 provides a very clear and well-organized answer to our original question; we can see what the ten most often reported Aboriginal languages were, according to the 2016 Canadian census, and how many people speak each of them. For instance, we can see that the Aboriginal language most often reported was Cree n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue. Note: “n.o.s.” means “not otherwise specified”, so Cree n.o.s. refers to individuals who reported Cree as their mother tongue. In this data set, the Cree languages include the following categories: Cree n.o.s., Swampy Cree, Plains Cree, Woods Cree, and a ‘Cree not included elsewhere’ category (which includes Moose Cree, Northern East Cree and Southern East Cree) (Statistics Canada 2016b). 1.10.3 Putting it all together In the block of code below, we put everything from this chapter together, with a few modifications. In particular, we have actually skipped the select step that we did above; since you specify the variable names to plot in the ggplot function, you don’t actually need to select the columns in advance when creating a visualization. We have also provided comments next to many of the lines of code below using the hash symbol #. When R sees a # sign, it will ignore all of the text that comes after the symbol on that line. So you can use comments to explain lines of code for others, and perhaps more importantly, your future self! It’s good practice to get in the habit of commenting your code to improve its readability. This exercise demonstrates the power of R. In relatively few lines of code, we performed an entire data science workflow with a highly effective data visualization! We asked a question, loaded the data into R, wrangled the data (using filter, arrange and slice) and created a data visualization to help answer our question. In this chapter, you got a quick taste of the data science workflow; continue on with the next few chapters to learn each of these steps in much more detail! library(tidyverse) # load the data set can_lang <- read_csv("data/can_lang.csv") # obtain the 10 most common Aboriginal languages aboriginal_lang <- filter(can_lang, category == "Aboriginal languages") arranged_lang <- arrange(aboriginal_lang, by = desc(mother_tongue)) ten_lang <- slice(arranged_lang, 1:10) # create the visualization ggplot(ten_lang, aes(x = mother_tongue, y = reorder(language, mother_tongue))) + geom_bar(stat = "identity") + xlab("Mother Tongue (Number of Canadian Residents)") + ylab("Language") Figure 1.12: Putting it all together: bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. 1.11 Accessing documentation There are many R functions in the tidyverse package (and beyond!), and nobody can be expected to remember what every one of them does or all of the arguments we have to give them. Fortunately, R provides the ? symbol, which provides an easy way to pull up the documentation for most functions quickly. To use the ? symbol to access documentation, you just put the name of the function you are curious about after the ? symbol. For example, if you had forgotten what the filter function did or exactly what arguments to pass in, you could run the following code: ?filter Figure 1.13 shows the documentation that will pop up, including a high-level description of the function, its arguments, a description of each, and more. Note that you may find some of the text in the documentation a bit too technical right now (for example, what is dbplyr, and what is a lazy data frame?). Fear not: as you work through this book, many of these terms will be introduced to you, and slowly but surely you will become more adept at understanding and navigating documentation like that shown in Figure 1.13. But do keep in mind that the documentation is not written to teach you about a function; it is just there as a reference to remind you about the different arguments and usage of functions that you have already learned about elsewhere. Figure 1.13: The documentation for the filter function, including a high-level description, a list of arguments and their meanings, and more. 1.12 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “R and the tidyverse” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["reading.html", "Chapter 2 Reading in data locally and from the web 2.1 Overview 2.2 Chapter learning objectives 2.3 Absolute and relative file paths 2.4 Reading tabular data from a plain text file into R 2.5 Reading tabular data from a Microsoft Excel file 2.6 Reading data from a database 2.7 Writing data from R to a .csv file 2.8 Obtaining data from the web 2.9 Exercises 2.10 Additional resources", " Chapter 2 Reading in data locally and from the web 2.1 Overview In this chapter, you’ll learn to read tabular data of various formats into R from your local device (e.g., your laptop) and the web. “Reading” (or “loading”) is the process of converting data (stored as plain text, a database, HTML, etc.) into an object (e.g., a data frame) that R can easily access and manipulate. Thus reading data is the gateway to any data analysis; you won’t be able to analyze data unless you’ve loaded it first. And because there are many ways to store data, there are similarly many ways to read data into R. The more time you spend upfront matching the data reading method to the type of data you have, the less time you will have to devote to re-formatting, cleaning and wrangling your data (the second step to all data analyses). It’s like making sure your shoelaces are tied well before going for a run so that you don’t trip later on! 2.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Define the types of path and use them to locate files: absolute file path relative file path Uniform Resource Locator (URL) Read data into R from various types of path using: read_csv read_tsv read_csv2 read_delim read_excel Compare and contrast the read_* functions. Describe when to use the following read_* function arguments: skip delim col_names Choose the appropriate tidyverse read_* function and function arguments to load a given plain text tabular data set into R. Use the rename function to rename columns in a data frame. Use read_excel function and arguments to load a sheet from an excel file into R. Work with databases using functions from dbplyr and DBI: Connect to a database with dbConnect. List tables in the database with dbListTables. Create a reference to a database table with tbl. Bring data from a database into R using collect. Use write_csv to save a data frame to a .csv file. (Optional) Obtain data from the web using scraping and application programming interfaces (APIs): Read HTML source code from a URL using the rvest package. Read data from the NASA “Astronomy Picture of the Day” API using the httr2 package. Compare downloading tabular data from a plain text file (e.g., .csv), accessing data from an API, and scraping the HTML source code from a website. 2.3 Absolute and relative file paths This chapter will discuss the different functions we can use to import data into R, but before we can talk about how we read the data into R with these functions, we first need to talk about where the data lives. When you load a data set into R, you first need to tell R where those files live. The file could live on your computer (local) or somewhere on the internet (remote). The place where the file lives on your computer is referred to as its “path”. You can think of the path as directions to the file. There are two kinds of paths: relative paths and absolute paths. A relative path indicates where the file is with respect to your working directory (i.e., “where you are currently”) on the computer. On the other hand, an absolute path indicates where the file is with respect to the computer’s filesystem base (or root) folder, regardless of where you are working. Suppose our computer’s filesystem looks like the picture in Figure 2.1. We are working in a file titled project3.ipynb, and our current working directory is project3; typically, as is the case here, the working directory is the directory containing the file you are currently working on. Figure 2.1: Example file system. Let’s say we wanted to open the happiness_report.csv file. We have two options to indicate where the file is: using a relative path, or using an absolute path. The absolute path of the file always starts with a slash /—representing the root folder on the computer—and proceeds by listing out the sequence of folders you would have to enter to reach the file, each separated by another slash /. So in this case, happiness_report.csv would be reached by starting at the root, and entering the home folder, then the dsci-100 folder, then the project3 folder, and then finally the data folder. So its absolute path would be /home/dsci-100/project3/data/happiness_report.csv. We can load the file using its absolute path as a string passed to the read_csv function. happy_data <- read_csv("/home/dsci-100/project3/data/happiness_report.csv") If we instead wanted to use a relative path, we would need to list out the sequence of steps needed to get from our current working directory to the file, with slashes / separating each step. Since we are currently in the project3 folder, we just need to enter the data folder to reach our desired file. Hence the relative path is data/happiness_report.csv, and we can load the file using its relative path as a string passed to read_csv. happy_data <- read_csv("data/happiness_report.csv") Note that there is no forward slash at the beginning of a relative path; if we accidentally typed \"/data/happiness_report.csv\", R would look for a folder named data in the root folder of the computer—but that doesn’t exist! Aside from specifying places to go in a path using folder names (like data and project3), we can also specify two additional special places: the current directory and the previous directory. We indicate the current working directory with a single dot ., and the previous directory with two dots ... So for instance, if we wanted to reach the bike_share.csv file from the project3 folder, we could use the relative path ../project2/bike_share.csv. We can even combine these two; for example, we could reach the bike_share.csv file using the (very silly) path ../project2/../project2/./bike_share.csv with quite a few redundant directions: it says to go back a folder, then open project2, then go back a folder again, then open project2 again, then stay in the current directory, then finally get to bike_share.csv. Whew, what a long trip! So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths. Using a relative path helps ensure that your code can be run on a different computer (and as an added bonus, relative paths are often shorter—easier to type!). This is because a file’s relative path is often the same across different computers, while a file’s absolute path (the names of all of the folders between the computer’s root, represented by /, and the file) isn’t usually the same across different computers. For example, suppose Fatima and Jayden are working on a project together on the happiness_report.csv data. Fatima’s file is stored at /home/Fatima/project3/data/happiness_report.csv, while Jayden’s is stored at /home/Jayden/project3/data/happiness_report.csv. Even though Fatima and Jayden stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames. If Jayden has code that loads the happiness_report.csv data using an absolute path, the code won’t work on Fatima’s computer. But the relative path from inside the project3 folder (data/happiness_report.csv) is the same on both computers; any code that uses relative paths will work on both! In the additional resources section, we include a link to a short video on the difference between absolute and relative paths. You can also check out the here package, which provides methods for finding and constructing file paths in R. Beyond files stored on your computer (i.e., locally), we also need a way to locate resources stored elsewhere on the internet (i.e., remotely). For this purpose we use a Uniform Resource Locator (URL), i.e., a web address that looks something like https://datasciencebook.ca/. URLs indicate the location of a resource on the internet, and start with a web domain, followed by a forward slash /, and then a path to where the resource is located on the remote machine. 2.4 Reading tabular data from a plain text file into R 2.4.1 read_csv to read in comma-separated values files Now that we have learned about where data could be, we will learn about how to import data into R using various functions. Specifically, we will learn how to read tabular data from a plain text file (a document containing only text) into R and write tabular data to a file out of R. The function we use to do this depends on the file’s format. For example, in the last chapter, we learned about using the tidyverse read_csv function when reading .csv (comma-separated values) files. In that case, the separator or delimiter that divided our columns was a comma (,). We only learned the case where the data matched the expected defaults of the read_csv function (column names are present, and commas are used as the delimiter between columns). In this section, we will learn how to read files that do not satisfy the default expectations of read_csv. Before we jump into the cases where the data aren’t in the expected default format for tidyverse and read_csv, let’s revisit the more straightforward case where the defaults hold, and the only argument we need to give to the function is the path to the file, data/can_lang.csv. The can_lang data set contains language data from the 2016 Canadian census. We put data/ before the file’s name when we are loading the data set because this data set is located in a sub-folder, named data, relative to where we are running our R code. Here is what the text in the file data/can_lang.csv looks like. category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 And here is a review of how we can use read_csv to load it into R. First we load the tidyverse package to gain access to useful functions for reading the data. library(tidyverse) Next we use read_csv to load the data into R, and in that call we specify the relative path to the file. Note that it is normal and expected that a message is printed out after using the read_csv and related functions. This message lets you know the data types of each of the columns that R inferred while reading the data into R. In the future when we use this and related functions to load data in this book, we will silence these messages to help with the readability of the book. canlang_data <- read_csv("data/can_lang.csv") ## Rows: 214 Columns: 6 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): category, language ## dbl (4): mother_tongue, most_at_home, most_at_work, lang_known ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. Finally, to view the first 10 rows of the data frame, we must call it: canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.2 Skipping rows when reading in data Oftentimes, information about how data was collected, or other relevant information, is included at the top of the data file. This information is usually written in sentence and paragraph form, with no delimiter because it is not organized into columns. An example of this is shown below. This information gives the data scientist useful context and information about the data, however, it is not well formatted or intended to be read into a data frame cell along with the tabular data that follows later in the file. Data source: https://ttimbers.github.io/canlang/ Data originally published in: Statistics Canada Census of Population 2016. Reproduced and distributed on an as-is basis with their permission. category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 With this extra information being present at the top of the file, using read_csv as we did previously does not allow us to correctly load the data into R. In the case of this file we end up only reading in one column of the data set. In contrast to the normal and expected messages above, this time R prints out a warning for us indicating that there might be a problem with how our data is being read in. canlang_data <- read_csv("data/can_lang_meta-data.csv") ## Warning: One or more parsing issues, call `problems()` on your data frame for details, ## e.g.: ## dat <- vroom(...) ## problems(dat) canlang_data ## # A tibble: 217 × 1 ## `Data source: https://ttimbers.github.io/canlang/` ## <chr> ## 1 "Data originally published in: Statistics Canada Census of Population 2016." ## 2 "Reproduced and distributed on an as-is basis with their permission." ## 3 "category,language,mother_tongue,most_at_home,most_at_work,lang_known" ## 4 "Aboriginal languages,\\"Aboriginal languages, n.o.s.\\",590,235,30,665" ## 5 "Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415" ## 6 "Non-Official & Non-Aboriginal languages,\\"Afro-Asiatic languages, n.i.e.\\",… ## 7 "Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150" ## 8 "Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930" ## 9 "Aboriginal languages,\\"Algonquian languages, n.i.e.\\",45,10,0,120" ## 10 "Aboriginal languages,Algonquin,1260,370,40,2480" ## # ℹ 207 more rows To successfully read data like this into R, the skip argument can be useful to tell R how many lines to skip before it should start reading in the data. In the example above, we would set this value to 3. canlang_data <- read_csv("data/can_lang_meta-data.csv", skip = 3) canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows How did we know to skip three lines? We looked at the data! The first three lines of the data had information we didn’t need to import: Data source: https://ttimbers.github.io/canlang/ Data originally published in: Statistics Canada Census of Population 2016. Reproduced and distributed on an as-is basis with their permission. The column names began at line 4, so we skipped the first three lines. 2.4.3 read_tsv to read in tab-separated values files Another common way data is stored is with tabs as the delimiter. Notice the data file, can_lang.tsv, has tabs in between the columns instead of commas. category language mother_tongue most_at_home most_at_work lang_kno Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415 Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930 Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120 Aboriginal languages Algonquin 1260 370 40 2480 Non-Official & Non-Aboriginal languages American Sign Language 2685 3020 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 We can use the read_tsv function to read in .tsv (tab separated values) files. canlang_data <- read_tsv("data/can_lang.tsv") canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows If you compare the data frame here to the data frame we obtained in Section 2.4.1 using read_csv, you’ll notice that they look identical: they have the same number of columns and rows, the same column names, and the same entries! So even though we needed to use a different function depending on the file format, our resulting data frame (canlang_data) in both cases was the same. 2.4.4 read_delim as a more flexible method to get tabular data into R The read_csv and read_tsv functions are actually just special cases of the more general read_delim function. We can use read_delim to import both comma and tab-separated values files, and more; we just have to specify the delimiter. For example, the can_lang_no_names.tsv file contains a different version of this same data set with no column names and uses tabs as the delimiter instead of commas. Here is how the file would look in a plain text editor: Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415 Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930 Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120 Aboriginal languages Algonquin 1260 370 40 2480 Non-Official & Non-Aboriginal languages American Sign Language 2685 3020 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 Non-Official & Non-Aboriginal languages Arabic 419890 223535 5585 629055 To read this into R using the read_delim function, we specify the path to the file as the first argument, provide the tab character \"\\t\" as the delim argument, and set the col_names argument to FALSE to denote that there are no column names provided in the data. Note that the read_csv, read_tsv, and read_delim functions all have a col_names argument with the default value TRUE. Note: \\t is an example of an escaped character, which always starts with a backslash (\\). Escaped characters are used to represent non-printing characters (like the tab) or those with special meanings (such as quotation marks). canlang_data <- read_delim("data/can_lang_no_names.tsv", delim = "\\t", col_names = FALSE) canlang_data ## # A tibble: 214 × 6 ## X1 X2 X3 X4 X5 X6 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal languages Aborigina… 590 235 30 665 ## 2 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415 ## 3 Non-Official & Non-Aboriginal languages Afro-Asia… 1150 445 10 2775 ## 4 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150 ## 5 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930 ## 6 Aboriginal languages Algonquia… 45 10 0 120 ## 7 Aboriginal languages Algonquin 1260 370 40 2480 ## 8 Non-Official & Non-Aboriginal languages American … 2685 3020 1145 21930 ## 9 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 ## 10 Non-Official & Non-Aboriginal languages Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows Data frames in R need to have column names. Thus if you read in data without column names, R will assign names automatically. In this example, R assigns the column names X1, X2, X3, X4, X5, X6. It is best to rename your columns manually in this scenario. The current column names (X1, X2, etc.) are not very descriptive and will make your analysis confusing. To rename your columns, you can use the rename function from the dplyr R package (Wickham, François, et al. 2021) (one of the packages loaded with tidyverse, so we don’t need to load it separately). The first argument is the data set, and in the subsequent arguments you write new_name = old_name for the selected variables to rename. We rename the X1, X2, ..., X6 columns in the canlang_data data frame to more descriptive names below. canlang_data <- rename(canlang_data, category = X1, language = X2, mother_tongue = X3, most_at_home = X4, most_at_work = X5, lang_known = X6) canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.5 Reading tabular data directly from a URL We can also use read_csv, read_tsv, or read_delim (and related functions) to read in data directly from a Uniform Resource Locator (URL) that contains tabular data. Here, we provide the URL of a remote file to read_*, instead of a path to a local file on our computer. We need to surround the URL with quotes similar to when we specify a path on our local computer. All other arguments that we use are the same as when using these functions with a local file on our computer. url <- "https://raw.githubusercontent.com/UBC-DSCI/data/main/can_lang.csv" canlang_data <- read_csv(url) canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.6 Downloading data from a URL Occasionally the data available at a URL is not formatted nicely enough to use read_csv, read_tsv, read_delim, or other related functions to read the data directly into R. In situations where it is necessary to download a file to our local computer prior to working with it in R, we can use the download.file function. The first argument is the URL, and the second is a path where we would like to store the downloaded file. download.file(url, "data/can_lang.csv") canlang_data <- read_csv("data/can_lang.csv") canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.7 Previewing a data file before reading it into R In many of the examples above, we gave you previews of the data file before we read it into R. Previewing data is essential to see whether or not there are column names, what the delimiters are, and if there are lines you need to skip. You should do this yourself when trying to read in data files: open the file in whichever text editor you prefer to inspect its contents prior to reading it into R. 2.5 Reading tabular data from a Microsoft Excel file There are many other ways to store tabular data sets beyond plain text files, and similarly, many ways to load those data sets into R. For example, it is very common to encounter, and need to load into R, data stored as a Microsoft Excel spreadsheet (with the file name extension .xlsx). To be able to do this, a key thing to know is that even though .csv and .xlsx files look almost identical when loaded into Excel, the data themselves are stored completely differently. While .csv files are plain text files, where the characters you see when you open the file in a text editor are exactly the data they represent, this is not the case for .xlsx files. Take a look at a snippet of what a .xlsx file would look like in a text editor: ,?'O _rels/.rels???J1??>E?{7? <?V????w8?'J???'QrJ???Tf?d??d?o?wZ'???@>?4'?|??hlIo??F t 8f??3wn ????t??u"/ %~Ed2??<?w?? ?Pd(??J-?E???7?'t(?-GZ?????y???c~N?g[^_r?4 yG?O ?K??G? ]TUEe??O??c[???????6q??s??d?m???\\???H?^????3} ?rZY? ?:L60?^?????XTP+?|? X?a??4VT?,D?Jq This type of file representation allows Excel files to store additional things that you cannot store in a .csv file, such as fonts, text formatting, graphics, multiple sheets and more. And despite looking odd in a plain text editor, we can read Excel spreadsheets into R using the readxl package developed specifically for this purpose. library(readxl) canlang_data <- read_excel("data/can_lang.xlsx") canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows If the .xlsx file has multiple sheets, you have to use the sheet argument to specify the sheet number or name. You can also specify cell ranges using the range argument. This functionality is useful when a single sheet contains multiple tables (a sad thing that happens to many Excel spreadsheets since this makes reading in data more difficult). As with plain text files, you should always explore the data file before importing it into R. Exploring the data beforehand helps you decide which arguments you need to load the data into R successfully. If you do not have the Excel program on your computer, you can use other programs to preview the file. Examples include Google Sheets and Libre Office. In Table 2.1 we summarize the read_* functions we covered in this chapter. We also include the read_csv2 function for data separated by semicolons ;, which you may run into with data sets where the decimal is represented by a comma instead of a period (as with some data sets from European countries). Table 2.1: Summary of read_* functions Data File Type R Function R Package Comma (,) separated files read_csv readr Tab (\\t) separated files read_tsv readr Semicolon (;) separated files read_csv2 readr Various formats (.csv, .tsv) read_delim readr Excel files (.xlsx) read_excel readxl Note: readr is a part of the tidyverse package so we did not need to load this package separately since we loaded tidyverse. 2.6 Reading data from a database Another very common form of data storage is the relational database. Databases are great when you have large data sets or multiple users working on a project. There are many relational database management systems, such as SQLite, MySQL, PostgreSQL, Oracle, and many more. These different relational database management systems each have their own advantages and limitations. Almost all employ SQL (structured query language) to obtain data from the database. But you don’t need to know SQL to analyze data from a database; several packages have been written that allow you to connect to relational databases and use the R programming language to obtain data. In this book, we will give examples of how to do this using R with SQLite and PostgreSQL databases. 2.6.1 Reading data from a SQLite database SQLite is probably the simplest relational database system that one can use in combination with R. SQLite databases are self-contained, and are usually stored and accessed locally on one computer from a file with a .db extension (or sometimes an .sqlite extension). Similar to Excel files, these are not plain text files and cannot be read in a plain text editor. The first thing you need to do to read data into R from a database is to connect to the database. We do that using the dbConnect function from the DBI (database interface) package. This does not read in the data, but simply tells R where the database is and opens up a communication channel that R can use to send SQL commands to the database. library(DBI) canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db") Often relational databases have many tables; thus, in order to retrieve data from a database, you need to know the name of the table in which the data is stored. You can get the names of all the tables in the database using the dbListTables function: tables <- dbListTables(canlang_conn) tables ## [1] "lang" The dbListTables function returned only one name, which tells us that there is only one table in this database. To reference a table in the database (so that we can perform operations like selecting columns and filtering rows), we use the tbl function from the dbplyr package. The object returned by the tbl function allows us to work with data stored in databases as if they were just regular data frames; but secretly, behind the scenes, dbplyr is turning your function calls (e.g., select and filter) into SQL queries! library(dbplyr) lang_db <- tbl(canlang_conn, "lang") lang_db ## # Source: table<lang> [?? x 6] ## # Database: sqlite 3.41.2 [/home/rstudio/introduction-to-datascience/data/can_lang.db] ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ more rows Although it looks like we just got a data frame from the database, we didn’t! It’s a reference; the data is still stored only in the SQLite database. The dbplyr package works this way because databases are often more efficient at selecting, filtering and joining large data sets than R. And typically the database will not even be stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell it to using the collect function. Figure 2.2 highlights the difference between a tibble object in R and the output we just created. Notice in the table on the right, the first two lines of the output indicate the source is SQL. The last line doesn’t show how many rows there are (R is trying to avoid performing expensive query operations), whereas the output for the tibble object does. Figure 2.2: Comparison of a reference to data in a database and a tibble in R. We can look at the SQL commands that are sent to the database when we write tbl(canlang_conn, \"lang\") in R with the show_query function from the dbplyr package. show_query(tbl(canlang_conn, "lang")) ## <SQL> ## SELECT * ## FROM `lang` The output above shows the SQL code that is sent to the database. When we write tbl(canlang_conn, \"lang\") in R, in the background, the function is translating the R code into SQL, sending that SQL to the database, and then translating the response for us. So dbplyr does all the hard work of translating from R to SQL and back for us; we can just stick with R! With our lang_db table reference for the 2016 Canadian Census data in hand, we can mostly continue onward as if it were a regular data frame. For example, let’s do the same exercise from Chapter 1: we will obtain only those rows corresponding to Aboriginal languages, and keep only the language and mother_tongue columns. We can use the filter function to obtain only certain rows. Below we filter the data to include only Aboriginal languages. aboriginal_lang_db <- filter(lang_db, category == "Aboriginal languages") aboriginal_lang_db ## # Source: SQL [?? x 6] ## # Database: sqlite 3.41.2 [/home/rstudio/introduction-to-datascience/data/can_lang.db] ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Aboriginal langu… Algonqu… 45 10 0 120 ## 3 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 4 Aboriginal langu… Athabas… 50 10 0 85 ## 5 Aboriginal langu… Atikame… 6150 5465 1100 6645 ## 6 Aboriginal langu… Babine … 110 20 10 210 ## 7 Aboriginal langu… Beaver 190 50 0 340 ## 8 Aboriginal langu… Blackfo… 2815 1110 85 5645 ## 9 Aboriginal langu… Carrier 1025 250 15 2100 ## 10 Aboriginal langu… Cayuga 45 10 10 125 ## # ℹ more rows Above you can again see the hints that this data is not actually stored in R yet: the source is SQL [?? x 6] and the output says ... more rows at the end (both indicating that R does not know how many rows there are in total!), and a database type sqlite is listed. We didn’t use the collect function because we are not ready to bring the data into R yet. We can still use the database to do some work to obtain only the small amount of data we want to work with locally in R. Let’s add the second part of our database query: selecting only the language and mother_tongue columns using the select function. aboriginal_lang_selected_db <- select(aboriginal_lang_db, language, mother_tongue) aboriginal_lang_selected_db ## # Source: SQL [?? x 2] ## # Database: sqlite 3.41.2 [/home/rstudio/introduction-to-datascience/data/can_lang.db] ## language mother_tongue ## <chr> <dbl> ## 1 Aboriginal languages, n.o.s. 590 ## 2 Algonquian languages, n.i.e. 45 ## 3 Algonquin 1260 ## 4 Athabaskan languages, n.i.e. 50 ## 5 Atikamekw 6150 ## 6 Babine (Wetsuwet'en) 110 ## 7 Beaver 190 ## 8 Blackfoot 2815 ## 9 Carrier 1025 ## 10 Cayuga 45 ## # ℹ more rows Now you can see that the database will return only the two columns we asked for with the select function. In order to actually retrieve this data in R as a data frame, we use the collect function. Below you will see that after running collect, R knows that the retrieved data has 67 rows, and there is no database listed any more. aboriginal_lang_data <- collect(aboriginal_lang_selected_db) aboriginal_lang_data ## # A tibble: 67 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Aboriginal languages, n.o.s. 590 ## 2 Algonquian languages, n.i.e. 45 ## 3 Algonquin 1260 ## 4 Athabaskan languages, n.i.e. 50 ## 5 Atikamekw 6150 ## 6 Babine (Wetsuwet'en) 110 ## 7 Beaver 190 ## 8 Blackfoot 2815 ## 9 Carrier 1025 ## 10 Cayuga 45 ## # ℹ 57 more rows Aside from knowing the number of rows, the data looks pretty similar in both outputs shown above. And dbplyr provides many more functions (not just filter) that you can use to directly feed the database reference (lang_db) into downstream analysis functions (e.g., ggplot2 for data visualization). But dbplyr does not provide every function that we need for analysis; we do eventually need to call collect. For example, look what happens when we try to use nrow to count rows in a data frame: nrow(aboriginal_lang_selected_db) ## [1] NA or tail to preview the last six rows of a data frame: tail(aboriginal_lang_selected_db) ## Error: tail() is not supported by sql sources Additionally, some operations will not work to extract columns or single values from the reference given by the tbl function. Thus, once you have finished your data wrangling of the tbl database reference object, it is advisable to bring it into R as a data frame using collect. But be very careful using collect: databases are often very big, and reading an entire table into R might take a long time to run or even possibly crash your machine. So make sure you use filter and select on the database table to reduce the data to a reasonable size before using collect to read it into R! 2.6.2 Reading data from a PostgreSQL database PostgreSQL (also called Postgres) is a very popular and open-source option for relational database software. Unlike SQLite, PostgreSQL uses a client–server database engine, as it was designed to be used and accessed on a network. This means that you have to provide more information to R when connecting to Postgres databases. The additional information that you need to include when you call the dbConnect function is listed below: dbname: the name of the database (a single PostgreSQL instance can host more than one database) host: the URL pointing to where the database is located port: the communication endpoint between R and the PostgreSQL database (usually 5432) user: the username for accessing the database password: the password for accessing the database Additionally, we must use the RPostgres package instead of RSQLite in the dbConnect function call. Below we demonstrate how to connect to a version of the can_mov_db database, which contains information about Canadian movies. Note that the host (fakeserver.stat.ubc.ca), user (user0001), and password (abc123) below are not real; you will not actually be able to connect to a database using this information. library(RPostgres) canmov_conn <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db", host = "fakeserver.stat.ubc.ca", port = 5432, user = "user0001", password = "abc123") After opening the connection, everything looks and behaves almost identically to when we were using an SQLite database in R. For example, we can again use dbListTables to find out what tables are in the can_mov_db database: dbListTables(canmov_conn) [1] "themes" "medium" "titles" "title_aliases" "forms" [6] "episodes" "names" "names_occupations" "occupation" "ratings" We see that there are 10 tables in this database. Let’s first look at the \"ratings\" table to find the lowest rating that exists in the can_mov_db database: ratings_db <- tbl(canmov_conn, "ratings") ratings_db # Source: table<ratings> [?? x 3] # Database: postgres [user0001@fakeserver.stat.ubc.ca:5432/can_mov_db] title average_rating num_votes <chr> <dbl> <int> 1 The Grand Seduction 6.6 150 2 Rhymes for Young Ghouls 6.3 1685 3 Mommy 7.5 1060 4 Incendies 6.1 1101 5 Bon Cop, Bad Cop 7.0 894 6 Goon 5.5 1111 7 Monsieur Lazhar 5.6 610 8 What if 5.3 1401 9 The Barbarian Invations 5.8 99 10 Away from Her 6.9 2311 # … with more rows To find the lowest rating that exists in the data base, we first need to extract the average_rating column using select: avg_rating_db <- select(ratings_db, average_rating) avg_rating_db # Source: lazy query [?? x 1] # Database: postgres [user0001@fakeserver.stat.ubc.ca:5432/can_mov_db] average_rating <dbl> 1 6.6 2 6.3 3 7.5 4 6.1 5 7.0 6 5.5 7 5.6 8 5.3 9 5.8 10 6.9 # … with more rows Next we use min to find the minimum rating in that column: min(avg_rating_db) Error in min(avg_rating_db) : invalid 'type' (list) of argument Instead of the minimum, we get an error! This is another example of when we need to use the collect function to bring the data into R for further computation: avg_rating_data <- collect(avg_rating_db) min(avg_rating_data) [1] 1 We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie… 2.6.3 Why should we bother with databases at all? Opening a database involved a lot more effort than just opening a .csv, .tsv, or any of the other plain text or Excel formats. We had to open a connection to the database, then use dbplyr to translate tidyverse-like commands (filter, select etc.) into SQL commands that the database understands, and then finally collect the results. And not all tidyverse commands can currently be translated to work with databases. For example, we can compute a mean with a database but can’t easily compute a median. So you might be wondering: why should we use databases at all? Databases are beneficial in a large-scale setting: They enable storing large data sets across multiple computers with backups. They provide mechanisms for ensuring data integrity and validating input. They provide security and data access control. They allow multiple users to access data simultaneously and remotely without conflicts and errors. For example, there are billions of Google searches conducted daily in 2021 (Real Time Statistics Project 2021). Can you imagine if Google stored all of the data from those searches in a single .csv file!? Chaos would ensue! 2.7 Writing data from R to a .csv file At the middle and end of a data analysis, we often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file to share it with others or use it for another step in the analysis. The most straightforward way to do this is to use the write_csv function from the tidyverse package. The default arguments for this file are to use a comma (,) as the delimiter and include column names. Below we demonstrate creating a new version of the Canadian languages data set without the official languages category according to the Canadian 2016 Census, and then writing this to a .csv file: no_official_lang_data <- filter(can_lang, category != "Official languages") write_csv(no_official_lang_data, "data/no_official_languages.csv") 2.8 Obtaining data from the web Note: This section is not required reading for the remainder of the textbook. It is included for those readers interested in learning a little bit more about how to obtain different types of data from the web. Data doesn’t just magically appear on your computer; you need to get it from somewhere. Earlier in the chapter we showed you how to access data stored in a plain text, spreadsheet-like format (e.g., comma- or tab-separated) from a web URL using one of the read_* functions from the tidyverse. But as time goes on, it is increasingly uncommon to find data (especially large amounts of data) in this format available for download from a URL. Instead, websites now often offer something known as an application programming interface (API), which provides a programmatic way to ask for subsets of a data set. This allows the website owner to control who has access to the data, what portion of the data they have access to, and how much data they can access. Typically, the website owner will give you a token or key (a secret string of characters somewhat like a password) that you have to provide when accessing the API. Another interesting thought: websites themselves are data! When you type a URL into your browser window, your browser asks the web server (another computer on the internet whose job it is to respond to requests for the website) to give it the website’s data, and then your browser translates that data into something you can see. If the website shows you some information that you’re interested in, you could create a data set for yourself by copying and pasting that information into a file. This process of taking information directly from what a website displays is called web scraping (or sometimes screen scraping). Now, of course, copying and pasting information manually is a painstaking and error-prone process, especially when there is a lot of information to gather. So instead of asking your browser to translate the information that the web server provides into something you can see, you can collect that data programmatically—in the form of hypertext markup language (HTML) and cascading style sheet (CSS) code—and process it to extract useful information. HTML provides the basic structure of a site and tells the webpage how to display the content (e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the content and tells the webpage how the HTML elements should be presented (e.g., colors, layouts, fonts etc.). This subsection will show you the basics of both web scraping with the rvest R package (Wickham 2021a) and accessing the NASA “Astronomy Picture of the Day” API using the httr2 R package (Wickham 2023). 2.8.1 Web scraping HTML and CSS selectors When you enter a URL into your browser, your browser connects to the web server at that URL and asks for the source code for the website. This is the data that the browser translates into something you can see; so if we are going to create our own data by scraping a website, we have to first understand what that data looks like! For example, let’s say we are interested in knowing the average rental price (per square foot) of the most recently available one-bedroom apartments in Vancouver on Craiglist. When we visit the Vancouver Craigslist website and search for one-bedroom apartments, we should see something similar to Figure 2.3. Figure 2.3: Craigslist webpage of advertisements for one-bedroom apartments. Based on what our browser shows us, it’s pretty easy to find the size and price for each apartment listed. But we would like to be able to obtain that information using R, without any manual human effort or copying and pasting. We do this by examining the source code that the web server actually sent our browser to display for us. We show a snippet of it below; the entire source is included with the code for this book: <span class="result-meta"> <span class="result-price">$800</span> <span class="housing"> 1br - </span> <span class="result-hood"> (13768 108th Avenue)</span> <span class="result-tags"> <span class="maptag" data-pid="6786042973">map</span> </span> <span class="banish icon icon-trash" role="button"> <span class="screen-reader-text">hide this posting</span> </span> <span class="unbanish icon icon-trash red" role="button"></span> <a href="#" class="restore-link"> <span class="restore-narrow-text">restore</span> <span class="restore-wide-text">restore this posting</span> </a> <span class="result-price">$2285</span> </span> Oof…you can tell that the source code for a web page is not really designed for humans to understand easily. However, if you look through it closely, you will find that the information we’re interested in is hidden among the muck. For example, near the top of the snippet above you can see a line that looks like <span class="result-price">$800</span> That snippet is definitely storing the price of a particular apartment. With some more investigation, you should be able to find things like the date and time of the listing, the address of the listing, and more. So this source code most likely contains all the information we are interested in! Let’s dig into that line above a bit more. You can see that that bit of code has an opening tag (words between < and >, like <span>) and a closing tag (the same with a slash, like </span>). HTML source code generally stores its data between opening and closing tags like these. Tags are keywords that tell the web browser how to display or format the content. Above you can see that the information we want ($800) is stored between an opening and closing tag (<span> and </span>). In the opening tag, you can also see a very useful “class” (a special word that is sometimes included with opening tags): class=\"result-price\". Since we want R to programmatically sort through all of the source code for the website to find apartment prices, maybe we can look for all the tags with the \"result-price\" class, and grab the information between the opening and closing tag. Indeed, take a look at another line of the source snippet above: <span class="result-price">$2285</span> It’s yet another price for an apartment listing, and the tags surrounding it have the \"result-price\" class. Wonderful! Now that we know what pattern we are looking for—a dollar amount between opening and closing tags that have the \"result-price\" class—we should be able to use code to pull out all of the matching patterns from the source code to obtain our data. This sort of “pattern” is known as a CSS selector (where CSS stands for cascading style sheet). The above was a simple example of “finding the pattern to look for”; many websites are quite a bit larger and more complex, and so is their website source code. Fortunately, there are tools available to make this process easier. For example, SelectorGadget is an open-source tool that simplifies identifying the generating and finding of CSS selectors. At the end of the chapter in the additional resources section, we include a link to a short video on how to install and use the SelectorGadget tool to obtain CSS selectors for use in web scraping. After installing and enabling the tool, you can click the website element for which you want an appropriate selector. For example, if we click the price of an apartment listing, we find that SelectorGadget shows us the selector .result-price in its toolbar, and highlights all the other apartment prices that would be obtained using that selector (Figure 2.4). Figure 2.4: Using the SelectorGadget on a Craigslist webpage to obtain the CCS selector useful for obtaining apartment prices. If we then click the size of an apartment listing, SelectorGadget shows us the span selector, and highlights many of the lines on the page; this indicates that the span selector is not specific enough to capture only apartment sizes (Figure 2.5). Figure 2.5: Using the SelectorGadget on a Craigslist webpage to obtain a CCS selector useful for obtaining apartment sizes. To narrow the selector, we can click one of the highlighted elements that we do not want. For example, we can deselect the “pic/map” links, resulting in only the data we want highlighted using the .housing selector (Figure 2.6). Figure 2.6: Using the SelectorGadget on a Craigslist webpage to refine the CCS selector to one that is most useful for obtaining apartment sizes. So to scrape information about the square footage and rental price of apartment listings, we need to use the two CSS selectors .housing and .result-price, respectively. The selector gadget returns them to us as a comma-separated list (here .housing , .result-price), which is exactly the format we need to provide to R if we are using more than one CSS selector. Caution: are you allowed to scrape that website? Before scraping data from the web, you should always check whether or not you are allowed to scrape it! There are two documents that are important for this: the robots.txt file and the Terms of Service document. If we take a look at Craigslist’s Terms of Service document, we find the following text: “You agree not to copy/collect CL content via robots, spiders, scripts, scrapers, crawlers, or any automated or manual equivalent (e.g., by hand).” So unfortunately, without explicit permission, we are not allowed to scrape the website. What to do now? Well, we could ask the owner of Craigslist for permission to scrape. However, we are not likely to get a response, and even if we did they would not likely give us permission. The more realistic answer is that we simply cannot scrape Craigslist. If we still want to find data about rental prices in Vancouver, we must go elsewhere. To continue learning how to scrape data from the web, let’s instead scrape data on the population of Canadian cities from Wikipedia. We have checked the Terms of Service document, and it does not mention that web scraping is disallowed. We will use the SelectorGadget tool to pick elements that we are interested in (city names and population counts) and deselect others to indicate that we are not interested in them (province names), as shown in Figure 2.7. Figure 2.7: Using the SelectorGadget on a Wikipedia webpage. We include a link to a short video tutorial on this process at the end of the chapter in the additional resources section. SelectorGadget provides in its toolbar the following list of CSS selectors to use: td:nth-child(8) , td:nth-child(4) , .largestCities-cell-background+ td a Now that we have the CSS selectors that describe the properties of the elements that we want to target, we can use them to find certain elements in web pages and extract data. Using rvest We will use the rvest R package to scrape data from the Wikipedia page. We start by loading the rvest package: library(rvest) Next, we tell R what page we want to scrape by providing the webpage’s URL in quotations to the function read_html: page <- read_html("https://en.wikipedia.org/wiki/Canada") The read_html function directly downloads the source code for the page at the URL you specify, just like your browser would if you navigated to that site. But instead of displaying the website to you, the read_html function just returns the HTML source code itself, which we have stored in the page variable. Next, we send the page object to the html_nodes function, along with the CSS selectors we obtained from the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, html_nodes, expects that argument is a string. We store the result of the html_nodes function in the population_nodes variable. Note that below we use the paste function with a comma separator (sep=\",\") to build the list of selectors. The paste function converts elements to characters and combines the values into a list. We use this function to build the list of selectors to maintain code readability; this avoids having a very long line of code. selectors <- paste("td:nth-child(8)", "td:nth-child(4)", ".largestCities-cell-background+ td a", sep = ",") population_nodes <- html_nodes(page, selectors) head(population_nodes) ## {xml_nodeset (6)} ## [1] <a href="/wiki/Greater_Toronto_Area" title="Greater Toronto Area">Toronto ... ## [2] <td style="text-align:right;">6,202,225</td> ## [3] <a href="/wiki/London,_Ontario" title="London, Ontario">London</a> ## [4] <td style="text-align:right;">543,551\\n</td> ## [5] <a href="/wiki/Greater_Montreal" title="Greater Montreal">Montreal</a> ## [6] <td style="text-align:right;">4,291,732</td> Note: head is a function that is often useful for viewing only a short summary of an R object, rather than the whole thing (which may be quite a lot to look at). For example, here head shows us only the first 6 items in the population_nodes object. Note that some R objects by default print only a small summary. For example, tibble data frames only show you the first 10 rows. But not all R objects do this, and that’s where the head function helps summarize things for you. Each of the items in the population_nodes list is a node from the HTML document that matches the CSS selectors you specified. A node is an HTML tag pair (e.g., <td> and </td> which defines the cell of a table) combined with the content stored between the tags. For our CSS selector td:nth-child(4), an example node that would be selected would be: <td style="text-align:left;background:#f0f0f0;"> <a href="/wiki/London,_Ontario" title="London, Ontario">London</a> </td> Next we extract the meaningful data—in other words, we get rid of the HTML code syntax and tags—from the nodes using the html_text function. In the case of the example node above, html_text function returns \"London\". population_text <- html_text(population_nodes) head(population_text) ## [1] "Toronto" "6,202,225" "London" "543,551\\n" "Montreal" "4,291,732" Fantastic! We seem to have extracted the data of interest from the raw HTML source code. But we are not quite done; the data is not yet in an optimal format for data analysis. Both the city names and population are encoded as characters in a single vector, instead of being in a data frame with one character column for city and one numeric column for population (like a spreadsheet). Additionally, the populations contain commas (not useful for programmatically dealing with numbers), and some even contain a line break character at the end (\\n). In Chapter 3, we will learn more about how to wrangle data such as this into a more useful format for data analysis using R. 2.8.2 Using an API Rather than posting a data file at a URL for you to download, many websites these days provide an API that must be accessed through a programming language like R. The benefit of using an API is that data owners have much more control over the data they provide to users. However, unlike web scraping, there is no consistent way to access an API across websites. Every website typically has its own API designed especially for its own use case. Therefore we will just provide one example of accessing data through an API in this book, with the hope that it gives you enough of a basic idea that you can learn how to use another API if needed. In particular, in this book we will show you the basics of how to use the httr2 package in R to access data from the NASA “Astronomy Picture of the Day” API (a great source of desktop backgrounds, by the way—take a look at the stunning picture of the Rho-Ophiuchi cloud complex (NASA et al. 2023) in Figure 2.8 from July 13, 2023!). Figure 2.8: The James Webb Space Telescope’s NIRCam image of the Rho Ophiuchi molecular cloud complex. First, you will need to visit the NASA APIs page and generate an API key (i.e., a password used to identify you when accessing the API). Note that a valid email address is required to associate with the key. The signup form looks something like Figure 2.9. After filling out the basic information, you will receive the token via email. Make sure to store the key in a safe place, and keep it private. Figure 2.9: Generating the API access token for the NASA API Caution: think about your API usage carefully! When you access an API, you are initiating a transfer of data from a web server to your computer. Web servers are expensive to run and do not have infinite resources. If you try to ask for too much data at once, you can use up a huge amount of the server’s bandwidth. If you try to ask for data too frequently—e.g., if you make many requests to the server in quick succession—you can also bog the server down and make it unable to talk to anyone else. Most servers have mechanisms to revoke your access if you are not careful, but you should try to prevent issues from happening in the first place by being extra careful with how you write and run your code. You should also keep in mind that when a website owner grants you API access, they also usually specify a limit (or quota) of how much data you can ask for. Be careful not to overrun your quota! So before we try to use the API, we will first visit the NASA website to see what limits we should abide by when using the API. These limits are outlined in Figure 2.10. Figure 2.10: The NASA website specifies an hourly limit of 1,000 requests. After checking the NASA website, it seems like we can send at most 1,000 requests per hour. That should be more than enough for our purposes in this section. Accessing the NASA API The NASA API is what is known as an HTTP API: this is a particularly common kind of API, where you can obtain data simply by accessing a particular URL as if it were a regular website. To make a query to the NASA API, we need to specify three things. First, we specify the URL endpoint of the API, which is simply a URL that helps the remote server understand which API you are trying to access. NASA offers a variety of APIs, each with its own endpoint; in the case of the NASA “Astronomy Picture of the Day” API, the URL endpoint is https://api.nasa.gov/planetary/apod. Second, we write ?, which denotes that a list of query parameters will follow. And finally, we specify a list of query parameters of the form parameter=value, separated by & characters. The NASA “Astronomy Picture of the Day” API accepts the parameters shown in Figure 2.11. Figure 2.11: The set of parameters that you can specify when querying the NASA “Astronomy Picture of the Day” API, along with syntax, default settings, and a description of each. So for example, to obtain the image of the day from July 13, 2023, the API query would have two parameters: api_key=YOUR_API_KEY and date=2023-07-13. Remember to replace YOUR_API_KEY with the API key you received from NASA in your email! Putting it all together, the query will look like the following: https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13 If you try putting this URL into your web browser, you’ll actually find that the server responds to your request with some text: {"date":"2023-07-13","explanation":"A mere 390 light-years away, Sun-like stars and future planetary systems are forming in the Rho Ophiuchi molecular cloud complex, the closest star-forming region to our fair planet. The James Webb Space Telescope's NIRCam peered into the nearby natal chaos to capture this infrared image at an inspiring scale. The spectacular cosmic snapshot was released to celebrate the successful first year of Webb's exploration of the Universe. The frame spans less than a light-year across the Rho Ophiuchi region and contains about 50 young stars. Brighter stars clearly sport Webb's characteristic pattern of diffraction spikes. Huge jets of shocked molecular hydrogen blasting from newborn stars are red in the image, with the large, yellowish dusty cavity carved out by the energetic young star near its center. Near some stars in the stunning image are shadows cast by their protoplanetary disks.","hdurl":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph.png", "media_type":"image","service_version":"v1","title":"Webb's Rho Ophiuchi","url":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph1024.png"} Neat! There is definitely some data there, but it’s a bit hard to see what it all is. As it turns out, this is a common format for data called JSON (JavaScript Object Notation). We won’t encounter this kind of data much in this book, but for now you can interpret this data as key : value pairs separated by commas. For example, if you look closely, you’ll see that the first entry is \"date\":\"2023-07-13\", which indicates that we indeed successfully received data corresponding to July 13, 2023. So now our job is to do all of this programmatically in R. We will load the httr2 package, and construct the query using the request function, which takes a single URL argument; you will recognize the same query URL that we pasted into the browser earlier. We will then send the query using the req_perform function, and finally obtain a JSON representation of the response using the resp_body_json function. library(httr2) req <- request("https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13") resp <- req_perform(req) nasa_data_single <- resp_body_json(resp) nasa_data_single ## $date ## [1] "2023-07-13" ## ## $explanation ## [1] "A mere 390 light-years away, Sun-like stars and future planetary systems are forming in the Rho Ophiuchi molecular cloud complex, the closest star-forming region to our fair planet. The James Webb Space Telescope's NIRCam peered into the nearby natal chaos to capture this infrared image at an inspiring scale. The spectacular cosmic snapshot was released to celebrate the successful first year of Webb's exploration of the Universe. The frame spans less than a light-year across the Rho Ophiuchi region and contains about 50 young stars. Brighter stars clearly sport Webb's characteristic pattern of diffraction spikes. Huge jets of shocked molecular hydrogen blasting from newborn stars are red in the image, with the large, yellowish dusty cavity carved out by the energetic young star near its center. Near some stars in the stunning image are shadows cast by their protoplanetary disks." ## ## $hdurl ## [1] "https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph.png" ## ## $media_type ## [1] "image" ## ## $service_version ## [1] "v1" ## ## $title ## [1] "Webb's Rho Ophiuchi" ## ## $url ## [1] "https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph1024.png" We can obtain more records at once by using the start_date and end_date parameters, as shown in the table of parameters in 2.11. Let’s obtain all the records between May 1, 2023, and July 13, 2023, and store the result in an object called nasa_data; now the response will take the form of an R list (you’ll learn more about these in Chapter 3). Each item in the list will correspond to a single day’s record (just like the nasa_data_single object), and there will be 74 items total, one for each day between the start and end dates: req <- request("https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&start_date=2023-05-01&end_date=2023-07-13") resp <- req_perform(req) nasa_data <- resp_body_json(response) length(nasa_data) ## [1] 74 For further data processing using the techniques in this book, you’ll need to turn this list of items into a data frame. Here we will extract the date, title, copyright, and url variables from the JSON data, and construct a data frame using the extracted information. Note: Understanding this code is not required for the remainder of the textbook. It is included for those readers who would like to parse JSON data into a data frame in their own data analyses. nasa_df_all <- tibble(bind_rows(lapply(nasa_data, as.data.frame.list))) nasa_df <- select(nasa_df_all, date, title, copyright, url) nasa_df ## # A tibble: 74 × 4 ## date title copyright url ## <chr> <chr> <chr> <chr> ## 1 2023-05-01 Carina Nebula North "\\nCarlos Tayl… http… ## 2 2023-05-02 Flat Rock Hills on Mars "\\nNASA, \\nJPL… http… ## 3 2023-05-03 Centaurus A: A Peculiar Island of Stars "\\nMarco Loren… http… ## 4 2023-05-04 The Galaxy, the Jet, and a Famous Black Hole <NA> http… ## 5 2023-05-05 Shackleton from ShadowCam <NA> http… ## 6 2023-05-06 Twilight in a Flower "Dario Giannob… http… ## 7 2023-05-07 The Helix Nebula from CFHT <NA> http… ## 8 2023-05-08 The Spanish Dancer Spiral Galaxy <NA> http… ## 9 2023-05-09 Shadows of Earth "\\nMarcella Gi… http… ## 10 2023-05-10 Milky Way over Egyptian Desert "\\nAmr Abdulwa… http… ## # ℹ 64 more rows Success—we have created a small data set using the NASA API! This data is also quite different from what we obtained from web scraping; the extracted information is readily available in a JSON format, as opposed to raw HTML code (although not every API will provide data in such a nice format). From this point onward, the nasa_df data frame is stored on your machine, and you can play with it to your heart’s content. For example, you can use write_csv to save it to a file and read_csv to read it into R again later; and after reading the next few chapters you will have the skills to do even more interesting things! If you decide that you want to ask any of the various NASA APIs for more data (see the list of awesome NASA APIS here for more examples of what is possible), just be mindful as usual about how much data you are requesting and how frequently you are making requests. 2.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Reading in data locally and from the web” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 2.10 Additional resources The readr documentation provides the documentation for many of the reading functions we cover in this chapter. It is where you should look if you want to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter. Sometimes you might run into data in such poor shape that none of the reading functions we cover in this chapter work. In that case, you can consult the data import chapter from R for Data Science (Wickham and Grolemund 2016), which goes into a lot more detail about how R parses text from files into data frames. The here R package (Müller 2020) provides a way for you to construct or find your files’ paths. The readxl documentation provides more details on reading data from Excel, such as reading in data with multiple sheets, or specifying the cells to read in. The rio R package (Leeper 2021) provides an alternative set of tools for reading and writing data in R. It aims to be a “Swiss army knife” for data reading/writing/converting, and supports a wide variety of data types (including data formats generated by other statistical software like SPSS and SAS). A video from the Udacity course Linux Command Line Basics provides a good explanation of absolute versus relative paths. If you read the subsection on obtaining data from the web via scraping and APIs, we provide two companion tutorial video links for how to use the SelectorGadget tool to obtain desired CSS selectors for: extracting the data for apartment listings on Craigslist, and extracting Canadian city names and populations from Wikipedia. The polite R package (Perepolkin 2021) provides a set of tools for responsibly scraping data from websites. References "],["wrangling.html", "Chapter 3 Cleaning and wrangling data 3.1 Overview 3.2 Chapter learning objectives 3.3 Data frames, vectors, and lists 3.4 Tidy data 3.5 Using select to extract a range of columns 3.6 Using filter to extract rows 3.7 Using mutate to modify or add columns 3.8 Combining functions using the pipe operator, |> 3.9 Aggregating data with summarize and map 3.10 Apply functions across many columns with mutate and across 3.11 Apply functions across columns within one row with rowwise and mutate 3.12 Summary 3.13 Exercises 3.14 Additional resources", " Chapter 3 Cleaning and wrangling data 3.1 Overview This chapter is centered around defining tidy data—a data format that is suitable for analysis—and the tools needed to transform raw data into this format. This will be presented in the context of a real-world data science application, providing more practice working through a whole case study. 3.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Define the term “tidy data”. Discuss the advantages of storing data in a tidy data format. Define what vectors, lists, and data frames are in R, and describe how they relate to each other. Describe the common types of data in R and their uses. Use the following functions for their intended data wrangling tasks: c pivot_longer pivot_wider separate select filter mutate summarize map group_by across rowwise Use the following operators for their intended data wrangling tasks: ==, !=, <, <=, >, and >= %in% !, &, and | |> and %>% 3.3 Data frames, vectors, and lists In Chapters 1 and 2, data frames were the focus: we learned how to import data into R as a data frame, and perform basic operations on data frames in R. In the remainder of this book, this pattern continues. The vast majority of tools we use will require that data are represented as a data frame in R. Therefore, in this section, we will dig more deeply into what data frames are and how they are represented in R. This knowledge will be helpful in effectively utilizing these objects in our data analyses. 3.3.1 What is a data frame? A data frame is a table-like structure for storing data in R. Data frames are important to learn about because most data that you will encounter in practice can be naturally stored as a table. In order to define data frames precisely, we need to introduce a few technical terms: variable: a characteristic, number, or quantity that can be measured. observation: all of the measurements for a given entity. value: a single measurement of a single variable for a given entity. Given these definitions, a data frame is a tabular data structure in R that is designed to store observations, variables, and their values. Most commonly, each column in a data frame corresponds to a variable, and each row corresponds to an observation. For example, Figure 3.1 displays a data set of city populations. Here, the variables are “region, year, population”; each of these are properties that can be collected or measured. The first observation is “Toronto, 2016, 2235145”; these are the values that the three variables take for the first entity in the data set. There are 13 entities in the data set in total, corresponding to the 13 rows in Figure 3.1. Figure 3.1: A data frame storing data regarding the population of various regions in Canada. In this example data frame, the row that corresponds to the observation for the city of Vancouver is colored yellow, and the column that corresponds to the population variable is colored blue. R stores the columns of a data frame as either lists or vectors. For example, the data frame in Figure 3.2 has three vectors whose names are region, year and population. The next two sections will explain what lists and vectors are. Figure 3.2: Data frame with three vectors. 3.3.2 What is a vector? In R, vectors are objects that can contain one or more elements. The vector elements are ordered, and they must all be of the same data type; R has several different basic data types, as shown in Table 3.1. Figure 3.3 provides an example of a vector where all of the elements are of character type. You can create vectors in R using the c function (c stands for “concatenate”). For example, to create the vector region as shown in Figure 3.3, you would write: region <- c("Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa") region ## [1] "Toronto" "Montreal" "Vancouver" "Calgary" "Ottawa" Note: Technically, these objects are called “atomic vectors.” In this book we have chosen to call them “vectors,” which is how they are most commonly referred to in the R community. To be totally precise, “vector” is an umbrella term that encompasses both atomic vector and list objects in R. But this creates a confusing situation where the term “vector” could mean “atomic vector” or “the umbrella term for atomic vector and list,” depending on context. Very confusing indeed! So to keep things simple, in this book we always use the term “vector” to refer to “atomic vector.” We encourage readers who are enthusiastic to learn more to read the Vectors chapter of Advanced R (Wickham 2019). Figure 3.3: Example of a vector whose type is character. Table 3.1: Basic data types in R Data type Abbreviation Description Example character chr letters or numbers surrounded by quotes “1” , “Hello world!” double dbl numbers with decimals values 1.2333 integer int numbers that do not contain decimals 1L, 20L (where “L” tells R to store as an integer) logical lgl either true or false TRUE, FALSE factor fct used to represent data with a limited number of values (usually categories) a color variable with levels red, green and orange It is important in R to make sure you represent your data with the correct type. Many of the tidyverse functions we use in this book treat the various data types differently. You should use integers and double types (which both fall under the “numeric” umbrella type) to represent numbers and perform arithmetic. Doubles are more common than integers in R, though; for instance, a double data type is the default when you create a vector of numbers using c(), and when you read in whole numbers via read_csv. Characters are used to represent data that should be thought of as “text”, such as words, names, paths, URLs, and more. Factors help us encode variables that represent categories; a factor variable takes one of a discrete set of values known as levels (one for each category). The levels can be ordered or unordered. Even though factors can sometimes look like characters, they are not used to represent text, words, names, and paths in the way that characters are; in fact, R internally stores factors using integers! There are other basic data types in R, such as raw and complex, but we do not use these in this textbook. 3.3.3 What is a list? Lists are also objects in R that have multiple, ordered elements. Vectors and lists differ by the requirement of element type consistency. All elements within a single vector must be of the same type (e.g., all elements are characters), whereas elements within a single list can be of different types (e.g., characters, integers, logicals, and even other lists). See Figure 3.4. Figure 3.4: A vector versus a list. 3.3.4 What does this have to do with data frames? A data frame is really a special kind of list that follows two rules: Each element itself must either be a vector or a list. Each element (vector or list) must have the same length. Not all columns in a data frame need to be of the same type. Figure 3.5 shows a data frame where the columns are vectors of different types. But remember: because the columns in this example are vectors, the elements must be the same data type within each column. On the other hand, if our data frame had list columns, there would be no such requirement. It is generally much more common to use vector columns, though, as the values for a single variable are usually all of the same type. Figure 3.5: Data frame and vector types. Data frames are actually included in R itself, without the need for any additional packages. However, the tidyverse functions that we use throughout this book all work with a special kind of data frame called a tibble. Tibbles have some additional features and benefits over built-in data frames in R. These include the ability to add useful attributes (such as grouping, which we will discuss later) and more predictable type preservation when subsetting. Because a tibble is just a data frame with some added features, we will collectively refer to both built-in R data frames and tibbles as data frames in this book. Note: You can use the function class on a data object to assess whether a data frame is a built-in R data frame or a tibble. If the data object is a data frame, class will return \"data.frame\". If the data object is a tibble it will return \"tbl_df\" \"tbl\" \"data.frame\". You can easily convert built-in R data frames to tibbles using the tidyverse as_tibble function. For example we can check the class of the Canadian languages data set, can_lang, we worked with in the previous chapters and we see it is a tibble. class(can_lang) ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" Vectors, data frames and lists are basic types of data structure in R, which are core to most data analyses. We summarize them in Table 3.2. There are several other data structures in the R programming language (e.g., matrices), but these are beyond the scope of this book. Table 3.2: Basic data structures in R Data Structure Description vector An ordered collection of one, or more, values of the same data type. list An ordered collection of one, or more, values of possibly different data types. data frame A list of either vectors or lists of the same length, with column names. We typically use a data frame to represent a data set. 3.4 Tidy data There are many ways a tabular data set can be organized. This chapter will focus on introducing the tidy data format of organization and how to make your raw (and likely messy) data tidy. A tidy data frame satisfies the following three criteria (Wickham 2014): each row is a single observation, each column is a single variable, and each value is a single cell (i.e., its entry in the data frame is not shared with another value). Figure 3.6 demonstrates a tidy data set that satisfies these three criteria. Figure 3.6: Tidy data satisfies three criteria. There are many good reasons for making sure your data are tidy as a first step in your analysis. The most important is that it is a single, consistent format that nearly every function in the tidyverse recognizes. No matter what the variables and observations in your data represent, as long as the data frame is tidy, you can manipulate it, plot it, and analyze it using the same tools. If your data is not tidy, you will have to write special bespoke code in your analysis that will not only be error-prone, but hard for others to understand. Beyond making your analysis more accessible to others and less error-prone, tidy data is also typically easy for humans to interpret. Given these benefits, it is well worth spending the time to get your data into a tidy format upfront. Fortunately, there are many well-designed tidyverse data cleaning/wrangling tools to help you easily tidy your data. Let’s explore them below! Note: Is there only one shape for tidy data for a given data set? Not necessarily! It depends on the statistical question you are asking and what the variables are for that question. For tidy data, each variable should be its own column. So, just as it’s essential to match your statistical question with the appropriate data analysis tool, it’s important to match your statistical question with the appropriate variables and ensure they are represented as individual columns to make the data tidy. 3.4.1 Tidying up: going from wide to long using pivot_longer One task that is commonly performed to get data into a tidy format is to combine values that are stored in separate columns, but are really part of the same variable, into one. Data is often stored this way because this format is sometimes more intuitive for human readability and understanding, and humans create data sets. In Figure 3.7, the table on the left is in an untidy, “wide” format because the year values (2006, 2011, 2016) are stored as column names. And as a consequence, the values for population for the various cities over these years are also split across several columns. For humans, this table is easy to read, which is why you will often find data stored in this wide format. However, this format is difficult to work with when performing data visualization or statistical analysis using R. For example, if we wanted to find the latest year it would be challenging because the year values are stored as column names instead of as values in a single column. So before we could apply a function to find the latest year (for example, by using max), we would have to first extract the column names to get them as a vector and then apply a function to extract the latest year. The problem only gets worse if you would like to find the value for the population for a given region for the latest year. Both of these tasks are greatly simplified once the data is tidied. Another problem with data in this format is that we don’t know what the numbers under each year actually represent. Do those numbers represent population size? Land area? It’s not clear. To solve both of these problems, we can reshape this data set to a tidy data format by creating a column called “year” and a column called “population.” This transformation—which makes the data “longer”—is shown as the right table in Figure 3.7. Figure 3.7: Pivoting data from a wide to long data format. We can achieve this effect in R using the pivot_longer function from the tidyverse package. The pivot_longer function combines columns, and is usually used during tidying data when we need to make the data frame longer and narrower. To learn how to use pivot_longer, we will work through an example with the region_lang_top5_cities_wide.csv data set. This data set contains the counts of how many Canadians cited each language as their mother tongue for five major Canadian cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton) from the 2016 Canadian census. To get started, we will load the tidyverse package and use read_csv to load the (untidy) data. library(tidyverse) lang_wide <- read_csv("data/region_lang_top5_cities_wide.csv") lang_wide ## # A tibble: 214 × 7 ## category language Toronto Montréal Vancouver Calgary Edmonton ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal languages Aborigi… 80 30 70 20 25 ## 2 Non-Official & Non-Abor… Afrikaa… 985 90 1435 960 575 ## 3 Non-Official & Non-Abor… Afro-As… 360 240 45 45 65 ## 4 Non-Official & Non-Abor… Akan (T… 8485 1015 400 705 885 ## 5 Non-Official & Non-Abor… Albanian 13260 2450 1090 1365 770 ## 6 Aboriginal languages Algonqu… 5 5 0 0 0 ## 7 Aboriginal languages Algonqu… 5 30 5 5 0 ## 8 Non-Official & Non-Abor… America… 470 50 265 100 180 ## 9 Non-Official & Non-Abor… Amharic 7460 665 1140 4075 2515 ## 10 Non-Official & Non-Abor… Arabic 85175 151955 14320 18965 17525 ## # ℹ 204 more rows What is wrong with the untidy format above? The table on the left in Figure 3.8 represents the data in the “wide” (messy) format. From a data analysis perspective, this format is not ideal because the values of the variable region (Toronto, Montréal, Vancouver, Calgary, and Edmonton) are stored as column names. Thus they are not easily accessible to the data analysis functions we will apply to our data set. Additionally, the mother tongue variable values are spread across multiple columns, which will prevent us from doing any desired visualization or statistical tasks until we combine them into one column. For instance, suppose we want to know the languages with the highest number of Canadians reporting it as their mother tongue among all five regions. This question would be tough to answer with the data in its current format. We could find the answer with the data in this format, though it would be much easier to answer if we tidy our data first. If mother tongue were instead stored as one column, as shown in the tidy data on the right in Figure 3.8, we could simply use the max function in one line of code to get the maximum value. Figure 3.8: Going from wide to long with the pivot_longer function. Figure 3.9 details the arguments that we need to specify in the pivot_longer function to accomplish this data transformation. Figure 3.9: Syntax for the pivot_longer function. We use pivot_longer to combine the Toronto, Montréal, Vancouver, Calgary, and Edmonton columns into a single column called region, and create a column called mother_tongue that contains the count of how many Canadians report each language as their mother tongue for each metropolitan area. We use a colon : between Toronto and Edmonton to tell R to select all the columns between Toronto and Edmonton: lang_mother_tidy <- pivot_longer(lang_wide, cols = Toronto:Edmonton, names_to = "region", values_to = "mother_tongue" ) lang_mother_tidy ## # A tibble: 1,070 × 4 ## category language region mother_tongue ## <chr> <chr> <chr> <dbl> ## 1 Aboriginal languages Aboriginal lang… Toron… 80 ## 2 Aboriginal languages Aboriginal lang… Montr… 30 ## 3 Aboriginal languages Aboriginal lang… Vanco… 70 ## 4 Aboriginal languages Aboriginal lang… Calga… 20 ## 5 Aboriginal languages Aboriginal lang… Edmon… 25 ## 6 Non-Official & Non-Aboriginal languages Afrikaans Toron… 985 ## 7 Non-Official & Non-Aboriginal languages Afrikaans Montr… 90 ## 8 Non-Official & Non-Aboriginal languages Afrikaans Vanco… 1435 ## 9 Non-Official & Non-Aboriginal languages Afrikaans Calga… 960 ## 10 Non-Official & Non-Aboriginal languages Afrikaans Edmon… 575 ## # ℹ 1,060 more rows Note: In the code above, the call to the pivot_longer function is split across several lines. This is allowed in certain cases; for example, when calling a function as above, as long as the line ends with a comma , R knows to keep reading on the next line. Splitting long lines like this across multiple lines is encouraged as it helps significantly with code readability. Generally speaking, you should limit each line of code to about 80 characters. The data above is now tidy because all three criteria for tidy data have now been met: All the variables (category, language, region and mother_tongue) are now their own columns in the data frame. Each observation, (i.e., each language in a region) is in a single row. Each value is a single cell, i.e., its row, column position in the data frame is not shared with another value. 3.4.2 Tidying up: going from long to wide using pivot_wider Suppose we have observations spread across multiple rows rather than in a single row. For example, in Figure 3.10, the table on the left is in an untidy, long format because the count column contains three variables (population, commuter count, and year the city was incorporated) and information about each observation (here, population, commuter, and incorporated values for a region) is split across three rows. Remember: one of the criteria for tidy data is that each observation must be in a single row. Using data in this format—where two or more variables are mixed together in a single column—makes it harder to apply many usual tidyverse functions. For example, finding the maximum number of commuters would require an additional step of filtering for the commuter values before the maximum can be computed. In comparison, if the data were tidy, all we would have to do is compute the maximum value for the commuter column. To reshape this untidy data set to a tidy (and in this case, wider) format, we need to create columns called “population”, “commuters”, and “incorporated.” This is illustrated in the right table of Figure 3.10. Figure 3.10: Going from long to wide data. To tidy this type of data in R, we can use the pivot_wider function. The pivot_wider function generally increases the number of columns (widens) and decreases the number of rows in a data set. To learn how to use pivot_wider, we will work through an example with the region_lang_top5_cities_long.csv data set. This data set contains the number of Canadians reporting the primary language at home and work for five major cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton). lang_long <- read_csv("data/region_lang_top5_cities_long.csv") lang_long ## # A tibble: 2,140 × 5 ## region category language type count ## <chr> <chr> <chr> <chr> <dbl> ## 1 Montréal Aboriginal languages Aboriginal languages, n.o.s. most_at_home 15 ## 2 Montréal Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 3 Toronto Aboriginal languages Aboriginal languages, n.o.s. most_at_home 50 ## 4 Toronto Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 5 Calgary Aboriginal languages Aboriginal languages, n.o.s. most_at_home 5 ## 6 Calgary Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 7 Edmonton Aboriginal languages Aboriginal languages, n.o.s. most_at_home 10 ## 8 Edmonton Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 9 Vancouver Aboriginal languages Aboriginal languages, n.o.s. most_at_home 15 ## 10 Vancouver Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## # ℹ 2,130 more rows What makes the data set shown above untidy? In this example, each observation is a language in a region. However, each observation is split across multiple rows: one where the count for most_at_home is recorded, and the other where the count for most_at_work is recorded. Suppose the goal with this data was to visualize the relationship between the number of Canadians reporting their primary language at home and work. Doing that would be difficult with this data in its current form, since these two variables are stored in the same column. Figure 3.11 shows how this data will be tidied using the pivot_wider function. Figure 3.11: Going from long to wide with the pivot_wider function. Figure 3.12 details the arguments that we need to specify in the pivot_wider function. Figure 3.12: Syntax for the pivot_wider function. We will apply the function as detailed in Figure 3.12. lang_home_tidy <- pivot_wider(lang_long, names_from = type, values_from = count ) lang_home_tidy ## # A tibble: 1,070 × 5 ## region category language most_at_home most_at_work ## <chr> <chr> <chr> <dbl> <dbl> ## 1 Montréal Aboriginal languages Aborigi… 15 0 ## 2 Toronto Aboriginal languages Aborigi… 50 0 ## 3 Calgary Aboriginal languages Aborigi… 5 0 ## 4 Edmonton Aboriginal languages Aborigi… 10 0 ## 5 Vancouver Aboriginal languages Aborigi… 15 0 ## 6 Montréal Non-Official & Non-Aboriginal l… Afrikaa… 10 0 ## 7 Toronto Non-Official & Non-Aboriginal l… Afrikaa… 265 0 ## 8 Calgary Non-Official & Non-Aboriginal l… Afrikaa… 505 15 ## 9 Edmonton Non-Official & Non-Aboriginal l… Afrikaa… 300 0 ## 10 Vancouver Non-Official & Non-Aboriginal l… Afrikaa… 520 10 ## # ℹ 1,060 more rows The data above is now tidy! We can go through the three criteria again to check that this data is a tidy data set. All the statistical variables are their own columns in the data frame (i.e., most_at_home, and most_at_work have been separated into their own columns in the data frame). Each observation, (i.e., each language in a region) is in a single row. Each value is a single cell (i.e., its row, column position in the data frame is not shared with another value). You might notice that we have the same number of columns in the tidy data set as we did in the messy one. Therefore pivot_wider didn’t really “widen” the data, as the name suggests. This is just because the original type column only had two categories in it. If it had more than two, pivot_wider would have created more columns, and we would see the data set “widen.” 3.4.3 Tidying up: using separate to deal with multiple delimiters Data are also not considered tidy when multiple values are stored in the same cell. The data set we show below is even messier than the ones we dealt with above: the Toronto, Montréal, Vancouver, Calgary, and Edmonton columns contain the number of Canadians reporting their primary language at home and work in one column separated by the delimiter (/). The column names are the values of a variable, and each value does not have its own cell! To turn this messy data into tidy data, we’ll have to fix these issues. lang_messy <- read_csv("data/region_lang_top5_cities_messy.csv") lang_messy ## # A tibble: 214 × 7 ## category language Toronto Montréal Vancouver Calgary Edmonton ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Aboriginal languages Aborigi… 50/0 15/0 15/0 5/0 10/0 ## 2 Non-Official & Non-Abor… Afrikaa… 265/0 10/0 520/10 505/15 300/0 ## 3 Non-Official & Non-Abor… Afro-As… 185/10 65/0 10/0 15/0 20/0 ## 4 Non-Official & Non-Abor… Akan (T… 4045/20 440/0 125/10 330/0 445/0 ## 5 Non-Official & Non-Abor… Albanian 6380/2… 1445/20 530/10 620/25 370/10 ## 6 Aboriginal languages Algonqu… 5/0 0/0 0/0 0/0 0/0 ## 7 Aboriginal languages Algonqu… 0/0 10/0 0/0 0/0 0/0 ## 8 Non-Official & Non-Abor… America… 720/245 70/0 300/140 85/25 190/85 ## 9 Non-Official & Non-Abor… Amharic 3820/55 315/0 540/10 2730/50 1695/35 ## 10 Non-Official & Non-Abor… Arabic 45025/… 72980/1… 8680/275 11010/… 10590/3… ## # ℹ 204 more rows First we’ll use pivot_longer to create two columns, region and value, similar to what we did previously. The new region columns will contain the region names, and the new column value will be a temporary holding place for the data that we need to further separate, i.e., the number of Canadians reporting their primary language at home and work. lang_messy_longer <- pivot_longer(lang_messy, cols = Toronto:Edmonton, names_to = "region", values_to = "value" ) lang_messy_longer ## # A tibble: 1,070 × 4 ## category language region value ## <chr> <chr> <chr> <chr> ## 1 Aboriginal languages Aboriginal languages, n… Toron… 50/0 ## 2 Aboriginal languages Aboriginal languages, n… Montr… 15/0 ## 3 Aboriginal languages Aboriginal languages, n… Vanco… 15/0 ## 4 Aboriginal languages Aboriginal languages, n… Calga… 5/0 ## 5 Aboriginal languages Aboriginal languages, n… Edmon… 10/0 ## 6 Non-Official & Non-Aboriginal languages Afrikaans Toron… 265/0 ## 7 Non-Official & Non-Aboriginal languages Afrikaans Montr… 10/0 ## 8 Non-Official & Non-Aboriginal languages Afrikaans Vanco… 520/… ## 9 Non-Official & Non-Aboriginal languages Afrikaans Calga… 505/… ## 10 Non-Official & Non-Aboriginal languages Afrikaans Edmon… 300/0 ## # ℹ 1,060 more rows Next we’ll use separate to split the value column into two columns. One column will contain only the counts of Canadians that speak each language most at home, and the other will contain the counts of Canadians that speak each language most at work for each region. Figure 3.13 outlines what we need to specify to use separate. Figure 3.13: Syntax for the separate function. tidy_lang <- separate(lang_messy_longer, col = value, into = c("most_at_home", "most_at_work"), sep = "/" ) tidy_lang ## # A tibble: 1,070 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <chr> <chr> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,060 more rows Is this data set now tidy? If we recall the three criteria for tidy data: each row is a single observation, each column is a single variable, and each value is a single cell. We can see that this data now satisfies all three criteria, making it easier to analyze. But we aren’t done yet! Notice in the table above that the word <chr> appears beneath each of the column names. The word under the column name indicates the data type of each column. Here all of the variables are “character” data types. Recall, character data types are letter(s) or digits(s) surrounded by quotes. In the previous example in Section 3.4.2, the most_at_home and most_at_work variables were <dbl> (double)—you can verify this by looking at the tables in the previous sections—which is a type of numeric data. This change is due to the delimiter (/) when we read in this messy data set. R read these columns in as character types, and by default, separate will return columns as character data types. It makes sense for region, category, and language to be stored as a character (or perhaps factor) type. However, suppose we want to apply any functions that treat the most_at_home and most_at_work columns as a number (e.g., finding rows above a numeric threshold of a column). In that case, it won’t be possible to do if the variable is stored as a character. Fortunately, the separate function provides a natural way to fix problems like this: we can set convert = TRUE to convert the most_at_home and most_at_work columns to the correct data type. tidy_lang <- separate(lang_messy_longer, col = value, into = c("most_at_home", "most_at_work"), sep = "/", convert = TRUE ) tidy_lang ## # A tibble: 1,070 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,060 more rows Now we see <int> appears under the most_at_home and most_at_work columns, indicating they are integer data types (i.e., numbers)! 3.5 Using select to extract a range of columns Now that the tidy_lang data is indeed tidy, we can start manipulating it using the powerful suite of functions from the tidyverse. For the first example, recall the select function from Chapter 1, which lets us create a subset of columns from a data frame. Suppose we wanted to select only the columns language, region, most_at_home and most_at_work from the tidy_lang data set. Using what we learned in Chapter 1, we would pass the tidy_lang data frame as well as all of these column names into the select function: selected_columns <- select(tidy_lang, language, region, most_at_home, most_at_work) selected_columns ## # A tibble: 1,070 × 4 ## language region most_at_home most_at_work ## <chr> <chr> <int> <int> ## 1 Aboriginal languages, n.o.s. Toronto 50 0 ## 2 Aboriginal languages, n.o.s. Montréal 15 0 ## 3 Aboriginal languages, n.o.s. Vancouver 15 0 ## 4 Aboriginal languages, n.o.s. Calgary 5 0 ## 5 Aboriginal languages, n.o.s. Edmonton 10 0 ## 6 Afrikaans Toronto 265 0 ## 7 Afrikaans Montréal 10 0 ## 8 Afrikaans Vancouver 520 10 ## 9 Afrikaans Calgary 505 15 ## 10 Afrikaans Edmonton 300 0 ## # ℹ 1,060 more rows Here we wrote out the names of each of the columns. However, this method is time-consuming, especially if you have a lot of columns! Another approach is to use a “select helper”. Select helpers are operators that make it easier for us to select columns. For instance, we can use a select helper to choose a range of columns rather than typing each column name out. To do this, we use the colon (:) operator to denote the range. For example, to get all the columns in the tidy_lang data frame from language to most_at_work we pass language:most_at_work as the second argument to the select function. column_range <- select(tidy_lang, language:most_at_work) column_range ## # A tibble: 1,070 × 4 ## language region most_at_home most_at_work ## <chr> <chr> <int> <int> ## 1 Aboriginal languages, n.o.s. Toronto 50 0 ## 2 Aboriginal languages, n.o.s. Montréal 15 0 ## 3 Aboriginal languages, n.o.s. Vancouver 15 0 ## 4 Aboriginal languages, n.o.s. Calgary 5 0 ## 5 Aboriginal languages, n.o.s. Edmonton 10 0 ## 6 Afrikaans Toronto 265 0 ## 7 Afrikaans Montréal 10 0 ## 8 Afrikaans Vancouver 520 10 ## 9 Afrikaans Calgary 505 15 ## 10 Afrikaans Edmonton 300 0 ## # ℹ 1,060 more rows Notice that we get the same output as we did above, but with less (and clearer!) code. This type of operator is especially handy for large data sets. Suppose instead we wanted to extract columns that followed a particular pattern rather than just selecting a range. For example, let’s say we wanted only to select the columns most_at_home and most_at_work. There are other helpers that allow us to select variables based on their names. In particular, we can use the select helper starts_with to choose only the columns that start with the word “most”: select(tidy_lang, starts_with("most")) ## # A tibble: 1,070 × 2 ## most_at_home most_at_work ## <int> <int> ## 1 50 0 ## 2 15 0 ## 3 15 0 ## 4 5 0 ## 5 10 0 ## 6 265 0 ## 7 10 0 ## 8 520 10 ## 9 505 15 ## 10 300 0 ## # ℹ 1,060 more rows We could also have chosen the columns containing an underscore _ by adding contains(\"_\") as the second argument in the select function, since we notice the columns we want contain underscores and the others don’t. select(tidy_lang, contains("_")) ## # A tibble: 1,070 × 2 ## most_at_home most_at_work ## <int> <int> ## 1 50 0 ## 2 15 0 ## 3 15 0 ## 4 5 0 ## 5 10 0 ## 6 265 0 ## 7 10 0 ## 8 520 10 ## 9 505 15 ## 10 300 0 ## # ℹ 1,060 more rows There are many different select helpers that select variables based on certain criteria. The additional resources section at the end of this chapter provides a comprehensive resource on select helpers. 3.6 Using filter to extract rows Next, we revisit the filter function from Chapter 1, which lets us create a subset of rows from a data frame. Recall the two main arguments to the filter function: the first is the name of the data frame object, and the second is a logical statement to use when filtering the rows. filter works by returning the rows where the logical statement evaluates to TRUE. This section will highlight more advanced usage of the filter function. In particular, this section provides an in-depth treatment of the variety of logical statements one can use in the filter function to select subsets of rows. 3.6.1 Extracting rows that have a certain value with == Suppose we are only interested in the subset of rows in tidy_lang corresponding to the official languages of Canada (English and French). We can filter for these rows by using the equivalency operator (==) to compare the values of the category column with the value \"Official languages\". With these arguments, filter returns a data frame with all the columns of the input data frame but only the rows we asked for in the logical statement, i.e., those where the category column holds the value \"Official languages\". We name this data frame official_langs. official_langs <- filter(tidy_lang, category == "Official languages") official_langs ## # A tibble: 10 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 ## 6 Official languages French Toronto 29800 11940 ## 7 Official languages French Montréal 2669195 1607550 ## 8 Official languages French Vancouver 8630 3245 ## 9 Official languages French Calgary 8630 2140 ## 10 Official languages French Edmonton 10950 2520 3.6.2 Extracting rows that do not have a certain value with != What if we want all the other language categories in the data set except for those in the \"Official languages\" category? We can accomplish this with the != operator, which means “not equal to”. So if we want to find all the rows where the category does not equal \"Official languages\" we write the code below. filter(tidy_lang, category != "Official languages") ## # A tibble: 1,060 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,050 more rows 3.6.3 Extracting rows satisfying multiple conditions using , or & Suppose now we want to look at only the rows for the French language in Montréal. To do this, we need to filter the data set to find rows that satisfy multiple conditions simultaneously. We can do this with the comma symbol (,), which in the case of filter is interpreted by R as “and”. We write the code as shown below to filter the official_langs data frame to subset the rows where region == \"Montréal\" and the language == \"French\". filter(official_langs, region == "Montréal", language == "French") ## # A tibble: 1 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages French Montréal 2669195 1607550 We can also use the ampersand (&) logical operator, which gives us cases where both one condition and another condition are satisfied. You can use either comma (,) or ampersand (&) in the filter function interchangeably. filter(official_langs, region == "Montréal" & language == "French") ## # A tibble: 1 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages French Montréal 2669195 1607550 3.6.4 Extracting rows satisfying at least one condition using | Suppose we were interested in only those rows corresponding to cities in Alberta in the official_langs data set (Edmonton and Calgary). We can’t use , as we did above because region cannot be both Edmonton and Calgary simultaneously. Instead, we can use the vertical pipe (|) logical operator, which gives us the cases where one condition or another condition or both are satisfied. In the code below, we ask R to return the rows where the region columns are equal to “Calgary” or “Edmonton”. filter(official_langs, region == "Calgary" | region == "Edmonton") ## # A tibble: 4 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Calgary 1065070 844740 ## 2 Official languages English Edmonton 1050410 792700 ## 3 Official languages French Calgary 8630 2140 ## 4 Official languages French Edmonton 10950 2520 3.6.5 Extracting rows with values in a vector using %in% Next, suppose we want to see the populations of our five cities. Let’s read in the region_data.csv file that comes from the 2016 Canadian census, as it contains statistics for number of households, land area, population and number of dwellings for different regions. region_data <- read_csv("data/region_data.csv") region_data ## # A tibble: 35 × 5 ## region households area population dwellings ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Belleville 43002 1355. 103472 45050 ## 2 Lethbridge 45696 3047. 117394 48317 ## 3 Thunder Bay 52545 2618. 121621 57146 ## 4 Peterborough 50533 1637. 121721 55662 ## 5 Saint John 52872 3793. 126202 58398 ## 6 Brantford 52530 1086. 134203 54419 ## 7 Moncton 61769 2625. 144810 66699 ## 8 Guelph 59280 604. 151984 63324 ## 9 Trois-Rivières 72502 1053. 156042 77734 ## 10 Saguenay 72479 3079. 160980 77968 ## # ℹ 25 more rows To get the population of the five cities we can filter the data set using the %in% operator. The %in% operator is used to see if an element belongs to a vector. Here we are filtering for rows where the value in the region column matches any of the five cities we are intersted in: Toronto, Montréal, Vancouver, Calgary, and Edmonton. city_names <- c("Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton") five_cities <- filter(region_data, region %in% city_names) five_cities ## # A tibble: 5 × 5 ## region households area population dwellings ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Edmonton 502143 9858. 1321426 537634 ## 2 Calgary 519693 5242. 1392609 544870 ## 3 Vancouver 960894 3040. 2463431 1027613 ## 4 Montréal 1727310 4638. 4098927 1823281 ## 5 Toronto 2135909 6270. 5928040 2235145 Note: What’s the difference between == and %in%? Suppose we have two vectors, vectorA and vectorB. If you type vectorA == vectorB into R it will compare the vectors element by element. R checks if the first element of vectorA equals the first element of vectorB, the second element of vectorA equals the second element of vectorB, and so on. On the other hand, vectorA %in% vectorB compares the first element of vectorA to all the elements in vectorB. Then the second element of vectorA is compared to all the elements in vectorB, and so on. Notice the difference between == and %in% in the example below. c("Vancouver", "Toronto") == c("Toronto", "Vancouver") ## [1] FALSE FALSE c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver") ## [1] TRUE TRUE 3.6.6 Extracting rows above or below a threshold using > and < We saw in Section 3.6.3 that 2,669,195 people reported speaking French in Montréal as their primary language at home. If we are interested in finding the official languages in regions with higher numbers of people who speak it as their primary language at home compared to French in Montréal, then we can use filter to obtain rows where the value of most_at_home is greater than 2,669,195. We use the > symbol to look for values above a threshold, and the < symbol to look for values below a threshold. The >= and <= symbols similarly look for equal to or above a threshold and equal to or below a threshold. filter(official_langs, most_at_home > 2669195) ## # A tibble: 1 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Toronto 3836770 3218725 filter returns a data frame with only one row, indicating that when considering the official languages, only English in Toronto is reported by more people as their primary language at home than French in Montréal according to the 2016 Canadian census. 3.7 Using mutate to modify or add columns 3.7.1 Using mutate to modify columns In Section 3.4.3, when we first read in the \"region_lang_top5_cities_messy.csv\" data, all of the variables were “character” data types. During the tidying process, we used the convert argument from the separate function to convert the most_at_home and most_at_work columns to the desired integer (i.e., numeric class) data types. But suppose we didn’t use the convert argument, and needed to modify the column type some other way. Below we create such a situation so that we can demonstrate how to use mutate to change the column types of a data frame. mutate is a useful function to modify or create new data frame columns. lang_messy <- read_csv("data/region_lang_top5_cities_messy.csv") lang_messy_longer <- pivot_longer(lang_messy, cols = Toronto:Edmonton, names_to = "region", values_to = "value") tidy_lang_chr <- separate(lang_messy_longer, col = value, into = c("most_at_home", "most_at_work"), sep = "/") official_langs_chr <- filter(tidy_lang_chr, category == "Official languages") official_langs_chr ## # A tibble: 10 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <chr> <chr> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 ## 6 Official languages French Toronto 29800 11940 ## 7 Official languages French Montréal 2669195 1607550 ## 8 Official languages French Vancouver 8630 3245 ## 9 Official languages French Calgary 8630 2140 ## 10 Official languages French Edmonton 10950 2520 To use mutate, again we first specify the data set in the first argument, and in the following arguments, we specify the name of the column we want to modify or create (here most_at_home and most_at_work), an = sign, and then the function we want to apply (here as.numeric). In the function we want to apply, we refer directly to the column name upon which we want it to act (here most_at_home and most_at_work). In our example, we are naming the columns the same names as columns that already exist in the data frame (“most_at_home”, “most_at_work”) and this will cause mutate to overwrite those columns (also referred to as modifying those columns in-place). If we were to give the columns a new name, then mutate would create new columns with the names we specified. mutate’s general syntax is detailed in Figure 3.14. Figure 3.14: Syntax for the mutate function. Below we use mutate to convert the columns most_at_home and most_at_work to numeric data types in the official_langs data set as described in Figure 3.14: official_langs_numeric <- mutate(official_langs_chr, most_at_home = as.numeric(most_at_home), most_at_work = as.numeric(most_at_work) ) official_langs_numeric ## # A tibble: 10 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <dbl> <dbl> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 ## 6 Official languages French Toronto 29800 11940 ## 7 Official languages French Montréal 2669195 1607550 ## 8 Official languages French Vancouver 8630 3245 ## 9 Official languages French Calgary 8630 2140 ## 10 Official languages French Edmonton 10950 2520 Now we see <dbl> appears under the most_at_home and most_at_work columns, indicating they are double data types (which is a numeric data type)! 3.7.2 Using mutate to create new columns We can see in the table that 3,836,770 people reported speaking English in Toronto as their primary language at home, according to the 2016 Canadian census. What does this number mean to us? To understand this number, we need context. In particular, how many people were in Toronto when this data was collected? From the 2016 Canadian census profile, the population of Toronto was reported to be 5,928,040 people. The number of people who report that English is their primary language at home is much more meaningful when we report it in this context. We can even go a step further and transform this count to a relative frequency or proportion. We can do this by dividing the number of people reporting a given language as their primary language at home by the number of people who live in Toronto. For example, the proportion of people who reported that their primary language at home was English in the 2016 Canadian census was 0.65 in Toronto. Let’s use mutate to create a new column in our data frame that holds the proportion of people who speak English for our five cities of focus in this chapter. To accomplish this, we will need to do two tasks beforehand: Create a vector containing the population values for the cities. Filter the official_langs data frame so that we only keep the rows where the language is English. To create a vector containing the population values for the five cities (Toronto, Montréal, Vancouver, Calgary, Edmonton), we will use the c function (recall that c stands for “concatenate”): city_pops <- c(5928040, 4098927, 2463431, 1392609, 1321426) city_pops ## [1] 5928040 4098927 2463431 1392609 1321426 And next, we will filter the official_langs data frame so that we only keep the rows where the language is English. We will name the new data frame we get from this english_langs: english_langs <- filter(official_langs, language == "English") english_langs ## # A tibble: 5 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 Finally, we can use mutate to create a new column, named most_at_home_proportion, that will have value that corresponds to the proportion of people reporting English as their primary language at home. We will compute this by dividing the column by our vector of city populations. english_langs <- mutate(english_langs, most_at_home_proportion = most_at_home / city_pops) english_langs ## # A tibble: 5 × 6 ## category language region most_at_home most_at_work most_at_home_proport…¹ ## <chr> <chr> <chr> <int> <int> <dbl> ## 1 Official lan… English Toron… 3836770 3218725 0.647 ## 2 Official lan… English Montr… 620510 412120 0.151 ## 3 Official lan… English Vanco… 1622735 1330555 0.659 ## 4 Official lan… English Calga… 1065070 844740 0.765 ## 5 Official lan… English Edmon… 1050410 792700 0.795 ## # ℹ abbreviated name: ¹​most_at_home_proportion In the computation above, we had to ensure that we ordered the city_pops vector in the same order as the cities were listed in the english_langs data frame. This is because R will perform the division computation we did by dividing each element of the most_at_home column by each element of the city_pops vector, matching them up by position. Failing to do this would have resulted in the incorrect math being performed. Note: In more advanced data wrangling, one might solve this problem in a less error-prone way though using a technique called “joins.” We link to resources that discuss this in the additional resources at the end of this chapter. 3.8 Combining functions using the pipe operator, |> In R, we often have to call multiple functions in a sequence to process a data frame. The basic ways of doing this can become quickly unreadable if there are many steps. For example, suppose we need to perform three operations on a data frame called data: add a new column new_col that is double another old_col, filter for rows where another column, other_col, is more than 5, and select only the new column new_col for those rows. One way of performing these three steps is to just write multiple lines of code, storing temporary objects as you go: output_1 <- mutate(data, new_col = old_col * 2) output_2 <- filter(output_1, other_col > 5) output <- select(output_2, new_col) This is difficult to understand for multiple reasons. The reader may be tricked into thinking the named output_1 and output_2 objects are important for some reason, while they are just temporary intermediate computations. Further, the reader has to look through and find where output_1 and output_2 are used in each subsequent line. Another option for doing this would be to compose the functions: output <- select(filter(mutate(data, new_col = old_col * 2), other_col > 5), new_col) Code like this can also be difficult to understand. Functions compose (reading from left to right) in the opposite order in which they are computed by R (above, mutate happens first, then filter, then select). It is also just a really long line of code to read in one go. The pipe operator (|>) solves this problem, resulting in cleaner and easier-to-follow code. |> is built into R so you don’t need to load any packages to use it. You can think of the pipe as a physical pipe. It takes the output from the function on the left-hand side of the pipe, and passes it as the first argument to the function on the right-hand side of the pipe. The code below accomplishes the same thing as the previous two code blocks: output <- data |> mutate(new_col = old_col * 2) |> filter(other_col > 5) |> select(new_col) Note: You might also have noticed that we split the function calls across lines after the pipe, similar to when we did this earlier in the chapter for long function calls. Again, this is allowed and recommended, especially when the piped function calls create a long line of code. Doing this makes your code more readable. When you do this, it is important to end each line with the pipe operator |> to tell R that your code is continuing onto the next line. Note: In this textbook, we will be using the base R pipe operator syntax, |>. This base R |> pipe operator was inspired by a previous version of the pipe operator, %>%. The %>% pipe operator is not built into R and is from the magrittr R package. The tidyverse metapackage imports the %>% pipe operator via dplyr (which in turn imports the magrittr R package). There are some other differences between %>% and |> related to more advanced R uses, such as sharing and distributing code as R packages, however, these are beyond the scope of this textbook. We have this note in the book to make the reader aware that %>% exists as it is still commonly used in data analysis code and in many data science books and other resources. In most cases these two pipes are interchangeable and either can be used. 3.8.1 Using |> to combine filter and select Let’s work with the tidy tidy_lang data set from Section 3.4.3, which contains the number of Canadians reporting their primary language at home and work for five major cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton): tidy_lang ## # A tibble: 1,070 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,060 more rows Suppose we want to create a subset of the data with only the languages and counts of each language spoken most at home for the city of Vancouver. To do this, we can use the functions filter and select. First, we use filter to create a data frame called van_data that contains only values for Vancouver. van_data <- filter(tidy_lang, region == "Vancouver") van_data ## # A tibble: 214 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Vanco… 15 0 ## 2 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 3 Non-Official & Non-Aboriginal lang… Afro-As… Vanco… 10 0 ## 4 Non-Official & Non-Aboriginal lang… Akan (T… Vanco… 125 10 ## 5 Non-Official & Non-Aboriginal lang… Albanian Vanco… 530 10 ## 6 Aboriginal languages Algonqu… Vanco… 0 0 ## 7 Aboriginal languages Algonqu… Vanco… 0 0 ## 8 Non-Official & Non-Aboriginal lang… America… Vanco… 300 140 ## 9 Non-Official & Non-Aboriginal lang… Amharic Vanco… 540 10 ## 10 Non-Official & Non-Aboriginal lang… Arabic Vanco… 8680 275 ## # ℹ 204 more rows We then use select on this data frame to keep only the variables we want: van_data_selected <- select(van_data, language, most_at_home) van_data_selected ## # A tibble: 214 × 2 ## language most_at_home ## <chr> <int> ## 1 Aboriginal languages, n.o.s. 15 ## 2 Afrikaans 520 ## 3 Afro-Asiatic languages, n.i.e. 10 ## 4 Akan (Twi) 125 ## 5 Albanian 530 ## 6 Algonquian languages, n.i.e. 0 ## 7 Algonquin 0 ## 8 American Sign Language 300 ## 9 Amharic 540 ## 10 Arabic 8680 ## # ℹ 204 more rows Although this is valid code, there is a more readable approach we could take by using the pipe, |>. With the pipe, we do not need to create an intermediate object to store the output from filter. Instead, we can directly send the output of filter to the input of select: van_data_selected <- filter(tidy_lang, region == "Vancouver") |> select(language, most_at_home) van_data_selected ## # A tibble: 214 × 2 ## language most_at_home ## <chr> <int> ## 1 Aboriginal languages, n.o.s. 15 ## 2 Afrikaans 520 ## 3 Afro-Asiatic languages, n.i.e. 10 ## 4 Akan (Twi) 125 ## 5 Albanian 530 ## 6 Algonquian languages, n.i.e. 0 ## 7 Algonquin 0 ## 8 American Sign Language 300 ## 9 Amharic 540 ## 10 Arabic 8680 ## # ℹ 204 more rows But wait…Why do the select function calls look different in these two examples? Remember: when you use the pipe, the output of the first function is automatically provided as the first argument for the function that comes after it. Therefore you do not specify the first argument in that function call. In the code above, The pipe passes the left-hand side (the output of filter) to the first argument of the function on the right (select), so in the select function you only see the second argument (and beyond). As you can see, both of these approaches—with and without pipes—give us the same output, but the second approach is clearer and more readable. 3.8.2 Using |> with more than two functions The pipe operator (|>) can be used with any function in R. Additionally, we can pipe together more than two functions. For example, we can pipe together three functions to: filter rows to include only those where the counts of the language most spoken at home are greater than 10,000, select only the columns corresponding to region, language and most_at_home, and arrange the data frame rows in order by counts of the language most spoken at home from smallest to largest. As we saw in Chapter 1, we can use the tidyverse arrange function to order the rows in the data frame by the values of one or more columns. Here we pass the column name most_at_home to arrange the data frame rows by the values in that column, in ascending order. large_region_lang <- filter(tidy_lang, most_at_home > 10000) |> select(region, language, most_at_home) |> arrange(most_at_home) large_region_lang ## # A tibble: 67 × 3 ## region language most_at_home ## <chr> <chr> <int> ## 1 Edmonton Arabic 10590 ## 2 Montréal Tamil 10670 ## 3 Vancouver Russian 10795 ## 4 Edmonton Spanish 10880 ## 5 Edmonton French 10950 ## 6 Calgary Arabic 11010 ## 7 Calgary Urdu 11060 ## 8 Vancouver Hindi 11235 ## 9 Montréal Armenian 11835 ## 10 Toronto Romanian 12200 ## # ℹ 57 more rows You will notice above that we passed tidy_lang as the first argument of the filter function. We can also pipe the data frame into the same sequence of functions rather than using it as the first argument of the first function. These two choices are equivalent, and we get the same result. large_region_lang <- tidy_lang |> filter(most_at_home > 10000) |> select(region, language, most_at_home) |> arrange(most_at_home) large_region_lang ## # A tibble: 67 × 3 ## region language most_at_home ## <chr> <chr> <int> ## 1 Edmonton Arabic 10590 ## 2 Montréal Tamil 10670 ## 3 Vancouver Russian 10795 ## 4 Edmonton Spanish 10880 ## 5 Edmonton French 10950 ## 6 Calgary Arabic 11010 ## 7 Calgary Urdu 11060 ## 8 Vancouver Hindi 11235 ## 9 Montréal Armenian 11835 ## 10 Toronto Romanian 12200 ## # ℹ 57 more rows Now that we’ve shown you the pipe operator as an alternative to storing temporary objects and composing code, does this mean you should never store temporary objects or compose code? Not necessarily! There are times when you will still want to do these things. For example, you might store a temporary object before feeding it into a plot function so you can iteratively change the plot without having to redo all of your data transformations. Additionally, piping many functions can be overwhelming and difficult to debug; you may want to store a temporary object midway through to inspect your result before moving on with further steps. 3.9 Aggregating data with summarize and map 3.9.1 Calculating summary statistics on whole columns As a part of many data analyses, we need to calculate a summary value for the data (a summary statistic). Examples of summary statistics we might want to calculate are the number of observations, the average/mean value for a column, the minimum value, etc. Oftentimes, this summary statistic is calculated from the values in a data frame column, or columns, as shown in Figure 3.15. Figure 3.15: summarize is useful for calculating summary statistics on one or more column(s). In its simplest use case, it creates a new data frame with a single row containing the summary statistic(s) for each column being summarized. The darker, top row of each table represents the column headers. A useful dplyr function for calculating summary statistics is summarize, where the first argument is the data frame and subsequent arguments are the summaries we want to perform. Here we show how to use the summarize function to calculate the minimum and maximum number of Canadians reporting a particular language as their primary language at home. First a reminder of what region_lang looks like: region_lang ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows We apply summarize to calculate the minimum and maximum number of Canadians reporting a particular language as their primary language at home, for any region: summarize(region_lang, min_most_at_home = min(most_at_home), max_most_at_home = max(most_at_home)) ## # A tibble: 1 × 2 ## min_most_at_home max_most_at_home ## <dbl> <dbl> ## 1 0 3836770 From this we see that there are some languages in the data set that no one speaks as their primary language at home. We also see that the most commonly spoken primary language at home is spoken by 3,836,770 people. 3.9.2 Calculating summary statistics when there are NAs In data frames in R, the value NA is often used to denote missing data. Many of the base R statistical summary functions (e.g., max, min, mean, sum, etc) will return NA when applied to columns containing NA values. Usually that is not what we want to happen; instead, we would usually like R to ignore the missing entries and calculate the summary statistic using all of the other non-NA values in the column. Fortunately many of these functions provide an argument na.rm that lets us tell the function what to do when it encounters NA values. In particular, if we specify na.rm = TRUE, the function will ignore missing values and return a summary of all the non-missing entries. We show an example of this combined with summarize below. First we create a new version of the region_lang data frame, named region_lang_na, that has a seemingly innocuous NA in the first row of the most_at_home column: region_lang_na ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 NA 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows Now if we apply the summarize function as above, we see that we no longer get the minimum and maximum returned, but just an NA instead! summarize(region_lang_na, min_most_at_home = min(most_at_home), max_most_at_home = max(most_at_home)) ## # A tibble: 1 × 2 ## min_most_at_home max_most_at_home ## <dbl> <dbl> ## 1 NA NA We can fix this by adding the na.rm = TRUE as explained above: summarize(region_lang_na, min_most_at_home = min(most_at_home, na.rm = TRUE), max_most_at_home = max(most_at_home, na.rm = TRUE)) ## # A tibble: 1 × 2 ## min_most_at_home max_most_at_home ## <dbl> <dbl> ## 1 0 3836770 3.9.3 Calculating summary statistics for groups of rows A common pairing with summarize is group_by. Pairing these functions together can let you summarize values for subgroups within a data set, as illustrated in Figure 3.16. For example, we can use group_by to group the regions of the region_lang data frame and then calculate the minimum and maximum number of Canadians reporting the language as the primary language at home for each of the regions in the data set. Figure 3.16: summarize and group_by is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame—with one row for each group—containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The orange, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example. The group_by function takes at least two arguments. The first is the data frame that will be grouped, and the second and onwards are columns to use in the grouping. Here we use only one column for grouping (region), but more than one can also be used. To do this, list additional columns separated by commas. group_by(region_lang, region) |> summarize( min_most_at_home = min(most_at_home), max_most_at_home = max(most_at_home) ) ## # A tibble: 35 × 3 ## region min_most_at_home max_most_at_home ## <chr> <dbl> <dbl> ## 1 Abbotsford - Mission 0 137445 ## 2 Barrie 0 182390 ## 3 Belleville 0 97840 ## 4 Brantford 0 124560 ## 5 Calgary 0 1065070 ## 6 Edmonton 0 1050410 ## 7 Greater Sudbury 0 133960 ## 8 Guelph 0 130950 ## 9 Halifax 0 371215 ## 10 Hamilton 0 630380 ## # ℹ 25 more rows Notice that group_by on its own doesn’t change the way the data looks. In the output below, the grouped data set looks the same, and it doesn’t appear to be grouped by region. Instead, group_by simply changes how other functions work with the data, as we saw with summarize above. group_by(region_lang, region) ## # A tibble: 7,490 × 7 ## # Groups: region [35] ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows 3.9.4 Calculating summary statistics on many columns Sometimes we need to summarize statistics across many columns. An example of this is illustrated in Figure 3.17. In such a case, using summarize alone means that we have to type out the name of each column we want to summarize. In this section we will meet two strategies for performing this task. First we will see how we can do this using summarize + across. Then we will also explore how we can use a more general iteration function, map, to also accomplish this. Figure 3.17: summarize + across or map is useful for efficiently calculating summary statistics on many columns at once. The darker, top row of each table represents the column headers. summarize and across for calculating summary statistics on many columns To summarize statistics across many columns, we can use the summarize function we have just recently learned about. However, in such a case, using summarize alone means that we have to type out the name of each column we want to summarize. To do this more efficiently, we can pair summarize with across and use a colon : to specify a range of columns we would like to perform the statistical summaries on. Here we demonstrate finding the maximum value of each of the numeric columns of the region_lang data set. region_lang |> summarize(across(mother_tongue:lang_known, max)) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 Note: Similar to when we use base R statistical summary functions (e.g., max, min, mean, sum, etc) with summarize alone, the use of the summarize + across functions paired with base R statistical summary functions also return NAs when we apply them to columns that contain NAs in the data frame. To resolve this issue, again we need to add the argument na.rm = TRUE. But in this case we need to use it a little bit differently: we write a ~, and then call the summary function with the first argument .x and the second argument na.rm = TRUE. For example, for the previous example with the max function, we would write region_lang_na |> summarize(across(mother_tongue:lang_known, ~ max(.x, na.rm = TRUE))) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 The meaning of this unusual syntax is a bit beyond the scope of this book, but interested readers can look up anonymous functions in the purrr package from tidyverse. map for calculating summary statistics on many columns An alternative to summarize and across for applying a function to many columns is the map family of functions. Let’s again find the maximum value of each column of the region_lang data frame, but using map with the max function this time. map takes two arguments: an object (a vector, data frame or list) that you want to apply the function to, and the function that you would like to apply to each column. Note that map does not have an argument to specify which columns to apply the function to. Therefore, we will use the select function before calling map to choose the columns for which we want the maximum. region_lang |> select(mother_tongue:lang_known) |> map(max) ## $mother_tongue ## [1] 3061820 ## ## $most_at_home ## [1] 3836770 ## ## $most_at_work ## [1] 3218725 ## ## $lang_known ## [1] 5600480 Note: The map function comes from the purrr package. But since purrr is part of the tidyverse, once we call library(tidyverse) we do not need to load the purrr package separately. The output looks a bit weird… we passed in a data frame, but the output doesn’t look like a data frame. As it so happens, it is not a data frame, but rather a plain list: region_lang |> select(mother_tongue:lang_known) |> map(max) |> typeof() ## [1] "list" So what do we do? Should we convert this to a data frame? We could, but a simpler alternative is to just use a different map function. There are quite a few to choose from, they all work similarly, but their name reflects the type of output you want from the mapping operation. Table 3.3 lists the commonly used map functions as well as their output type. Table 3.3: The map functions in R. map function Output map list map_lgl logical vector map_int integer vector map_dbl double vector map_chr character vector map_dfc data frame, combining column-wise map_dfr data frame, combining row-wise Let’s get the columns’ maximums again, but this time use the map_dfr function to return the output as a data frame: region_lang |> select(mother_tongue:lang_known) |> map_dfr(max) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 Note: Similar to when we use base R statistical summary functions (e.g., max, min, mean, sum, etc.) with summarize, map functions paired with base R statistical summary functions also return NA values when we apply them to columns that contain NA values. To avoid this, again we need to add the argument na.rm = TRUE. When we use this with map, we do this by adding a , and then na.rm = TRUE after specifying the function, as illustrated below: region_lang_na |> select(mother_tongue:lang_known) |> map_dfr(max, na.rm = TRUE) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 The map functions are generally quite useful for solving many problems involving repeatedly applying functions in R. Additionally, their use is not limited to columns of a data frame; map family functions can be used to apply functions to elements of a vector, or a list, and even to lists of (nested!) data frames. To learn more about the map functions, see the additional resources section at the end of this chapter. 3.10 Apply functions across many columns with mutate and across Sometimes we need to apply a function to many columns in a data frame. For example, we would need to do this when converting units of measurements across many columns. We illustrate such a data transformation in Figure 3.18. Figure 3.18: mutate and across is useful for applying functions across many columns. The darker, top row of each table represents the column headers. For example, imagine that we wanted to convert all the numeric columns in the region_lang data frame from double type to integer type using the as.integer function. When we revisit the region_lang data frame, we can see that this would be the columns from mother_tongue to lang_known. region_lang ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows To accomplish such a task, we can use mutate paired with across. This works in a similar way for column selection, as we saw when we used summarize + across earlier. As we did above, we again use across to specify the columns using select syntax as well as the function we want to apply on the specified columns. However, a key difference here is that we are using mutate, which means that we get back a data frame with the same number of columns and rows. The only thing that changes is the transformation we applied to the specified columns (here mother_tongue to lang_known). region_lang |> mutate(across(mother_tongue:lang_known, as.integer)) ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <int> <int> <int> <int> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows 3.11 Apply functions across columns within one row with rowwise and mutate What if you want to apply a function across columns but within one row? We illustrate such a data transformation in Figure 3.19. Figure 3.19: rowwise and mutate is useful for applying functions across columns within one row. The darker, top row of each table represents the column headers. For instance, suppose we want to know the maximum value between mother_tongue, most_at_home, most_at_work and lang_known for each language and region in the region_lang data set. In other words, we want to apply the max function row-wise. We will use the (aptly named) rowwise function in combination with mutate to accomplish this task. Before we apply rowwise, we will select only the count columns so we can see all the columns in the data frame’s output easily in the book. So for this demonstration, the data set we are operating on looks like this: region_lang |> select(mother_tongue:lang_known) ## # A tibble: 7,490 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 5 0 0 0 ## 2 5 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 5 5 0 0 ## 6 0 5 0 20 ## 7 0 0 0 0 ## 8 0 0 0 0 ## 9 30 15 0 10 ## 10 0 0 0 0 ## # ℹ 7,480 more rows Now we apply rowwise before mutate, to tell R that we would like the mutate function to be applied across, and within, a row, as opposed to being applied on a column (which is the default behavior of mutate): region_lang |> select(mother_tongue:lang_known) |> rowwise() |> mutate(maximum = max(c(mother_tongue, most_at_home, most_at_work, lang_known))) ## # A tibble: 7,490 × 5 ## # Rowwise: ## mother_tongue most_at_home most_at_work lang_known maximum ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5 0 0 0 5 ## 2 5 0 0 0 5 ## 3 0 0 0 0 0 ## 4 0 0 0 0 0 ## 5 5 5 0 0 5 ## 6 0 5 0 20 20 ## 7 0 0 0 0 0 ## 8 0 0 0 0 0 ## 9 30 15 0 10 30 ## 10 0 0 0 0 0 ## # ℹ 7,480 more rows We see that we get an additional column added to the data frame, named maximum, which is the maximum value between mother_tongue, most_at_home, most_at_work and lang_known for each language and region. Similar to group_by, rowwise doesn’t appear to do anything when it is called by itself. However, we can apply rowwise in combination with other functions to change how these other functions operate on the data. Notice if we used mutate without rowwise, we would have computed the maximum value across all rows rather than the maximum value for each row. Below we show what would have happened had we not used rowwise. In particular, the same maximum value is reported in every single row; this code does not provide the desired result. region_lang |> select(mother_tongue:lang_known) |> mutate(maximum = max(c(mother_tongue, most_at_home, most_at_home, lang_known))) ## # A tibble: 7,490 × 5 ## mother_tongue most_at_home most_at_work lang_known maximum ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5 0 0 0 5600480 ## 2 5 0 0 0 5600480 ## 3 0 0 0 0 5600480 ## 4 0 0 0 0 5600480 ## 5 5 5 0 0 5600480 ## 6 0 5 0 20 5600480 ## 7 0 0 0 0 5600480 ## 8 0 0 0 0 5600480 ## 9 30 15 0 10 5600480 ## 10 0 0 0 0 5600480 ## # ℹ 7,480 more rows 3.12 Summary Cleaning and wrangling data can be a very time-consuming process. However, it is a critical step in any data analysis. We have explored many different functions for cleaning and wrangling data into a tidy format. Table 3.4 summarizes some of the key wrangling functions we learned in this chapter. In the following chapters, you will learn how you can take this tidy data and do so much more with it to answer your burning data science questions! Table 3.4: Summary of wrangling functions Function Description across allows you to apply function(s) to multiple columns filter subsets rows of a data frame group_by allows you to apply function(s) to groups of rows mutate adds or modifies columns in a data frame map general iteration function pivot_longer generally makes the data frame longer and narrower pivot_wider generally makes a data frame wider and decreases the number of rows rowwise applies functions across columns within one row separate splits up a character column into multiple columns select subsets columns of a data frame summarize calculates summaries of inputs 3.13 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Cleaning and wrangling data” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 3.14 Additional resources As we mentioned earlier, tidyverse is actually an R meta package: it installs and loads a collection of R packages that all follow the tidy data philosophy we discussed above. One of the tidyverse packages is dplyr—a data wrangling workhorse. You have already met many of dplyr’s functions (select, filter, mutate, arrange, summarize, and group_by). To learn more about these functions and meet a few more useful functions, we recommend you check out Chapters 5-9 of the STAT545 online notes. of the data wrangling, exploration, and analysis with R book. The dplyr R package documentation (Wickham, François, et al. 2021) is another resource to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter. Check out the tidyselect R package page (Henry and Wickham 2021) for a comprehensive list of select helpers. These helpers can be used to choose columns in a data frame when paired with the select function (and other functions that use the tidyselect syntax, such as pivot_longer). The documentation for select helpers is a useful reference to find the helper you need for your particular problem. R for Data Science (Wickham and Grolemund 2016) has a few chapters related to data wrangling that go into more depth than this book. For example, the tidy data chapter covers tidy data, pivot_longer/pivot_wider and separate, but also covers missing values and additional wrangling functions (like unite). The data transformation chapter covers select, filter, arrange, mutate, and summarize. And the map functions chapter provides more about the map functions. You will occasionally encounter a case where you need to iterate over items in a data frame, but none of the above functions are flexible enough to do what you want. In that case, you may consider using a for loop. References "],["viz.html", "Chapter 4 Effective data visualization 4.1 Overview 4.2 Chapter learning objectives 4.3 Choosing the visualization 4.4 Refining the visualization 4.5 Creating visualizations with ggplot2 4.6 Explaining the visualization 4.7 Saving the visualization 4.8 Exercises 4.9 Additional resources", " Chapter 4 Effective data visualization 4.1 Overview This chapter will introduce concepts and tools relating to data visualization beyond what we have seen and practiced so far. We will focus on guiding principles for effective data visualization and explaining visualizations independent of any particular tool or programming language. In the process, we will cover some specifics of creating visualizations (scatter plots, bar plots, line plots, and histograms) for data using R. 4.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe when to use the following kinds of visualizations to answer specific questions using a data set: scatter plots line plots bar plots histogram plots Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question. Evaluate the effectiveness of a visualization and suggest improvements to better answer a given question. Referring to the visualization, communicate the conclusions in non-technical terms. Identify rules of thumb for creating effective visualizations. Use the ggplot2 package in R to create and refine the above visualizations using: geometric objects: geom_point, geom_line, geom_histogram, geom_bar, geom_vline, geom_hline scales: xlim, ylim aesthetic mappings: x, y, fill, color, shape labeling: xlab, ylab, labs font control and legend positioning: theme subplots: facet_grid Define the three key aspects of ggplot2 objects: aesthetic mappings geometric objects scales Describe the difference in raster and vector output formats. Use ggsave to save visualizations in .png and .svg format. 4.3 Choosing the visualization Ask a question, and answer it The purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do before creating a visualization is to formulate the question about the data you are trying to answer. A good visualization will clearly answer your question without distraction; a great visualization will suggest even what the question was itself without additional explanation. Imagine your visualization as part of a poster presentation for a project; even if you aren’t standing at the poster explaining things, an effective visualization will convey your message to the audience. Recall the different data analysis questions from Chapter 1. With the visualizations we will cover in this chapter, we will be able to answer only descriptive and exploratory questions. Be careful to not answer any predictive, inferential, causal or mechanistic questions with the visualizations presented here, as we have not learned the tools necessary to do that properly just yet. As with most coding tasks, it is totally fine (and quite common) to make mistakes and iterate a few times before you find the right visualization for your data and question. There are many different kinds of plotting graphics available to use (see Chapter 5 of Fundamentals of Data Visualization (Wilke 2019) for a directory). The types of plot that we introduce in this book are shown in Figure 4.1; which one you should select depends on your data and the question you want to answer. In general, the guiding principles of when to use each type of plot are as follows: scatter plots visualize the relationship between two quantitative variables line plots visualize trends with respect to an independent, ordered quantity (e.g., time) bar plots visualize comparisons of amounts histograms visualize the distribution of one quantitative variable (i.e., all its possible values and how often they occur) Figure 4.1: Examples of scatter, line and bar plots, as well as histograms. All types of visualization have their (mis)uses, but three kinds are usually hard to understand or are easily replaced with an oft-better alternative. In particular, you should avoid pie charts; it is generally better to use bars, as it is easier to compare bar heights than pie slice sizes. You should also not use 3-D visualizations, as they are typically hard to understand when converted to a static 2-D image format. Finally, do not use tables to make numerical comparisons; humans are much better at quickly processing visual information than text and math. Bar plots are again typically a better alternative. 4.4 Refining the visualization Convey the message, minimize noise Just being able to make a visualization in R (or any other language, for that matter) doesn’t mean that it effectively communicates your message to others. Once you have selected a broad type of visualization to use, you will have to refine it to suit your particular need. Some rules of thumb for doing this are listed below. They generally fall into two classes: you want to make your visualization convey your message, and you want to reduce visual noise as much as possible. Humans have limited cognitive ability to process information; both of these types of refinement aim to reduce the mental load on your audience when viewing your visualization, making it easier for them to understand and remember your message quickly. Convey the message Make sure the visualization answers the question you have asked most simply and plainly as possible. Use legends and labels so that your visualization is understandable without reading the surrounding text. Ensure the text, symbols, lines, etc., on your visualization are big enough to be easily read. Ensure the data are clearly visible; don’t hide the shape/distribution of the data behind other objects (e.g., a bar). Make sure to use color schemes that are understandable by those with colorblindness (a surprisingly large fraction of the overall population—from about 1% to 10%, depending on sex and ancestry (Deeb 2005)). For example, ColorBrewer and the RColorBrewer R package (Neuwirth 2014) provide the ability to pick such color schemes, and you can check your visualizations after you have created them by uploading to online tools such as a color blindness simulator. Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience. Minimize noise Use colors sparingly. Too many different colors can be distracting, create false patterns, and detract from the message. Be wary of overplotting. Overplotting is when marks that represent the data overlap, and is problematic as it prevents you from seeing how many data points are represented in areas of the visualization where this occurs. If your plot has too many dots or lines and starts to look like a mess, you need to do something different. Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small. Don’t adjust the axes to zoom in on small differences. If the difference is small, show that it’s small! 4.5 Creating visualizations with ggplot2 Build the visualization iteratively This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, and then how to create the visualization in R using the ggplot2 R package. Given that the ggplot2 package is loaded by the tidyverse metapackage, we still need to load only `tidyverse’: library(tidyverse) 4.5.1 Scatter plots and line plots: the Mauna Loa CO\\(_{\\text{2}}\\) data set The Mauna Loa CO\\(_{\\text{2}}\\) data set, curated by Dr. Pieter Tans, NOAA/GML and Dr. Ralph Keeling, Scripps Institution of Oceanography, records the atmospheric concentration of carbon dioxide (CO\\(_{\\text{2}}\\), in parts per million) at the Mauna Loa research station in Hawaii from 1959 onward (Tans and Keeling 2020). For this book, we are going to focus on the years 1980-2020. Question: Does the concentration of atmospheric CO\\(_{\\text{2}}\\) change over time, and are there any interesting patterns to note? To get started, we will read and inspect the data: # mauna loa carbon dioxide data co2_df <- read_csv("data/mauna_loa_data.csv") co2_df ## # A tibble: 484 × 2 ## date_measured ppm ## <date> <dbl> ## 1 1980-02-01 338. ## 2 1980-03-01 340. ## 3 1980-04-01 341. ## 4 1980-05-01 341. ## 5 1980-06-01 341. ## 6 1980-07-01 339. ## 7 1980-08-01 338. ## 8 1980-09-01 336. ## 9 1980-10-01 336. ## 10 1980-11-01 337. ## # ℹ 474 more rows We see that there are two columns in the co2_df data frame; date_measured and ppm. The date_measured column holds the date the measurement was taken, and is of type date. The ppm column holds the value of CO\\(_{\\text{2}}\\) in parts per million that was measured on each date, and is type double. Note: read_csv was able to parse the date_measured column into the date vector type because it was entered in the international standard date format, called ISO 8601, which lists dates as year-month-day. date vectors are double vectors with special properties that allow them to handle dates correctly. For example, date type vectors allow functions like ggplot to treat them as numeric dates and not as character vectors, even though they contain non-numeric characters (e.g., in the date_measured column in the co2_df data frame). This means R will not accidentally plot the dates in the wrong order (i.e., not alphanumerically as would happen if it was a character vector). An in-depth study of dates and times is beyond the scope of the book, but interested readers may consult the Dates and Times chapter of R for Data Science (Wickham and Grolemund 2016); see the additional resources at the end of this chapter. Since we are investigating a relationship between two variables (CO\\(_{\\text{2}}\\) concentration and date), a scatter plot is a good place to start. Scatter plots show the data as individual points with x (horizontal axis) and y (vertical axis) coordinates. Here, we will use the measurement date as the x coordinate and the CO\\(_{\\text{2}}\\) concentration as the y coordinate. When using the ggplot2 package, we create a plot object with the ggplot function. There are a few basic aspects of a plot that we need to specify: The name of the data frame object to visualize. Here, we specify the co2_df data frame. The aesthetic mapping, which tells ggplot how the columns in the data frame map to properties of the visualization. To create an aesthetic mapping, we use the aes function. Here, we set the plot x axis to the date_measured variable, and the plot y axis to the ppm variable. The + operator, which tells ggplot that we would like to add another layer to the plot. The geometric object, which specifies how the mapped data should be displayed. To create a geometric object, we use a geom_* function (see the ggplot reference for a list of geometric objects). Here, we use the geom_point function to visualize our data as a scatter plot. co2_scatter <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_point() co2_scatter Figure 4.2: Scatter plot of atmospheric concentration of CO\\(_{2}\\) over time. The visualization in Figure 4.2 shows a clear upward trend in the atmospheric concentration of CO\\(_{\\text{2}}\\) over time. This plot answers the first part of our question in the affirmative, but that appears to be the only conclusion one can make from the scatter visualization. One important thing to note about this data is that one of the variables we are exploring is time. Time is a special kind of quantitative variable because it forces additional structure on the data—the data points have a natural order. Specifically, each observation in the data set has a predecessor and a successor, and the order of the observations matters; changing their order alters their meaning. In situations like this, we typically use a line plot to visualize the data. Line plots connect the sequence of x and y coordinates of the observations with line segments, thereby emphasizing their order. We can create a line plot in ggplot using the geom_line function. Let’s now try to visualize the co2_df as a line plot with just the default arguments: co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_line() co2_line Figure 4.3: Line plot of atmospheric concentration of CO\\(_{2}\\) over time. Aha! Figure 4.3 shows us there is another interesting phenomenon in the data: in addition to increasing over time, the concentration seems to oscillate as well. Given the visualization as it is now, it is still hard to tell how fast the oscillation is, but nevertheless, the line seems to be a better choice for answering the question than the scatter plot was. The comparison between these two visualizations also illustrates a common issue with scatter plots: often, the points are shown too close together or even on top of one another, muddling information that would otherwise be clear (overplotting). Now that we have settled on the rough details of the visualization, it is time to refine things. This plot is fairly straightforward, and there is not much visual noise to remove. But there are a few things we must do to improve clarity, such as adding informative axis labels and making the font a more readable size. To add axis labels, we use the xlab and ylab functions. To change the font size, we use the theme function with the text argument: co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_line() + xlab("Year") + ylab("Atmospheric CO2 (ppm)") + theme(text = element_text(size = 12)) co2_line Figure 4.4: Line plot of atmospheric concentration of CO\\(_{2}\\) over time with clearer axes and labels. Note: The theme function is quite complex and has many arguments that can be specified to control many non-data aspects of a visualization. An in-depth discussion of the theme function is beyond the scope of this book. Interested readers may consult the theme function documentation; see the additional resources section at the end of this chapter. Finally, let’s see if we can better understand the oscillation by changing the visualization slightly. Note that it is totally fine to use a small number of visualizations to answer different aspects of the question you are trying to answer. We will accomplish this by using scales, another important feature of ggplot2 that easily transforms the different variables and set limits. We scale the horizontal axis using the xlim function, and the vertical axis with the ylim function. In particular, here, we will use the xlim function to zoom in on just five years of data (say, 1990-1994). xlim takes a vector of length two to specify the upper and lower bounds to limit the axis. We can create that using the c function. Note that it is important that the vector given to xlim must be of the same type as the data that is mapped to that axis. Here, we have mapped a date to the x-axis, and so we need to use the date function (from the tidyverse lubridate R package (Spinu, Grolemund, and Wickham 2021; Grolemund and Wickham 2011)) to convert the character strings we provide to c to date vectors. Note: lubridate is a package that is installed by the tidyverse metapackage, but is not loaded by it. Hence we need to load it separately in the code below. library(lubridate) co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_line() + xlab("Year") + ylab("Atmospheric CO2 (ppm)") + xlim(c(date("1990-01-01"), date("1993-12-01"))) + theme(text = element_text(size = 12)) co2_line Figure 4.5: Line plot of atmospheric concentration of CO\\(_{2}\\) from 1990 to 1994. Interesting! It seems that each year, the atmospheric CO\\(_{\\text{2}}\\) increases until it reaches its peak somewhere around April, decreases until around late September, and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April. Therefore, the oscillating pattern in CO\\(_{\\text{2}}\\) matches up fairly closely with the two seasons. As you might have noticed from the code used to create the final visualization of the co2_df data frame, we construct the visualizations in ggplot with layers. New layers are added with the + operator, and we can really add as many as we would like! A useful analogy to constructing a data visualization is painting a picture. We start with a blank canvas, and the first thing we do is prepare the surface for our painting by adding primer. In our data visualization this is akin to calling ggplot and specifying the data set we will be using. Next, we sketch out the background of the painting. In our data visualization, this would be when we map data to the axes in the aes function. Then we add our key visual subjects to the painting. In our data visualization, this would be the geometric objects (e.g., geom_point, geom_line, etc.). And finally, we work on adding details and refinements to the painting. In our data visualization this would be when we fine tune axis labels, change the font, adjust the point size, and do other related things. 4.5.2 Scatter plots: the Old Faithful eruption time data set The faithful data set contains measurements of the waiting time between eruptions and the subsequent eruption duration (in minutes) of the Old Faithful geyser in Yellowstone National Park, Wyoming, United States. The faithful data set is available in base R as a data frame, so it does not need to be loaded. We convert it to a tibble to take advantage of the nicer print output these specialized data frames provide. Question: Is there a relationship between the waiting time before an eruption and the duration of the eruption? # old faithful eruption time / wait time data faithful <- as_tibble(faithful) faithful ## # A tibble: 272 × 2 ## eruptions waiting ## <dbl> <dbl> ## 1 3.6 79 ## 2 1.8 54 ## 3 3.33 74 ## 4 2.28 62 ## 5 4.53 85 ## 6 2.88 55 ## 7 4.7 88 ## 8 3.6 85 ## 9 1.95 51 ## 10 4.35 85 ## # ℹ 262 more rows Here again, we investigate the relationship between two quantitative variables (waiting time and eruption time). But if you look at the output of the data frame, you’ll notice that unlike time in the Mauna Loa CO\\(_{\\text{2}}\\) data set, neither of the variables here have a natural order to them. So a scatter plot is likely to be the most appropriate visualization. Let’s create a scatter plot using the ggplot function with the waiting variable on the horizontal axis, the eruptions variable on the vertical axis, and the geom_point geometric object. The result is shown in Figure 4.6. faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() faithful_scatter Figure 4.6: Scatter plot of waiting time and eruption time. We can see in Figure 4.6 that the data tend to fall into two groups: one with short waiting and eruption times, and one with long waiting and eruption times. Note that in this case, there is no overplotting: the points are generally nicely visually separated, and the pattern they form is clear. In order to refine the visualization, we need only to add axis labels and make the font more readable: faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() + xlab("Waiting Time (mins)") + ylab("Eruption Duration (mins)") + theme(text = element_text(size = 12)) faithful_scatter Figure 4.7: Scatter plot of waiting time and eruption time with clearer axes and labels. 4.5.3 Axis transformation and colored scatter plots: the Canadian languages data set Recall the can_lang data set (Timbers 2020) from Chapters 1, 2, and 3, which contains counts of languages from the 2016 Canadian census. Question: Is there a relationship between the percentage of people who speak a language as their mother tongue and the percentage for whom that is the primary language spoken at home? And is there a pattern in the strength of this relationship in the higher-level language categories (Official languages, Aboriginal languages, or non-official and non-Aboriginal languages)? To get started, we will read and inspect the data: can_lang <- read_csv("data/can_lang.csv") can_lang ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows We will begin with a scatter plot of the mother_tongue and most_at_home columns from our data frame. The resulting plot is shown in Figure 4.8. ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) + geom_point() Figure 4.8: Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home. To make an initial improvement in the interpretability of Figure 4.8, we should replace the default axis names with more informative labels. We can use \\n to create a line break in the axis names so that the words after \\n are printed on a new line. This will make the axes labels on the plots more readable. We should also increase the font size to further improve readability. ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) + geom_point() + xlab("Language spoken most at home \\n (number of Canadian residents)") + ylab("Mother tongue \\n (number of Canadian residents)") + theme(text = element_text(size = 12)) Figure 4.9: Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels. Okay! The axes and labels in Figure 4.9 are much more readable and interpretable now. However, the scatter points themselves could use some work; most of the 214 data points are bunched up in the lower left-hand side of the visualization. The data is clumped because many more people in Canada speak English or French (the two points in the upper right corner) than other languages. In particular, the most common mother tongue language has 19,460,850 speakers, while the least common has only 10. That’s a 6-decimal-place difference in the magnitude of these two numbers! We can confirm that the two points in the upper right-hand corner correspond to Canada’s two official languages by filtering the data: can_lang |> filter(language == "English" | language == "French") ## # A tibble: 2 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Official languages English 19460850 22162865 15265335 29748265 ## 2 Official languages French 7166700 6943800 3825215 10242945 Recall that our question about this data pertains to all languages; so to properly answer our question, we will need to adjust the scale of the axes so that we can clearly see all of the scatter points. In particular, we will improve the plot by adjusting the horizontal and vertical axes so that they are on a logarithmic (or log) scale. Log scaling is useful when your data take both very large and very small values, because it helps space out small values and squishes larger values together. For example, \\(\\log_{10}(1) = 0\\), \\(\\log_{10}(10) = 1\\), \\(\\log_{10}(100) = 2\\), and \\(\\log_{10}(1000) = 3\\); on the logarithmic scale, the values 1, 10, 100, and 1000 are all the same distance apart! So we see that applying this function is moving big values closer together and moving small values farther apart. Note that if your data can take the value 0, logarithmic scaling may not be appropriate (since log10(0) is -Inf in R). There are other ways to transform the data in such a case, but these are beyond the scope of the book. We can accomplish logarithmic scaling in a ggplot visualization using the scale_x_log10 and scale_y_log10 functions. Given that the x and y axes have large numbers, we should also format the axis labels to put commas in these numbers to increase their readability. We can do this in R by passing the label_comma function (from the scales package) to the labels argument of the scale_x_log10 and scale_x_log10 functions. library(scales) ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) + geom_point() + xlab("Language spoken most at home \\n (number of Canadian residents)") + ylab("Mother tongue \\n (number of Canadian residents)") + theme(text = element_text(size = 12)) + scale_x_log10(labels = label_comma()) + scale_y_log10(labels = label_comma()) Figure 4.10: Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with log adjusted x and y axes. Similar to some of the examples in Chapter 3, we can convert the counts to percentages to give them context and make them easier to understand. We can do this by dividing the number of people reporting a given language as their mother tongue or primary language at home by the number of people who live in Canada and multiplying by 100%. For example, the percentage of people who reported that their mother tongue was English in the 2016 Canadian census was 19,460,850 / 35,151,728 \\(\\times\\) 100 % = 55.36%. Below we use mutate to calculate the percentage of people reporting a given language as their mother tongue and primary language at home for all the languages in the can_lang data set. Since the new columns are appended to the end of the data table, we selected the new columns after the transformation so you can clearly see the mutated output from the table. can_lang <- can_lang |> mutate( mother_tongue_percent = (mother_tongue / 35151728) * 100, most_at_home_percent = (most_at_home / 35151728) * 100 ) can_lang |> select(mother_tongue_percent, most_at_home_percent) ## # A tibble: 214 × 2 ## mother_tongue_percent most_at_home_percent ## <dbl> <dbl> ## 1 0.00168 0.000669 ## 2 0.0292 0.0136 ## 3 0.00327 0.00127 ## 4 0.0383 0.0170 ## 5 0.0765 0.0374 ## 6 0.000128 0.0000284 ## 7 0.00358 0.00105 ## 8 0.00764 0.00859 ## 9 0.0639 0.0364 ## 10 1.19 0.636 ## # ℹ 204 more rows Finally, we will edit the visualization to use the percentages we just computed (and change our axis labels to reflect this change in units). Figure 4.11 displays the final result. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12)) + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) Figure 4.11: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home. Figure 4.11 is the appropriate visualization to use to answer the first question in this section, i.e., whether there is a relationship between the percentage of people who speak a language as their mother tongue and the percentage for whom that is the primary language spoken at home. To fully answer the question, we need to use Figure 4.11 to assess a few key characteristics of the data: Direction: if the y variable tends to increase when the x variable increases, then y has a positive relationship with x. If y tends to decrease when x increases, then y has a negative relationship with x. If y does not meaningfully increase or decrease as x increases, then y has little or no relationship with x. Strength: if the y variable reliably increases, decreases, or stays flat as x increases, then the relationship is strong. Otherwise, the relationship is weak. Intuitively, the relationship is strong when the scatter points are close together and look more like a “line” or “curve” than a “cloud.” Shape: if you can draw a straight line roughly through the data points, the relationship is linear. Otherwise, it is nonlinear. In Figure 4.11, we see that as the percentage of people who have a language as their mother tongue increases, so does the percentage of people who speak that language at home. Therefore, there is a positive relationship between these two variables. Furthermore, because the points in Figure 4.11 are fairly close together, and the points look more like a “line” than a “cloud”, we can say that this is a strong relationship. And finally, because drawing a straight line through these points in Figure 4.11 would fit the pattern we observe quite well, we say that the relationship is linear. Onto the second part of our exploratory data analysis question! Recall that we are interested in knowing whether the strength of the relationship we uncovered in Figure 4.11 depends on the higher-level language category (Official languages, Aboriginal languages, and non-official, non-Aboriginal languages). One common way to explore this is to color the data points on the scatter plot we have already created by group. For example, given that we have the higher-level language category for each language recorded in the 2016 Canadian census, we can color the points in our previous scatter plot to represent each language’s higher-level language category. Here we want to distinguish the values according to the category group with which they belong. We can add an argument to the aes function, specifying that the category column should color the points. Adding this argument will color the points according to their group and add a legend at the side of the plot. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent, color = category)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12)) + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) Figure 4.12: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category. The legend in Figure 4.12 takes up valuable plot area. We can improve this by moving the legend title using the legend.position and legend.direction arguments of the theme function. Here we set legend.position to \"top\" to put the legend above the plot and legend.direction to \"vertical\" so that the legend items remain vertically stacked on top of each other. When the legend.position is set to either \"top\" or \"bottom\" the default direction is to stack the legend items horizontally. However, that will not work well for this particular visualization because the legend labels are quite long and would run off the page if displayed this way. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent, color = category)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical") + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) Figure 4.13: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with the legend edited. In Figure 4.13, the points are colored with the default ggplot2 color palette. But what if you want to use different colors? In R, two packages that provide alternative color palettes are RColorBrewer (Neuwirth 2014) and ggthemes (Arnold 2019); in this book we will cover how to use RColorBrewer. You can visualize the list of color palettes that RColorBrewer has to offer with the display.brewer.all function. You can also print a list of color-blind friendly palettes by adding colorblindFriendly = TRUE to the function. library(RColorBrewer) display.brewer.all(colorblindFriendly = TRUE) Figure 4.14: Color palettes available from the RColorBrewer R package. From Figure 4.14, we can choose the color palette we want to use in our plot. To change the color palette, we add the scale_color_brewer layer indicating the palette we want to use. You can use this color blindness simulator to check if your visualizations are color-blind friendly. Below we pick the \"Set2\" palette, with the result shown in Figure 4.15. We also set the shape aesthetic mapping to the category variable as well; this makes the scatter point shapes different for each category. This kind of visual redundancy—i.e., conveying the same information with both scatter point color and shape—can further improve the clarity and accessibility of your visualization. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent, color = category, shape = category)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical") + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) + scale_color_brewer(palette = "Set2") Figure 4.15: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with color-blind friendly colors. From the visualization in Figure 4.15, we can now clearly see that the vast majority of Canadians reported one of the official languages as their mother tongue and as the language they speak most often at home. What do we see when considering the second part of our exploratory question? Do we see a difference in the relationship between languages spoken as a mother tongue and as a primary language at home across the higher-level language categories? Based on Figure 4.15, there does not appear to be much of a difference. For each higher-level language category, there appears to be a strong, positive, and linear relationship between the percentage of people who speak a language as their mother tongue and the percentage who speak it as their primary language at home. The relationship looks similar regardless of the category. Does this mean that this relationship is positive for all languages in the world? And further, can we use this data visualization on its own to predict how many people have a given language as their mother tongue if we know how many people speak it as their primary language at home? The answer to both these questions is “no!” However, with exploratory data analysis, we can create new hypotheses, ideas, and questions (like the ones at the beginning of this paragraph). Answering those questions often involves doing more complex analyses, and sometimes even gathering additional data. We will see more of such complex analyses later on in this book. 4.5.4 Bar plots: the island landmass data set The islands.csv data set contains a list of Earth’s landmasses as well as their area (in thousands of square miles) (McNeil 1977). Question: Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth’s seven largest landmasses? If so, what are the next few largest landmasses after those? To get started, we will read and inspect the data: # islands data islands_df <- read_csv("data/islands.csv") islands_df ## # A tibble: 48 × 3 ## landmass size landmass_type ## <chr> <dbl> <chr> ## 1 Africa 11506 Continent ## 2 Antarctica 5500 Continent ## 3 Asia 16988 Continent ## 4 Australia 2968 Continent ## 5 Axel Heiberg 16 Other ## 6 Baffin 184 Other ## 7 Banks 23 Other ## 8 Borneo 280 Other ## 9 Britain 84 Other ## 10 Celebes 73 Other ## # ℹ 38 more rows Here, we have a data frame of Earth’s landmasses, and are trying to compare their sizes. The right type of visualization to answer this question is a bar plot. In a bar plot, the height of each bar represents the value of an amount (a size, count, proportion, percentage, etc). They are particularly useful for comparing counts or proportions across different groups of a categorical variable. Note, however, that bar plots should generally not be used to display mean or median values, as they hide important information about the variation of the data. Instead it’s better to show the distribution of all the individual data points, e.g., using a histogram, which we will discuss further in Section 4.5.5. We specify that we would like to use a bar plot via the geom_bar function in ggplot2. However, by default, geom_bar sets the heights of bars to the number of times a value appears in a data frame (its count); here, we want to plot exactly the values in the data frame, i.e., the landmass sizes. So we have to pass the stat = \"identity\" argument to geom_bar. The result is shown in Figure 4.16. islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) + geom_bar(stat = "identity") islands_bar Figure 4.16: Bar plot of Earth’s landmass sizes with squished labels. Alright, not bad! The plot in Figure 4.16 is definitely the right kind of visualization, as we can clearly see and compare sizes of landmasses. The major issues are that the smaller landmasses’ sizes are hard to distinguish, and the names of the landmasses are obscuring each other as they have been squished into too little space. But remember that the question we asked was only about the largest landmasses; let’s make the plot a little bit clearer by keeping only the largest 12 landmasses. We do this using the slice_max function: the order_by argument is the name of the column we want to use for comparing which is largest, and the n argument specifies how many rows to keep. Then to give the labels enough space, we’ll use horizontal bars instead of vertical ones. We do this by swapping the x and y variables. Note: Recall that in Chapter 1, we used arrange followed by slice to obtain the ten rows with the largest values of a variable. We could have instead used the slice_max function for this purpose. The slice_max and slice_min functions achieve the same goal as arrange followed by slice, but are slightly more efficient because they are specialized for this purpose. In general, it is good to use more specialized functions when they are available! islands_top12 <- slice_max(islands_df, order_by = size, n = 12) islands_bar <- ggplot(islands_top12, aes(x = size, y = landmass)) + geom_bar(stat = "identity") islands_bar Figure 4.17: Bar plot of size for Earth’s largest 12 landmasses. The plot in Figure 4.17 is definitely clearer now, and allows us to answer our question (“Are the top 7 largest landmasses continents?”) in the affirmative. However, we could still improve this visualization by coloring the bars based on whether they correspond to a continent, and by organizing the bars by landmass size rather than by alphabetical order. The data for coloring the bars is stored in the landmass_type column, so we add the fill argument to the aesthetic mapping and set it to landmass_type. We manually select two colors for the bars using the scale_fill_manual function:\"darkorange\" for orange and \"steelblue\" for blue. To organize the landmasses by their size variable, we will use the tidyverse fct_reorder function in the aesthetic mapping to organize the landmasses by their size variable. The first argument passed to fct_reorder is the name of the factor column whose levels we would like to reorder (here, landmass). The second argument is the column name that holds the values we would like to use to do the ordering (here, size). The fct_reorder function uses ascending order by default, but this can be changed to descending order by setting .desc = TRUE. We do this here so that the largest bar will be closest to the axis line, which is more visually appealing. To finalize this plot we will customize the axis and legend labels, and add a title to the chart. Plot titles are not always required, especially when it would be redundant with an already-existing caption or surrounding context (e.g., in a slide presentation with annotations). But if you decide to include one, a good plot title should provide the take home message that you want readers to focus on, e.g., “Earth’s seven largest landmasses are continents,” or a more general summary of the information displayed, e.g., “Earth’s twelve largest landmasses.” To make these final adjustments we will use the labs function rather than the xlab and ylab functions we have seen earlier in this chapter, as labs lets us modify the legend label and title in addition to axis labels. We provide a label for each aesthetic mapping in the plot—in this case, x, y, and fill—as well as one for the title argument. Finally, we again use the theme function to change the font size. islands_bar <- ggplot(islands_top12, aes(x = size, y = fct_reorder(landmass, size, .desc = TRUE), fill = landmass_type)) + geom_bar(stat = "identity") + labs(x = "Size (1000 square mi)", y = "Landmass", fill = "Type", title = "Earth's twelve largest landmasses") + scale_fill_manual(values = c("steelblue", "darkorange")) + theme(text = element_text(size = 10)) islands_bar Figure 4.18: Bar plot of size for Earth’s largest 12 landmasses, colored by landmass type, with clearer axes and labels. The plot in Figure 4.18 is now a very effective visualization for answering our original questions. Landmasses are organized by their size, and continents are colored differently than other landmasses, making it quite clear that continents are the largest seven landmasses. 4.5.5 Histograms: the Michelson speed of light data set The morley data set contains measurements of the speed of light collected in experiments performed in 1879. Five experiments were performed, and in each experiment, 20 runs were performed—meaning that 20 measurements of the speed of light were collected in each experiment (Michelson 1882). The morley data set is available in base R as a data frame, so it does not need to be loaded. Because the speed of light is a very large number (the true value is 299,792.458 km/sec), the data is coded to be the measured speed of light minus 299,000. This coding allows us to focus on the variations in the measurements, which are generally much smaller than 299,000. If we used the full large speed measurements, the variations in the measurements would not be noticeable, making it difficult to study the differences between the experiments. Note that we convert the morley data to a tibble to take advantage of the nicer print output these specialized data frames provide. Question: Given what we know now about the speed of light (299,792.458 kilometres per second), how accurate were each of the experiments? # michelson morley experimental data morley <- as_tibble(morley) morley ## # A tibble: 100 × 3 ## Expt Run Speed ## <int> <int> <int> ## 1 1 1 850 ## 2 1 2 740 ## 3 1 3 900 ## 4 1 4 1070 ## 5 1 5 930 ## 6 1 6 850 ## 7 1 7 950 ## 8 1 8 980 ## 9 1 9 980 ## 10 1 10 880 ## # ℹ 90 more rows In this experimental data, Michelson was trying to measure just a single quantitative number (the speed of light). The data set contains many measurements of this single quantity. To tell how accurate the experiments were, we need to visualize the distribution of the measurements (i.e., all their possible values and how often each occurs). We can do this using a histogram. A histogram helps us visualize how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin. To create a histogram in ggplot2 we will use the geom_histogram geometric object, setting the x axis to the Speed measurement variable. As usual, let’s use the default arguments just to see how things look. morley_hist <- ggplot(morley, aes(x = Speed)) + geom_histogram() morley_hist Figure 4.19: Histogram of Michelson’s speed of light data. Figure 4.19 is a great start. However, we cannot tell how accurate the measurements are using this visualization unless we can see the true value. In order to visualize the true speed of light, we will add a vertical line with the geom_vline function. To draw a vertical line with geom_vline, we need to specify where on the x-axis the line should be drawn. We can do this by setting the xintercept argument. Here we set it to 792.458, which is the true value of light speed minus 299,000; this ensures it is coded the same way as the measurements in the morley data frame. We would also like to fine tune this vertical line, styling it so that it is dashed by setting linetype = \"dashed\". There is a similar function, geom_hline, that is used for plotting horizontal lines. Note that vertical lines are used to denote quantities on the horizontal axis, while horizontal lines are used to denote quantities on the vertical axis. morley_hist <- ggplot(morley, aes(x = Speed)) + geom_histogram() + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.20: Histogram of Michelson’s speed of light data with vertical line indicating true speed of light. In Figure 4.20, we still cannot tell which experiments (denoted in the Expt column) led to which measurements; perhaps some experiments were more accurate than others. To fully answer our question, we need to separate the measurements from each other visually. We can try to do this using a colored histogram, where counts from different experiments are stacked on top of each other in different colors. We can create a histogram colored by the Expt variable by adding it to the fill aesthetic mapping. We make sure the different colors can be seen (despite them all sitting on top of each other) by setting the alpha argument in geom_histogram to 0.5 to make the bars slightly translucent. We also specify position = \"identity\" in geom_histogram to ensure the histograms for each experiment will be overlaid side-by-side, instead of stacked bars (which is the default for bar plots or histograms when they are colored by another categorical variable). morley_hist <- ggplot(morley, aes(x = Speed, fill = Expt)) + geom_histogram(alpha = 0.5, position = "identity") + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.21: Histogram of Michelson’s speed of light data where an attempt is made to color the bars by experiment. Alright great, Figure 4.21 looks…wait a second! The histogram is still all the same color! What is going on here? Well, if you recall from Chapter 3, the data type you use for each variable can influence how R and tidyverse treats it. Here, we indeed have an issue with the data types in the morley data frame. In particular, the Expt column is currently an integer (you can see the label <int> underneath the Expt column in the printed data frame at the start of this section). But we want to treat it as a category, i.e., there should be one category per type of experiment. To fix this issue we can convert the Expt variable into a factor by passing it to as_factor in the fill aesthetic mapping. Recall that factor is a data type in R that is often used to represent categories. By writing as_factor(Expt) we are ensuring that R will treat this variable as a factor, and the color will be mapped discretely. morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) + geom_histogram(alpha = 0.5, position = "identity") + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.22: Histogram of Michelson’s speed of light data colored by experiment as factor. Note: Factors impact plots in two ways: (1) ensuring a color is mapped as discretely where appropriate (as in this example) and (2) the ordering of levels in a plot. ggplot takes into account the order of the factor levels as opposed to the order of data in your data frame. Learning how to reorder your factor levels will help you with reordering the labels of a factor on a plot. Unfortunately, the attempt to separate out the experiment number visually has created a bit of a mess. All of the colors in Figure 4.22 are blending together, and although it is possible to derive some insight from this (e.g., experiments 1 and 3 had some of the most incorrect measurements), it isn’t the clearest way to convey our message and answer the question. Let’s try a different strategy of creating grid of separate histogram plots. We use the facet_grid function to create a plot that has multiple subplots arranged in a grid. The argument to facet_grid specifies the variable(s) used to split the plot into subplots, and how to split them (i.e., into rows or columns). If the plot is to be split horizontally, into rows, then the rows argument is used. If the plot is to be split vertically, into columns, then the cols argument is used. Both the rows and cols arguments take the column names on which to split the data when creating the subplots. Note that the column names must be surrounded by the vars function. This function allows the column names to be correctly evaluated in the context of the data frame. morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) + geom_histogram() + facet_grid(rows = vars(Expt)) + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.23: Histogram of Michelson’s speed of light data split vertically by experiment. The visualization in Figure 4.23 now makes it quite clear how accurate the different experiments were with respect to one another. The most variable measurements came from Experiment 1. There the measurements ranged from about 650–1050 km/sec. The least variable measurements came from Experiment 2. There, the measurements ranged from about 750–950 km/sec. The most different experiments still obtained quite similar results! There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels using the labs function, and increase the font size to make it readable using the theme function. Second, and perhaps more subtly, even though it is easy to compare the experiments on this plot to one another, it is hard to get a sense of just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light? To answer this question, we’ll use the mutate function to transform our data into a relative measure of accuracy rather than absolute measurements: morley_rel <- mutate(morley, relative_accuracy = 100 * ((299000 + Speed) - 299792.458) / (299792.458)) morley_hist <- ggplot(morley_rel, aes(x = relative_accuracy, fill = as_factor(Expt))) + geom_histogram() + facet_grid(rows = vars(Expt)) + geom_vline(xintercept = 0, linetype = "dashed") + labs(x = "Relative Accuracy (%)", y = "# Measurements", fill = "Experiment ID") + theme(text = element_text(size = 12)) morley_hist Figure 4.24: Histogram of relative accuracy split vertically by experiment with clearer axes and labels. Wow, impressive! These measurements of the speed of light from 1879 had errors around 0.05% of the true speed. Figure 4.24 shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an admirable job given the technology available at the time. Choosing a binwidth for histograms When you create a histogram in R, the default number of bins used is 30. Naturally, this is not always the right number to use. You can set the number of bins yourself by using the bins argument in the geom_histogram geometric object. You can also set the width of the bins using the binwidth argument in the geom_histogram geometric object. But what number of bins, or bin width, is the right one to use? Unfortunately there is no hard rule for what the right bin number or width is. It depends entirely on your problem; the right number of bins or bin width is the one that helps you answer the question you asked. Choosing the correct setting for your problem is something that commonly takes iteration. We recommend setting the bin width (not the number of bins) because it often more directly corresponds to values in your problem of interest. For example, if you are looking at a histogram of human heights, a bin width of 1 inch would likely be reasonable, while the number of bins to use is not immediately clear. It’s usually a good idea to try out several bin widths to see which one most clearly captures your data in the context of the question you want to answer. To get a sense for how different bin widths affect visualizations, let’s experiment with the histogram that we have been working on in this section. In Figure 4.25, we compare the default setting with three other histograms where we set the binwidth to 0.001, 0.01 and 0.1. In this case, we can see that both the default number of bins and the binwidth of 0.01 are effective for helping answer our question. On the other hand, the bin widths of 0.001 and 0.1 are too small and too big, respectively. Figure 4.25: Effect of varying bin width on histograms. Adding layers to a ggplot plot object One of the powerful features of ggplot is that you can continue to iterate on a single plot object, adding and refining one layer at a time. If you stored your plot as a named object using the assignment symbol (<-), you can add to it using the + operator. For example, if we wanted to add a title to the last plot we created (morley_hist), we can use the + operator to add a title layer with the ggtitle function. The result is shown in Figure 4.26. morley_hist_title <- morley_hist + ggtitle("Speed of light experiments \\n were accurate to about 0.05%") morley_hist_title Figure 4.26: Histogram of relative accuracy split vertically by experiment with a descriptive title highlighting the take home message of the visualization. Note: Good visualization titles clearly communicate the take home message to the audience. Typically, that is the answer to the question you posed before making the visualization. 4.6 Explaining the visualization Tell a story Typically, your visualization will not be shown entirely on its own, but rather it will be part of a larger presentation. Further, visualizations can provide supporting information for any aspect of a presentation, from opening to conclusion. For example, you could use an exploratory visualization in the opening of the presentation to motivate your choice of a more detailed data analysis / model, a visualization of the results of your analysis to show what your analysis has uncovered, or even one at the end of a presentation to help suggest directions for future work. Regardless of where it appears, a good way to discuss your visualization is as a story: Establish the setting and scope, and describe why you did what you did. Pose the question that your visualization answers. Justify why the question is important to answer. Answer the question using your visualization. Make sure you describe all aspects of the visualization (including describing the axes). But you can emphasize different aspects based on what is important to answer your question: trends (lines): Does a line describe the trend well? If so, the trend is linear, and if not, the trend is nonlinear. Is the trend increasing, decreasing, or neither? Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line “jump around” a lot) or smooth? distributions (scatters, histograms): How spread out are the data? Where are they centered, roughly? Are there any obvious “clusters” or “subgroups”, which would be visible as multiple bumps in the histogram? distributions of two variables (scatters): Is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible relationship (the data are too noisy to make any conclusion)? amounts (bars): How large are the bars relative to one another? Are there patterns in different groups of bars? Summarize your findings, and use them to motivate whatever you will discuss next. Below are two examples of how one might take these four steps in describing the example visualizations that appeared earlier in this chapter. Each of the steps is denoted by its numeral in parentheses, e.g. (3). Mauna Loa Atmospheric CO\\(_{\\text{2}}\\) Measurements: (1) Many current forms of energy generation and conversion—from automotive engines to natural gas power plants—rely on burning fossil fuels and produce greenhouse gases, typically primarily carbon dioxide (CO\\(_{\\text{2}}\\)), as a byproduct. Too much of these gases in the Earth’s atmosphere will cause it to trap more heat from the sun, leading to global warming. (2) In order to assess how quickly the atmospheric concentration of CO\\(_{\\text{2}}\\) is increasing over time, we (3) used a data set from the Mauna Loa observatory in Hawaii, consisting of CO\\(_{\\text{2}}\\) measurements from 1980 to 2020. We plotted the measured concentration of CO\\(_{\\text{2}}\\) (on the vertical axis) over time (on the horizontal axis). From this plot, you can see a clear, increasing, and generally linear trend over time. There is also a periodic oscillation that occurs once per year and aligns with Hawaii’s seasons, with an amplitude that is small relative to the growth in the overall trend. This shows that atmospheric CO\\(_{\\text{2}}\\) is clearly increasing over time, and (4) it is perhaps worth investigating more into the causes. Michelson Light Speed Experiments: (1) Our modern understanding of the physics of light has advanced significantly from the late 1800s when Michelson and Morley’s experiments first demonstrated that it had a finite speed. We now know, based on modern experiments, that it moves at roughly 299,792.458 kilometers per second. (2) But how accurately were we first able to measure this fundamental physical constant, and did certain experiments produce more accurate results than others? (3) To better understand this, we plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as histograms stacked on top of one another. The horizontal axis shows the accuracy of the measurements relative to the true speed of light as we know it today, expressed as a percentage. From this visualization, you can see that most results had relative errors of at most 0.05%. You can also see that experiments 1 and 3 had measurements that were the farthest from the true value, and experiment 5 tended to provide the most consistently accurate result. (4) It would be worth further investigating the differences between these experiments to see why they produced different results. 4.7 Saving the visualization Choose the right output format for your needs Just as there are many ways to store data sets, there are many ways to store visualizations and images. Which one you choose can depend on several factors, such as file size/type limitations (e.g., if you are submitting your visualization as part of a conference paper or to a poster printing shop) and where it will be displayed (e.g., online, in a paper, on a poster, on a billboard, in talk slides). Generally speaking, images come in two flavors: raster formats and vector formats. Raster images are represented as a 2-D grid of square pixels, each with its own color. Raster images are often compressed before storing so they take up less space. A compressed format is lossy if the image cannot be perfectly re-created when loading and displaying, with the hope that the change is not noticeable. Lossless formats, on the other hand, allow a perfect display of the original image. Common file types: JPEG (.jpg, .jpeg): lossy, usually used for photographs PNG (.png): lossless, usually used for plots / line drawings BMP (.bmp): lossless, raw image data, no compression (rarely used) TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing Open-source software: GIMP Vector images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas. Common file types: SVG (.svg): general-purpose use EPS (.eps), general-purpose use (rarely used) Open-source software: Inkscape Raster and vector images have opposing advantages and disadvantages. A raster image of a fixed width / height takes the same amount of space and time to load regardless of what the image shows (the one caveat is that the compression algorithms may shrink the image more or run faster for certain images). A vector image takes space and time to load corresponding to how complex the image is, since the computer has to draw all the elements each time it is displayed. For example, if you have a scatter plot with 1 million points stored as an SVG file, it may take your computer some time to open the image. On the other hand, you can zoom into / scale up vector graphics as much as you like without the image looking bad, while raster images eventually start to look “pixelated.” Note: The portable document format PDF (.pdf) is commonly used to store both raster and vector formats. If you try to open a PDF and it’s taking a long time to load, it may be because there is a complicated vector graphics image that your computer is rendering. Let’s learn how to save plot images to these different file formats using a scatter plot of the Old Faithful data set (Hardle 1991), shown in Figure 4.27. library(svglite) # we need this to save SVG files faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions)) + geom_point() + labs(x = "Waiting time to next eruption \\n (minutes)", y = "Eruption time \\n (minutes)") + theme(text = element_text(size = 12)) faithful_plot Figure 4.27: Scatter plot of waiting time and eruption time. Now that we have a named ggplot plot object, we can use the ggsave function to save a file containing this image. ggsave works by taking a file name to create for the image as its first argument. This can include the path to the directory where you would like to save the file (e.g., img/viz/filename.png to save a file named filename to the img/viz/ directory), and the name of the plot object to save as its second argument. The kind of image to save is specified by the file extension. For example, to create a PNG image file, we specify that the file extension is .png. Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types for the faithful_plot: ggsave("img/viz/faithful_plot.png", faithful_plot) ggsave("img/viz/faithful_plot.jpg", faithful_plot) ggsave("img/viz/faithful_plot.bmp", faithful_plot) ggsave("img/viz/faithful_plot.tiff", faithful_plot) ggsave("img/viz/faithful_plot.svg", faithful_plot) Table 4.1: File sizes of the scatter plot of the Old Faithful data set when saved as different file formats. Image type File type Image size Raster PNG 0.15 MB Raster JPG 0.42 MB Raster BMP 3.15 MB Raster TIFF 9.44 MB Vector SVG 0.03 MB Take a look at the file sizes in Table 4.1. Wow, that’s quite a difference! Notice that for such a simple plot with few graphical elements (points), the vector graphics format (SVG) is over 100 times smaller than the uncompressed raster images (BMP, TIFF). Also, note that the JPG format is twice as large as the PNG format since the JPG compression algorithm is designed for natural images (not plots). In Figure 4.28, we also show what the images look like when we zoom in to a rectangle with only 2 data points. You can see why vector graphics formats are so useful: because they’re just based on mathematical formulas, vector graphics can be scaled up to arbitrary sizes. This makes them great for presentation media of all sizes, from papers to posters to billboards. Figure 4.28: Zoomed in faithful, raster (PNG, left) and vector (SVG, right) formats. 4.8 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Effective data visualization” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 4.9 Additional resources The ggplot2 R package page (Wickham, Chang, et al. 2021) is where you should look if you want to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter. The Fundamentals of Data Visualization (Wilke 2019) has a wealth of information on designing effective visualizations. It is not specific to any particular programming language or library. If you want to improve your visualization skills, this is the next place to look. R for Data Science (Wickham and Grolemund 2016) has a chapter on creating visualizations using ggplot2. This reference is specific to R and ggplot2, but provides a much more detailed introduction to the full set of tools that ggplot2 provides. This chapter is where you should look if you want to learn how to make more intricate visualizations in ggplot2 than what is included in this chapter. The theme function documentation is an excellent reference to see how you can fine tune the non-data aspects of your visualization. R for Data Science (Wickham and Grolemund 2016) has a chapter on dates and times. This chapter is where you should look if you want to learn about date vectors, including how to create them, and how to use them to effectively handle durations, periods and intervals using the lubridate package. References "],["classification1.html", "Chapter 5 Classification I: training & predicting 5.1 Overview 5.2 Chapter learning objectives 5.3 The classification problem 5.4 Exploring a data set 5.5 Classification with K-nearest neighbors 5.6 K-nearest neighbors with tidymodels 5.7 Data preprocessing with tidymodels 5.8 Putting it together in a workflow 5.9 Exercises", " Chapter 5 Classification I: training & predicting 5.1 Overview In previous chapters, we focused solely on descriptive and exploratory data analysis questions. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on classification, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. 5.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Recognize situations where a classifier would be appropriate for making predictions. Describe what a training data set is and how it is used in classification. Interpret the output of a classifier. Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables. Explain the K-nearest neighbors classification algorithm. Perform K-nearest neighbors classification in R using tidymodels. Use a recipe to center, scale, balance, and impute data as a preprocessing step. Combine preprocessing and model training using a workflow. 5.3 The classification problem In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” based on the email’s text and past email text data; or a credit card company may want to predict whether a purchase is fraudulent based on the current purchase item, amount, and location as well as past purchases. These tasks are all examples of classification, i.e., predicting a categorical class (sometimes called a label) for an observation given its other variables (sometimes called features). Generally, a classifier assigns an observation without a known class (e.g., a new patient) to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations for which we do know the class (e.g., previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set; this name comes from the fact that we use these data to train, or teach, our classifier. Once taught, we can use the classifier to make predictions on new data for which we do not know the class. There are many possible methods that we could use to predict a categorical class/label for an observation. In this book, we will focus on the widely used K-nearest neighbors algorithm (Fix and Hodges 1951; Cover and Hart 1967). In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. It is also worth mentioning that there are many variations on the basic classification problem. For example, we focus on the setting of binary classification where only two classes are involved (e.g., a diagnosis of either healthy or diseased), but you may also run into multiclass classification problems with more than two categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold). 5.4 Exploring a data set In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian (Street, Wolberg, and Mangasarian 1993). Each row in the data set represents an image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, and more). Diagnosis for each image was conducted by physicians. As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: can we use the tumor image measurements available to us to predict whether a future tumor image (with unknown diagnosis) shows a benign or malignant tumor? Answering this question is important because traditional, non-data-driven methods for tumor diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumors are not normally dangerous; the cells stay in the same place, and the tumor stops growing before it gets very large. By contrast, in malignant tumors, the cells invade the surrounding tissue and spread into nearby organs, where they can cause serious damage (Stanford Health Care 2021). Thus, it is important to quickly and accurately diagnose the tumor type to guide patient treatment. 5.4.1 Loading the cancer data Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by loading the tidyverse package needed for our analysis. library(tidyverse) In this case, the file containing the breast cancer data set is a .csv file with headers. We’ll use the read_csv function with no additional arguments, and then inspect its contents: cancer <- read_csv("data/wdbc.csv") cancer ## # A tibble: 569 × 12 ## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65 ## 2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238 ## 3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36 ## 4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91 ## 5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37 ## 6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866 ## 7 8.44e5 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300 ## 8 8.45e7 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610 ## 9 8.45e5 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22 ## 10 8.45e7 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74 ## # ℹ 559 more rows ## # ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>, ## # Fractal_Dimension <dbl> 5.4.2 Describing the variables in the cancer data set Breast tumors can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle aspiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, ten different variables were measured for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been standardized (centered and scaled); we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis by a physician. Therefore, the total set of variables per image in this data set is: ID: identification number Class: the diagnosis (M = malignant or B = benign) Radius: the mean of distances from center to points on the perimeter Texture: the standard deviation of gray-scale values Perimeter: the length of the surrounding contour Area: the area inside the contour Smoothness: the local variation in radius lengths Compactness: the ratio of squared perimeter and area Concavity: severity of concave portions of the contour Concave Points: the number of concave portions of the contour Symmetry: how similar the nucleus is when mirrored Fractal Dimension: a measurement of how “rough” the perimeter is Below we use glimpse to preview the data frame. This function can make it easier to inspect the data when we have a lot of columns, as it prints the data such that the columns go down the page (instead of across). glimpse(cancer) ## Rows: 569 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786… ## $ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M… ## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875… ## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150… ## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7… ## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8… ## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2… ## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5… ## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3… ## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42… ## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, … ## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0… From the summary of the data above, we can see that Class is of type character (denoted by <chr>). We can use the distinct function to see all the unique values present in that column. We see that there are two diagnoses: benign, represented by “B”, and malignant, represented by “M”. cancer |> distinct(Class) ## # A tibble: 2 × 1 ## Class ## <chr> ## 1 M ## 2 B Since we will be working with Class as a categorical variable, it is a good idea to convert it to a factor type using the as_factor function. We will also improve the readability of our analysis by renaming “M” to “Malignant” and “B” to “Benign” using the fct_recode method. The fct_recode method is used to replace the names of factor values with other names. The arguments of fct_recode are the column that you want to modify, followed any number of arguments of the form \"new name\" = \"old name\" to specify the renaming scheme. cancer <- cancer |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) glimpse(cancer) ## Rows: 569 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786… ## $ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan… ## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875… ## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150… ## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7… ## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8… ## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2… ## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5… ## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3… ## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42… ## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, … ## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0… Let’s verify that we have successfully converted the Class column to a factor variable and renamed its values to “Benign” and “Malignant” using the distinct function once more. cancer |> distinct(Class) ## # A tibble: 2 × 1 ## Class ## <fct> ## 1 Malignant ## 2 Benign 5.4.3 Exploring the cancer data Before we start doing any modeling, let’s explore our data set. Below we use the group_by, summarize and n functions to find the number and percentage of benign and malignant tumor observations in our data set. The n function within summarize, when paired with group_by, counts the number of observations in each Class group. Then we calculate the percentage in each group by dividing by the total number of observations and multiplying by 100. We have 357 (63%) benign and 212 (37%) malignant tumor observations. num_obs <- nrow(cancer) cancer |> group_by(Class) |> summarize( count = n(), percentage = n() / num_obs * 100 ) ## # A tibble: 2 × 3 ## Class count percentage ## <fct> <int> <dbl> ## 1 Malignant 212 37.3 ## 2 Benign 357 62.7 Next, let’s draw a scatter plot to visualize the relationship between the perimeter and concavity variables. Rather than use ggplot's default palette, we select our own colorblind-friendly colors—\"darkorange\" for orange and \"steelblue\" for blue—and pass them as the values argument to the scale_color_manual function. perim_concav <- cancer |> ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.6) + labs(x = "Perimeter (standardized)", y = "Concavity (standardized)", color = "Diagnosis") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) perim_concav Figure 5.1: Scatter plot of concavity versus perimeter colored by diagnosis label. In Figure 5.1, we can see that malignant observations typically fall in the upper right-hand corner of the plot area. By contrast, benign observations typically fall in the lower left-hand corner of the plot. In other words, benign observations tend to have lower concavity and perimeter values, and malignant ones tend to have larger values. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumor class). We could compute the standardized perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? Based on the scatter plot, how might you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like it may be possible to make accurate predictions of the Class variable (i.e., a diagnosis) for tumor images with unknown diagnoses. 5.5 Classification with K-nearest neighbors In order to actually make predictions for new observations in practice, we will need a classification algorithm. In this book, we will use the K-nearest neighbors classification algorithm. To predict the label of a new observation (here, classify it as either benign or malignant), the K-nearest neighbors classifier generally finds the \\(K\\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. \\(K\\) is a number that we must choose in advance; for now, we will assume that someone has chosen \\(K\\) for us. We will cover how to choose \\(K\\) ourselves in the next chapter. To illustrate the concept of K-nearest neighbors classification, we will walk through an example. Suppose we have a new observation, with standardized perimeter of 2 and standardized concavity of 4, whose diagnosis “Class” is unknown. This new observation is depicted by the red, diamond point in Figure 5.2. Figure 5.2: Scatter plot of concavity versus perimeter with new observation represented as a red diamond. Figure 5.3 shows that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis. Figure 5.3: Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label. Suppose we have another new observation with standardized perimeter 0.2 and concavity of 3.3. Looking at the scatter plot in Figure 5.4, how would you classify this red, diamond observation? The nearest neighbor to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points. Figure 5.4: Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label. To improve the prediction we can consider several neighboring points, say \\(K = 3\\), that are closest to the new observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. As shown in Figure 5.5, we see that the diagnoses of 2 of the 3 nearest neighbors to our new observation are malignant. Therefore we take majority vote and classify our new red, diamond observation as malignant. Figure 5.5: Scatter plot of concavity versus perimeter with three nearest neighbors. Here we chose the \\(K=3\\) nearest observations, but there is nothing special about \\(K=3\\). We could have used \\(K=4, 5\\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \\(K\\) in the next chapter. 5.5.1 Distance between points We decide which points are the \\(K\\) “nearest” to our new observation using the straight-line distance (we will often just refer to this as distance). Suppose we have two observations \\(a\\) and \\(b\\), each having two predictor variables, \\(x\\) and \\(y\\). Denote \\(a_x\\) and \\(a_y\\) to be the values of variables \\(x\\) and \\(y\\) for observation \\(a\\); \\(b_x\\) and \\(b_y\\) have similar definitions for observation \\(b\\). Then the straight-line distance between observation \\(a\\) and \\(b\\) on the x-y plane can be computed using the following formula: \\[\\mathrm{Distance} = \\sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}\\] To find the \\(K\\) nearest neighbors to our new observation, we compute the distance from that new observation to each observation in our training data, and select the \\(K\\) observations corresponding to the \\(K\\) smallest distance values. For example, suppose we want to use \\(K=5\\) neighbors to classify a new observation with perimeter of 0 and concavity of 3.5, shown as a red diamond in Figure 5.6. Let’s calculate the distances between our new point and each of the observations in the training set to find the \\(K=5\\) neighbors that are nearest to our new point. You will see in the mutate step below, we compute the straight-line distance using the formula above: we square the differences between the two observations’ perimeter and concavity coordinates, add the squared differences, and then take the square root. In order to find the \\(K=5\\) nearest neighbors, we will use the slice_min function. Figure 5.6: Scatter plot of concavity versus perimeter with new observation represented as a red diamond. new_obs_Perimeter <- 0 new_obs_Concavity <- 3.5 cancer |> select(ID, Perimeter, Concavity, Class) |> mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) |> slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance ## # A tibble: 5 × 5 ## ID Perimeter Concavity Class dist_from_new ## <dbl> <dbl> <dbl> <fct> <dbl> ## 1 86409 0.241 2.65 Benign 0.881 ## 2 887181 0.750 2.87 Malignant 0.980 ## 3 899667 0.623 2.54 Malignant 1.14 ## 4 907914 0.417 2.31 Malignant 1.26 ## 5 8710441 -1.16 4.04 Benign 1.28 In Table 5.1 we show in mathematical detail how the mutate step was used to compute the dist_from_new variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data. Table 5.1: Evaluating the distances from the new observation to each of its 5 nearest neighbors Perimeter Concavity Distance Class 0.24 2.65 \\(\\sqrt{(0 - 0.24)^2 + (3.5 - 2.65)^2} = 0.88\\) Benign 0.75 2.87 \\(\\sqrt{(0 - 0.75)^2 + (3.5 - 2.87)^2} = 0.98\\) Malignant 0.62 2.54 \\(\\sqrt{(0 - 0.62)^2 + (3.5 - 2.54)^2} = 1.14\\) Malignant 0.42 2.31 \\(\\sqrt{(0 - 0.42)^2 + (3.5 - 2.31)^2} = 1.26\\) Malignant -1.16 4.04 \\(\\sqrt{(0 - (-1.16))^2 + (3.5 - 4.04)^2} = 1.28\\) Benign The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in Figure 5.7. Figure 5.7: Scatter plot of concavity versus perimeter with 5 nearest neighbors circled. 5.5.2 More than two explanatory variables Although the above description is directed toward two predictor variables, exactly the same K-nearest neighbors algorithm applies when you have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula for the distance between points. Suppose we have \\(m\\) predictor variables for two observations \\(a\\) and \\(b\\), i.e., \\(a = (a_{1}, a_{2}, \\dots, a_{m})\\) and \\(b = (b_{1}, b_{2}, \\dots, b_{m})\\). The distance formula becomes \\[\\mathrm{Distance} = \\sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \\dots + (a_{m} - b_{m})^2}.\\] This formula still corresponds to a straight-line distance, just in a space with more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables, and then took the square root. Now we will do the same, except for our three variables. We calculate the distance as follows \\[\\mathrm{Distance} =\\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.\\] Let’s calculate the distances between our new observation and each of the observations in the training set to find the \\(K=5\\) neighbors when we have these three predictors. new_obs_Perimeter <- 0 new_obs_Concavity <- 3.5 new_obs_Symmetry <- 1 cancer |> select(ID, Perimeter, Concavity, Symmetry, Class) |> mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2 + (Symmetry - new_obs_Symmetry)^2)) |> slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance ## # A tibble: 5 × 6 ## ID Perimeter Concavity Symmetry Class dist_from_new ## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> ## 1 907914 0.417 2.31 0.837 Malignant 1.27 ## 2 90439701 1.33 2.89 1.10 Malignant 1.47 ## 3 925622 0.470 2.08 1.15 Malignant 1.50 ## 4 859471 -1.37 2.81 1.09 Benign 1.53 ## 5 899667 0.623 2.54 2.06 Malignant 1.56 Based on \\(K=5\\) nearest neighbors with these three predictors, we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are from the malignant class. Figure 5.8 shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. Figure 5.8: 3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes. 5.5.3 Summary of K-nearest neighbors algorithm In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following: Compute the distance between the new observation and each observation in the training set. Sort the data table in ascending order according to the distances. Choose the top \\(K\\) rows of the sorted table. Classify the new observation based on a majority vote of the neighbor classes. 5.6 K-nearest neighbors with tidymodels Coding the K-nearest neighbors algorithm in R ourselves can get complicated, especially if we want to handle multiple classes, more than two variables, or predict the class for multiple new observations. Thankfully, in R, the K-nearest neighbors algorithm is implemented in the parsnip R package (Kuhn and Vaughan 2021) included in tidymodels, along with many other models that you will encounter in this and future chapters of the book. The tidymodels collection provides tools to help make and use models, such as classifiers. Using the packages in this collection will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we will likely make. We start by loading tidymodels. library(tidymodels) Let’s walk through how to use tidymodels to perform K-nearest neighbors classification. We will use the cancer data set from above, with perimeter and concavity as predictors and \\(K = 5\\) neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with perimeter 0, concavity 3.5, and an unknown diagnosis label. Let’s pick out our two desired predictor variables and class label and store them as a new data set named cancer_train: cancer_train <- cancer |> select(Class, Perimeter, Concavity) cancer_train ## # A tibble: 569 × 3 ## Class Perimeter Concavity ## <fct> <dbl> <dbl> ## 1 Malignant 1.27 2.65 ## 2 Malignant 1.68 -0.0238 ## 3 Malignant 1.57 1.36 ## 4 Malignant -0.592 1.91 ## 5 Malignant 1.78 1.37 ## 6 Malignant -0.387 0.866 ## 7 Malignant 1.14 0.300 ## 8 Malignant -0.0728 0.0610 ## 9 Malignant -0.184 1.22 ## 10 Malignant -0.329 1.74 ## # ℹ 559 more rows Next, we create a model specification for K-nearest neighbors classification by calling the nearest_neighbor function, specifying that we want to use \\(K = 5\\) neighbors (we will discuss how to choose \\(K\\) in the next chapter) and that each neighboring point should have the same weight when voting (weight_func = \"rectangular\"). The weight_func argument controls how neighbors vote when classifying a new observation; by setting it to \"rectangular\", each of the \\(K\\) nearest neighbors gets exactly 1 vote as described above. Other choices, which weigh each neighbor’s vote differently, can be found on the parsnip website. In the set_engine argument, we specify which package or system will be used for training the model. Here kknn is the R package we will use for performing K-nearest neighbors classification. Finally, we specify that this is a classification problem with the set_mode function. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |> set_engine("kknn") |> set_mode("classification") knn_spec ## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = 5 ## weight_func = rectangular ## ## Computational engine: kknn In order to fit the model on the breast cancer data, we need to pass the model specification and the data set to the fit function. We also need to specify what variables to use as predictors and what variable to use as the response. Below, the Class ~ Perimeter + Concavity argument specifies that Class is the response variable (the one we want to predict), and both Perimeter and Concavity are to be used as the predictors. knn_fit <- knn_spec |> fit(Class ~ Perimeter + Concavity, data = cancer_train) We can also use a convenient shorthand syntax using a period, Class ~ ., to indicate that we want to use every variable except Class as a predictor in the model. In this particular setup, since Concavity and Perimeter are the only two predictors in the cancer_train data frame, Class ~ Perimeter + Concavity and Class ~ . are equivalent. In general, you can choose individual predictors using the + symbol, or you can specify to use all predictors using the . symbol. knn_fit <- knn_spec |> fit(Class ~ ., data = cancer_train) knn_fit ## parsnip model object ## ## ## Call: ## kknn::train.kknn(formula = Class ~ ., data = data, ks = min_rows(5, data, 5) ## , kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.07557118 ## Best kernel: rectangular ## Best k: 5 Here you can see the final trained model summary. It confirms that the computational engine used to train the model was kknn::train.kknn. It also shows the fraction of errors made by the K-nearest neighbors model, but we will ignore this for now and discuss it in more detail in the next chapter. Finally, it shows (somewhat confusingly) that the “best” weight function was “rectangular” and “best” setting of \\(K\\) was 5; but since we specified these earlier, R is just repeating those settings to us here. In the next chapter, we will actually let R find the value of \\(K\\) for us. Finally, we make the prediction on the new observation by calling the predict function, passing both the fit object we just created and the new observation itself. As above, when we ran the K-nearest neighbors classification algorithm manually, the knn_fit object classifies the new observation as malignant. Note that the predict function outputs a data frame with a single variable named .pred_class. new_obs <- tibble(Perimeter = 0, Concavity = 3.5) predict(knn_fit, new_obs) ## # A tibble: 1 × 1 ## .pred_class ## <fct> ## 1 Malignant Is this predicted malignant label the actual class for this observation? Well, we don’t know because we do not have this observation’s diagnosis— that is what we were trying to predict! The classifier’s prediction is not necessarily correct, but in the next chapter, we will learn ways to quantify how accurate we think our predictions are. 5.7 Data preprocessing with tidymodels 5.7.1 Centering and scaling When using K-nearest neighbors classification, the scale of each variable (i.e., its size and range of values) matters. Since the classifier predicts classes by identifying observations nearest to it, any variables with a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. For example, suppose you have a data set with two features, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the neighbor distances, a difference of $1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it’s the opposite; 10 years of education is huge compared to a difference of $1000 in yearly salary! In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celsius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise, in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn’t affect the K-nearest neighbors classification algorithm, this large shift can change the outcome of using many other predictive models. To scale and center our data, we need to find our variables’ mean (the average, which quantifies the “central” value of a set of numbers) and standard deviation (a number quantifying how spread out values are). For each observed value of the variable, we subtract the mean (i.e., center the variable) and divide by the standard deviation (i.e., scale the variable). When we do this, the data is said to be standardized, and all variables in a data set will have a mean of 0 and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest neighbors algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. As before, we will convert the Class variable to the factor type and rename the values to “Malignant” and “Benign.” To keep things simple, we will just use the Area, Smoothness, and Class variables: unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Benign" = "B", "Malignant" = "M")) |> select(Class, Area, Smoothness) unscaled_cancer ## # A tibble: 569 × 3 ## Class Area Smoothness ## <fct> <dbl> <dbl> ## 1 Malignant 1001 0.118 ## 2 Malignant 1326 0.0847 ## 3 Malignant 1203 0.110 ## 4 Malignant 386. 0.142 ## 5 Malignant 1297 0.100 ## 6 Malignant 477. 0.128 ## 7 Malignant 1040 0.0946 ## 8 Malignant 578. 0.119 ## 9 Malignant 520. 0.127 ## 10 Malignant 476. 0.119 ## # ℹ 559 more rows Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the unscaled_cancer data set with tidymodels. In the tidymodels framework, all data preprocessing happens using a recipe from the recipes R package (Kuhn and Wickham 2021). Here we will initialize a recipe for the unscaled_cancer data above, specifying that the Class variable is the response, and all other variables are predictors: uc_recipe <- recipe(Class ~ ., data = unscaled_cancer) uc_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 2 So far, there is not much in the recipe; just a statement about the number of response variables and predictors. Let’s add scaling (step_scale) and centering (step_center) steps for all of the predictors so that they each have a mean of 0 and standard deviation of 1. Note that tidyverse actually provides step_normalize, which does both centering and scaling in a single recipe step; in this book we will keep step_scale and step_center separate to emphasize conceptually that there are two steps happening. The prep function finalizes the recipe by using the data (here, unscaled_cancer) to compute anything necessary to run the recipe (in this case, the column means and standard deviations): uc_recipe <- uc_recipe |> step_scale(all_predictors()) |> step_center(all_predictors()) |> prep() uc_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 2 ## ## ── Training information ## Training data contained 569 data points and no incomplete rows. ## ## ── Operations ## • Scaling for: Area, Smoothness | Trained ## • Centering for: Area, Smoothness | Trained You can now see that the recipe includes a scaling and centering step for all predictor variables. Note that when you add a step to a recipe, you must specify what columns to apply the step to. Here we used the all_predictors() function to specify that each step should be applied to all predictor variables. However, there are a number of different arguments one could use here, as well as naming particular columns with the same syntax as the select function. For example: all_nominal() and all_numeric(): specify all categorical or all numeric variables all_predictors() and all_outcomes(): specify all predictor or all response variables Area, Smoothness: specify both the Area and Smoothness variable -Class: specify everything except the Class variable You can find a full set of all the steps and variable selection functions on the recipes reference page. At this point, we have calculated the required statistics based on the data input into the recipe, but the data are not yet scaled and centered. To actually scale and center the data, we need to apply the bake function to the unscaled data. scaled_cancer <- bake(uc_recipe, unscaled_cancer) scaled_cancer ## # A tibble: 569 × 3 ## Area Smoothness Class ## <dbl> <dbl> <fct> ## 1 0.984 1.57 Malignant ## 2 1.91 -0.826 Malignant ## 3 1.56 0.941 Malignant ## 4 -0.764 3.28 Malignant ## 5 1.82 0.280 Malignant ## 6 -0.505 2.24 Malignant ## 7 1.09 -0.123 Malignant ## 8 -0.219 1.60 Malignant ## 9 -0.384 2.20 Malignant ## 10 -0.509 1.58 Malignant ## # ℹ 559 more rows It may seem redundant that we had to both bake and prep to scale and center the data. However, we do this in two steps so we can specify a different data set in the bake step if we want. For example, we may want to specify new data that were not part of the training set. You may wonder why we are doing so much work just to center and scale our variables. Can’t we just manually scale and center the Area and Smoothness variables ourselves before building our K-nearest neighbors model? Well, technically yes; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a different centering / scaling than what we used while training. Proper use of a recipe helps keep our code simple, readable, and error-free. Furthermore, note that using prep and bake is required only when you want to inspect the result of the preprocessing steps yourself. You will see further on in Section 5.8 that tidymodels provides tools to automatically apply prep and bake as necessary without additional coding effort. Figure 5.9 shows the two scatter plots side-by-side—one for unscaled_cancer and one for scaled_cancer. Each has the same new observation annotated with its \\(K=3\\) nearest neighbors. In the original unstandardized data plot, you can see some odd choices for the three nearest neighbors. In particular, the “neighbors” are visually well within the cloud of benign observations, and the neighbors are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). Figure 5.10 shows a close-up of that region on the unstandardized plot. Here the computation of nearest neighbors is dominated by the much larger-scale area variable. The plot for standardized data on the right in Figure 5.9 shows a much more intuitively reasonable selection of nearest neighbors. Thus, standardizing the data can change things in an important way when we are using predictive algorithms. Standardizing your data should be a part of the preprocessing you do before predictive modeling and you should always think carefully about your problem domain and whether you need to standardize your data. Figure 5.9: Comparison of K = 3 nearest neighbors with unstandardized and standardized data. Figure 5.10: Close-up of three nearest neighbors for unstandardized data. 5.7.2 Balancing Another potential issue in a data set for a classifier is class imbalance, i.e., when one label is much more common than another. Since classifiers like the K-nearest neighbors algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the “pattern” of data suggests otherwise). Class imbalance is actually quite a common and important problem: from rare disease diagnosis to malicious email detection, there are many cases in which the “important” class to identify (presence of disease, malicious email) is much rarer than the “unimportant” class (no disease, normal email). To better illustrate the problem, let’s revisit the scaled breast cancer data, cancer; except now we will remove many of the observations of malignant tumors, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the slice_head function, which takes two arguments: a data frame-like object, and the number of rows to select from the top (n). We will use the bind_rows function to glue the two resulting filtered data frames back together, and name the result rare_cancer. The new imbalanced data is shown in Figure 5.11. rare_cancer <- bind_rows( filter(cancer, Class == "Benign"), cancer |> filter(Class == "Malignant") |> slice_head(n = 3) ) |> select(Class, Perimeter, Concavity) rare_plot <- rare_cancer |> ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(x = "Perimeter (standardized)", y = "Concavity (standardized)", color = "Diagnosis") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) rare_plot Figure 5.11: Imbalanced data. Suppose we now decided to use \\(K = 7\\) in K-nearest neighbors classification. With only 3 observations of malignant tumors, the classifier will always predict that the tumor is benign, no matter what its concavity and perimeter are! This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, Figure 5.12 shows what happens for a new tumor observation that is quite close to three observations in the training data that were tagged as malignant. Figure 5.12: Imbalanced data with 7 nearest neighbors to a new observation highlighted. Figure 5.13 shows what happens if we set the background color of each area of the plot to the prediction the K-nearest neighbors classifier would make for a new observation at that location. We can see that the decision is always “benign,” corresponding to the blue color. Figure 5.13: Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data. Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by oversampling the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more voting power in the K-nearest neighbors algorithm. In order to do this, we will add an oversampling step to the earlier uc_recipe recipe with the step_upsample function from the themis R package. We show below how to do this, and also use the group_by and summarize functions to see that our classes are now balanced: library(themis) ups_recipe <- recipe(Class ~ ., data = rare_cancer) |> step_upsample(Class, over_ratio = 1, skip = FALSE) |> prep() ups_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 2 ## ## ── Training information ## Training data contained 360 data points and no incomplete rows. ## ## ── Operations ## • Up-sampling based on: Class | Trained upsampled_cancer <- bake(ups_recipe, rare_cancer) upsampled_cancer |> group_by(Class) |> summarize(n = n()) ## # A tibble: 2 × 2 ## Class n ## <fct> <int> ## 1 Malignant 357 ## 2 Benign 357 Now suppose we train our K-nearest neighbors classifier with \\(K=7\\) on this balanced data. Figure 5.14 shows what happens now when we set the background color of each area of our scatter plot to the decision the K-nearest neighbors classifier would make. We can see that the decision is more reasonable; when the points are close to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are closer to the benign tumor observations. Figure 5.14: Upsampled data with background color indicating the decision of the classifier. 5.7.3 Missing data One of the most common issues in real data sets in the wild is missing data, i.e., observations where the values of some of the variables were not recorded. Unfortunately, as common as it is, handling missing data properly is very challenging and generally relies on expert knowledge about the data, setting, and how the data were collected. One typical challenge with missing data is that missing entries can be informative: the very fact that an entries were missing is related to the values of other variables. For example, survey participants from a marginalized group of people may be less likely to respond to certain kinds of questions if they fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries, we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents. So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts. In this book, we will cover only those techniques for dealing with missing entries in situations where missing entries are just “randomly missing”, i.e., where the fact that certain entries are missing isn’t related to anything else about the observation. Let’s load and examine a modified subset of the tumor image data that has a few missing entries: missing_cancer <- read_csv("data/wdbc_missing.csv") |> select(Class, Radius, Texture, Perimeter) |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) missing_cancer ## # A tibble: 7 × 4 ## Class Radius Texture Perimeter ## <fct> <dbl> <dbl> <dbl> ## 1 Malignant NA NA 1.27 ## 2 Malignant 1.83 -0.353 1.68 ## 3 Malignant 1.58 NA 1.57 ## 4 Malignant -0.768 0.254 -0.592 ## 5 Malignant 1.75 -1.15 1.78 ## 6 Malignant -0.476 -0.835 -0.387 ## 7 Malignant 1.17 0.161 1.14 Recall that K-nearest neighbors classification makes predictions by computing the straight-line distance to nearby training observations, and hence requires access to the values of all variables for all observations in the training data. So how can we perform K-nearest neighbors classification in the presence of missing data? Well, since there are not too many observations with missing entries, one option is to simply remove those observations prior to building the K-nearest neighbors classifier. We can accomplish this by using the drop_na function from tidyverse prior to working with the data. no_missing_cancer <- missing_cancer |> drop_na() no_missing_cancer ## # A tibble: 5 × 4 ## Class Radius Texture Perimeter ## <fct> <dbl> <dbl> <dbl> ## 1 Malignant 1.83 -0.353 1.68 ## 2 Malignant -0.768 0.254 -0.592 ## 3 Malignant 1.75 -1.15 1.78 ## 4 Malignant -0.476 -0.835 -0.387 ## 5 Malignant 1.17 0.161 1.14 However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away too much data. In this case, another possible approach is to impute the missing entries, i.e., fill in synthetic values based on the other observations in the data set. One reasonable choice is to perform mean imputation, where missing entries are filled in using the mean of the present entries in each variable. To perform mean imputation, we add the step_impute_mean step to the tidymodels preprocessing recipe. impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |> step_impute_mean(all_predictors()) |> prep() impute_missing_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 3 ## ## ── Training information ## Training data contained 7 data points and 2 incomplete rows. ## ## ── Operations ## • Mean imputation for: Radius, Texture, Perimeter | Trained To visualize what mean imputation does, let’s just apply the recipe directly to the missing_cancer data frame using the bake function. The imputation step fills in the missing entries with the mean values of their corresponding variables. imputed_cancer <- bake(impute_missing_recipe, missing_cancer) imputed_cancer ## # A tibble: 7 × 4 ## Radius Texture Perimeter Class ## <dbl> <dbl> <dbl> <fct> ## 1 0.847 -0.385 1.27 Malignant ## 2 1.83 -0.353 1.68 Malignant ## 3 1.58 -0.385 1.57 Malignant ## 4 -0.768 0.254 -0.592 Malignant ## 5 1.75 -1.15 1.78 Malignant ## 6 -0.476 -0.835 -0.387 Malignant ## 7 1.17 0.161 1.14 Malignant Many other options for missing data imputation can be found in the recipes documentation. However you decide to handle missing data in your data analysis, it is always crucial to think critically about the setting, how the data were collected, and the question you are answering. 5.8 Putting it together in a workflow The tidymodels package collection also provides the workflow, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. To illustrate the whole pipeline, let’s start from scratch with the wdbc_unscaled.csv data. First we will load the data, create a model, and specify a recipe for how the data should be preprocessed: # load the unscaled cancer data # and make sure the response variable, Class, is a factor unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) # create the K-NN model knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |> set_engine("kknn") |> set_mode("classification") # create the centering / scaling recipe uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) |> step_scale(all_predictors()) |> step_center(all_predictors()) Note that each of these steps is exactly the same as earlier, except for one major difference: we did not use the select function to extract the relevant variables from the data frame, and instead simply specified the relevant variables to use via the formula Class ~ Area + Smoothness (instead of Class ~ .) in the recipe. You will also notice that we did not call prep() on the recipe; this is unnecessary when it is placed in a workflow. We will now place these steps in a workflow using the add_recipe and add_model functions, and finally we will use the fit function to run the whole workflow on the unscaled_cancer data. Note another difference from earlier here: we do not include a formula in the fit function. This is again because we included the formula in the recipe, so there is no need to respecify it: knn_fit <- workflow() |> add_recipe(uc_recipe) |> add_model(knn_spec) |> fit(data = unscaled_cancer) knn_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(7, data, 5), ## kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.112478 ## Best kernel: rectangular ## Best k: 7 As before, the fit object lists the function that trains the model as well as the “best” settings for the number of neighbors and weight function (for now, these are just the values we chose manually when we created knn_spec above). But now the fit object also includes information about the overall workflow, including the centering and scaling preprocessing steps. In other words, when we use the predict function with the knn_fit object to make a prediction for a new observation, it will first apply the same recipe steps to the new observation. As an example, we will predict the class label of two new observations: one with Area = 500 and Smoothness = 0.075, and one with Area = 1500 and Smoothness = 0.1. new_observation <- tibble(Area = c(500, 1500), Smoothness = c(0.075, 0.1)) prediction <- predict(knn_fit, new_observation) prediction ## # A tibble: 2 × 1 ## .pred_class ## <fct> ## 1 Benign ## 2 Malignant The classifier predicts that the first observation is benign, while the second is malignant. Figure 5.15 visualizes the predictions that this trained K-nearest neighbors model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. For the interested reader who wants a learning challenge, we now include it below. The basic idea is to create a grid of synthetic new observations using the expand.grid function, predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low alpha value) and large point radius. See if you can figure out what each line is doing! Note: Understanding this code is not required for the remainder of the textbook. It is included for those readers who would like to use similar visualizations in their own data analyses. # create the grid of area/smoothness vals, and arrange in a data frame are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100) smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100) asgrid <- as_tibble(expand.grid(Area = are_grid, Smoothness = smo_grid)) # use the fit workflow to make predictions at the grid points knnPredGrid <- predict(knn_fit, asgrid) # bind the predictions as a new column with the grid points prediction_table <- bind_cols(knnPredGrid, asgrid) |> rename(Class = .pred_class) # plot: # 1. the colored scatter of the original data # 2. the faded colored scatter for the grid points wkflw_plot <- ggplot() + geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.75) + geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.02, size = 5) + labs(color = "Diagnosis", x = "Area", y = "Smoothness") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) wkflw_plot Figure 5.15: Scatter plot of smoothness versus area where background color indicates the decision of the classifier. 5.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Classification I: training and predicting” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["classification2.html", "Chapter 6 Classification II: evaluation & tuning 6.1 Overview 6.2 Chapter learning objectives 6.3 Evaluating performance 6.4 Randomness and seeds 6.5 Evaluating performance with tidymodels 6.6 Tuning the classifier 6.7 Summary 6.8 Predictor variable selection 6.9 Exercises 6.10 Additional resources", " Chapter 6 Classification II: evaluation & tuning 6.1 Overview This chapter continues the introduction to predictive modeling through classification. While the previous chapter covered training and data preprocessing, this chapter focuses on how to evaluate the performance of a classifier, as well as how to improve the classifier (where possible) to maximize its accuracy. 6.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe what training, validation, and test data sets are and how they are used in classification. Split data into training, validation, and test data sets. Describe what a random seed is and its importance in reproducible data analysis. Set the random seed in R using the set.seed function. Describe and interpret accuracy, precision, recall, and confusion matrices. Evaluate classification accuracy, precision, and recall in R using a test set, a single validation set, and cross-validation. Produce a confusion matrix in R. Choose the number of neighbors in a K-nearest neighbors classifier by maximizing estimated cross-validation accuracy. Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors classification. Describe the advantages and disadvantages of the K-nearest neighbors classification algorithm. 6.3 Evaluating performance Sometimes our classifier might make the wrong prediction. A classifier does not need to be right 100% of the time to be useful, though we don’t want the classifier to make too many wrong predictions. How do we measure how “good” our classifier is? Let’s revisit the breast cancer images data (Street, Wolberg, and Mangasarian 1993) and think about how our classifier will be used in practice. A biopsy will be performed on a new patient’s tumor, the resulting image will be analyzed, and the classifier will be asked to decide whether the tumor is benign or malignant. The key word here is new: our classifier is “good” if it provides accurate predictions on data not seen during training, as this implies that it has actually learned about the relationship between the predictor variables and response variable, as opposed to simply memorizing the labels of individual training data examples. But then, how can we evaluate our classifier without visiting the hospital to collect more tumor images? The trick is to split the data into a training set and test set (Figure 6.1) and use only the training set when building the classifier. Then, to evaluate the performance of the classifier, we first set aside the labels from the test set, and then use the classifier to predict the labels in the test set. If our predictions match the actual labels for the observations in the test set, then we have some confidence that our classifier might also accurately predict the class labels for new observations without known class labels. Note: If there were a golden rule of machine learning, it might be this: you cannot use the test data to build the model! If you do, the model gets to “see” the test data in advance, making it look more accurate than it really is. Imagine how bad it would be to overestimate your classifier’s accuracy when predicting whether a patient’s tumor is malignant or benign! Figure 6.1: Splitting the data into training and testing sets. How exactly can we assess how well our predictions match the actual labels for the observations in the test set? One way we can do this is to calculate the prediction accuracy. This is the fraction of examples for which the classifier made the correct prediction. To calculate this, we divide the number of correct predictions by the number of predictions made. The process for assessing if our predictions match the actual labels in the test set is illustrated in Figure 6.2. \\[\\mathrm{accuracy} = \\frac{\\mathrm{number \\; of \\; correct \\; predictions}}{\\mathrm{total \\; number \\; of \\; predictions}}\\] Figure 6.2: Process for splitting the data and finding the prediction accuracy. Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with a single number. But prediction accuracy by itself does not tell the whole story. In particular, accuracy alone only tells us how often the classifier makes mistakes in general, but does not tell us anything about the kinds of mistakes the classifier makes. A more comprehensive view of performance can be obtained by additionally examining the confusion matrix. The confusion matrix shows how many test set labels of each type are predicted correctly and incorrectly, which gives us more detail about the kinds of mistakes the classifier tends to make. Table 6.1 shows an example of what a confusion matrix might look like for the tumor image data with a test set of 65 observations. Table 6.1: An example confusion matrix for the tumor image data. Actually Malignant Actually Benign Predicted Malignant 1 4 Predicted Benign 3 57 In the example in Table 6.1, we see that there was 1 malignant observation that was correctly classified as malignant (top left corner), and 57 benign observations that were correctly classified as benign (bottom right corner). However, we can also see that the classifier made some mistakes: it classified 3 malignant observations as benign, and 4 benign observations as malignant. The accuracy of this classifier is roughly 89%, given by the formula \\[\\mathrm{accuracy} = \\frac{\\mathrm{number \\; of \\; correct \\; predictions}}{\\mathrm{total \\; number \\; of \\; predictions}} = \\frac{1+57}{1+57+4+3} = 0.892.\\] But we can also see that the classifier only identified 1 out of 4 total malignant tumors; in other words, it misclassified 75% of the malignant cases present in the data set! In this example, misclassifying a malignant tumor is a potentially disastrous error, since it may lead to a patient who requires treatment not receiving it. Since we are particularly interested in identifying malignant cases, this classifier would likely be unacceptable even with an accuracy of 89%. Focusing more on one label than the other is common in classification problems. In such cases, we typically refer to the label we are more interested in identifying as the positive label, and the other as the negative label. In the tumor example, we would refer to malignant observations as positive, and benign observations as negative. We can then use the following terms to talk about the four kinds of prediction that the classifier can make, corresponding to the four entries in the confusion matrix: True Positive: A malignant observation that was classified as malignant (top left in Table 6.1). False Positive: A benign observation that was classified as malignant (top right in Table 6.1). True Negative: A benign observation that was classified as benign (bottom right in Table 6.1). False Negative: A malignant observation that was classified as benign (bottom left in Table 6.1). A perfect classifier would have zero false negatives and false positives (and therefore, 100% accuracy). However, classifiers in practice will almost always make some errors. So you should think about which kinds of error are most important in your application, and use the confusion matrix to quantify and report them. Two commonly used metrics that we can compute using the confusion matrix are the precision and recall of the classifier. These are often reported together with accuracy. Precision quantifies how many of the positive predictions the classifier made were actually positive. Intuitively, we would like a classifier to have a high precision: for a classifier with high precision, if the classifier reports that a new observation is positive, we can trust that the new observation is indeed positive. We can compute the precision of a classifier using the entries in the confusion matrix, with the formula \\[\\mathrm{precision} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; predictions}}.\\] Recall quantifies how many of the positive observations in the test set were identified as positive. Intuitively, we would like a classifier to have a high recall: for a classifier with high recall, if there is a positive observation in the test data, we can trust that the classifier will find it. We can also compute the recall of the classifier using the entries in the confusion matrix, with the formula \\[\\mathrm{recall} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; test \\; set \\; observations}}.\\] In the example presented in Table 6.1, we have that the precision and recall are \\[\\mathrm{precision} = \\frac{1}{1+4} = 0.20, \\quad \\mathrm{recall} = \\frac{1}{1+3} = 0.25.\\] So even with an accuracy of 89%, the precision and recall of the classifier were both relatively low. For this data analysis context, recall is particularly important: if someone has a malignant tumor, we certainly want to identify it. A recall of just 25% would likely be unacceptable! Note: It is difficult to achieve both high precision and high recall at the same time; models with high precision tend to have low recall and vice versa. As an example, we can easily make a classifier that has perfect recall: just always guess positive! This classifier will of course find every positive observation in the test set, but it will make lots of false positive predictions along the way and have low precision. Similarly, we can easily make a classifier that has perfect precision: never guess positive! This classifier will never incorrectly identify an obsevation as positive, but it will make a lot of false negative predictions along the way. In fact, this classifier will have 0% recall! Of course, most real classifiers fall somewhere in between these two extremes. But these examples serve to show that in settings where one of the classes is of interest (i.e., there is a positive label), there is a trade-off between precision and recall that one has to make when designing a classifier. 6.4 Randomness and seeds Beginning in this chapter, our data analyses will often involve the use of randomness. We use randomness any time we need to make a decision in our analysis that needs to be fair, unbiased, and not influenced by human input. For example, in this chapter, we need to split a data set into a training set and test set to evaluate our classifier. We certainly do not want to choose how to split the data ourselves by hand, as we want to avoid accidentally influencing the result of the evaluation. So instead, we let R randomly split the data. In future chapters we will use randomness in many other ways, e.g., to help us select a small subset of data from a larger data set, to pick groupings of data, and more. However, the use of randomness runs counter to one of the main tenets of good data analysis practice: reproducibility. Recall that a reproducible analysis produces the same result each time it is run; if we include randomness in the analysis, would we not get a different result each time? The trick is that in R—and other programming languages—randomness is not actually random! Instead, R uses a random number generator that produces a sequence of numbers that are completely determined by a seed value. Once you set the seed value using the set.seed function, everything after that point may look random, but is actually totally reproducible. As long as you pick the same seed value, you get the same result! Let’s use an example to investigate how seeds work in R. Say we want to randomly pick 10 numbers from 0 to 9 in R using the sample function, but we want it to be reproducible. Before using the sample function, we call set.seed, and pass it any integer as an argument. Here, we pass in the number 1. set.seed(1) random_numbers1 <- sample(0:9, 10, replace = TRUE) random_numbers1 ## [1] 8 3 6 0 1 6 1 2 0 4 You can see that random_numbers1 is a list of 10 numbers from 0 to 9 that, from all appearances, looks random. If we run the sample function again, we will get a fresh batch of 10 numbers that also look random. random_numbers2 <- sample(0:9, 10, replace = TRUE) random_numbers2 ## [1] 4 9 5 9 6 8 4 4 8 8 If we want to force R to produce the same sequences of random numbers, we can simply call the set.seed function again with the same argument value. set.seed(1) random_numbers1_again <- sample(0:9, 10, replace = TRUE) random_numbers1_again ## [1] 8 3 6 0 1 6 1 2 0 4 random_numbers2_again <- sample(0:9, 10, replace = TRUE) random_numbers2_again ## [1] 4 9 5 9 6 8 4 4 8 8 Notice that after setting the seed, we get the same two sequences of numbers in the same order. random_numbers1 and random_numbers1_again produce the same sequence of numbers, and the same can be said about random_numbers2 and random_numbers2_again. And if we choose a different value for the seed—say, 4235—we obtain a different sequence of random numbers. set.seed(4235) random_numbers1_different <- sample(0:9, 10, replace = TRUE) random_numbers1_different ## [1] 8 3 1 4 6 8 8 4 1 7 random_numbers2_different <- sample(0:9, 10, replace = TRUE) random_numbers2_different ## [1] 3 7 8 2 8 8 6 3 3 8 In other words, even though the sequences of numbers that R is generating look random, they are totally determined when we set a seed value! So what does this mean for data analysis? Well, sample is certainly not the only function that uses randomness in R. Many of the functions that we use in tidymodels, tidyverse, and beyond use randomness—some of them without even telling you about it. So at the beginning of every data analysis you do, right after loading packages, you should call the set.seed function and pass it an integer that you pick. Also note that when R starts up, it creates its own seed to use. So if you do not explicitly call the set.seed function in your code, your results will likely not be reproducible. And finally, be careful to set the seed only once at the beginning of a data analysis. Each time you set the seed, you are inserting your own human input, thereby influencing the analysis. If you use set.seed many times throughout your analysis, the randomness that R uses will not look as random as it should. In summary: if you want your analysis to be reproducible, i.e., produce the same result each time you run it, make sure to use set.seed exactly once at the beginning of the analysis. Different argument values in set.seed lead to different patterns of randomness, but as long as you pick the same argument value your result will be the same. In the remainder of the textbook, we will set the seed once at the beginning of each chapter. 6.5 Evaluating performance with tidymodels Back to evaluating classifiers now! In R, we can use the tidymodels package not only to perform K-nearest neighbors classification, but also to assess how well our classification worked. Let’s work through an example of how to use tools from tidymodels to evaluate a classifier using the breast cancer data set from the previous chapter. We begin the analysis by loading the packages we require, reading in the breast cancer data, and then making a quick scatter plot visualization of tumor cell concavity versus smoothness colored by diagnosis in Figure 6.3. You will also notice that we set the random seed here at the beginning of the analysis using the set.seed function, as described in Section 6.4. # load packages library(tidyverse) library(tidymodels) # set the seed set.seed(1) # load data cancer <- read_csv("data/wdbc_unscaled.csv") |> # convert the character Class variable to the factor datatype mutate(Class = as_factor(Class)) |> # rename the factor values to be more readable mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) # create scatter plot of tumor cell concavity versus smoothness, # labeling the points be diagnosis class perim_concav <- cancer |> ggplot(aes(x = Smoothness, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(color = "Diagnosis") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) perim_concav Figure 6.3: Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label. 6.5.1 Create the train / test split Once we have decided on a predictive question to answer and done some preliminary exploration, the very next thing to do is to split the data into the training and test sets. Typically, the training set is between 50% and 95% of the data, while the test set is the remaining 5% to 50%; the intuition is that you want to trade off between training an accurate model (by using a larger training data set) and getting an accurate evaluation of its performance (by using a larger test data set). Here, we will use 75% of the data for training, and 25% for testing. The initial_split function from tidymodels handles the procedure of splitting the data for us. It also applies two very important steps when splitting to ensure that the accuracy estimates from the test data are reasonable. First, it shuffles the data before splitting, which ensures that any ordering present in the data does not influence the data that ends up in the training and testing sets. Second, it stratifies the data by the class label, to ensure that roughly the same proportion of each class ends up in both the training and testing sets. For example, in our data set, roughly 63% of the observations are from the benign class, and 37% are from the malignant class, so initial_split ensures that roughly 63% of the training data are benign, 37% of the training data are malignant, and the same proportions exist in the testing data. Let’s use the initial_split function to create the training and testing sets. We will specify that prop = 0.75 so that 75% of our original data set ends up in the training set. We will also set the strata argument to the categorical label variable (here, Class) to ensure that the training and testing subsets contain the right proportions of each category of observation. The training and testing functions then extract the training and testing data sets into two separate data frames. Note that the initial_split function uses randomness, but since we set the seed earlier in the chapter, the split will be reproducible. cancer_split <- initial_split(cancer, prop = 0.75, strata = Class) cancer_train <- training(cancer_split) cancer_test <- testing(cancer_split) glimpse(cancer_train) ## Rows: 426 ## Columns: 12 ## $ ID <dbl> 8510426, 8510653, 8510824, 854941, 85713702, 857155,… ## $ Class <fct> Benign, Benign, Benign, Benign, Benign, Benign, Beni… ## $ Radius <dbl> 13.540, 13.080, 9.504, 13.030, 8.196, 12.050, 13.490… ## $ Texture <dbl> 14.36, 15.71, 12.44, 18.42, 16.84, 14.63, 22.30, 21.… ## $ Perimeter <dbl> 87.46, 85.63, 60.34, 82.61, 51.71, 78.04, 86.91, 74.… ## $ Area <dbl> 566.3, 520.0, 273.9, 523.8, 201.9, 449.3, 561.0, 427… ## $ Smoothness <dbl> 0.09779, 0.10750, 0.10240, 0.08983, 0.08600, 0.10310… ## $ Compactness <dbl> 0.08129, 0.12700, 0.06492, 0.03766, 0.05943, 0.09092… ## $ Concavity <dbl> 0.066640, 0.045680, 0.029560, 0.025620, 0.015880, 0.… ## $ Concave_Points <dbl> 0.047810, 0.031100, 0.020760, 0.029230, 0.005917, 0.… ## $ Symmetry <dbl> 0.1885, 0.1967, 0.1815, 0.1467, 0.1769, 0.1675, 0.18… ## $ Fractal_Dimension <dbl> 0.05766, 0.06811, 0.06905, 0.05863, 0.06503, 0.06043… glimpse(cancer_test) ## Rows: 143 ## Columns: 12 ## $ ID <dbl> 842517, 84300903, 84501001, 84610002, 848406, 848620… ## $ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan… ## $ Radius <dbl> 20.570, 19.690, 12.460, 15.780, 14.680, 16.130, 19.8… ## $ Texture <dbl> 17.77, 21.25, 24.04, 17.89, 20.13, 20.68, 22.15, 14.… ## $ Perimeter <dbl> 132.90, 130.00, 83.97, 103.60, 94.74, 108.10, 130.00… ## $ Area <dbl> 1326.0, 1203.0, 475.9, 781.0, 684.5, 798.8, 1260.0, … ## $ Smoothness <dbl> 0.08474, 0.10960, 0.11860, 0.09710, 0.09867, 0.11700… ## $ Compactness <dbl> 0.07864, 0.15990, 0.23960, 0.12920, 0.07200, 0.20220… ## $ Concavity <dbl> 0.08690, 0.19740, 0.22730, 0.09954, 0.07395, 0.17220… ## $ Concave_Points <dbl> 0.070170, 0.127900, 0.085430, 0.066060, 0.052590, 0.… ## $ Symmetry <dbl> 0.1812, 0.2069, 0.2030, 0.1842, 0.1586, 0.2164, 0.15… ## $ Fractal_Dimension <dbl> 0.05667, 0.05999, 0.08243, 0.06082, 0.05922, 0.07356… We can see from glimpse in the code above that the training set contains 426 observations, while the test set contains 143 observations. This corresponds to a train / test split of 75% / 25%, as desired. Recall from Chapter 5 that we use the glimpse function to view data with a large number of columns, as it prints the data such that the columns go down the page (instead of across). We can use group_by and summarize to find the percentage of malignant and benign classes in cancer_train and we see about 63% of the training data are benign and 37% are malignant, indicating that our class proportions were roughly preserved when we split the data. cancer_proportions <- cancer_train |> group_by(Class) |> summarize(n = n()) |> mutate(percent = 100*n/nrow(cancer_train)) cancer_proportions ## # A tibble: 2 × 3 ## Class n percent ## <fct> <int> <dbl> ## 1 Malignant 159 37.3 ## 2 Benign 267 62.7 6.5.2 Preprocess the data As we mentioned in the last chapter, K-nearest neighbors is sensitive to the scale of the predictors, so we should perform some preprocessing to standardize them. An additional consideration we need to take when doing this is that we should create the standardization preprocessor using only the training data. This ensures that our test data does not influence any aspect of our model training. Once we have created the standardization preprocessor, we can then apply it separately to both the training and test data sets. Fortunately, the recipe framework from tidymodels helps us handle this properly. Below we construct and prepare the recipe using only the training data (due to data = cancer_train in the first line). cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) 6.5.3 Train the classifier Now that we have split our original data set into training and test sets, we can create our K-nearest neighbors classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose the number \\(K\\) of neighbors to be 3, and use concavity and smoothness as the predictors. As before we need to create a model specification, combine the model specification and recipe into a workflow, and then finally use fit with the training data cancer_train to build the classifier. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |> set_engine("kknn") |> set_mode("classification") knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit(data = cancer_train) knn_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3, data, 5), ## kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.1126761 ## Best kernel: rectangular ## Best k: 3 6.5.4 Predict the labels in the test set Now that we have a K-nearest neighbors classifier object, we can use it to predict the class labels for our test set. We use the bind_cols to add the column of predictions to the original test data, creating the cancer_test_predictions data frame. The Class variable contains the actual diagnoses, while the .pred_class contains the predicted diagnoses from the classifier. cancer_test_predictions <- predict(knn_fit, cancer_test) |> bind_cols(cancer_test) cancer_test_predictions ## # A tibble: 143 × 13 ## .pred_class ID Class Radius Texture Perimeter Area Smoothness ## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Benign 842517 Malignant 20.6 17.8 133. 1326 0.0847 ## 2 Malignant 84300903 Malignant 19.7 21.2 130 1203 0.110 ## 3 Malignant 84501001 Malignant 12.5 24.0 84.0 476. 0.119 ## 4 Malignant 84610002 Malignant 15.8 17.9 104. 781 0.0971 ## 5 Benign 848406 Malignant 14.7 20.1 94.7 684. 0.0987 ## 6 Malignant 84862001 Malignant 16.1 20.7 108. 799. 0.117 ## 7 Malignant 849014 Malignant 19.8 22.2 130 1260 0.0983 ## 8 Malignant 8511133 Malignant 15.3 14.3 102. 704. 0.107 ## 9 Malignant 852552 Malignant 16.6 21.4 110 905. 0.112 ## 10 Malignant 853612 Malignant 11.8 18.7 77.9 441. 0.111 ## # ℹ 133 more rows ## # ℹ 5 more variables: Compactness <dbl>, Concavity <dbl>, Concave_Points <dbl>, ## # Symmetry <dbl>, Fractal_Dimension <dbl> 6.5.5 Evaluate performance Finally, we can assess our classifier’s performance. First, we will examine accuracy. To do this we use the metrics function from tidymodels, specifying the truth and estimate arguments: cancer_test_predictions |> metrics(truth = Class, estimate = .pred_class) |> filter(.metric == "accuracy") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.853 In the metrics data frame, we filtered the .metric column since we are interested in the accuracy row. Other entries involve other metrics that are beyond the scope of this book. Looking at the value of the .estimate variable shows that the estimated accuracy of the classifier on the test data was 85%. To compute the precision and recall, we can use the precision and recall functions from tidymodels. We first check the order of the labels in the Class variable using the levels function: cancer_test_predictions |> pull(Class) |> levels() ## [1] "Malignant" "Benign" This shows that \"Malignant\" is the first level. Therefore we will set the truth and estimate arguments to Class and .pred_class as before, but also specify that the “positive” class corresponds to the first factor level via event_level=\"first\". If the labels were in the other order, we would instead use event_level=\"second\". cancer_test_predictions |> precision(truth = Class, estimate = .pred_class, event_level = "first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 precision binary 0.767 cancer_test_predictions |> recall(truth = Class, estimate = .pred_class, event_level = "first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 recall binary 0.868 The output shows that the estimated precision and recall of the classifier on the test data was 77% and 87%, respectively. Finally, we can look at the confusion matrix for the classifier using the conf_mat function. confusion <- cancer_test_predictions |> conf_mat(truth = Class, estimate = .pred_class) confusion ## Truth ## Prediction Malignant Benign ## Malignant 46 14 ## Benign 7 76 The confusion matrix shows 46 observations were correctly predicted as malignant, and 76 were correctly predicted as benign. It also shows that the classifier made some mistakes; in particular, it classified 7 observations as benign when they were actually malignant, and 14 observations as malignant when they were actually benign. Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what R reported. \\[\\mathrm{accuracy} = \\frac{\\mathrm{number \\; of \\; correct \\; predictions}}{\\mathrm{total \\; number \\; of \\; predictions}} = \\frac{46+76}{46+76+14+7} = 0.853\\] \\[\\mathrm{precision} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; predictions}} = \\frac{46}{46 + 14} = 0.767\\] \\[\\mathrm{recall} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; test \\; set \\; observations}} = \\frac{46}{46+7} = 0.868\\] 6.5.6 Critically analyze performance We now know that the classifier was 85% accurate on the test data set, and had a precision of 77% and a recall of 87%. That sounds pretty good! Wait, is it good? Or do we need something higher? In general, a good value for accuracy (as well as precision and recall, if applicable) depends on the application; you must critically analyze your accuracy in the context of the problem you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99% of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!). And beyond just accuracy, we need to consider the precision and recall: as mentioned earlier, the kind of mistake the classifier makes is important in many applications as well. In the previous example with 99% benign observations, it might be very bad for the classifier to predict “benign” when the actual class is “malignant” (a false negative), as this might result in a patient not receiving appropriate medical attention. In other words, in this context, we need the classifier to have a high recall. On the other hand, it might be less bad for the classifier to guess “malignant” when the actual class is “benign” (a false positive), as the patient will then likely see a doctor who can provide an expert diagnosis. In other words, we are fine with sacrificing some precision in the interest of achieving high recall. This is why it is important not only to look at accuracy, but also the confusion matrix. However, there is always an easy baseline that you can compare to for any classification problem: the majority classifier. The majority classifier always guesses the majority class label from the training data, regardless of the predictor variables’ values. It helps to give you a sense of scale when considering accuracies. If the majority classifier obtains a 90% accuracy on a problem, then you might hope for your K-nearest neighbors classifier to do better than that. If your classifier provides a significant improvement upon the majority classifier, this means that at least your method is extracting some useful information from your predictor variables. Be careful though: improving on the majority classifier does not necessarily mean the classifier is working well enough for your application. As an example, in the breast cancer data, recall the proportions of benign and malignant observations in the training data are as follows: cancer_proportions ## # A tibble: 2 × 3 ## Class n percent ## <fct> <int> <dbl> ## 1 Malignant 159 37.3 ## 2 Benign 267 62.7 Since the benign class represents the majority of the training data, the majority classifier would always predict that a new observation is benign. The estimated accuracy of the majority classifier is usually fairly close to the majority class proportion in the training data. In this case, we would suspect that the majority classifier will have an accuracy of around 63%. The K-nearest neighbors classifier we built does quite a bit better than this, with an accuracy of 85%. This means that from the perspective of accuracy, the K-nearest neighbors classifier improved quite a bit on the basic majority classifier. Hooray! But we still need to be cautious; in this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing patients who actually need medical care. The confusion matrix above shows that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign (7 out of 53 malignant tumors, or 13%!). Therefore, even though the accuracy improved upon the majority classifier, our critical analysis suggests that this classifier may not have appropriate performance for the application. 6.6 Tuning the classifier The vast majority of predictive models in statistics and machine learning have parameters. A parameter is a number you have to pick in advance that determines some aspect of how the model behaves. For example, in the K-nearest neighbors classification algorithm, \\(K\\) is a parameter that we have to pick that determines how many neighbors participate in the class vote. By picking different values of \\(K\\), we create different classifiers that make different predictions. So then, how do we pick the best value of \\(K\\), i.e., tune the model? And is it possible to make this selection in a principled way? In this book, we will focus on maximizing the accuracy of the classifier. Ideally, we want somehow to maximize the accuracy of our classifier on data it hasn’t seen yet. But we cannot use our test data set in the process of building our model. So we will play the same trick we did before when evaluating our classifier: we’ll split our training data itself into two subsets, use one to train the model, and then use the other to evaluate it. In this section, we will cover the details of this procedure, as well as how to use it to help you pick a good parameter value for your classifier. And remember: don’t touch the test set during the tuning process. Tuning is a part of model training! 6.6.1 Cross-validation The first step in choosing the parameter \\(K\\) is to be able to evaluate the classifier using only the training data. If this is possible, then we can compare the classifier’s performance for different values of \\(K\\)—and pick the best—using only the training data. As suggested at the beginning of this section, we will accomplish this by splitting the training data, training on one subset, and evaluating on the other. The subset of training data used for evaluation is often called the validation set. There is, however, one key difference from the train/test split that we performed earlier. In particular, we were forced to make only a single split of the data. This is because at the end of the day, we have to produce a single classifier. If we had multiple different splits of the data into training and testing data, we would produce multiple different classifiers. But while we are tuning the classifier, we are free to create multiple classifiers based on multiple splits of the training data, evaluate them, and then choose a parameter value based on all of the different results. If we just split our overall training data once, our best parameter choice will depend strongly on whatever data was lucky enough to end up in the validation set. Perhaps using multiple different train/validation splits, we’ll get a better estimate of accuracy, which will lead to a better choice of the number of neighbors \\(K\\) for the overall set of training data. Let’s investigate this idea in R! In particular, we will generate five different train/validation splits of our overall training data, train five different K-nearest neighbors models, and evaluate their accuracy. We will start with just a single split. # create the 25/75 split of the training data into training and validation cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class) cancer_subtrain <- training(cancer_split) cancer_validation <- testing(cancer_split) # recreate the standardization recipe from before # (since it must be based on the training data) cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_subtrain) |> step_scale(all_predictors()) |> step_center(all_predictors()) # fit the knn model (we can reuse the old knn_spec model from before) knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit(data = cancer_subtrain) # get predictions on the validation data validation_predicted <- predict(knn_fit, cancer_validation) |> bind_cols(cancer_validation) # compute the accuracy acc <- validation_predicted |> metrics(truth = Class, estimate = .pred_class) |> filter(.metric == "accuracy") |> select(.estimate) |> pull() acc ## [1] 0.8598131 The accuracy estimate using this split is 86%. Now we repeat the above code 4 more times, which generates 4 more splits. Therefore we get five different shuffles of the data, and therefore five different values for accuracy: 86.0%, 89.7%, 88.8%, 86.0%, 86.9%. None of these values are necessarily “more correct” than any other; they’re just five estimates of the true, underlying accuracy of our classifier built using our overall training data. We can combine the estimates by taking their average (here 87%) to try to get a single assessment of our classifier’s accuracy; this has the effect of reducing the influence of any one (un)lucky validation set on the estimate. In practice, we don’t use random splits, but rather use a more structured splitting procedure so that each observation in the data set is used in a validation set only a single time. The name for this strategy is cross-validation. In cross-validation, we split our overall training data into \\(C\\) evenly sized chunks. Then, iteratively use \\(1\\) chunk as the validation set and combine the remaining \\(C-1\\) chunks as the training set. This procedure is shown in Figure 6.4. Here, \\(C=5\\) different chunks of the data set are used, resulting in 5 different choices for the validation set; we call this 5-fold cross-validation. Figure 6.4: 5-fold cross-validation. To perform 5-fold cross-validation in R with tidymodels, we use another function: vfold_cv. This function splits our training data into v folds automatically. We set the strata argument to the categorical label variable (here, Class) to ensure that the training and validation subsets contain the right proportions of each category of observation. cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class) cancer_vfold ## # 5-fold cross-validation using stratification ## # A tibble: 5 × 2 ## splits id ## <list> <chr> ## 1 <split [340/86]> Fold1 ## 2 <split [340/86]> Fold2 ## 3 <split [341/85]> Fold3 ## 4 <split [341/85]> Fold4 ## 5 <split [342/84]> Fold5 Then, when we create our data analysis workflow, we use the fit_resamples function instead of the fit function for training. This runs cross-validation on each train/validation split. # recreate the standardization recipe from before # (since it must be based on the training data) cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) # fit the knn model (we can reuse the old knn_spec model from before) knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold) knn_fit ## # Resampling results ## # 5-fold cross-validation using stratification ## # A tibble: 5 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [340/86]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> ## 2 <split [340/86]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> ## 3 <split [341/85]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> ## 4 <split [341/85]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> ## 5 <split [342/84]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> The collect_metrics function is used to aggregate the mean and standard error of the classifier’s validation accuracy across the folds. You will find results related to the accuracy in the row with accuracy listed under the .metric column. You should consider the mean (mean) to be the estimated accuracy, while the standard error (std_err) is a measure of how uncertain we are in the mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean is 0.89 and standard error is 0.02, you can expect the true average accuracy of the classifier to be somewhere roughly between 87% and 91% (although it may fall outside this range). You may ignore the other columns in the metrics data frame, as they do not provide any additional insight. You can also ignore the entire second row with roc_auc in the .metric column, as it is beyond the scope of this book. knn_fit |> collect_metrics() ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.890 5 0.0180 Preprocessor1_Model1 ## 2 roc_auc binary 0.925 5 0.0151 Preprocessor1_Model1 We can choose any number of folds, and typically the more we use the better our accuracy estimate will be (lower standard error). However, we are limited by computational power: the more folds we choose, the more computation it takes, and hence the more time it takes to run the analysis. So when you do cross-validation, you need to consider the size of the data, the speed of the algorithm (e.g., K-nearest neighbors), and the speed of your computer. In practice, this is a trial-and-error process, but typically \\(C\\) is chosen to be either 5 or 10. Here we will try 10-fold cross-validation to see if we get a lower standard error: cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class) vfold_metrics <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold) |> collect_metrics() vfold_metrics ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.890 10 0.0127 Preprocessor1_Model1 ## 2 roc_auc binary 0.913 10 0.0150 Preprocessor1_Model1 In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes you might even end up with a higher standard error when increasing the number of folds! We can make the reduction in standard error more dramatic by increasing the number of folds by a large amount. In the following code we show the result when \\(C = 50\\); picking such a large number of folds often takes a long time to run in practice, so we usually stick to 5 or 10. cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class) vfold_metrics_50 <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold_50) |> collect_metrics() vfold_metrics_50 ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.884 50 0.00568 Preprocessor1_Model1 ## 2 roc_auc binary 0.926 50 0.0148 Preprocessor1_Model1 6.6.2 Parameter value selection Using 5- and 10-fold cross-validation, we have estimated that the prediction accuracy of our classifier is somewhere around 89%. Whether that is good or not depends entirely on the downstream application of the data analysis. In the present situation, we are trying to predict a tumor diagnosis, with expensive, damaging chemo/radiation therapy or patient death as potential consequences of misprediction. Hence, we might like to do better than 89% for this application. In order to improve our classifier, we have one choice of parameter: the number of neighbors, \\(K\\). Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of \\(K\\) in a reasonable range, and then pick the value of \\(K\\) that gives us the best accuracy. The tidymodels package collection provides a very simple syntax for tuning models: each parameter in the model to be tuned should be specified as tune() in the model specification rather than given a particular value. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("classification") Then instead of using fit or fit_resamples, we will use the tune_grid function to fit the model for each value in a range of parameter values. In particular, we first create a data frame with a neighbors variable that contains the sequence of values of \\(K\\) to try; below we create the k_vals data frame with the neighbors variable containing values from 1 to 100 (stepping by 5) using the seq function. Then we pass that data frame to the grid argument of tune_grid. k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5)) knn_results <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = k_vals) |> collect_metrics() accuracies <- knn_results |> filter(.metric == "accuracy") accuracies ## # A tibble: 20 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 accuracy binary 0.866 10 0.0165 Preprocessor1_Model01 ## 2 6 accuracy binary 0.890 10 0.0153 Preprocessor1_Model02 ## 3 11 accuracy binary 0.887 10 0.0173 Preprocessor1_Model03 ## 4 16 accuracy binary 0.887 10 0.0142 Preprocessor1_Model04 ## 5 21 accuracy binary 0.887 10 0.0143 Preprocessor1_Model05 ## 6 26 accuracy binary 0.887 10 0.0170 Preprocessor1_Model06 ## 7 31 accuracy binary 0.897 10 0.0145 Preprocessor1_Model07 ## 8 36 accuracy binary 0.899 10 0.0144 Preprocessor1_Model08 ## 9 41 accuracy binary 0.892 10 0.0135 Preprocessor1_Model09 ## 10 46 accuracy binary 0.892 10 0.0156 Preprocessor1_Model10 ## 11 51 accuracy binary 0.890 10 0.0155 Preprocessor1_Model11 ## 12 56 accuracy binary 0.873 10 0.0156 Preprocessor1_Model12 ## 13 61 accuracy binary 0.876 10 0.0104 Preprocessor1_Model13 ## 14 66 accuracy binary 0.871 10 0.0139 Preprocessor1_Model14 ## 15 71 accuracy binary 0.876 10 0.0104 Preprocessor1_Model15 ## 16 76 accuracy binary 0.873 10 0.0127 Preprocessor1_Model16 ## 17 81 accuracy binary 0.876 10 0.0135 Preprocessor1_Model17 ## 18 86 accuracy binary 0.873 10 0.0131 Preprocessor1_Model18 ## 19 91 accuracy binary 0.873 10 0.0140 Preprocessor1_Model19 ## 20 96 accuracy binary 0.866 10 0.0126 Preprocessor1_Model20 We can decide which number of neighbors is best by plotting the accuracy versus \\(K\\), as shown in Figure 6.5. accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = "Neighbors", y = "Accuracy Estimate") + theme(text = element_text(size = 12)) accuracy_vs_k Figure 6.5: Plot of estimated accuracy versus the number of neighbors. We can also obtain the number of neighbours with the highest accuracy programmatically by accessing the neighbors variable in the accuracies data frame where the mean variable is highest. Note that it is still useful to visualize the results as we did above since this provides additional information on how the model performance varies. best_k <- accuracies |> arrange(desc(mean)) |> head(1) |> pull(neighbors) best_k ## [1] 36 Setting the number of neighbors to \\(K =\\) 36 provides the highest cross-validation accuracy estimate (89.89%). But there is no exact or perfect answer here; any selection from \\(K = 30\\) and \\(60\\) would be reasonably justified, as all of these differ in classifier accuracy by a small amount. Remember: the values you see on this plot are estimates of the true accuracy of our classifier. Although the \\(K =\\) 36 value is higher than the others on this plot, that doesn’t mean the classifier is actually more accurate with this parameter value! Generally, when selecting \\(K\\) (and other parameters for other predictive models), we are looking for a value where: we get roughly optimal accuracy, so that our model will likely be accurate; changing the value to a nearby one (e.g., adding or subtracting a small number) doesn’t decrease accuracy too much, so that our choice is reliable in the presence of uncertainty; the cost of training the model is not prohibitive (e.g., in our situation, if \\(K\\) is too large, predicting becomes expensive!). We know that \\(K =\\) 36 provides the highest estimated accuracy. Further, Figure 6.5 shows that the estimated accuracy changes by only a small amount if we increase or decrease \\(K\\) near \\(K =\\) 36. And finally, \\(K =\\) 36 does not create a prohibitively expensive computational cost of training. Considering these three points, we would indeed select \\(K =\\) 36 for the classifier. 6.6.3 Under/Overfitting To build a bit more intuition, what happens if we keep increasing the number of neighbors \\(K\\)? In fact, the accuracy actually starts to decrease! Let’s specify a much larger range of values of \\(K\\) to try in the grid argument of tune_grid. Figure 6.6 shows a plot of estimated accuracy as we vary \\(K\\) from 1 to almost the number of observations in the training set. k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10)) knn_results <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = k_lots) |> collect_metrics() accuracies_lots <- knn_results |> filter(.metric == "accuracy") accuracy_vs_k_lots <- ggplot(accuracies_lots, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = "Neighbors", y = "Accuracy Estimate") + theme(text = element_text(size = 12)) accuracy_vs_k_lots Figure 6.6: Plot of accuracy estimate versus number of neighbors for many K values. Underfitting: What is actually happening to our classifier that causes this? As we increase the number of neighbors, more and more of the training observations (and those that are farther and farther away from the point) get a “say” in what the class of a new observation is. This causes a sort of “averaging effect” to take place, making the boundary between where our classifier would predict a tumor to be malignant versus benign to smooth out and become simpler. If you take this to the extreme, setting \\(K\\) to the total training data set size, then the classifier will always predict the same label regardless of what the new observation looks like. In general, if the model isn’t influenced enough by the training data, it is said to underfit the data. Overfitting: In contrast, when we decrease the number of neighbors, each individual data point has a stronger and stronger vote regarding nearby points. Since the data themselves are noisy, this causes a more “jagged” boundary corresponding to a less simple model. If you take this case to the extreme, setting \\(K = 1\\), then the classifier is essentially just matching each new observation to its closest neighbor in the training data set. This is just as problematic as the large \\(K\\) case, because the classifier becomes unreliable on new data: if we had a different training set, the predictions would be completely different. In general, if the model is influenced too much by the training data, it is said to overfit the data. Figure 6.7: Effect of K in overfitting and underfitting. Both overfitting and underfitting are problematic and will lead to a model that does not generalize well to new data. When fitting a model, we need to strike a balance between the two. You can see these two effects in Figure 6.7, which shows how the classifier changes as we set the number of neighbors \\(K\\) to 1, 7, 20, and 300. 6.6.4 Evaluating on the test set Now that we have tuned the K-NN classifier and set \\(K =\\) 36, we are done building the model and it is time to evaluate the quality of its predictions on the held out test data, as we did earlier in Section 6.5.5. We first need to retrain the K-NN classifier on the entire training data set using the selected number of neighbors. cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |> set_engine("kknn") |> set_mode("classification") knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit(data = cancer_train) knn_fit ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(36, data, 5), kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.1150235 ## Best kernel: rectangular ## Best k: 36 Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the predict and metrics functions as we did earlier in the chapter. We can then pass those predictions to the precision, recall, and conf_mat functions to assess the estimated precision and recall, and print a confusion matrix. cancer_test_predictions <- predict(knn_fit, cancer_test) |> bind_cols(cancer_test) cancer_test_predictions |> metrics(truth = Class, estimate = .pred_class) |> filter(.metric == "accuracy") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.860 cancer_test_predictions |> precision(truth = Class, estimate = .pred_class, event_level="first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 precision binary 0.8 cancer_test_predictions |> recall(truth = Class, estimate = .pred_class, event_level="first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 recall binary 0.830 confusion <- cancer_test_predictions |> conf_mat(truth = Class, estimate = .pred_class) confusion ## Truth ## Prediction Malignant Benign ## Malignant 44 11 ## Benign 9 79 At first glance, this is a bit surprising: the accuracy of the classifier has only changed a small amount despite tuning the number of neighbors! Our first model with \\(K =\\) 3 (before we knew how to tune) had an estimated accuracy of 85%, while the tuned model with \\(K =\\) 36 had an estimated accuracy of 86%. Upon examining Figure 6.5 again to see the cross validation accuracy estimates for a range of neighbors, this result becomes much less surprising. From 1 to around 96 neighbors, the cross validation accuracy estimate varies only by around 3%, with each estimate having a standard error around 1%. Since the cross-validation accuracy estimates the test set accuracy, the fact that the test set accuracy also doesn’t change much is expected. Also note that the \\(K =\\) 3 model had a precision precision of 77% and recall of 87%, while the tuned model had a precision of 80% and recall of 83%. Given that the recall decreased—remember, in this application, recall is critical to making sure we find all the patients with malignant tumors—the tuned model may actually be less preferred in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to maximize accuracy are not necessarily better for a given application. 6.7 Summary Classification algorithms use one or more quantitative variables to predict the value of another categorical variable. In particular, the K-nearest neighbors algorithm does this by first finding the \\(K\\) points in the training data nearest to the new observation, and then returning the majority class vote from those training observations. We can tune and evaluate a classifier by splitting the data randomly into a training and test data set. The training set is used to build the classifier, and we can tune the classifier (e.g., select the number of neighbors in K-NN) by maximizing estimated accuracy via cross-validation. After we have tuned the model we can use the test set to estimate its accuracy. The overall process is summarized in Figure 6.8. Figure 6.8: Overview of K-NN classification. The overall workflow for performing K-nearest neighbors classification using tidymodels is as follows: Use the initial_split function to split the data into a training and test set. Set the strata argument to the class label variable. Put the test set aside for now. Use the vfold_cv function to split up the training data for cross-validation. Create a recipe that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the data argument of the recipe. Create a nearest_neighbors model specification, with neighbors = tune(). Add the recipe and model specification to a workflow(), and use the tune_grid function on the train/validation splits to estimate the classifier accuracy for a range of \\(K\\) values. Pick a value of \\(K\\) that yields a high accuracy estimate that doesn’t change much if you change \\(K\\) to a nearby value. Make a new model specification for the best parameter value (i.e., \\(K\\)), and retrain the classifier using the fit function. Evaluate the estimated accuracy of the classifier on the test set using the predict function. In these last two chapters, we focused on the K-nearest neighbors algorithm, but there are many other methods we could have used to predict a categorical label. All algorithms have their strengths and weaknesses, and we summarize these for the K-NN here. Strengths: K-nearest neighbors classification is a simple, intuitive algorithm, requires few assumptions about what the data must look like, and works for binary (two-class) and multi-class (more than 2 classes) classification problems. Weaknesses: K-nearest neighbors classification becomes very slow as the training data gets larger, may not perform well with a large number of predictors, and may not perform well when classes are imbalanced. 6.8 Predictor variable selection Note: This section is not required reading for the remainder of the textbook. It is included for those readers interested in learning how irrelevant variables can influence the performance of a classifier, and how to pick a subset of useful variables to include as predictors. Another potentially important part of tuning your classifier is to choose which variables from your data will be treated as predictor variables. Technically, you can choose anything from using a single predictor variable to using every variable in your data; the K-nearest neighbors algorithm accepts any number of predictors. However, it is not the case that using more predictors always yields better predictions! In fact, sometimes including irrelevant predictors can actually negatively affect classifier performance. 6.8.1 The effect of irrelevant predictors Let’s take a look at an example where K-nearest neighbors performs worse when given more predictors to work with. In this example, we modified the breast cancer data to have only the Smoothness, Concavity, and Perimeter variables from the original data. Then, we added irrelevant variables that we created ourselves using a random number generator. The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless of what the value Class variable takes. In other words, the irrelevant variables have no meaningful relationship with the Class variable. cancer_irrelevant |> select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2) ## # A tibble: 569 × 6 ## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2 ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Malignant 0.118 0.300 123. 1 0 ## 2 Malignant 0.0847 0.0869 133. 0 0 ## 3 Malignant 0.110 0.197 130 0 0 ## 4 Malignant 0.142 0.241 77.6 0 1 ## 5 Malignant 0.100 0.198 135. 0 0 ## 6 Malignant 0.128 0.158 82.6 1 0 ## 7 Malignant 0.0946 0.113 120. 0 1 ## 8 Malignant 0.119 0.0937 90.2 1 0 ## 9 Malignant 0.127 0.186 87.5 0 0 ## 10 Malignant 0.119 0.227 84.0 1 1 ## # ℹ 559 more rows Next, we build a sequence of K-NN classifiers that include Smoothness, Concavity, and Perimeter as predictor variables, but also increasingly many irrelevant variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors. Then we build a model, tuned via 5-fold cross-validation, for each data set. Figure 6.9 shows the estimated cross-validation accuracy versus the number of irrelevant predictors. As we add more irrelevant predictor variables, the estimated accuracy of our classifier decreases. This is because the irrelevant variables add a random amount to the distance between each pair of observations; the more irrelevant variables there are, the more (random) influence they have, and the more they corrupt the set of nearest neighbors that vote on the class of the new observation to predict. Figure 6.9: Effect of inclusion of irrelevant predictors. Although the accuracy decreases as expected, one surprising thing about Figure 6.9 is that it shows that the method still outperforms the baseline majority classifier (with about 63% accuracy) even with 40 irrelevant variables. How could that be? Figure 6.10 provides the answer: the tuning procedure for the K-nearest neighbors classifier combats the extra randomness from the irrelevant variables by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant variables, the number of neighbors does not increase smoothly; but the general trend is increasing. Figure 6.11 corroborates this evidence; if we fix the number of neighbors to \\(K=3\\), the accuracy falls off more quickly. Figure 6.10: Tuned number of neighbors for varying number of irrelevant predictors. Figure 6.11: Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors. 6.8.2 Finding a good subset of predictors So then, if it is not ideal to use all of our variables as predictors without consideration, how do we choose which variables we should use? A simple method is to rely on your scientific understanding of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer data that we have been studying, the ID variable is just a unique identifier for the observation. As it is not related to any measured property of the cells, the ID variable should therefore not be used as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables is less obvious, as all seem like reasonable candidates. It is not clear which subset of them will create the best classifier. One could use visualizations and other exploratory analyses to try to help understand which variables are potentially relevant, but this process is both time-consuming and error-prone when there are many variables to consider. Therefore we need a more systematic and programmatic way of choosing variables. This is a very difficult problem to solve in general, and there are a number of methods that have been developed that apply in particular cases of interest. Here we will discuss two basic selection methods as an introduction to the topic. See the additional resources at the end of this chapter to find out where you can learn more about variable selection, including more advanced methods. The first idea you might think of for a systematic way to select predictors is to try all possible subsets of predictors and then pick the set that results in the “best” classifier. This procedure is indeed a well-known variable selection method referred to as best subset selection (Beale, Kendall, and Mann 1967; Hocking and Leslie 1967). In particular, you create a separate model for every possible subset of predictors, tune each one using cross-validation, and pick the subset of predictors that gives you the highest cross-validation accuracy. Best subset selection is applicable to any classification method (K-NN or otherwise). However, it becomes very slow when you have even a moderate number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets grows very quickly with the number of predictors, and you have to train the model (itself a slow process!) for each one. For example, if we have 2 predictors—let’s call them A and B—then we have 3 variable sets to try: A alone, B alone, and finally A and B together. If we have 3 predictors—A, B, and C—then we have 7 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models we have to train for \\(m\\) predictors is \\(2^m-1\\); in other words, when we get to 10 predictors we have over one thousand models to train, and at 20 predictors we have over one million models to train! So although it is a simple method, best subset selection is usually too computationally expensive to use in practice. Another idea is to iteratively build up a model by adding one predictor variable at a time. This method—known as forward selection (Eforymson 1966; Draper and Smith 1966)—is also widely applicable and fairly straightforward. It involves the following steps: Start with a model having no predictors. Run the following 3 steps until you run out of predictors: For each unused predictor, add it to the model to form a candidate model. Tune all of the candidate models. Update the model to be the candidate model with the highest cross-validation accuracy. Select the model that provides the best trade-off between accuracy and simplicity. Say you have \\(m\\) total predictors to work with. In the first iteration, you have to make \\(m\\) candidate models, each with 1 predictor. Then in the second iteration, you have to make \\(m-1\\) candidate models, each with 2 predictors (the one you chose before and a new one). This pattern continues for as many iterations as you want. If you run the method all the way until you run out of predictors to choose, you will end up training \\(\\frac{1}{2}m(m+1)\\) separate models. This is a big improvement from the \\(2^m-1\\) models that best subset selection requires you to train! For example, while best subset selection requires training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models. Therefore we will continue the rest of this section using forward selection. Note: One word of caution before we move on. Every additional model that you train increases the likelihood that you will get unlucky and stumble on a model that has a high cross-validation accuracy estimate, but a low true accuracy on the test data and other future observations. Since forward selection involves training a lot of models, you run a fairly high risk of this happening. To keep this risk low, only use forward selection when you have a large amount of data and a relatively small total number of predictors. More advanced methods do not suffer from this problem as much; see the additional resources at the end of this chapter for where to learn more about advanced predictor selection methods. 6.8.3 Forward selection in R We now turn to implementing forward selection in R. Unfortunately there is no built-in way to do this using the tidymodels framework, so we will have to code it ourselves. First we will use the select function to extract a smaller set of predictors to work with in this illustrative example—Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, and Irrelevant3—as well as the Class variable as the label. We will also extract the column names for the full set of predictors. cancer_subset <- cancer_irrelevant |> select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, Irrelevant3) names <- colnames(cancer_subset |> select(-Class)) cancer_subset ## # A tibble: 569 × 7 ## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2 Irrelevant3 ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Malignant 0.118 0.300 123. 1 0 1 ## 2 Malignant 0.0847 0.0869 133. 0 0 0 ## 3 Malignant 0.110 0.197 130 0 0 0 ## 4 Malignant 0.142 0.241 77.6 0 1 0 ## 5 Malignant 0.100 0.198 135. 0 0 0 ## 6 Malignant 0.128 0.158 82.6 1 0 1 ## 7 Malignant 0.0946 0.113 120. 0 1 1 ## 8 Malignant 0.119 0.0937 90.2 1 0 0 ## 9 Malignant 0.127 0.186 87.5 0 0 1 ## 10 Malignant 0.119 0.227 84.0 1 1 0 ## # ℹ 559 more rows The key idea of the forward selection code is to use the paste function (which concatenates strings separated by spaces) to create a model formula for each subset of predictors for which we want to build a model. The collapse argument tells paste what to put between the items in the list; to make a formula, we need to put a + symbol between each variable. As an example, let’s make a model formula for all the predictors, which should output something like Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3: example_formula <- paste("Class", "~", paste(names, collapse="+")) example_formula ## [1] "Class ~ Smoothness+Concavity+Perimeter+Irrelevant1+Irrelevant2+Irrelevant3" Finally, we need to write some code that performs the task of sequentially finding the best predictor to add to the model. If you recall the end of the wrangling chapter, we mentioned that sometimes one needs more flexible forms of iteration than what we have used earlier, and in these cases one typically resorts to a for loop; see the chapter on iteration in R for Data Science (Wickham and Grolemund 2016). Here we will use two for loops: one over increasing predictor set sizes (where you see for (i in 1:length(names)) below), and another to check which predictor to add in each round (where you see for (j in 1:length(names)) below). For each set of predictors to try, we construct a model formula, pass it into a recipe, build a workflow that tunes a K-NN classifier using 5-fold cross-validation, and finally records the estimated accuracy. # create an empty tibble to store the results accuracies <- tibble(size = integer(), model_string = character(), accuracy = numeric()) # create a model specification knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("classification") # create a 5-fold cross-validation object cancer_vfold <- vfold_cv(cancer_subset, v = 5, strata = Class) # store the total number of predictors n_total <- length(names) # stores selected predictors selected <- c() # for every size from 1 to the total number of predictors for (i in 1:n_total) { # for every predictor still not added yet accs <- list() models <- list() for (j in 1:length(names)) { # create a model string for this combination of predictors preds_new <- c(selected, names[[j]]) model_string <- paste("Class", "~", paste(preds_new, collapse="+")) # create a recipe from the model string cancer_recipe <- recipe(as.formula(model_string), data = cancer_subset) |> step_scale(all_predictors()) |> step_center(all_predictors()) # tune the K-NN classifier with these predictors, # and collect the accuracy for the best K acc <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = 10) |> collect_metrics() |> filter(.metric == "accuracy") |> summarize(mx = max(mean)) acc <- acc$mx |> unlist() # add this result to the dataframe accs[[j]] <- acc models[[j]] <- model_string } jstar <- which.max(unlist(accs)) accuracies <- accuracies |> add_row(size = i, model_string = models[[jstar]], accuracy = accs[[jstar]]) selected <- c(selected, names[[jstar]]) names <- names[-jstar] } accuracies ## # A tibble: 6 × 3 ## size model_string accuracy ## <int> <chr> <dbl> ## 1 1 Class ~ Perimeter 0.896 ## 2 2 Class ~ Perimeter+Concavity 0.916 ## 3 3 Class ~ Perimeter+Concavity+Smoothness 0.931 ## 4 4 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1 0.928 ## 5 5 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3 0.924 ## 6 6 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3… 0.902 Interesting! The forward selection procedure first added the three meaningful variables Perimeter, Concavity, and Smoothness, followed by the irrelevant variables. Figure 6.12 visualizes the accuracy versus the number of predictors in the model. You can see that as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). The way to find that balance is to look for the elbow in Figure 6.12, i.e., the place on the plot where the accuracy stops increasing dramatically and levels off or begins to decrease. The elbow in Figure 6.12 appears to occur at the model with 3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors occurs with 3 variables: Class ~ Perimeter + Concavity + Smoothness. In other words, we have successfully removed irrelevant predictors from the model! It is always worth remembering, however, that what cross-validation gives you is an estimate of the true accuracy; you have to use your judgement when looking at this plot to decide where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy. Figure 6.12: Estimated accuracy versus the number of predictors for the sequence of models built using forward selection. Note: Since the choice of which variables to include as predictors is part of tuning your classifier, you cannot use your test data for this process! 6.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Classification II: evaluation and tuning” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 6.10 Additional resources The tidymodels website is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a nice beginner’s tutorial and an extensive list of more advanced examples that you can use to continue learning beyond the scope of this book. It’s worth noting that the tidymodels package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you’ll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters. An Introduction to Statistical Learning (James et al. 2013) provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require. References "],["regression1.html", "Chapter 7 Regression I: K-nearest neighbors 7.1 Overview 7.2 Chapter learning objectives 7.3 The regression problem 7.4 Exploring a data set 7.5 K-nearest neighbors regression 7.6 Training, evaluating, and tuning the model 7.7 Underfitting and overfitting 7.8 Evaluating on the test set 7.9 Multivariable K-NN regression 7.10 Strengths and limitations of K-NN regression 7.11 Exercises", " Chapter 7 Regression I: K-nearest neighbors 7.1 Overview This chapter continues our foray into answering predictive questions. Here we will focus on predicting numerical variables and will use regression to perform this task. This is unlike the past two chapters, which focused on predicting categorical variables via classification. However, regression does have many similarities to classification: for example, just as in the case of classification, we will split our data into training, validation, and test sets, we will use tidymodels workflows, we will use a K-nearest neighbors (K-NN) approach to make predictions, and we will use cross-validation to choose K. Because of how similar these procedures are, make sure to read Chapters 5 and 6 before reading this one—we will move a little bit faster here with the concepts that have already been covered. This chapter will primarily focus on the case where there is a single predictor, but the end of the chapter shows how to perform regression with more than one predictor variable, i.e., multivariable regression. It is important to note that regression can also be used to answer inferential and causal questions, however that is beyond the scope of this book. 7.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Recognize situations where a regression analysis would be appropriate for making predictions. Explain the K-nearest neighbors (K-NN) regression algorithm and describe how it differs from K-NN classification. Interpret the output of a K-NN regression. In a data set with two or more variables, perform K-nearest neighbors regression in R. Evaluate K-NN regression prediction quality in R using the root mean squared prediction error (RMSPE). Estimate the RMSPE in R using cross-validation or a test set. Choose the number of neighbors in K-nearest neighbors regression by minimizing estimated cross-validation RMSPE. Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors regression. Describe the advantages and disadvantages of K-nearest neighbors regression. 7.3 The regression problem Regression, like classification, is a predictive problem setting where we want to use past information to predict future observations. But in the case of regression, the goal is to predict numerical values instead of categorical values. The variable that you want to predict is often called the response variable. For example, we could try to use the number of hours a person spends on exercise each week to predict their race time in the annual Boston marathon. As another example, we could try to use the size of a house to predict its sale price. Both of these response variables—race time and sale price—are numerical, and so predicting them given past data is considered a regression problem. Just like in the classification setting, there are many possible methods that we can use to predict numerical response variables. In this chapter we will focus on the K-nearest neighbors algorithm (Fix and Hodges 1951; Cover and Hart 1967), and in the next chapter we will study linear regression. In your future studies, you might encounter regression trees, splines, and general local regression methods; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. Many of the concepts from classification map over to the setting of regression. For example, a regression model predicts a new observation’s response variable based on the response variables for similar observations in the data set of past observations. When building a regression model, we first split the data into training and test sets, in order to ensure that we assess the performance of our method on observations not seen during training. And finally, we can use cross-validation to evaluate different choices of model parameters (e.g., K in a K-nearest neighbors model). The major difference is that we are now predicting numerical variables instead of categorical variables. Note: You can usually tell whether a variable is numerical or categorical—and therefore whether you need to perform regression or classification—by taking the response variable for two observations X and Y from your data, and asking the question, “is response variable X more than response variable Y?” If the variable is categorical, the question will make no sense. (Is blue more than red? Is benign more than malignant?) If the variable is numerical, it will make sense. (Is 1.5 hours more than 2.25 hours? Is $500,000 more than $400,000?) Be careful when applying this heuristic, though: sometimes categorical variables will be encoded as numbers in your data (e.g., “1” represents “benign”, and “0” represents “malignant”). In these cases you have to ask the question about the meaning of the labels (“benign” and “malignant”), not their values (“1” and “0”). 7.4 Exploring a data set In this chapter and the next, we will study a data set of 932 real estate transactions in Sacramento, California originally reported in the Sacramento Bee newspaper. We first need to formulate a precise question that we want to answer. In this example, our question is again predictive: Can we use the size of a house in the Sacramento, CA area to predict its sale price? A rigorous, quantitative answer to this question might help a realtor advise a client as to whether the price of a particular listing is fair, or perhaps how to set the price of a new listing. We begin the analysis by loading and examining the data, and setting the seed value. library(tidyverse) library(tidymodels) library(gridExtra) set.seed(5) sacramento <- read_csv("data/sacramento.csv") sacramento ## # A tibble: 932 × 9 ## city zip beds baths sqft type price latitude longitude ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 SACRAMENTO z95838 2 1 836 Residential 59222 38.6 -121. ## 2 SACRAMENTO z95823 3 1 1167 Residential 68212 38.5 -121. ## 3 SACRAMENTO z95815 2 1 796 Residential 68880 38.6 -121. ## 4 SACRAMENTO z95815 2 1 852 Residential 69307 38.6 -121. ## 5 SACRAMENTO z95824 2 1 797 Residential 81900 38.5 -121. ## 6 SACRAMENTO z95841 3 1 1122 Condo 89921 38.7 -121. ## 7 SACRAMENTO z95842 3 2 1104 Residential 90895 38.7 -121. ## 8 SACRAMENTO z95820 3 1 1177 Residential 91002 38.5 -121. ## 9 RANCHO_CORDOVA z95670 2 2 941 Condo 94905 38.6 -121. ## 10 RIO_LINDA z95673 3 2 1146 Residential 98937 38.7 -121. ## # ℹ 922 more rows The scientific question guides our initial exploration: the columns in the data that we are interested in are sqft (house size, in livable square feet) and price (house sale price, in US dollars (USD)). The first step is to visualize the data as a scatter plot where we place the predictor variable (house size) on the x-axis, and we place the response variable that we want to predict (sale price) on the y-axis. Note: Given that the y-axis unit is dollars in Figure 7.1, we format the axis labels to put dollar signs in front of the house prices, as well as commas to increase the readability of the larger numbers. We can do this in R by passing the dollar_format function (from the scales package) to the labels argument of the scale_y_continuous function. eda <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + theme(text = element_text(size = 12)) eda Figure 7.1: Scatter plot of price (USD) versus house size (square feet). The plot is shown in Figure 7.1. We can see that in Sacramento, CA, as the size of a house increases, so does its sale price. Thus, we can reason that we may be able to use the size of a not-yet-sold house (for which we don’t know the sale price) to predict its final sale price. Note that we do not suggest here that a larger house size causes a higher sale price; just that house price tends to increase with house size, and that we may be able to use the latter to predict the former. 7.5 K-nearest neighbors regression Much like in the case of classification, we can use a K-nearest neighbors-based approach in regression to make predictions. Let’s take a small sample of the data in Figure 7.1 and walk through how K-nearest neighbors (K-NN) works in a regression context before we dive in to creating our model and assessing how well it predicts house sale price. This subsample is taken to allow us to illustrate the mechanics of K-NN regression with a few data points; later in this chapter we will use all the data. To take a small random sample of size 30, we’ll use the function slice_sample, and input the data frame to sample from and the number of rows to randomly select. small_sacramento <- slice_sample(sacramento, n = 30) Next let’s say we come across a 2,000 square-foot house in Sacramento we are interested in purchasing, with an advertised list price of $350,000. Should we offer to pay the asking price for this house, or is it overpriced and we should offer less? Absent any other information, we can get a sense for a good answer to this question by using the data we have to predict the sale price given the sale prices we have already observed. But in Figure 7.2, you can see that we have no observations of a house of size exactly 2,000 square feet. How can we predict the sale price? small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) + geom_point() + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + geom_vline(xintercept = 2000, linetype = "dashed") + theme(text = element_text(size = 12)) small_plot Figure 7.2: Scatter plot of price (USD) versus house size (square feet) with vertical line indicating 2,000 square feet on x-axis. We will employ the same intuition from the classification chapter, and use the neighboring points to the new point of interest to suggest/predict what its sale price might be. For the example shown in Figure 7.2, we find and label the 5 nearest neighbors to our observation of a house that is 2,000 square feet. nearest_neighbors <- small_sacramento |> mutate(diff = abs(2000 - sqft)) |> slice_min(diff, n = 5) nearest_neighbors ## # A tibble: 5 × 10 ## city zip beds baths sqft type price latitude longitude diff ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 ROSEVILLE z95661 3 2 2049 Residenti… 395500 38.7 -121. 49 ## 2 ANTELOPE z95843 4 3 2085 Residenti… 408431 38.7 -121. 85 ## 3 SACRAMENTO z95823 4 2 1876 Residenti… 299940 38.5 -121. 124 ## 4 ROSEVILLE z95747 3 2.5 1829 Residenti… 306500 38.8 -121. 171 ## 5 SACRAMENTO z95825 4 2 1776 Multi_Fam… 221250 38.6 -121. 224 Figure 7.3: Scatter plot of price (USD) versus house size (square feet) with lines to 5 nearest neighbors (highlighted in orange). Figure 7.3 illustrates the difference between the house sizes of the 5 nearest neighbors (in terms of house size) to our new 2,000 square-foot house of interest. Now that we have obtained these nearest neighbors, we can use their values to predict the sale price for the new home. Specifically, we can take the mean (or average) of these 5 values as our predicted value, as illustrated by the red point in Figure 7.4. prediction <- nearest_neighbors |> summarise(predicted = mean(price)) prediction ## # A tibble: 1 × 1 ## predicted ## <dbl> ## 1 326324. Figure 7.4: Scatter plot of price (USD) versus house size (square feet) with predicted price for a 2,000 square-foot house based on 5 nearest neighbors represented as a red dot. Our predicted price is $326,324 (shown as a red point in Figure 7.4), which is much less than $350,000; perhaps we might want to offer less than the list price at which the house is advertised. But this is only the very beginning of the story. We still have all the same unanswered questions here with K-NN regression that we had with K-NN classification: which \\(K\\) do we choose, and is our model any good at making predictions? In the next few sections, we will address these questions in the context of K-NN regression. One strength of the K-NN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships (i.e., if the relationship is not a straight line). This stems from the use of nearest neighbors to predict values. The algorithm really has very few assumptions about what the data must look like for it to work. 7.6 Training, evaluating, and tuning the model As usual, we must start by putting some test data away in a lock box that we will come back to only after we choose our final model. Let’s take care of that now. Note that for the remainder of the chapter we’ll be working with the entire Sacramento data set, as opposed to the smaller sample of 30 points that we used earlier in the chapter (Figure 7.2). sacramento_split <- initial_split(sacramento, prop = 0.75, strata = price) sacramento_train <- training(sacramento_split) sacramento_test <- testing(sacramento_split) Next, we’ll use cross-validation to choose \\(K\\). In K-NN classification, we used accuracy to see how well our predictions matched the true labels. We cannot use the same metric in the regression setting, since our predictions will almost never exactly match the true response variable values. Therefore in the context of K-NN regression we will use root mean square prediction error (RMSPE) instead. The mathematical formula for calculating RMSPE is: \\[\\text{RMSPE} = \\sqrt{\\frac{1}{n}\\sum\\limits_{i=1}^{n}(y_i - \\hat{y}_i)^2}\\] where: \\(n\\) is the number of observations, \\(y_i\\) is the observed value for the \\(i^\\text{th}\\) observation, and \\(\\hat{y}_i\\) is the forecasted/predicted value for the \\(i^\\text{th}\\) observation. In other words, we compute the squared difference between the predicted and true response value for each observation in our test (or validation) set, compute the average, and then finally take the square root. The reason we use the squared difference (and not just the difference) is that the differences can be positive or negative, i.e., we can overshoot or undershoot the true response value. Figure 7.5 illustrates both positive and negative differences between predicted and true response values. So if we want to measure error—a notion of distance between our predicted and true response values—we want to make sure that we are only adding up positive values, with larger positive values representing larger mistakes. If the predictions are very close to the true values, then RMSPE will be small. If, on the other-hand, the predictions are very different from the true values, then RMSPE will be quite large. When we use cross-validation, we will choose the \\(K\\) that gives us the smallest RMSPE. Figure 7.5: Scatter plot of price (USD) versus house size (square feet) with example predictions (blue line) and the error in those predictions compared with true response values (vertical lines). Note: When using many code packages (tidymodels included), the evaluation output we will get to assess the prediction quality of our K-NN regression models is labeled “RMSE”, or “root mean squared error”. Why is this so, and why not RMSPE? In statistics, we try to be very precise with our language to indicate whether we are calculating the prediction error on the training data (in-sample prediction) versus on the testing data (out-of-sample prediction). When predicting and evaluating prediction quality on the training data, we say RMSE. By contrast, when predicting and evaluating prediction quality on the testing or validation data, we say RMSPE. The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the \\(y\\)s are training or testing data. But many people just use RMSE for both, and rely on context to denote which data the root mean squared error is being calculated on. Now that we know how we can assess how well our model predicts a numerical value, let’s use R to perform cross-validation and to choose the optimal \\(K\\). First, we will create a recipe for preprocessing our data. Note that we include standardization in our preprocessing to build good habits, but since we only have one predictor, it is technically not necessary; there is no risk of comparing two predictors of different scales. Next we create a model specification for K-nearest neighbors regression. Note that we use set_mode(\"regression\") now in the model specification to denote a regression problem, as opposed to the classification problems from the previous chapters. The use of set_mode(\"regression\") essentially tells tidymodels that we need to use different metrics (RMSPE, not accuracy) for tuning and evaluation. Then we create a 5-fold cross-validation object, and put the recipe and model specification together in a workflow. sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("regression") sacr_vfold <- vfold_cv(sacramento_train, v = 5, strata = price) sacr_wkflw <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) sacr_wkflw ## ══ Workflow ══════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## K-Nearest Neighbor Model Specification (regression) ## ## Main Arguments: ## neighbors = tune() ## weight_func = rectangular ## ## Computational engine: kknn Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200. The following code tunes the model and returns the RMSPE for each number of neighbors. In the output of the sacr_results results data frame, we see that the neighbors variable contains the value of \\(K\\), the mean (mean) contains the value of the RMSPE estimated via cross-validation, and the standard error (std_err) contains a value corresponding to a measure of how uncertain we are in the mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean RMSPE is $100,000 and standard error is $1,000, you can expect the true RMSPE to be somewhere roughly between $99,000 and $101,000 (although it may fall outside this range). You may ignore the other columns in the metrics data frame, as they do not provide any additional insight. gridvals <- tibble(neighbors = seq(from = 1, to = 200, by = 3)) sacr_results <- sacr_wkflw |> tune_grid(resamples = sacr_vfold, grid = gridvals) |> collect_metrics() |> filter(.metric == "rmse") # show the results sacr_results ## # A tibble: 67 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 rmse standard 107206. 5 4102. Preprocessor1_Model01 ## 2 4 rmse standard 90469. 5 3312. Preprocessor1_Model02 ## 3 7 rmse standard 86580. 5 3062. Preprocessor1_Model03 ## 4 10 rmse standard 85321. 5 3395. Preprocessor1_Model04 ## 5 13 rmse standard 85045. 5 3641. Preprocessor1_Model05 ## 6 16 rmse standard 84675. 5 3679. Preprocessor1_Model06 ## 7 19 rmse standard 84776. 5 3984. Preprocessor1_Model07 ## 8 22 rmse standard 84617. 5 3952. Preprocessor1_Model08 ## 9 25 rmse standard 84953. 5 3929. Preprocessor1_Model09 ## 10 28 rmse standard 84612. 5 3917. Preprocessor1_Model10 ## # ℹ 57 more rows Figure 7.6: Effect of the number of neighbors on the RMSPE. Figure 7.6 visualizes how the RMSPE varies with the number of neighbors \\(K\\). We take the minimum RMSPE to find the best setting for the number of neighbors: # show only the row of minimum RMSPE sacr_min <- sacr_results |> filter(mean == min(mean)) sacr_min ## # A tibble: 1 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 52 rmse standard 84561. 5 4470. Preprocessor1_Model18 The smallest RMSPE occurs when \\(K =\\) 52. 7.7 Underfitting and overfitting Similar to the setting of classification, by setting the number of neighbors to be too small or too large, we cause the RMSPE to increase, as shown in Figure 7.6. What is happening here? Figure 7.7 visualizes the effect of different settings of \\(K\\) on the regression model. Each plot shows the predicted values for house sale price from our K-NN regression model on the training data for 6 different values for \\(K\\): 1, 3, 25, 52, 250, and 680 (almost the entire training set). For each model, we predict prices for the range of possible home sizes we observed in the data set (here 500 to 5,000 square feet) and we plot the predicted prices as a blue line. Figure 7.7: Predicted values for house price (represented as a blue line) from K-NN regression models for six different values for \\(K\\). Figure 7.7 shows that when \\(K\\) = 1, the blue line runs perfectly through (almost) all of our training observations. This happens because our predicted values for a given region (typically) depend on just a single observation. In general, when \\(K\\) is too small, the line follows the training data quite closely, even if it does not match it perfectly. If we used a different training data set of house prices and sizes from the Sacramento real estate market, we would end up with completely different predictions. In other words, the model is influenced too much by the data. Because the model follows the training data so closely, it will not make accurate predictions on new observations which, generally, will not have the same fluctuations as the original training data. Recall from the classification chapters that this behavior—where the model is influenced too much by the noisy data—is called overfitting; we use this same term in the context of regression. What about the plots in Figure 7.7 where \\(K\\) is quite large, say, \\(K\\) = 250 or 680? In this case the blue line becomes extremely smooth, and actually becomes flat once \\(K\\) is equal to the number of datapoints in the training set. This happens because our predicted values for a given x value (here, home size), depend on many neighboring observations; in the case where \\(K\\) is equal to the size of the training set, the prediction is just the mean of the house prices (completely ignoring the house size). In contrast to the \\(K=1\\) example, the smooth, inflexible blue line does not follow the training observations very closely. In other words, the model is not influenced enough by the training data. Recall from the classification chapters that this behavior is called underfitting; we again use this same term in the context of regression. Ideally, what we want is neither of the two situations discussed above. Instead, we would like a model that (1) follows the overall “trend” in the training data, so the model actually uses the training data to learn something useful, and (2) does not follow the noisy fluctuations, so that we can be confident that our model will transfer/generalize well to other new data. If we explore the other values for \\(K\\), in particular \\(K\\) = 52 (as suggested by cross-validation), we can see it achieves this goal: it follows the increasing trend of house price versus house size, but is not influenced too much by the idiosyncratic variations in price. All of this is similar to how the choice of \\(K\\) affects K-nearest neighbors classification, as discussed in the previous chapter. 7.8 Evaluating on the test set To assess how well our model might do at predicting on unseen data, we will assess its RMSPE on the test data. To do this, we will first re-train our K-NN regression model on the entire training data set, using \\(K =\\) 52 neighbors. Then we will use predict to make predictions on the test data, and use the metrics function again to compute the summary of regression quality. Because we specify that we are performing regression in set_mode, the metrics function knows to output a quality summary related to regression, and not, say, classification. kmin <- sacr_min |> pull(neighbors) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) |> set_engine("kknn") |> set_mode("regression") sacr_fit <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) |> fit(data = sacramento_train) sacr_summary <- sacr_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) |> metrics(truth = price, estimate = .pred) |> filter(.metric == 'rmse') sacr_summary ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 90529. Our final model’s test error as assessed by RMSPE is $90,529. Note that RMSPE is measured in the same units as the response variable. In other words, on new observations, we expect the error in our prediction to be roughly $90,529. From one perspective, this is good news: this is about the same as the cross-validation RMSPE estimate of our tuned model (which was $84,561), so we can say that the model appears to generalize well to new data that it has never seen before. However, much like in the case of K-NN classification, whether this value for RMSPE is good—i.e., whether an error of around $90,529 is acceptable—depends entirely on the application. In this application, this error is not prohibitively large, but it is not negligible either; $90,529 might represent a substantial fraction of a home buyer’s budget, and could make or break whether or not they could afford put an offer on a house. Finally, Figure 7.8 shows the predictions that our final model makes across the range of house sizes we might encounter in the Sacramento area. Note that instead of predicting the house price only for those house sizes that happen to appear in our data, we predict it for evenly spaced values between the minimum and maximum in the data set (roughly 500 to 5000 square feet). We superimpose this prediction line on a scatter plot of the original housing price data, so that we can qualitatively assess if the model seems to fit the data well. You have already seen a few plots like this in this chapter, but here we also provide the code that generated it as a learning opportunity. sqft_prediction_grid <- tibble( sqft = seq( from = sacramento |> select(sqft) |> min(), to = sacramento |> select(sqft) |> max(), by = 10 ) ) sacr_preds <- sacr_fit |> predict(sqft_prediction_grid) |> bind_cols(sqft_prediction_grid) plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + geom_line(data = sacr_preds, mapping = aes(x = sqft, y = .pred), color = "steelblue", linewidth = 1) + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + ggtitle(paste0("K = ", kmin)) + theme(text = element_text(size = 12)) plot_final Figure 7.8: Predicted values of house price (blue line) for the final K-NN regression model. 7.9 Multivariable K-NN regression As in K-NN classification, we can use multiple predictors in K-NN regression. In this setting, we have the same concerns regarding the scale of the predictors. Once again, predictions are made by identifying the \\(K\\) observations that are nearest to the new point we want to predict; any variables that are on a large scale will have a much larger effect than variables on a small scale. But since the recipe we built above scales and centers all predictor variables, this is handled for us. Note that we also have the same concern regarding the selection of predictors in K-NN regression as in K-NN classification: having more predictors is not always better, and the choice of which predictors to use has a potentially large influence on the quality of predictions. Fortunately, we can use the predictor selection algorithm from the classification chapter in K-NN regression as well. As the algorithm is the same, we will not cover it again in this chapter. We will now demonstrate a multivariable K-NN regression analysis of the Sacramento real estate data using tidymodels. This time we will use house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our response variable that we are trying to predict. It is always a good practice to do exploratory data analysis, such as visualizing the data, before we start modeling the data. Figure 7.9 shows that the number of bedrooms might provide useful information to help predict the sale price of a house. plot_beds <- sacramento |> ggplot(aes(x = beds, y = price)) + geom_point(alpha = 0.4) + labs(x = 'Number of Bedrooms', y = 'Price (USD)') + theme(text = element_text(size = 12)) plot_beds Figure 7.9: Scatter plot of the sale price of houses versus the number of bedrooms. Figure 7.9 shows that as the number of bedrooms increases, the house sale price tends to increase as well, but that the relationship is quite weak. Does adding the number of bedrooms to our model improve our ability to predict price? To answer that question, we will have to create a new K-NN regression model using house size and number of bedrooms, and then we can compare it to the model we previously came up with that only used house size. Let’s do that now! First we’ll build a new model specification and recipe for the analysis. Note that we use the formula price ~ sqft + beds to denote that we have two predictors, and set neighbors = tune() to tell tidymodels to tune the number of neighbors for us. sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("regression") Next, we’ll use 5-fold cross-validation to choose the number of neighbors via the minimum RMSPE: gridvals <- tibble(neighbors = seq(1, 200)) sacr_multi <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) |> tune_grid(sacr_vfold, grid = gridvals) |> collect_metrics() |> filter(.metric == "rmse") |> filter(mean == min(mean)) sacr_k <- sacr_multi |> pull(neighbors) sacr_multi ## # A tibble: 1 × 7 ## neighbors .metric .estimator mean n std_err .config ## <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 11 rmse standard 81839. 5 3108. Preprocessor1_Model011 Here we see that the smallest estimated RMSPE from cross-validation occurs when \\(K =\\) 11. If we want to compare this multivariable K-NN regression model to the model with only a single predictor as part of the model tuning process (e.g., if we are running forward selection as described in the chapter on evaluating and tuning classification models), then we must compare the RMSPE estimated using only the training data via cross-validation. Looking back, the estimated cross-validation RMSPE for the single-predictor model was $84,561. The estimated cross-validation RMSPE for the multivariable model is $81,839. Thus in this case, we did not improve the model by a large amount by adding this additional predictor. Regardless, let’s continue the analysis to see how we can make predictions with a multivariable K-NN regression model and evaluate its performance on test data. We first need to re-train the model on the entire training data set with \\(K =\\) 11, and then use that model to make predictions on the test data. sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = sacr_k) |> set_engine("kknn") |> set_mode("regression") knn_mult_fit <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) |> fit(data = sacramento_train) knn_mult_preds <- knn_mult_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) |> filter(.metric == 'rmse') knn_mult_mets ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 90862. This time, when we performed K-NN regression on the same data set, but also included number of bedrooms as a predictor, we obtained a RMSPE test error of $90,862. Figure 7.10 visualizes the model’s predictions overlaid on top of the data. This time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2 predictors instead of 1. Figure 7.10: K-NN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes. We can see that the predictions in this case, where we have 2 predictors, form a surface instead of a line. Because the newly added predictor (number of bedrooms) is related to price (as price changes, so does number of bedrooms) and is not totally determined by house size (our other predictor), we get additional and useful information for making our predictions. For example, in this model we would predict that the cost of a house with a size of 2,500 square feet generally increases slightly as the number of bedrooms increases. Without having the additional predictor of number of bedrooms, we would predict the same price for these two houses. 7.10 Strengths and limitations of K-NN regression As with K-NN classification (or any prediction algorithm for that matter), K-NN regression has both strengths and weaknesses. Some are listed here: Strengths: K-nearest neighbors regression is a simple, intuitive algorithm, requires few assumptions about what the data must look like, and works well with non-linear relationships (i.e., if the relationship is not a straight line). Weaknesses: K-nearest neighbors regression becomes very slow as the training data gets larger, may not perform well with a large number of predictors, and may not predict well beyond the range of values input in your training data. 7.11 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Regression I: K-nearest neighbors” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["regression2.html", "Chapter 8 Regression II: linear regression 8.1 Overview 8.2 Chapter learning objectives 8.3 Simple linear regression 8.4 Linear regression in R 8.5 Comparing simple linear and K-NN regression 8.6 Multivariable linear regression 8.7 Multicollinearity and outliers 8.8 Designing new predictors 8.9 The other sides of regression 8.10 Exercises 8.11 Additional resources", " Chapter 8 Regression II: linear regression 8.1 Overview Up to this point, we have solved all of our predictive problems—both classification and regression—using K-nearest neighbors (K-NN)-based approaches. In the context of regression, there is another commonly used method known as linear regression. This chapter provides an introduction to the basic concept of linear regression, shows how to use tidymodels to perform linear regression in R, and characterizes its strengths and weaknesses compared to K-NN regression. The focus is, as usual, on the case where there is a single predictor and single response variable of interest; but the chapter concludes with an example using multivariable linear regression when there is more than one predictor. 8.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Use R to fit simple and multivariable linear regression models on training data. Evaluate the linear regression model on test data. Compare and contrast predictions obtained from K-nearest neighbors regression to those obtained using linear regression from the same data set. Describe how linear regression is affected by outliers and multicollinearity. 8.3 Simple linear regression At the end of the previous chapter, we noted some limitations of K-NN regression. While the method is simple and easy to understand, K-NN regression does not predict well beyond the range of the predictors in the training data, and the method gets significantly slower as the training data set grows. Fortunately, there is an alternative to K-NN regression—linear regression—that addresses both of these limitations. Linear regression is also very commonly used in practice because it provides an interpretable mathematical equation that describes the relationship between the predictor and response variables. In this first part of the chapter, we will focus on simple linear regression, which involves only one predictor variable and one response variable; later on, we will consider multivariable linear regression, which involves multiple predictor variables. Like K-NN regression, simple linear regression involves predicting a numerical response variable (like race time, house price, or height); but how it makes those predictions for a new observation is quite different from K-NN regression. Instead of looking at the K nearest neighbors and averaging over their values for a prediction, in simple linear regression, we create a straight line of best fit through the training data and then “look up” the prediction using the line. Note: Although we did not cover it in earlier chapters, there is another popular method for classification called logistic regression (it is used for classification even though the name, somewhat confusingly, has the word “regression” in it). In logistic regression—similar to linear regression—you “fit” the model to the training data and then “look up” the prediction for each new observation. Logistic regression and K-NN classification have an advantage/disadvantage comparison similar to that of linear regression and K-NN regression. It is useful to have a good understanding of linear regression before learning about logistic regression. After reading this chapter, see the “Additional Resources” section at the end of the classification chapters to learn more about logistic regression. Let’s return to the Sacramento housing data from Chapter 7 to learn how to apply linear regression and compare it to K-NN regression. For now, we will consider a smaller version of the housing data to help make our visualizations clear. Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict its sale price? In particular, recall that we have come across a new 2,000 square-foot house we are interested in purchasing with an advertised list price of $350,000. Should we offer the list price, or is that over/undervalued? To answer this question using simple linear regression, we use the data we have to draw the straight line of best fit through our existing data points. The small subset of data as well as the line of best fit are shown in Figure 8.1. Figure 8.1: Scatter plot of sale price versus size with line of best fit for subset of the Sacramento housing data. The equation for the straight line is: \\[\\text{house sale price} = \\beta_0 + \\beta_1 \\cdot (\\text{house size}),\\] where \\(\\beta_0\\) is the vertical intercept of the line (the price when house size is 0) \\(\\beta_1\\) is the slope of the line (how quickly the price increases as you increase house size) Therefore using the data to find the line of best fit is equivalent to finding coefficients \\(\\beta_0\\) and \\(\\beta_1\\) that parametrize (correspond to) the line of best fit. Now of course, in this particular problem, the idea of a 0 square-foot house is a bit silly; but you can think of \\(\\beta_0\\) here as the “base price,” and \\(\\beta_1\\) as the increase in price for each square foot of space. Let’s push this thought even further: what would happen in the equation for the line if you tried to evaluate the price of a house with size 6 million square feet? Or what about negative 2,000 square feet? As it turns out, nothing in the formula breaks; linear regression will happily make predictions for nonsensical predictor values if you ask it to. But even though you can make these wild predictions, you shouldn’t. You should only make predictions roughly within the range of your original data, and perhaps a bit beyond it only if it makes sense. For example, the data in Figure 8.1 only reaches around 800 square feet on the low end, but it would probably be reasonable to use the linear regression model to make a prediction at 600 square feet, say. Back to the example! Once we have the coefficients \\(\\beta_0\\) and \\(\\beta_1\\), we can use the equation above to evaluate the predicted sale price given the value we have for the predictor variable—here 2,000 square feet. Figure 8.2 demonstrates this process. Figure 8.2: Scatter plot of sale price versus size with line of best fit and a red dot at the predicted sale price for a 2,000 square-foot home. By using simple linear regression on this small data set to predict the sale price for a 2,000 square-foot house, we get a predicted value of $295,564. But wait a minute… how exactly does simple linear regression choose the line of best fit? Many different lines could be drawn through the data points. Some plausible examples are shown in Figure 8.3. Figure 8.3: Scatter plot of sale price versus size with many possible lines that could be drawn through the data points. Simple linear regression chooses the straight line of best fit by choosing the line that minimizes the average squared vertical distance between itself and each of the observed data points in the training data (equivalent to minimizing the RMSE). Figure 8.4 illustrates these vertical distances as red lines. Finally, to assess the predictive accuracy of a simple linear regression model, we use RMSPE—the same measure of predictive performance we used with K-NN regression. Figure 8.4: Scatter plot of sale price versus size with red lines denoting the vertical distances between the predicted values and the observed data points. 8.4 Linear regression in R We can perform simple linear regression in R using tidymodels in a very similar manner to how we performed K-NN regression. To do this, instead of creating a nearest_neighbor model specification with the kknn engine, we use a linear_reg model specification with the lm engine. Another difference is that we do not need to choose \\(K\\) in the context of linear regression, and so we do not need to perform cross-validation. Below we illustrate how we can use the usual tidymodels workflow to predict house sale price given house size using a simple linear regression approach using the full Sacramento real estate data set. As usual, we start by loading packages, setting the seed, loading data, and putting some test data away in a lock box that we can come back to after we choose our final model. Let’s take care of that now. library(tidyverse) library(tidymodels) set.seed(7) sacramento <- read_csv("data/sacramento.csv") sacramento_split <- initial_split(sacramento, prop = 0.75, strata = price) sacramento_train <- training(sacramento_split) sacramento_test <- testing(sacramento_split) Now that we have our training data, we will create the model specification and recipe, and fit our simple linear regression model: lm_spec <- linear_reg() |> set_engine("lm") |> set_mode("regression") lm_recipe <- recipe(price ~ sqft, data = sacramento_train) lm_fit <- workflow() |> add_recipe(lm_recipe) |> add_model(lm_spec) |> fit(data = sacramento_train) lm_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────── ## 0 Recipe Steps ## ## ── Model ────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) sqft ## 18450.3 134.8 Note: An additional difference that you will notice here is that we do not standardize (i.e., scale and center) our predictors. In K-nearest neighbors models, recall that the model fit changes depending on whether we standardize first or not. In linear regression, standardization does not affect the fit (it does affect the coefficients in the equation, though!). So you can standardize if you want—it won’t hurt anything—but if you leave the predictors in their original form, the best fit coefficients are usually easier to interpret afterward. Our coefficients are (intercept) \\(\\beta_0=\\) 18450 and (slope) \\(\\beta_1=\\) 135. This means that the equation of the line of best fit is \\[\\text{house sale price} = 18450 + 135\\cdot (\\text{house size}).\\] In other words, the model predicts that houses start at $18,450 for 0 square feet, and that every extra square foot increases the cost of the house by $135. Finally, we predict on the test data set to assess how well our model does: lm_test_results <- lm_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) |> metrics(truth = price, estimate = .pred) lm_test_results ## # A tibble: 3 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 88528. ## 2 rsq standard 0.608 ## 3 mae standard 61892. Our final model’s test error as assessed by RMSPE is $88,528. Remember that this is in units of the response variable, and here that is US Dollars (USD). Does this mean our model is “good” at predicting house sale price based off of the predictor of home size? Again, answering this is tricky and requires knowledge of how you intend to use the prediction. To visualize the simple linear regression model, we can plot the predicted house sale price across all possible house sizes we might encounter. Since our model is linear, we only need to compute the predicted price of the minimum and maximum house size, and then connect them with a straight line. We superimpose this prediction line on a scatter plot of the original housing price data, so that we can qualitatively assess if the model seems to fit the data well. Figure 8.5 displays the result. sqft_prediction_grid <- tibble( sqft = c( sacramento |> select(sqft) |> min(), sacramento |> select(sqft) |> max() ) ) sacr_preds <- lm_fit |> predict(sqft_prediction_grid) |> bind_cols(sqft_prediction_grid) lm_plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + geom_line(data = sacr_preds, mapping = aes(x = sqft, y = .pred), color = "steelblue", linewidth = 1) + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + theme(text = element_text(size = 12)) lm_plot_final Figure 8.5: Scatter plot of sale price versus size with line of best fit for the full Sacramento housing data. We can extract the coefficients from our model by accessing the fit object that is output by the fit function; we first have to extract it from the workflow using the extract_fit_parsnip function, and then apply the tidy function to convert the result into a data frame: coeffs <- lm_fit |> extract_fit_parsnip() |> tidy() coeffs ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 18450. 7916. 2.33 2.01e- 2 ## 2 sqft 135. 4.31 31.2 1.37e-134 8.5 Comparing simple linear and K-NN regression Now that we have a general understanding of both simple linear and K-NN regression, we can start to compare and contrast these methods as well as the predictions made by them. To start, let’s look at the visualization of the simple linear regression model predictions for the Sacramento real estate data (predicting price from house size) and the “best” K-NN regression model obtained from the same problem, shown in Figure 8.6. Figure 8.6: Comparison of simple linear regression and K-NN regression. What differences do we observe in Figure 8.6? One obvious difference is the shape of the blue lines. In simple linear regression we are restricted to a straight line, whereas in K-NN regression our line is much more flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the response variable we predict given a unit increase in the predictor variable. K-NN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line. There can, however, also be a disadvantage to using a simple linear regression model in some cases, particularly when the relationship between the response and the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In these cases the prediction model from a simple linear regression will underfit, meaning that model/predicted values do not match the actual observed values very well. Such a model would probably have a quite high RMSE when assessing model goodness of fit on the training data and a quite high RMSPE when assessing model prediction quality on a test data set. On such a data set, K-NN regression may fare better. Additionally, there are other types of regression you can learn about in future books that may do even better at predicting with such data. How do these two models compare on the Sacramento house prices data set? In Figure 8.6, we also printed the RMSPE as calculated from predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear regression model is slightly lower than the RMSPE for the K-NN regression model. Considering that the simple linear regression model is also more interpretable, if we were comparing these in practice we would likely choose to use the simple linear regression model. Finally, note that the K-NN regression model becomes “flat” at the left and right boundaries of the data, while the linear model predicts a constant slope. Predicting outside the range of the observed data is known as extrapolation; K-NN and linear models behave quite differently when extrapolating. Depending on the application, the flat or constant slope trend may make more sense. For example, if our housing data were slightly different, the linear model may have actually predicted a negative price for a small house (if the intercept \\(\\beta_0\\) was negative), which obviously does not match reality. On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, so the “flat” extrapolation of K-NN likely does not match reality. 8.6 Multivariable linear regression As in K-NN classification and K-NN regression, we can move beyond the simple case of only one predictor to the case with multiple predictors, known as multivariable linear regression. To do this, we follow a very similar approach to what we did for K-NN regression: we just add more predictors to the model formula in the recipe. But recall that we do not need to use cross-validation to choose any parameters, nor do we need to standardize (i.e., center and scale) the data for linear regression. Note once again that we have the same concerns regarding multiple predictors as in the settings of multivariable K-NN regression and classification: having more predictors is not always better. But because the same predictor selection algorithm from the classification chapter extends to the setting of linear regression, it will not be covered again in this chapter. We will demonstrate multivariable linear regression using the Sacramento real estate data with both house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our response variable. We will start by changing the formula in the recipe to include both the sqft and beds variables as predictors: mlm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) Now we can build our workflow and fit the model: mlm_fit <- workflow() |> add_recipe(mlm_recipe) |> add_model(lm_spec) |> fit(data = sacramento_train) mlm_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────── ## 0 Recipe Steps ## ## ── Model ────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) sqft beds ## 72547.8 160.6 -29644.3 And finally, we make predictions on the test data set to assess the quality of our model: lm_mult_test_results <- mlm_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) |> metrics(truth = price, estimate = .pred) lm_mult_test_results ## # A tibble: 3 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 88739. ## 2 rsq standard 0.603 ## 3 mae standard 61732. Our model’s test error as assessed by RMSPE is $88,739. In the case of two predictors, we can plot the predictions made by our linear regression creates a plane of best fit, as shown in Figure 8.7. Figure 8.7: Linear regression plane of best fit overlaid on top of the data (using price, house size, and number of bedrooms as predictors). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the regression plane looks like for learning purposes. We see that the predictions from linear regression with two predictors form a flat plane. This is the hallmark of linear regression, and differs from the wiggly, flexible surface we get from other methods such as K-NN regression. As discussed, this can be advantageous in one aspect, which is that for each predictor, we can get slopes/intercept from linear regression, and thus describe the plane mathematically. We can extract those slope values from our model object as shown below: mcoeffs <- mlm_fit |> extract_fit_parsnip() |> tidy() mcoeffs ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 72548. 11670. 6.22 8.76e- 10 ## 2 sqft 161. 5.93 27.1 8.34e-111 ## 3 beds -29644. 4799. -6.18 1.11e- 9 And then use those slopes to write a mathematical equation to describe the prediction plane: \\[\\text{house sale price} = \\beta_0 + \\beta_1\\cdot(\\text{house size}) + \\beta_2\\cdot(\\text{number of bedrooms}),\\] where: \\(\\beta_0\\) is the vertical intercept of the hyperplane (the price when both house size and number of bedrooms are 0) \\(\\beta_1\\) is the slope for the first predictor (how quickly the price changes as you increase house size, holding number of bedrooms constant) \\(\\beta_2\\) is the slope for the second predictor (how quickly the price changes as you increase the number of bedrooms, holding house size constant) Finally, we can fill in the values for \\(\\beta_0\\), \\(\\beta_1\\) and \\(\\beta_2\\) from the model output above to create the equation of the plane of best fit to the data: \\[\\text{house sale price} = 72548 + 161\\cdot (\\text{house size}) -29644 \\cdot (\\text{number of bedrooms})\\] This model is more interpretable than the multivariable K-NN regression model; we can write a mathematical equation that explains how each predictor is affecting the predictions. But as always, we should question how well multivariable linear regression is doing compared to the other tools we have, such as simple linear regression and multivariable K-NN regression. If this comparison is part of the model tuning process—for example, if we are trying out many different sets of predictors for multivariable linear and K-NN regression—we must perform this comparison using cross-validation on only our training data. But if we have already decided on a small number (e.g., 2 or 3) of tuned candidate models and we want to make a final comparison, we can do so by comparing the prediction error of the methods on the test data. lm_mult_test_results ## # A tibble: 3 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 88739. ## 2 rsq standard 0.603 ## 3 mae standard 61732. We obtain an RMSPE for the multivariable linear regression model of $88,739.45. This prediction error is less than the prediction error for the multivariable K-NN regression model, indicating that we should likely choose linear regression for predictions of house sale price on this data set. Revisiting the simple linear regression model with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was $88,527.75, which is almost the same as that of our more complex model. As mentioned earlier, this is not always the case: often including more predictors will either positively or negatively impact the prediction performance on unseen test data. 8.7 Multicollinearity and outliers What can go wrong when performing (possibly multivariable) linear regression? This section will introduce two common issues—outliers and collinear predictors—and illustrate their impact on predictions. 8.7.1 Outliers Outliers are data points that do not follow the usual pattern of the rest of the data. In the setting of linear regression, these are points that have a vertical distance to the line of best fit that is either much higher or much lower than you might expect based on the rest of the data. The problem with outliers is that they can have too much influence on the line of best fit. In general, it is very difficult to judge accurately which data are outliers without advanced techniques that are beyond the scope of this book. But to illustrate what can happen when you have outliers, Figure 8.8 shows a small subset of the Sacramento housing data again, except we have added a single data point (highlighted in red). This house is 5,000 square feet in size, and sold for only $50,000. Unbeknownst to the data analyst, this house was sold by a parent to their child for an absurdly low price. Of course, this is not representative of the real housing market values that the other data points follow; the data point is an outlier. In blue we plot the original line of best fit, and in red we plot the new line of best fit including the outlier. You can see how different the red line is from the blue line, which is entirely caused by that one extra outlier data point. Figure 8.8: Scatter plot of a subset of the data, with outlier highlighted in red. Fortunately, if you have enough data, the inclusion of one or two outliers—as long as their values are not too wild—will typically not have a large effect on the line of best fit. Figure 8.9 shows how that same outlier data point from earlier influences the line of best fit when we are working with the entire original Sacramento training data. You can see that with this larger data set, the line changes much less when adding the outlier. Nevertheless, it is still important when working with linear regression to critically think about how much any individual data point is influencing the model. Figure 8.9: Scatter plot of the full data, with outlier highlighted in red. 8.7.2 Multicollinearity The second, and much more subtle, issue can occur when performing multivariable linear regression. In particular, if you include multiple predictors that are strongly linearly related to one another, the coefficients that describe the plane of best fit can be very unreliable—small changes to the data can result in large changes in the coefficients. Consider an extreme example using the Sacramento housing data where the house was measured twice by two people. Since the two people are each slightly inaccurate, the two measurements might not agree exactly, but they are very strongly linearly related to each other, as shown in Figure 8.10. Figure 8.10: Scatter plot of house size (in square feet) measured by person 1 versus house size (in square feet) measured by person 2. If we again fit the multivariable linear regression model on this data, then the plane of best fit has regression coefficients that are very sensitive to the exact values in the data. For example, if we change the data ever so slightly—e.g., by running cross-validation, which splits up the data randomly into different chunks—the coefficients vary by large amounts: Best Fit 1: \\(\\text{house sale price} = 22535 + (220)\\cdot (\\text{house size 1 (ft$^2$)}) + (-86) \\cdot (\\text{house size 2 (ft$^2$)}).\\) Best Fit 2: \\(\\text{house sale price} = 15966 + (86)\\cdot (\\text{house size 1 (ft$^2$)}) + (49) \\cdot (\\text{house size 2 (ft$^2$)}).\\) Best Fit 3: \\(\\text{house sale price} = 17178 + (107)\\cdot (\\text{house size 1 (ft$^2$)}) + (27) \\cdot (\\text{house size 2 (ft$^2$)}).\\) Therefore, when performing multivariable linear regression, it is important to avoid including very linearly related predictors. However, techniques for doing so are beyond the scope of this book; see the list of additional resources at the end of this chapter to find out where you can learn more. 8.8 Designing new predictors We were quite fortunate in our initial exploration to find a predictor variable (house size) that seems to have a meaningful and nearly linear relationship with our response variable (sale price). But what should we do if we cannot immediately find such a nice variable? Well, sometimes it is just a fact that the variables in the data do not have enough of a relationship with the response variable to provide useful predictions. For example, if the only available predictor was “the current house owner’s favorite ice cream flavor”, we likely would have little hope of using that variable to predict the house’s sale price (barring any future remarkable scientific discoveries about the relationship between the housing market and homeowner ice cream preferences). In cases like these, the only option is to obtain measurements of more useful variables. There are, however, a wide variety of cases where the predictor variables do have a meaningful relationship with the response variable, but that relationship does not fit the assumptions of the regression method you have chosen. For example, a data frame df with two variables—x and y—with a nonlinear relationship between the two variables will not be fully captured by simple linear regression, as shown in Figure 8.11. df ## # A tibble: 100 × 2 ## x y ## <dbl> <dbl> ## 1 0.102 0.0720 ## 2 0.800 0.532 ## 3 0.478 0.148 ## 4 0.972 1.01 ## 5 0.846 0.677 ## 6 0.405 0.157 ## 7 0.879 0.768 ## 8 0.130 0.0402 ## 9 0.852 0.576 ## 10 0.180 0.0847 ## # ℹ 90 more rows Figure 8.11: Example of a data set with a nonlinear relationship between the predictor and the response. Instead of trying to predict the response y using a linear regression on x, we might have some scientific background about our problem to suggest that y should be a cubic function of x. So before performing regression, we might create a new predictor variable z using the mutate function: df <- df |> mutate(z = x^3) Then we can perform linear regression for y using the predictor variable z, as shown in Figure 8.12. Here you can see that the transformed predictor z helps the linear regression model make more accurate predictions. Note that none of the y response values have changed between Figures 8.11 and 8.12; the only change is that the x values have been replaced by z values. Figure 8.12: Relationship between the transformed predictor and the response. The process of transforming predictors (and potentially combining multiple predictors in the process) is known as feature engineering. In real data analysis problems, you will need to rely on a deep understanding of the problem—as well as the wrangling tools from previous chapters—to engineer useful new features that improve predictive performance. Note: Feature engineering is part of tuning your model, and as such you must not use your test data to evaluate the quality of the features you produce. You are free to use cross-validation, though! 8.9 The other sides of regression So far in this textbook we have used regression only in the context of prediction. However, regression can also be seen as a method to understand and quantify the effects of individual predictor variables on a response variable of interest. In the housing example from this chapter, beyond just using past data to predict future sale prices, we might also be interested in describing the individual relationships of house size and the number of bedrooms with house price, quantifying how strong each of these relationships are, and assessing how accurately we can estimate their magnitudes. And even beyond that, we may be interested in understanding whether the predictors cause changes in the price. These sides of regression are well beyond the scope of this book; but the material you have learned here should give you a foundation of knowledge that will serve you well when moving to more advanced books on the topic. 8.10 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Regression II: linear regression” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 8.11 Additional resources The tidymodels website is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a nice beginner’s tutorial and an extensive list of more advanced examples that you can use to continue learning beyond the scope of this book. Modern Dive (Ismay and Kim 2020) is another textbook that uses the tidyverse / tidymodels framework. Chapter 6 complements the material in the current chapter well; it covers some slightly more advanced concepts than we do without getting mathematical. Give this chapter a read before moving on to the next reference. It is also worth noting that this book takes a more “explanatory” / “inferential” approach to regression in general (in Chapters 5, 6, and 10), which provides a nice complement to the predictive tack we take in the present book. An Introduction to Statistical Learning (James et al. 2013) provides a great next stop in the process of learning about regression. Chapter 3 covers linear regression at a slightly more mathematical level than we do here, but it is not too large a leap and so should provide a good stepping stone. Chapter 6 discusses how to pick a subset of “informative” predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the computational efficiency of linear regression. In contrast, the K-NN methods we covered earlier are indeed more flexible but become very slow when given lots of data. References "],["clustering.html", "Chapter 9 Clustering 9.1 Overview 9.2 Chapter learning objectives 9.3 Clustering 9.4 An illustrative example 9.5 K-means 9.6 K-means in R 9.7 Exercises 9.8 Additional resources", " Chapter 9 Clustering 9.1 Overview As part of exploratory data analysis, it is often helpful to see if there are meaningful subgroups (or clusters) in the data. This grouping can be used for many purposes, such as generating new questions or improving predictive analyses. This chapter provides an introduction to clustering using the K-means algorithm, including techniques to choose the number of clusters. 9.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe a situation in which clustering is an appropriate technique to use, and what insight it might extract from the data. Explain the K-means clustering algorithm. Interpret the output of a K-means analysis. Differentiate between clustering, classification, and regression. Identify when it is necessary to scale variables before clustering, and do this using R. Perform K-means clustering in R using tidymodels workflows. Use the elbow method to choose the number of clusters for K-means. Visualize the output of K-means clustering in R using colored scatter plots. Describe the advantages, limitations and assumptions of the K-means clustering algorithm. 9.3 Clustering Clustering is a data analysis technique involving separating a data set into subgroups of related data. For example, we might use clustering to separate a data set of documents into groups that correspond to topics, a data set of human genetic information into groups that correspond to ancestral subpopulations, or a data set of online customers into groups that correspond to purchasing behaviors. Once the data are separated, we can, for example, use the subgroups to generate new questions about the data and follow up with a predictive modeling exercise. In this course, clustering will be used only for exploratory analysis, i.e., uncovering patterns in the data. Note that clustering is a fundamentally different kind of task than classification or regression. In particular, both classification and regression are supervised tasks where there is a response variable (a category label or value), and we have examples of past data with labels/values that help us predict those of future data. By contrast, clustering is an unsupervised task, as we are trying to understand and examine the structure of data without any response variable labels or values to help us. This approach has both advantages and disadvantages. Clustering requires no additional annotation or input on the data. For example, while it would be nearly impossible to annotate all the articles on Wikipedia with human-made topic labels, we can cluster the articles without this information to find groupings corresponding to topics automatically. However, given that there is no response variable, it is not as easy to evaluate the “quality” of a clustering. With classification, we can use a test data set to assess prediction performance. In clustering, there is not a single good choice for evaluation. In this book, we will use visualization to ascertain the quality of a clustering, and leave rigorous evaluation for more advanced courses. As in the case of classification, there are many possible methods that we could use to cluster our observations to look for subgroups. In this book, we will focus on the widely used K-means algorithm (Lloyd 1982). In your future studies, you might encounter hierarchical clustering, principal component analysis, multidimensional scaling, and more; see the additional resources section at the end of this chapter for where to begin learning more about these other methods. Note: There are also so-called semisupervised tasks, where only some of the data come with response variable labels/values, but the vast majority don’t. The goal is to try to uncover underlying structure in the data that allows one to guess the missing labels. This sort of task is beneficial, for example, when one has an unlabeled data set that is too large to manually label, but one is willing to provide a few informative example labels as a “seed” to guess the labels for all the data. 9.4 An illustrative example In this chapter we will focus on a data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020). This data set was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Site, and includes measurements for adult penguins (Figure 9.1) found near there (Gorman, Williams, and Fraser 2014). Our goal will be to use two variables—penguin bill and flipper length, both in millimeters—to determine whether there are distinct types of penguins in our data. Understanding this might help us with species discovery and classification in a data-driven way. Note that we have reduced the size of the data set to 18 observations and 2 variables; this will help us make clear visualizations that illustrate how clustering works for learning purposes. Figure 9.1: A Gentoo penguin. Before we get started, we will load the tidyverse metapackage as well as set a random seed. This will ensure we have access to the functions we need and that our analysis will be reproducible. As we will learn in more detail later in the chapter, setting the seed here is important because the K-means clustering algorithm uses randomness when choosing a starting position for each cluster. library(tidyverse) set.seed(1) Now we can load and preview the penguins data. penguins <- read_csv("data/penguins.csv") penguins ## # A tibble: 18 × 2 ## bill_length_mm flipper_length_mm ## <dbl> <dbl> ## 1 39.2 196 ## 2 36.5 182 ## 3 34.5 187 ## 4 36.7 187 ## 5 38.1 181 ## 6 39.2 190 ## 7 36 195 ## 8 37.8 193 ## 9 46.5 213 ## 10 46.1 215 ## 11 47.8 215 ## 12 45 220 ## 13 49.1 212 ## 14 43.3 208 ## 15 46 195 ## 16 46.7 195 ## 17 52.2 197 ## 18 46.8 189 We will begin by using a version of the data that we have standardized, penguins_standardized, to illustrate how K-means clustering works (recall standardization from Chapter 5). Later in this chapter, we will return to the original penguins data to see how to include standardization automatically in the clustering pipeline. penguins_standardized ## # A tibble: 18 × 2 ## bill_length_standardized flipper_length_standardized ## <dbl> <dbl> ## 1 -0.641 -0.190 ## 2 -1.14 -1.33 ## 3 -1.52 -0.922 ## 4 -1.11 -0.922 ## 5 -0.847 -1.41 ## 6 -0.641 -0.678 ## 7 -1.24 -0.271 ## 8 -0.902 -0.434 ## 9 0.720 1.19 ## 10 0.646 1.36 ## 11 0.963 1.36 ## 12 0.440 1.76 ## 13 1.21 1.11 ## 14 0.123 0.786 ## 15 0.627 -0.271 ## 16 0.757 -0.271 ## 17 1.78 -0.108 ## 18 0.776 -0.759 Next, we can create a scatter plot using this data set to see if we can detect subtypes or groups in our data set. ggplot(penguins_standardized, aes(x = flipper_length_standardized, y = bill_length_standardized)) + geom_point() + xlab("Flipper Length (standardized)") + ylab("Bill Length (standardized)") + theme(text = element_text(size = 12)) Figure 9.2: Scatter plot of standardized bill length versus standardized flipper length. Based on the visualization in Figure 9.2, we might suspect there are a few subtypes of penguins within our data set. We can see roughly 3 groups of observations in Figure 9.2, including: a small flipper and bill length group, a small flipper length, but large bill length group, and a large flipper and bill length group. Data visualization is a great tool to give us a rough sense of such patterns when we have a small number of variables. But if we are to group data—and select the number of groups—as part of a reproducible analysis, we need something a bit more automated. Additionally, finding groups via visualization becomes more difficult as we increase the number of variables we consider when clustering. The way to rigorously separate the data into groups is to use a clustering algorithm. In this chapter, we will focus on the K-means algorithm, a widely used and often very effective clustering method, combined with the elbow method for selecting the number of clusters. This procedure will separate the data into groups; Figure 9.3 shows these groups denoted by colored scatter points. Figure 9.3: Scatter plot of standardized bill length versus standardized flipper length with colored groups. What are the labels for these groups? Unfortunately, we don’t have any. K-means, like almost all clustering algorithms, just outputs meaningless “cluster labels” that are typically whole numbers: 1, 2, 3, etc. But in a simple case like this, where we can easily visualize the clusters on a scatter plot, we can give human-made labels to the groups using their positions on the plot: small flipper length and small bill length (orange cluster), small flipper length and large bill length (blue cluster). and large flipper length and large bill length (yellow cluster). Once we have made these determinations, we can use them to inform our species classifications or ask further questions about our data. For example, we might be interested in understanding the relationship between flipper length and bill length, and that relationship may differ depending on the type of penguin we have. 9.5 K-means 9.5.1 Measuring cluster quality The K-means algorithm is a procedure that groups data into K clusters. It starts with an initial clustering of the data, and then iteratively improves it by making adjustments to the assignment of data to clusters until it cannot improve any further. But how do we measure the “quality” of a clustering, and what does it mean to improve it? In K-means clustering, we measure the quality of a cluster by its within-cluster sum-of-squared-distances (WSSD). Computing this involves two steps. First, we find the cluster centers by computing the mean of each variable over data points in the cluster. For example, suppose we have a cluster containing four observations, and we are using two variables, \\(x\\) and \\(y\\), to cluster the data. Then we would compute the coordinates, \\(\\mu_x\\) and \\(\\mu_y\\), of the cluster center via \\[\\mu_x = \\frac{1}{4}(x_1+x_2+x_3+x_4) \\quad \\mu_y = \\frac{1}{4}(y_1+y_2+y_3+y_4).\\] In the first cluster from the example, there are 4 data points. These are shown with their cluster center (standardized flipper length -0.35, standardized bill length 0.99) highlighted in Figure 9.4. Figure 9.4: Cluster 1 from the penguins_standardized data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The second step in computing the WSSD is to add up the squared distance between each point in the cluster and the cluster center. We use the straight-line / Euclidean distance formula that we learned about in Chapter 5. In the 4-observation cluster example above, we would compute the WSSD \\(S^2\\) via \\[\\begin{align*} S^2 = \\left((x_1 - \\mu_x)^2 + (y_1 - \\mu_y)^2\\right) + \\left((x_2 - \\mu_x)^2 + (y_2 - \\mu_y)^2\\right) + \\\\ \\left((x_3 - \\mu_x)^2 + (y_3 - \\mu_y)^2\\right) + \\left((x_4 - \\mu_x)^2 + (y_4 - \\mu_y)^2\\right). \\end{align*}\\] These distances are denoted by lines in Figure 9.5 for the first cluster of the penguin data example. Figure 9.5: Cluster 1 from the penguins_standardized data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The distances from the observations to the cluster center are represented as black lines. The larger the value of \\(S^2\\), the more spread out the cluster is, since large \\(S^2\\) means that points are far from the cluster center. Note, however, that “large” is relative to both the scale of the variables for clustering and the number of points in the cluster. A cluster where points are very close to the center might still have a large \\(S^2\\) if there are many data points in the cluster. After we have calculated the WSSD for all the clusters, we sum them together to get the total WSSD. For our example, this means adding up all the squared distances for the 18 observations. These distances are denoted by black lines in Figure 9.6. Figure 9.6: All clusters from the penguins_standardized data set example. Observations are small orange, blue, and yellow points with cluster centers denoted by larger points with a black outline. The distances from the observations to each of the respective cluster centers are represented as black lines. Since K-means uses the straight-line distance to measure the quality of a clustering, it is limited to clustering based on quantitative variables. However, note that there are variants of the K-means algorithm, as well as other clustering algorithms entirely, that use other distance metrics to allow for non-quantitative data to be clustered. These are beyond the scope of this book. 9.5.2 The clustering algorithm We begin the K-means algorithm by picking K, and randomly assigning a roughly equal number of observations to each of the K clusters. An example random initialization is shown in Figure 9.7. Figure 9.7: Random initialization of labels. Then K-means consists of two major steps that attempt to minimize the sum of WSSDs over all the clusters, i.e., the total WSSD: Center update: Compute the center of each cluster. Label update: Reassign each data point to the cluster with the nearest center. These two steps are repeated until the cluster assignments no longer change. We show what the first four iterations of K-means would look like in Figure 9.8. There each pair of plots in each row corresponds to an iteration, where the left figure in the pair depicts the center update, and the right figure in the pair depicts the label update (i.e., the reassignment of data to clusters). Figure 9.8: First four iterations of K-means clustering on the penguins_standardized example data set. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black. Note that at this point, we can terminate the algorithm since none of the assignments changed in the fourth iteration; both the centers and labels will remain the same from this point onward. Note: Is K-means guaranteed to stop at some point, or could it iterate forever? As it turns out, thankfully, the answer is that K-means is guaranteed to stop after some number of iterations. For the interested reader, the logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration, (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments are changing, and the algorithm terminates. 9.5.3 Random restarts Unlike the classification and regression models we studied in previous chapters, K-means can get “stuck” in a bad solution. For example, Figure 9.9 illustrates an unlucky random initialization by K-means. Figure 9.9: Random initialization of labels. Figure 9.10 shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure 9.9. Figure 9.10: First five iterations of K-means clustering on the penguins_standardized example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black. This looks like a relatively bad clustering of the data, but K-means cannot improve it. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD. 9.5.4 Choosing K In order to cluster data using K-means, we also have to pick the number of clusters, K. But unlike in classification, we have no response variable and cannot perform cross-validation with some measure of model prediction error. Further, if K is chosen too small, then multiple clusters get grouped together; if K is too large, then clusters get subdivided. In both cases, we will potentially miss interesting structure in the data. Figure 9.11 illustrates the impact of K on K-means clustering of our penguin flipper and bill length data by showing the different clusterings for K’s ranging from 1 to 9. Figure 9.11: Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black. If we set K less than 3, then the clustering merges separate groups of data; this causes a large total WSSD, since the cluster center is not close to any of the data in the cluster. On the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still decrease the total WSSD, but by only a diminishing amount. If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly the right number of clusters (Figure 9.12). Figure 9.12: Total WSSD for K clusters ranging from 1 to 9. 9.6 K-means in R We can perform K-means clustering in R using a tidymodels workflow similar to those in the earlier classification and regression chapters. We will begin by loading the tidyclust library, which contains the necessary functionality. library(tidyclust) Returning to the original (unstandardized) penguins data, recall that K-means clustering uses straight-line distance to decide which points are similar to each other. Therefore, the scale of each of the variables in the data will influence which cluster data points end up being assigned. Variables with a large scale will have a much larger effect on deciding cluster assignment than variables with a small scale. To address this problem, we need to create a recipe that standardizes our data before clustering using the step_scale and step_center preprocessing steps. Standardization will ensure that each variable has a mean of 0 and standard deviation of 1 prior to clustering. We will designate that all variables are to be used in clustering via the model formula ~ .. Note: Recipes were originally designed specifically for predictive data analysis problems—like classification and regression—not clustering problems. So the functions in R that we use to construct recipes are a little bit awkward in the setting of clustering In particular, we will have to treat “predictors” here as if it meant “variables to be used in clustering”. So the model formula ~ . specifies that all variables are “predictors”, i.e., all variables should be used for clustering. Similarly, when we use the all_predictors() function in the preprocessing steps, we really mean “apply this step to all variables used for clustering.” kmeans_recipe <- recipe(~ ., data=penguins) |> step_scale(all_predictors()) |> step_center(all_predictors()) kmeans_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## predictor: 2 ## ## ── Operations ## • Scaling for: all_predictors() ## • Centering for: all_predictors() To indicate that we are performing K-means clustering, we will use the k_means model specification. We will use the num_clusters argument to specify the number of clusters (here we choose K = 3), and specify that we are using the \"stats\" engine. kmeans_spec <- k_means(num_clusters = 3) |> set_engine("stats") kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = 3 ## ## Computational engine: stats To actually run the K-means clustering, we combine the recipe and model specification in a workflow, and use the fit function. Note that the K-means algorithm uses a random initialization of assignments; but since we set the random seed earlier, the clustering will be reproducible. kmeans_fit <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> fit(data = penguins) kmeans_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: k_means() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## K-means clustering with 3 clusters of sizes 4, 6, 8 ## ## Cluster means: ## bill_length_mm flipper_length_mm ## 1 0.9858721 -0.3524358 ## 2 0.6828058 1.2606357 ## 3 -1.0050404 -0.7692589 ## ## Clustering vector: ## [1] 3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 ## ## Within cluster sum of squares by cluster: ## [1] 1.098928 1.247042 2.121932 ## (between_SS / total_SS = 86.9 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" ## [6] "betweenss" "size" "iter" "ifault" As you can see above, the fit object has a lot of information that can be used to visualize the clusters, pick K, and evaluate the total WSSD. Let’s start by visualizing the clusters as a colored scatter plot! In order to do that, we first need to augment our original data frame with the cluster assignments. We can achieve this using the augment function from tidyclust. clustered_data <- kmeans_fit |> augment(penguins) clustered_data ## # A tibble: 18 × 3 ## bill_length_mm flipper_length_mm .pred_cluster ## <dbl> <dbl> <fct> ## 1 39.2 196 Cluster_1 ## 2 36.5 182 Cluster_1 ## 3 34.5 187 Cluster_1 ## 4 36.7 187 Cluster_1 ## 5 38.1 181 Cluster_1 ## 6 39.2 190 Cluster_1 ## 7 36 195 Cluster_1 ## 8 37.8 193 Cluster_1 ## 9 46.5 213 Cluster_2 ## 10 46.1 215 Cluster_2 ## 11 47.8 215 Cluster_2 ## 12 45 220 Cluster_2 ## 13 49.1 212 Cluster_2 ## 14 43.3 208 Cluster_2 ## 15 46 195 Cluster_3 ## 16 46.7 195 Cluster_3 ## 17 52.2 197 Cluster_3 ## 18 46.8 189 Cluster_3 Now that we have the cluster assignments included in the clustered_data tidy data frame, we can visualize them as shown in Figure 9.13. Note that we are plotting the un-standardized data here; if we for some reason wanted to visualize the standardized data from the recipe, we would need to use the bake function to obtain that first. cluster_plot <- ggplot(clustered_data, aes(x = flipper_length_mm, y = bill_length_mm, color = .pred_cluster), size = 2) + geom_point() + labs(x = "Flipper Length", y = "Bill Length", color = "Cluster") + scale_color_manual(values = c("steelblue", "darkorange", "goldenrod1")) + theme(text = element_text(size = 12)) cluster_plot Figure 9.13: The data colored by the cluster assignments returned by K-means. As mentioned above, we also need to select K by finding where the “elbow” occurs in the plot of total WSSD versus the number of clusters. We can obtain the total WSSD (tot.withinss) from our clustering with 3 clusters using the glance function. glance(kmeans_fit) ## # A tibble: 1 × 4 ## totss tot.withinss betweenss iter ## <dbl> <dbl> <dbl> <int> ## 1 34 4.47 29.5 2 To calculate the total WSSD for a variety of Ks, we will create a data frame with a column named num_clusters with rows containing each value of K we want to run K-means with (here, 1 to 9). penguin_clust_ks <- tibble(num_clusters = 1:9) penguin_clust_ks ## # A tibble: 9 × 1 ## num_clusters ## <int> ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 Then we construct our model specification again, this time specifying that we want to tune the num_clusters parameter. kmeans_spec <- k_means(num_clusters = tune()) |> set_engine("stats") kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = tune() ## ## Computational engine: stats We combine the recipe and specification in a workflow, and then use the tune_cluster function to run K-means on each of the different settings of num_clusters. The grid argument controls which values of K we want to try—in this case, the values from 1 to 9 that are stored in the penguin_clust_ks data frame. We set the resamples argument to apparent(penguins) to tell K-means to run on the whole data set for each value of num_clusters. Finally, we collect the results using the collect_metrics function. kmeans_results <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |> collect_metrics() kmeans_results ## # A tibble: 18 × 7 ## num_clusters .metric .estimator mean n std_err .config ## <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 sse_total standard 34 1 NA Preprocessor1_… ## 2 1 sse_within_total standard 34 1 NA Preprocessor1_… ## 3 2 sse_total standard 34 1 NA Preprocessor1_… ## 4 2 sse_within_total standard 10.9 1 NA Preprocessor1_… ## 5 3 sse_total standard 34 1 NA Preprocessor1_… ## 6 3 sse_within_total standard 4.47 1 NA Preprocessor1_… ## 7 4 sse_total standard 34 1 NA Preprocessor1_… ## 8 4 sse_within_total standard 3.54 1 NA Preprocessor1_… ## 9 5 sse_total standard 34 1 NA Preprocessor1_… ## 10 5 sse_within_total standard 2.23 1 NA Preprocessor1_… ## 11 6 sse_total standard 34 1 NA Preprocessor1_… ## 12 6 sse_within_total standard 1.75 1 NA Preprocessor1_… ## 13 7 sse_total standard 34 1 NA Preprocessor1_… ## 14 7 sse_within_total standard 2.06 1 NA Preprocessor1_… ## 15 8 sse_total standard 34 1 NA Preprocessor1_… ## 16 8 sse_within_total standard 2.46 1 NA Preprocessor1_… ## 17 9 sse_total standard 34 1 NA Preprocessor1_… ## 18 9 sse_within_total standard 0.906 1 NA Preprocessor1_… The total WSSD results correspond to the mean column when the .metric variable is equal to sse_within_total. We can obtain a tidy data frame with this information using filter and mutate. kmeans_results <- kmeans_results |> filter(.metric == "sse_within_total") |> mutate(total_WSSD = mean) |> select(num_clusters, total_WSSD) kmeans_results ## # A tibble: 9 × 2 ## num_clusters total_WSSD ## <int> <dbl> ## 1 1 34 ## 2 2 10.9 ## 3 3 4.47 ## 4 4 3.54 ## 5 5 2.23 ## 6 6 1.75 ## 7 7 2.06 ## 8 8 2.46 ## 9 9 0.906 Now that we have total_WSSD and num_clusters as columns in a data frame, we can make a line plot (Figure 9.14) and search for the “elbow” to find which value of K to use. elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares") + scale_x_continuous(breaks = 1:9) + theme(text = element_text(size = 12)) elbow_plot Figure 9.14: A plot showing the total WSSD versus the number of clusters. It looks like 3 clusters is the right choice for this data. But why is there a “bump” in the total WSSD plot here? Shouldn’t total WSSD always decrease as we add more clusters? Technically yes, but remember: K-means can get “stuck” in a bad solution. Unfortunately, for K = 8 we had an unlucky initialization and found a bad clustering! We can help prevent finding a bad clustering by trying a few different random initializations via the nstart argument in the model specification. Here we will try using 10 restarts. kmeans_spec <- k_means(num_clusters = tune()) |> set_engine("stats", nstart = 10) kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = tune() ## ## Engine-Specific Arguments: ## nstart = 10 ## ## Computational engine: stats Now if we rerun the same workflow with the new model specification, K-means clustering will be performed nstart = 10 times for each value of K. The collect_metrics function will then pick the best clustering of the 10 runs for each value of K, and report the results for that best clustering. Figure 9.15 shows the resulting total WSSD plot from using 10 restarts; the bump is gone and the total WSSD decreases as expected. The more times we perform K-means clustering, the more likely we are to find a good clustering (if one exists). What value should you choose for nstart? The answer is that it depends on many factors: the size and characteristics of your data set, as well as how powerful your computer is. The larger the nstart value the better from an analysis perspective, but there is a trade-off that doing many clusterings could take a long time. So this is something that needs to be balanced. kmeans_results <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |> collect_metrics() |> filter(.metric == "sse_within_total") |> mutate(total_WSSD = mean) |> select(num_clusters, total_WSSD) elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares") + scale_x_continuous(breaks = 1:9) + theme(text = element_text(size = 12)) elbow_plot Figure 9.15: A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts. 9.7 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Clustering” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 9.8 Additional resources Chapter 10 of An Introduction to Statistical Learning (James et al. 2013) provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers hierarchical clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers principal components analysis (PCA), which is a very popular technique for reducing the number of predictors in a data set. References "],["inference.html", "Chapter 10 Statistical inference 10.1 Overview 10.2 Chapter learning objectives 10.3 Why do we need sampling? 10.4 Sampling distributions 10.5 Bootstrapping 10.6 Exercises 10.7 Additional resources", " Chapter 10 Statistical inference 10.1 Overview A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that population; we typically do not get data on the entire population. Data analysis questions regarding how summaries, patterns, trends, or relationships in a data set extend to the wider population are called inferential questions. This chapter will start with the fundamental ideas of sampling from populations and then introduce two common techniques in statistical inference: point estimation and interval estimation. 10.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe real-world examples of questions that can be answered with statistical inference. Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample. Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution. Explain the difference between a population parameter and a sample point estimate. Use R to draw random samples from a finite population. Use R to create a sampling distribution from a finite population. Describe how sample size influences the sampling distribution. Define bootstrapping. Use R to create a bootstrap distribution to approximate a sampling distribution. Contrast the bootstrap and sampling distributions. 10.3 Why do we need sampling? We often need to understand how quantities we observe in a subset of data relate to the same quantities in the broader population. For example, suppose a retailer is considering selling iPhone accessories, and they want to estimate how big the market might be. Additionally, they want to strategize how they can market their products on North American college and university campuses. This retailer might formulate the following question: What proportion of all undergraduate students in North America own an iPhone? In the above question, we are interested in making a conclusion about all undergraduate students in North America; this is referred to as the population. In general, the population is the complete collection of individuals or cases we are interested in studying. Further, in the above question, we are interested in computing a quantity—the proportion of iPhone owners—based on the entire population. This proportion is referred to as a population parameter. In general, a population parameter is a numerical characteristic of the entire population. To compute this number in the example above, we would need to ask every single undergraduate in North America whether they own an iPhone. In practice, directly computing population parameters is often time-consuming and costly, and sometimes impossible. A more practical approach would be to make measurements for a sample, i.e., a subset of individuals collected from the population. We can then compute a sample estimate—a numerical characteristic of the sample—that estimates the population parameter. For example, suppose we randomly selected ten undergraduate students across North America (the sample) and computed the proportion of those students who own an iPhone (the sample estimate). In that case, we might suspect that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population. Figure 10.1 illustrates this process. In general, the process of using a sample to make a conclusion about the broader population from which it is taken is referred to as statistical inference. Figure 10.1: The process of using a sample from a broader population to obtain a point estimate of a population parameter. In this case, a sample of 10 individuals yielded 6 who own an iPhone, resulting in an estimated population proportion of 60% iPhone owners. The actual population proportion in this example illustration is 53.8%. Note that proportions are not the only kind of population parameter we might be interested in. For example, suppose an undergraduate student studying at the University of British Columbia in Canada is looking for an apartment to rent. They need to create a budget, so they want to know about studio apartment rental prices in Vancouver. This student might formulate the question: What is the average price per month of studio apartment rentals in Vancouver? In this case, the population consists of all studio apartment rentals in Vancouver, and the population parameter is the average price per month. Here we used the average as a measure of the center to describe the “typical value” of studio apartment rental prices. But even within this one example, we could also be interested in many other population parameters. For instance, we know that not every studio apartment rental in Vancouver will have the same price per month. The student might be interested in how much monthly prices vary and want to find a measure of the rentals’ spread (or variability), such as the standard deviation. Or perhaps the student might be interested in the fraction of studio apartment rentals that cost more than $1000 per month. The question we want to answer will help us determine the parameter we want to estimate. If we were somehow able to observe the whole population of studio apartment rental offerings in Vancouver, we could compute each of these numbers exactly; therefore, these are all population parameters. There are many kinds of observations and population parameters that you will run into in practice, but in this chapter, we will focus on two settings: Using categorical observations to estimate the proportion of a category Using quantitative observations to estimate the average (or mean) 10.4 Sampling distributions 10.4.1 Sampling distributions for proportions We will look at an example using data from Inside Airbnb (Cox n.d.). Airbnb is an online marketplace for arranging vacation rentals and places to stay. The data set contains listings for Vancouver, Canada, in September 2020. Our data includes an ID number, neighborhood, type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night. library(tidyverse) set.seed(123) airbnb <- read_csv("data/listings.csv") airbnb ## # A tibble: 4,594 × 8 ## id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 Downtown Entire h… 5 2 baths 2 2 150 ## 2 2 Downtown Eastside Entire h… 4 2 baths 2 2 132 ## 3 3 West End Entire h… 2 1 bath 1 1 85 ## 4 4 Kensington-Cedar… Entire h… 2 1 bath 1 0 146 ## 5 5 Kensington-Cedar… Entire h… 4 1 bath 1 2 110 ## 6 6 Hastings-Sunrise Entire h… 4 1 bath 2 3 195 ## 7 7 Renfrew-Collingw… Entire h… 8 3 baths 4 5 130 ## 8 8 Mount Pleasant Entire h… 2 1 bath 1 1 94 ## 9 9 Grandview-Woodla… Private … 2 1 privat… 1 1 79 ## 10 10 West End Private … 2 1 privat… 1 1 75 ## # ℹ 4,584 more rows Suppose the city of Vancouver wants information about Airbnb rentals to help plan city bylaws, and they want to know how many Airbnb places are listed as entire homes and apartments (rather than as private or shared rooms). Therefore they may want to estimate the true proportion of all Airbnb listings where the “type of place” is listed as “entire home or apartment.” Of course, we usually do not have access to the true population, but here let’s imagine (for learning purposes) that our data set represents the population of all Airbnb rental listings in Vancouver, Canada. We can find the proportion of listings where room_type == \"Entire home/apt\". airbnb |> summarize( n = sum(room_type == "Entire home/apt"), proportion = sum(room_type == "Entire home/apt") / nrow(airbnb) ) ## # A tibble: 1 × 2 ## n proportion ## <int> <dbl> ## 1 3434 0.747 We can see that the proportion of Entire home/apt listings in the data set is 0.747. This value, 0.747, is the population parameter. Remember, this parameter value is usually unknown in real data analysis problems, as it is typically not possible to make measurements for an entire population. Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let’s try randomly selecting 40 listings (i.e., taking a random sample of size 40 from our population), and computing the proportion for that sample. We will use the rep_sample_n function from the infer package to take the sample. The arguments of rep_sample_n are (1) the data frame to sample from, and (2) the size of the sample to take. library(infer) sample_1 <- rep_sample_n(tbl = airbnb, size = 40) airbnb_sample_1 <- summarize(sample_1, n = sum(room_type == "Entire home/apt"), prop = sum(room_type == "Entire home/apt") / 40 ) airbnb_sample_1 ## # A tibble: 1 × 3 ## replicate n prop ## <int> <int> <dbl> ## 1 1 28 0.7 Here we see that the proportion of entire home/apartment listings in this random sample is 0.7. Wow—that’s close to our true population value! But remember, we computed the proportion using a random sample of size 40. This has two consequences. First, this value is only an estimate, i.e., our best guess of our population parameter using this sample. Given that we are estimating a single value here, we often refer to it as a point estimate. Second, since the sample was random, if we were to take another random sample of size 40 and compute the proportion for that sample, we would not get the same answer: sample_2 <- rep_sample_n(airbnb, size = 40) airbnb_sample_2 <- summarize(sample_2, n = sum(room_type == "Entire home/apt"), prop = sum(room_type == "Entire home/apt") / 40 ) airbnb_sample_2 ## # A tibble: 1 × 3 ## replicate n prop ## <int> <int> <dbl> ## 1 1 35 0.875 Confirmed! We get a different value for our estimate this time. That means that our point estimate might be unreliable. Indeed, estimates vary from sample to sample due to sampling variability. But just how much should we expect the estimates of our random samples to vary? Or in other words, how much can we really trust our point estimate based on a single sample? To understand this, we will simulate many samples (much more than just two) of size 40 from our population of listings and calculate the proportion of entire home/apartment listings in each sample. This simulation will create many sample proportions, which we can visualize using a histogram. The distribution of the estimate for all possible samples of a given size (which we commonly refer to as \\(n\\)) from a population is called a sampling distribution. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. We again use the rep_sample_n to take samples of size 40 from our population of Airbnb listings. But this time we set the reps argument to 20,000 to specify that we want to take 20,000 samples of size 40. samples <- rep_sample_n(airbnb, size = 40, reps = 20000) samples ## # A tibble: 800,000 × 9 ## # Groups: replicate [20,000] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1 4403 Downtown Entire h… 2 1 bath 1 1 ## 2 1 902 Kensington-C… Private … 2 1 shared… 1 1 ## 3 1 3808 Hastings-Sun… Entire h… 6 1.5 baths 1 3 ## 4 1 561 Kensington-C… Entire h… 6 1 bath 2 2 ## 5 1 3385 Mount Pleasa… Entire h… 4 1 bath 1 1 ## 6 1 4232 Shaughnessy Entire h… 6 1.5 baths 2 2 ## 7 1 1169 Downtown Entire h… 3 1 bath 1 1 ## 8 1 959 Kitsilano Private … 1 1.5 shar… 1 1 ## 9 1 2171 Downtown Entire h… 2 1 bath 1 1 ## 10 1 1258 Dunbar South… Entire h… 4 1 bath 2 2 ## # ℹ 799,990 more rows ## # ℹ 1 more variable: price <dbl> Notice that the column replicate indicates the replicate, or sample, to which each listing belongs. Above, since by default R only prints the first few rows, it looks like all of the listings have replicate set to 1. But you can check the last few entries using the tail() function to verify that we indeed created 20,000 samples (or replicates). tail(samples) ## # A tibble: 6 × 9 ## # Groups: replicate [1] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 20000 3414 Marpole Entire h… 4 1 bath 2 2 ## 2 20000 1974 Hastings-Sunr… Private … 2 1 shared… 1 1 ## 3 20000 1846 Riley Park Entire h… 4 1 bath 2 3 ## 4 20000 862 Downtown Entire h… 5 2 baths 2 2 ## 5 20000 3295 Victoria-Fras… Private … 2 1 shared… 1 1 ## 6 20000 997 Dunbar Southl… Private … 1 1.5 shar… 1 1 ## # ℹ 1 more variable: price <dbl> Now that we have obtained the samples, we need to compute the proportion of entire home/apartment listings in each sample. We first group the data by the replicate variable—to group the set of listings in each sample together—and then use summarize to compute the proportion in each sample. We print both the first and last few entries of the resulting data frame below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. sample_estimates <- samples |> group_by(replicate) |> summarize(sample_proportion = sum(room_type == "Entire home/apt") / 40) sample_estimates ## # A tibble: 20,000 × 2 ## replicate sample_proportion ## <int> <dbl> ## 1 1 0.85 ## 2 2 0.85 ## 3 3 0.65 ## 4 4 0.7 ## 5 5 0.75 ## 6 6 0.725 ## 7 7 0.775 ## 8 8 0.775 ## 9 9 0.7 ## 10 10 0.675 ## # ℹ 19,990 more rows tail(sample_estimates) ## # A tibble: 6 × 2 ## replicate sample_proportion ## <int> <dbl> ## 1 19995 0.75 ## 2 19996 0.675 ## 3 19997 0.625 ## 4 19998 0.75 ## 5 19999 0.875 ## 6 20000 0.65 We can now visualize the sampling distribution of sample proportions for samples of size 40 using a histogram in Figure 10.2. Keep in mind: in the real world, we don’t have access to the full population. So we can’t take many samples and can’t actually construct or visualize the sampling distribution. We have created this particular example such that we do have access to the full population, which lets us visualize the sampling distribution directly for learning purposes. sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) + geom_histogram(color = "lightgrey", bins = 12) + labs(x = "Sample proportions", y = "Count") + theme(text = element_text(size = 12)) sampling_distribution Figure 10.2: Sampling distribution of the sample proportion for sample size 40. The sampling distribution in Figure 10.2 appears to be bell-shaped, is roughly symmetric, and has one peak. It is centered around 0.7 and the sample proportions range from about 0.4 to about 1. In fact, we can calculate the mean of the sample proportions. sample_estimates |> summarize(mean_proportion = mean(sample_proportion)) ## # A tibble: 1 × 1 ## mean_proportion ## <dbl> ## 1 0.747 We notice that the sample proportions are centered around the population proportion value, 0.747! In general, the mean of the sampling distribution should be equal to the population proportion. This is great news because it means that the sample proportion is neither an overestimate nor an underestimate of the population proportion. In other words, if you were to take many samples as we did above, there is no tendency towards over or underestimating the population proportion. In a real data analysis setting where you just have access to your single sample, this implies that you would suspect that your sample point estimate is roughly equally likely to be above or below the true population proportion. 10.4.2 Sampling distributions for means In the previous section, our variable of interest—room_type—was categorical, and the population parameter was a proportion. As mentioned in the chapter introduction, there are many choices of the population parameter for each type of variable. What if we wanted to infer something about a population of quantitative variables instead? For instance, a traveler visiting Vancouver, Canada may wish to estimate the population mean (or average) price per night of Airbnb listings. Knowing the average could help them tell whether a particular listing is overpriced. We can visualize the population distribution of the price per night with a histogram. population_distribution <- ggplot(airbnb, aes(x = price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) population_distribution Figure 10.3: Population distribution of price per night (dollars) for all Airbnb listings in Vancouver, Canada. In Figure 10.3, we see that the population distribution has one peak. It is also skewed (i.e., is not symmetric): most of the listings are less than $250 per night, but a small number of listings cost much more, creating a long tail on the histogram’s right side. Along with visualizing the population, we can calculate the population mean, the average price per night for all the Airbnb listings. population_parameters <- airbnb |> summarize(mean_price = mean(price)) population_parameters ## # A tibble: 1 × 1 ## mean_price ## <dbl> ## 1 154.51 The price per night of all Airbnb rentals in Vancouver, BC is $154.51, on average. This value is our population parameter since we are calculating it using the population data. Now suppose we did not have access to the population data (which is usually the case!), yet we wanted to estimate the mean price per night. We could answer this question by taking a random sample of as many Airbnb listings as our time and resources allow. Let’s say we could do this for 40 listings. What would such a sample look like? Let’s take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using rep_sample_n. one_sample <- airbnb |> rep_sample_n(40) We can create a histogram to visualize the distribution of observations in the sample (Figure 10.4), and calculate the mean of our sample. sample_distribution <- ggplot(one_sample, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) sample_distribution Figure 10.4: Distribution of price per night (dollars) for sample of 40 Airbnb listings. estimates <- one_sample |> summarize(mean_price = mean(price)) estimates ## # A tibble: 1 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 155.80 The average value of the sample of size 40 is $155.80. This number is a point estimate for the mean of the full population. Recall that the population mean was $154.51. So our estimate was fairly close to the population parameter: the mean was about 0.8% off. Note that we usually cannot compute the estimate’s accuracy in practice since we do not have access to the population parameter; if we did, we wouldn’t need to estimate it! Also, recall from the previous section that the point estimate can vary; if we took another random sample from the population, our estimate’s value might change. So then, did we just get lucky with our point estimate above? How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population, we can take many samples and plot the sampling distribution of sample means for samples of size 40 to get a sense for this variation. In this case, we’ll use 20,000 samples of size 40. samples <- rep_sample_n(airbnb, size = 40, reps = 20000) samples ## # A tibble: 800,000 × 9 ## # Groups: replicate [20,000] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1 1177 Downtown Entire h… 4 2 baths 2 2 ## 2 1 4063 Downtown Entire h… 2 1 bath 1 1 ## 3 1 2641 Kitsilano Private … 1 1 shared… 1 1 ## 4 1 1941 West End Entire h… 2 1 bath 1 1 ## 5 1 2431 Mount Pleasa… Entire h… 2 1 bath 1 1 ## 6 1 1871 Arbutus Ridge Entire h… 4 1 bath 2 2 ## 7 1 2557 Marpole Private … 3 1 privat… 1 2 ## 8 1 3534 Downtown Entire h… 2 1 bath 1 1 ## 9 1 4379 Downtown Entire h… 4 1 bath 1 0 ## 10 1 2161 Downtown Entire h… 4 2 baths 2 2 ## # ℹ 799,990 more rows ## # ℹ 1 more variable: price <dbl> Now we can calculate the sample mean for each replicate and plot the sampling distribution of sample means for samples of size 40. sample_estimates <- samples |> group_by(replicate) |> summarize(mean_price = mean(price)) sample_estimates ## # A tibble: 20,000 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 160.06 ## 2 2 173.18 ## 3 3 131.20 ## 4 4 176.96 ## 5 5 125.65 ## 6 6 148.84 ## 7 7 134.82 ## 8 8 137.26 ## 9 9 166.11 ## 10 10 157.81 ## # ℹ 19,990 more rows sampling_distribution_40 <- ggplot(sample_estimates, aes(x = mean_price)) + geom_histogram(color = "lightgrey") + labs(x = "Sample mean price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) sampling_distribution_40 Figure 10.5: Sampling distribution of the sample means for sample size of 40. In Figure 10.5, the sampling distribution of the mean has one peak and is bell-shaped. Most of the estimates are between about $140 and $170; but there is a good fraction of cases outside this range (i.e., where the point estimate was not close to the population parameter). So it does indeed look like we were quite lucky when we estimated the population mean with only 0.8% error. Let’s visualize the population distribution, distribution of the sample, and the sampling distribution on one plot to compare them in Figure 10.6. Comparing these three distributions, the centers of the distributions are all around the same price (around $150). The original population distribution has a long right tail, and the sample distribution has a similar shape to that of the population distribution. However, the sampling distribution is not shaped like the population or sample distribution. Instead, it has a bell shape, and it has a lower spread than the population or sample distributions. The sample means vary less than the individual observations because there will be some high values and some small values in any random sample, which will keep the average from being too extreme. Figure 10.6: Comparison of population distribution, sample distribution, and sampling distribution. Given that there is quite a bit of variation in the sampling distribution of the sample mean—i.e., the point estimate that we obtain is not very reliable—is there any way to improve the estimate? One way to improve a point estimate is to take a larger sample. To illustrate what effect this has, we will take many samples of size 20, 50, 100, and 500, and plot the sampling distribution of the sample mean. We indicate the mean of the sampling distribution with a vertical dashed line. Figure 10.7: Comparison of sampling distributions, with mean highlighted as a vertical dashed line. Based on the visualization in Figure 10.7, three points about the sample mean become clear. First, the mean of the sample mean (across samples) is equal to the population mean. In other words, the sampling distribution is centered at the population mean. Second, increasing the size of the sample decreases the spread (i.e., the variability) of the sampling distribution. Therefore, a larger sample size results in a more reliable point estimate of the population parameter. And third, the distribution of the sample mean is roughly bell-shaped. Note: You might notice that in the n = 20 case in Figure 10.7, the distribution is not quite bell-shaped. There is a bit of skew towards the right! You might also notice that in the n = 50 case and larger, that skew seems to disappear. In general, the sampling distribution—for both means and proportions—only becomes bell-shaped once the sample size is large enough. How large is “large enough?” Unfortunately, it depends entirely on the problem at hand. But as a rule of thumb, often a sample size of at least 20 will suffice. 10.4.3 Summary A point estimate is a single value computed using a sample from a population (e.g., a mean or proportion). The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population. The shape of the sampling distribution is usually bell-shaped with one peak and centered at the population mean or proportion. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases. 10.5 Bootstrapping 10.5.1 Overview Why all this emphasis on sampling distributions? We saw in the previous section that we could compute a point estimate of a population parameter using a sample of observations from the population. And since we constructed examples where we had access to the population, we could evaluate how accurate the estimate was, and even get a sense of how much the estimate would vary for different samples from the population. But in real data analysis settings, we usually have just one sample from our population and do not have access to the population itself. Therefore we cannot construct the sampling distribution as we did in the previous section. And as we saw, our sample estimate’s value can vary significantly from the population parameter. So reporting the point estimate from a single sample alone may not be enough. We also need to report some notion of uncertainty in the value of the point estimate. Unfortunately, we cannot construct the exact sampling distribution without full access to the population. However, if we could somehow approximate what the sampling distribution would look like for a sample, we could use that approximation to then report how uncertain our sample point estimate is (as we did above with the exact sampling distribution). There are several methods to accomplish this; in this book, we will use the bootstrap. We will discuss interval estimation and construct confidence intervals using just a single sample from a population. A confidence interval is a range of plausible values for our population parameter. Here is the key idea. First, if you take a big enough sample, it looks like the population. Notice the histograms’ shapes for samples of different sizes taken from the population in Figure 10.8. We see that the sample’s distribution looks like that of the population for a large enough sample. Figure 10.8: Comparison of samples of different sizes from the population. In the previous section, we took many samples of the same size from our population to get a sense of the variability of a sample estimate. But if our sample is big enough that it looks like our population, we can pretend that our sample is the population, and take more samples (with replacement) of the same size from it instead! This very clever technique is called the bootstrap. Note that by taking many samples from our single, observed sample, we do not obtain the true sampling distribution, but rather an approximation that we call the bootstrap distribution. Note: We must sample with replacement when using the bootstrap. Otherwise, if we had a sample of size \\(n\\), and obtained a sample from it of size \\(n\\) without replacement, it would just return our original sample! This section will explore how to create a bootstrap distribution from a single sample using R. The process is visualized in Figure 10.9. For a sample of size \\(n\\), you would do the following: Randomly select an observation from the original sample, which was drawn from the population. Record the observation’s value. Replace that observation. Repeat steps 1–3 (sampling with replacement) until you have \\(n\\) observations, which form a bootstrap sample. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the \\(n\\) observations in your bootstrap sample. Repeat steps 1–5 many times to create a distribution of point estimates (the bootstrap distribution). Calculate the plausible range of values around our observed point estimate. Figure 10.9: Overview of the bootstrap process. 10.5.2 Bootstrapping in R Let’s continue working with our Airbnb example to illustrate how we might create and use a bootstrap distribution using just a single sample from the population. Once again, suppose we are interested in estimating the population mean price per night of all Airbnb listings in Vancouver, Canada, using a single sample size of 40. Recall our point estimate was $155.80. The histogram of prices in the sample is displayed in Figure 10.10. one_sample ## # A tibble: 40 × 8 ## id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 3928 Marpole Private … 2 1 shared… 1 1 58 ## 2 3013 Kensington-Cedar… Entire h… 4 1 bath 2 2 112 ## 3 3156 Downtown Entire h… 6 2 baths 2 2 151 ## 4 3873 Dunbar Southlands Private … 5 1 bath 2 3 700 ## 5 3632 Downtown Eastside Entire h… 6 2 baths 3 3 157 ## 6 296 Kitsilano Private … 1 1 shared… 1 1 100 ## 7 3514 West End Entire h… 2 1 bath 1 1 110 ## 8 594 Sunset Entire h… 5 1 bath 3 3 105 ## 9 3305 Dunbar Southlands Entire h… 4 1 bath 1 2 196 ## 10 938 Downtown Entire h… 7 2 baths 2 3 269 ## # ℹ 30 more rows one_sample_dist <- ggplot(one_sample, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) one_sample_dist Figure 10.10: Histogram of price per night (dollars) for one sample of size 40. The histogram for the sample is skewed, with a few observations out to the right. The mean of the sample is $155.80. Remember, in practice, we usually only have this one sample from the population. So this sample and estimate are the only data we can work with. We now perform steps 1–5 listed above to generate a single bootstrap sample in R and calculate a point estimate from that bootstrap sample. We will use the rep_sample_n function as we did when we were creating our sampling distribution. But critically, note that we now pass one_sample—our single sample of size 40—as the first argument. And since we need to sample with replacement, we change the argument for replace from its default value of FALSE to TRUE. boot1 <- one_sample |> rep_sample_n(size = 40, replace = TRUE, reps = 1) boot1_dist <- ggplot(boot1, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) boot1_dist Figure 10.11: Bootstrap distribution. summarize(boot1, mean_price = mean(price)) ## # A tibble: 1 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 164.20 Notice in Figure 10.11 that the histogram of our bootstrap sample has a similar shape to the original sample histogram. Though the shapes of the distributions are similar, they are not identical. You’ll also notice that the original sample mean and the bootstrap sample mean differ. How might that happen? Remember that we are sampling with replacement from the original sample, so we don’t end up with the same sample values again. We are pretending that our single sample is close to the population, and we are trying to mimic drawing another sample from the population by drawing one from our original sample. Let’s now take 20,000 bootstrap samples from the original sample (one_sample) using rep_sample_n, and calculate the means for each of those replicates. Recall that this assumes that one_sample looks like our original population; but since we do not have access to the population itself, this is often the best we can do. boot20000 <- one_sample |> rep_sample_n(size = 40, replace = TRUE, reps = 20000) boot20000 ## # A tibble: 800,000 × 9 ## # Groups: replicate [20,000] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1 1276 Hastings-Sun… Entire h… 2 1 bath 1 1 ## 2 1 3235 Hastings-Sun… Entire h… 2 1 bath 1 1 ## 3 1 1301 Oakridge Entire h… 12 2 baths 2 12 ## 4 1 118 Grandview-Wo… Entire h… 4 1 bath 2 2 ## 5 1 2550 Downtown Eas… Private … 2 1.5 shar… 1 1 ## 6 1 1006 Grandview-Wo… Entire h… 5 1 bath 3 4 ## 7 1 3632 Downtown Eas… Entire h… 6 2 baths 3 3 ## 8 1 1923 West End Entire h… 4 2 baths 2 2 ## 9 1 3873 Dunbar South… Private … 5 1 bath 2 3 ## 10 1 2349 Kerrisdale Private … 2 1 shared… 1 1 ## # ℹ 799,990 more rows ## # ℹ 1 more variable: price <dbl> tail(boot20000) ## # A tibble: 6 × 9 ## # Groups: replicate [1] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 20000 1949 Kitsilano Entire h… 3 1 bath 1 1 ## 2 20000 1025 Kensington-Ce… Entire h… 3 1 bath 1 1 ## 3 20000 3013 Kensington-Ce… Entire h… 4 1 bath 2 2 ## 4 20000 2868 Downtown Entire h… 2 1 bath 1 1 ## 5 20000 3156 Downtown Entire h… 6 2 baths 2 2 ## 6 20000 1923 West End Entire h… 4 2 baths 2 2 ## # ℹ 1 more variable: price <dbl> Let’s take a look at the histograms of the first six replicates of our bootstrap samples. six_bootstrap_samples <- boot20000 |> filter(replicate <= 6) ggplot(six_bootstrap_samples, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + facet_wrap(~replicate) + theme(text = element_text(size = 12)) Figure 10.12: Histograms of the first six replicates of the bootstrap samples. We see in Figure 10.12 how the bootstrap samples differ. We can also calculate the sample mean for each of these six replicates. six_bootstrap_samples |> group_by(replicate) |> summarize(mean_price = mean(price)) ## # A tibble: 6 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 177.2 ## 2 2 131.45 ## 3 3 179.10 ## 4 4 171.35 ## 5 5 191.32 ## 6 6 170.05 We can see that the bootstrap sample distributions and the sample means are different. They are different because we are sampling with replacement. We will now calculate point estimates for our 20,000 bootstrap samples and generate a bootstrap distribution of our point estimates. The bootstrap distribution (Figure 10.13) suggests how we might expect our point estimate to behave if we took another sample. boot20000_means <- boot20000 |> group_by(replicate) |> summarize(mean_price = mean(price)) boot20000_means ## # A tibble: 20,000 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 177.2 ## 2 2 131.45 ## 3 3 179.10 ## 4 4 171.35 ## 5 5 191.32 ## 6 6 170.05 ## 7 7 178.83 ## 8 8 154.78 ## 9 9 163.85 ## 10 10 209.28 ## # ℹ 19,990 more rows tail(boot20000_means) ## # A tibble: 6 × 2 ## replicate mean_price ## <int> <dbl> ## 1 19995 130.40 ## 2 19996 189.18 ## 3 19997 168.98 ## 4 19998 168.23 ## 5 19999 155.73 ## 6 20000 136.95 boot_est_dist <- ggplot(boot20000_means, aes(x = mean_price)) + geom_histogram(color = "lightgrey") + labs(x = "Sample mean price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) boot_est_dist Figure 10.13: Distribution of the bootstrap sample means. Let’s compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with the true sampling distribution—which corresponds to taking many samples from the population. Figure 10.14: Comparison of the distribution of the bootstrap sample means and sampling distribution. There are two essential points that we can take away from Figure 10.14. First, the shape and spread of the true sampling distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate’s variability. The second important point is that the means of these two distributions are different. The sampling distribution is centered at $154.51, the population mean value. However, the bootstrap distribution is centered at the original sample’s mean price per night, $155.87. Because we are resampling from the original sample repeatedly, we see that the bootstrap distribution is centered at the original sample’s mean value (unlike the sampling distribution of the sample mean, which is centered at the population parameter value). Figure 10.15 summarizes the bootstrapping process. The idea here is that we can use this distribution of bootstrap sample means to approximate the sampling distribution of the sample means when we only have one sample. Since the bootstrap distribution pretty well approximates the sampling distribution spread, we can use the bootstrap spread to help us develop a plausible range for our population parameter along with our estimate! Figure 10.15: Summary of bootstrapping process. 10.5.3 Using the bootstrap to calculate a plausible range Now that we have constructed our bootstrap distribution, let’s use it to create an approximate 95% percentile bootstrap confidence interval. A confidence interval is a range of plausible values for the population parameter. We will find the range of values covering the middle 95% of the bootstrap distribution, giving us a 95% confidence interval. You may be wondering, what does “95% confidence” mean? If we took 100 random samples and calculated 100 95% confidence intervals, then about 95% of the ranges would capture the population parameter’s value. Note there’s nothing special about 95%. We could have used other levels, such as 90% or 99%. There is a balance between our level of confidence and precision. A higher confidence level corresponds to a wider range of the interval, and a lower confidence level corresponds to a narrower range. Therefore the level we choose is based on what chance we are willing to take of being wrong based on the implications of being wrong for our application. In general, we choose confidence levels to be comfortable with our level of uncertainty but not so strict that the interval is unhelpful. For instance, if our decision impacts human life and the implications of being wrong are deadly, we may want to be very confident and choose a higher confidence level. To calculate a 95% percentile bootstrap confidence interval, we will do the following: Arrange the observations in the bootstrap distribution in ascending order. Find the value such that 2.5% of observations fall below it (the 2.5% percentile). Use that value as the lower bound of the interval. Find the value such that 97.5% of observations fall below it (the 97.5% percentile). Use that value as the upper bound of the interval. To do this in R, we can use the quantile() function. Quantiles are expressed in proportions rather than percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively. bounds <- boot20000_means |> select(mean_price) |> pull() |> quantile(c(0.025, 0.975)) bounds ## 2.5% 97.5% ## 119 204 Our interval, $119.28 to $203.63, captures the middle 95% of the sample mean prices in the bootstrap distribution. We can visualize the interval on our distribution in Figure 10.16. Figure 10.16: Distribution of the bootstrap sample means with percentile lower and upper bounds. To finish our estimation of the population parameter, we would report the point estimate and our confidence interval’s lower and upper bounds. Here the sample mean price per night of 40 Airbnb listings was $155.80, and we are 95% “confident” that the true population mean price per night for all Airbnb listings in Vancouver is between $119.28 and $203.63. Notice that our interval does indeed contain the true population mean value, $154.51! However, in practice, we would not know whether our interval captured the population parameter or not because we usually only have a single sample, not the entire population. This is the best we can do when we only have one sample! This chapter is only the beginning of the journey into statistical inference. We can extend the concepts learned here to do much more than report point estimates and confidence intervals, such as testing for real differences between populations, tests for associations between variables, and so much more. We have just scratched the surface of statistical inference; however, the material presented here will serve as the foundation for more advanced statistical techniques you may learn about in the future! 10.6 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the two “Statistical inference” rows. You can launch an interactive version of each worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of each worksheet by clicking “view worksheet.” If you instead decide to download the worksheets and run them on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 10.7 Additional resources Chapters 7 to 10 of Modern Dive (Ismay and Kim 2020) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using tidyverse and infer in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics. Chapters 4 to 7 of OpenIntro Statistics (Diez, Çetinkaya-Rundel, and Barr 2019) provide a good next step after Modern Dive. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is the language of statistics; if you have a solid grasp of probability, more advanced statistics will come naturally to you! References "],["jupyter.html", "Chapter 11 Combining code and text with Jupyter 11.1 Overview 11.2 Chapter learning objectives 11.3 Jupyter 11.4 Code cells 11.5 Markdown cells 11.6 Saving your work 11.7 Best practices for running a notebook 11.8 Exploring data files 11.9 Exporting to a different file format 11.10 Creating a new Jupyter notebook 11.11 Additional resources", " Chapter 11 Combining code and text with Jupyter 11.1 Overview A typical data analysis involves not only writing and executing code, but also writing text and displaying images that help tell the story of the analysis. In fact, ideally, we would like to interleave these three media, with the text and images serving as narration for the code and its output. In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in data science. Jupyter notebooks do precisely what we need: they let you combine text, images, and (executable!) code in a single document. In this chapter, we will focus on the use of Jupyter notebooks to program in R and write text via a web interface. These skills are essential to getting your analysis running; think of it like getting dressed in the morning! Note that we assume that you already have Jupyter set up and ready to use. If that is not the case, please first read Chapter 13 to learn how to install and configure Jupyter on your own computer. 11.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Create new Jupyter notebooks. Write, edit, and execute R code in a Jupyter notebook. Write, edit, and view text in a Jupyter notebook. Open and view plain text data files in Jupyter. Export Jupyter notebooks to other standard file types (e.g., .html, .pdf). 11.3 Jupyter Jupyter (Kluyver et al. 2016) is a web-based interactive development environment for creating, editing, and executing documents called Jupyter notebooks. Jupyter notebooks are documents that contain a mix of computer code (and its output) and formattable text. Given that they combine these two analysis artifacts in a single document—code is not separate from the output or written report—notebooks are one of the leading tools to create reproducible data analyses. Reproducible data analysis is one where you can reliably and easily re-create the same results when analyzing the same data. Although this sounds like something that should always be true of any data analysis, in reality, this is not often the case; one needs to make a conscious effort to perform data analysis in a reproducible manner. An example of what a Jupyter notebook looks like is shown in Figure 11.1. Figure 11.1: A screenshot of a Jupyter Notebook. 11.3.1 Accessing Jupyter One of the easiest ways to start working with Jupyter is to use a web-based platform called JupyterHub. JupyterHubs often have Jupyter, R, a number of R packages, and collaboration tools installed, configured and ready to use. JupyterHubs are usually created and provisioned by organizations, and require authentication to gain access. For example, if you are reading this book as part of a course, your instructor may have a JupyterHub already set up for you to use! Jupyter can also be installed on your own computer; see Chapter 13 for instructions. 11.4 Code cells The sections of a Jupyter notebook that contain code are referred to as code cells. A code cell that has not yet been executed has no number inside the square brackets to the left of the cell (Figure 11.2). Running a code cell will execute all of the code it contains, and the output (if any exists) will be displayed directly underneath the code that generated it. Outputs may include printed text or numbers, data frames and data visualizations. Cells that have been executed also have a number inside the square brackets to the left of the cell. This number indicates the order in which the cells were run (Figure 11.3). Figure 11.2: A code cell in Jupyter that has not yet been executed. Figure 11.3: A code cell in Jupyter that has been executed. 11.4.1 Executing code cells Code cells can be run independently or as part of executing the entire notebook using one of the “Run all” commands found in the Run or Kernel menus in Jupyter. Running a single code cell independently is a workflow typically used when editing or writing your own R code. Executing an entire notebook is a workflow typically used to ensure that your analysis runs in its entirety before sharing it with others, and when using a notebook as part of an automated process. To run a code cell independently, the cell needs to first be activated. This is done by clicking on it with the cursor. Jupyter will indicate a cell has been activated by highlighting it with a blue rectangle to its left. After the cell has been activated (Figure 11.4), the cell can be run by either pressing the Run (▶). button in the toolbar, or by using the keyboard shortcut Shift + Enter. Figure 11.4: An activated cell that is ready to be run. The blue rectangle to the cell’s left (annotated by a red arrow) indicates that it is ready to be run. The cell can be run by clicking the run button (circled in red). To execute all of the code cells in an entire notebook, you have three options: Select Run >> Run All Cells from the menu. Select Kernel >> Restart Kernel and Run All Cells… from the menu (Figure 11.5). Click the (⏭) button in the tool bar. All of these commands result in all of the code cells in a notebook being run. However, there is a slight difference between them. In particular, only options 2 and 3 above will restart the R session before running all of the cells; option 1 will not restart the session. Restarting the R session means that all previous objects that were created from running cells before this command was run will be deleted. In other words, restarting the session and then running all cells (options 2 or 3) emulates how your notebook code would run if you completely restarted Jupyter before executing your entire notebook. Figure 11.5: Restarting the R session can be accomplished by clicking Restart Kernel and Run All Cells… 11.4.2 The Kernel The kernel is a program that executes the code inside your notebook and outputs the results. Kernels for many different programming languages have been created for Jupyter, which means that Jupyter can interpret and execute the code of many different programming languages. To run R code, your notebook will need an R kernel. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (◯) the kernel is idle and ready to execute code. If the circle is filled in (⬤) the kernel is busy running some code. You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps: At the top of your screen, click Kernel, then Interrupt Kernel. If that doesn’t help, click Kernel, then Restart Kernel… If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work. If that still doesn’t help, restart Jupyter. First, save your work by clicking File at the top left of your screen, then Save Notebook. Next, if you are accessing Jupyter using a JupyterHub server, from the File menu click Hub Control Panel. Choose Stop My Server to shut it down, then the My Server button to start it back up. If you are running Jupyter on your own computer, from the File menu click Shut Down, then start Jupyter again. Finally, navigate back to the notebook you were working on. 11.4.3 Creating new code cells To create a new code cell in Jupyter (Figure 11.6), click the + button in the toolbar. By default, all new cells in Jupyter start out as code cells, so after this, all you have to do is write R code within the new cell you just created! Figure 11.6: New cells can be created by clicking the + button, and are by default code cells. 11.5 Markdown cells Text cells inside a Jupyter notebook are called Markdown cells. Markdown cells are rich formatted text cells, which means you can bold and italicize text, create subject headers, create bullet and numbered lists, and more. These cells are given the name “Markdown” because they use Markdown language to specify the rich text formatting. You do not need to learn Markdown to write text in the Markdown cells in Jupyter; plain text will work just fine. However, you might want to learn a bit about Markdown eventually to enable you to create nicely formatted analyses. See the additional resources at the end of this chapter to find out where you can start learning Markdown. 11.5.1 Editing Markdown cells To edit a Markdown cell in Jupyter, you need to double click on the cell. Once you do this, the unformatted (or unrendered) version of the text will be shown (Figure 11.7). You can then use your keyboard to edit the text. To view the formatted (or rendered) text (Figure 11.8), click the Run (▶) button in the toolbar, or use the Shift + Enter keyboard shortcut. Figure 11.7: A Markdown cell in Jupyter that has not yet been rendered and can be edited. Figure 11.8: A Markdown cell in Jupyter that has been rendered and exhibits rich text formatting. 11.5.2 Creating new Markdown cells To create a new Markdown cell in Jupyter, click the + button in the toolbar. By default, all new cells in Jupyter start as code cells, so the cell format needs to be changed to be recognized and rendered as a Markdown cell. To do this, click on the cell with your cursor to ensure it is activated. Then click on the drop-down box on the toolbar that says “Code” (it is next to the ⏭ button), and change it from “Code” to “Markdown” (Figure 11.9). Figure 11.9: New cells are by default code cells. To create Markdown cells, the cell format must be changed. 11.6 Saving your work As with any file you work on, it is critical to save your work often so you don’t lose your progress! Jupyter has an autosave feature, where open files are saved periodically. The default for this is every two minutes. You can also manually save a Jupyter notebook by selecting Save Notebook from the File menu, by clicking the disk icon on the toolbar, or by using a keyboard shortcut (Control + S for Windows, or Command + S for Mac OS). 11.7 Best practices for running a notebook 11.7.1 Best practices for executing code cells As you might know (or at least imagine) by now, Jupyter notebooks are great for interactively editing, writing and running R code; this is what they were designed for! Consequently, Jupyter notebooks are flexible in regards to code cell execution order. This flexibility means that code cells can be run in any arbitrary order using the Run (▶) button. But this flexibility has a downside: it can lead to Jupyter notebooks whose code cannot be executed in a linear order (from top to bottom of the notebook). A nonlinear notebook is problematic because a linear order is the conventional way code documents are run, and others will have this expectation when running your notebook. Finally, if the code is used in some automated process, it will need to run in a linear order, from top to bottom of the notebook. The most common way to inadvertently create a nonlinear notebook is to rely solely on using the ▶ button to execute cells. For example, suppose you write some R code that creates an R object, say a variable named y. When you execute that cell and create y, it will continue to exist until it is deliberately deleted with R code, or when the Jupyter notebook R session (i.e., kernel) is stopped or restarted. It can also be referenced in another distinct code cell (Figure 11.10). Together, this means that you could then write a code cell further above in the notebook that references y and execute it without error in the current session (Figure 11.11). This could also be done successfully in future sessions if, and only if, you run the cells in the same unconventional order. However, it is difficult to remember this unconventional order, and it is not the order that others would expect your code to be executed in. Thus, in the future, this would lead to errors when the notebook is run in the conventional linear order (Figure 11.12). Figure 11.10: Code that was written out of order, but not yet executed. Figure 11.11: Code that was written out of order, and was executed using the run button in a nonlinear order without error. The order of execution can be traced by following the numbers to the left of the code cells; their order indicates the order in which the cells were executed. Figure 11.12: Code that was written out of order, and was executed in a linear order using “Restart Kernel and Run All Cells…” This resulted in an error at the execution of the second code cell and it failed to run all code cells in the notebook. You can also accidentally create a nonfunctioning notebook by creating an object in a cell that later gets deleted. In such a scenario, that object only exists for that one particular R session and will not exist once the notebook is restarted and run again. If that object was referenced in another cell in that notebook, an error would occur when the notebook was run again in a new session. These events may not negatively affect the current R session when the code is being written; but as you might now see, they will likely lead to errors when that notebook is run in a future session. Regularly executing the entire notebook in a fresh R session will help guard against this. If you restart your session and new errors seem to pop up when you run all of your cells in linear order, you can at least be aware that there is an issue. Knowing this sooner rather than later will allow you to fix the issue and ensure your notebook can be run linearly from start to finish. We recommend as a best practice to run the entire notebook in a fresh R session at least 2–3 times within any period of work. Note that, critically, you must do this in a fresh R session by restarting your kernel. We recommend using either the Kernel >> Restart Kernel and Run All Cells… command from the menu or the ⏭ button in the toolbar. Note that the Run >> Run All Cells menu item will not restart the kernel, and so it is not sufficient to guard against these errors. 11.7.2 Best practices for including R packages in notebooks Most data analyses these days depend on functions from external R packages that are not built into R. One example is the tidyverse metapackage that we heavily rely on in this book. This package provides us access to functions like read_csv for reading data, select for subsetting columns, and ggplot for creating high-quality graphics. As mentioned earlier in the book, external R packages need to be loaded before the functions they contain can be used. Our recommended way to do this is via library(package_name). But where should this line of code be written in a Jupyter notebook? One idea could be to load the library right before the function is used in the notebook. However, although this technically works, this causes hidden, or at least non-obvious, R package dependencies when others view or try to run the notebook. These hidden dependencies can lead to errors when the notebook is executed on another computer if the needed R packages are not installed. Additionally, if the data analysis code takes a long time to run, uncovering the hidden dependencies that need to be installed so that the analysis can run without error can take a great deal of time to uncover. Therefore, we recommend you load all R packages in a code cell near the top of the Jupyter notebook. Loading all your packages at the start ensures that all packages are loaded before their functions are called, assuming the notebook is run in a linear order from top to bottom as recommended above. It also makes it easy for others viewing or running the notebook to see what external R packages are used in the analysis, and hence, what packages they should install on their computer to run the analysis successfully. 11.7.3 Summary of best practices for running a notebook Write code so that it can be executed in a linear order. As you write code in a Jupyter notebook, run the notebook in a linear order and in its entirety often (2–3 times every work session) via the Kernel >> Restart Kernel and Run All Cells… command from the Jupyter menu or the ⏭ button in the toolbar. Write the code that loads external R packages near the top of the Jupyter notebook. 11.8 Exploring data files It is essential to preview data files before you try to read them into R to see whether or not there are column names, what the delimiters are, and if there are lines you need to skip. In Jupyter, you preview data files stored as plain text files (e.g., comma- and tab-separated files) in their plain text format (Figure 11.14) by right-clicking on the file’s name in the Jupyter file explorer, selecting Open with, and then selecting Editor (Figure 11.13). Suppose you do not specify to open the data file with an editor. In that case, Jupyter will render a nice table for you, and you will not be able to see the column delimiters, and therefore you will not know which function to use, nor which arguments to use and values to specify for them. Figure 11.13: Opening data files with an editor in Jupyter. Figure 11.14: A data file as viewed in an editor in Jupyter. 11.9 Exporting to a different file format In Jupyter, viewing, editing and running R code is done in the Jupyter notebook file format with file extension .ipynb. This file format is not easy to open and view outside of Jupyter. Thus, to share your analysis with people who do not commonly use Jupyter, it is recommended that you export your executed analysis as a more common file type, such as an .html file, or a .pdf. We recommend exporting the Jupyter notebook after executing the analysis so that you can also share the outputs of your code. Note, however, that your audience will not be able to run your analysis using a .html or .pdf file. If you want your audience to be able to reproduce the analysis, you must provide them with the .ipynb Jupyter notebook file. 11.9.1 Exporting to HTML Exporting to .html will result in a shareable file that anyone can open using a web browser (e.g., Firefox, Safari, Chrome, or Edge). The .html output will produce a document that is visually similar to what the Jupyter notebook looked like inside Jupyter. One point of caution here is that if there are images in your Jupyter notebook, you will need to share the image files and the .html file to see them. 11.9.2 Exporting to PDF Exporting to .pdf will result in a shareable file that anyone can open using many programs, including Adobe Acrobat, Preview, web browsers and many more. The benefit of exporting to PDF is that it is a standalone document, even if the Jupyter notebook included references to image files. Unfortunately, the default settings will result in a document that visually looks quite different from what the Jupyter notebook looked like. The font, page margins, and other details will appear different in the .pdf output. 11.10 Creating a new Jupyter notebook At some point, you will want to create a new, fresh Jupyter notebook for your own project instead of viewing, running or editing a notebook that was started by someone else. To do this, navigate to the Launcher tab, and click on the R icon under the Notebook heading. If no Launcher tab is visible, you can get a new one via clicking the + button at the top of the Jupyter file explorer (Figure 11.15). Figure 11.15: Clicking on the R icon under the Notebook heading will create a new Jupyter notebook with an R kernel. Once you have created a new Jupyter notebook, be sure to give it a descriptive name, as the default file name is Untitled.ipynb. You can rename files by first right-clicking on the file name of the notebook you just created, and then clicking Rename. This will make the file name editable. Use your keyboard to change the name. Pressing Enter or clicking anywhere else in the Jupyter interface will save the changed file name. We recommend not using white space or non-standard characters in file names. Doing so will not prevent you from using that file in Jupyter. However, these sorts of things become troublesome as you start to do more advanced data science projects that involve repetition and automation. We recommend naming files using lower case characters and separating words by a dash (-) or an underscore (_). 11.11 Additional resources The JupyterLab Documentation is a good next place to look for more information about working in Jupyter notebooks. This documentation goes into significantly more detail about all of the topics we covered in this chapter, and covers more advanced topics as well. If you are keen to learn about the Markdown language for rich text formatting, two good places to start are CommonMark’s Markdown cheatsheet and Markdown tutorial. References "],["version-control.html", "Chapter 12 Collaboration with version control 12.1 Overview 12.2 Chapter learning objectives 12.3 What is version control, and why should I use it? 12.4 Version control repositories 12.5 Version control workflows 12.6 Working with remote repositories using GitHub 12.7 Working with local repositories using Jupyter 12.8 Collaboration 12.9 Exercises 12.10 Additional resources", " Chapter 12 Collaboration with version control You mostly collaborate with yourself, and me-from-two-months-ago never responds to email. –Mark T. Holder 12.1 Overview This chapter will introduce the concept of using version control systems to track changes to a project over its lifespan, to share and edit code in a collaborative team, and to distribute the finished project to its intended audience. This chapter will also introduce how to use the two most common version control tools: Git for local version control, and GitHub for remote version control. We will focus on the most common version control operations used day-to-day in a standard data science project. There are many user interfaces for Git; in this chapter we will cover the Jupyter Git interface. 12.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe what version control is and why data analysis projects can benefit from it. Create a remote version control repository on GitHub. Use Jupyter’s Git version control tools for project versioning and collaboration: Clone a remote version control repository to create a local repository. Commit changes to a local version control repository. Push local changes to a remote version control repository. Pull changes from a remote version control repository to a local version control repository. Resolve merge conflicts. Give collaborators access to a remote GitHub repository. Communicate with collaborators using GitHub issues. Use best practices when collaborating on a project with others. 12.3 What is version control, and why should I use it? Data analysis projects often require iteration and revision to move from an initial idea to a finished product ready for the intended audience. Without deliberate and conscious effort towards tracking changes made to the analysis, projects tend to become messy. This mess can have serious, negative repercussions on an analysis project, including results that your code cannot reproduce, temporary files with snippets of ideas that are forgotten or not easy to find, mind-boggling file names that make it unclear which is the current working version of the file (e.g., document_final_draft_final.txt, to_hand_in_final_v2.txt, etc.), and more. Additionally, the iterative nature of data analysis projects means that most of the time, the final version of the analysis that is shared with the audience is only a fraction of what was explored during the development of that analysis. Changes in data visualizations and modeling approaches, as well as some negative results, are often not observable from reviewing only the final, polished analysis. The lack of observability of these parts of the analysis development can lead to others repeating things that did not work well, instead of seeing what did not work well, and using that as a springboard to new, more fruitful approaches. Finally, data analyses are typically completed by a team of people rather than a single person. This means that files need to be shared across multiple computers, and multiple people often end up editing the project simultaneously. In such a situation, determining who has the latest version of the project—and how to resolve conflicting edits—can be a real challenge. Version control helps solve these challenges. Version control is the process of keeping a record of changes to documents, including when the changes were made and who made them, throughout the history of their development. It also provides the means both to view earlier versions of the project and to revert changes. Version control is most commonly used in software development, but can be used for any electronic files for any type of project, including data analyses. Being able to record and view the history of a data analysis project is important for understanding how and why decisions to use one method or another were made, among other things. Version control also facilitates collaboration via tools to share edits with others and resolve conflicting edits. But even if you’re working on a project alone, you should still use version control. It helps you keep track of what you’ve done, when you did it, and what you’re planning to do next! To version control a project, you generally need two things: a version control system and a repository hosting service. The version control system is the software responsible for tracking changes, sharing changes you make with others, obtaining changes from others, and resolving conflicting edits. The repository hosting service is responsible for storing a copy of the version-controlled project online (a repository), where you and your collaborators can access it remotely, discuss issues and bugs, and distribute your final product. For both of these items, there is a wide variety of choices. In this textbook we’ll use Git for version control, and GitHub for repository hosting, because both are currently the most widely used platforms. In the additional resources section at the end of the chapter, we list many of the common version control systems and repository hosting services in use today. Note: Technically you don’t have to use a repository hosting service. You can, for example, version control a project that is stored only in a folder on your computer—never sharing it on a repository hosting service. But using a repository hosting service provides a few big benefits, including managing collaborator access permissions, tools to discuss and track bugs, and the ability to have external collaborators contribute work, not to mention the safety of having your work backed up in the cloud. Since most repository hosting services now offer free accounts, there are not many situations in which you wouldn’t want to use one for your project. 12.4 Version control repositories Typically, when we put a data analysis project under version control, we create two copies of the repository (Figure 12.1). One copy we use as our primary workspace where we create, edit, and delete files. This copy is commonly referred to as the local repository. The local repository most commonly exists on our computer or laptop, but can also exist within a workspace on a server (e.g., JupyterHub). The other copy is typically stored in a repository hosting service (e.g., GitHub), where we can easily share it with our collaborators. This copy is commonly referred to as the remote repository. Figure 12.1: Schematic of local and remote version control repositories. Both copies of the repository have a working directory where you can create, store, edit, and delete files (e.g., analysis.ipynb in Figure 12.1). Both copies of the repository also maintain a full project history (Figure 12.1). This history is a record of all versions of the project files that have been created. The repository history is not automatically generated; Git must be explicitly told when to record a version of the project. These records are called commits. They are a snapshot of the file contents as well metadata about the repository at that time the record was created (who made the commit, when it was made, etc.). In the local and remote repositories shown in Figure 12.1, there are two commits represented as rectangles inside the “Repository History” sections. The white rectangle represents the most recent commit, while faded rectangles represent previous commits. Each commit can be identified by a human-readable message, which you write when you make a commit, and a commit hash that Git automatically adds for you. The purpose of the message is to contain a brief, rich description of what work was done since the last commit. Messages act as a very useful narrative of the changes to a project over its lifespan. If you ever want to view or revert to an earlier version of the project, the message can help you identify which commit to view or revert to. In Figure 12.1, you can see two such messages, one for each commit: Created README.md and Added analysis draft. The hash is a string of characters consisting of about 40 letters and numbers. The purpose of the hash is to serve as a unique identifier for the commit, and is used by Git to index project history. Although hashes are quite long—imagine having to type out 40 precise characters to view an old project version!—Git is able to work with shorter versions of hashes. In Figure 12.1, you can see two of these shortened hashes, one for each commit: Daa29d6 and 884c7ce. 12.5 Version control workflows When you work in a local version-controlled repository, there are generally three additional steps you must take as part of your regular workflow. In addition to just working on files—creating, editing, and deleting files as you normally would—you must: Tell Git when to make a commit of your own changes in the local repository. Tell Git when to send your new commits to the remote GitHub repository. Tell Git when to retrieve any new changes (that others made) from the remote GitHub repository. In this section we will discuss all three of these steps in detail. 12.5.1 Committing changes to a local repository When working on files in your local version control repository (e.g., using Jupyter) and saving your work, these changes will only initially exist in the working directory of the local repository (Figure 12.2). Figure 12.2: Local repository with changes to files. Once you reach a point that you want Git to keep a record of the current version of your work, you need to commit (i.e., snapshot) your changes. A prerequisite to this is telling Git which files should be included in that snapshot. We call this step adding the files to the staging area. Note that the staging area is not a real physical location on your computer; it is instead a conceptual placeholder for these files until they are committed. The benefit of the Git version control system using a staging area is that you can choose to commit changes in only certain files. For example, in Figure 12.3, we add only the two files that are important to the analysis project (analysis.ipynb and README.md) and not our personal scratch notes for the project (notes.txt). Figure 12.3: Adding modified files to the staging area in the local repository. Once the files we wish to commit have been added to the staging area, we can then commit those files to the repository history (Figure 12.4). When we do this, we are required to include a helpful commit message to tell collaborators (which often includes future you!) about the changes that were made. In Figure 12.4, the message is Message about changes...; in your work you should make sure to replace this with an informative message about what changed. It is also important to note here that these changes are only being committed to the local repository’s history. The remote repository on GitHub has not changed, and collaborators are not yet able to see your new changes. Figure 12.4: Committing the modified files in the staging area to the local repository history, with an informative message about what changed. 12.5.2 Pushing changes to a remote repository Once you have made one or more commits that you want to share with your collaborators, you need to push (i.e., send) those commits back to GitHub (Figure 12.5). This updates the history in the remote repository (i.e., GitHub) to match what you have in your local repository. Now when collaborators interact with the remote repository, they will be able to see the changes you made. And you can also take comfort in the fact that your work is now backed up in the cloud! Figure 12.5: Pushing the commit to send the changes to the remote repository on GitHub. 12.5.3 Pulling changes from a remote repository If you are working on a project with collaborators, they will also be making changes to files (e.g., to the analysis code in a Jupyter notebook and the project’s README file), committing them to their own local repository, and pushing their commits to the remote GitHub repository to share them with you. When they push their changes, those changes will only initially exist in the remote GitHub repository and not in your local repository (Figure 12.6). Figure 12.6: Changes pushed by collaborators, or created directly on GitHub will not be automatically sent to your local repository. To obtain the new changes from the remote repository on GitHub, you will need to pull those changes to your own local repository. By pulling changes, you synchronize your local repository to what is present on GitHub (Figure 12.7). Additionally, until you pull changes from the remote repository, you will not be able to push any more changes yourself (though you will still be able to work and make commits in your own local repository). Figure 12.7: Pulling changes from the remote GitHub repository to synchronize your local repository. 12.6 Working with remote repositories using GitHub Now that you have been introduced to some of the key general concepts and workflows of Git version control, we will walk through the practical steps. There are several different ways to start using version control with a new project. For simplicity and ease of setup, we recommend creating a remote repository first. This section covers how to both create and edit a remote repository on GitHub. Once you have a remote repository set up, we recommend cloning (or copying) that repository to create a local repository in which you primarily work. You can clone the repository either on your own computer or in a workspace on a server (e.g., a JupyterHub server). Section 12.7 below will cover this second step in detail. 12.6.1 Creating a remote repository on GitHub Before you can create remote repositories on GitHub, you will need a GitHub account; you can sign up for a free account at https://github.com/. Once you have logged into your account, you can create a new repository to host your project by clicking on the “+” icon in the upper right-hand corner, and then on “New Repository,” as shown in Figure 12.8. Figure 12.8: New repositories on GitHub can be created by clicking on “New Repository” from the + menu. Repositories can be set up with a variety of configurations, including a name, optional description, and the inclusion (or not) of several template files. One of the most important configuration items to choose is the visibility to the outside world, either public or private. Public repositories can be viewed by anyone. Private repositories can be viewed by only you. Both public and private repositories are only editable by you, but you can change that by giving access to other collaborators. To get started with a public repository having a template README.md file, take the following steps shown in Figure 12.9: Enter the name of your project repository. In the example below, we use canadian_languages. Most repositories follow a similar naming convention involving only lowercase letter words separated by either underscores or hyphens. Choose an option for the privacy of your repository. Select “Add a README file.” This creates a template README.md file in your repository’s root folder. When you are happy with your repository name and configuration, click on the green “Create Repository” button. Figure 12.9: Repository configuration for a project that is public and initialized with a README.md template file. A newly created public repository with a README.md template file should look something like what is shown in Figure 12.10. Figure 12.10: Respository configuration for a project that is public and initialized with a README.md template file. 12.6.2 Editing files on GitHub with the pen tool The pen tool can be used to edit existing plain text files. When you click on the pen tool, the file will be opened in a text box where you can use your keyboard to make changes (Figures 12.11 and 12.12). Figure 12.11: Clicking on the pen tool opens a text box for editing plain text files. Figure 12.12: The text box where edits can be made after clicking on the pen tool. After you are done with your edits, they can be “saved” by committing your changes. When you commit a file in a repository, the version control system takes a snapshot of what the file looks like. As you continue working on the project, over time you will possibly make many commits to a single file; this generates a useful version history for that file. On GitHub, if you click the green “Commit changes” button, it will save the file and then make a commit (Figure 12.13). Recall from Section 12.5.1 that you normally have to add files to the staging area before committing them. Why don’t we have to do that when we work directly on GitHub? Behind the scenes, when you click the green “Commit changes” button, GitHub is adding that one file to the staging area prior to committing it. But note that on GitHub you are limited to committing changes to only one file at a time. When you work in your own local repository, you can commit changes to multiple files simultaneously. This is especially useful when one “improvement” to the project involves modifying multiple files. You can also do things like run code when working in a local repository, which you cannot do on GitHub. In general, editing on GitHub is reserved for small edits to plain text files. Figure 12.13: Saving changes using the pen tool requires committing those changes, and an associated commit message. 12.6.3 Creating files on GitHub with the “Add file” menu The “Add file” menu can be used to create new plain text files and upload files from your computer. To create a new plain text file, click the “Add file” drop-down menu and select the “Create new file” option (Figure 12.14). Figure 12.14: New plain text files can be created directly on GitHub. A page will open with a small text box for the file name to be entered, and a larger text box where the desired file content text can be entered. Note the two tabs, “Edit new file” and “Preview”. Toggling between them lets you enter and edit text and view what the text will look like when rendered, respectively (Figure 12.15). Note that GitHub understands and renders .md files using a markdown syntax very similar to Jupyter notebooks, so the “Preview” tab is especially helpful for checking markdown code correctness. Figure 12.15: New plain text files require a file name in the text box circled in red, and file content entered in the larger text box (red arrow). Save and commit your changes by clicking the green “Commit changes” button at the bottom of the page (Figure 12.16). Figure 12.16: To be saved, newly created files are required to be committed along with an associated commit message. You can also upload files that you have created on your local machine by using the “Add file” drop-down menu and selecting “Upload files” (Figure 12.17). To select the files from your local computer to upload, you can either drag and drop them into the gray box area shown in Figure 12.18, or click the “choose your files” link to access a file browser dialog. Once the files you want to upload have been selected, click the green “Commit changes” button at the bottom of the page (Figure 12.18). Figure 12.17: New files of any type can be uploaded to GitHub. Figure 12.18: Specify files to upload by dragging them into the GitHub website (red circle) or by clicking on “choose your files.” Uploaded files are also required to be committed along with an associated commit message. Note that Git and GitHub are designed to track changes in individual files. Do not upload your whole project in an archive file (e.g., .zip). If you do, then Git can only keep track of changes to the entire .zip file, which will not be human-readable. Committing one big archive defeats the whole purpose of using version control: you won’t be able to see, interpret, or find changes in the history of any of the actual content of your project! 12.7 Working with local repositories using Jupyter Although there are several ways to create and edit files on GitHub, they are not quite powerful enough for efficiently creating and editing complex files, or files that need to be executed to assess whether they work (e.g., files containing code). For example, you wouldn’t be able to run an analysis written with R code directly on GitHub. Thus, it is useful to be able to connect the remote repository that was created on GitHub to a local coding environment. This can be done by creating and working in a local copy of the repository. In this chapter, we focus on interacting with Git via Jupyter using the Jupyter Git extension. The Jupyter Git extension can be run by Jupyter on your local computer, or on a JupyterHub server. We recommend reading Chapter 11 to learn how to use Jupyter before reading this chapter. 12.7.1 Generating a GitHub personal access token To send and retrieve work between your local repository and the remote repository on GitHub, you will frequently need to authenticate with GitHub to prove you have the required permission. There are several methods to do this, but for beginners we recommend using the HTTPS method because it is easier and requires less setup. In order to use the HTTPS method, GitHub requires you to provide a personal access token. A personal access token is like a password—so keep it a secret!—but it gives you more fine-grained control over what parts of your account the token can be used to access, and lets you set an expiry date for the authentication. To generate a personal access token, you must first visit https://github.com/settings/tokens, which will take you to the “Personal access tokens” page in your account settings. Once there, click “Generate new token” (Figure 12.19). Note that you may be asked to re-authenticate with your username and password to proceed. Figure 12.19: The “Generate new token” button used to initiate the creation of a new personal access token. It is found in the “Personal access tokens” section of the “Developer settings” page in your account settings. You will be asked to add a note to describe the purpose for your personal access token. Next, you need to select permissions for the token; this is where you can control what parts of your account the token can be used to access. Make sure to choose only those permissions that you absolutely require. In Figure 12.20, we tick only the “repo” box, which gives the token access to our repositories (so that we can push and pull) but none of our other GitHub account features. Finally, to generate the token, scroll to the bottom of that page and click the green “Generate token” button (Figure 12.20). Figure 12.20: Webpage for creating a new personal access token. Finally, you will be taken to a page where you will be able to see and copy the personal access token you just generated (Figure 12.21). Since it provides access to certain parts of your account, you should treat this token like a password; for example, you should consider securely storing it (and your other passwords and tokens, too!) using a password manager. Note that this page will only display the token to you once, so make sure you store it in a safe place right away. If you accidentally forget to store it, though, do not fret—you can delete that token by clicking the “Delete” button next to your token, and generate a new one from scratch. To learn more about GitHub authentication, see the additional resources section at the end of this chapter. Figure 12.21: Display of the newly generated personal access token. 12.7.2 Cloning a repository using Jupyter Cloning a remote repository from GitHub to create a local repository results in a copy that knows where it was obtained from so that it knows where to send/receive new committed edits. In order to do this, first copy the URL from the HTTPS tab of the Code drop-down menu on GitHub (Figure 12.22). Figure 12.22: The green “Code” drop-down menu contains the remote address (URL) corresponding to the location of the remote GitHub repository. Open Jupyter, and click the Git+ icon on the file browser tab (Figure 12.23). Figure 12.23: The Jupyter Git Clone icon (red circle). Paste the URL of the GitHub project repository you created and click the blue “CLONE” button (Figure 12.24). Figure 12.24: Prompt where the remote address (URL) corresponding to the location of the GitHub repository needs to be input in Jupyter. On the file browser tab, you will now see a folder for the repository. Inside this folder will be all the files that existed on GitHub (Figure 12.25). Figure 12.25: Cloned GitHub repositories can been seen and accessed via the Jupyter file browser. 12.7.3 Specifying files to commit Now that you have cloned the remote repository from GitHub to create a local repository, you can get to work editing, creating, and deleting files. For example, suppose you created and saved a new file (named eda.ipynb) that you would like to send back to the project repository on GitHub (Figure 12.26). To “add” this modified file to the staging area (i.e., flag that this is a file whose changes we would like to commit), click the Jupyter Git extension icon on the far left-hand side of Jupyter (Figure 12.26). Figure 12.26: Jupyter Git extension icon (circled in red). This opens the Jupyter Git graphical user interface pane. Next, click the plus sign (+) beside the file(s) that you want to “add” (Figure 12.27). Note that because this is the first change for this file, it falls under the “Untracked” heading. However, next time you edit this file and want to add the changes, you will find it under the “Changed” heading. You will also see an eda-checkpoint.ipynb file under the “Untracked” heading. This is a temporary “checkpoint file” created by Jupyter when you work on eda.ipynb. You generally do not want to add auto-generated files to Git repositories; only add the files you directly create and edit. Figure 12.27: eda.ipynb is added to the staging area via the plus sign (+). Clicking the plus sign (+) moves the file from the “Untracked” heading to the “Staged” heading, so that Git knows you want a snapshot of its current state as a commit (Figure 12.28). Now you are ready to “commit” the changes. Make sure to include a (clear and helpful!) message about what was changed so that your collaborators (and future you) know what happened in this commit. Figure 12.28: Adding eda.ipynb makes it visible in the staging area. 12.7.4 Making the commit To snapshot the changes with an associated commit message, you must put a message in the text box at the bottom of the Git pane and click on the blue “Commit” button (Figure 12.29). It is highly recommended to write useful and meaningful messages about what was changed. These commit messages, and the datetime stamp for a given commit, are the primary means to navigate through the project’s history in the event that you need to view or retrieve a past version of a file, or revert your project to an earlier state. When you click the “Commit” button for the first time, you will be prompted to enter your name and email. This only needs to be done once for each machine you use Git on. Figure 12.29: A commit message must be added into the Jupyter Git extension commit text box before the blue Commit button can be used to record the commit. After “committing” the file(s), you will see there are 0 “Staged” files. You are now ready to push your changes to the remote repository on GitHub (Figure 12.30). Figure 12.30: After recording a commit, the staging area should be empty. 12.7.5 Pushing the commits to GitHub To send the committed changes back to the remote repository on GitHub, you need to push them. To do this, click on the cloud icon with the up arrow on the Jupyter Git tab (Figure 12.31). Figure 12.31: The Jupyter Git extension “push” button (circled in red). You will then be prompted to enter your GitHub username and the personal access token that you generated earlier (not your account password!). Click the blue “OK” button to initiate the push (Figure 12.32). Figure 12.32: Enter your Git credentials to authorize the push to the remote repository. If the files were successfully pushed to the project repository on GitHub, you will be shown a success message (Figure 12.33). Click “Dismiss” to continue working in Jupyter. Figure 12.33: The prompt that the push was successful. If you visit the remote repository on GitHub, you will see that the changes now exist there too (Figure 12.34)! Figure 12.34: The GitHub web interface shows a preview of the commit message, and the time of the most recently pushed commit for each file. 12.8 Collaboration 12.8.1 Giving collaborators access to your project As mentioned earlier, GitHub allows you to control who has access to your project. The default of both public and private projects are that only the person who created the GitHub repository has permissions to create, edit and delete files (write access). To give your collaborators write access to the projects, navigate to the “Settings” tab (Figure 12.35). Figure 12.35: The “Settings” tab on the GitHub web interface. Then click “Manage access” (Figure 12.36). Figure 12.36: The “Manage access” tab on the GitHub web interface. Then click the green “Invite a collaborator” button (Figure 12.37). Figure 12.37: The “Invite a collaborator” button on the GitHub web interface. Type in the collaborator’s GitHub username or email, and select their name when it appears (Figure 12.38). Figure 12.38: The text box where a collaborator’s GitHub username or email can be entered. Finally, click the green “Add collaborator to this repository” button (Figure 12.39). Figure 12.39: The confirmation button for adding a collaborator to a repository on the GitHub web interface. After this, you should see your newly added collaborator listed under the “Manage access” tab. They should receive an email invitation to join the GitHub repository as a collaborator. They need to accept this invitation to enable write access. 12.8.2 Pulling changes from GitHub using Jupyter We will now walk through how to use the Jupyter Git extension tool to pull changes to our eda.ipynb analysis file that were made by a collaborator (Figure 12.40). Figure 12.40: The GitHub interface indicates the name of the last person to push a commit to the remote repository, a preview of the associated commit message, the unique commit identifier, and how long ago the commit was snapshotted. You can tell Git to “pull” by clicking on the cloud icon with the down arrow in Jupyter (Figure 12.41). Figure 12.41: The Jupyter Git extension clone button. Once the files are successfully pulled from GitHub, you need to click “Dismiss” to keep working (Figure 12.42). Figure 12.42: The prompt after changes have been successfully pulled from a remote repository. And then when you open (or refresh) the files whose changes you just pulled, you should be able to see them (Figure 12.43). Figure 12.43: Changes made by the collaborator to eda.ipynb (code highlighted by red arrows). It can be very useful to review the history of the changes to your project. You can do this directly in Jupyter by clicking “History” in the Git tab (Figure 12.44). Figure 12.44: Version control repository history viewed using the Jupyter Git extension. It is good practice to pull any changes at the start of every work session before you start working on your local copy. If you do not do this, and your collaborators have pushed some changes to the project to GitHub, then you will be unable to push your changes to GitHub until you pull. This situation can be recognized by the error message shown in Figure 12.45. Figure 12.45: Error message that indicates that there are changes on the remote repository that you do not have locally. Usually, getting out of this situation is not too troublesome. First you need to pull the changes that exist on GitHub that you do not yet have in the local repository. Usually when this happens, Git can automatically merge the changes for you, even if you and your collaborators were working on different parts of the same file! If, however, you and your collaborators made changes to the same line of the same file, Git will not be able to automatically merge the changes—it will not know whether to keep your version of the line(s), your collaborators version of the line(s), or some blend of the two. When this happens, Git will tell you that you have a merge conflict in certain file(s) (Figure 12.46). Figure 12.46: Error message that indicates you and your collaborators made changes to the same line of the same file and that Git will not be able to automatically merge the changes. 12.8.3 Handling merge conflicts To fix the merge conflict, you need to open the offending file in a plain text editor and look for special marks that Git puts in the file to tell you where the merge conflict occurred (Figure 12.47). Figure 12.47: How to open a Jupyter notebook as a plain text file view in Jupyter. The beginning of the merge conflict is preceded by <<<<<<< HEAD and the end of the merge conflict is marked by >>>>>>>. Between these markings, Git also inserts a separator (=======). The version of the change before the separator is your change, and the version that follows the separator was the change that existed on GitHub. In Figure 12.48, you can see that in your local repository there is a line of code that calls scale_color_manual with three color values (deeppink2, cyan4, and purple1). It looks like your collaborator made an edit to that line too, except with different colors (to blue3, red3, and black)! Figure 12.48: Merge conflict identifiers (highlighted in red). Once you have decided which version of the change (or what combination!) to keep, you need to use the plain text editor to remove the special marks that Git added (Figure 12.49). Figure 12.49: File where a merge conflict has been resolved. The file must be saved, added to the staging area, and then committed before you will be able to push your changes to GitHub. 12.8.4 Communicating using GitHub issues When working on a project in a team, you don’t just want a historical record of who changed what file and when in the project—you also want a record of decisions that were made, ideas that were floated, problems that were identified and addressed, and all other communication surrounding the project. Email and messaging apps are both very popular for general communication, but are not designed for project-specific communication: they both generally do not have facilities for organizing conversations by project subtopics, searching for conversations related to particular bugs or software versions, etc. GitHub issues are an alternative written communication medium to email and messaging apps, and were designed specifically to facilitate project-specific communication. Issues are opened from the “Issues” tab on the project’s GitHub page, and they persist there even after the conversation is over and the issue is closed (in contrast to email, issues are not usually deleted). One issue thread is usually created per topic, and they are easily searchable using GitHub’s search tools. All issues are accessible to all project collaborators, so no one is left out of the conversation. Finally, issues can be set up so that team members get email notifications when a new issue is created or a new post is made in an issue thread. Replying to issues from email is also possible. Given all of these advantages, we highly recommend the use of issues for project-related communication. To open a GitHub issue, first click on the “Issues” tab (Figure 12.50). Figure 12.50: The “Issues” tab on the GitHub web interface. Next click the “New issue” button (Figure 12.51). Figure 12.51: The “New issue” button on the GitHub web interface. Add an issue title (which acts like an email subject line), and then put the body of the message in the larger text box. Finally, click “Submit new issue” to post the issue to share with others (Figure 12.52). Figure 12.52: Dialog boxes and submission button for creating new GitHub issues. You can reply to an issue that someone opened by adding your written response to the large text box and clicking comment (Figure 12.53). Figure 12.53: Dialog box for replying to GitHub issues. When a conversation is resolved, you can click “Close issue”. The closed issue can be later viewed by clicking the “Closed” header link in the “Issue” tab (Figure 12.54). Figure 12.54: The “Closed” issues tab on the GitHub web interface. 12.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Collaboration with version control” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 12.10 Additional resources Now that you’ve picked up the basics of version control with Git and GitHub, you can expand your knowledge through the resources listed below: GitHub’s guides website and Happy Git and GitHub for the useR are great resources for learning more about Git and GitHub. Good enough practices in scientific computing (G. Wilson et al. 2017) provides more advice on useful workflows and “good enough” practices in data analysis projects. In addition to GitHub, there are other popular Git repository hosting services such as GitLab and BitBucket. Comparing all of these options is beyond the scope of this book, and until you become a more advanced user, you are perfectly fine to just stick with GitHub. Just be aware that you have options! GitHub’s documentation on creating a personal access token and the Happy Git and GitHub for the useR personal access tokens chapter are both excellent additional resources to consult if you need additional help generating and using personal access tokens. References "],["setup.html", "Chapter 13 Setting up your computer 13.1 Overview 13.2 Chapter learning objectives 13.3 Obtaining the worksheets for this book 13.4 Working with Docker 13.5 Working with JupyterLab Desktop", " Chapter 13 Setting up your computer 13.1 Overview In this chapter, you’ll learn how to set up the software needed to follow along with this book on your own computer. Given that installation instructions can vary based on computer setup, we provide instructions for multiple operating systems (Ubuntu Linux, MacOS, and Windows). Although the instructions in this chapter will likely work on many systems, we have specifically verified that they work on a computer that: runs Windows 10 Home, MacOS 13 Ventura, or Ubuntu 22.04, uses a 64-bit CPU, has a connection to the internet, uses English as the default language. 13.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Download the worksheets that accompany this book. Install the Docker virtualization engine. Edit and run the worksheets using JupyterLab running inside a Docker container. Install Git, JupyterLab Desktop, and R packages. Edit and run the worksheets using JupyterLab Desktop. 13.3 Obtaining the worksheets for this book The worksheets containing exercises for this book are online at https://worksheets.datasciencebook.ca. The worksheets can be launched directly from that page using the Binder links in the rightmost column of the table. This is the easiest way to access the worksheets, but note that you will not be able to save your work and return to it again later. In order to save your progress, you will need to download the worksheets to your own computer and work on them locally. You can download the worksheets as a compressed zip file using the link at the top of the page. Once you unzip the downloaded file, you will have a folder containing all of the Jupyter notebook worksheets accompanying this book. See Chapter 11 for instructions on working with Jupyter notebooks. 13.4 Working with Docker Once you have downloaded the worksheets, you will next need to install and run the software required to work on Jupyter notebooks on your own computer. Doing this setup manually can be quite tricky, as it involves quite a few different software packages, not to mention getting the right versions of everything—the worksheets and autograder tests may not work unless all the versions are exactly right! To keep things simple, we instead recommend that you install Docker. Docker lets you run your Jupyter notebooks inside a pre-built container that comes with precisely the right versions of all software packages needed run the worksheets that come with this book. Note: A container is a virtual user space within your computer. Within the container, you can run software in isolation without interfering with the other software that already exists on your machine. In this book, we use a container to run a specific version of the R programming language, as well as other necessary packages. The container ensures that the worksheets function correctly, even if you have a different version of R installed on your computer—or even if you haven’t installed R at all! 13.4.1 Windows Installation To install Docker on Windows, visit the online Docker documentation, and download the Docker Desktop Installer.exe file. Double-click the file to open the installer and follow the instructions on the installation wizard, choosing WSL-2 instead of Hyper-V when prompted. Note: Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see: If you need to update WSL, you can enter cmd.exe in the Start menu to run the command line. Type wsl --update to update WSL. If the admin account on your computer is different to your user account, you must add the user to the “docker-users” group. Run Computer Management as an administrator and navigate to Local Users and Groups -> Groups -> docker-users. Right-click to add the user to the group. Log out and log back in for the changes to take effect. If you need to enable virtualization, you will need to edit your BIOS. Restart your computer, and enter the BIOS using the hotkey (usually Delete, Esc, and/or one of the F# keys). Look for an “Advanced” menu, and under your CPU settings, set the “Virtualization” option to “enabled”. Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. Running JupyterLab Run Docker Desktop. Once it is running, you need to download and run the Docker image that we have made available for the worksheets (an image is like a “snapshot” of a computer with all the right packages pre-installed). You only need to do this step one time; the image will remain the next time you run Docker Desktop. In the Docker Desktop search bar, enter ubcdsci/r-dsci-100, as this is the name of the image. You will see the ubcdsci/r-dsci-100 image in the list (Figure 13.1), and “latest” in the Tag drop down menu. We need to change “latest” to the right image version before proceeding. To find the right tag, open the Dockerfile in the worksheets repository, and look for the line FROM ubcdsci/r-dsci-100: followed by the tag consisting of a sequence of numbers and letters. Back in Docker Desktop, in the “Tag” drop down menu, click that tag to select the correct image version. Then click the “Pull” button to download the image. Figure 13.1: The Docker Desktop search window. Make sure to click the Tag drop down menu and find the right version of the image before clicking the Pull button to download it. Once the image is done downloading, click the “Images” button on the left side of the Docker Desktop window (Figure 13.2). You will see the recently downloaded image listed there under the “Local” tab. Figure 13.2: The Docker Desktop images tab. To start up a container using that image, click the play button beside the image. This will open the run configuration menu (Figure 13.3). Expand the “Optional settings” drop down menu. In the “Host port” textbox, enter 8888. In the “Volumes” section, click the “Host path” box and navigate to the folder where your Jupyter worksheets are stored. In the “Container path” text box, enter /home/jovyan/work. Then click the “Run” button to start the container. Figure 13.3: The Docker Desktop container run configuration menu. After clicking the “Run” button, you will see a terminal. The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the URL in the terminal that starts with http://127.0.0.1:8888 (highlighted by the red box in Figure 13.4), and paste it into your browser to start JupyterLab. Figure 13.4: The terminal text after running the Docker container. The red box indicates the URL that you should paste into your browser to open JupyterLab. When you are done working, make sure to shut down and remove the container by clicking the red trash can symbol (in the top right corner of Figure 13.4). You will not be able to start the container again until you do so. More information on installing and running Docker on Windows, as well as troubleshooting tips, can be found in the online Docker documentation. 13.4.2 MacOS Installation To install Docker on MacOS, visit the online Docker documentation, and download the Docker.dmg installation file that is appropriate for your computer. To know which installer is right for your machine, you need to know whether your computer has an Intel processor (older machines) or an Apple processor (newer machines); the Apple support page has information to help you determine which processor you have. Once downloaded, double-click the file to open the installer, then drag the Docker icon to the Applications folder. Double-click the icon in the Applications folder to start Docker. In the installation window, use the recommended settings. Running JupyterLab Run Docker Desktop. Once it is running, follow the instructions above in the Windows section on Running JupyterLab (the user interface is the same). More information on installing and running Docker on MacOS, as well as troubleshooting tips, can be found in the online Docker documentation. 13.4.3 Ubuntu Installation To install Docker on Ubuntu, open the terminal and enter the following five commands. sudo apt update sudo apt install ca-certificates curl gnupg curl -fsSL https://get.docker.com -o get-docker.sh sudo chmod u+x get-docker.sh sudo sh get-docker.sh Running JupyterLab First, open the Dockerfile in the worksheets repository, and look for the line FROM ubcdsci/r-dsci-100: followed by a tag consisting of a sequence of numbers and letters. Then in the terminal, navigate to the directory where you want to run JupyterLab, and run the following command, replacing TAG with the tag you found earlier. docker run --rm -v $(pwd):/home/jovyan/work -p 8888:8888 ubcdsci/r-dsci-100:TAG jupyter lab The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the URL in your terminal that starts with http://127.0.0.1:8888 (highlighted by the red box in Figure 13.5), and paste it into your browser to start JupyterLab. More information on installing and running Docker on Ubuntu, as well as troubleshooting tips, can be found in the online Docker documentation. Figure 13.5: The terminal text after running the Docker container in Ubuntu. The red box indicates the URL that you should paste into your browser to open JupyterLab. 13.5 Working with JupyterLab Desktop You can also run the worksheets accompanying this book on your computer using JupyterLab Desktop. The advantage of JupyterLab Desktop over Docker is that it can be easier to install; Docker can sometimes run into some fairly technical issues (especially on Windows computers) that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that you may not end up with the right versions of all the R packages needed for the worksheets. Docker, on the other hand, guarantees that the worksheets will work exactly as intended. In this section, we will cover how to install JupyterLab Desktop, Git and the JupyterLab Git extension (for version control, as discussed in Chapter 12), and all of the R packages needed to run the code in this book. 13.5.1 Windows Installation First, we will install Git for version control. Go to the Git download page and download the Windows version of Git. Once the download has finished, run the installer and accept the default configuration for all pages. Next, visit the “Installation” section of the JupyterLab Desktop homepage. Download the JupyterLab-Setup-Windows.exe installer file for Windows. Double-click the installer to run it, use the default settings. Run JupyterLab Desktop by clicking the icon on your desktop. Configuring JupyterLab Desktop Next, in the JupyterLab Desktop graphical interface that appears (Figure 13.6), you will see text at the bottom saying “Python environment not found”. Click “Install using the bundled installer” to set up the environment. Figure 13.6: The JupyterLab Desktop graphical user interface. Next, we need to add the JupyterLab Git extension (so that we can use version control directly from within JupyterLab Desktop), the IRkernel (to enable the R programming language), and various R software packages. Click “New session…” in the JupyterLab Desktop user interface, then scroll to the bottom, and click “Terminal” under the “Other” heading (Figure 13.7). Figure 13.7: A JupyterLab Desktop session, showing the Terminal option at the bottom. In this terminal, run the following commands: pip install --upgrade jupyterlab-git conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/environment.yml The second command installs the specific R and package versions specified in the environment.yml file found in the worksheets repository. We will always keep the versions in the environment.yml file updated so that they are compatible with the exercise worksheets that accompany the book. Once all of the software installation is complete, it is a good idea to restart JupyterLab Desktop entirely before you proceed to doing your data analysis. This will ensure all the software and settings you put in place are correctly set up and ready for use. 13.5.2 MacOS Installation First, we will install Git for version control. Open the terminal (how-to video) and type the following command: xcode-select --install Next, visit the “Installation” section of the JupyterLab Desktop homepage. Download the JupyterLab-Setup-MacOS-x64.dmg or JupyterLab-Setup-MacOS-arm64.dmg installer file. To know which installer is right for your machine, you need to know whether your computer has an Intel processor (older machines) or an Apple processor (newer machines); the Apple support page has information to help you determine which processor you have. Once downloaded, double-click the file to open the installer, then drag the JupyterLab Desktop icon to the Applications folder. Double-click the icon in the Applications folder to start JupyterLab Desktop. Configuring JupyterLab Desktop From this point onward, with JupyterLab Desktop running, follow the instructions in the Windows section on Configuring JupyterLab Desktop to set up the environment, install the JupyterLab Git extension, and install the various R software packages needed for the worksheets. 13.5.3 Ubuntu Installation First, we will install Git for version control. Open the terminal and type the following commands: sudo apt update sudo apt install git Next, visit the “Installation” section of the JupyterLab Desktop homepage. Download the JupyterLab-Setup-Debian.deb installer file for Ubuntu/Debian. Open a terminal, navigate to where the installer file was downloaded, and run the command sudo dpkg -i JupyterLab-Setup-Debian.deb Run JupyterLab Desktop using the command jlab Configuring JupyterLab Desktop From this point onward, with JupyterLab Desktop running, follow the instructions in the Windows section on Configuring JupyterLab Desktop to set up the environment, install the JupyterLab Git extension, and install the various R software packages needed for the worksheets. "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "A First Introduction Welcome!", " Data Science A First Introduction Tiffany Timbers, Trevor Campbell, and Melissa Lee 2024-08-21 Welcome! This is the website for Data Science: A First Introduction. You can read the web version of the book on this site. Click a section in the table of contents on the left side of the page to navigate to it. If you are on a mobile device, you may need to open the table of contents first by clicking the menu button on the top left of the page. You can purchase a PDF or print copy of the book on the CRC Press website or on Amazon. For the python version of the textbook, visit https://python.datasciencebook.ca. This book is listed in a number of open educational resource (OER) collections: The University of British Columbia OER collection The OER Commons MERLOT This work by Tiffany Timbers, Trevor Campbell, and Melissa Lee is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. "],["foreword.html", "Foreword", " Foreword Roger D. Peng Johns Hopkins Bloomberg School of Public Health 2022-01-04 The field of data science has expanded and grown significantly in recent years, attracting excitement and interest from many different directions. The demand for introductory educational materials has grown concurrently with the growth of the field itself, leading to a proliferation of textbooks, courses, blog posts, and tutorials. This book is an important contribution to this fast-growing literature, but given the wide availability of materials, a reader should be inclined to ask, “What is the unique contribution of this book?” In order to answer that question it is useful to step back for a moment and consider the development of the field of data science over the past few years. When thinking about data science, it is important to consider two questions: “What is data science?” and “How should one do data science?” The former question is under active discussion amongst a broad community of researchers and practitioners and there does not appear to be much consensus to date. However, there seems a general understanding that data science focuses on the more “active” elements—data wrangling, cleaning, and analysis—of answering questions with data. These elements are often highly problem-specific and may seem difficult to generalize across applications. Nevertheless, over time we have seen some core elements emerge that appear to repeat themselves as useful concepts across different problems. Given the lack of clear agreement over the definition of data science, there is a strong need for a book like this one to propose a vision for what the field is and what the implications are for the activities in which members of the field engage. The first important concept addressed by this book is tidy data, which is a format for tabular data formally introduced to the statistical community in a 2014 paper by Hadley Wickham. The tidy data organization strategy has proven a powerful abstract concept for conducting data analysis, in large part because of the vast toolchain implemented in the Tidyverse collection of R packages. The second key concept is the development of workflows for reproducible and auditable data analyses. Modern data analyses have only grown in complexity due to the availability of data and the ease with which we can implement complex data analysis procedures. Furthermore, these data analyses are often part of decision-making processes that may have significant impacts on people and communities. Therefore, there is a critical need to build reproducible analyses that can be studied and repeated by others in a reliable manner. Statistical methods clearly represent an important element of data science for building prediction and classification models and for making inferences about unobserved populations. Finally, because a field can succeed only if it fosters an active and collaborative community, it has become clear that being fluent in the tools of collaboration is a core element of data science. This book takes these core concepts and focuses on how one can apply them to do data science in a rigorous manner. Students who learn from this book will be well-versed in the techniques and principles behind producing reliable evidence from data. This book is centered around the use of the R programming language within the tidy data framework, and as such employs the most recent advances in data analysis coding. The use of Jupyter notebooks for exercises immediately places the student in an environment that encourages auditability and reproducibility of analyses. The integration of git and GitHub into the course is a key tool for teaching about collaboration and community, key concepts that are critical to data science. The demand for training in data science continues to increase. The availability of large quantities of data to answer a variety of questions, the computational power available to many more people than ever before, and the public awareness of the importance of data for decision-making have all contributed to the need for high-quality data science work. This book provides a sophisticated first introduction to the field of data science and provides a balanced mix of practical skills along with generalizable principles. As we continue to introduce students to data science and train them to confront an expanding array of data science problems, they will be well-served by the ideas presented here. "],["preface.html", "Preface", " Preface This textbook aims to be an approachable introduction to the world of data science. In this book, we define data science as the process of generating insight from data through reproducible and auditable processes. If you analyze some data and give your analysis to a friend or colleague, they should be able to re-run the analysis from start to finish and get the same result you did (reproducibility). They should also be able to see and understand all the steps in the analysis, as well as the history of how the analysis developed (auditability). Creating reproducible and auditable analyses allows both you and others to easily double-check and validate your work. At a high level, in this book, you will learn how to identify common problems in data science, and solve those problems with reproducible and auditable workflows. Figure 0.1 summarizes what you will learn in each chapter of this book. Throughout, you will learn how to use the R programming language (R Core Team 2021) to perform all the tasks associated with data analysis. You will spend the first four chapters learning how to use R to load, clean, wrangle (i.e., restructure the data into a usable format) and visualize data while answering descriptive and exploratory data analysis questions. In the next six chapters, you will learn how to answer predictive, exploratory, and inferential data analysis questions with common methods in data science, including classification, regression, clustering, and estimation. In the final chapters (11–13), you will learn how to combine R code, formatted text, and images in a single coherent document with Jupyter, use version control for collaboration, and install and configure the software needed for data science on your own computer. If you are reading this book as part of a course that you are taking, the instructor may have set up all of these tools already for you; in this case, you can continue on through the book reading the chapters in order. But if you are reading this independently, you may want to jump to these last three chapters early before going on to make sure your computer is set up in such a way that you can try out the example code that we include throughout the book. Figure 0.1: Where are we going? Each chapter in the book has an accompanying worksheet that provides exercises to help you practice the concepts you will learn. We strongly recommend that you work through the worksheet when you finish reading each chapter before moving on to the next chapter. All of the worksheets are available at https://worksheets.datasciencebook.ca; the “Exercises” section at the end of each chapter points you to the right worksheet for that chapter. For each worksheet, you can either launch an interactive version of the worksheet in your browser by clicking the “launch binder” button, or preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["acknowledgments.html", "Acknowledgments", " Acknowledgments We’d like to thank everyone that has contributed to the development of Data Science: A First Introduction. This is an open source textbook that began as a collection of course readings for DSCI 100, a new introductory data science course at the University of British Columbia (UBC). Several faculty members in the UBC Department of Statistics were pivotal in shaping the direction of that course, and as such, contributed greatly to the broad structure and list of topics in this book. We would especially like to thank Matías Salibían-Barrera for his mentorship during the initial development and roll-out of both DSCI 100 and this book. His door was always open when we needed to chat about how to best introduce and teach data science to our first-year students. We would also like to thank Gabriela Cohen Freue for her DSCI 561 (Regression I) teaching materials from the UBC Master of Data Science program, as some of our linear regression figures were inspired from these. We would also like to thank all those who contributed to the process of publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions: Rohan Alexander, Isabella Ghement, Virgilio Gómez Rubio, Albert Kim, Adam Loy, Maria Prokofieva, Emily Riederer, and Greg Wilson. The book was improved substantially by their insights. We would like to give special thanks to Jim Zidek for his support and encouragement throughout the process, and to Roger Peng for graciously offering to write the Foreword. Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the past few years. They provided invaluable feedback on the book and worksheets; they found bugs for us (and stood by very patiently in class while we frantically fixed those bugs); and they brought a level of enthusiasm to the class that sustained us during the hard work of creating a new course and writing a textbook. Our interactions with them taught us how to teach data science, and that learning is reflected in the content of this book. "],["about-the-authors.html", "About the authors", " About the authors Tiffany Timbers is an Associate Professor of Teaching in the Department of Statistics and Co-Director for the Master of Data Science program (Vancouver Option) at the University of British Columbia. In these roles she teaches and develops curriculum around the responsible application of Data Science to solve real-world problems. One of her favorite courses she teaches is a graduate course on collaborative software development, which focuses on teaching how to create R and Python packages using modern tools and workflows. Trevor Campbell is an Associate Professor in the Department of Statistics at the University of British Columbia. His research focuses on automated, scalable Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and Bayesian theory. He was previously a postdoctoral associate advised by Tamara Broderick in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute for Data, Systems, and Society (IDSS) at MIT, a Ph.D. candidate under Jonathan How in the Laboratory for Information and Decision Systems (LIDS) at MIT, and before that he was in the Engineering Science program at the University of Toronto. Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. She teaches and develops curriculum for undergraduate statistics and data science courses. Her work focuses on student-centered approaches to teaching, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives. "],["intro.html", "Chapter 1 R and the Tidyverse 1.1 Overview 1.2 Chapter learning objectives 1.3 Canadian languages data set 1.4 Asking a question 1.5 Loading a tabular data set 1.6 Naming things in R 1.7 Creating subsets of data frames with filter & select 1.8 Using arrange to order and slice to select rows by index number 1.9 Adding and modifying columns using mutate 1.10 Exploring data with visualizations 1.11 Accessing documentation 1.12 Exercises", " Chapter 1 R and the Tidyverse 1.1 Overview This chapter provides an introduction to data science and the R programming language. The goal here is to get your hands dirty right from the start! We will walk through an entire data analysis, and along the way introduce different types of data analysis question, some fundamental programming concepts in R, and the basics of loading, cleaning, and visualizing data. In the following chapters, we will dig into each of these steps in much more detail; but for now, let’s jump in to see how much we can do with data science! 1.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Identify the different types of data analysis question and categorize a question into the correct type. Load the tidyverse package into R. Read tabular data with read_csv. Create new variables and objects in R using the assignment symbol. Create and organize subsets of tabular data using filter, select, arrange, and slice. Add and modify columns in tabular data using mutate. Visualize data with a ggplot bar plot. Use ? to access help and documentation tools in R. 1.3 Canadian languages data set In this chapter, we will walk through a full analysis of a data set relating to languages spoken at home by Canadian residents (Figure 1.1). Many Indigenous peoples exist in Canada with their own cultures and languages; these languages are often unique to Canada and not spoken anywhere else in the world (Statistics Canada 2018). Sadly, colonization has led to the loss of many of these languages. For instance, generations of children were not allowed to speak their mother tongue (the first language an individual learns in childhood) in Canadian residential schools. Colonizers also renamed places they had “discovered” (K. Wilson 2018). Acts such as these have significantly harmed the continuity of Indigenous languages in Canada, and some languages are considered “endangered” as few people report speaking them. To learn more, please see Canadian Geographic’s article, “Mapping Indigenous Languages in Canada” (Walker 2017), They Came for the Children: Canada, Aboriginal peoples, and Residential Schools (Truth and Reconciliation Commission of Canada 2012) and the Truth and Reconciliation Commission of Canada’s Calls to Action (Truth and Reconciliation Commission of Canada 2015). Figure 1.1: Map of Canada. The data set we will study in this chapter is taken from the canlang R data package (Timbers 2020), which has population language data collected during the 2016 Canadian census (Statistics Canada 2016a). In this data, there are 214 languages recorded, each having six different properties: category: Higher-level language category, describing whether the language is an Official Canadian language, an Aboriginal (i.e., Indigenous) language, or a Non-Official and Non-Aboriginal language. language: The name of the language. mother_tongue: Number of Canadian residents who reported the language as their mother tongue. Mother tongue is generally defined as the language someone was exposed to since birth. most_at_home: Number of Canadian residents who reported the language as being spoken most often at home. most_at_work: Number of Canadian residents who reported the language as being used most often at work. lang_known: Number of Canadian residents who reported knowledge of the language. According to the census, more than 60 Aboriginal languages were reported as being spoken in Canada. Suppose we want to know which are the most common; then we might ask the following question, which we wish to answer using our data: Which ten Aboriginal languages were most often reported in 2016 as mother tongues in Canada, and how many people speak each of them? Note: Data science cannot be done without a deep understanding of the data and problem domain. In this book, we have simplified the data sets used in our examples to concentrate on methods and fundamental concepts. But in real life, you cannot and should not do data science without a domain expert. Alternatively, it is common to practice data science in your own domain of expertise! Remember that when you work with data, it is essential to think about how the data were collected, which affects the conclusions you can draw. If your data are biased, then your results will be biased! 1.4 Asking a question Every good data analysis begins with a question—like the above—that you aim to answer using data. As it turns out, there are actually a number of different types of question regarding data: descriptive, exploratory, predictive, inferential, causal, and mechanistic, all of which are defined in Table 1.1. Carefully formulating a question as early as possible in your analysis—and correctly identifying which type of question it is—will guide your overall approach to the analysis as well as the selection of appropriate tools. Table 1.1: Types of data analysis question (Leek and Peng 2015; Peng and Matsui 2015). Question type Description Example Descriptive A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). How many people live in each province and territory in Canada? Exploratory A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? Predictive A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. What political party will someone vote for in the next Canadian election? Inferential A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. Does political party voting change with indicators of wealth for all people living in Canada? Causal A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. Does wealth lead to voting for a certain political party in Canadian elections? Mechanistic A question that asks about the underlying mechanism of the observed patterns, trends, or relationships (i.e., how does it happen?) How does wealth lead to voting for a certain political party in Canadian elections? In this book, you will learn techniques to answer the first four types of question: descriptive, exploratory, predictive, and inferential; causal and mechanistic questions are beyond the scope of this book. In particular, you will learn how to apply the following analysis tools: Summarization: computing and reporting aggregated values pertaining to a data set. Summarization is most often used to answer descriptive questions, and can occasionally help with answering exploratory questions. For example, you might use summarization to answer the following question: What is the average race time for runners in this data set? Tools for summarization are covered in detail in Chapters 2 and 3, but appear regularly throughout the text. Visualization: plotting data graphically. Visualization is typically used to answer descriptive and exploratory questions, but plays a critical supporting role in answering all of the types of question in Table 1.1. For example, you might use visualization to answer the following question: Is there any relationship between race time and age for runners in this data set? This is covered in detail in Chapter 4, but again appears regularly throughout the book. Classification: predicting a class or category for a new observation. Classification is used to answer predictive questions. For example, you might use classification to answer the following question: Given measurements of a tumor’s average cell area and perimeter, is the tumor benign or malignant? Classification is covered in Chapters 5 and 6. Regression: predicting a quantitative value for a new observation. Regression is also used to answer predictive questions. For example, you might use regression to answer the following question: What will be the race time for a 20-year-old runner who weighs 50kg? Regression is covered in Chapters 7 and 8. Clustering: finding previously unknown/unlabeled subgroups in a data set. Clustering is often used to answer exploratory questions. For example, you might use clustering to answer the following question: What products are commonly bought together on Amazon? Clustering is covered in Chapter 9. Estimation: taking measurements for a small number of items from a large group and making a good guess for the average or proportion for the large group. Estimation is used to answer inferential questions. For example, you might use estimation to answer the following question: Given a survey of cellphone ownership of 100 Canadians, what proportion of the entire Canadian population own Android phones? Estimation is covered in Chapter 10. Referring to Table 1.1, our question about Aboriginal languages is an example of a descriptive question: we are summarizing the characteristics of a data set without further interpretation. And referring to the list above, it looks like we should use visualization and perhaps some summarization to answer the question. So in the remainder of this chapter, we will work towards making a visualization that shows us the ten most common Aboriginal languages in Canada and their associated counts, according to the 2016 census. 1.5 Loading a tabular data set A data set is, at its core essence, a structured collection of numbers and characters. Aside from that, there are really no strict rules; data sets can come in many different forms! Perhaps the most common form of data set that you will find in the wild, however, is tabular data. Think spreadsheets in Microsoft Excel: tabular data are rectangular-shaped and spreadsheet-like, as shown in Figure 1.2. In this book, we will focus primarily on tabular data. Since we are using R for data analysis in this book, the first step for us is to load the data into R. When we load tabular data into R, it is represented as a data frame object. Figure 1.2 shows that an R data frame is very similar to a spreadsheet. We refer to the rows as observations; these are the individual objects for which we collect data. In Figure 1.2, the observations are languages. We refer to the columns as variables; these are the characteristics of each observation. In Figure 1.2, the variables are the the language’s category, its name, the number of mother tongue speakers, etc. Figure 1.2: A spreadsheet versus a data frame in R. The first kind of data file that we will learn how to load into R as a data frame is the comma-separated values format (.csv for short). These files have names ending in .csv, and can be opened and saved using common spreadsheet programs like Microsoft Excel and Google Sheets. For example, the .csv file named can_lang.csv is included with the code for this book. If we were to open this data in a plain text editor (a program like Notepad that just shows text with no formatting), we would see each row on its own line, and each entry in the table separated by a comma: category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 To load this data into R so that we can do things with it (e.g., perform analyses or create data visualizations), we will need to use a function. A function is a special word in R that takes instructions (we call these arguments) and does something. The function we will use to load a .csv file into R is called read_csv. In its most basic use-case, read_csv expects that the data file: has column names (or headers), uses a comma (,) to separate the columns, and does not have row names. Below you’ll see the code used to load the data into R using the read_csv function. Note that the read_csv function is not included in the base installation of R, meaning that it is not one of the primary functions ready to use when you install R. Therefore, you need to load it from somewhere else before you can use it. The place from which we will load it is called an R package. An R package is a collection of functions that can be used in addition to the built-in R package functions once loaded. The read_csv function, in particular, can be made accessible by loading the tidyverse R package (Wickham 2021b; Wickham et al. 2019) using the library function. The tidyverse package contains many functions that we will use throughout this book to load, clean, wrangle, and visualize data. library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.2 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Note: You may have noticed that we got some extra output from R regarding attached packages and conflicts below our code line. These are examples of messages in R, which give the user more information that might be handy to know. The Attaching packages message is natural when loading tidyverse, since tidyverse actually automatically causes other packages to be imported too, such as dplyr. In the future, when we load tidyverse in this book, we will silence these messages to help with the readability of the book. The Conflicts message is also totally normal in this circumstance. This message tells you if functions from different packages share the same name, which is confusing to R. For example, in this case, the dplyr package and the stats package both provide a function called filter. The message above (dplyr::filter() masks stats::filter()) is R telling you that it is going to default to the dplyr package version of this function. So if you use the filter function, you will be using the dplyr version. In order to use the stats version, you need to use its full name stats::filter. Messages are not errors, so generally you don’t need to take action when you see a message; but you should always read the message and critically think about what it means and whether you need to do anything about it. After loading the tidyverse package, we can call the read_csv function and pass it a single argument: the name of the file, \"can_lang.csv\". We have to put quotes around file names and other letters and words that we use in our code to distinguish it from the special words (like functions!) that make up the R programming language. The file’s name is the only argument we need to provide because our file satisfies everything else that the read_csv function expects in the default use-case. Figure 1.3 describes how we use the read_csv to read data into R. Figure 1.3: Syntax for the read_csv function. read_csv("data/can_lang.csv") ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows Note: There is another function that also loads csv files named read.csv. We will always use read_csv in this book, as it is designed to play nicely with all of the other tidyverse functions, which we will use extensively. Be careful not to accidentally use read.csv, as it can cause some tricky errors to occur in your code that are hard to track down! 1.6 Naming things in R When we loaded the 2016 Canadian census language data using read_csv, we did not give this data frame a name. Therefore the data was just printed on the screen, and we cannot do anything else with it. That isn’t very useful. What would be more useful would be to give a name to the data frame that read_csv outputs, so that we can refer to it later for analysis and visualization. The way to assign a name to a value in R is via the assignment symbol <-. On the left side of the assignment symbol you put the name that you want to use, and on the right side of the assignment symbol you put the value that you want the name to refer to. Names can be used to refer to almost anything in R, such as numbers, words (also known as strings of characters), and data frames! Below, we set my_number to 3 (the result of 1+2) and we set name to the string \"Alice\". my_number <- 1 + 2 name <- "Alice" Note that when we name something in R using the assignment symbol, <-, we do not need to surround the name we are creating with quotes. This is because we are formally telling R that this special word denotes the value of whatever is on the right-hand side. Only characters and words that act as values on the right-hand side of the assignment symbol—e.g., the file name \"data/can_lang.csv\" that we specified before, or \"Alice\" above—need to be surrounded by quotes. After making the assignment, we can use the special name words we have created in place of their values. For example, if we want to do something with the value 3 later on, we can just use my_number instead. Let’s try adding 2 to my_number; you will see that R just interprets this as adding 3 and 2: my_number + 2 ## [1] 5 Object names can consist of letters, numbers, periods . and underscores _. Other symbols won’t work since they have their own meanings in R. For example, - is the subtraction symbol; if we try to assign a name with the - symbol, R will complain and we will get an error! my-number <- 1 Error in my - number <- 1 : object 'my' not found There are certain conventions for naming objects in R. When naming an object we suggest using only lowercase letters, numbers and underscores _ to separate the words in a name. R is case sensitive, which means that Letter and letter would be two different objects in R. You should also try to give your objects meaningful names. For instance, you can name a data frame x. However, using more meaningful terms, such as language_data, will help you remember what each name in your code represents. We recommend following the Tidyverse naming conventions outlined in the Tidyverse Style Guide (Wickham 2020). Let’s now use the assignment symbol to give the name can_lang to the 2016 Canadian census language data frame that we get from read_csv. can_lang <- read_csv("data/can_lang.csv") Wait a minute, nothing happened this time! Where’s our data? Actually, something did happen: the data was loaded in and now has the name can_lang associated with it. And we can use that name to access the data frame and do things with it. For example, we can type the name of the data frame to print the first few rows on the screen. You will also see at the top that the number of observations (i.e., rows) and variables (i.e., columns) are printed. Printing the first few rows of a data frame like this is a handy way to get a quick sense for what is contained in a data frame. can_lang ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 1.7 Creating subsets of data frames with filter & select Now that we’ve loaded our data into R, we can start wrangling the data to find the ten Aboriginal languages that were most often reported in 2016 as mother tongues in Canada. In particular, we will construct a table with the ten Aboriginal languages that have the largest counts in the mother_tongue column. The filter and select functions from the tidyverse package will help us here. The filter function allows you to obtain a subset of the rows with specific values, while the select function allows you to obtain a subset of the columns. Therefore, we can filter the rows to extract the Aboriginal languages in the data set, and then use select to obtain only the columns we want to include in our table. 1.7.1 Using filter to extract rows Looking at the can_lang data above, we see the category column contains different high-level categories of languages, which include “Aboriginal languages”, “Non-Official & Non-Aboriginal languages” and “Official languages”. To answer our question we want to filter our data set so we restrict our attention to only those languages in the “Aboriginal languages” category. We can use the filter function to obtain the subset of rows with desired values from a data frame. Figure 1.4 outlines what arguments we need to specify to use filter. The first argument to filter is the name of the data frame object, can_lang. The second argument is a logical statement to use when filtering the rows. A logical statement evaluates to either TRUE or FALSE; filter keeps only those rows for which the logical statement evaluates to TRUE. For example, in our analysis, we are interested in keeping only languages in the “Aboriginal languages” higher-level category. We can use the equivalency operator == to compare the values of the category column with the value \"Aboriginal languages\"; you will learn about many other kinds of logical statements in Chapter 3. Similar to when we loaded the data file and put quotes around the file name, here we need to put quotes around \"Aboriginal languages\". Using quotes tells R that this is a string value and not one of the special words that make up the R programming language, or one of the names we have given to data frames in the code we have already written. Figure 1.4: Syntax for the filter function. With these arguments, filter returns a data frame that has all the columns of the input data frame, but only those rows we asked for in our logical filter statement. aboriginal_lang <- filter(can_lang, category == "Aboriginal languages") aboriginal_lang ## # A tibble: 67 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Aboriginal langu… Algonqu… 45 10 0 120 ## 3 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 4 Aboriginal langu… Athabas… 50 10 0 85 ## 5 Aboriginal langu… Atikame… 6150 5465 1100 6645 ## 6 Aboriginal langu… Babine … 110 20 10 210 ## 7 Aboriginal langu… Beaver 190 50 0 340 ## 8 Aboriginal langu… Blackfo… 2815 1110 85 5645 ## 9 Aboriginal langu… Carrier 1025 250 15 2100 ## 10 Aboriginal langu… Cayuga 45 10 10 125 ## # ℹ 57 more rows It’s good practice to check the output after using a function in R. We can see the original can_lang data set contained 214 rows with multiple kinds of category. The data frame aboriginal_lang contains only 67 rows, and looks like it only contains languages in the “Aboriginal languages” in the category column. So it looks like the function gave us the result we wanted! 1.7.2 Using select to extract columns Now let’s use select to extract the language and mother_tongue columns from this data frame. Figure 1.5 shows us the syntax for the select function. To extract these columns, we need to provide the select function with three arguments. The first argument is the name of the data frame object, which in this example is aboriginal_lang. The second and third arguments are the column names that we want to select: language and mother_tongue. After passing these three arguments, the select function returns two columns (the language and mother_tongue columns that we asked for) as a data frame. This code is also a great example of why being able to name things in R is useful: you can see that we are using the result of our earlier filter step (which we named aboriginal_lang) here in the next step of the analysis! Figure 1.5: Syntax for the select function. selected_lang <- select(aboriginal_lang, language, mother_tongue) selected_lang ## # A tibble: 67 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Aboriginal languages, n.o.s. 590 ## 2 Algonquian languages, n.i.e. 45 ## 3 Algonquin 1260 ## 4 Athabaskan languages, n.i.e. 50 ## 5 Atikamekw 6150 ## 6 Babine (Wetsuwet'en) 110 ## 7 Beaver 190 ## 8 Blackfoot 2815 ## 9 Carrier 1025 ## 10 Cayuga 45 ## # ℹ 57 more rows 1.8 Using arrange to order and slice to select rows by index number We have used filter and select to obtain a table with only the Aboriginal languages in the data set and their associated counts. However, we want to know the ten languages that are spoken most often. As a next step, we will order the mother_tongue column from largest to smallest value and then extract only the top ten rows. This is where the arrange and slice functions come to the rescue! The arrange function allows us to order the rows of a data frame by the values of a particular column. Figure 1.6 details what arguments we need to specify to use the arrange function. We need to pass the data frame as the first argument to this function, and the variable to order by as the second argument. Since we want to choose the ten Aboriginal languages most often reported as a mother tongue language, we will use the arrange function to order the rows in our selected_lang data frame by the mother_tongue column. We want to arrange the rows in descending order (from largest to smallest), so we pass the column to the desc function before using it as an argument. Figure 1.6: Syntax for the arrange function. arranged_lang <- arrange(selected_lang, by = desc(mother_tongue)) arranged_lang ## # A tibble: 67 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Cree, n.o.s. 64050 ## 2 Inuktitut 35210 ## 3 Ojibway 17885 ## 4 Oji-Cree 12855 ## 5 Dene 10700 ## 6 Montagnais (Innu) 10235 ## 7 Mi'kmaq 6690 ## 8 Atikamekw 6150 ## 9 Plains Cree 3065 ## 10 Stoney 3025 ## # ℹ 57 more rows Next we will use the slice function, which selects rows according to their row number. Since we want to choose the most common ten languages, we will indicate we want the rows 1 to 10 using the argument 1:10. ten_lang <- slice(arranged_lang, 1:10) ten_lang ## # A tibble: 10 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Cree, n.o.s. 64050 ## 2 Inuktitut 35210 ## 3 Ojibway 17885 ## 4 Oji-Cree 12855 ## 5 Dene 10700 ## 6 Montagnais (Innu) 10235 ## 7 Mi'kmaq 6690 ## 8 Atikamekw 6150 ## 9 Plains Cree 3065 ## 10 Stoney 3025 1.9 Adding and modifying columns using mutate Recall that our data analysis question referred to the count of Canadians that speak each of the top ten most commonly reported Aboriginal languages as their mother tongue, and the ten_lang data frame indeed contains those counts… But perhaps, seeing these numbers, we became curious about the percentage of the population of Canada associated with each count. It is common to come up with new data analysis questions in the process of answering a first one—so fear not and explore! To answer this small question along the way, we need to divide each count in the mother_tongue column by the total Canadian population according to the 2016 census—i.e., 35,151,728—and multiply it by 100. We can perform this computation using the mutate function. We pass the ten_lang data frame as its first argument, then specify the equation that computes the percentages in the second argument. By using a new variable name on the left hand side of the equation, we will create a new column in the data frame; and if we use an existing name, we will modify that variable. In this case, we will opt to create a new column called mother_tongue_percent. canadian_population = 35151728 ten_lang_percent = mutate(ten_lang, mother_tongue_percent = 100 * mother_tongue / canadian_population) ten_lang_percent ## # A tibble: 10 × 3 ## language mother_tongue mother_tongue_percent ## <chr> <dbl> <dbl> ## 1 Cree, n.o.s. 64050 0.182 ## 2 Inuktitut 35210 0.100 ## 3 Ojibway 17885 0.0509 ## 4 Oji-Cree 12855 0.0366 ## 5 Dene 10700 0.0304 ## 6 Montagnais (Innu) 10235 0.0291 ## 7 Mi'kmaq 6690 0.0190 ## 8 Atikamekw 6150 0.0175 ## 9 Plains Cree 3065 0.00872 ## 10 Stoney 3025 0.00861 The ten_lang_percent data frame shows that the ten Aboriginal languages in the ten_lang data frame were spoken as a mother tongue by between 0.008% and 0.18% of the Canadian population. 1.10 Exploring data with visualizations The ten_lang table we generated in Section 1.8 answers our initial data analysis question. Are we done? Well, not quite; tables are almost never the best way to present the result of your analysis to your audience. Even the ten_lang table with only two columns presents some difficulty: for example, you have to scrutinize the table quite closely to get a sense for the relative numbers of speakers of each language. When you move on to more complicated analyses, this issue only gets worse. In contrast, a visualization would convey this information in a much more easily understood format. Visualizations are a great tool for summarizing information to help you effectively communicate with your audience, and creating effective data visualizations is an essential component of any data analysis. In this section we will develop a visualization of the ten Aboriginal languages that were most often reported in 2016 as mother tongues in Canada, as well as the number of people that speak each of them. 1.10.1 Using ggplot to create a bar plot In our data set, we can see that language and mother_tongue are in separate columns (or variables). In addition, there is a single row (or observation) for each language. The data are, therefore, in what we call a tidy data format. Tidy data is a fundamental concept and will be a significant focus in the remainder of this book: many of the functions from tidyverse require tidy data, including the ggplot function that we will use shortly for our visualization. We will formally introduce tidy data in Chapter 3. We will make a bar plot to visualize our data. A bar plot is a chart where the lengths of the bars represent certain values, like counts or proportions. We will make a bar plot using the mother_tongue and language columns from our ten_lang data frame. To create a bar plot of these two variables using the ggplot function, we must specify the data frame, which variables to put on the x and y axes, and what kind of plot to create. The ggplot function and its common usage is illustrated in Figure 1.7. Figure 1.8 shows the resulting bar plot generated by following the instructions in Figure 1.7. Figure 1.7: Creating a bar plot with the ggplot function. ggplot(ten_lang, aes(x = language, y = mother_tongue)) + geom_bar(stat = "identity") Figure 1.8: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made. Note: The vast majority of the time, a single expression in R must be contained in a single line of code. However, there are a small number of situations in which you can have a single R expression span multiple lines. Above is one such case: here, R knows that a line cannot end with a + symbol, and so it keeps reading the next line to figure out what the right-hand side of the + symbol should be. We could, of course, put all of the added layers on one line of code, but splitting them across multiple lines helps a lot with code readability. 1.10.2 Formatting ggplot objects It is exciting that we can already visualize our data to help answer our question, but we are not done yet! We can (and should) do more to improve the interpretability of the data visualization that we created. For example, by default, R uses the column names as the axis labels. Usually these column names do not have enough information about the variable in the column. We really should replace this default with a more informative label. For the example above, R uses the column name mother_tongue as the label for the y axis, but most people will not know what that is. And even if they did, they will not know how we measured this variable, or the group of people on which the measurements were taken. An axis label that reads “Mother Tongue (Number of Canadian Residents)” would be much more informative. Adding additional layers to our visualizations that we create in ggplot is one common and easy way to improve and refine our data visualizations. New layers are added to ggplot objects using the + symbol. For example, we can use the xlab (short for x axis label) and ylab (short for y axis label) functions to add layers where we specify meaningful and informative labels for the x and y axes. Again, since we are specifying words (e.g. \"Mother Tongue (Number of Canadian Residents)\") as arguments to xlab and ylab, we surround them with double quotation marks. We can add many more layers to format the plot further, and we will explore these in Chapter 4. ggplot(ten_lang, aes(x = language, y = mother_tongue)) + geom_bar(stat = "identity") + xlab("Language") + ylab("Mother Tongue (Number of Canadian Residents)") Figure 1.9: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue with x and y labels. Note that this visualization is not done yet; there are still improvements to be made. The result is shown in Figure 1.9. This is already quite an improvement! Let’s tackle the next major issue with the visualization in Figure 1.9: the overlapping x axis labels, which are currently making it difficult to read the different language names. One solution is to rotate the plot such that the bars are horizontal rather than vertical. To accomplish this, we will swap the x and y coordinate axes: ggplot(ten_lang, aes(x = mother_tongue, y = language)) + geom_bar(stat = "identity") + xlab("Mother Tongue (Number of Canadian Residents)") + ylab("Language") Figure 1.10: Horizontal bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. There are no more serious issues with this visualization, but it could be refined further. Another big step forward, as shown in Figure 1.10! There are no more serious issues with the visualization. Now comes time to refine the visualization to make it even more well-suited to answering the question we asked earlier in this chapter. For example, the visualization could be made more transparent by organizing the bars according to the number of Canadian residents reporting each language, rather than in alphabetical order. We can reorder the bars using the reorder function, which orders a variable (here language) based on the values of the second variable (mother_tongue). ggplot(ten_lang, aes(x = mother_tongue, y = reorder(language, mother_tongue))) + geom_bar(stat = "identity") + xlab("Mother Tongue (Number of Canadian Residents)") + ylab("Language") Figure 1.11: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue with bars reordered. Figure 1.11 provides a very clear and well-organized answer to our original question; we can see what the ten most often reported Aboriginal languages were, according to the 2016 Canadian census, and how many people speak each of them. For instance, we can see that the Aboriginal language most often reported was Cree n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue. Note: “n.o.s.” means “not otherwise specified”, so Cree n.o.s. refers to individuals who reported Cree as their mother tongue. In this data set, the Cree languages include the following categories: Cree n.o.s., Swampy Cree, Plains Cree, Woods Cree, and a ‘Cree not included elsewhere’ category (which includes Moose Cree, Northern East Cree and Southern East Cree) (Statistics Canada 2016b). 1.10.3 Putting it all together In the block of code below, we put everything from this chapter together, with a few modifications. In particular, we have actually skipped the select step that we did above; since you specify the variable names to plot in the ggplot function, you don’t actually need to select the columns in advance when creating a visualization. We have also provided comments next to many of the lines of code below using the hash symbol #. When R sees a # sign, it will ignore all of the text that comes after the symbol on that line. So you can use comments to explain lines of code for others, and perhaps more importantly, your future self! It’s good practice to get in the habit of commenting your code to improve its readability. This exercise demonstrates the power of R. In relatively few lines of code, we performed an entire data science workflow with a highly effective data visualization! We asked a question, loaded the data into R, wrangled the data (using filter, arrange and slice) and created a data visualization to help answer our question. In this chapter, you got a quick taste of the data science workflow; continue on with the next few chapters to learn each of these steps in much more detail! library(tidyverse) # load the data set can_lang <- read_csv("data/can_lang.csv") # obtain the 10 most common Aboriginal languages aboriginal_lang <- filter(can_lang, category == "Aboriginal languages") arranged_lang <- arrange(aboriginal_lang, by = desc(mother_tongue)) ten_lang <- slice(arranged_lang, 1:10) # create the visualization ggplot(ten_lang, aes(x = mother_tongue, y = reorder(language, mother_tongue))) + geom_bar(stat = "identity") + xlab("Mother Tongue (Number of Canadian Residents)") + ylab("Language") Figure 1.12: Putting it all together: bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. 1.11 Accessing documentation There are many R functions in the tidyverse package (and beyond!), and nobody can be expected to remember what every one of them does or all of the arguments we have to give them. Fortunately, R provides the ? symbol, which provides an easy way to pull up the documentation for most functions quickly. To use the ? symbol to access documentation, you just put the name of the function you are curious about after the ? symbol. For example, if you had forgotten what the filter function did or exactly what arguments to pass in, you could run the following code: ?filter Figure 1.13 shows the documentation that will pop up, including a high-level description of the function, its arguments, a description of each, and more. Note that you may find some of the text in the documentation a bit too technical right now (for example, what is dbplyr, and what is a lazy data frame?). Fear not: as you work through this book, many of these terms will be introduced to you, and slowly but surely you will become more adept at understanding and navigating documentation like that shown in Figure 1.13. But do keep in mind that the documentation is not written to teach you about a function; it is just there as a reference to remind you about the different arguments and usage of functions that you have already learned about elsewhere. Figure 1.13: The documentation for the filter function, including a high-level description, a list of arguments and their meanings, and more. 1.12 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “R and the tidyverse” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["reading.html", "Chapter 2 Reading in data locally and from the web 2.1 Overview 2.2 Chapter learning objectives 2.3 Absolute and relative file paths 2.4 Reading tabular data from a plain text file into R 2.5 Reading tabular data from a Microsoft Excel file 2.6 Reading data from a database 2.7 Writing data from R to a .csv file 2.8 Obtaining data from the web 2.9 Exercises 2.10 Additional resources", " Chapter 2 Reading in data locally and from the web 2.1 Overview In this chapter, you’ll learn to read tabular data of various formats into R from your local device (e.g., your laptop) and the web. “Reading” (or “loading”) is the process of converting data (stored as plain text, a database, HTML, etc.) into an object (e.g., a data frame) that R can easily access and manipulate. Thus reading data is the gateway to any data analysis; you won’t be able to analyze data unless you’ve loaded it first. And because there are many ways to store data, there are similarly many ways to read data into R. The more time you spend upfront matching the data reading method to the type of data you have, the less time you will have to devote to re-formatting, cleaning and wrangling your data (the second step to all data analyses). It’s like making sure your shoelaces are tied well before going for a run so that you don’t trip later on! 2.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Define the types of path and use them to locate files: absolute file path relative file path Uniform Resource Locator (URL) Read data into R from various types of path using: read_csv read_tsv read_csv2 read_delim read_excel Compare and contrast the read_* functions. Describe when to use the following read_* function arguments: skip delim col_names Choose the appropriate tidyverse read_* function and function arguments to load a given plain text tabular data set into R. Use the rename function to rename columns in a data frame. Use read_excel function and arguments to load a sheet from an excel file into R. Work with databases using functions from dbplyr and DBI: Connect to a database with dbConnect. List tables in the database with dbListTables. Create a reference to a database table with tbl. Bring data from a database into R using collect. Use write_csv to save a data frame to a .csv file. (Optional) Obtain data from the web using scraping and application programming interfaces (APIs): Read HTML source code from a URL using the rvest package. Read data from the NASA “Astronomy Picture of the Day” API using the httr2 package. Compare downloading tabular data from a plain text file (e.g., .csv), accessing data from an API, and scraping the HTML source code from a website. 2.3 Absolute and relative file paths This chapter will discuss the different functions we can use to import data into R, but before we can talk about how we read the data into R with these functions, we first need to talk about where the data lives. When you load a data set into R, you first need to tell R where those files live. The file could live on your computer (local) or somewhere on the internet (remote). The place where the file lives on your computer is referred to as its “path”. You can think of the path as directions to the file. There are two kinds of paths: relative paths and absolute paths. A relative path indicates where the file is with respect to your working directory (i.e., “where you are currently”) on the computer. On the other hand, an absolute path indicates where the file is with respect to the computer’s filesystem base (or root) folder, regardless of where you are working. Suppose our computer’s filesystem looks like the picture in Figure 2.1. We are working in a file titled project3.ipynb, and our current working directory is project3; typically, as is the case here, the working directory is the directory containing the file you are currently working on. Figure 2.1: Example file system. Let’s say we wanted to open the happiness_report.csv file. We have two options to indicate where the file is: using a relative path, or using an absolute path. The absolute path of the file always starts with a slash /—representing the root folder on the computer—and proceeds by listing out the sequence of folders you would have to enter to reach the file, each separated by another slash /. So in this case, happiness_report.csv would be reached by starting at the root, and entering the home folder, then the dsci-100 folder, then the project3 folder, and then finally the data folder. So its absolute path would be /home/dsci-100/project3/data/happiness_report.csv. We can load the file using its absolute path as a string passed to the read_csv function. happy_data <- read_csv("/home/dsci-100/project3/data/happiness_report.csv") If we instead wanted to use a relative path, we would need to list out the sequence of steps needed to get from our current working directory to the file, with slashes / separating each step. Since we are currently in the project3 folder, we just need to enter the data folder to reach our desired file. Hence the relative path is data/happiness_report.csv, and we can load the file using its relative path as a string passed to read_csv. happy_data <- read_csv("data/happiness_report.csv") Note that there is no forward slash at the beginning of a relative path; if we accidentally typed \"/data/happiness_report.csv\", R would look for a folder named data in the root folder of the computer—but that doesn’t exist! Aside from specifying places to go in a path using folder names (like data and project3), we can also specify two additional special places: the current directory and the previous directory. We indicate the current working directory with a single dot ., and the previous directory with two dots ... So for instance, if we wanted to reach the bike_share.csv file from the project3 folder, we could use the relative path ../project2/bike_share.csv. We can even combine these two; for example, we could reach the bike_share.csv file using the (very silly) path ../project2/../project2/./bike_share.csv with quite a few redundant directions: it says to go back a folder, then open project2, then go back a folder again, then open project2 again, then stay in the current directory, then finally get to bike_share.csv. Whew, what a long trip! So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths. Using a relative path helps ensure that your code can be run on a different computer (and as an added bonus, relative paths are often shorter—easier to type!). This is because a file’s relative path is often the same across different computers, while a file’s absolute path (the names of all of the folders between the computer’s root, represented by /, and the file) isn’t usually the same across different computers. For example, suppose Fatima and Jayden are working on a project together on the happiness_report.csv data. Fatima’s file is stored at /home/Fatima/project3/data/happiness_report.csv, while Jayden’s is stored at /home/Jayden/project3/data/happiness_report.csv. Even though Fatima and Jayden stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames. If Jayden has code that loads the happiness_report.csv data using an absolute path, the code won’t work on Fatima’s computer. But the relative path from inside the project3 folder (data/happiness_report.csv) is the same on both computers; any code that uses relative paths will work on both! In the additional resources section, we include a link to a short video on the difference between absolute and relative paths. You can also check out the here package, which provides methods for finding and constructing file paths in R. Beyond files stored on your computer (i.e., locally), we also need a way to locate resources stored elsewhere on the internet (i.e., remotely). For this purpose we use a Uniform Resource Locator (URL), i.e., a web address that looks something like https://datasciencebook.ca/. URLs indicate the location of a resource on the internet, and start with a web domain, followed by a forward slash /, and then a path to where the resource is located on the remote machine. 2.4 Reading tabular data from a plain text file into R 2.4.1 read_csv to read in comma-separated values files Now that we have learned about where data could be, we will learn about how to import data into R using various functions. Specifically, we will learn how to read tabular data from a plain text file (a document containing only text) into R and write tabular data to a file out of R. The function we use to do this depends on the file’s format. For example, in the last chapter, we learned about using the tidyverse read_csv function when reading .csv (comma-separated values) files. In that case, the separator or delimiter that divided our columns was a comma (,). We only learned the case where the data matched the expected defaults of the read_csv function (column names are present, and commas are used as the delimiter between columns). In this section, we will learn how to read files that do not satisfy the default expectations of read_csv. Before we jump into the cases where the data aren’t in the expected default format for tidyverse and read_csv, let’s revisit the more straightforward case where the defaults hold, and the only argument we need to give to the function is the path to the file, data/can_lang.csv. The can_lang data set contains language data from the 2016 Canadian census. We put data/ before the file’s name when we are loading the data set because this data set is located in a sub-folder, named data, relative to where we are running our R code. Here is what the text in the file data/can_lang.csv looks like. category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 And here is a review of how we can use read_csv to load it into R. First we load the tidyverse package to gain access to useful functions for reading the data. library(tidyverse) Next we use read_csv to load the data into R, and in that call we specify the relative path to the file. Note that it is normal and expected that a message is printed out after using the read_csv and related functions. This message lets you know the data types of each of the columns that R inferred while reading the data into R. In the future when we use this and related functions to load data in this book, we will silence these messages to help with the readability of the book. canlang_data <- read_csv("data/can_lang.csv") ## Rows: 214 Columns: 6 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): category, language ## dbl (4): mother_tongue, most_at_home, most_at_work, lang_known ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. Finally, to view the first 10 rows of the data frame, we must call it: canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.2 Skipping rows when reading in data Oftentimes, information about how data was collected, or other relevant information, is included at the top of the data file. This information is usually written in sentence and paragraph form, with no delimiter because it is not organized into columns. An example of this is shown below. This information gives the data scientist useful context and information about the data, however, it is not well formatted or intended to be read into a data frame cell along with the tabular data that follows later in the file. Data source: https://ttimbers.github.io/canlang/ Data originally published in: Statistics Canada Census of Population 2016. Reproduced and distributed on an as-is basis with their permission. category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44 Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 Aboriginal languages,Algonquin,1260,370,40,2480 Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 With this extra information being present at the top of the file, using read_csv as we did previously does not allow us to correctly load the data into R. In the case of this file we end up only reading in one column of the data set. In contrast to the normal and expected messages above, this time R prints out a warning for us indicating that there might be a problem with how our data is being read in. canlang_data <- read_csv("data/can_lang_meta-data.csv") ## Warning: One or more parsing issues, call `problems()` on your data frame for details, ## e.g.: ## dat <- vroom(...) ## problems(dat) canlang_data ## # A tibble: 217 × 1 ## `Data source: https://ttimbers.github.io/canlang/` ## <chr> ## 1 "Data originally published in: Statistics Canada Census of Population 2016." ## 2 "Reproduced and distributed on an as-is basis with their permission." ## 3 "category,language,mother_tongue,most_at_home,most_at_work,lang_known" ## 4 "Aboriginal languages,\\"Aboriginal languages, n.o.s.\\",590,235,30,665" ## 5 "Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415" ## 6 "Non-Official & Non-Aboriginal languages,\\"Afro-Asiatic languages, n.i.e.\\",… ## 7 "Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150" ## 8 "Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930" ## 9 "Aboriginal languages,\\"Algonquian languages, n.i.e.\\",45,10,0,120" ## 10 "Aboriginal languages,Algonquin,1260,370,40,2480" ## # ℹ 207 more rows To successfully read data like this into R, the skip argument can be useful to tell R how many lines to skip before it should start reading in the data. In the example above, we would set this value to 3. canlang_data <- read_csv("data/can_lang_meta-data.csv", skip = 3) canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows How did we know to skip three lines? We looked at the data! The first three lines of the data had information we didn’t need to import: Data source: https://ttimbers.github.io/canlang/ Data originally published in: Statistics Canada Census of Population 2016. Reproduced and distributed on an as-is basis with their permission. The column names began at line 4, so we skipped the first three lines. 2.4.3 read_tsv to read in tab-separated values files Another common way data is stored is with tabs as the delimiter. Notice the data file, can_lang.tsv, has tabs in between the columns instead of commas. category language mother_tongue most_at_home most_at_work lang_kno Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415 Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930 Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120 Aboriginal languages Algonquin 1260 370 40 2480 Non-Official & Non-Aboriginal languages American Sign Language 2685 3020 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 We can use the read_tsv function to read in .tsv (tab separated values) files. canlang_data <- read_tsv("data/can_lang.tsv") canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows If you compare the data frame here to the data frame we obtained in Section 2.4.1 using read_csv, you’ll notice that they look identical: they have the same number of columns and rows, the same column names, and the same entries! So even though we needed to use a different function depending on the file format, our resulting data frame (canlang_data) in both cases was the same. 2.4.4 read_delim as a more flexible method to get tabular data into R The read_csv and read_tsv functions are actually just special cases of the more general read_delim function. We can use read_delim to import both comma and tab-separated values files, and more; we just have to specify the delimiter. For example, the can_lang_no_names.tsv file contains a different version of this same data set with no column names and uses tabs as the delimiter instead of commas. Here is how the file would look in a plain text editor: Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415 Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930 Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120 Aboriginal languages Algonquin 1260 370 40 2480 Non-Official & Non-Aboriginal languages American Sign Language 2685 3020 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 Non-Official & Non-Aboriginal languages Arabic 419890 223535 5585 629055 To read this into R using the read_delim function, we specify the path to the file as the first argument, provide the tab character \"\\t\" as the delim argument, and set the col_names argument to FALSE to denote that there are no column names provided in the data. Note that the read_csv, read_tsv, and read_delim functions all have a col_names argument with the default value TRUE. Note: \\t is an example of an escaped character, which always starts with a backslash (\\). Escaped characters are used to represent non-printing characters (like the tab) or those with special meanings (such as quotation marks). canlang_data <- read_delim("data/can_lang_no_names.tsv", delim = "\\t", col_names = FALSE) canlang_data ## # A tibble: 214 × 6 ## X1 X2 X3 X4 X5 X6 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal languages Aborigina… 590 235 30 665 ## 2 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415 ## 3 Non-Official & Non-Aboriginal languages Afro-Asia… 1150 445 10 2775 ## 4 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150 ## 5 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930 ## 6 Aboriginal languages Algonquia… 45 10 0 120 ## 7 Aboriginal languages Algonquin 1260 370 40 2480 ## 8 Non-Official & Non-Aboriginal languages American … 2685 3020 1145 21930 ## 9 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 ## 10 Non-Official & Non-Aboriginal languages Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows Data frames in R need to have column names. Thus if you read in data without column names, R will assign names automatically. In this example, R assigns the column names X1, X2, X3, X4, X5, X6. It is best to rename your columns manually in this scenario. The current column names (X1, X2, etc.) are not very descriptive and will make your analysis confusing. To rename your columns, you can use the rename function from the dplyr R package (Wickham, François, et al. 2021) (one of the packages loaded with tidyverse, so we don’t need to load it separately). The first argument is the data set, and in the subsequent arguments you write new_name = old_name for the selected variables to rename. We rename the X1, X2, ..., X6 columns in the canlang_data data frame to more descriptive names below. canlang_data <- rename(canlang_data, category = X1, language = X2, mother_tongue = X3, most_at_home = X4, most_at_work = X5, lang_known = X6) canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.5 Reading tabular data directly from a URL We can also use read_csv, read_tsv, or read_delim (and related functions) to read in data directly from a Uniform Resource Locator (URL) that contains tabular data. Here, we provide the URL of a remote file to read_*, instead of a path to a local file on our computer. We need to surround the URL with quotes similar to when we specify a path on our local computer. All other arguments that we use are the same as when using these functions with a local file on our computer. url <- "https://raw.githubusercontent.com/UBC-DSCI/data/main/can_lang.csv" canlang_data <- read_csv(url) canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.6 Downloading data from a URL Occasionally the data available at a URL is not formatted nicely enough to use read_csv, read_tsv, read_delim, or other related functions to read the data directly into R. In situations where it is necessary to download a file to our local computer prior to working with it in R, we can use the download.file function. The first argument is the URL, and the second is a path where we would like to store the downloaded file. download.file(url, "data/can_lang.csv") canlang_data <- read_csv("data/can_lang.csv") canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows 2.4.7 Previewing a data file before reading it into R In many of the examples above, we gave you previews of the data file before we read it into R. Previewing data is essential to see whether or not there are column names, what the delimiters are, and if there are lines you need to skip. You should do this yourself when trying to read in data files: open the file in whichever text editor you prefer to inspect its contents prior to reading it into R. 2.5 Reading tabular data from a Microsoft Excel file There are many other ways to store tabular data sets beyond plain text files, and similarly, many ways to load those data sets into R. For example, it is very common to encounter, and need to load into R, data stored as a Microsoft Excel spreadsheet (with the file name extension .xlsx). To be able to do this, a key thing to know is that even though .csv and .xlsx files look almost identical when loaded into Excel, the data themselves are stored completely differently. While .csv files are plain text files, where the characters you see when you open the file in a text editor are exactly the data they represent, this is not the case for .xlsx files. Take a look at a snippet of what a .xlsx file would look like in a text editor: ,?'O _rels/.rels???J1??>E?{7? <?V????w8?'J???'QrJ???Tf?d??d?o?wZ'???@>?4'?|??hlIo??F t 8f??3wn ????t??u"/ %~Ed2??<?w?? ?Pd(??J-?E???7?'t(?-GZ?????y???c~N?g[^_r?4 yG?O ?K??G? ]TUEe??O??c[???????6q??s??d?m???\\???H?^????3} ?rZY? ?:L60?^?????XTP+?|? X?a??4VT?,D?Jq This type of file representation allows Excel files to store additional things that you cannot store in a .csv file, such as fonts, text formatting, graphics, multiple sheets and more. And despite looking odd in a plain text editor, we can read Excel spreadsheets into R using the readxl package developed specifically for this purpose. library(readxl) canlang_data <- read_excel("data/can_lang.xlsx") canlang_data ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows If the .xlsx file has multiple sheets, you have to use the sheet argument to specify the sheet number or name. You can also specify cell ranges using the range argument. This functionality is useful when a single sheet contains multiple tables (a sad thing that happens to many Excel spreadsheets since this makes reading in data more difficult). As with plain text files, you should always explore the data file before importing it into R. Exploring the data beforehand helps you decide which arguments you need to load the data into R successfully. If you do not have the Excel program on your computer, you can use other programs to preview the file. Examples include Google Sheets and Libre Office. In Table 2.1 we summarize the read_* functions we covered in this chapter. We also include the read_csv2 function for data separated by semicolons ;, which you may run into with data sets where the decimal is represented by a comma instead of a period (as with some data sets from European countries). Table 2.1: Summary of read_* functions Data File Type R Function R Package Comma (,) separated files read_csv readr Tab (\\t) separated files read_tsv readr Semicolon (;) separated files read_csv2 readr Various formats (.csv, .tsv) read_delim readr Excel files (.xlsx) read_excel readxl Note: readr is a part of the tidyverse package so we did not need to load this package separately since we loaded tidyverse. 2.6 Reading data from a database Another very common form of data storage is the relational database. Databases are great when you have large data sets or multiple users working on a project. There are many relational database management systems, such as SQLite, MySQL, PostgreSQL, Oracle, and many more. These different relational database management systems each have their own advantages and limitations. Almost all employ SQL (structured query language) to obtain data from the database. But you don’t need to know SQL to analyze data from a database; several packages have been written that allow you to connect to relational databases and use the R programming language to obtain data. In this book, we will give examples of how to do this using R with SQLite and PostgreSQL databases. 2.6.1 Reading data from a SQLite database SQLite is probably the simplest relational database system that one can use in combination with R. SQLite databases are self-contained, and are usually stored and accessed locally on one computer from a file with a .db extension (or sometimes an .sqlite extension). Similar to Excel files, these are not plain text files and cannot be read in a plain text editor. The first thing you need to do to read data into R from a database is to connect to the database. We do that using the dbConnect function from the DBI (database interface) package. This does not read in the data, but simply tells R where the database is and opens up a communication channel that R can use to send SQL commands to the database. library(DBI) canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db") Often relational databases have many tables; thus, in order to retrieve data from a database, you need to know the name of the table in which the data is stored. You can get the names of all the tables in the database using the dbListTables function: tables <- dbListTables(canlang_conn) tables ## [1] "lang" The dbListTables function returned only one name, which tells us that there is only one table in this database. To reference a table in the database (so that we can perform operations like selecting columns and filtering rows), we use the tbl function from the dbplyr package. The object returned by the tbl function allows us to work with data stored in databases as if they were just regular data frames; but secretly, behind the scenes, dbplyr is turning your function calls (e.g., select and filter) into SQL queries! library(dbplyr) lang_db <- tbl(canlang_conn, "lang") lang_db ## # Source: table<lang> [?? x 6] ## # Database: sqlite 3.41.2 [/home/rstudio/introduction-to-datascience/data/can_lang.db] ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ more rows Although it looks like we just got a data frame from the database, we didn’t! It’s a reference; the data is still stored only in the SQLite database. The dbplyr package works this way because databases are often more efficient at selecting, filtering and joining large data sets than R. And typically the database will not even be stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell it to using the collect function. Figure 2.2 highlights the difference between a tibble object in R and the output we just created. Notice in the table on the right, the first two lines of the output indicate the source is SQL. The last line doesn’t show how many rows there are (R is trying to avoid performing expensive query operations), whereas the output for the tibble object does. Figure 2.2: Comparison of a reference to data in a database and a tibble in R. We can look at the SQL commands that are sent to the database when we write tbl(canlang_conn, \"lang\") in R with the show_query function from the dbplyr package. show_query(tbl(canlang_conn, "lang")) ## <SQL> ## SELECT * ## FROM `lang` The output above shows the SQL code that is sent to the database. When we write tbl(canlang_conn, \"lang\") in R, in the background, the function is translating the R code into SQL, sending that SQL to the database, and then translating the response for us. So dbplyr does all the hard work of translating from R to SQL and back for us; we can just stick with R! With our lang_db table reference for the 2016 Canadian Census data in hand, we can mostly continue onward as if it were a regular data frame. For example, let’s do the same exercise from Chapter 1: we will obtain only those rows corresponding to Aboriginal languages, and keep only the language and mother_tongue columns. We can use the filter function to obtain only certain rows. Below we filter the data to include only Aboriginal languages. aboriginal_lang_db <- filter(lang_db, category == "Aboriginal languages") aboriginal_lang_db ## # Source: SQL [?? x 6] ## # Database: sqlite 3.41.2 [/home/rstudio/introduction-to-datascience/data/can_lang.db] ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Aboriginal langu… Algonqu… 45 10 0 120 ## 3 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 4 Aboriginal langu… Athabas… 50 10 0 85 ## 5 Aboriginal langu… Atikame… 6150 5465 1100 6645 ## 6 Aboriginal langu… Babine … 110 20 10 210 ## 7 Aboriginal langu… Beaver 190 50 0 340 ## 8 Aboriginal langu… Blackfo… 2815 1110 85 5645 ## 9 Aboriginal langu… Carrier 1025 250 15 2100 ## 10 Aboriginal langu… Cayuga 45 10 10 125 ## # ℹ more rows Above you can again see the hints that this data is not actually stored in R yet: the source is SQL [?? x 6] and the output says ... more rows at the end (both indicating that R does not know how many rows there are in total!), and a database type sqlite is listed. We didn’t use the collect function because we are not ready to bring the data into R yet. We can still use the database to do some work to obtain only the small amount of data we want to work with locally in R. Let’s add the second part of our database query: selecting only the language and mother_tongue columns using the select function. aboriginal_lang_selected_db <- select(aboriginal_lang_db, language, mother_tongue) aboriginal_lang_selected_db ## # Source: SQL [?? x 2] ## # Database: sqlite 3.41.2 [/home/rstudio/introduction-to-datascience/data/can_lang.db] ## language mother_tongue ## <chr> <dbl> ## 1 Aboriginal languages, n.o.s. 590 ## 2 Algonquian languages, n.i.e. 45 ## 3 Algonquin 1260 ## 4 Athabaskan languages, n.i.e. 50 ## 5 Atikamekw 6150 ## 6 Babine (Wetsuwet'en) 110 ## 7 Beaver 190 ## 8 Blackfoot 2815 ## 9 Carrier 1025 ## 10 Cayuga 45 ## # ℹ more rows Now you can see that the database will return only the two columns we asked for with the select function. In order to actually retrieve this data in R as a data frame, we use the collect function. Below you will see that after running collect, R knows that the retrieved data has 67 rows, and there is no database listed any more. aboriginal_lang_data <- collect(aboriginal_lang_selected_db) aboriginal_lang_data ## # A tibble: 67 × 2 ## language mother_tongue ## <chr> <dbl> ## 1 Aboriginal languages, n.o.s. 590 ## 2 Algonquian languages, n.i.e. 45 ## 3 Algonquin 1260 ## 4 Athabaskan languages, n.i.e. 50 ## 5 Atikamekw 6150 ## 6 Babine (Wetsuwet'en) 110 ## 7 Beaver 190 ## 8 Blackfoot 2815 ## 9 Carrier 1025 ## 10 Cayuga 45 ## # ℹ 57 more rows Aside from knowing the number of rows, the data looks pretty similar in both outputs shown above. And dbplyr provides many more functions (not just filter) that you can use to directly feed the database reference (lang_db) into downstream analysis functions (e.g., ggplot2 for data visualization). But dbplyr does not provide every function that we need for analysis; we do eventually need to call collect. For example, look what happens when we try to use nrow to count rows in a data frame: nrow(aboriginal_lang_selected_db) ## [1] NA or tail to preview the last six rows of a data frame: tail(aboriginal_lang_selected_db) ## Error: tail() is not supported by sql sources Additionally, some operations will not work to extract columns or single values from the reference given by the tbl function. Thus, once you have finished your data wrangling of the tbl database reference object, it is advisable to bring it into R as a data frame using collect. But be very careful using collect: databases are often very big, and reading an entire table into R might take a long time to run or even possibly crash your machine. So make sure you use filter and select on the database table to reduce the data to a reasonable size before using collect to read it into R! 2.6.2 Reading data from a PostgreSQL database PostgreSQL (also called Postgres) is a very popular and open-source option for relational database software. Unlike SQLite, PostgreSQL uses a client–server database engine, as it was designed to be used and accessed on a network. This means that you have to provide more information to R when connecting to Postgres databases. The additional information that you need to include when you call the dbConnect function is listed below: dbname: the name of the database (a single PostgreSQL instance can host more than one database) host: the URL pointing to where the database is located port: the communication endpoint between R and the PostgreSQL database (usually 5432) user: the username for accessing the database password: the password for accessing the database Additionally, we must use the RPostgres package instead of RSQLite in the dbConnect function call. Below we demonstrate how to connect to a version of the can_mov_db database, which contains information about Canadian movies. Note that the host (fakeserver.stat.ubc.ca), user (user0001), and password (abc123) below are not real; you will not actually be able to connect to a database using this information. library(RPostgres) canmov_conn <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db", host = "fakeserver.stat.ubc.ca", port = 5432, user = "user0001", password = "abc123") After opening the connection, everything looks and behaves almost identically to when we were using an SQLite database in R. For example, we can again use dbListTables to find out what tables are in the can_mov_db database: dbListTables(canmov_conn) [1] "themes" "medium" "titles" "title_aliases" "forms" [6] "episodes" "names" "names_occupations" "occupation" "ratings" We see that there are 10 tables in this database. Let’s first look at the \"ratings\" table to find the lowest rating that exists in the can_mov_db database: ratings_db <- tbl(canmov_conn, "ratings") ratings_db # Source: table<ratings> [?? x 3] # Database: postgres [user0001@fakeserver.stat.ubc.ca:5432/can_mov_db] title average_rating num_votes <chr> <dbl> <int> 1 The Grand Seduction 6.6 150 2 Rhymes for Young Ghouls 6.3 1685 3 Mommy 7.5 1060 4 Incendies 6.1 1101 5 Bon Cop, Bad Cop 7.0 894 6 Goon 5.5 1111 7 Monsieur Lazhar 5.6 610 8 What if 5.3 1401 9 The Barbarian Invations 5.8 99 10 Away from Her 6.9 2311 # … with more rows To find the lowest rating that exists in the data base, we first need to extract the average_rating column using select: avg_rating_db <- select(ratings_db, average_rating) avg_rating_db # Source: lazy query [?? x 1] # Database: postgres [user0001@fakeserver.stat.ubc.ca:5432/can_mov_db] average_rating <dbl> 1 6.6 2 6.3 3 7.5 4 6.1 5 7.0 6 5.5 7 5.6 8 5.3 9 5.8 10 6.9 # … with more rows Next we use min to find the minimum rating in that column: min(avg_rating_db) Error in min(avg_rating_db) : invalid 'type' (list) of argument Instead of the minimum, we get an error! This is another example of when we need to use the collect function to bring the data into R for further computation: avg_rating_data <- collect(avg_rating_db) min(avg_rating_data) [1] 1 We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie… 2.6.3 Why should we bother with databases at all? Opening a database involved a lot more effort than just opening a .csv, .tsv, or any of the other plain text or Excel formats. We had to open a connection to the database, then use dbplyr to translate tidyverse-like commands (filter, select etc.) into SQL commands that the database understands, and then finally collect the results. And not all tidyverse commands can currently be translated to work with databases. For example, we can compute a mean with a database but can’t easily compute a median. So you might be wondering: why should we use databases at all? Databases are beneficial in a large-scale setting: They enable storing large data sets across multiple computers with backups. They provide mechanisms for ensuring data integrity and validating input. They provide security and data access control. They allow multiple users to access data simultaneously and remotely without conflicts and errors. For example, there are billions of Google searches conducted daily in 2021 (Real Time Statistics Project 2021). Can you imagine if Google stored all of the data from those searches in a single .csv file!? Chaos would ensue! 2.7 Writing data from R to a .csv file At the middle and end of a data analysis, we often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file to share it with others or use it for another step in the analysis. The most straightforward way to do this is to use the write_csv function from the tidyverse package. The default arguments for this file are to use a comma (,) as the delimiter and include column names. Below we demonstrate creating a new version of the Canadian languages data set without the official languages category according to the Canadian 2016 Census, and then writing this to a .csv file: no_official_lang_data <- filter(can_lang, category != "Official languages") write_csv(no_official_lang_data, "data/no_official_languages.csv") 2.8 Obtaining data from the web Note: This section is not required reading for the remainder of the textbook. It is included for those readers interested in learning a little bit more about how to obtain different types of data from the web. Data doesn’t just magically appear on your computer; you need to get it from somewhere. Earlier in the chapter we showed you how to access data stored in a plain text, spreadsheet-like format (e.g., comma- or tab-separated) from a web URL using one of the read_* functions from the tidyverse. But as time goes on, it is increasingly uncommon to find data (especially large amounts of data) in this format available for download from a URL. Instead, websites now often offer something known as an application programming interface (API), which provides a programmatic way to ask for subsets of a data set. This allows the website owner to control who has access to the data, what portion of the data they have access to, and how much data they can access. Typically, the website owner will give you a token or key (a secret string of characters somewhat like a password) that you have to provide when accessing the API. Another interesting thought: websites themselves are data! When you type a URL into your browser window, your browser asks the web server (another computer on the internet whose job it is to respond to requests for the website) to give it the website’s data, and then your browser translates that data into something you can see. If the website shows you some information that you’re interested in, you could create a data set for yourself by copying and pasting that information into a file. This process of taking information directly from what a website displays is called web scraping (or sometimes screen scraping). Now, of course, copying and pasting information manually is a painstaking and error-prone process, especially when there is a lot of information to gather. So instead of asking your browser to translate the information that the web server provides into something you can see, you can collect that data programmatically—in the form of hypertext markup language (HTML) and cascading style sheet (CSS) code—and process it to extract useful information. HTML provides the basic structure of a site and tells the webpage how to display the content (e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the content and tells the webpage how the HTML elements should be presented (e.g., colors, layouts, fonts etc.). This subsection will show you the basics of both web scraping with the rvest R package (Wickham 2021a) and accessing the NASA “Astronomy Picture of the Day” API using the httr2 R package (Wickham 2023). 2.8.1 Web scraping HTML and CSS selectors When you enter a URL into your browser, your browser connects to the web server at that URL and asks for the source code for the website. This is the data that the browser translates into something you can see; so if we are going to create our own data by scraping a website, we have to first understand what that data looks like! For example, let’s say we are interested in knowing the average rental price (per square foot) of the most recently available one-bedroom apartments in Vancouver on Craiglist. When we visit the Vancouver Craigslist website and search for one-bedroom apartments, we should see something similar to Figure 2.3. Figure 2.3: Craigslist webpage of advertisements for one-bedroom apartments. Based on what our browser shows us, it’s pretty easy to find the size and price for each apartment listed. But we would like to be able to obtain that information using R, without any manual human effort or copying and pasting. We do this by examining the source code that the web server actually sent our browser to display for us. We show a snippet of it below; the entire source is included with the code for this book: <span class="result-meta"> <span class="result-price">$800</span> <span class="housing"> 1br - </span> <span class="result-hood"> (13768 108th Avenue)</span> <span class="result-tags"> <span class="maptag" data-pid="6786042973">map</span> </span> <span class="banish icon icon-trash" role="button"> <span class="screen-reader-text">hide this posting</span> </span> <span class="unbanish icon icon-trash red" role="button"></span> <a href="#" class="restore-link"> <span class="restore-narrow-text">restore</span> <span class="restore-wide-text">restore this posting</span> </a> <span class="result-price">$2285</span> </span> Oof…you can tell that the source code for a web page is not really designed for humans to understand easily. However, if you look through it closely, you will find that the information we’re interested in is hidden among the muck. For example, near the top of the snippet above you can see a line that looks like <span class="result-price">$800</span> That snippet is definitely storing the price of a particular apartment. With some more investigation, you should be able to find things like the date and time of the listing, the address of the listing, and more. So this source code most likely contains all the information we are interested in! Let’s dig into that line above a bit more. You can see that that bit of code has an opening tag (words between < and >, like <span>) and a closing tag (the same with a slash, like </span>). HTML source code generally stores its data between opening and closing tags like these. Tags are keywords that tell the web browser how to display or format the content. Above you can see that the information we want ($800) is stored between an opening and closing tag (<span> and </span>). In the opening tag, you can also see a very useful “class” (a special word that is sometimes included with opening tags): class=\"result-price\". Since we want R to programmatically sort through all of the source code for the website to find apartment prices, maybe we can look for all the tags with the \"result-price\" class, and grab the information between the opening and closing tag. Indeed, take a look at another line of the source snippet above: <span class="result-price">$2285</span> It’s yet another price for an apartment listing, and the tags surrounding it have the \"result-price\" class. Wonderful! Now that we know what pattern we are looking for—a dollar amount between opening and closing tags that have the \"result-price\" class—we should be able to use code to pull out all of the matching patterns from the source code to obtain our data. This sort of “pattern” is known as a CSS selector (where CSS stands for cascading style sheet). The above was a simple example of “finding the pattern to look for”; many websites are quite a bit larger and more complex, and so is their website source code. Fortunately, there are tools available to make this process easier. For example, SelectorGadget is an open-source tool that simplifies identifying the generating and finding of CSS selectors. At the end of the chapter in the additional resources section, we include a link to a short video on how to install and use the SelectorGadget tool to obtain CSS selectors for use in web scraping. After installing and enabling the tool, you can click the website element for which you want an appropriate selector. For example, if we click the price of an apartment listing, we find that SelectorGadget shows us the selector .result-price in its toolbar, and highlights all the other apartment prices that would be obtained using that selector (Figure 2.4). Figure 2.4: Using the SelectorGadget on a Craigslist webpage to obtain the CCS selector useful for obtaining apartment prices. If we then click the size of an apartment listing, SelectorGadget shows us the span selector, and highlights many of the lines on the page; this indicates that the span selector is not specific enough to capture only apartment sizes (Figure 2.5). Figure 2.5: Using the SelectorGadget on a Craigslist webpage to obtain a CCS selector useful for obtaining apartment sizes. To narrow the selector, we can click one of the highlighted elements that we do not want. For example, we can deselect the “pic/map” links, resulting in only the data we want highlighted using the .housing selector (Figure 2.6). Figure 2.6: Using the SelectorGadget on a Craigslist webpage to refine the CCS selector to one that is most useful for obtaining apartment sizes. So to scrape information about the square footage and rental price of apartment listings, we need to use the two CSS selectors .housing and .result-price, respectively. The selector gadget returns them to us as a comma-separated list (here .housing , .result-price), which is exactly the format we need to provide to R if we are using more than one CSS selector. Caution: are you allowed to scrape that website? Before scraping data from the web, you should always check whether or not you are allowed to scrape it! There are two documents that are important for this: the robots.txt file and the Terms of Service document. If we take a look at Craigslist’s Terms of Service document, we find the following text: “You agree not to copy/collect CL content via robots, spiders, scripts, scrapers, crawlers, or any automated or manual equivalent (e.g., by hand).” So unfortunately, without explicit permission, we are not allowed to scrape the website. What to do now? Well, we could ask the owner of Craigslist for permission to scrape. However, we are not likely to get a response, and even if we did they would not likely give us permission. The more realistic answer is that we simply cannot scrape Craigslist. If we still want to find data about rental prices in Vancouver, we must go elsewhere. To continue learning how to scrape data from the web, let’s instead scrape data on the population of Canadian cities from Wikipedia. We have checked the Terms of Service document, and it does not mention that web scraping is disallowed. We will use the SelectorGadget tool to pick elements that we are interested in (city names and population counts) and deselect others to indicate that we are not interested in them (province names), as shown in Figure 2.7. Figure 2.7: Using the SelectorGadget on a Wikipedia webpage. We include a link to a short video tutorial on this process at the end of the chapter in the additional resources section. SelectorGadget provides in its toolbar the following list of CSS selectors to use: td:nth-child(8) , td:nth-child(4) , .largestCities-cell-background+ td a Now that we have the CSS selectors that describe the properties of the elements that we want to target, we can use them to find certain elements in web pages and extract data. Using rvest We will use the rvest R package to scrape data from the Wikipedia page. We start by loading the rvest package: library(rvest) Next, we tell R what page we want to scrape by providing the webpage’s URL in quotations to the function read_html: page <- read_html("https://en.wikipedia.org/wiki/Canada") The read_html function directly downloads the source code for the page at the URL you specify, just like your browser would if you navigated to that site. But instead of displaying the website to you, the read_html function just returns the HTML source code itself, which we have stored in the page variable. Next, we send the page object to the html_nodes function, along with the CSS selectors we obtained from the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, html_nodes, expects that argument is a string. We store the result of the html_nodes function in the population_nodes variable. Note that below we use the paste function with a comma separator (sep=\",\") to build the list of selectors. The paste function converts elements to characters and combines the values into a list. We use this function to build the list of selectors to maintain code readability; this avoids having a very long line of code. selectors <- paste("td:nth-child(8)", "td:nth-child(4)", ".largestCities-cell-background+ td a", sep = ",") population_nodes <- html_nodes(page, selectors) head(population_nodes) ## {xml_nodeset (6)} ## [1] <a href="/wiki/Greater_Toronto_Area" title="Greater Toronto Area">Toronto ... ## [2] <td style="text-align:right;">6,202,225</td> ## [3] <a href="/wiki/London,_Ontario" title="London, Ontario">London</a> ## [4] <td style="text-align:right;">543,551\\n</td> ## [5] <a href="/wiki/Greater_Montreal" title="Greater Montreal">Montreal</a> ## [6] <td style="text-align:right;">4,291,732</td> Note: head is a function that is often useful for viewing only a short summary of an R object, rather than the whole thing (which may be quite a lot to look at). For example, here head shows us only the first 6 items in the population_nodes object. Note that some R objects by default print only a small summary. For example, tibble data frames only show you the first 10 rows. But not all R objects do this, and that’s where the head function helps summarize things for you. Each of the items in the population_nodes list is a node from the HTML document that matches the CSS selectors you specified. A node is an HTML tag pair (e.g., <td> and </td> which defines the cell of a table) combined with the content stored between the tags. For our CSS selector td:nth-child(4), an example node that would be selected would be: <td style="text-align:left;background:#f0f0f0;"> <a href="/wiki/London,_Ontario" title="London, Ontario">London</a> </td> Next we extract the meaningful data—in other words, we get rid of the HTML code syntax and tags—from the nodes using the html_text function. In the case of the example node above, html_text function returns \"London\". population_text <- html_text(population_nodes) head(population_text) ## [1] "Toronto" "6,202,225" "London" "543,551\\n" "Montreal" "4,291,732" Fantastic! We seem to have extracted the data of interest from the raw HTML source code. But we are not quite done; the data is not yet in an optimal format for data analysis. Both the city names and population are encoded as characters in a single vector, instead of being in a data frame with one character column for city and one numeric column for population (like a spreadsheet). Additionally, the populations contain commas (not useful for programmatically dealing with numbers), and some even contain a line break character at the end (\\n). In Chapter 3, we will learn more about how to wrangle data such as this into a more useful format for data analysis using R. 2.8.2 Using an API Rather than posting a data file at a URL for you to download, many websites these days provide an API that must be accessed through a programming language like R. The benefit of using an API is that data owners have much more control over the data they provide to users. However, unlike web scraping, there is no consistent way to access an API across websites. Every website typically has its own API designed especially for its own use case. Therefore we will just provide one example of accessing data through an API in this book, with the hope that it gives you enough of a basic idea that you can learn how to use another API if needed. In particular, in this book we will show you the basics of how to use the httr2 package in R to access data from the NASA “Astronomy Picture of the Day” API (a great source of desktop backgrounds, by the way—take a look at the stunning picture of the Rho-Ophiuchi cloud complex (NASA et al. 2023) in Figure 2.8 from July 13, 2023!). Figure 2.8: The James Webb Space Telescope’s NIRCam image of the Rho Ophiuchi molecular cloud complex. First, you will need to visit the NASA APIs page and generate an API key (i.e., a password used to identify you when accessing the API). Note that a valid email address is required to associate with the key. The signup form looks something like Figure 2.9. After filling out the basic information, you will receive the token via email. Make sure to store the key in a safe place, and keep it private. Figure 2.9: Generating the API access token for the NASA API Caution: think about your API usage carefully! When you access an API, you are initiating a transfer of data from a web server to your computer. Web servers are expensive to run and do not have infinite resources. If you try to ask for too much data at once, you can use up a huge amount of the server’s bandwidth. If you try to ask for data too frequently—e.g., if you make many requests to the server in quick succession—you can also bog the server down and make it unable to talk to anyone else. Most servers have mechanisms to revoke your access if you are not careful, but you should try to prevent issues from happening in the first place by being extra careful with how you write and run your code. You should also keep in mind that when a website owner grants you API access, they also usually specify a limit (or quota) of how much data you can ask for. Be careful not to overrun your quota! So before we try to use the API, we will first visit the NASA website to see what limits we should abide by when using the API. These limits are outlined in Figure 2.10. Figure 2.10: The NASA website specifies an hourly limit of 1,000 requests. After checking the NASA website, it seems like we can send at most 1,000 requests per hour. That should be more than enough for our purposes in this section. Accessing the NASA API The NASA API is what is known as an HTTP API: this is a particularly common kind of API, where you can obtain data simply by accessing a particular URL as if it were a regular website. To make a query to the NASA API, we need to specify three things. First, we specify the URL endpoint of the API, which is simply a URL that helps the remote server understand which API you are trying to access. NASA offers a variety of APIs, each with its own endpoint; in the case of the NASA “Astronomy Picture of the Day” API, the URL endpoint is https://api.nasa.gov/planetary/apod. Second, we write ?, which denotes that a list of query parameters will follow. And finally, we specify a list of query parameters of the form parameter=value, separated by & characters. The NASA “Astronomy Picture of the Day” API accepts the parameters shown in Figure 2.11. Figure 2.11: The set of parameters that you can specify when querying the NASA “Astronomy Picture of the Day” API, along with syntax, default settings, and a description of each. So for example, to obtain the image of the day from July 13, 2023, the API query would have two parameters: api_key=YOUR_API_KEY and date=2023-07-13. Remember to replace YOUR_API_KEY with the API key you received from NASA in your email! Putting it all together, the query will look like the following: https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13 If you try putting this URL into your web browser, you’ll actually find that the server responds to your request with some text: {"date":"2023-07-13","explanation":"A mere 390 light-years away, Sun-like stars and future planetary systems are forming in the Rho Ophiuchi molecular cloud complex, the closest star-forming region to our fair planet. The James Webb Space Telescope's NIRCam peered into the nearby natal chaos to capture this infrared image at an inspiring scale. The spectacular cosmic snapshot was released to celebrate the successful first year of Webb's exploration of the Universe. The frame spans less than a light-year across the Rho Ophiuchi region and contains about 50 young stars. Brighter stars clearly sport Webb's characteristic pattern of diffraction spikes. Huge jets of shocked molecular hydrogen blasting from newborn stars are red in the image, with the large, yellowish dusty cavity carved out by the energetic young star near its center. Near some stars in the stunning image are shadows cast by their protoplanetary disks.","hdurl":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph.png", "media_type":"image","service_version":"v1","title":"Webb's Rho Ophiuchi","url":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph1024.png"} Neat! There is definitely some data there, but it’s a bit hard to see what it all is. As it turns out, this is a common format for data called JSON (JavaScript Object Notation). We won’t encounter this kind of data much in this book, but for now you can interpret this data as key : value pairs separated by commas. For example, if you look closely, you’ll see that the first entry is \"date\":\"2023-07-13\", which indicates that we indeed successfully received data corresponding to July 13, 2023. So now our job is to do all of this programmatically in R. We will load the httr2 package, and construct the query using the request function, which takes a single URL argument; you will recognize the same query URL that we pasted into the browser earlier. We will then send the query using the req_perform function, and finally obtain a JSON representation of the response using the resp_body_json function. library(httr2) req <- request("https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13") resp <- req_perform(req) nasa_data_single <- resp_body_json(resp) nasa_data_single ## $date ## [1] "2023-07-13" ## ## $explanation ## [1] "A mere 390 light-years away, Sun-like stars and future planetary systems are forming in the Rho Ophiuchi molecular cloud complex, the closest star-forming region to our fair planet. The James Webb Space Telescope's NIRCam peered into the nearby natal chaos to capture this infrared image at an inspiring scale. The spectacular cosmic snapshot was released to celebrate the successful first year of Webb's exploration of the Universe. The frame spans less than a light-year across the Rho Ophiuchi region and contains about 50 young stars. Brighter stars clearly sport Webb's characteristic pattern of diffraction spikes. Huge jets of shocked molecular hydrogen blasting from newborn stars are red in the image, with the large, yellowish dusty cavity carved out by the energetic young star near its center. Near some stars in the stunning image are shadows cast by their protoplanetary disks." ## ## $hdurl ## [1] "https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph.png" ## ## $media_type ## [1] "image" ## ## $service_version ## [1] "v1" ## ## $title ## [1] "Webb's Rho Ophiuchi" ## ## $url ## [1] "https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph1024.png" We can obtain more records at once by using the start_date and end_date parameters, as shown in the table of parameters in 2.11. Let’s obtain all the records between May 1, 2023, and July 13, 2023, and store the result in an object called nasa_data; now the response will take the form of an R list (you’ll learn more about these in Chapter 3). Each item in the list will correspond to a single day’s record (just like the nasa_data_single object), and there will be 74 items total, one for each day between the start and end dates: req <- request("https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&start_date=2023-05-01&end_date=2023-07-13") resp <- req_perform(req) nasa_data <- resp_body_json(response) length(nasa_data) ## [1] 74 For further data processing using the techniques in this book, you’ll need to turn this list of items into a data frame. Here we will extract the date, title, copyright, and url variables from the JSON data, and construct a data frame using the extracted information. Note: Understanding this code is not required for the remainder of the textbook. It is included for those readers who would like to parse JSON data into a data frame in their own data analyses. nasa_df_all <- tibble(bind_rows(lapply(nasa_data, as.data.frame.list))) nasa_df <- select(nasa_df_all, date, title, copyright, url) nasa_df ## # A tibble: 74 × 4 ## date title copyright url ## <chr> <chr> <chr> <chr> ## 1 2023-05-01 Carina Nebula North "\\nCarlos Tayl… http… ## 2 2023-05-02 Flat Rock Hills on Mars "\\nNASA, \\nJPL… http… ## 3 2023-05-03 Centaurus A: A Peculiar Island of Stars "\\nMarco Loren… http… ## 4 2023-05-04 The Galaxy, the Jet, and a Famous Black Hole <NA> http… ## 5 2023-05-05 Shackleton from ShadowCam <NA> http… ## 6 2023-05-06 Twilight in a Flower "Dario Giannob… http… ## 7 2023-05-07 The Helix Nebula from CFHT <NA> http… ## 8 2023-05-08 The Spanish Dancer Spiral Galaxy <NA> http… ## 9 2023-05-09 Shadows of Earth "\\nMarcella Gi… http… ## 10 2023-05-10 Milky Way over Egyptian Desert "\\nAmr Abdulwa… http… ## # ℹ 64 more rows Success—we have created a small data set using the NASA API! This data is also quite different from what we obtained from web scraping; the extracted information is readily available in a JSON format, as opposed to raw HTML code (although not every API will provide data in such a nice format). From this point onward, the nasa_df data frame is stored on your machine, and you can play with it to your heart’s content. For example, you can use write_csv to save it to a file and read_csv to read it into R again later; and after reading the next few chapters you will have the skills to do even more interesting things! If you decide that you want to ask any of the various NASA APIs for more data (see the list of awesome NASA APIS here for more examples of what is possible), just be mindful as usual about how much data you are requesting and how frequently you are making requests. 2.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Reading in data locally and from the web” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 2.10 Additional resources The readr documentation provides the documentation for many of the reading functions we cover in this chapter. It is where you should look if you want to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter. Sometimes you might run into data in such poor shape that none of the reading functions we cover in this chapter work. In that case, you can consult the data import chapter from R for Data Science (Wickham and Grolemund 2016), which goes into a lot more detail about how R parses text from files into data frames. The here R package (Müller 2020) provides a way for you to construct or find your files’ paths. The readxl documentation provides more details on reading data from Excel, such as reading in data with multiple sheets, or specifying the cells to read in. The rio R package (Leeper 2021) provides an alternative set of tools for reading and writing data in R. It aims to be a “Swiss army knife” for data reading/writing/converting, and supports a wide variety of data types (including data formats generated by other statistical software like SPSS and SAS). A video from the Udacity course Linux Command Line Basics provides a good explanation of absolute versus relative paths. If you read the subsection on obtaining data from the web via scraping and APIs, we provide two companion tutorial video links for how to use the SelectorGadget tool to obtain desired CSS selectors for: extracting the data for apartment listings on Craigslist, and extracting Canadian city names and populations from Wikipedia. The polite R package (Perepolkin 2021) provides a set of tools for responsibly scraping data from websites. References "],["wrangling.html", "Chapter 3 Cleaning and wrangling data 3.1 Overview 3.2 Chapter learning objectives 3.3 Data frames, vectors, and lists 3.4 Tidy data 3.5 Using select to extract a range of columns 3.6 Using filter to extract rows 3.7 Using mutate to modify or add columns 3.8 Combining functions using the pipe operator, |> 3.9 Aggregating data with summarize and map 3.10 Apply functions across many columns with mutate and across 3.11 Apply functions across columns within one row with rowwise and mutate 3.12 Summary 3.13 Exercises 3.14 Additional resources", " Chapter 3 Cleaning and wrangling data 3.1 Overview This chapter is centered around defining tidy data—a data format that is suitable for analysis—and the tools needed to transform raw data into this format. This will be presented in the context of a real-world data science application, providing more practice working through a whole case study. 3.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Define the term “tidy data”. Discuss the advantages of storing data in a tidy data format. Define what vectors, lists, and data frames are in R, and describe how they relate to each other. Describe the common types of data in R and their uses. Use the following functions for their intended data wrangling tasks: c pivot_longer pivot_wider separate select filter mutate summarize map group_by across rowwise Use the following operators for their intended data wrangling tasks: ==, !=, <, <=, >, and >= %in% !, &, and | |> and %>% 3.3 Data frames, vectors, and lists In Chapters 1 and 2, data frames were the focus: we learned how to import data into R as a data frame, and perform basic operations on data frames in R. In the remainder of this book, this pattern continues. The vast majority of tools we use will require that data are represented as a data frame in R. Therefore, in this section, we will dig more deeply into what data frames are and how they are represented in R. This knowledge will be helpful in effectively utilizing these objects in our data analyses. 3.3.1 What is a data frame? A data frame is a table-like structure for storing data in R. Data frames are important to learn about because most data that you will encounter in practice can be naturally stored as a table. In order to define data frames precisely, we need to introduce a few technical terms: variable: a characteristic, number, or quantity that can be measured. observation: all of the measurements for a given entity. value: a single measurement of a single variable for a given entity. Given these definitions, a data frame is a tabular data structure in R that is designed to store observations, variables, and their values. Most commonly, each column in a data frame corresponds to a variable, and each row corresponds to an observation. For example, Figure 3.1 displays a data set of city populations. Here, the variables are “region, year, population”; each of these are properties that can be collected or measured. The first observation is “Toronto, 2016, 2235145”; these are the values that the three variables take for the first entity in the data set. There are 13 entities in the data set in total, corresponding to the 13 rows in Figure 3.1. Figure 3.1: A data frame storing data regarding the population of various regions in Canada. In this example data frame, the row that corresponds to the observation for the city of Vancouver is colored yellow, and the column that corresponds to the population variable is colored blue. R stores the columns of a data frame as either lists or vectors. For example, the data frame in Figure 3.2 has three vectors whose names are region, year and population. The next two sections will explain what lists and vectors are. Figure 3.2: Data frame with three vectors. 3.3.2 What is a vector? In R, vectors are objects that can contain one or more elements. The vector elements are ordered, and they must all be of the same data type; R has several different basic data types, as shown in Table 3.1. Figure 3.3 provides an example of a vector where all of the elements are of character type. You can create vectors in R using the c function (c stands for “concatenate”). For example, to create the vector region as shown in Figure 3.3, you would write: region <- c("Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa") region ## [1] "Toronto" "Montreal" "Vancouver" "Calgary" "Ottawa" Note: Technically, these objects are called “atomic vectors.” In this book we have chosen to call them “vectors,” which is how they are most commonly referred to in the R community. To be totally precise, “vector” is an umbrella term that encompasses both atomic vector and list objects in R. But this creates a confusing situation where the term “vector” could mean “atomic vector” or “the umbrella term for atomic vector and list,” depending on context. Very confusing indeed! So to keep things simple, in this book we always use the term “vector” to refer to “atomic vector.” We encourage readers who are enthusiastic to learn more to read the Vectors chapter of Advanced R (Wickham 2019). Figure 3.3: Example of a vector whose type is character. Table 3.1: Basic data types in R Data type Abbreviation Description Example character chr letters or numbers surrounded by quotes “1” , “Hello world!” double dbl numbers with decimals values 1.2333 integer int numbers that do not contain decimals 1L, 20L (where “L” tells R to store as an integer) logical lgl either true or false TRUE, FALSE factor fct used to represent data with a limited number of values (usually categories) a color variable with levels red, green and orange It is important in R to make sure you represent your data with the correct type. Many of the tidyverse functions we use in this book treat the various data types differently. You should use integers and double types (which both fall under the “numeric” umbrella type) to represent numbers and perform arithmetic. Doubles are more common than integers in R, though; for instance, a double data type is the default when you create a vector of numbers using c(), and when you read in whole numbers via read_csv. Characters are used to represent data that should be thought of as “text”, such as words, names, paths, URLs, and more. Factors help us encode variables that represent categories; a factor variable takes one of a discrete set of values known as levels (one for each category). The levels can be ordered or unordered. Even though factors can sometimes look like characters, they are not used to represent text, words, names, and paths in the way that characters are; in fact, R internally stores factors using integers! There are other basic data types in R, such as raw and complex, but we do not use these in this textbook. 3.3.3 What is a list? Lists are also objects in R that have multiple, ordered elements. Vectors and lists differ by the requirement of element type consistency. All elements within a single vector must be of the same type (e.g., all elements are characters), whereas elements within a single list can be of different types (e.g., characters, integers, logicals, and even other lists). See Figure 3.4. Figure 3.4: A vector versus a list. 3.3.4 What does this have to do with data frames? A data frame is really a special kind of list that follows two rules: Each element itself must either be a vector or a list. Each element (vector or list) must have the same length. Not all columns in a data frame need to be of the same type. Figure 3.5 shows a data frame where the columns are vectors of different types. But remember: because the columns in this example are vectors, the elements must be the same data type within each column. On the other hand, if our data frame had list columns, there would be no such requirement. It is generally much more common to use vector columns, though, as the values for a single variable are usually all of the same type. Figure 3.5: Data frame and vector types. Data frames are actually included in R itself, without the need for any additional packages. However, the tidyverse functions that we use throughout this book all work with a special kind of data frame called a tibble. Tibbles have some additional features and benefits over built-in data frames in R. These include the ability to add useful attributes (such as grouping, which we will discuss later) and more predictable type preservation when subsetting. Because a tibble is just a data frame with some added features, we will collectively refer to both built-in R data frames and tibbles as data frames in this book. Note: You can use the function class on a data object to assess whether a data frame is a built-in R data frame or a tibble. If the data object is a data frame, class will return \"data.frame\". If the data object is a tibble it will return \"tbl_df\" \"tbl\" \"data.frame\". You can easily convert built-in R data frames to tibbles using the tidyverse as_tibble function. For example we can check the class of the Canadian languages data set, can_lang, we worked with in the previous chapters and we see it is a tibble. class(can_lang) ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" Vectors, data frames and lists are basic types of data structure in R, which are core to most data analyses. We summarize them in Table 3.2. There are several other data structures in the R programming language (e.g., matrices), but these are beyond the scope of this book. Table 3.2: Basic data structures in R Data Structure Description vector An ordered collection of one, or more, values of the same data type. list An ordered collection of one, or more, values of possibly different data types. data frame A list of either vectors or lists of the same length, with column names. We typically use a data frame to represent a data set. 3.4 Tidy data There are many ways a tabular data set can be organized. This chapter will focus on introducing the tidy data format of organization and how to make your raw (and likely messy) data tidy. A tidy data frame satisfies the following three criteria (Wickham 2014): each row is a single observation, each column is a single variable, and each value is a single cell (i.e., its entry in the data frame is not shared with another value). Figure 3.6 demonstrates a tidy data set that satisfies these three criteria. Figure 3.6: Tidy data satisfies three criteria. There are many good reasons for making sure your data are tidy as a first step in your analysis. The most important is that it is a single, consistent format that nearly every function in the tidyverse recognizes. No matter what the variables and observations in your data represent, as long as the data frame is tidy, you can manipulate it, plot it, and analyze it using the same tools. If your data is not tidy, you will have to write special bespoke code in your analysis that will not only be error-prone, but hard for others to understand. Beyond making your analysis more accessible to others and less error-prone, tidy data is also typically easy for humans to interpret. Given these benefits, it is well worth spending the time to get your data into a tidy format upfront. Fortunately, there are many well-designed tidyverse data cleaning/wrangling tools to help you easily tidy your data. Let’s explore them below! Note: Is there only one shape for tidy data for a given data set? Not necessarily! It depends on the statistical question you are asking and what the variables are for that question. For tidy data, each variable should be its own column. So, just as it’s essential to match your statistical question with the appropriate data analysis tool, it’s important to match your statistical question with the appropriate variables and ensure they are represented as individual columns to make the data tidy. 3.4.1 Tidying up: going from wide to long using pivot_longer One task that is commonly performed to get data into a tidy format is to combine values that are stored in separate columns, but are really part of the same variable, into one. Data is often stored this way because this format is sometimes more intuitive for human readability and understanding, and humans create data sets. In Figure 3.7, the table on the left is in an untidy, “wide” format because the year values (2006, 2011, 2016) are stored as column names. And as a consequence, the values for population for the various cities over these years are also split across several columns. For humans, this table is easy to read, which is why you will often find data stored in this wide format. However, this format is difficult to work with when performing data visualization or statistical analysis using R. For example, if we wanted to find the latest year it would be challenging because the year values are stored as column names instead of as values in a single column. So before we could apply a function to find the latest year (for example, by using max), we would have to first extract the column names to get them as a vector and then apply a function to extract the latest year. The problem only gets worse if you would like to find the value for the population for a given region for the latest year. Both of these tasks are greatly simplified once the data is tidied. Another problem with data in this format is that we don’t know what the numbers under each year actually represent. Do those numbers represent population size? Land area? It’s not clear. To solve both of these problems, we can reshape this data set to a tidy data format by creating a column called “year” and a column called “population.” This transformation—which makes the data “longer”—is shown as the right table in Figure 3.7. Figure 3.7: Pivoting data from a wide to long data format. We can achieve this effect in R using the pivot_longer function from the tidyverse package. The pivot_longer function combines columns, and is usually used during tidying data when we need to make the data frame longer and narrower. To learn how to use pivot_longer, we will work through an example with the region_lang_top5_cities_wide.csv data set. This data set contains the counts of how many Canadians cited each language as their mother tongue for five major Canadian cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton) from the 2016 Canadian census. To get started, we will load the tidyverse package and use read_csv to load the (untidy) data. library(tidyverse) lang_wide <- read_csv("data/region_lang_top5_cities_wide.csv") lang_wide ## # A tibble: 214 × 7 ## category language Toronto Montréal Vancouver Calgary Edmonton ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal languages Aborigi… 80 30 70 20 25 ## 2 Non-Official & Non-Abor… Afrikaa… 985 90 1435 960 575 ## 3 Non-Official & Non-Abor… Afro-As… 360 240 45 45 65 ## 4 Non-Official & Non-Abor… Akan (T… 8485 1015 400 705 885 ## 5 Non-Official & Non-Abor… Albanian 13260 2450 1090 1365 770 ## 6 Aboriginal languages Algonqu… 5 5 0 0 0 ## 7 Aboriginal languages Algonqu… 5 30 5 5 0 ## 8 Non-Official & Non-Abor… America… 470 50 265 100 180 ## 9 Non-Official & Non-Abor… Amharic 7460 665 1140 4075 2515 ## 10 Non-Official & Non-Abor… Arabic 85175 151955 14320 18965 17525 ## # ℹ 204 more rows What is wrong with the untidy format above? The table on the left in Figure 3.8 represents the data in the “wide” (messy) format. From a data analysis perspective, this format is not ideal because the values of the variable region (Toronto, Montréal, Vancouver, Calgary, and Edmonton) are stored as column names. Thus they are not easily accessible to the data analysis functions we will apply to our data set. Additionally, the mother tongue variable values are spread across multiple columns, which will prevent us from doing any desired visualization or statistical tasks until we combine them into one column. For instance, suppose we want to know the languages with the highest number of Canadians reporting it as their mother tongue among all five regions. This question would be tough to answer with the data in its current format. We could find the answer with the data in this format, though it would be much easier to answer if we tidy our data first. If mother tongue were instead stored as one column, as shown in the tidy data on the right in Figure 3.8, we could simply use the max function in one line of code to get the maximum value. Figure 3.8: Going from wide to long with the pivot_longer function. Figure 3.9 details the arguments that we need to specify in the pivot_longer function to accomplish this data transformation. Figure 3.9: Syntax for the pivot_longer function. We use pivot_longer to combine the Toronto, Montréal, Vancouver, Calgary, and Edmonton columns into a single column called region, and create a column called mother_tongue that contains the count of how many Canadians report each language as their mother tongue for each metropolitan area. We use a colon : between Toronto and Edmonton to tell R to select all the columns between Toronto and Edmonton: lang_mother_tidy <- pivot_longer(lang_wide, cols = Toronto:Edmonton, names_to = "region", values_to = "mother_tongue" ) lang_mother_tidy ## # A tibble: 1,070 × 4 ## category language region mother_tongue ## <chr> <chr> <chr> <dbl> ## 1 Aboriginal languages Aboriginal lang… Toron… 80 ## 2 Aboriginal languages Aboriginal lang… Montr… 30 ## 3 Aboriginal languages Aboriginal lang… Vanco… 70 ## 4 Aboriginal languages Aboriginal lang… Calga… 20 ## 5 Aboriginal languages Aboriginal lang… Edmon… 25 ## 6 Non-Official & Non-Aboriginal languages Afrikaans Toron… 985 ## 7 Non-Official & Non-Aboriginal languages Afrikaans Montr… 90 ## 8 Non-Official & Non-Aboriginal languages Afrikaans Vanco… 1435 ## 9 Non-Official & Non-Aboriginal languages Afrikaans Calga… 960 ## 10 Non-Official & Non-Aboriginal languages Afrikaans Edmon… 575 ## # ℹ 1,060 more rows Note: In the code above, the call to the pivot_longer function is split across several lines. This is allowed in certain cases; for example, when calling a function as above, as long as the line ends with a comma , R knows to keep reading on the next line. Splitting long lines like this across multiple lines is encouraged as it helps significantly with code readability. Generally speaking, you should limit each line of code to about 80 characters. The data above is now tidy because all three criteria for tidy data have now been met: All the variables (category, language, region and mother_tongue) are now their own columns in the data frame. Each observation, (i.e., each language in a region) is in a single row. Each value is a single cell, i.e., its row, column position in the data frame is not shared with another value. 3.4.2 Tidying up: going from long to wide using pivot_wider Suppose we have observations spread across multiple rows rather than in a single row. For example, in Figure 3.10, the table on the left is in an untidy, long format because the count column contains three variables (population, commuter count, and year the city was incorporated) and information about each observation (here, population, commuter, and incorporated values for a region) is split across three rows. Remember: one of the criteria for tidy data is that each observation must be in a single row. Using data in this format—where two or more variables are mixed together in a single column—makes it harder to apply many usual tidyverse functions. For example, finding the maximum number of commuters would require an additional step of filtering for the commuter values before the maximum can be computed. In comparison, if the data were tidy, all we would have to do is compute the maximum value for the commuter column. To reshape this untidy data set to a tidy (and in this case, wider) format, we need to create columns called “population”, “commuters”, and “incorporated.” This is illustrated in the right table of Figure 3.10. Figure 3.10: Going from long to wide data. To tidy this type of data in R, we can use the pivot_wider function. The pivot_wider function generally increases the number of columns (widens) and decreases the number of rows in a data set. To learn how to use pivot_wider, we will work through an example with the region_lang_top5_cities_long.csv data set. This data set contains the number of Canadians reporting the primary language at home and work for five major cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton). lang_long <- read_csv("data/region_lang_top5_cities_long.csv") lang_long ## # A tibble: 2,140 × 5 ## region category language type count ## <chr> <chr> <chr> <chr> <dbl> ## 1 Montréal Aboriginal languages Aboriginal languages, n.o.s. most_at_home 15 ## 2 Montréal Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 3 Toronto Aboriginal languages Aboriginal languages, n.o.s. most_at_home 50 ## 4 Toronto Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 5 Calgary Aboriginal languages Aboriginal languages, n.o.s. most_at_home 5 ## 6 Calgary Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 7 Edmonton Aboriginal languages Aboriginal languages, n.o.s. most_at_home 10 ## 8 Edmonton Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## 9 Vancouver Aboriginal languages Aboriginal languages, n.o.s. most_at_home 15 ## 10 Vancouver Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0 ## # ℹ 2,130 more rows What makes the data set shown above untidy? In this example, each observation is a language in a region. However, each observation is split across multiple rows: one where the count for most_at_home is recorded, and the other where the count for most_at_work is recorded. Suppose the goal with this data was to visualize the relationship between the number of Canadians reporting their primary language at home and work. Doing that would be difficult with this data in its current form, since these two variables are stored in the same column. Figure 3.11 shows how this data will be tidied using the pivot_wider function. Figure 3.11: Going from long to wide with the pivot_wider function. Figure 3.12 details the arguments that we need to specify in the pivot_wider function. Figure 3.12: Syntax for the pivot_wider function. We will apply the function as detailed in Figure 3.12. lang_home_tidy <- pivot_wider(lang_long, names_from = type, values_from = count ) lang_home_tidy ## # A tibble: 1,070 × 5 ## region category language most_at_home most_at_work ## <chr> <chr> <chr> <dbl> <dbl> ## 1 Montréal Aboriginal languages Aborigi… 15 0 ## 2 Toronto Aboriginal languages Aborigi… 50 0 ## 3 Calgary Aboriginal languages Aborigi… 5 0 ## 4 Edmonton Aboriginal languages Aborigi… 10 0 ## 5 Vancouver Aboriginal languages Aborigi… 15 0 ## 6 Montréal Non-Official & Non-Aboriginal l… Afrikaa… 10 0 ## 7 Toronto Non-Official & Non-Aboriginal l… Afrikaa… 265 0 ## 8 Calgary Non-Official & Non-Aboriginal l… Afrikaa… 505 15 ## 9 Edmonton Non-Official & Non-Aboriginal l… Afrikaa… 300 0 ## 10 Vancouver Non-Official & Non-Aboriginal l… Afrikaa… 520 10 ## # ℹ 1,060 more rows The data above is now tidy! We can go through the three criteria again to check that this data is a tidy data set. All the statistical variables are their own columns in the data frame (i.e., most_at_home, and most_at_work have been separated into their own columns in the data frame). Each observation, (i.e., each language in a region) is in a single row. Each value is a single cell (i.e., its row, column position in the data frame is not shared with another value). You might notice that we have the same number of columns in the tidy data set as we did in the messy one. Therefore pivot_wider didn’t really “widen” the data, as the name suggests. This is just because the original type column only had two categories in it. If it had more than two, pivot_wider would have created more columns, and we would see the data set “widen.” 3.4.3 Tidying up: using separate to deal with multiple delimiters Data are also not considered tidy when multiple values are stored in the same cell. The data set we show below is even messier than the ones we dealt with above: the Toronto, Montréal, Vancouver, Calgary, and Edmonton columns contain the number of Canadians reporting their primary language at home and work in one column separated by the delimiter (/). The column names are the values of a variable, and each value does not have its own cell! To turn this messy data into tidy data, we’ll have to fix these issues. lang_messy <- read_csv("data/region_lang_top5_cities_messy.csv") lang_messy ## # A tibble: 214 × 7 ## category language Toronto Montréal Vancouver Calgary Edmonton ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Aboriginal languages Aborigi… 50/0 15/0 15/0 5/0 10/0 ## 2 Non-Official & Non-Abor… Afrikaa… 265/0 10/0 520/10 505/15 300/0 ## 3 Non-Official & Non-Abor… Afro-As… 185/10 65/0 10/0 15/0 20/0 ## 4 Non-Official & Non-Abor… Akan (T… 4045/20 440/0 125/10 330/0 445/0 ## 5 Non-Official & Non-Abor… Albanian 6380/2… 1445/20 530/10 620/25 370/10 ## 6 Aboriginal languages Algonqu… 5/0 0/0 0/0 0/0 0/0 ## 7 Aboriginal languages Algonqu… 0/0 10/0 0/0 0/0 0/0 ## 8 Non-Official & Non-Abor… America… 720/245 70/0 300/140 85/25 190/85 ## 9 Non-Official & Non-Abor… Amharic 3820/55 315/0 540/10 2730/50 1695/35 ## 10 Non-Official & Non-Abor… Arabic 45025/… 72980/1… 8680/275 11010/… 10590/3… ## # ℹ 204 more rows First we’ll use pivot_longer to create two columns, region and value, similar to what we did previously. The new region columns will contain the region names, and the new column value will be a temporary holding place for the data that we need to further separate, i.e., the number of Canadians reporting their primary language at home and work. lang_messy_longer <- pivot_longer(lang_messy, cols = Toronto:Edmonton, names_to = "region", values_to = "value" ) lang_messy_longer ## # A tibble: 1,070 × 4 ## category language region value ## <chr> <chr> <chr> <chr> ## 1 Aboriginal languages Aboriginal languages, n… Toron… 50/0 ## 2 Aboriginal languages Aboriginal languages, n… Montr… 15/0 ## 3 Aboriginal languages Aboriginal languages, n… Vanco… 15/0 ## 4 Aboriginal languages Aboriginal languages, n… Calga… 5/0 ## 5 Aboriginal languages Aboriginal languages, n… Edmon… 10/0 ## 6 Non-Official & Non-Aboriginal languages Afrikaans Toron… 265/0 ## 7 Non-Official & Non-Aboriginal languages Afrikaans Montr… 10/0 ## 8 Non-Official & Non-Aboriginal languages Afrikaans Vanco… 520/… ## 9 Non-Official & Non-Aboriginal languages Afrikaans Calga… 505/… ## 10 Non-Official & Non-Aboriginal languages Afrikaans Edmon… 300/0 ## # ℹ 1,060 more rows Next we’ll use separate to split the value column into two columns. One column will contain only the counts of Canadians that speak each language most at home, and the other will contain the counts of Canadians that speak each language most at work for each region. Figure 3.13 outlines what we need to specify to use separate. Figure 3.13: Syntax for the separate function. tidy_lang <- separate(lang_messy_longer, col = value, into = c("most_at_home", "most_at_work"), sep = "/" ) tidy_lang ## # A tibble: 1,070 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <chr> <chr> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,060 more rows Is this data set now tidy? If we recall the three criteria for tidy data: each row is a single observation, each column is a single variable, and each value is a single cell. We can see that this data now satisfies all three criteria, making it easier to analyze. But we aren’t done yet! Notice in the table above that the word <chr> appears beneath each of the column names. The word under the column name indicates the data type of each column. Here all of the variables are “character” data types. Recall, character data types are letter(s) or digits(s) surrounded by quotes. In the previous example in Section 3.4.2, the most_at_home and most_at_work variables were <dbl> (double)—you can verify this by looking at the tables in the previous sections—which is a type of numeric data. This change is due to the delimiter (/) when we read in this messy data set. R read these columns in as character types, and by default, separate will return columns as character data types. It makes sense for region, category, and language to be stored as a character (or perhaps factor) type. However, suppose we want to apply any functions that treat the most_at_home and most_at_work columns as a number (e.g., finding rows above a numeric threshold of a column). In that case, it won’t be possible to do if the variable is stored as a character. Fortunately, the separate function provides a natural way to fix problems like this: we can set convert = TRUE to convert the most_at_home and most_at_work columns to the correct data type. tidy_lang <- separate(lang_messy_longer, col = value, into = c("most_at_home", "most_at_work"), sep = "/", convert = TRUE ) tidy_lang ## # A tibble: 1,070 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,060 more rows Now we see <int> appears under the most_at_home and most_at_work columns, indicating they are integer data types (i.e., numbers)! 3.5 Using select to extract a range of columns Now that the tidy_lang data is indeed tidy, we can start manipulating it using the powerful suite of functions from the tidyverse. For the first example, recall the select function from Chapter 1, which lets us create a subset of columns from a data frame. Suppose we wanted to select only the columns language, region, most_at_home and most_at_work from the tidy_lang data set. Using what we learned in Chapter 1, we would pass the tidy_lang data frame as well as all of these column names into the select function: selected_columns <- select(tidy_lang, language, region, most_at_home, most_at_work) selected_columns ## # A tibble: 1,070 × 4 ## language region most_at_home most_at_work ## <chr> <chr> <int> <int> ## 1 Aboriginal languages, n.o.s. Toronto 50 0 ## 2 Aboriginal languages, n.o.s. Montréal 15 0 ## 3 Aboriginal languages, n.o.s. Vancouver 15 0 ## 4 Aboriginal languages, n.o.s. Calgary 5 0 ## 5 Aboriginal languages, n.o.s. Edmonton 10 0 ## 6 Afrikaans Toronto 265 0 ## 7 Afrikaans Montréal 10 0 ## 8 Afrikaans Vancouver 520 10 ## 9 Afrikaans Calgary 505 15 ## 10 Afrikaans Edmonton 300 0 ## # ℹ 1,060 more rows Here we wrote out the names of each of the columns. However, this method is time-consuming, especially if you have a lot of columns! Another approach is to use a “select helper”. Select helpers are operators that make it easier for us to select columns. For instance, we can use a select helper to choose a range of columns rather than typing each column name out. To do this, we use the colon (:) operator to denote the range. For example, to get all the columns in the tidy_lang data frame from language to most_at_work we pass language:most_at_work as the second argument to the select function. column_range <- select(tidy_lang, language:most_at_work) column_range ## # A tibble: 1,070 × 4 ## language region most_at_home most_at_work ## <chr> <chr> <int> <int> ## 1 Aboriginal languages, n.o.s. Toronto 50 0 ## 2 Aboriginal languages, n.o.s. Montréal 15 0 ## 3 Aboriginal languages, n.o.s. Vancouver 15 0 ## 4 Aboriginal languages, n.o.s. Calgary 5 0 ## 5 Aboriginal languages, n.o.s. Edmonton 10 0 ## 6 Afrikaans Toronto 265 0 ## 7 Afrikaans Montréal 10 0 ## 8 Afrikaans Vancouver 520 10 ## 9 Afrikaans Calgary 505 15 ## 10 Afrikaans Edmonton 300 0 ## # ℹ 1,060 more rows Notice that we get the same output as we did above, but with less (and clearer!) code. This type of operator is especially handy for large data sets. Suppose instead we wanted to extract columns that followed a particular pattern rather than just selecting a range. For example, let’s say we wanted only to select the columns most_at_home and most_at_work. There are other helpers that allow us to select variables based on their names. In particular, we can use the select helper starts_with to choose only the columns that start with the word “most”: select(tidy_lang, starts_with("most")) ## # A tibble: 1,070 × 2 ## most_at_home most_at_work ## <int> <int> ## 1 50 0 ## 2 15 0 ## 3 15 0 ## 4 5 0 ## 5 10 0 ## 6 265 0 ## 7 10 0 ## 8 520 10 ## 9 505 15 ## 10 300 0 ## # ℹ 1,060 more rows We could also have chosen the columns containing an underscore _ by adding contains(\"_\") as the second argument in the select function, since we notice the columns we want contain underscores and the others don’t. select(tidy_lang, contains("_")) ## # A tibble: 1,070 × 2 ## most_at_home most_at_work ## <int> <int> ## 1 50 0 ## 2 15 0 ## 3 15 0 ## 4 5 0 ## 5 10 0 ## 6 265 0 ## 7 10 0 ## 8 520 10 ## 9 505 15 ## 10 300 0 ## # ℹ 1,060 more rows There are many different select helpers that select variables based on certain criteria. The additional resources section at the end of this chapter provides a comprehensive resource on select helpers. 3.6 Using filter to extract rows Next, we revisit the filter function from Chapter 1, which lets us create a subset of rows from a data frame. Recall the two main arguments to the filter function: the first is the name of the data frame object, and the second is a logical statement to use when filtering the rows. filter works by returning the rows where the logical statement evaluates to TRUE. This section will highlight more advanced usage of the filter function. In particular, this section provides an in-depth treatment of the variety of logical statements one can use in the filter function to select subsets of rows. 3.6.1 Extracting rows that have a certain value with == Suppose we are only interested in the subset of rows in tidy_lang corresponding to the official languages of Canada (English and French). We can filter for these rows by using the equivalency operator (==) to compare the values of the category column with the value \"Official languages\". With these arguments, filter returns a data frame with all the columns of the input data frame but only the rows we asked for in the logical statement, i.e., those where the category column holds the value \"Official languages\". We name this data frame official_langs. official_langs <- filter(tidy_lang, category == "Official languages") official_langs ## # A tibble: 10 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 ## 6 Official languages French Toronto 29800 11940 ## 7 Official languages French Montréal 2669195 1607550 ## 8 Official languages French Vancouver 8630 3245 ## 9 Official languages French Calgary 8630 2140 ## 10 Official languages French Edmonton 10950 2520 3.6.2 Extracting rows that do not have a certain value with != What if we want all the other language categories in the data set except for those in the \"Official languages\" category? We can accomplish this with the != operator, which means “not equal to”. So if we want to find all the rows where the category does not equal \"Official languages\" we write the code below. filter(tidy_lang, category != "Official languages") ## # A tibble: 1,060 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,050 more rows 3.6.3 Extracting rows satisfying multiple conditions using , or & Suppose now we want to look at only the rows for the French language in Montréal. To do this, we need to filter the data set to find rows that satisfy multiple conditions simultaneously. We can do this with the comma symbol (,), which in the case of filter is interpreted by R as “and”. We write the code as shown below to filter the official_langs data frame to subset the rows where region == \"Montréal\" and the language == \"French\". filter(official_langs, region == "Montréal", language == "French") ## # A tibble: 1 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages French Montréal 2669195 1607550 We can also use the ampersand (&) logical operator, which gives us cases where both one condition and another condition are satisfied. You can use either comma (,) or ampersand (&) in the filter function interchangeably. filter(official_langs, region == "Montréal" & language == "French") ## # A tibble: 1 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages French Montréal 2669195 1607550 3.6.4 Extracting rows satisfying at least one condition using | Suppose we were interested in only those rows corresponding to cities in Alberta in the official_langs data set (Edmonton and Calgary). We can’t use , as we did above because region cannot be both Edmonton and Calgary simultaneously. Instead, we can use the vertical pipe (|) logical operator, which gives us the cases where one condition or another condition or both are satisfied. In the code below, we ask R to return the rows where the region columns are equal to “Calgary” or “Edmonton”. filter(official_langs, region == "Calgary" | region == "Edmonton") ## # A tibble: 4 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Calgary 1065070 844740 ## 2 Official languages English Edmonton 1050410 792700 ## 3 Official languages French Calgary 8630 2140 ## 4 Official languages French Edmonton 10950 2520 3.6.5 Extracting rows with values in a vector using %in% Next, suppose we want to see the populations of our five cities. Let’s read in the region_data.csv file that comes from the 2016 Canadian census, as it contains statistics for number of households, land area, population and number of dwellings for different regions. region_data <- read_csv("data/region_data.csv") region_data ## # A tibble: 35 × 5 ## region households area population dwellings ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Belleville 43002 1355. 103472 45050 ## 2 Lethbridge 45696 3047. 117394 48317 ## 3 Thunder Bay 52545 2618. 121621 57146 ## 4 Peterborough 50533 1637. 121721 55662 ## 5 Saint John 52872 3793. 126202 58398 ## 6 Brantford 52530 1086. 134203 54419 ## 7 Moncton 61769 2625. 144810 66699 ## 8 Guelph 59280 604. 151984 63324 ## 9 Trois-Rivières 72502 1053. 156042 77734 ## 10 Saguenay 72479 3079. 160980 77968 ## # ℹ 25 more rows To get the population of the five cities we can filter the data set using the %in% operator. The %in% operator is used to see if an element belongs to a vector. Here we are filtering for rows where the value in the region column matches any of the five cities we are intersted in: Toronto, Montréal, Vancouver, Calgary, and Edmonton. city_names <- c("Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton") five_cities <- filter(region_data, region %in% city_names) five_cities ## # A tibble: 5 × 5 ## region households area population dwellings ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Edmonton 502143 9858. 1321426 537634 ## 2 Calgary 519693 5242. 1392609 544870 ## 3 Vancouver 960894 3040. 2463431 1027613 ## 4 Montréal 1727310 4638. 4098927 1823281 ## 5 Toronto 2135909 6270. 5928040 2235145 Note: What’s the difference between == and %in%? Suppose we have two vectors, vectorA and vectorB. If you type vectorA == vectorB into R it will compare the vectors element by element. R checks if the first element of vectorA equals the first element of vectorB, the second element of vectorA equals the second element of vectorB, and so on. On the other hand, vectorA %in% vectorB compares the first element of vectorA to all the elements in vectorB. Then the second element of vectorA is compared to all the elements in vectorB, and so on. Notice the difference between == and %in% in the example below. c("Vancouver", "Toronto") == c("Toronto", "Vancouver") ## [1] FALSE FALSE c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver") ## [1] TRUE TRUE 3.6.6 Extracting rows above or below a threshold using > and < We saw in Section 3.6.3 that 2,669,195 people reported speaking French in Montréal as their primary language at home. If we are interested in finding the official languages in regions with higher numbers of people who speak it as their primary language at home compared to French in Montréal, then we can use filter to obtain rows where the value of most_at_home is greater than 2,669,195. We use the > symbol to look for values above a threshold, and the < symbol to look for values below a threshold. The >= and <= symbols similarly look for equal to or above a threshold and equal to or below a threshold. filter(official_langs, most_at_home > 2669195) ## # A tibble: 1 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Toronto 3836770 3218725 filter returns a data frame with only one row, indicating that when considering the official languages, only English in Toronto is reported by more people as their primary language at home than French in Montréal according to the 2016 Canadian census. 3.7 Using mutate to modify or add columns 3.7.1 Using mutate to modify columns In Section 3.4.3, when we first read in the \"region_lang_top5_cities_messy.csv\" data, all of the variables were “character” data types. During the tidying process, we used the convert argument from the separate function to convert the most_at_home and most_at_work columns to the desired integer (i.e., numeric class) data types. But suppose we didn’t use the convert argument, and needed to modify the column type some other way. Below we create such a situation so that we can demonstrate how to use mutate to change the column types of a data frame. mutate is a useful function to modify or create new data frame columns. lang_messy <- read_csv("data/region_lang_top5_cities_messy.csv") lang_messy_longer <- pivot_longer(lang_messy, cols = Toronto:Edmonton, names_to = "region", values_to = "value") tidy_lang_chr <- separate(lang_messy_longer, col = value, into = c("most_at_home", "most_at_work"), sep = "/") official_langs_chr <- filter(tidy_lang_chr, category == "Official languages") official_langs_chr ## # A tibble: 10 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <chr> <chr> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 ## 6 Official languages French Toronto 29800 11940 ## 7 Official languages French Montréal 2669195 1607550 ## 8 Official languages French Vancouver 8630 3245 ## 9 Official languages French Calgary 8630 2140 ## 10 Official languages French Edmonton 10950 2520 To use mutate, again we first specify the data set in the first argument, and in the following arguments, we specify the name of the column we want to modify or create (here most_at_home and most_at_work), an = sign, and then the function we want to apply (here as.numeric). In the function we want to apply, we refer directly to the column name upon which we want it to act (here most_at_home and most_at_work). In our example, we are naming the columns the same names as columns that already exist in the data frame (“most_at_home”, “most_at_work”) and this will cause mutate to overwrite those columns (also referred to as modifying those columns in-place). If we were to give the columns a new name, then mutate would create new columns with the names we specified. mutate’s general syntax is detailed in Figure 3.14. Figure 3.14: Syntax for the mutate function. Below we use mutate to convert the columns most_at_home and most_at_work to numeric data types in the official_langs data set as described in Figure 3.14: official_langs_numeric <- mutate(official_langs_chr, most_at_home = as.numeric(most_at_home), most_at_work = as.numeric(most_at_work) ) official_langs_numeric ## # A tibble: 10 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <dbl> <dbl> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 ## 6 Official languages French Toronto 29800 11940 ## 7 Official languages French Montréal 2669195 1607550 ## 8 Official languages French Vancouver 8630 3245 ## 9 Official languages French Calgary 8630 2140 ## 10 Official languages French Edmonton 10950 2520 Now we see <dbl> appears under the most_at_home and most_at_work columns, indicating they are double data types (which is a numeric data type)! 3.7.2 Using mutate to create new columns We can see in the table that 3,836,770 people reported speaking English in Toronto as their primary language at home, according to the 2016 Canadian census. What does this number mean to us? To understand this number, we need context. In particular, how many people were in Toronto when this data was collected? From the 2016 Canadian census profile, the population of Toronto was reported to be 5,928,040 people. The number of people who report that English is their primary language at home is much more meaningful when we report it in this context. We can even go a step further and transform this count to a relative frequency or proportion. We can do this by dividing the number of people reporting a given language as their primary language at home by the number of people who live in Toronto. For example, the proportion of people who reported that their primary language at home was English in the 2016 Canadian census was 0.65 in Toronto. Let’s use mutate to create a new column in our data frame that holds the proportion of people who speak English for our five cities of focus in this chapter. To accomplish this, we will need to do two tasks beforehand: Create a vector containing the population values for the cities. Filter the official_langs data frame so that we only keep the rows where the language is English. To create a vector containing the population values for the five cities (Toronto, Montréal, Vancouver, Calgary, Edmonton), we will use the c function (recall that c stands for “concatenate”): city_pops <- c(5928040, 4098927, 2463431, 1392609, 1321426) city_pops ## [1] 5928040 4098927 2463431 1392609 1321426 And next, we will filter the official_langs data frame so that we only keep the rows where the language is English. We will name the new data frame we get from this english_langs: english_langs <- filter(official_langs, language == "English") english_langs ## # A tibble: 5 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Official languages English Toronto 3836770 3218725 ## 2 Official languages English Montréal 620510 412120 ## 3 Official languages English Vancouver 1622735 1330555 ## 4 Official languages English Calgary 1065070 844740 ## 5 Official languages English Edmonton 1050410 792700 Finally, we can use mutate to create a new column, named most_at_home_proportion, that will have value that corresponds to the proportion of people reporting English as their primary language at home. We will compute this by dividing the column by our vector of city populations. english_langs <- mutate(english_langs, most_at_home_proportion = most_at_home / city_pops) english_langs ## # A tibble: 5 × 6 ## category language region most_at_home most_at_work most_at_home_proport…¹ ## <chr> <chr> <chr> <int> <int> <dbl> ## 1 Official lan… English Toron… 3836770 3218725 0.647 ## 2 Official lan… English Montr… 620510 412120 0.151 ## 3 Official lan… English Vanco… 1622735 1330555 0.659 ## 4 Official lan… English Calga… 1065070 844740 0.765 ## 5 Official lan… English Edmon… 1050410 792700 0.795 ## # ℹ abbreviated name: ¹​most_at_home_proportion In the computation above, we had to ensure that we ordered the city_pops vector in the same order as the cities were listed in the english_langs data frame. This is because R will perform the division computation we did by dividing each element of the most_at_home column by each element of the city_pops vector, matching them up by position. Failing to do this would have resulted in the incorrect math being performed. Note: In more advanced data wrangling, one might solve this problem in a less error-prone way though using a technique called “joins.” We link to resources that discuss this in the additional resources at the end of this chapter. 3.8 Combining functions using the pipe operator, |> In R, we often have to call multiple functions in a sequence to process a data frame. The basic ways of doing this can become quickly unreadable if there are many steps. For example, suppose we need to perform three operations on a data frame called data: add a new column new_col that is double another old_col, filter for rows where another column, other_col, is more than 5, and select only the new column new_col for those rows. One way of performing these three steps is to just write multiple lines of code, storing temporary objects as you go: output_1 <- mutate(data, new_col = old_col * 2) output_2 <- filter(output_1, other_col > 5) output <- select(output_2, new_col) This is difficult to understand for multiple reasons. The reader may be tricked into thinking the named output_1 and output_2 objects are important for some reason, while they are just temporary intermediate computations. Further, the reader has to look through and find where output_1 and output_2 are used in each subsequent line. Another option for doing this would be to compose the functions: output <- select(filter(mutate(data, new_col = old_col * 2), other_col > 5), new_col) Code like this can also be difficult to understand. Functions compose (reading from left to right) in the opposite order in which they are computed by R (above, mutate happens first, then filter, then select). It is also just a really long line of code to read in one go. The pipe operator (|>) solves this problem, resulting in cleaner and easier-to-follow code. |> is built into R so you don’t need to load any packages to use it. You can think of the pipe as a physical pipe. It takes the output from the function on the left-hand side of the pipe, and passes it as the first argument to the function on the right-hand side of the pipe. The code below accomplishes the same thing as the previous two code blocks: output <- data |> mutate(new_col = old_col * 2) |> filter(other_col > 5) |> select(new_col) Note: You might also have noticed that we split the function calls across lines after the pipe, similar to when we did this earlier in the chapter for long function calls. Again, this is allowed and recommended, especially when the piped function calls create a long line of code. Doing this makes your code more readable. When you do this, it is important to end each line with the pipe operator |> to tell R that your code is continuing onto the next line. Note: In this textbook, we will be using the base R pipe operator syntax, |>. This base R |> pipe operator was inspired by a previous version of the pipe operator, %>%. The %>% pipe operator is not built into R and is from the magrittr R package. The tidyverse metapackage imports the %>% pipe operator via dplyr (which in turn imports the magrittr R package). There are some other differences between %>% and |> related to more advanced R uses, such as sharing and distributing code as R packages, however, these are beyond the scope of this textbook. We have this note in the book to make the reader aware that %>% exists as it is still commonly used in data analysis code and in many data science books and other resources. In most cases these two pipes are interchangeable and either can be used. 3.8.1 Using |> to combine filter and select Let’s work with the tidy tidy_lang data set from Section 3.4.3, which contains the number of Canadians reporting their primary language at home and work for five major cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton): tidy_lang ## # A tibble: 1,070 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Toron… 50 0 ## 2 Aboriginal languages Aborigi… Montr… 15 0 ## 3 Aboriginal languages Aborigi… Vanco… 15 0 ## 4 Aboriginal languages Aborigi… Calga… 5 0 ## 5 Aboriginal languages Aborigi… Edmon… 10 0 ## 6 Non-Official & Non-Aboriginal lang… Afrikaa… Toron… 265 0 ## 7 Non-Official & Non-Aboriginal lang… Afrikaa… Montr… 10 0 ## 8 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 9 Non-Official & Non-Aboriginal lang… Afrikaa… Calga… 505 15 ## 10 Non-Official & Non-Aboriginal lang… Afrikaa… Edmon… 300 0 ## # ℹ 1,060 more rows Suppose we want to create a subset of the data with only the languages and counts of each language spoken most at home for the city of Vancouver. To do this, we can use the functions filter and select. First, we use filter to create a data frame called van_data that contains only values for Vancouver. van_data <- filter(tidy_lang, region == "Vancouver") van_data ## # A tibble: 214 × 5 ## category language region most_at_home most_at_work ## <chr> <chr> <chr> <int> <int> ## 1 Aboriginal languages Aborigi… Vanco… 15 0 ## 2 Non-Official & Non-Aboriginal lang… Afrikaa… Vanco… 520 10 ## 3 Non-Official & Non-Aboriginal lang… Afro-As… Vanco… 10 0 ## 4 Non-Official & Non-Aboriginal lang… Akan (T… Vanco… 125 10 ## 5 Non-Official & Non-Aboriginal lang… Albanian Vanco… 530 10 ## 6 Aboriginal languages Algonqu… Vanco… 0 0 ## 7 Aboriginal languages Algonqu… Vanco… 0 0 ## 8 Non-Official & Non-Aboriginal lang… America… Vanco… 300 140 ## 9 Non-Official & Non-Aboriginal lang… Amharic Vanco… 540 10 ## 10 Non-Official & Non-Aboriginal lang… Arabic Vanco… 8680 275 ## # ℹ 204 more rows We then use select on this data frame to keep only the variables we want: van_data_selected <- select(van_data, language, most_at_home) van_data_selected ## # A tibble: 214 × 2 ## language most_at_home ## <chr> <int> ## 1 Aboriginal languages, n.o.s. 15 ## 2 Afrikaans 520 ## 3 Afro-Asiatic languages, n.i.e. 10 ## 4 Akan (Twi) 125 ## 5 Albanian 530 ## 6 Algonquian languages, n.i.e. 0 ## 7 Algonquin 0 ## 8 American Sign Language 300 ## 9 Amharic 540 ## 10 Arabic 8680 ## # ℹ 204 more rows Although this is valid code, there is a more readable approach we could take by using the pipe, |>. With the pipe, we do not need to create an intermediate object to store the output from filter. Instead, we can directly send the output of filter to the input of select: van_data_selected <- filter(tidy_lang, region == "Vancouver") |> select(language, most_at_home) van_data_selected ## # A tibble: 214 × 2 ## language most_at_home ## <chr> <int> ## 1 Aboriginal languages, n.o.s. 15 ## 2 Afrikaans 520 ## 3 Afro-Asiatic languages, n.i.e. 10 ## 4 Akan (Twi) 125 ## 5 Albanian 530 ## 6 Algonquian languages, n.i.e. 0 ## 7 Algonquin 0 ## 8 American Sign Language 300 ## 9 Amharic 540 ## 10 Arabic 8680 ## # ℹ 204 more rows But wait…Why do the select function calls look different in these two examples? Remember: when you use the pipe, the output of the first function is automatically provided as the first argument for the function that comes after it. Therefore you do not specify the first argument in that function call. In the code above, The pipe passes the left-hand side (the output of filter) to the first argument of the function on the right (select), so in the select function you only see the second argument (and beyond). As you can see, both of these approaches—with and without pipes—give us the same output, but the second approach is clearer and more readable. 3.8.2 Using |> with more than two functions The pipe operator (|>) can be used with any function in R. Additionally, we can pipe together more than two functions. For example, we can pipe together three functions to: filter rows to include only those where the counts of the language most spoken at home are greater than 10,000, select only the columns corresponding to region, language and most_at_home, and arrange the data frame rows in order by counts of the language most spoken at home from smallest to largest. As we saw in Chapter 1, we can use the tidyverse arrange function to order the rows in the data frame by the values of one or more columns. Here we pass the column name most_at_home to arrange the data frame rows by the values in that column, in ascending order. large_region_lang <- filter(tidy_lang, most_at_home > 10000) |> select(region, language, most_at_home) |> arrange(most_at_home) large_region_lang ## # A tibble: 67 × 3 ## region language most_at_home ## <chr> <chr> <int> ## 1 Edmonton Arabic 10590 ## 2 Montréal Tamil 10670 ## 3 Vancouver Russian 10795 ## 4 Edmonton Spanish 10880 ## 5 Edmonton French 10950 ## 6 Calgary Arabic 11010 ## 7 Calgary Urdu 11060 ## 8 Vancouver Hindi 11235 ## 9 Montréal Armenian 11835 ## 10 Toronto Romanian 12200 ## # ℹ 57 more rows You will notice above that we passed tidy_lang as the first argument of the filter function. We can also pipe the data frame into the same sequence of functions rather than using it as the first argument of the first function. These two choices are equivalent, and we get the same result. large_region_lang <- tidy_lang |> filter(most_at_home > 10000) |> select(region, language, most_at_home) |> arrange(most_at_home) large_region_lang ## # A tibble: 67 × 3 ## region language most_at_home ## <chr> <chr> <int> ## 1 Edmonton Arabic 10590 ## 2 Montréal Tamil 10670 ## 3 Vancouver Russian 10795 ## 4 Edmonton Spanish 10880 ## 5 Edmonton French 10950 ## 6 Calgary Arabic 11010 ## 7 Calgary Urdu 11060 ## 8 Vancouver Hindi 11235 ## 9 Montréal Armenian 11835 ## 10 Toronto Romanian 12200 ## # ℹ 57 more rows Now that we’ve shown you the pipe operator as an alternative to storing temporary objects and composing code, does this mean you should never store temporary objects or compose code? Not necessarily! There are times when you will still want to do these things. For example, you might store a temporary object before feeding it into a plot function so you can iteratively change the plot without having to redo all of your data transformations. Additionally, piping many functions can be overwhelming and difficult to debug; you may want to store a temporary object midway through to inspect your result before moving on with further steps. 3.9 Aggregating data with summarize and map 3.9.1 Calculating summary statistics on whole columns As a part of many data analyses, we need to calculate a summary value for the data (a summary statistic). Examples of summary statistics we might want to calculate are the number of observations, the average/mean value for a column, the minimum value, etc. Oftentimes, this summary statistic is calculated from the values in a data frame column, or columns, as shown in Figure 3.15. Figure 3.15: summarize is useful for calculating summary statistics on one or more column(s). In its simplest use case, it creates a new data frame with a single row containing the summary statistic(s) for each column being summarized. The darker, top row of each table represents the column headers. A useful dplyr function for calculating summary statistics is summarize, where the first argument is the data frame and subsequent arguments are the summaries we want to perform. Here we show how to use the summarize function to calculate the minimum and maximum number of Canadians reporting a particular language as their primary language at home. First a reminder of what region_lang looks like: region_lang ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows We apply summarize to calculate the minimum and maximum number of Canadians reporting a particular language as their primary language at home, for any region: summarize(region_lang, min_most_at_home = min(most_at_home), max_most_at_home = max(most_at_home)) ## # A tibble: 1 × 2 ## min_most_at_home max_most_at_home ## <dbl> <dbl> ## 1 0 3836770 From this we see that there are some languages in the data set that no one speaks as their primary language at home. We also see that the most commonly spoken primary language at home is spoken by 3,836,770 people. 3.9.2 Calculating summary statistics when there are NAs In data frames in R, the value NA is often used to denote missing data. Many of the base R statistical summary functions (e.g., max, min, mean, sum, etc) will return NA when applied to columns containing NA values. Usually that is not what we want to happen; instead, we would usually like R to ignore the missing entries and calculate the summary statistic using all of the other non-NA values in the column. Fortunately many of these functions provide an argument na.rm that lets us tell the function what to do when it encounters NA values. In particular, if we specify na.rm = TRUE, the function will ignore missing values and return a summary of all the non-missing entries. We show an example of this combined with summarize below. First we create a new version of the region_lang data frame, named region_lang_na, that has a seemingly innocuous NA in the first row of the most_at_home column: region_lang_na ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 NA 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows Now if we apply the summarize function as above, we see that we no longer get the minimum and maximum returned, but just an NA instead! summarize(region_lang_na, min_most_at_home = min(most_at_home), max_most_at_home = max(most_at_home)) ## # A tibble: 1 × 2 ## min_most_at_home max_most_at_home ## <dbl> <dbl> ## 1 NA NA We can fix this by adding the na.rm = TRUE as explained above: summarize(region_lang_na, min_most_at_home = min(most_at_home, na.rm = TRUE), max_most_at_home = max(most_at_home, na.rm = TRUE)) ## # A tibble: 1 × 2 ## min_most_at_home max_most_at_home ## <dbl> <dbl> ## 1 0 3836770 3.9.3 Calculating summary statistics for groups of rows A common pairing with summarize is group_by. Pairing these functions together can let you summarize values for subgroups within a data set, as illustrated in Figure 3.16. For example, we can use group_by to group the regions of the region_lang data frame and then calculate the minimum and maximum number of Canadians reporting the language as the primary language at home for each of the regions in the data set. Figure 3.16: summarize and group_by is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame—with one row for each group—containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The orange, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example. The group_by function takes at least two arguments. The first is the data frame that will be grouped, and the second and onwards are columns to use in the grouping. Here we use only one column for grouping (region), but more than one can also be used. To do this, list additional columns separated by commas. group_by(region_lang, region) |> summarize( min_most_at_home = min(most_at_home), max_most_at_home = max(most_at_home) ) ## # A tibble: 35 × 3 ## region min_most_at_home max_most_at_home ## <chr> <dbl> <dbl> ## 1 Abbotsford - Mission 0 137445 ## 2 Barrie 0 182390 ## 3 Belleville 0 97840 ## 4 Brantford 0 124560 ## 5 Calgary 0 1065070 ## 6 Edmonton 0 1050410 ## 7 Greater Sudbury 0 133960 ## 8 Guelph 0 130950 ## 9 Halifax 0 371215 ## 10 Hamilton 0 630380 ## # ℹ 25 more rows Notice that group_by on its own doesn’t change the way the data looks. In the output below, the grouped data set looks the same, and it doesn’t appear to be grouped by region. Instead, group_by simply changes how other functions work with the data, as we saw with summarize above. group_by(region_lang, region) ## # A tibble: 7,490 × 7 ## # Groups: region [35] ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows 3.9.4 Calculating summary statistics on many columns Sometimes we need to summarize statistics across many columns. An example of this is illustrated in Figure 3.17. In such a case, using summarize alone means that we have to type out the name of each column we want to summarize. In this section we will meet two strategies for performing this task. First we will see how we can do this using summarize + across. Then we will also explore how we can use a more general iteration function, map, to also accomplish this. Figure 3.17: summarize + across or map is useful for efficiently calculating summary statistics on many columns at once. The darker, top row of each table represents the column headers. summarize and across for calculating summary statistics on many columns To summarize statistics across many columns, we can use the summarize function we have just recently learned about. However, in such a case, using summarize alone means that we have to type out the name of each column we want to summarize. To do this more efficiently, we can pair summarize with across and use a colon : to specify a range of columns we would like to perform the statistical summaries on. Here we demonstrate finding the maximum value of each of the numeric columns of the region_lang data set. region_lang |> summarize(across(mother_tongue:lang_known, max)) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 Note: Similar to when we use base R statistical summary functions (e.g., max, min, mean, sum, etc) with summarize alone, the use of the summarize + across functions paired with base R statistical summary functions also return NAs when we apply them to columns that contain NAs in the data frame. To resolve this issue, again we need to add the argument na.rm = TRUE. But in this case we need to use it a little bit differently: we write a ~, and then call the summary function with the first argument .x and the second argument na.rm = TRUE. For example, for the previous example with the max function, we would write region_lang_na |> summarize(across(mother_tongue:lang_known, ~ max(.x, na.rm = TRUE))) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 The meaning of this unusual syntax is a bit beyond the scope of this book, but interested readers can look up anonymous functions in the purrr package from tidyverse. map for calculating summary statistics on many columns An alternative to summarize and across for applying a function to many columns is the map family of functions. Let’s again find the maximum value of each column of the region_lang data frame, but using map with the max function this time. map takes two arguments: an object (a vector, data frame or list) that you want to apply the function to, and the function that you would like to apply to each column. Note that map does not have an argument to specify which columns to apply the function to. Therefore, we will use the select function before calling map to choose the columns for which we want the maximum. region_lang |> select(mother_tongue:lang_known) |> map(max) ## $mother_tongue ## [1] 3061820 ## ## $most_at_home ## [1] 3836770 ## ## $most_at_work ## [1] 3218725 ## ## $lang_known ## [1] 5600480 Note: The map function comes from the purrr package. But since purrr is part of the tidyverse, once we call library(tidyverse) we do not need to load the purrr package separately. The output looks a bit weird… we passed in a data frame, but the output doesn’t look like a data frame. As it so happens, it is not a data frame, but rather a plain list: region_lang |> select(mother_tongue:lang_known) |> map(max) |> typeof() ## [1] "list" So what do we do? Should we convert this to a data frame? We could, but a simpler alternative is to just use a different map function. There are quite a few to choose from, they all work similarly, but their name reflects the type of output you want from the mapping operation. Table 3.3 lists the commonly used map functions as well as their output type. Table 3.3: The map functions in R. map function Output map list map_lgl logical vector map_int integer vector map_dbl double vector map_chr character vector map_dfc data frame, combining column-wise map_dfr data frame, combining row-wise Let’s get the columns’ maximums again, but this time use the map_dfr function to return the output as a data frame: region_lang |> select(mother_tongue:lang_known) |> map_dfr(max) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 Note: Similar to when we use base R statistical summary functions (e.g., max, min, mean, sum, etc.) with summarize, map functions paired with base R statistical summary functions also return NA values when we apply them to columns that contain NA values. To avoid this, again we need to add the argument na.rm = TRUE. When we use this with map, we do this by adding a , and then na.rm = TRUE after specifying the function, as illustrated below: region_lang_na |> select(mother_tongue:lang_known) |> map_dfr(max, na.rm = TRUE) ## # A tibble: 1 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 3061820 3836770 3218725 5600480 The map functions are generally quite useful for solving many problems involving repeatedly applying functions in R. Additionally, their use is not limited to columns of a data frame; map family functions can be used to apply functions to elements of a vector, or a list, and even to lists of (nested!) data frames. To learn more about the map functions, see the additional resources section at the end of this chapter. 3.10 Apply functions across many columns with mutate and across Sometimes we need to apply a function to many columns in a data frame. For example, we would need to do this when converting units of measurements across many columns. We illustrate such a data transformation in Figure 3.18. Figure 3.18: mutate and across is useful for applying functions across many columns. The darker, top row of each table represents the column headers. For example, imagine that we wanted to convert all the numeric columns in the region_lang data frame from double type to integer type using the as.integer function. When we revisit the region_lang data frame, we can see that this would be the columns from mother_tongue to lang_known. region_lang ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows To accomplish such a task, we can use mutate paired with across. This works in a similar way for column selection, as we saw when we used summarize + across earlier. As we did above, we again use across to specify the columns using select syntax as well as the function we want to apply on the specified columns. However, a key difference here is that we are using mutate, which means that we get back a data frame with the same number of columns and rows. The only thing that changes is the transformation we applied to the specified columns (here mother_tongue to lang_known). region_lang |> mutate(across(mother_tongue:lang_known, as.integer)) ## # A tibble: 7,490 × 7 ## region category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <chr> <int> <int> <int> <int> ## 1 St. Joh… Aborigi… Aborigi… 5 0 0 0 ## 2 Halifax Aborigi… Aborigi… 5 0 0 0 ## 3 Moncton Aborigi… Aborigi… 0 0 0 0 ## 4 Saint J… Aborigi… Aborigi… 0 0 0 0 ## 5 Saguenay Aborigi… Aborigi… 5 5 0 0 ## 6 Québec Aborigi… Aborigi… 0 5 0 20 ## 7 Sherbro… Aborigi… Aborigi… 0 0 0 0 ## 8 Trois-R… Aborigi… Aborigi… 0 0 0 0 ## 9 Montréal Aborigi… Aborigi… 30 15 0 10 ## 10 Kingston Aborigi… Aborigi… 0 0 0 0 ## # ℹ 7,480 more rows 3.11 Apply functions across columns within one row with rowwise and mutate What if you want to apply a function across columns but within one row? We illustrate such a data transformation in Figure 3.19. Figure 3.19: rowwise and mutate is useful for applying functions across columns within one row. The darker, top row of each table represents the column headers. For instance, suppose we want to know the maximum value between mother_tongue, most_at_home, most_at_work and lang_known for each language and region in the region_lang data set. In other words, we want to apply the max function row-wise. We will use the (aptly named) rowwise function in combination with mutate to accomplish this task. Before we apply rowwise, we will select only the count columns so we can see all the columns in the data frame’s output easily in the book. So for this demonstration, the data set we are operating on looks like this: region_lang |> select(mother_tongue:lang_known) ## # A tibble: 7,490 × 4 ## mother_tongue most_at_home most_at_work lang_known ## <dbl> <dbl> <dbl> <dbl> ## 1 5 0 0 0 ## 2 5 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 5 5 0 0 ## 6 0 5 0 20 ## 7 0 0 0 0 ## 8 0 0 0 0 ## 9 30 15 0 10 ## 10 0 0 0 0 ## # ℹ 7,480 more rows Now we apply rowwise before mutate, to tell R that we would like the mutate function to be applied across, and within, a row, as opposed to being applied on a column (which is the default behavior of mutate): region_lang |> select(mother_tongue:lang_known) |> rowwise() |> mutate(maximum = max(c(mother_tongue, most_at_home, most_at_work, lang_known))) ## # A tibble: 7,490 × 5 ## # Rowwise: ## mother_tongue most_at_home most_at_work lang_known maximum ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5 0 0 0 5 ## 2 5 0 0 0 5 ## 3 0 0 0 0 0 ## 4 0 0 0 0 0 ## 5 5 5 0 0 5 ## 6 0 5 0 20 20 ## 7 0 0 0 0 0 ## 8 0 0 0 0 0 ## 9 30 15 0 10 30 ## 10 0 0 0 0 0 ## # ℹ 7,480 more rows We see that we get an additional column added to the data frame, named maximum, which is the maximum value between mother_tongue, most_at_home, most_at_work and lang_known for each language and region. Similar to group_by, rowwise doesn’t appear to do anything when it is called by itself. However, we can apply rowwise in combination with other functions to change how these other functions operate on the data. Notice if we used mutate without rowwise, we would have computed the maximum value across all rows rather than the maximum value for each row. Below we show what would have happened had we not used rowwise. In particular, the same maximum value is reported in every single row; this code does not provide the desired result. region_lang |> select(mother_tongue:lang_known) |> mutate(maximum = max(c(mother_tongue, most_at_home, most_at_home, lang_known))) ## # A tibble: 7,490 × 5 ## mother_tongue most_at_home most_at_work lang_known maximum ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5 0 0 0 5600480 ## 2 5 0 0 0 5600480 ## 3 0 0 0 0 5600480 ## 4 0 0 0 0 5600480 ## 5 5 5 0 0 5600480 ## 6 0 5 0 20 5600480 ## 7 0 0 0 0 5600480 ## 8 0 0 0 0 5600480 ## 9 30 15 0 10 5600480 ## 10 0 0 0 0 5600480 ## # ℹ 7,480 more rows 3.12 Summary Cleaning and wrangling data can be a very time-consuming process. However, it is a critical step in any data analysis. We have explored many different functions for cleaning and wrangling data into a tidy format. Table 3.4 summarizes some of the key wrangling functions we learned in this chapter. In the following chapters, you will learn how you can take this tidy data and do so much more with it to answer your burning data science questions! Table 3.4: Summary of wrangling functions Function Description across allows you to apply function(s) to multiple columns filter subsets rows of a data frame group_by allows you to apply function(s) to groups of rows mutate adds or modifies columns in a data frame map general iteration function pivot_longer generally makes the data frame longer and narrower pivot_wider generally makes a data frame wider and decreases the number of rows rowwise applies functions across columns within one row separate splits up a character column into multiple columns select subsets columns of a data frame summarize calculates summaries of inputs 3.13 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Cleaning and wrangling data” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 3.14 Additional resources As we mentioned earlier, tidyverse is actually an R meta package: it installs and loads a collection of R packages that all follow the tidy data philosophy we discussed above. One of the tidyverse packages is dplyr—a data wrangling workhorse. You have already met many of dplyr’s functions (select, filter, mutate, arrange, summarize, and group_by). To learn more about these functions and meet a few more useful functions, we recommend you check out Chapters 5-9 of the STAT545 online notes. of the data wrangling, exploration, and analysis with R book. The dplyr R package documentation (Wickham, François, et al. 2021) is another resource to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter. Check out the tidyselect R package page (Henry and Wickham 2021) for a comprehensive list of select helpers. These helpers can be used to choose columns in a data frame when paired with the select function (and other functions that use the tidyselect syntax, such as pivot_longer). The documentation for select helpers is a useful reference to find the helper you need for your particular problem. R for Data Science (Wickham and Grolemund 2016) has a few chapters related to data wrangling that go into more depth than this book. For example, the tidy data chapter covers tidy data, pivot_longer/pivot_wider and separate, but also covers missing values and additional wrangling functions (like unite). The data transformation chapter covers select, filter, arrange, mutate, and summarize. And the map functions chapter provides more about the map functions. You will occasionally encounter a case where you need to iterate over items in a data frame, but none of the above functions are flexible enough to do what you want. In that case, you may consider using a for loop. References "],["viz.html", "Chapter 4 Effective data visualization 4.1 Overview 4.2 Chapter learning objectives 4.3 Choosing the visualization 4.4 Refining the visualization 4.5 Creating visualizations with ggplot2 4.6 Explaining the visualization 4.7 Saving the visualization 4.8 Exercises 4.9 Additional resources", " Chapter 4 Effective data visualization 4.1 Overview This chapter will introduce concepts and tools relating to data visualization beyond what we have seen and practiced so far. We will focus on guiding principles for effective data visualization and explaining visualizations independent of any particular tool or programming language. In the process, we will cover some specifics of creating visualizations (scatter plots, bar plots, line plots, and histograms) for data using R. 4.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe when to use the following kinds of visualizations to answer specific questions using a data set: scatter plots line plots bar plots histogram plots Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question. Evaluate the effectiveness of a visualization and suggest improvements to better answer a given question. Referring to the visualization, communicate the conclusions in non-technical terms. Identify rules of thumb for creating effective visualizations. Use the ggplot2 package in R to create and refine the above visualizations using: geometric objects: geom_point, geom_line, geom_histogram, geom_bar, geom_vline, geom_hline scales: xlim, ylim aesthetic mappings: x, y, fill, color, shape labeling: xlab, ylab, labs font control and legend positioning: theme subplots: facet_grid Define the three key aspects of ggplot2 objects: aesthetic mappings geometric objects scales Describe the difference in raster and vector output formats. Use ggsave to save visualizations in .png and .svg format. 4.3 Choosing the visualization Ask a question, and answer it The purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do before creating a visualization is to formulate the question about the data you are trying to answer. A good visualization will clearly answer your question without distraction; a great visualization will suggest even what the question was itself without additional explanation. Imagine your visualization as part of a poster presentation for a project; even if you aren’t standing at the poster explaining things, an effective visualization will convey your message to the audience. Recall the different data analysis questions from Chapter 1. With the visualizations we will cover in this chapter, we will be able to answer only descriptive and exploratory questions. Be careful to not answer any predictive, inferential, causal or mechanistic questions with the visualizations presented here, as we have not learned the tools necessary to do that properly just yet. As with most coding tasks, it is totally fine (and quite common) to make mistakes and iterate a few times before you find the right visualization for your data and question. There are many different kinds of plotting graphics available to use (see Chapter 5 of Fundamentals of Data Visualization (Wilke 2019) for a directory). The types of plot that we introduce in this book are shown in Figure 4.1; which one you should select depends on your data and the question you want to answer. In general, the guiding principles of when to use each type of plot are as follows: scatter plots visualize the relationship between two quantitative variables line plots visualize trends with respect to an independent, ordered quantity (e.g., time) bar plots visualize comparisons of amounts histograms visualize the distribution of one quantitative variable (i.e., all its possible values and how often they occur) Figure 4.1: Examples of scatter, line and bar plots, as well as histograms. All types of visualization have their (mis)uses, but three kinds are usually hard to understand or are easily replaced with an oft-better alternative. In particular, you should avoid pie charts; it is generally better to use bars, as it is easier to compare bar heights than pie slice sizes. You should also not use 3-D visualizations, as they are typically hard to understand when converted to a static 2-D image format. Finally, do not use tables to make numerical comparisons; humans are much better at quickly processing visual information than text and math. Bar plots are again typically a better alternative. 4.4 Refining the visualization Convey the message, minimize noise Just being able to make a visualization in R (or any other language, for that matter) doesn’t mean that it effectively communicates your message to others. Once you have selected a broad type of visualization to use, you will have to refine it to suit your particular need. Some rules of thumb for doing this are listed below. They generally fall into two classes: you want to make your visualization convey your message, and you want to reduce visual noise as much as possible. Humans have limited cognitive ability to process information; both of these types of refinement aim to reduce the mental load on your audience when viewing your visualization, making it easier for them to understand and remember your message quickly. Convey the message Make sure the visualization answers the question you have asked most simply and plainly as possible. Use legends and labels so that your visualization is understandable without reading the surrounding text. Ensure the text, symbols, lines, etc., on your visualization are big enough to be easily read. Ensure the data are clearly visible; don’t hide the shape/distribution of the data behind other objects (e.g., a bar). Make sure to use color schemes that are understandable by those with colorblindness (a surprisingly large fraction of the overall population—from about 1% to 10%, depending on sex and ancestry (Deeb 2005)). For example, ColorBrewer and the RColorBrewer R package (Neuwirth 2014) provide the ability to pick such color schemes, and you can check your visualizations after you have created them by uploading to online tools such as a color blindness simulator. Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience. Minimize noise Use colors sparingly. Too many different colors can be distracting, create false patterns, and detract from the message. Be wary of overplotting. Overplotting is when marks that represent the data overlap, and is problematic as it prevents you from seeing how many data points are represented in areas of the visualization where this occurs. If your plot has too many dots or lines and starts to look like a mess, you need to do something different. Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small. Don’t adjust the axes to zoom in on small differences. If the difference is small, show that it’s small! 4.5 Creating visualizations with ggplot2 Build the visualization iteratively This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, and then how to create the visualization in R using the ggplot2 R package. Given that the ggplot2 package is loaded by the tidyverse metapackage, we still need to load only `tidyverse’: library(tidyverse) 4.5.1 Scatter plots and line plots: the Mauna Loa CO\\(_{\\text{2}}\\) data set The Mauna Loa CO\\(_{\\text{2}}\\) data set, curated by Dr. Pieter Tans, NOAA/GML and Dr. Ralph Keeling, Scripps Institution of Oceanography, records the atmospheric concentration of carbon dioxide (CO\\(_{\\text{2}}\\), in parts per million) at the Mauna Loa research station in Hawaii from 1959 onward (Tans and Keeling 2020). For this book, we are going to focus on the years 1980-2020. Question: Does the concentration of atmospheric CO\\(_{\\text{2}}\\) change over time, and are there any interesting patterns to note? To get started, we will read and inspect the data: # mauna loa carbon dioxide data co2_df <- read_csv("data/mauna_loa_data.csv") co2_df ## # A tibble: 484 × 2 ## date_measured ppm ## <date> <dbl> ## 1 1980-02-01 338. ## 2 1980-03-01 340. ## 3 1980-04-01 341. ## 4 1980-05-01 341. ## 5 1980-06-01 341. ## 6 1980-07-01 339. ## 7 1980-08-01 338. ## 8 1980-09-01 336. ## 9 1980-10-01 336. ## 10 1980-11-01 337. ## # ℹ 474 more rows We see that there are two columns in the co2_df data frame; date_measured and ppm. The date_measured column holds the date the measurement was taken, and is of type date. The ppm column holds the value of CO\\(_{\\text{2}}\\) in parts per million that was measured on each date, and is type double. Note: read_csv was able to parse the date_measured column into the date vector type because it was entered in the international standard date format, called ISO 8601, which lists dates as year-month-day. date vectors are double vectors with special properties that allow them to handle dates correctly. For example, date type vectors allow functions like ggplot to treat them as numeric dates and not as character vectors, even though they contain non-numeric characters (e.g., in the date_measured column in the co2_df data frame). This means R will not accidentally plot the dates in the wrong order (i.e., not alphanumerically as would happen if it was a character vector). An in-depth study of dates and times is beyond the scope of the book, but interested readers may consult the Dates and Times chapter of R for Data Science (Wickham and Grolemund 2016); see the additional resources at the end of this chapter. Since we are investigating a relationship between two variables (CO\\(_{\\text{2}}\\) concentration and date), a scatter plot is a good place to start. Scatter plots show the data as individual points with x (horizontal axis) and y (vertical axis) coordinates. Here, we will use the measurement date as the x coordinate and the CO\\(_{\\text{2}}\\) concentration as the y coordinate. When using the ggplot2 package, we create a plot object with the ggplot function. There are a few basic aspects of a plot that we need to specify: The name of the data frame object to visualize. Here, we specify the co2_df data frame. The aesthetic mapping, which tells ggplot how the columns in the data frame map to properties of the visualization. To create an aesthetic mapping, we use the aes function. Here, we set the plot x axis to the date_measured variable, and the plot y axis to the ppm variable. The + operator, which tells ggplot that we would like to add another layer to the plot. The geometric object, which specifies how the mapped data should be displayed. To create a geometric object, we use a geom_* function (see the ggplot reference for a list of geometric objects). Here, we use the geom_point function to visualize our data as a scatter plot. co2_scatter <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_point() co2_scatter Figure 4.2: Scatter plot of atmospheric concentration of CO\\(_{2}\\) over time. The visualization in Figure 4.2 shows a clear upward trend in the atmospheric concentration of CO\\(_{\\text{2}}\\) over time. This plot answers the first part of our question in the affirmative, but that appears to be the only conclusion one can make from the scatter visualization. One important thing to note about this data is that one of the variables we are exploring is time. Time is a special kind of quantitative variable because it forces additional structure on the data—the data points have a natural order. Specifically, each observation in the data set has a predecessor and a successor, and the order of the observations matters; changing their order alters their meaning. In situations like this, we typically use a line plot to visualize the data. Line plots connect the sequence of x and y coordinates of the observations with line segments, thereby emphasizing their order. We can create a line plot in ggplot using the geom_line function. Let’s now try to visualize the co2_df as a line plot with just the default arguments: co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_line() co2_line Figure 4.3: Line plot of atmospheric concentration of CO\\(_{2}\\) over time. Aha! Figure 4.3 shows us there is another interesting phenomenon in the data: in addition to increasing over time, the concentration seems to oscillate as well. Given the visualization as it is now, it is still hard to tell how fast the oscillation is, but nevertheless, the line seems to be a better choice for answering the question than the scatter plot was. The comparison between these two visualizations also illustrates a common issue with scatter plots: often, the points are shown too close together or even on top of one another, muddling information that would otherwise be clear (overplotting). Now that we have settled on the rough details of the visualization, it is time to refine things. This plot is fairly straightforward, and there is not much visual noise to remove. But there are a few things we must do to improve clarity, such as adding informative axis labels and making the font a more readable size. To add axis labels, we use the xlab and ylab functions. To change the font size, we use the theme function with the text argument: co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_line() + xlab("Year") + ylab("Atmospheric CO2 (ppm)") + theme(text = element_text(size = 12)) co2_line Figure 4.4: Line plot of atmospheric concentration of CO\\(_{2}\\) over time with clearer axes and labels. Note: The theme function is quite complex and has many arguments that can be specified to control many non-data aspects of a visualization. An in-depth discussion of the theme function is beyond the scope of this book. Interested readers may consult the theme function documentation; see the additional resources section at the end of this chapter. Finally, let’s see if we can better understand the oscillation by changing the visualization slightly. Note that it is totally fine to use a small number of visualizations to answer different aspects of the question you are trying to answer. We will accomplish this by using scales, another important feature of ggplot2 that easily transforms the different variables and set limits. We scale the horizontal axis using the xlim function, and the vertical axis with the ylim function. In particular, here, we will use the xlim function to zoom in on just five years of data (say, 1990-1994). xlim takes a vector of length two to specify the upper and lower bounds to limit the axis. We can create that using the c function. Note that it is important that the vector given to xlim must be of the same type as the data that is mapped to that axis. Here, we have mapped a date to the x-axis, and so we need to use the date function (from the tidyverse lubridate R package (Spinu, Grolemund, and Wickham 2021; Grolemund and Wickham 2011)) to convert the character strings we provide to c to date vectors. Note: lubridate is a package that is installed by the tidyverse metapackage, but is not loaded by it. Hence we need to load it separately in the code below. library(lubridate) co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) + geom_line() + xlab("Year") + ylab("Atmospheric CO2 (ppm)") + xlim(c(date("1990-01-01"), date("1993-12-01"))) + theme(text = element_text(size = 12)) co2_line Figure 4.5: Line plot of atmospheric concentration of CO\\(_{2}\\) from 1990 to 1994. Interesting! It seems that each year, the atmospheric CO\\(_{\\text{2}}\\) increases until it reaches its peak somewhere around April, decreases until around late September, and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April. Therefore, the oscillating pattern in CO\\(_{\\text{2}}\\) matches up fairly closely with the two seasons. As you might have noticed from the code used to create the final visualization of the co2_df data frame, we construct the visualizations in ggplot with layers. New layers are added with the + operator, and we can really add as many as we would like! A useful analogy to constructing a data visualization is painting a picture. We start with a blank canvas, and the first thing we do is prepare the surface for our painting by adding primer. In our data visualization this is akin to calling ggplot and specifying the data set we will be using. Next, we sketch out the background of the painting. In our data visualization, this would be when we map data to the axes in the aes function. Then we add our key visual subjects to the painting. In our data visualization, this would be the geometric objects (e.g., geom_point, geom_line, etc.). And finally, we work on adding details and refinements to the painting. In our data visualization this would be when we fine tune axis labels, change the font, adjust the point size, and do other related things. 4.5.2 Scatter plots: the Old Faithful eruption time data set The faithful data set contains measurements of the waiting time between eruptions and the subsequent eruption duration (in minutes) of the Old Faithful geyser in Yellowstone National Park, Wyoming, United States. The faithful data set is available in base R as a data frame, so it does not need to be loaded. We convert it to a tibble to take advantage of the nicer print output these specialized data frames provide. Question: Is there a relationship between the waiting time before an eruption and the duration of the eruption? # old faithful eruption time / wait time data faithful <- as_tibble(faithful) faithful ## # A tibble: 272 × 2 ## eruptions waiting ## <dbl> <dbl> ## 1 3.6 79 ## 2 1.8 54 ## 3 3.33 74 ## 4 2.28 62 ## 5 4.53 85 ## 6 2.88 55 ## 7 4.7 88 ## 8 3.6 85 ## 9 1.95 51 ## 10 4.35 85 ## # ℹ 262 more rows Here again, we investigate the relationship between two quantitative variables (waiting time and eruption time). But if you look at the output of the data frame, you’ll notice that unlike time in the Mauna Loa CO\\(_{\\text{2}}\\) data set, neither of the variables here have a natural order to them. So a scatter plot is likely to be the most appropriate visualization. Let’s create a scatter plot using the ggplot function with the waiting variable on the horizontal axis, the eruptions variable on the vertical axis, and the geom_point geometric object. The result is shown in Figure 4.6. faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() faithful_scatter Figure 4.6: Scatter plot of waiting time and eruption time. We can see in Figure 4.6 that the data tend to fall into two groups: one with short waiting and eruption times, and one with long waiting and eruption times. Note that in this case, there is no overplotting: the points are generally nicely visually separated, and the pattern they form is clear. In order to refine the visualization, we need only to add axis labels and make the font more readable: faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() + xlab("Waiting Time (mins)") + ylab("Eruption Duration (mins)") + theme(text = element_text(size = 12)) faithful_scatter Figure 4.7: Scatter plot of waiting time and eruption time with clearer axes and labels. 4.5.3 Axis transformation and colored scatter plots: the Canadian languages data set Recall the can_lang data set (Timbers 2020) from Chapters 1, 2, and 3, which contains counts of languages from the 2016 Canadian census. Question: Is there a relationship between the percentage of people who speak a language as their mother tongue and the percentage for whom that is the primary language spoken at home? And is there a pattern in the strength of this relationship in the higher-level language categories (Official languages, Aboriginal languages, or non-official and non-Aboriginal languages)? To get started, we will read and inspect the data: can_lang <- read_csv("data/can_lang.csv") can_lang ## # A tibble: 214 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Aboriginal langu… Aborigi… 590 235 30 665 ## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415 ## 3 Non-Official & N… Afro-As… 1150 445 10 2775 ## 4 Non-Official & N… Akan (T… 13460 5985 25 22150 ## 5 Non-Official & N… Albanian 26895 13135 345 31930 ## 6 Aboriginal langu… Algonqu… 45 10 0 120 ## 7 Aboriginal langu… Algonqu… 1260 370 40 2480 ## 8 Non-Official & N… America… 2685 3020 1145 21930 ## 9 Non-Official & N… Amharic 22465 12785 200 33670 ## 10 Non-Official & N… Arabic 419890 223535 5585 629055 ## # ℹ 204 more rows We will begin with a scatter plot of the mother_tongue and most_at_home columns from our data frame. The resulting plot is shown in Figure 4.8. ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) + geom_point() Figure 4.8: Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home. To make an initial improvement in the interpretability of Figure 4.8, we should replace the default axis names with more informative labels. We can use \\n to create a line break in the axis names so that the words after \\n are printed on a new line. This will make the axes labels on the plots more readable. We should also increase the font size to further improve readability. ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) + geom_point() + xlab("Language spoken most at home \\n (number of Canadian residents)") + ylab("Mother tongue \\n (number of Canadian residents)") + theme(text = element_text(size = 12)) Figure 4.9: Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels. Okay! The axes and labels in Figure 4.9 are much more readable and interpretable now. However, the scatter points themselves could use some work; most of the 214 data points are bunched up in the lower left-hand side of the visualization. The data is clumped because many more people in Canada speak English or French (the two points in the upper right corner) than other languages. In particular, the most common mother tongue language has 19,460,850 speakers, while the least common has only 10. That’s a 6-decimal-place difference in the magnitude of these two numbers! We can confirm that the two points in the upper right-hand corner correspond to Canada’s two official languages by filtering the data: can_lang |> filter(language == "English" | language == "French") ## # A tibble: 2 × 6 ## category language mother_tongue most_at_home most_at_work lang_known ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Official languages English 19460850 22162865 15265335 29748265 ## 2 Official languages French 7166700 6943800 3825215 10242945 Recall that our question about this data pertains to all languages; so to properly answer our question, we will need to adjust the scale of the axes so that we can clearly see all of the scatter points. In particular, we will improve the plot by adjusting the horizontal and vertical axes so that they are on a logarithmic (or log) scale. Log scaling is useful when your data take both very large and very small values, because it helps space out small values and squishes larger values together. For example, \\(\\log_{10}(1) = 0\\), \\(\\log_{10}(10) = 1\\), \\(\\log_{10}(100) = 2\\), and \\(\\log_{10}(1000) = 3\\); on the logarithmic scale, the values 1, 10, 100, and 1000 are all the same distance apart! So we see that applying this function is moving big values closer together and moving small values farther apart. Note that if your data can take the value 0, logarithmic scaling may not be appropriate (since log10(0) is -Inf in R). There are other ways to transform the data in such a case, but these are beyond the scope of the book. We can accomplish logarithmic scaling in a ggplot visualization using the scale_x_log10 and scale_y_log10 functions. Given that the x and y axes have large numbers, we should also format the axis labels to put commas in these numbers to increase their readability. We can do this in R by passing the label_comma function (from the scales package) to the labels argument of the scale_x_log10 and scale_x_log10 functions. library(scales) ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) + geom_point() + xlab("Language spoken most at home \\n (number of Canadian residents)") + ylab("Mother tongue \\n (number of Canadian residents)") + theme(text = element_text(size = 12)) + scale_x_log10(labels = label_comma()) + scale_y_log10(labels = label_comma()) Figure 4.10: Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with log adjusted x and y axes. Similar to some of the examples in Chapter 3, we can convert the counts to percentages to give them context and make them easier to understand. We can do this by dividing the number of people reporting a given language as their mother tongue or primary language at home by the number of people who live in Canada and multiplying by 100%. For example, the percentage of people who reported that their mother tongue was English in the 2016 Canadian census was 19,460,850 / 35,151,728 \\(\\times\\) 100 % = 55.36%. Below we use mutate to calculate the percentage of people reporting a given language as their mother tongue and primary language at home for all the languages in the can_lang data set. Since the new columns are appended to the end of the data table, we selected the new columns after the transformation so you can clearly see the mutated output from the table. can_lang <- can_lang |> mutate( mother_tongue_percent = (mother_tongue / 35151728) * 100, most_at_home_percent = (most_at_home / 35151728) * 100 ) can_lang |> select(mother_tongue_percent, most_at_home_percent) ## # A tibble: 214 × 2 ## mother_tongue_percent most_at_home_percent ## <dbl> <dbl> ## 1 0.00168 0.000669 ## 2 0.0292 0.0136 ## 3 0.00327 0.00127 ## 4 0.0383 0.0170 ## 5 0.0765 0.0374 ## 6 0.000128 0.0000284 ## 7 0.00358 0.00105 ## 8 0.00764 0.00859 ## 9 0.0639 0.0364 ## 10 1.19 0.636 ## # ℹ 204 more rows Finally, we will edit the visualization to use the percentages we just computed (and change our axis labels to reflect this change in units). Figure 4.11 displays the final result. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12)) + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) Figure 4.11: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home. Figure 4.11 is the appropriate visualization to use to answer the first question in this section, i.e., whether there is a relationship between the percentage of people who speak a language as their mother tongue and the percentage for whom that is the primary language spoken at home. To fully answer the question, we need to use Figure 4.11 to assess a few key characteristics of the data: Direction: if the y variable tends to increase when the x variable increases, then y has a positive relationship with x. If y tends to decrease when x increases, then y has a negative relationship with x. If y does not meaningfully increase or decrease as x increases, then y has little or no relationship with x. Strength: if the y variable reliably increases, decreases, or stays flat as x increases, then the relationship is strong. Otherwise, the relationship is weak. Intuitively, the relationship is strong when the scatter points are close together and look more like a “line” or “curve” than a “cloud.” Shape: if you can draw a straight line roughly through the data points, the relationship is linear. Otherwise, it is nonlinear. In Figure 4.11, we see that as the percentage of people who have a language as their mother tongue increases, so does the percentage of people who speak that language at home. Therefore, there is a positive relationship between these two variables. Furthermore, because the points in Figure 4.11 are fairly close together, and the points look more like a “line” than a “cloud”, we can say that this is a strong relationship. And finally, because drawing a straight line through these points in Figure 4.11 would fit the pattern we observe quite well, we say that the relationship is linear. Onto the second part of our exploratory data analysis question! Recall that we are interested in knowing whether the strength of the relationship we uncovered in Figure 4.11 depends on the higher-level language category (Official languages, Aboriginal languages, and non-official, non-Aboriginal languages). One common way to explore this is to color the data points on the scatter plot we have already created by group. For example, given that we have the higher-level language category for each language recorded in the 2016 Canadian census, we can color the points in our previous scatter plot to represent each language’s higher-level language category. Here we want to distinguish the values according to the category group with which they belong. We can add an argument to the aes function, specifying that the category column should color the points. Adding this argument will color the points according to their group and add a legend at the side of the plot. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent, color = category)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12)) + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) Figure 4.12: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category. The legend in Figure 4.12 takes up valuable plot area. We can improve this by moving the legend title using the legend.position and legend.direction arguments of the theme function. Here we set legend.position to \"top\" to put the legend above the plot and legend.direction to \"vertical\" so that the legend items remain vertically stacked on top of each other. When the legend.position is set to either \"top\" or \"bottom\" the default direction is to stack the legend items horizontally. However, that will not work well for this particular visualization because the legend labels are quite long and would run off the page if displayed this way. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent, color = category)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical") + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) Figure 4.13: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with the legend edited. In Figure 4.13, the points are colored with the default ggplot2 color palette. But what if you want to use different colors? In R, two packages that provide alternative color palettes are RColorBrewer (Neuwirth 2014) and ggthemes (Arnold 2019); in this book we will cover how to use RColorBrewer. You can visualize the list of color palettes that RColorBrewer has to offer with the display.brewer.all function. You can also print a list of color-blind friendly palettes by adding colorblindFriendly = TRUE to the function. library(RColorBrewer) display.brewer.all(colorblindFriendly = TRUE) Figure 4.14: Color palettes available from the RColorBrewer R package. From Figure 4.14, we can choose the color palette we want to use in our plot. To change the color palette, we add the scale_color_brewer layer indicating the palette we want to use. You can use this color blindness simulator to check if your visualizations are color-blind friendly. Below we pick the \"Set2\" palette, with the result shown in Figure 4.15. We also set the shape aesthetic mapping to the category variable as well; this makes the scatter point shapes different for each category. This kind of visual redundancy—i.e., conveying the same information with both scatter point color and shape—can further improve the clarity and accessibility of your visualization. ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent, color = category, shape = category)) + geom_point() + xlab("Language spoken most at home \\n (percentage of Canadian residents)") + ylab("Mother tongue \\n (percentage of Canadian residents)") + theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical") + scale_x_log10(labels = comma) + scale_y_log10(labels = comma) + scale_color_brewer(palette = "Set2") Figure 4.15: Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with color-blind friendly colors. From the visualization in Figure 4.15, we can now clearly see that the vast majority of Canadians reported one of the official languages as their mother tongue and as the language they speak most often at home. What do we see when considering the second part of our exploratory question? Do we see a difference in the relationship between languages spoken as a mother tongue and as a primary language at home across the higher-level language categories? Based on Figure 4.15, there does not appear to be much of a difference. For each higher-level language category, there appears to be a strong, positive, and linear relationship between the percentage of people who speak a language as their mother tongue and the percentage who speak it as their primary language at home. The relationship looks similar regardless of the category. Does this mean that this relationship is positive for all languages in the world? And further, can we use this data visualization on its own to predict how many people have a given language as their mother tongue if we know how many people speak it as their primary language at home? The answer to both these questions is “no!” However, with exploratory data analysis, we can create new hypotheses, ideas, and questions (like the ones at the beginning of this paragraph). Answering those questions often involves doing more complex analyses, and sometimes even gathering additional data. We will see more of such complex analyses later on in this book. 4.5.4 Bar plots: the island landmass data set The islands.csv data set contains a list of Earth’s landmasses as well as their area (in thousands of square miles) (McNeil 1977). Question: Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth’s seven largest landmasses? If so, what are the next few largest landmasses after those? To get started, we will read and inspect the data: # islands data islands_df <- read_csv("data/islands.csv") islands_df ## # A tibble: 48 × 3 ## landmass size landmass_type ## <chr> <dbl> <chr> ## 1 Africa 11506 Continent ## 2 Antarctica 5500 Continent ## 3 Asia 16988 Continent ## 4 Australia 2968 Continent ## 5 Axel Heiberg 16 Other ## 6 Baffin 184 Other ## 7 Banks 23 Other ## 8 Borneo 280 Other ## 9 Britain 84 Other ## 10 Celebes 73 Other ## # ℹ 38 more rows Here, we have a data frame of Earth’s landmasses, and are trying to compare their sizes. The right type of visualization to answer this question is a bar plot. In a bar plot, the height of each bar represents the value of an amount (a size, count, proportion, percentage, etc). They are particularly useful for comparing counts or proportions across different groups of a categorical variable. Note, however, that bar plots should generally not be used to display mean or median values, as they hide important information about the variation of the data. Instead it’s better to show the distribution of all the individual data points, e.g., using a histogram, which we will discuss further in Section 4.5.5. We specify that we would like to use a bar plot via the geom_bar function in ggplot2. However, by default, geom_bar sets the heights of bars to the number of times a value appears in a data frame (its count); here, we want to plot exactly the values in the data frame, i.e., the landmass sizes. So we have to pass the stat = \"identity\" argument to geom_bar. The result is shown in Figure 4.16. islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) + geom_bar(stat = "identity") islands_bar Figure 4.16: Bar plot of Earth’s landmass sizes with squished labels. Alright, not bad! The plot in Figure 4.16 is definitely the right kind of visualization, as we can clearly see and compare sizes of landmasses. The major issues are that the smaller landmasses’ sizes are hard to distinguish, and the names of the landmasses are obscuring each other as they have been squished into too little space. But remember that the question we asked was only about the largest landmasses; let’s make the plot a little bit clearer by keeping only the largest 12 landmasses. We do this using the slice_max function: the order_by argument is the name of the column we want to use for comparing which is largest, and the n argument specifies how many rows to keep. Then to give the labels enough space, we’ll use horizontal bars instead of vertical ones. We do this by swapping the x and y variables. Note: Recall that in Chapter 1, we used arrange followed by slice to obtain the ten rows with the largest values of a variable. We could have instead used the slice_max function for this purpose. The slice_max and slice_min functions achieve the same goal as arrange followed by slice, but are slightly more efficient because they are specialized for this purpose. In general, it is good to use more specialized functions when they are available! islands_top12 <- slice_max(islands_df, order_by = size, n = 12) islands_bar <- ggplot(islands_top12, aes(x = size, y = landmass)) + geom_bar(stat = "identity") islands_bar Figure 4.17: Bar plot of size for Earth’s largest 12 landmasses. The plot in Figure 4.17 is definitely clearer now, and allows us to answer our question (“Are the top 7 largest landmasses continents?”) in the affirmative. However, we could still improve this visualization by coloring the bars based on whether they correspond to a continent, and by organizing the bars by landmass size rather than by alphabetical order. The data for coloring the bars is stored in the landmass_type column, so we add the fill argument to the aesthetic mapping and set it to landmass_type. We manually select two colors for the bars using the scale_fill_manual function:\"darkorange\" for orange and \"steelblue\" for blue. To organize the landmasses by their size variable, we will use the tidyverse fct_reorder function in the aesthetic mapping to organize the landmasses by their size variable. The first argument passed to fct_reorder is the name of the factor column whose levels we would like to reorder (here, landmass). The second argument is the column name that holds the values we would like to use to do the ordering (here, size). The fct_reorder function uses ascending order by default, but this can be changed to descending order by setting .desc = TRUE. We do this here so that the largest bar will be closest to the axis line, which is more visually appealing. To finalize this plot we will customize the axis and legend labels, and add a title to the chart. Plot titles are not always required, especially when it would be redundant with an already-existing caption or surrounding context (e.g., in a slide presentation with annotations). But if you decide to include one, a good plot title should provide the take home message that you want readers to focus on, e.g., “Earth’s seven largest landmasses are continents,” or a more general summary of the information displayed, e.g., “Earth’s twelve largest landmasses.” To make these final adjustments we will use the labs function rather than the xlab and ylab functions we have seen earlier in this chapter, as labs lets us modify the legend label and title in addition to axis labels. We provide a label for each aesthetic mapping in the plot—in this case, x, y, and fill—as well as one for the title argument. Finally, we again use the theme function to change the font size. islands_bar <- ggplot(islands_top12, aes(x = size, y = fct_reorder(landmass, size, .desc = TRUE), fill = landmass_type)) + geom_bar(stat = "identity") + labs(x = "Size (1000 square mi)", y = "Landmass", fill = "Type", title = "Earth's twelve largest landmasses") + scale_fill_manual(values = c("steelblue", "darkorange")) + theme(text = element_text(size = 10)) islands_bar Figure 4.18: Bar plot of size for Earth’s largest 12 landmasses, colored by landmass type, with clearer axes and labels. The plot in Figure 4.18 is now a very effective visualization for answering our original questions. Landmasses are organized by their size, and continents are colored differently than other landmasses, making it quite clear that continents are the largest seven landmasses. 4.5.5 Histograms: the Michelson speed of light data set The morley data set contains measurements of the speed of light collected in experiments performed in 1879. Five experiments were performed, and in each experiment, 20 runs were performed—meaning that 20 measurements of the speed of light were collected in each experiment (Michelson 1882). The morley data set is available in base R as a data frame, so it does not need to be loaded. Because the speed of light is a very large number (the true value is 299,792.458 km/sec), the data is coded to be the measured speed of light minus 299,000. This coding allows us to focus on the variations in the measurements, which are generally much smaller than 299,000. If we used the full large speed measurements, the variations in the measurements would not be noticeable, making it difficult to study the differences between the experiments. Note that we convert the morley data to a tibble to take advantage of the nicer print output these specialized data frames provide. Question: Given what we know now about the speed of light (299,792.458 kilometres per second), how accurate were each of the experiments? # michelson morley experimental data morley <- as_tibble(morley) morley ## # A tibble: 100 × 3 ## Expt Run Speed ## <int> <int> <int> ## 1 1 1 850 ## 2 1 2 740 ## 3 1 3 900 ## 4 1 4 1070 ## 5 1 5 930 ## 6 1 6 850 ## 7 1 7 950 ## 8 1 8 980 ## 9 1 9 980 ## 10 1 10 880 ## # ℹ 90 more rows In this experimental data, Michelson was trying to measure just a single quantitative number (the speed of light). The data set contains many measurements of this single quantity. To tell how accurate the experiments were, we need to visualize the distribution of the measurements (i.e., all their possible values and how often each occurs). We can do this using a histogram. A histogram helps us visualize how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin. To create a histogram in ggplot2 we will use the geom_histogram geometric object, setting the x axis to the Speed measurement variable. As usual, let’s use the default arguments just to see how things look. morley_hist <- ggplot(morley, aes(x = Speed)) + geom_histogram() morley_hist Figure 4.19: Histogram of Michelson’s speed of light data. Figure 4.19 is a great start. However, we cannot tell how accurate the measurements are using this visualization unless we can see the true value. In order to visualize the true speed of light, we will add a vertical line with the geom_vline function. To draw a vertical line with geom_vline, we need to specify where on the x-axis the line should be drawn. We can do this by setting the xintercept argument. Here we set it to 792.458, which is the true value of light speed minus 299,000; this ensures it is coded the same way as the measurements in the morley data frame. We would also like to fine tune this vertical line, styling it so that it is dashed by setting linetype = \"dashed\". There is a similar function, geom_hline, that is used for plotting horizontal lines. Note that vertical lines are used to denote quantities on the horizontal axis, while horizontal lines are used to denote quantities on the vertical axis. morley_hist <- ggplot(morley, aes(x = Speed)) + geom_histogram() + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.20: Histogram of Michelson’s speed of light data with vertical line indicating true speed of light. In Figure 4.20, we still cannot tell which experiments (denoted in the Expt column) led to which measurements; perhaps some experiments were more accurate than others. To fully answer our question, we need to separate the measurements from each other visually. We can try to do this using a colored histogram, where counts from different experiments are stacked on top of each other in different colors. We can create a histogram colored by the Expt variable by adding it to the fill aesthetic mapping. We make sure the different colors can be seen (despite them all sitting on top of each other) by setting the alpha argument in geom_histogram to 0.5 to make the bars slightly translucent. We also specify position = \"identity\" in geom_histogram to ensure the histograms for each experiment will be overlaid side-by-side, instead of stacked bars (which is the default for bar plots or histograms when they are colored by another categorical variable). morley_hist <- ggplot(morley, aes(x = Speed, fill = Expt)) + geom_histogram(alpha = 0.5, position = "identity") + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.21: Histogram of Michelson’s speed of light data where an attempt is made to color the bars by experiment. Alright great, Figure 4.21 looks…wait a second! The histogram is still all the same color! What is going on here? Well, if you recall from Chapter 3, the data type you use for each variable can influence how R and tidyverse treats it. Here, we indeed have an issue with the data types in the morley data frame. In particular, the Expt column is currently an integer (you can see the label <int> underneath the Expt column in the printed data frame at the start of this section). But we want to treat it as a category, i.e., there should be one category per type of experiment. To fix this issue we can convert the Expt variable into a factor by passing it to as_factor in the fill aesthetic mapping. Recall that factor is a data type in R that is often used to represent categories. By writing as_factor(Expt) we are ensuring that R will treat this variable as a factor, and the color will be mapped discretely. morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) + geom_histogram(alpha = 0.5, position = "identity") + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.22: Histogram of Michelson’s speed of light data colored by experiment as factor. Note: Factors impact plots in two ways: (1) ensuring a color is mapped as discretely where appropriate (as in this example) and (2) the ordering of levels in a plot. ggplot takes into account the order of the factor levels as opposed to the order of data in your data frame. Learning how to reorder your factor levels will help you with reordering the labels of a factor on a plot. Unfortunately, the attempt to separate out the experiment number visually has created a bit of a mess. All of the colors in Figure 4.22 are blending together, and although it is possible to derive some insight from this (e.g., experiments 1 and 3 had some of the most incorrect measurements), it isn’t the clearest way to convey our message and answer the question. Let’s try a different strategy of creating grid of separate histogram plots. We use the facet_grid function to create a plot that has multiple subplots arranged in a grid. The argument to facet_grid specifies the variable(s) used to split the plot into subplots, and how to split them (i.e., into rows or columns). If the plot is to be split horizontally, into rows, then the rows argument is used. If the plot is to be split vertically, into columns, then the cols argument is used. Both the rows and cols arguments take the column names on which to split the data when creating the subplots. Note that the column names must be surrounded by the vars function. This function allows the column names to be correctly evaluated in the context of the data frame. morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) + geom_histogram() + facet_grid(rows = vars(Expt)) + geom_vline(xintercept = 792.458, linetype = "dashed") morley_hist Figure 4.23: Histogram of Michelson’s speed of light data split vertically by experiment. The visualization in Figure 4.23 now makes it quite clear how accurate the different experiments were with respect to one another. The most variable measurements came from Experiment 1. There the measurements ranged from about 650–1050 km/sec. The least variable measurements came from Experiment 2. There, the measurements ranged from about 750–950 km/sec. The most different experiments still obtained quite similar results! There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels using the labs function, and increase the font size to make it readable using the theme function. Second, and perhaps more subtly, even though it is easy to compare the experiments on this plot to one another, it is hard to get a sense of just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light? To answer this question, we’ll use the mutate function to transform our data into a relative measure of accuracy rather than absolute measurements: morley_rel <- mutate(morley, relative_accuracy = 100 * ((299000 + Speed) - 299792.458) / (299792.458)) morley_hist <- ggplot(morley_rel, aes(x = relative_accuracy, fill = as_factor(Expt))) + geom_histogram() + facet_grid(rows = vars(Expt)) + geom_vline(xintercept = 0, linetype = "dashed") + labs(x = "Relative Accuracy (%)", y = "# Measurements", fill = "Experiment ID") + theme(text = element_text(size = 12)) morley_hist Figure 4.24: Histogram of relative accuracy split vertically by experiment with clearer axes and labels. Wow, impressive! These measurements of the speed of light from 1879 had errors around 0.05% of the true speed. Figure 4.24 shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an admirable job given the technology available at the time. Choosing a binwidth for histograms When you create a histogram in R, the default number of bins used is 30. Naturally, this is not always the right number to use. You can set the number of bins yourself by using the bins argument in the geom_histogram geometric object. You can also set the width of the bins using the binwidth argument in the geom_histogram geometric object. But what number of bins, or bin width, is the right one to use? Unfortunately there is no hard rule for what the right bin number or width is. It depends entirely on your problem; the right number of bins or bin width is the one that helps you answer the question you asked. Choosing the correct setting for your problem is something that commonly takes iteration. We recommend setting the bin width (not the number of bins) because it often more directly corresponds to values in your problem of interest. For example, if you are looking at a histogram of human heights, a bin width of 1 inch would likely be reasonable, while the number of bins to use is not immediately clear. It’s usually a good idea to try out several bin widths to see which one most clearly captures your data in the context of the question you want to answer. To get a sense for how different bin widths affect visualizations, let’s experiment with the histogram that we have been working on in this section. In Figure 4.25, we compare the default setting with three other histograms where we set the binwidth to 0.001, 0.01 and 0.1. In this case, we can see that both the default number of bins and the binwidth of 0.01 are effective for helping answer our question. On the other hand, the bin widths of 0.001 and 0.1 are too small and too big, respectively. Figure 4.25: Effect of varying bin width on histograms. Adding layers to a ggplot plot object One of the powerful features of ggplot is that you can continue to iterate on a single plot object, adding and refining one layer at a time. If you stored your plot as a named object using the assignment symbol (<-), you can add to it using the + operator. For example, if we wanted to add a title to the last plot we created (morley_hist), we can use the + operator to add a title layer with the ggtitle function. The result is shown in Figure 4.26. morley_hist_title <- morley_hist + ggtitle("Speed of light experiments \\n were accurate to about 0.05%") morley_hist_title Figure 4.26: Histogram of relative accuracy split vertically by experiment with a descriptive title highlighting the take home message of the visualization. Note: Good visualization titles clearly communicate the take home message to the audience. Typically, that is the answer to the question you posed before making the visualization. 4.6 Explaining the visualization Tell a story Typically, your visualization will not be shown entirely on its own, but rather it will be part of a larger presentation. Further, visualizations can provide supporting information for any aspect of a presentation, from opening to conclusion. For example, you could use an exploratory visualization in the opening of the presentation to motivate your choice of a more detailed data analysis / model, a visualization of the results of your analysis to show what your analysis has uncovered, or even one at the end of a presentation to help suggest directions for future work. Regardless of where it appears, a good way to discuss your visualization is as a story: Establish the setting and scope, and describe why you did what you did. Pose the question that your visualization answers. Justify why the question is important to answer. Answer the question using your visualization. Make sure you describe all aspects of the visualization (including describing the axes). But you can emphasize different aspects based on what is important to answer your question: trends (lines): Does a line describe the trend well? If so, the trend is linear, and if not, the trend is nonlinear. Is the trend increasing, decreasing, or neither? Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line “jump around” a lot) or smooth? distributions (scatters, histograms): How spread out are the data? Where are they centered, roughly? Are there any obvious “clusters” or “subgroups”, which would be visible as multiple bumps in the histogram? distributions of two variables (scatters): Is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible relationship (the data are too noisy to make any conclusion)? amounts (bars): How large are the bars relative to one another? Are there patterns in different groups of bars? Summarize your findings, and use them to motivate whatever you will discuss next. Below are two examples of how one might take these four steps in describing the example visualizations that appeared earlier in this chapter. Each of the steps is denoted by its numeral in parentheses, e.g. (3). Mauna Loa Atmospheric CO\\(_{\\text{2}}\\) Measurements: (1) Many current forms of energy generation and conversion—from automotive engines to natural gas power plants—rely on burning fossil fuels and produce greenhouse gases, typically primarily carbon dioxide (CO\\(_{\\text{2}}\\)), as a byproduct. Too much of these gases in the Earth’s atmosphere will cause it to trap more heat from the sun, leading to global warming. (2) In order to assess how quickly the atmospheric concentration of CO\\(_{\\text{2}}\\) is increasing over time, we (3) used a data set from the Mauna Loa observatory in Hawaii, consisting of CO\\(_{\\text{2}}\\) measurements from 1980 to 2020. We plotted the measured concentration of CO\\(_{\\text{2}}\\) (on the vertical axis) over time (on the horizontal axis). From this plot, you can see a clear, increasing, and generally linear trend over time. There is also a periodic oscillation that occurs once per year and aligns with Hawaii’s seasons, with an amplitude that is small relative to the growth in the overall trend. This shows that atmospheric CO\\(_{\\text{2}}\\) is clearly increasing over time, and (4) it is perhaps worth investigating more into the causes. Michelson Light Speed Experiments: (1) Our modern understanding of the physics of light has advanced significantly from the late 1800s when Michelson and Morley’s experiments first demonstrated that it had a finite speed. We now know, based on modern experiments, that it moves at roughly 299,792.458 kilometers per second. (2) But how accurately were we first able to measure this fundamental physical constant, and did certain experiments produce more accurate results than others? (3) To better understand this, we plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as histograms stacked on top of one another. The horizontal axis shows the accuracy of the measurements relative to the true speed of light as we know it today, expressed as a percentage. From this visualization, you can see that most results had relative errors of at most 0.05%. You can also see that experiments 1 and 3 had measurements that were the farthest from the true value, and experiment 5 tended to provide the most consistently accurate result. (4) It would be worth further investigating the differences between these experiments to see why they produced different results. 4.7 Saving the visualization Choose the right output format for your needs Just as there are many ways to store data sets, there are many ways to store visualizations and images. Which one you choose can depend on several factors, such as file size/type limitations (e.g., if you are submitting your visualization as part of a conference paper or to a poster printing shop) and where it will be displayed (e.g., online, in a paper, on a poster, on a billboard, in talk slides). Generally speaking, images come in two flavors: raster formats and vector formats. Raster images are represented as a 2-D grid of square pixels, each with its own color. Raster images are often compressed before storing so they take up less space. A compressed format is lossy if the image cannot be perfectly re-created when loading and displaying, with the hope that the change is not noticeable. Lossless formats, on the other hand, allow a perfect display of the original image. Common file types: JPEG (.jpg, .jpeg): lossy, usually used for photographs PNG (.png): lossless, usually used for plots / line drawings BMP (.bmp): lossless, raw image data, no compression (rarely used) TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing Open-source software: GIMP Vector images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas. Common file types: SVG (.svg): general-purpose use EPS (.eps), general-purpose use (rarely used) Open-source software: Inkscape Raster and vector images have opposing advantages and disadvantages. A raster image of a fixed width / height takes the same amount of space and time to load regardless of what the image shows (the one caveat is that the compression algorithms may shrink the image more or run faster for certain images). A vector image takes space and time to load corresponding to how complex the image is, since the computer has to draw all the elements each time it is displayed. For example, if you have a scatter plot with 1 million points stored as an SVG file, it may take your computer some time to open the image. On the other hand, you can zoom into / scale up vector graphics as much as you like without the image looking bad, while raster images eventually start to look “pixelated.” Note: The portable document format PDF (.pdf) is commonly used to store both raster and vector formats. If you try to open a PDF and it’s taking a long time to load, it may be because there is a complicated vector graphics image that your computer is rendering. Let’s learn how to save plot images to these different file formats using a scatter plot of the Old Faithful data set (Hardle 1991), shown in Figure 4.27. library(svglite) # we need this to save SVG files faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions)) + geom_point() + labs(x = "Waiting time to next eruption \\n (minutes)", y = "Eruption time \\n (minutes)") + theme(text = element_text(size = 12)) faithful_plot Figure 4.27: Scatter plot of waiting time and eruption time. Now that we have a named ggplot plot object, we can use the ggsave function to save a file containing this image. ggsave works by taking a file name to create for the image as its first argument. This can include the path to the directory where you would like to save the file (e.g., img/viz/filename.png to save a file named filename to the img/viz/ directory), and the name of the plot object to save as its second argument. The kind of image to save is specified by the file extension. For example, to create a PNG image file, we specify that the file extension is .png. Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types for the faithful_plot: ggsave("img/viz/faithful_plot.png", faithful_plot) ggsave("img/viz/faithful_plot.jpg", faithful_plot) ggsave("img/viz/faithful_plot.bmp", faithful_plot) ggsave("img/viz/faithful_plot.tiff", faithful_plot) ggsave("img/viz/faithful_plot.svg", faithful_plot) Table 4.1: File sizes of the scatter plot of the Old Faithful data set when saved as different file formats. Image type File type Image size Raster PNG 0.15 MB Raster JPG 0.42 MB Raster BMP 3.15 MB Raster TIFF 9.44 MB Vector SVG 0.03 MB Take a look at the file sizes in Table 4.1. Wow, that’s quite a difference! Notice that for such a simple plot with few graphical elements (points), the vector graphics format (SVG) is over 100 times smaller than the uncompressed raster images (BMP, TIFF). Also, note that the JPG format is twice as large as the PNG format since the JPG compression algorithm is designed for natural images (not plots). In Figure 4.28, we also show what the images look like when we zoom in to a rectangle with only 2 data points. You can see why vector graphics formats are so useful: because they’re just based on mathematical formulas, vector graphics can be scaled up to arbitrary sizes. This makes them great for presentation media of all sizes, from papers to posters to billboards. Figure 4.28: Zoomed in faithful, raster (PNG, left) and vector (SVG, right) formats. 4.8 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Effective data visualization” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 4.9 Additional resources The ggplot2 R package page (Wickham, Chang, et al. 2021) is where you should look if you want to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter. The Fundamentals of Data Visualization (Wilke 2019) has a wealth of information on designing effective visualizations. It is not specific to any particular programming language or library. If you want to improve your visualization skills, this is the next place to look. R for Data Science (Wickham and Grolemund 2016) has a chapter on creating visualizations using ggplot2. This reference is specific to R and ggplot2, but provides a much more detailed introduction to the full set of tools that ggplot2 provides. This chapter is where you should look if you want to learn how to make more intricate visualizations in ggplot2 than what is included in this chapter. The theme function documentation is an excellent reference to see how you can fine tune the non-data aspects of your visualization. R for Data Science (Wickham and Grolemund 2016) has a chapter on dates and times. This chapter is where you should look if you want to learn about date vectors, including how to create them, and how to use them to effectively handle durations, periods and intervals using the lubridate package. References "],["classification1.html", "Chapter 5 Classification I: training & predicting 5.1 Overview 5.2 Chapter learning objectives 5.3 The classification problem 5.4 Exploring a data set 5.5 Classification with K-nearest neighbors 5.6 K-nearest neighbors with tidymodels 5.7 Data preprocessing with tidymodels 5.8 Putting it together in a workflow 5.9 Exercises", " Chapter 5 Classification I: training & predicting 5.1 Overview In previous chapters, we focused solely on descriptive and exploratory data analysis questions. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on classification, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. 5.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Recognize situations where a classifier would be appropriate for making predictions. Describe what a training data set is and how it is used in classification. Interpret the output of a classifier. Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables. Explain the K-nearest neighbors classification algorithm. Perform K-nearest neighbors classification in R using tidymodels. Use a recipe to center, scale, balance, and impute data as a preprocessing step. Combine preprocessing and model training using a workflow. 5.3 The classification problem In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” based on the email’s text and past email text data; or a credit card company may want to predict whether a purchase is fraudulent based on the current purchase item, amount, and location as well as past purchases. These tasks are all examples of classification, i.e., predicting a categorical class (sometimes called a label) for an observation given its other variables (sometimes called features). Generally, a classifier assigns an observation without a known class (e.g., a new patient) to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations for which we do know the class (e.g., previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set; this name comes from the fact that we use these data to train, or teach, our classifier. Once taught, we can use the classifier to make predictions on new data for which we do not know the class. There are many possible methods that we could use to predict a categorical class/label for an observation. In this book, we will focus on the widely used K-nearest neighbors algorithm (Fix and Hodges 1951; Cover and Hart 1967). In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. It is also worth mentioning that there are many variations on the basic classification problem. For example, we focus on the setting of binary classification where only two classes are involved (e.g., a diagnosis of either healthy or diseased), but you may also run into multiclass classification problems with more than two categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold). 5.4 Exploring a data set In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian (Street, Wolberg, and Mangasarian 1993). Each row in the data set represents an image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, and more). Diagnosis for each image was conducted by physicians. As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: can we use the tumor image measurements available to us to predict whether a future tumor image (with unknown diagnosis) shows a benign or malignant tumor? Answering this question is important because traditional, non-data-driven methods for tumor diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumors are not normally dangerous; the cells stay in the same place, and the tumor stops growing before it gets very large. By contrast, in malignant tumors, the cells invade the surrounding tissue and spread into nearby organs, where they can cause serious damage (Stanford Health Care 2021). Thus, it is important to quickly and accurately diagnose the tumor type to guide patient treatment. 5.4.1 Loading the cancer data Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by loading the tidyverse package needed for our analysis. library(tidyverse) In this case, the file containing the breast cancer data set is a .csv file with headers. We’ll use the read_csv function with no additional arguments, and then inspect its contents: cancer <- read_csv("data/wdbc.csv") cancer ## # A tibble: 569 × 12 ## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65 ## 2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238 ## 3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36 ## 4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91 ## 5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37 ## 6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866 ## 7 8.44e5 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300 ## 8 8.45e7 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610 ## 9 8.45e5 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22 ## 10 8.45e7 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74 ## # ℹ 559 more rows ## # ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>, ## # Fractal_Dimension <dbl> 5.4.2 Describing the variables in the cancer data set Breast tumors can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle aspiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, ten different variables were measured for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been standardized (centered and scaled); we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis by a physician. Therefore, the total set of variables per image in this data set is: ID: identification number Class: the diagnosis (M = malignant or B = benign) Radius: the mean of distances from center to points on the perimeter Texture: the standard deviation of gray-scale values Perimeter: the length of the surrounding contour Area: the area inside the contour Smoothness: the local variation in radius lengths Compactness: the ratio of squared perimeter and area Concavity: severity of concave portions of the contour Concave Points: the number of concave portions of the contour Symmetry: how similar the nucleus is when mirrored Fractal Dimension: a measurement of how “rough” the perimeter is Below we use glimpse to preview the data frame. This function can make it easier to inspect the data when we have a lot of columns, as it prints the data such that the columns go down the page (instead of across). glimpse(cancer) ## Rows: 569 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786… ## $ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M… ## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875… ## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150… ## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7… ## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8… ## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2… ## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5… ## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3… ## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42… ## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, … ## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0… From the summary of the data above, we can see that Class is of type character (denoted by <chr>). We can use the distinct function to see all the unique values present in that column. We see that there are two diagnoses: benign, represented by “B”, and malignant, represented by “M”. cancer |> distinct(Class) ## # A tibble: 2 × 1 ## Class ## <chr> ## 1 M ## 2 B Since we will be working with Class as a categorical variable, it is a good idea to convert it to a factor type using the as_factor function. We will also improve the readability of our analysis by renaming “M” to “Malignant” and “B” to “Benign” using the fct_recode method. The fct_recode method is used to replace the names of factor values with other names. The arguments of fct_recode are the column that you want to modify, followed any number of arguments of the form \"new name\" = \"old name\" to specify the renaming scheme. cancer <- cancer |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) glimpse(cancer) ## Rows: 569 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786… ## $ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan… ## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875… ## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150… ## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7… ## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8… ## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2… ## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5… ## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3… ## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42… ## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, … ## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0… Let’s verify that we have successfully converted the Class column to a factor variable and renamed its values to “Benign” and “Malignant” using the distinct function once more. cancer |> distinct(Class) ## # A tibble: 2 × 1 ## Class ## <fct> ## 1 Malignant ## 2 Benign 5.4.3 Exploring the cancer data Before we start doing any modeling, let’s explore our data set. Below we use the group_by, summarize and n functions to find the number and percentage of benign and malignant tumor observations in our data set. The n function within summarize, when paired with group_by, counts the number of observations in each Class group. Then we calculate the percentage in each group by dividing by the total number of observations and multiplying by 100. We have 357 (63%) benign and 212 (37%) malignant tumor observations. num_obs <- nrow(cancer) cancer |> group_by(Class) |> summarize( count = n(), percentage = n() / num_obs * 100 ) ## # A tibble: 2 × 3 ## Class count percentage ## <fct> <int> <dbl> ## 1 Malignant 212 37.3 ## 2 Benign 357 62.7 Next, let’s draw a scatter plot to visualize the relationship between the perimeter and concavity variables. Rather than use ggplot's default palette, we select our own colorblind-friendly colors—\"darkorange\" for orange and \"steelblue\" for blue—and pass them as the values argument to the scale_color_manual function. perim_concav <- cancer |> ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.6) + labs(x = "Perimeter (standardized)", y = "Concavity (standardized)", color = "Diagnosis") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) perim_concav Figure 5.1: Scatter plot of concavity versus perimeter colored by diagnosis label. In Figure 5.1, we can see that malignant observations typically fall in the upper right-hand corner of the plot area. By contrast, benign observations typically fall in the lower left-hand corner of the plot. In other words, benign observations tend to have lower concavity and perimeter values, and malignant ones tend to have larger values. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumor class). We could compute the standardized perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? Based on the scatter plot, how might you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like it may be possible to make accurate predictions of the Class variable (i.e., a diagnosis) for tumor images with unknown diagnoses. 5.5 Classification with K-nearest neighbors In order to actually make predictions for new observations in practice, we will need a classification algorithm. In this book, we will use the K-nearest neighbors classification algorithm. To predict the label of a new observation (here, classify it as either benign or malignant), the K-nearest neighbors classifier generally finds the \\(K\\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. \\(K\\) is a number that we must choose in advance; for now, we will assume that someone has chosen \\(K\\) for us. We will cover how to choose \\(K\\) ourselves in the next chapter. To illustrate the concept of K-nearest neighbors classification, we will walk through an example. Suppose we have a new observation, with standardized perimeter of 2 and standardized concavity of 4, whose diagnosis “Class” is unknown. This new observation is depicted by the red, diamond point in Figure 5.2. Figure 5.2: Scatter plot of concavity versus perimeter with new observation represented as a red diamond. Figure 5.3 shows that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis. Figure 5.3: Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label. Suppose we have another new observation with standardized perimeter 0.2 and concavity of 3.3. Looking at the scatter plot in Figure 5.4, how would you classify this red, diamond observation? The nearest neighbor to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points. Figure 5.4: Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label. To improve the prediction we can consider several neighboring points, say \\(K = 3\\), that are closest to the new observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. As shown in Figure 5.5, we see that the diagnoses of 2 of the 3 nearest neighbors to our new observation are malignant. Therefore we take majority vote and classify our new red, diamond observation as malignant. Figure 5.5: Scatter plot of concavity versus perimeter with three nearest neighbors. Here we chose the \\(K=3\\) nearest observations, but there is nothing special about \\(K=3\\). We could have used \\(K=4, 5\\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \\(K\\) in the next chapter. 5.5.1 Distance between points We decide which points are the \\(K\\) “nearest” to our new observation using the straight-line distance (we will often just refer to this as distance). Suppose we have two observations \\(a\\) and \\(b\\), each having two predictor variables, \\(x\\) and \\(y\\). Denote \\(a_x\\) and \\(a_y\\) to be the values of variables \\(x\\) and \\(y\\) for observation \\(a\\); \\(b_x\\) and \\(b_y\\) have similar definitions for observation \\(b\\). Then the straight-line distance between observation \\(a\\) and \\(b\\) on the x-y plane can be computed using the following formula: \\[\\mathrm{Distance} = \\sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}\\] To find the \\(K\\) nearest neighbors to our new observation, we compute the distance from that new observation to each observation in our training data, and select the \\(K\\) observations corresponding to the \\(K\\) smallest distance values. For example, suppose we want to use \\(K=5\\) neighbors to classify a new observation with perimeter of 0 and concavity of 3.5, shown as a red diamond in Figure 5.6. Let’s calculate the distances between our new point and each of the observations in the training set to find the \\(K=5\\) neighbors that are nearest to our new point. You will see in the mutate step below, we compute the straight-line distance using the formula above: we square the differences between the two observations’ perimeter and concavity coordinates, add the squared differences, and then take the square root. In order to find the \\(K=5\\) nearest neighbors, we will use the slice_min function. Figure 5.6: Scatter plot of concavity versus perimeter with new observation represented as a red diamond. new_obs_Perimeter <- 0 new_obs_Concavity <- 3.5 cancer |> select(ID, Perimeter, Concavity, Class) |> mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) |> slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance ## # A tibble: 5 × 5 ## ID Perimeter Concavity Class dist_from_new ## <dbl> <dbl> <dbl> <fct> <dbl> ## 1 86409 0.241 2.65 Benign 0.881 ## 2 887181 0.750 2.87 Malignant 0.980 ## 3 899667 0.623 2.54 Malignant 1.14 ## 4 907914 0.417 2.31 Malignant 1.26 ## 5 8710441 -1.16 4.04 Benign 1.28 In Table 5.1 we show in mathematical detail how the mutate step was used to compute the dist_from_new variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data. Table 5.1: Evaluating the distances from the new observation to each of its 5 nearest neighbors Perimeter Concavity Distance Class 0.24 2.65 \\(\\sqrt{(0 - 0.24)^2 + (3.5 - 2.65)^2} = 0.88\\) Benign 0.75 2.87 \\(\\sqrt{(0 - 0.75)^2 + (3.5 - 2.87)^2} = 0.98\\) Malignant 0.62 2.54 \\(\\sqrt{(0 - 0.62)^2 + (3.5 - 2.54)^2} = 1.14\\) Malignant 0.42 2.31 \\(\\sqrt{(0 - 0.42)^2 + (3.5 - 2.31)^2} = 1.26\\) Malignant -1.16 4.04 \\(\\sqrt{(0 - (-1.16))^2 + (3.5 - 4.04)^2} = 1.28\\) Benign The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in Figure 5.7. Figure 5.7: Scatter plot of concavity versus perimeter with 5 nearest neighbors circled. 5.5.2 More than two explanatory variables Although the above description is directed toward two predictor variables, exactly the same K-nearest neighbors algorithm applies when you have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula for the distance between points. Suppose we have \\(m\\) predictor variables for two observations \\(a\\) and \\(b\\), i.e., \\(a = (a_{1}, a_{2}, \\dots, a_{m})\\) and \\(b = (b_{1}, b_{2}, \\dots, b_{m})\\). The distance formula becomes \\[\\mathrm{Distance} = \\sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \\dots + (a_{m} - b_{m})^2}.\\] This formula still corresponds to a straight-line distance, just in a space with more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables, and then took the square root. Now we will do the same, except for our three variables. We calculate the distance as follows \\[\\mathrm{Distance} =\\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.\\] Let’s calculate the distances between our new observation and each of the observations in the training set to find the \\(K=5\\) neighbors when we have these three predictors. new_obs_Perimeter <- 0 new_obs_Concavity <- 3.5 new_obs_Symmetry <- 1 cancer |> select(ID, Perimeter, Concavity, Symmetry, Class) |> mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2 + (Symmetry - new_obs_Symmetry)^2)) |> slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance ## # A tibble: 5 × 6 ## ID Perimeter Concavity Symmetry Class dist_from_new ## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> ## 1 907914 0.417 2.31 0.837 Malignant 1.27 ## 2 90439701 1.33 2.89 1.10 Malignant 1.47 ## 3 925622 0.470 2.08 1.15 Malignant 1.50 ## 4 859471 -1.37 2.81 1.09 Benign 1.53 ## 5 899667 0.623 2.54 2.06 Malignant 1.56 Based on \\(K=5\\) nearest neighbors with these three predictors, we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are from the malignant class. Figure 5.8 shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. Figure 5.8: 3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes. 5.5.3 Summary of K-nearest neighbors algorithm In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following: Compute the distance between the new observation and each observation in the training set. Sort the data table in ascending order according to the distances. Choose the top \\(K\\) rows of the sorted table. Classify the new observation based on a majority vote of the neighbor classes. 5.6 K-nearest neighbors with tidymodels Coding the K-nearest neighbors algorithm in R ourselves can get complicated, especially if we want to handle multiple classes, more than two variables, or predict the class for multiple new observations. Thankfully, in R, the K-nearest neighbors algorithm is implemented in the parsnip R package (Kuhn and Vaughan 2021) included in tidymodels, along with many other models that you will encounter in this and future chapters of the book. The tidymodels collection provides tools to help make and use models, such as classifiers. Using the packages in this collection will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we will likely make. We start by loading tidymodels. library(tidymodels) Let’s walk through how to use tidymodels to perform K-nearest neighbors classification. We will use the cancer data set from above, with perimeter and concavity as predictors and \\(K = 5\\) neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with perimeter 0, concavity 3.5, and an unknown diagnosis label. Let’s pick out our two desired predictor variables and class label and store them as a new data set named cancer_train: cancer_train <- cancer |> select(Class, Perimeter, Concavity) cancer_train ## # A tibble: 569 × 3 ## Class Perimeter Concavity ## <fct> <dbl> <dbl> ## 1 Malignant 1.27 2.65 ## 2 Malignant 1.68 -0.0238 ## 3 Malignant 1.57 1.36 ## 4 Malignant -0.592 1.91 ## 5 Malignant 1.78 1.37 ## 6 Malignant -0.387 0.866 ## 7 Malignant 1.14 0.300 ## 8 Malignant -0.0728 0.0610 ## 9 Malignant -0.184 1.22 ## 10 Malignant -0.329 1.74 ## # ℹ 559 more rows Next, we create a model specification for K-nearest neighbors classification by calling the nearest_neighbor function, specifying that we want to use \\(K = 5\\) neighbors (we will discuss how to choose \\(K\\) in the next chapter) and that each neighboring point should have the same weight when voting (weight_func = \"rectangular\"). The weight_func argument controls how neighbors vote when classifying a new observation; by setting it to \"rectangular\", each of the \\(K\\) nearest neighbors gets exactly 1 vote as described above. Other choices, which weigh each neighbor’s vote differently, can be found on the parsnip website. In the set_engine argument, we specify which package or system will be used for training the model. Here kknn is the R package we will use for performing K-nearest neighbors classification. Finally, we specify that this is a classification problem with the set_mode function. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |> set_engine("kknn") |> set_mode("classification") knn_spec ## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = 5 ## weight_func = rectangular ## ## Computational engine: kknn In order to fit the model on the breast cancer data, we need to pass the model specification and the data set to the fit function. We also need to specify what variables to use as predictors and what variable to use as the response. Below, the Class ~ Perimeter + Concavity argument specifies that Class is the response variable (the one we want to predict), and both Perimeter and Concavity are to be used as the predictors. knn_fit <- knn_spec |> fit(Class ~ Perimeter + Concavity, data = cancer_train) We can also use a convenient shorthand syntax using a period, Class ~ ., to indicate that we want to use every variable except Class as a predictor in the model. In this particular setup, since Concavity and Perimeter are the only two predictors in the cancer_train data frame, Class ~ Perimeter + Concavity and Class ~ . are equivalent. In general, you can choose individual predictors using the + symbol, or you can specify to use all predictors using the . symbol. knn_fit <- knn_spec |> fit(Class ~ ., data = cancer_train) knn_fit ## parsnip model object ## ## ## Call: ## kknn::train.kknn(formula = Class ~ ., data = data, ks = min_rows(5, data, 5) ## , kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.07557118 ## Best kernel: rectangular ## Best k: 5 Here you can see the final trained model summary. It confirms that the computational engine used to train the model was kknn::train.kknn. It also shows the fraction of errors made by the K-nearest neighbors model, but we will ignore this for now and discuss it in more detail in the next chapter. Finally, it shows (somewhat confusingly) that the “best” weight function was “rectangular” and “best” setting of \\(K\\) was 5; but since we specified these earlier, R is just repeating those settings to us here. In the next chapter, we will actually let R find the value of \\(K\\) for us. Finally, we make the prediction on the new observation by calling the predict function, passing both the fit object we just created and the new observation itself. As above, when we ran the K-nearest neighbors classification algorithm manually, the knn_fit object classifies the new observation as malignant. Note that the predict function outputs a data frame with a single variable named .pred_class. new_obs <- tibble(Perimeter = 0, Concavity = 3.5) predict(knn_fit, new_obs) ## # A tibble: 1 × 1 ## .pred_class ## <fct> ## 1 Malignant Is this predicted malignant label the actual class for this observation? Well, we don’t know because we do not have this observation’s diagnosis— that is what we were trying to predict! The classifier’s prediction is not necessarily correct, but in the next chapter, we will learn ways to quantify how accurate we think our predictions are. 5.7 Data preprocessing with tidymodels 5.7.1 Centering and scaling When using K-nearest neighbors classification, the scale of each variable (i.e., its size and range of values) matters. Since the classifier predicts classes by identifying observations nearest to it, any variables with a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. For example, suppose you have a data set with two features, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the neighbor distances, a difference of $1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it’s the opposite; 10 years of education is huge compared to a difference of $1000 in yearly salary! In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celsius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise, in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn’t affect the K-nearest neighbors classification algorithm, this large shift can change the outcome of using many other predictive models. To scale and center our data, we need to find our variables’ mean (the average, which quantifies the “central” value of a set of numbers) and standard deviation (a number quantifying how spread out values are). For each observed value of the variable, we subtract the mean (i.e., center the variable) and divide by the standard deviation (i.e., scale the variable). When we do this, the data is said to be standardized, and all variables in a data set will have a mean of 0 and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest neighbors algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. As before, we will convert the Class variable to the factor type and rename the values to “Malignant” and “Benign.” To keep things simple, we will just use the Area, Smoothness, and Class variables: unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Benign" = "B", "Malignant" = "M")) |> select(Class, Area, Smoothness) unscaled_cancer ## # A tibble: 569 × 3 ## Class Area Smoothness ## <fct> <dbl> <dbl> ## 1 Malignant 1001 0.118 ## 2 Malignant 1326 0.0847 ## 3 Malignant 1203 0.110 ## 4 Malignant 386. 0.142 ## 5 Malignant 1297 0.100 ## 6 Malignant 477. 0.128 ## 7 Malignant 1040 0.0946 ## 8 Malignant 578. 0.119 ## 9 Malignant 520. 0.127 ## 10 Malignant 476. 0.119 ## # ℹ 559 more rows Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the unscaled_cancer data set with tidymodels. In the tidymodels framework, all data preprocessing happens using a recipe from the recipes R package (Kuhn and Wickham 2021). Here we will initialize a recipe for the unscaled_cancer data above, specifying that the Class variable is the response, and all other variables are predictors: uc_recipe <- recipe(Class ~ ., data = unscaled_cancer) uc_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 2 So far, there is not much in the recipe; just a statement about the number of response variables and predictors. Let’s add scaling (step_scale) and centering (step_center) steps for all of the predictors so that they each have a mean of 0 and standard deviation of 1. Note that tidyverse actually provides step_normalize, which does both centering and scaling in a single recipe step; in this book we will keep step_scale and step_center separate to emphasize conceptually that there are two steps happening. The prep function finalizes the recipe by using the data (here, unscaled_cancer) to compute anything necessary to run the recipe (in this case, the column means and standard deviations): uc_recipe <- uc_recipe |> step_scale(all_predictors()) |> step_center(all_predictors()) |> prep() uc_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 2 ## ## ── Training information ## Training data contained 569 data points and no incomplete rows. ## ## ── Operations ## • Scaling for: Area, Smoothness | Trained ## • Centering for: Area, Smoothness | Trained You can now see that the recipe includes a scaling and centering step for all predictor variables. Note that when you add a step to a recipe, you must specify what columns to apply the step to. Here we used the all_predictors() function to specify that each step should be applied to all predictor variables. However, there are a number of different arguments one could use here, as well as naming particular columns with the same syntax as the select function. For example: all_nominal() and all_numeric(): specify all categorical or all numeric variables all_predictors() and all_outcomes(): specify all predictor or all response variables Area, Smoothness: specify both the Area and Smoothness variable -Class: specify everything except the Class variable You can find a full set of all the steps and variable selection functions on the recipes reference page. At this point, we have calculated the required statistics based on the data input into the recipe, but the data are not yet scaled and centered. To actually scale and center the data, we need to apply the bake function to the unscaled data. scaled_cancer <- bake(uc_recipe, unscaled_cancer) scaled_cancer ## # A tibble: 569 × 3 ## Area Smoothness Class ## <dbl> <dbl> <fct> ## 1 0.984 1.57 Malignant ## 2 1.91 -0.826 Malignant ## 3 1.56 0.941 Malignant ## 4 -0.764 3.28 Malignant ## 5 1.82 0.280 Malignant ## 6 -0.505 2.24 Malignant ## 7 1.09 -0.123 Malignant ## 8 -0.219 1.60 Malignant ## 9 -0.384 2.20 Malignant ## 10 -0.509 1.58 Malignant ## # ℹ 559 more rows It may seem redundant that we had to both bake and prep to scale and center the data. However, we do this in two steps so we can specify a different data set in the bake step if we want. For example, we may want to specify new data that were not part of the training set. You may wonder why we are doing so much work just to center and scale our variables. Can’t we just manually scale and center the Area and Smoothness variables ourselves before building our K-nearest neighbors model? Well, technically yes; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a different centering / scaling than what we used while training. Proper use of a recipe helps keep our code simple, readable, and error-free. Furthermore, note that using prep and bake is required only when you want to inspect the result of the preprocessing steps yourself. You will see further on in Section 5.8 that tidymodels provides tools to automatically apply prep and bake as necessary without additional coding effort. Figure 5.9 shows the two scatter plots side-by-side—one for unscaled_cancer and one for scaled_cancer. Each has the same new observation annotated with its \\(K=3\\) nearest neighbors. In the original unstandardized data plot, you can see some odd choices for the three nearest neighbors. In particular, the “neighbors” are visually well within the cloud of benign observations, and the neighbors are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). Figure 5.10 shows a close-up of that region on the unstandardized plot. Here the computation of nearest neighbors is dominated by the much larger-scale area variable. The plot for standardized data on the right in Figure 5.9 shows a much more intuitively reasonable selection of nearest neighbors. Thus, standardizing the data can change things in an important way when we are using predictive algorithms. Standardizing your data should be a part of the preprocessing you do before predictive modeling and you should always think carefully about your problem domain and whether you need to standardize your data. Figure 5.9: Comparison of K = 3 nearest neighbors with unstandardized and standardized data. Figure 5.10: Close-up of three nearest neighbors for unstandardized data. 5.7.2 Balancing Another potential issue in a data set for a classifier is class imbalance, i.e., when one label is much more common than another. Since classifiers like the K-nearest neighbors algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the “pattern” of data suggests otherwise). Class imbalance is actually quite a common and important problem: from rare disease diagnosis to malicious email detection, there are many cases in which the “important” class to identify (presence of disease, malicious email) is much rarer than the “unimportant” class (no disease, normal email). To better illustrate the problem, let’s revisit the scaled breast cancer data, cancer; except now we will remove many of the observations of malignant tumors, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the slice_head function, which takes two arguments: a data frame-like object, and the number of rows to select from the top (n). We will use the bind_rows function to glue the two resulting filtered data frames back together, and name the result rare_cancer. The new imbalanced data is shown in Figure 5.11. rare_cancer <- bind_rows( filter(cancer, Class == "Benign"), cancer |> filter(Class == "Malignant") |> slice_head(n = 3) ) |> select(Class, Perimeter, Concavity) rare_plot <- rare_cancer |> ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(x = "Perimeter (standardized)", y = "Concavity (standardized)", color = "Diagnosis") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) rare_plot Figure 5.11: Imbalanced data. Suppose we now decided to use \\(K = 7\\) in K-nearest neighbors classification. With only 3 observations of malignant tumors, the classifier will always predict that the tumor is benign, no matter what its concavity and perimeter are! This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, Figure 5.12 shows what happens for a new tumor observation that is quite close to three observations in the training data that were tagged as malignant. Figure 5.12: Imbalanced data with 7 nearest neighbors to a new observation highlighted. Figure 5.13 shows what happens if we set the background color of each area of the plot to the prediction the K-nearest neighbors classifier would make for a new observation at that location. We can see that the decision is always “benign,” corresponding to the blue color. Figure 5.13: Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data. Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by oversampling the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more voting power in the K-nearest neighbors algorithm. In order to do this, we will add an oversampling step to the earlier uc_recipe recipe with the step_upsample function from the themis R package. We show below how to do this, and also use the group_by and summarize functions to see that our classes are now balanced: library(themis) ups_recipe <- recipe(Class ~ ., data = rare_cancer) |> step_upsample(Class, over_ratio = 1, skip = FALSE) |> prep() ups_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 2 ## ## ── Training information ## Training data contained 360 data points and no incomplete rows. ## ## ── Operations ## • Up-sampling based on: Class | Trained upsampled_cancer <- bake(ups_recipe, rare_cancer) upsampled_cancer |> group_by(Class) |> summarize(n = n()) ## # A tibble: 2 × 2 ## Class n ## <fct> <int> ## 1 Malignant 357 ## 2 Benign 357 Now suppose we train our K-nearest neighbors classifier with \\(K=7\\) on this balanced data. Figure 5.14 shows what happens now when we set the background color of each area of our scatter plot to the decision the K-nearest neighbors classifier would make. We can see that the decision is more reasonable; when the points are close to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are closer to the benign tumor observations. Figure 5.14: Upsampled data with background color indicating the decision of the classifier. 5.7.3 Missing data One of the most common issues in real data sets in the wild is missing data, i.e., observations where the values of some of the variables were not recorded. Unfortunately, as common as it is, handling missing data properly is very challenging and generally relies on expert knowledge about the data, setting, and how the data were collected. One typical challenge with missing data is that missing entries can be informative: the very fact that an entries were missing is related to the values of other variables. For example, survey participants from a marginalized group of people may be less likely to respond to certain kinds of questions if they fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries, we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents. So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts. In this book, we will cover only those techniques for dealing with missing entries in situations where missing entries are just “randomly missing”, i.e., where the fact that certain entries are missing isn’t related to anything else about the observation. Let’s load and examine a modified subset of the tumor image data that has a few missing entries: missing_cancer <- read_csv("data/wdbc_missing.csv") |> select(Class, Radius, Texture, Perimeter) |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) missing_cancer ## # A tibble: 7 × 4 ## Class Radius Texture Perimeter ## <fct> <dbl> <dbl> <dbl> ## 1 Malignant NA NA 1.27 ## 2 Malignant 1.83 -0.353 1.68 ## 3 Malignant 1.58 NA 1.57 ## 4 Malignant -0.768 0.254 -0.592 ## 5 Malignant 1.75 -1.15 1.78 ## 6 Malignant -0.476 -0.835 -0.387 ## 7 Malignant 1.17 0.161 1.14 Recall that K-nearest neighbors classification makes predictions by computing the straight-line distance to nearby training observations, and hence requires access to the values of all variables for all observations in the training data. So how can we perform K-nearest neighbors classification in the presence of missing data? Well, since there are not too many observations with missing entries, one option is to simply remove those observations prior to building the K-nearest neighbors classifier. We can accomplish this by using the drop_na function from tidyverse prior to working with the data. no_missing_cancer <- missing_cancer |> drop_na() no_missing_cancer ## # A tibble: 5 × 4 ## Class Radius Texture Perimeter ## <fct> <dbl> <dbl> <dbl> ## 1 Malignant 1.83 -0.353 1.68 ## 2 Malignant -0.768 0.254 -0.592 ## 3 Malignant 1.75 -1.15 1.78 ## 4 Malignant -0.476 -0.835 -0.387 ## 5 Malignant 1.17 0.161 1.14 However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away too much data. In this case, another possible approach is to impute the missing entries, i.e., fill in synthetic values based on the other observations in the data set. One reasonable choice is to perform mean imputation, where missing entries are filled in using the mean of the present entries in each variable. To perform mean imputation, we add the step_impute_mean step to the tidymodels preprocessing recipe. impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |> step_impute_mean(all_predictors()) |> prep() impute_missing_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## outcome: 1 ## predictor: 3 ## ## ── Training information ## Training data contained 7 data points and 2 incomplete rows. ## ## ── Operations ## • Mean imputation for: Radius, Texture, Perimeter | Trained To visualize what mean imputation does, let’s just apply the recipe directly to the missing_cancer data frame using the bake function. The imputation step fills in the missing entries with the mean values of their corresponding variables. imputed_cancer <- bake(impute_missing_recipe, missing_cancer) imputed_cancer ## # A tibble: 7 × 4 ## Radius Texture Perimeter Class ## <dbl> <dbl> <dbl> <fct> ## 1 0.847 -0.385 1.27 Malignant ## 2 1.83 -0.353 1.68 Malignant ## 3 1.58 -0.385 1.57 Malignant ## 4 -0.768 0.254 -0.592 Malignant ## 5 1.75 -1.15 1.78 Malignant ## 6 -0.476 -0.835 -0.387 Malignant ## 7 1.17 0.161 1.14 Malignant Many other options for missing data imputation can be found in the recipes documentation. However you decide to handle missing data in your data analysis, it is always crucial to think critically about the setting, how the data were collected, and the question you are answering. 5.8 Putting it together in a workflow The tidymodels package collection also provides the workflow, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. To illustrate the whole pipeline, let’s start from scratch with the wdbc_unscaled.csv data. First we will load the data, create a model, and specify a recipe for how the data should be preprocessed: # load the unscaled cancer data # and make sure the response variable, Class, is a factor unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |> mutate(Class = as_factor(Class)) |> mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) # create the K-NN model knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |> set_engine("kknn") |> set_mode("classification") # create the centering / scaling recipe uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) |> step_scale(all_predictors()) |> step_center(all_predictors()) Note that each of these steps is exactly the same as earlier, except for one major difference: we did not use the select function to extract the relevant variables from the data frame, and instead simply specified the relevant variables to use via the formula Class ~ Area + Smoothness (instead of Class ~ .) in the recipe. You will also notice that we did not call prep() on the recipe; this is unnecessary when it is placed in a workflow. We will now place these steps in a workflow using the add_recipe and add_model functions, and finally we will use the fit function to run the whole workflow on the unscaled_cancer data. Note another difference from earlier here: we do not include a formula in the fit function. This is again because we included the formula in the recipe, so there is no need to respecify it: knn_fit <- workflow() |> add_recipe(uc_recipe) |> add_model(knn_spec) |> fit(data = unscaled_cancer) knn_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(7, data, 5), ## kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.112478 ## Best kernel: rectangular ## Best k: 7 As before, the fit object lists the function that trains the model as well as the “best” settings for the number of neighbors and weight function (for now, these are just the values we chose manually when we created knn_spec above). But now the fit object also includes information about the overall workflow, including the centering and scaling preprocessing steps. In other words, when we use the predict function with the knn_fit object to make a prediction for a new observation, it will first apply the same recipe steps to the new observation. As an example, we will predict the class label of two new observations: one with Area = 500 and Smoothness = 0.075, and one with Area = 1500 and Smoothness = 0.1. new_observation <- tibble(Area = c(500, 1500), Smoothness = c(0.075, 0.1)) prediction <- predict(knn_fit, new_observation) prediction ## # A tibble: 2 × 1 ## .pred_class ## <fct> ## 1 Benign ## 2 Malignant The classifier predicts that the first observation is benign, while the second is malignant. Figure 5.15 visualizes the predictions that this trained K-nearest neighbors model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. For the interested reader who wants a learning challenge, we now include it below. The basic idea is to create a grid of synthetic new observations using the expand.grid function, predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low alpha value) and large point radius. See if you can figure out what each line is doing! Note: Understanding this code is not required for the remainder of the textbook. It is included for those readers who would like to use similar visualizations in their own data analyses. # create the grid of area/smoothness vals, and arrange in a data frame are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100) smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100) asgrid <- as_tibble(expand.grid(Area = are_grid, Smoothness = smo_grid)) # use the fit workflow to make predictions at the grid points knnPredGrid <- predict(knn_fit, asgrid) # bind the predictions as a new column with the grid points prediction_table <- bind_cols(knnPredGrid, asgrid) |> rename(Class = .pred_class) # plot: # 1. the colored scatter of the original data # 2. the faded colored scatter for the grid points wkflw_plot <- ggplot() + geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.75) + geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha = 0.02, size = 5) + labs(color = "Diagnosis", x = "Area", y = "Smoothness") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) wkflw_plot Figure 5.15: Scatter plot of smoothness versus area where background color indicates the decision of the classifier. 5.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Classification I: training and predicting” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["classification2.html", "Chapter 6 Classification II: evaluation & tuning 6.1 Overview 6.2 Chapter learning objectives 6.3 Evaluating performance 6.4 Randomness and seeds 6.5 Evaluating performance with tidymodels 6.6 Tuning the classifier 6.7 Summary 6.8 Predictor variable selection 6.9 Exercises 6.10 Additional resources", " Chapter 6 Classification II: evaluation & tuning 6.1 Overview This chapter continues the introduction to predictive modeling through classification. While the previous chapter covered training and data preprocessing, this chapter focuses on how to evaluate the performance of a classifier, as well as how to improve the classifier (where possible) to maximize its accuracy. 6.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe what training, validation, and test data sets are and how they are used in classification. Split data into training, validation, and test data sets. Describe what a random seed is and its importance in reproducible data analysis. Set the random seed in R using the set.seed function. Describe and interpret accuracy, precision, recall, and confusion matrices. Evaluate classification accuracy, precision, and recall in R using a test set, a single validation set, and cross-validation. Produce a confusion matrix in R. Choose the number of neighbors in a K-nearest neighbors classifier by maximizing estimated cross-validation accuracy. Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors classification. Describe the advantages and disadvantages of the K-nearest neighbors classification algorithm. 6.3 Evaluating performance Sometimes our classifier might make the wrong prediction. A classifier does not need to be right 100% of the time to be useful, though we don’t want the classifier to make too many wrong predictions. How do we measure how “good” our classifier is? Let’s revisit the breast cancer images data (Street, Wolberg, and Mangasarian 1993) and think about how our classifier will be used in practice. A biopsy will be performed on a new patient’s tumor, the resulting image will be analyzed, and the classifier will be asked to decide whether the tumor is benign or malignant. The key word here is new: our classifier is “good” if it provides accurate predictions on data not seen during training, as this implies that it has actually learned about the relationship between the predictor variables and response variable, as opposed to simply memorizing the labels of individual training data examples. But then, how can we evaluate our classifier without visiting the hospital to collect more tumor images? The trick is to split the data into a training set and test set (Figure 6.1) and use only the training set when building the classifier. Then, to evaluate the performance of the classifier, we first set aside the labels from the test set, and then use the classifier to predict the labels in the test set. If our predictions match the actual labels for the observations in the test set, then we have some confidence that our classifier might also accurately predict the class labels for new observations without known class labels. Note: If there were a golden rule of machine learning, it might be this: you cannot use the test data to build the model! If you do, the model gets to “see” the test data in advance, making it look more accurate than it really is. Imagine how bad it would be to overestimate your classifier’s accuracy when predicting whether a patient’s tumor is malignant or benign! Figure 6.1: Splitting the data into training and testing sets. How exactly can we assess how well our predictions match the actual labels for the observations in the test set? One way we can do this is to calculate the prediction accuracy. This is the fraction of examples for which the classifier made the correct prediction. To calculate this, we divide the number of correct predictions by the number of predictions made. The process for assessing if our predictions match the actual labels in the test set is illustrated in Figure 6.2. \\[\\mathrm{accuracy} = \\frac{\\mathrm{number \\; of \\; correct \\; predictions}}{\\mathrm{total \\; number \\; of \\; predictions}}\\] Figure 6.2: Process for splitting the data and finding the prediction accuracy. Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with a single number. But prediction accuracy by itself does not tell the whole story. In particular, accuracy alone only tells us how often the classifier makes mistakes in general, but does not tell us anything about the kinds of mistakes the classifier makes. A more comprehensive view of performance can be obtained by additionally examining the confusion matrix. The confusion matrix shows how many test set labels of each type are predicted correctly and incorrectly, which gives us more detail about the kinds of mistakes the classifier tends to make. Table 6.1 shows an example of what a confusion matrix might look like for the tumor image data with a test set of 65 observations. Table 6.1: An example confusion matrix for the tumor image data. Actually Malignant Actually Benign Predicted Malignant 1 4 Predicted Benign 3 57 In the example in Table 6.1, we see that there was 1 malignant observation that was correctly classified as malignant (top left corner), and 57 benign observations that were correctly classified as benign (bottom right corner). However, we can also see that the classifier made some mistakes: it classified 3 malignant observations as benign, and 4 benign observations as malignant. The accuracy of this classifier is roughly 89%, given by the formula \\[\\mathrm{accuracy} = \\frac{\\mathrm{number \\; of \\; correct \\; predictions}}{\\mathrm{total \\; number \\; of \\; predictions}} = \\frac{1+57}{1+57+4+3} = 0.892.\\] But we can also see that the classifier only identified 1 out of 4 total malignant tumors; in other words, it misclassified 75% of the malignant cases present in the data set! In this example, misclassifying a malignant tumor is a potentially disastrous error, since it may lead to a patient who requires treatment not receiving it. Since we are particularly interested in identifying malignant cases, this classifier would likely be unacceptable even with an accuracy of 89%. Focusing more on one label than the other is common in classification problems. In such cases, we typically refer to the label we are more interested in identifying as the positive label, and the other as the negative label. In the tumor example, we would refer to malignant observations as positive, and benign observations as negative. We can then use the following terms to talk about the four kinds of prediction that the classifier can make, corresponding to the four entries in the confusion matrix: True Positive: A malignant observation that was classified as malignant (top left in Table 6.1). False Positive: A benign observation that was classified as malignant (top right in Table 6.1). True Negative: A benign observation that was classified as benign (bottom right in Table 6.1). False Negative: A malignant observation that was classified as benign (bottom left in Table 6.1). A perfect classifier would have zero false negatives and false positives (and therefore, 100% accuracy). However, classifiers in practice will almost always make some errors. So you should think about which kinds of error are most important in your application, and use the confusion matrix to quantify and report them. Two commonly used metrics that we can compute using the confusion matrix are the precision and recall of the classifier. These are often reported together with accuracy. Precision quantifies how many of the positive predictions the classifier made were actually positive. Intuitively, we would like a classifier to have a high precision: for a classifier with high precision, if the classifier reports that a new observation is positive, we can trust that the new observation is indeed positive. We can compute the precision of a classifier using the entries in the confusion matrix, with the formula \\[\\mathrm{precision} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; predictions}}.\\] Recall quantifies how many of the positive observations in the test set were identified as positive. Intuitively, we would like a classifier to have a high recall: for a classifier with high recall, if there is a positive observation in the test data, we can trust that the classifier will find it. We can also compute the recall of the classifier using the entries in the confusion matrix, with the formula \\[\\mathrm{recall} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; test \\; set \\; observations}}.\\] In the example presented in Table 6.1, we have that the precision and recall are \\[\\mathrm{precision} = \\frac{1}{1+4} = 0.20, \\quad \\mathrm{recall} = \\frac{1}{1+3} = 0.25.\\] So even with an accuracy of 89%, the precision and recall of the classifier were both relatively low. For this data analysis context, recall is particularly important: if someone has a malignant tumor, we certainly want to identify it. A recall of just 25% would likely be unacceptable! Note: It is difficult to achieve both high precision and high recall at the same time; models with high precision tend to have low recall and vice versa. As an example, we can easily make a classifier that has perfect recall: just always guess positive! This classifier will of course find every positive observation in the test set, but it will make lots of false positive predictions along the way and have low precision. Similarly, we can easily make a classifier that has perfect precision: never guess positive! This classifier will never incorrectly identify an obsevation as positive, but it will make a lot of false negative predictions along the way. In fact, this classifier will have 0% recall! Of course, most real classifiers fall somewhere in between these two extremes. But these examples serve to show that in settings where one of the classes is of interest (i.e., there is a positive label), there is a trade-off between precision and recall that one has to make when designing a classifier. 6.4 Randomness and seeds Beginning in this chapter, our data analyses will often involve the use of randomness. We use randomness any time we need to make a decision in our analysis that needs to be fair, unbiased, and not influenced by human input. For example, in this chapter, we need to split a data set into a training set and test set to evaluate our classifier. We certainly do not want to choose how to split the data ourselves by hand, as we want to avoid accidentally influencing the result of the evaluation. So instead, we let R randomly split the data. In future chapters we will use randomness in many other ways, e.g., to help us select a small subset of data from a larger data set, to pick groupings of data, and more. However, the use of randomness runs counter to one of the main tenets of good data analysis practice: reproducibility. Recall that a reproducible analysis produces the same result each time it is run; if we include randomness in the analysis, would we not get a different result each time? The trick is that in R—and other programming languages—randomness is not actually random! Instead, R uses a random number generator that produces a sequence of numbers that are completely determined by a seed value. Once you set the seed value using the set.seed function, everything after that point may look random, but is actually totally reproducible. As long as you pick the same seed value, you get the same result! Let’s use an example to investigate how seeds work in R. Say we want to randomly pick 10 numbers from 0 to 9 in R using the sample function, but we want it to be reproducible. Before using the sample function, we call set.seed, and pass it any integer as an argument. Here, we pass in the number 1. set.seed(1) random_numbers1 <- sample(0:9, 10, replace = TRUE) random_numbers1 ## [1] 8 3 6 0 1 6 1 2 0 4 You can see that random_numbers1 is a list of 10 numbers from 0 to 9 that, from all appearances, looks random. If we run the sample function again, we will get a fresh batch of 10 numbers that also look random. random_numbers2 <- sample(0:9, 10, replace = TRUE) random_numbers2 ## [1] 4 9 5 9 6 8 4 4 8 8 If we want to force R to produce the same sequences of random numbers, we can simply call the set.seed function again with the same argument value. set.seed(1) random_numbers1_again <- sample(0:9, 10, replace = TRUE) random_numbers1_again ## [1] 8 3 6 0 1 6 1 2 0 4 random_numbers2_again <- sample(0:9, 10, replace = TRUE) random_numbers2_again ## [1] 4 9 5 9 6 8 4 4 8 8 Notice that after setting the seed, we get the same two sequences of numbers in the same order. random_numbers1 and random_numbers1_again produce the same sequence of numbers, and the same can be said about random_numbers2 and random_numbers2_again. And if we choose a different value for the seed—say, 4235—we obtain a different sequence of random numbers. set.seed(4235) random_numbers1_different <- sample(0:9, 10, replace = TRUE) random_numbers1_different ## [1] 8 3 1 4 6 8 8 4 1 7 random_numbers2_different <- sample(0:9, 10, replace = TRUE) random_numbers2_different ## [1] 3 7 8 2 8 8 6 3 3 8 In other words, even though the sequences of numbers that R is generating look random, they are totally determined when we set a seed value! So what does this mean for data analysis? Well, sample is certainly not the only function that uses randomness in R. Many of the functions that we use in tidymodels, tidyverse, and beyond use randomness—some of them without even telling you about it. So at the beginning of every data analysis you do, right after loading packages, you should call the set.seed function and pass it an integer that you pick. Also note that when R starts up, it creates its own seed to use. So if you do not explicitly call the set.seed function in your code, your results will likely not be reproducible. And finally, be careful to set the seed only once at the beginning of a data analysis. Each time you set the seed, you are inserting your own human input, thereby influencing the analysis. If you use set.seed many times throughout your analysis, the randomness that R uses will not look as random as it should. In summary: if you want your analysis to be reproducible, i.e., produce the same result each time you run it, make sure to use set.seed exactly once at the beginning of the analysis. Different argument values in set.seed lead to different patterns of randomness, but as long as you pick the same argument value your result will be the same. In the remainder of the textbook, we will set the seed once at the beginning of each chapter. 6.5 Evaluating performance with tidymodels Back to evaluating classifiers now! In R, we can use the tidymodels package not only to perform K-nearest neighbors classification, but also to assess how well our classification worked. Let’s work through an example of how to use tools from tidymodels to evaluate a classifier using the breast cancer data set from the previous chapter. We begin the analysis by loading the packages we require, reading in the breast cancer data, and then making a quick scatter plot visualization of tumor cell concavity versus smoothness colored by diagnosis in Figure 6.3. You will also notice that we set the random seed here at the beginning of the analysis using the set.seed function, as described in Section 6.4. # load packages library(tidyverse) library(tidymodels) # set the seed set.seed(1) # load data cancer <- read_csv("data/wdbc_unscaled.csv") |> # convert the character Class variable to the factor datatype mutate(Class = as_factor(Class)) |> # rename the factor values to be more readable mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B")) # create scatter plot of tumor cell concavity versus smoothness, # labeling the points be diagnosis class perim_concav <- cancer |> ggplot(aes(x = Smoothness, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(color = "Diagnosis") + scale_color_manual(values = c("darkorange", "steelblue")) + theme(text = element_text(size = 12)) perim_concav Figure 6.3: Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label. 6.5.1 Create the train / test split Once we have decided on a predictive question to answer and done some preliminary exploration, the very next thing to do is to split the data into the training and test sets. Typically, the training set is between 50% and 95% of the data, while the test set is the remaining 5% to 50%; the intuition is that you want to trade off between training an accurate model (by using a larger training data set) and getting an accurate evaluation of its performance (by using a larger test data set). Here, we will use 75% of the data for training, and 25% for testing. The initial_split function from tidymodels handles the procedure of splitting the data for us. It also applies two very important steps when splitting to ensure that the accuracy estimates from the test data are reasonable. First, it shuffles the data before splitting, which ensures that any ordering present in the data does not influence the data that ends up in the training and testing sets. Second, it stratifies the data by the class label, to ensure that roughly the same proportion of each class ends up in both the training and testing sets. For example, in our data set, roughly 63% of the observations are from the benign class, and 37% are from the malignant class, so initial_split ensures that roughly 63% of the training data are benign, 37% of the training data are malignant, and the same proportions exist in the testing data. Let’s use the initial_split function to create the training and testing sets. We will specify that prop = 0.75 so that 75% of our original data set ends up in the training set. We will also set the strata argument to the categorical label variable (here, Class) to ensure that the training and testing subsets contain the right proportions of each category of observation. The training and testing functions then extract the training and testing data sets into two separate data frames. Note that the initial_split function uses randomness, but since we set the seed earlier in the chapter, the split will be reproducible. cancer_split <- initial_split(cancer, prop = 0.75, strata = Class) cancer_train <- training(cancer_split) cancer_test <- testing(cancer_split) glimpse(cancer_train) ## Rows: 426 ## Columns: 12 ## $ ID <dbl> 8510426, 8510653, 8510824, 854941, 85713702, 857155,… ## $ Class <fct> Benign, Benign, Benign, Benign, Benign, Benign, Beni… ## $ Radius <dbl> 13.540, 13.080, 9.504, 13.030, 8.196, 12.050, 13.490… ## $ Texture <dbl> 14.36, 15.71, 12.44, 18.42, 16.84, 14.63, 22.30, 21.… ## $ Perimeter <dbl> 87.46, 85.63, 60.34, 82.61, 51.71, 78.04, 86.91, 74.… ## $ Area <dbl> 566.3, 520.0, 273.9, 523.8, 201.9, 449.3, 561.0, 427… ## $ Smoothness <dbl> 0.09779, 0.10750, 0.10240, 0.08983, 0.08600, 0.10310… ## $ Compactness <dbl> 0.08129, 0.12700, 0.06492, 0.03766, 0.05943, 0.09092… ## $ Concavity <dbl> 0.066640, 0.045680, 0.029560, 0.025620, 0.015880, 0.… ## $ Concave_Points <dbl> 0.047810, 0.031100, 0.020760, 0.029230, 0.005917, 0.… ## $ Symmetry <dbl> 0.1885, 0.1967, 0.1815, 0.1467, 0.1769, 0.1675, 0.18… ## $ Fractal_Dimension <dbl> 0.05766, 0.06811, 0.06905, 0.05863, 0.06503, 0.06043… glimpse(cancer_test) ## Rows: 143 ## Columns: 12 ## $ ID <dbl> 842517, 84300903, 84501001, 84610002, 848406, 848620… ## $ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan… ## $ Radius <dbl> 20.570, 19.690, 12.460, 15.780, 14.680, 16.130, 19.8… ## $ Texture <dbl> 17.77, 21.25, 24.04, 17.89, 20.13, 20.68, 22.15, 14.… ## $ Perimeter <dbl> 132.90, 130.00, 83.97, 103.60, 94.74, 108.10, 130.00… ## $ Area <dbl> 1326.0, 1203.0, 475.9, 781.0, 684.5, 798.8, 1260.0, … ## $ Smoothness <dbl> 0.08474, 0.10960, 0.11860, 0.09710, 0.09867, 0.11700… ## $ Compactness <dbl> 0.07864, 0.15990, 0.23960, 0.12920, 0.07200, 0.20220… ## $ Concavity <dbl> 0.08690, 0.19740, 0.22730, 0.09954, 0.07395, 0.17220… ## $ Concave_Points <dbl> 0.070170, 0.127900, 0.085430, 0.066060, 0.052590, 0.… ## $ Symmetry <dbl> 0.1812, 0.2069, 0.2030, 0.1842, 0.1586, 0.2164, 0.15… ## $ Fractal_Dimension <dbl> 0.05667, 0.05999, 0.08243, 0.06082, 0.05922, 0.07356… We can see from glimpse in the code above that the training set contains 426 observations, while the test set contains 143 observations. This corresponds to a train / test split of 75% / 25%, as desired. Recall from Chapter 5 that we use the glimpse function to view data with a large number of columns, as it prints the data such that the columns go down the page (instead of across). We can use group_by and summarize to find the percentage of malignant and benign classes in cancer_train and we see about 63% of the training data are benign and 37% are malignant, indicating that our class proportions were roughly preserved when we split the data. cancer_proportions <- cancer_train |> group_by(Class) |> summarize(n = n()) |> mutate(percent = 100*n/nrow(cancer_train)) cancer_proportions ## # A tibble: 2 × 3 ## Class n percent ## <fct> <int> <dbl> ## 1 Malignant 159 37.3 ## 2 Benign 267 62.7 6.5.2 Preprocess the data As we mentioned in the last chapter, K-nearest neighbors is sensitive to the scale of the predictors, so we should perform some preprocessing to standardize them. An additional consideration we need to take when doing this is that we should create the standardization preprocessor using only the training data. This ensures that our test data does not influence any aspect of our model training. Once we have created the standardization preprocessor, we can then apply it separately to both the training and test data sets. Fortunately, the recipe framework from tidymodels helps us handle this properly. Below we construct and prepare the recipe using only the training data (due to data = cancer_train in the first line). cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) 6.5.3 Train the classifier Now that we have split our original data set into training and test sets, we can create our K-nearest neighbors classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose the number \\(K\\) of neighbors to be 3, and use concavity and smoothness as the predictors. As before we need to create a model specification, combine the model specification and recipe into a workflow, and then finally use fit with the training data cancer_train to build the classifier. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |> set_engine("kknn") |> set_mode("classification") knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit(data = cancer_train) knn_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3, data, 5), ## kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.1126761 ## Best kernel: rectangular ## Best k: 3 6.5.4 Predict the labels in the test set Now that we have a K-nearest neighbors classifier object, we can use it to predict the class labels for our test set. We use the bind_cols to add the column of predictions to the original test data, creating the cancer_test_predictions data frame. The Class variable contains the actual diagnoses, while the .pred_class contains the predicted diagnoses from the classifier. cancer_test_predictions <- predict(knn_fit, cancer_test) |> bind_cols(cancer_test) cancer_test_predictions ## # A tibble: 143 × 13 ## .pred_class ID Class Radius Texture Perimeter Area Smoothness ## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Benign 842517 Malignant 20.6 17.8 133. 1326 0.0847 ## 2 Malignant 84300903 Malignant 19.7 21.2 130 1203 0.110 ## 3 Malignant 84501001 Malignant 12.5 24.0 84.0 476. 0.119 ## 4 Malignant 84610002 Malignant 15.8 17.9 104. 781 0.0971 ## 5 Benign 848406 Malignant 14.7 20.1 94.7 684. 0.0987 ## 6 Malignant 84862001 Malignant 16.1 20.7 108. 799. 0.117 ## 7 Malignant 849014 Malignant 19.8 22.2 130 1260 0.0983 ## 8 Malignant 8511133 Malignant 15.3 14.3 102. 704. 0.107 ## 9 Malignant 852552 Malignant 16.6 21.4 110 905. 0.112 ## 10 Malignant 853612 Malignant 11.8 18.7 77.9 441. 0.111 ## # ℹ 133 more rows ## # ℹ 5 more variables: Compactness <dbl>, Concavity <dbl>, Concave_Points <dbl>, ## # Symmetry <dbl>, Fractal_Dimension <dbl> 6.5.5 Evaluate performance Finally, we can assess our classifier’s performance. First, we will examine accuracy. To do this we use the metrics function from tidymodels, specifying the truth and estimate arguments: cancer_test_predictions |> metrics(truth = Class, estimate = .pred_class) |> filter(.metric == "accuracy") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.853 In the metrics data frame, we filtered the .metric column since we are interested in the accuracy row. Other entries involve other metrics that are beyond the scope of this book. Looking at the value of the .estimate variable shows that the estimated accuracy of the classifier on the test data was 85%. To compute the precision and recall, we can use the precision and recall functions from tidymodels. We first check the order of the labels in the Class variable using the levels function: cancer_test_predictions |> pull(Class) |> levels() ## [1] "Malignant" "Benign" This shows that \"Malignant\" is the first level. Therefore we will set the truth and estimate arguments to Class and .pred_class as before, but also specify that the “positive” class corresponds to the first factor level via event_level=\"first\". If the labels were in the other order, we would instead use event_level=\"second\". cancer_test_predictions |> precision(truth = Class, estimate = .pred_class, event_level = "first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 precision binary 0.767 cancer_test_predictions |> recall(truth = Class, estimate = .pred_class, event_level = "first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 recall binary 0.868 The output shows that the estimated precision and recall of the classifier on the test data was 77% and 87%, respectively. Finally, we can look at the confusion matrix for the classifier using the conf_mat function. confusion <- cancer_test_predictions |> conf_mat(truth = Class, estimate = .pred_class) confusion ## Truth ## Prediction Malignant Benign ## Malignant 46 14 ## Benign 7 76 The confusion matrix shows 46 observations were correctly predicted as malignant, and 76 were correctly predicted as benign. It also shows that the classifier made some mistakes; in particular, it classified 7 observations as benign when they were actually malignant, and 14 observations as malignant when they were actually benign. Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what R reported. \\[\\mathrm{accuracy} = \\frac{\\mathrm{number \\; of \\; correct \\; predictions}}{\\mathrm{total \\; number \\; of \\; predictions}} = \\frac{46+76}{46+76+14+7} = 0.853\\] \\[\\mathrm{precision} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; predictions}} = \\frac{46}{46 + 14} = 0.767\\] \\[\\mathrm{recall} = \\frac{\\mathrm{number \\; of \\; correct \\; positive \\; predictions}}{\\mathrm{total \\; number \\; of \\; positive \\; test \\; set \\; observations}} = \\frac{46}{46+7} = 0.868\\] 6.5.6 Critically analyze performance We now know that the classifier was 85% accurate on the test data set, and had a precision of 77% and a recall of 87%. That sounds pretty good! Wait, is it good? Or do we need something higher? In general, a good value for accuracy (as well as precision and recall, if applicable) depends on the application; you must critically analyze your accuracy in the context of the problem you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99% of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!). And beyond just accuracy, we need to consider the precision and recall: as mentioned earlier, the kind of mistake the classifier makes is important in many applications as well. In the previous example with 99% benign observations, it might be very bad for the classifier to predict “benign” when the actual class is “malignant” (a false negative), as this might result in a patient not receiving appropriate medical attention. In other words, in this context, we need the classifier to have a high recall. On the other hand, it might be less bad for the classifier to guess “malignant” when the actual class is “benign” (a false positive), as the patient will then likely see a doctor who can provide an expert diagnosis. In other words, we are fine with sacrificing some precision in the interest of achieving high recall. This is why it is important not only to look at accuracy, but also the confusion matrix. However, there is always an easy baseline that you can compare to for any classification problem: the majority classifier. The majority classifier always guesses the majority class label from the training data, regardless of the predictor variables’ values. It helps to give you a sense of scale when considering accuracies. If the majority classifier obtains a 90% accuracy on a problem, then you might hope for your K-nearest neighbors classifier to do better than that. If your classifier provides a significant improvement upon the majority classifier, this means that at least your method is extracting some useful information from your predictor variables. Be careful though: improving on the majority classifier does not necessarily mean the classifier is working well enough for your application. As an example, in the breast cancer data, recall the proportions of benign and malignant observations in the training data are as follows: cancer_proportions ## # A tibble: 2 × 3 ## Class n percent ## <fct> <int> <dbl> ## 1 Malignant 159 37.3 ## 2 Benign 267 62.7 Since the benign class represents the majority of the training data, the majority classifier would always predict that a new observation is benign. The estimated accuracy of the majority classifier is usually fairly close to the majority class proportion in the training data. In this case, we would suspect that the majority classifier will have an accuracy of around 63%. The K-nearest neighbors classifier we built does quite a bit better than this, with an accuracy of 85%. This means that from the perspective of accuracy, the K-nearest neighbors classifier improved quite a bit on the basic majority classifier. Hooray! But we still need to be cautious; in this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing patients who actually need medical care. The confusion matrix above shows that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign (7 out of 53 malignant tumors, or 13%!). Therefore, even though the accuracy improved upon the majority classifier, our critical analysis suggests that this classifier may not have appropriate performance for the application. 6.6 Tuning the classifier The vast majority of predictive models in statistics and machine learning have parameters. A parameter is a number you have to pick in advance that determines some aspect of how the model behaves. For example, in the K-nearest neighbors classification algorithm, \\(K\\) is a parameter that we have to pick that determines how many neighbors participate in the class vote. By picking different values of \\(K\\), we create different classifiers that make different predictions. So then, how do we pick the best value of \\(K\\), i.e., tune the model? And is it possible to make this selection in a principled way? In this book, we will focus on maximizing the accuracy of the classifier. Ideally, we want somehow to maximize the accuracy of our classifier on data it hasn’t seen yet. But we cannot use our test data set in the process of building our model. So we will play the same trick we did before when evaluating our classifier: we’ll split our training data itself into two subsets, use one to train the model, and then use the other to evaluate it. In this section, we will cover the details of this procedure, as well as how to use it to help you pick a good parameter value for your classifier. And remember: don’t touch the test set during the tuning process. Tuning is a part of model training! 6.6.1 Cross-validation The first step in choosing the parameter \\(K\\) is to be able to evaluate the classifier using only the training data. If this is possible, then we can compare the classifier’s performance for different values of \\(K\\)—and pick the best—using only the training data. As suggested at the beginning of this section, we will accomplish this by splitting the training data, training on one subset, and evaluating on the other. The subset of training data used for evaluation is often called the validation set. There is, however, one key difference from the train/test split that we performed earlier. In particular, we were forced to make only a single split of the data. This is because at the end of the day, we have to produce a single classifier. If we had multiple different splits of the data into training and testing data, we would produce multiple different classifiers. But while we are tuning the classifier, we are free to create multiple classifiers based on multiple splits of the training data, evaluate them, and then choose a parameter value based on all of the different results. If we just split our overall training data once, our best parameter choice will depend strongly on whatever data was lucky enough to end up in the validation set. Perhaps using multiple different train/validation splits, we’ll get a better estimate of accuracy, which will lead to a better choice of the number of neighbors \\(K\\) for the overall set of training data. Let’s investigate this idea in R! In particular, we will generate five different train/validation splits of our overall training data, train five different K-nearest neighbors models, and evaluate their accuracy. We will start with just a single split. # create the 25/75 split of the training data into training and validation cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class) cancer_subtrain <- training(cancer_split) cancer_validation <- testing(cancer_split) # recreate the standardization recipe from before # (since it must be based on the training data) cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_subtrain) |> step_scale(all_predictors()) |> step_center(all_predictors()) # fit the knn model (we can reuse the old knn_spec model from before) knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit(data = cancer_subtrain) # get predictions on the validation data validation_predicted <- predict(knn_fit, cancer_validation) |> bind_cols(cancer_validation) # compute the accuracy acc <- validation_predicted |> metrics(truth = Class, estimate = .pred_class) |> filter(.metric == "accuracy") |> select(.estimate) |> pull() acc ## [1] 0.8598131 The accuracy estimate using this split is 86%. Now we repeat the above code 4 more times, which generates 4 more splits. Therefore we get five different shuffles of the data, and therefore five different values for accuracy: 86.0%, 89.7%, 88.8%, 86.0%, 86.9%. None of these values are necessarily “more correct” than any other; they’re just five estimates of the true, underlying accuracy of our classifier built using our overall training data. We can combine the estimates by taking their average (here 87%) to try to get a single assessment of our classifier’s accuracy; this has the effect of reducing the influence of any one (un)lucky validation set on the estimate. In practice, we don’t use random splits, but rather use a more structured splitting procedure so that each observation in the data set is used in a validation set only a single time. The name for this strategy is cross-validation. In cross-validation, we split our overall training data into \\(C\\) evenly sized chunks. Then, iteratively use \\(1\\) chunk as the validation set and combine the remaining \\(C-1\\) chunks as the training set. This procedure is shown in Figure 6.4. Here, \\(C=5\\) different chunks of the data set are used, resulting in 5 different choices for the validation set; we call this 5-fold cross-validation. Figure 6.4: 5-fold cross-validation. To perform 5-fold cross-validation in R with tidymodels, we use another function: vfold_cv. This function splits our training data into v folds automatically. We set the strata argument to the categorical label variable (here, Class) to ensure that the training and validation subsets contain the right proportions of each category of observation. cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class) cancer_vfold ## # 5-fold cross-validation using stratification ## # A tibble: 5 × 2 ## splits id ## <list> <chr> ## 1 <split [340/86]> Fold1 ## 2 <split [340/86]> Fold2 ## 3 <split [341/85]> Fold3 ## 4 <split [341/85]> Fold4 ## 5 <split [342/84]> Fold5 Then, when we create our data analysis workflow, we use the fit_resamples function instead of the fit function for training. This runs cross-validation on each train/validation split. # recreate the standardization recipe from before # (since it must be based on the training data) cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) # fit the knn model (we can reuse the old knn_spec model from before) knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold) knn_fit ## # Resampling results ## # 5-fold cross-validation using stratification ## # A tibble: 5 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [340/86]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> ## 2 <split [340/86]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> ## 3 <split [341/85]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> ## 4 <split [341/85]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> ## 5 <split [342/84]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> The collect_metrics function is used to aggregate the mean and standard error of the classifier’s validation accuracy across the folds. You will find results related to the accuracy in the row with accuracy listed under the .metric column. You should consider the mean (mean) to be the estimated accuracy, while the standard error (std_err) is a measure of how uncertain we are in the mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean is 0.89 and standard error is 0.02, you can expect the true average accuracy of the classifier to be somewhere roughly between 87% and 91% (although it may fall outside this range). You may ignore the other columns in the metrics data frame, as they do not provide any additional insight. You can also ignore the entire second row with roc_auc in the .metric column, as it is beyond the scope of this book. knn_fit |> collect_metrics() ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.890 5 0.0180 Preprocessor1_Model1 ## 2 roc_auc binary 0.925 5 0.0151 Preprocessor1_Model1 We can choose any number of folds, and typically the more we use the better our accuracy estimate will be (lower standard error). However, we are limited by computational power: the more folds we choose, the more computation it takes, and hence the more time it takes to run the analysis. So when you do cross-validation, you need to consider the size of the data, the speed of the algorithm (e.g., K-nearest neighbors), and the speed of your computer. In practice, this is a trial-and-error process, but typically \\(C\\) is chosen to be either 5 or 10. Here we will try 10-fold cross-validation to see if we get a lower standard error: cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class) vfold_metrics <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold) |> collect_metrics() vfold_metrics ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.890 10 0.0127 Preprocessor1_Model1 ## 2 roc_auc binary 0.913 10 0.0150 Preprocessor1_Model1 In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes you might even end up with a higher standard error when increasing the number of folds! We can make the reduction in standard error more dramatic by increasing the number of folds by a large amount. In the following code we show the result when \\(C = 50\\); picking such a large number of folds often takes a long time to run in practice, so we usually stick to 5 or 10. cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class) vfold_metrics_50 <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit_resamples(resamples = cancer_vfold_50) |> collect_metrics() vfold_metrics_50 ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.884 50 0.00568 Preprocessor1_Model1 ## 2 roc_auc binary 0.926 50 0.0148 Preprocessor1_Model1 6.6.2 Parameter value selection Using 5- and 10-fold cross-validation, we have estimated that the prediction accuracy of our classifier is somewhere around 89%. Whether that is good or not depends entirely on the downstream application of the data analysis. In the present situation, we are trying to predict a tumor diagnosis, with expensive, damaging chemo/radiation therapy or patient death as potential consequences of misprediction. Hence, we might like to do better than 89% for this application. In order to improve our classifier, we have one choice of parameter: the number of neighbors, \\(K\\). Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of \\(K\\) in a reasonable range, and then pick the value of \\(K\\) that gives us the best accuracy. The tidymodels package collection provides a very simple syntax for tuning models: each parameter in the model to be tuned should be specified as tune() in the model specification rather than given a particular value. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("classification") Then instead of using fit or fit_resamples, we will use the tune_grid function to fit the model for each value in a range of parameter values. In particular, we first create a data frame with a neighbors variable that contains the sequence of values of \\(K\\) to try; below we create the k_vals data frame with the neighbors variable containing values from 1 to 100 (stepping by 5) using the seq function. Then we pass that data frame to the grid argument of tune_grid. k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5)) knn_results <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = k_vals) |> collect_metrics() accuracies <- knn_results |> filter(.metric == "accuracy") accuracies ## # A tibble: 20 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 accuracy binary 0.866 10 0.0165 Preprocessor1_Model01 ## 2 6 accuracy binary 0.890 10 0.0153 Preprocessor1_Model02 ## 3 11 accuracy binary 0.887 10 0.0173 Preprocessor1_Model03 ## 4 16 accuracy binary 0.887 10 0.0142 Preprocessor1_Model04 ## 5 21 accuracy binary 0.887 10 0.0143 Preprocessor1_Model05 ## 6 26 accuracy binary 0.887 10 0.0170 Preprocessor1_Model06 ## 7 31 accuracy binary 0.897 10 0.0145 Preprocessor1_Model07 ## 8 36 accuracy binary 0.899 10 0.0144 Preprocessor1_Model08 ## 9 41 accuracy binary 0.892 10 0.0135 Preprocessor1_Model09 ## 10 46 accuracy binary 0.892 10 0.0156 Preprocessor1_Model10 ## 11 51 accuracy binary 0.890 10 0.0155 Preprocessor1_Model11 ## 12 56 accuracy binary 0.873 10 0.0156 Preprocessor1_Model12 ## 13 61 accuracy binary 0.876 10 0.0104 Preprocessor1_Model13 ## 14 66 accuracy binary 0.871 10 0.0139 Preprocessor1_Model14 ## 15 71 accuracy binary 0.876 10 0.0104 Preprocessor1_Model15 ## 16 76 accuracy binary 0.873 10 0.0127 Preprocessor1_Model16 ## 17 81 accuracy binary 0.876 10 0.0135 Preprocessor1_Model17 ## 18 86 accuracy binary 0.873 10 0.0131 Preprocessor1_Model18 ## 19 91 accuracy binary 0.873 10 0.0140 Preprocessor1_Model19 ## 20 96 accuracy binary 0.866 10 0.0126 Preprocessor1_Model20 We can decide which number of neighbors is best by plotting the accuracy versus \\(K\\), as shown in Figure 6.5. accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = "Neighbors", y = "Accuracy Estimate") + theme(text = element_text(size = 12)) accuracy_vs_k Figure 6.5: Plot of estimated accuracy versus the number of neighbors. We can also obtain the number of neighbours with the highest accuracy programmatically by accessing the neighbors variable in the accuracies data frame where the mean variable is highest. Note that it is still useful to visualize the results as we did above since this provides additional information on how the model performance varies. best_k <- accuracies |> arrange(desc(mean)) |> head(1) |> pull(neighbors) best_k ## [1] 36 Setting the number of neighbors to \\(K =\\) 36 provides the highest cross-validation accuracy estimate (89.89%). But there is no exact or perfect answer here; any selection from \\(K = 30\\) and \\(60\\) would be reasonably justified, as all of these differ in classifier accuracy by a small amount. Remember: the values you see on this plot are estimates of the true accuracy of our classifier. Although the \\(K =\\) 36 value is higher than the others on this plot, that doesn’t mean the classifier is actually more accurate with this parameter value! Generally, when selecting \\(K\\) (and other parameters for other predictive models), we are looking for a value where: we get roughly optimal accuracy, so that our model will likely be accurate; changing the value to a nearby one (e.g., adding or subtracting a small number) doesn’t decrease accuracy too much, so that our choice is reliable in the presence of uncertainty; the cost of training the model is not prohibitive (e.g., in our situation, if \\(K\\) is too large, predicting becomes expensive!). We know that \\(K =\\) 36 provides the highest estimated accuracy. Further, Figure 6.5 shows that the estimated accuracy changes by only a small amount if we increase or decrease \\(K\\) near \\(K =\\) 36. And finally, \\(K =\\) 36 does not create a prohibitively expensive computational cost of training. Considering these three points, we would indeed select \\(K =\\) 36 for the classifier. 6.6.3 Under/Overfitting To build a bit more intuition, what happens if we keep increasing the number of neighbors \\(K\\)? In fact, the accuracy actually starts to decrease! Let’s specify a much larger range of values of \\(K\\) to try in the grid argument of tune_grid. Figure 6.6 shows a plot of estimated accuracy as we vary \\(K\\) from 1 to almost the number of observations in the training set. k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10)) knn_results <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = k_lots) |> collect_metrics() accuracies_lots <- knn_results |> filter(.metric == "accuracy") accuracy_vs_k_lots <- ggplot(accuracies_lots, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = "Neighbors", y = "Accuracy Estimate") + theme(text = element_text(size = 12)) accuracy_vs_k_lots Figure 6.6: Plot of accuracy estimate versus number of neighbors for many K values. Underfitting: What is actually happening to our classifier that causes this? As we increase the number of neighbors, more and more of the training observations (and those that are farther and farther away from the point) get a “say” in what the class of a new observation is. This causes a sort of “averaging effect” to take place, making the boundary between where our classifier would predict a tumor to be malignant versus benign to smooth out and become simpler. If you take this to the extreme, setting \\(K\\) to the total training data set size, then the classifier will always predict the same label regardless of what the new observation looks like. In general, if the model isn’t influenced enough by the training data, it is said to underfit the data. Overfitting: In contrast, when we decrease the number of neighbors, each individual data point has a stronger and stronger vote regarding nearby points. Since the data themselves are noisy, this causes a more “jagged” boundary corresponding to a less simple model. If you take this case to the extreme, setting \\(K = 1\\), then the classifier is essentially just matching each new observation to its closest neighbor in the training data set. This is just as problematic as the large \\(K\\) case, because the classifier becomes unreliable on new data: if we had a different training set, the predictions would be completely different. In general, if the model is influenced too much by the training data, it is said to overfit the data. Figure 6.7: Effect of K in overfitting and underfitting. Both overfitting and underfitting are problematic and will lead to a model that does not generalize well to new data. When fitting a model, we need to strike a balance between the two. You can see these two effects in Figure 6.7, which shows how the classifier changes as we set the number of neighbors \\(K\\) to 1, 7, 20, and 300. 6.6.4 Evaluating on the test set Now that we have tuned the K-NN classifier and set \\(K =\\) 36, we are done building the model and it is time to evaluate the quality of its predictions on the held out test data, as we did earlier in Section 6.5.5. We first need to retrain the K-NN classifier on the entire training data set using the selected number of neighbors. cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |> set_engine("kknn") |> set_mode("classification") knn_fit <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> fit(data = cancer_train) knn_fit ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(36, data, 5), kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.1150235 ## Best kernel: rectangular ## Best k: 36 Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the predict and metrics functions as we did earlier in the chapter. We can then pass those predictions to the precision, recall, and conf_mat functions to assess the estimated precision and recall, and print a confusion matrix. cancer_test_predictions <- predict(knn_fit, cancer_test) |> bind_cols(cancer_test) cancer_test_predictions |> metrics(truth = Class, estimate = .pred_class) |> filter(.metric == "accuracy") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.860 cancer_test_predictions |> precision(truth = Class, estimate = .pred_class, event_level="first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 precision binary 0.8 cancer_test_predictions |> recall(truth = Class, estimate = .pred_class, event_level="first") ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 recall binary 0.830 confusion <- cancer_test_predictions |> conf_mat(truth = Class, estimate = .pred_class) confusion ## Truth ## Prediction Malignant Benign ## Malignant 44 11 ## Benign 9 79 At first glance, this is a bit surprising: the accuracy of the classifier has only changed a small amount despite tuning the number of neighbors! Our first model with \\(K =\\) 3 (before we knew how to tune) had an estimated accuracy of 85%, while the tuned model with \\(K =\\) 36 had an estimated accuracy of 86%. Upon examining Figure 6.5 again to see the cross validation accuracy estimates for a range of neighbors, this result becomes much less surprising. From 1 to around 96 neighbors, the cross validation accuracy estimate varies only by around 3%, with each estimate having a standard error around 1%. Since the cross-validation accuracy estimates the test set accuracy, the fact that the test set accuracy also doesn’t change much is expected. Also note that the \\(K =\\) 3 model had a precision precision of 77% and recall of 87%, while the tuned model had a precision of 80% and recall of 83%. Given that the recall decreased—remember, in this application, recall is critical to making sure we find all the patients with malignant tumors—the tuned model may actually be less preferred in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to maximize accuracy are not necessarily better for a given application. 6.7 Summary Classification algorithms use one or more quantitative variables to predict the value of another categorical variable. In particular, the K-nearest neighbors algorithm does this by first finding the \\(K\\) points in the training data nearest to the new observation, and then returning the majority class vote from those training observations. We can tune and evaluate a classifier by splitting the data randomly into a training and test data set. The training set is used to build the classifier, and we can tune the classifier (e.g., select the number of neighbors in K-NN) by maximizing estimated accuracy via cross-validation. After we have tuned the model we can use the test set to estimate its accuracy. The overall process is summarized in Figure 6.8. Figure 6.8: Overview of K-NN classification. The overall workflow for performing K-nearest neighbors classification using tidymodels is as follows: Use the initial_split function to split the data into a training and test set. Set the strata argument to the class label variable. Put the test set aside for now. Use the vfold_cv function to split up the training data for cross-validation. Create a recipe that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the data argument of the recipe. Create a nearest_neighbors model specification, with neighbors = tune(). Add the recipe and model specification to a workflow(), and use the tune_grid function on the train/validation splits to estimate the classifier accuracy for a range of \\(K\\) values. Pick a value of \\(K\\) that yields a high accuracy estimate that doesn’t change much if you change \\(K\\) to a nearby value. Make a new model specification for the best parameter value (i.e., \\(K\\)), and retrain the classifier using the fit function. Evaluate the estimated accuracy of the classifier on the test set using the predict function. In these last two chapters, we focused on the K-nearest neighbors algorithm, but there are many other methods we could have used to predict a categorical label. All algorithms have their strengths and weaknesses, and we summarize these for the K-NN here. Strengths: K-nearest neighbors classification is a simple, intuitive algorithm, requires few assumptions about what the data must look like, and works for binary (two-class) and multi-class (more than 2 classes) classification problems. Weaknesses: K-nearest neighbors classification becomes very slow as the training data gets larger, may not perform well with a large number of predictors, and may not perform well when classes are imbalanced. 6.8 Predictor variable selection Note: This section is not required reading for the remainder of the textbook. It is included for those readers interested in learning how irrelevant variables can influence the performance of a classifier, and how to pick a subset of useful variables to include as predictors. Another potentially important part of tuning your classifier is to choose which variables from your data will be treated as predictor variables. Technically, you can choose anything from using a single predictor variable to using every variable in your data; the K-nearest neighbors algorithm accepts any number of predictors. However, it is not the case that using more predictors always yields better predictions! In fact, sometimes including irrelevant predictors can actually negatively affect classifier performance. 6.8.1 The effect of irrelevant predictors Let’s take a look at an example where K-nearest neighbors performs worse when given more predictors to work with. In this example, we modified the breast cancer data to have only the Smoothness, Concavity, and Perimeter variables from the original data. Then, we added irrelevant variables that we created ourselves using a random number generator. The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless of what the value Class variable takes. In other words, the irrelevant variables have no meaningful relationship with the Class variable. cancer_irrelevant |> select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2) ## # A tibble: 569 × 6 ## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2 ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Malignant 0.118 0.300 123. 1 0 ## 2 Malignant 0.0847 0.0869 133. 0 0 ## 3 Malignant 0.110 0.197 130 0 0 ## 4 Malignant 0.142 0.241 77.6 0 1 ## 5 Malignant 0.100 0.198 135. 0 0 ## 6 Malignant 0.128 0.158 82.6 1 0 ## 7 Malignant 0.0946 0.113 120. 0 1 ## 8 Malignant 0.119 0.0937 90.2 1 0 ## 9 Malignant 0.127 0.186 87.5 0 0 ## 10 Malignant 0.119 0.227 84.0 1 1 ## # ℹ 559 more rows Next, we build a sequence of K-NN classifiers that include Smoothness, Concavity, and Perimeter as predictor variables, but also increasingly many irrelevant variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors. Then we build a model, tuned via 5-fold cross-validation, for each data set. Figure 6.9 shows the estimated cross-validation accuracy versus the number of irrelevant predictors. As we add more irrelevant predictor variables, the estimated accuracy of our classifier decreases. This is because the irrelevant variables add a random amount to the distance between each pair of observations; the more irrelevant variables there are, the more (random) influence they have, and the more they corrupt the set of nearest neighbors that vote on the class of the new observation to predict. Figure 6.9: Effect of inclusion of irrelevant predictors. Although the accuracy decreases as expected, one surprising thing about Figure 6.9 is that it shows that the method still outperforms the baseline majority classifier (with about 63% accuracy) even with 40 irrelevant variables. How could that be? Figure 6.10 provides the answer: the tuning procedure for the K-nearest neighbors classifier combats the extra randomness from the irrelevant variables by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant variables, the number of neighbors does not increase smoothly; but the general trend is increasing. Figure 6.11 corroborates this evidence; if we fix the number of neighbors to \\(K=3\\), the accuracy falls off more quickly. Figure 6.10: Tuned number of neighbors for varying number of irrelevant predictors. Figure 6.11: Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors. 6.8.2 Finding a good subset of predictors So then, if it is not ideal to use all of our variables as predictors without consideration, how do we choose which variables we should use? A simple method is to rely on your scientific understanding of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer data that we have been studying, the ID variable is just a unique identifier for the observation. As it is not related to any measured property of the cells, the ID variable should therefore not be used as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables is less obvious, as all seem like reasonable candidates. It is not clear which subset of them will create the best classifier. One could use visualizations and other exploratory analyses to try to help understand which variables are potentially relevant, but this process is both time-consuming and error-prone when there are many variables to consider. Therefore we need a more systematic and programmatic way of choosing variables. This is a very difficult problem to solve in general, and there are a number of methods that have been developed that apply in particular cases of interest. Here we will discuss two basic selection methods as an introduction to the topic. See the additional resources at the end of this chapter to find out where you can learn more about variable selection, including more advanced methods. The first idea you might think of for a systematic way to select predictors is to try all possible subsets of predictors and then pick the set that results in the “best” classifier. This procedure is indeed a well-known variable selection method referred to as best subset selection (Beale, Kendall, and Mann 1967; Hocking and Leslie 1967). In particular, you create a separate model for every possible subset of predictors, tune each one using cross-validation, and pick the subset of predictors that gives you the highest cross-validation accuracy. Best subset selection is applicable to any classification method (K-NN or otherwise). However, it becomes very slow when you have even a moderate number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets grows very quickly with the number of predictors, and you have to train the model (itself a slow process!) for each one. For example, if we have 2 predictors—let’s call them A and B—then we have 3 variable sets to try: A alone, B alone, and finally A and B together. If we have 3 predictors—A, B, and C—then we have 7 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models we have to train for \\(m\\) predictors is \\(2^m-1\\); in other words, when we get to 10 predictors we have over one thousand models to train, and at 20 predictors we have over one million models to train! So although it is a simple method, best subset selection is usually too computationally expensive to use in practice. Another idea is to iteratively build up a model by adding one predictor variable at a time. This method—known as forward selection (Eforymson 1966; Draper and Smith 1966)—is also widely applicable and fairly straightforward. It involves the following steps: Start with a model having no predictors. Run the following 3 steps until you run out of predictors: For each unused predictor, add it to the model to form a candidate model. Tune all of the candidate models. Update the model to be the candidate model with the highest cross-validation accuracy. Select the model that provides the best trade-off between accuracy and simplicity. Say you have \\(m\\) total predictors to work with. In the first iteration, you have to make \\(m\\) candidate models, each with 1 predictor. Then in the second iteration, you have to make \\(m-1\\) candidate models, each with 2 predictors (the one you chose before and a new one). This pattern continues for as many iterations as you want. If you run the method all the way until you run out of predictors to choose, you will end up training \\(\\frac{1}{2}m(m+1)\\) separate models. This is a big improvement from the \\(2^m-1\\) models that best subset selection requires you to train! For example, while best subset selection requires training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models. Therefore we will continue the rest of this section using forward selection. Note: One word of caution before we move on. Every additional model that you train increases the likelihood that you will get unlucky and stumble on a model that has a high cross-validation accuracy estimate, but a low true accuracy on the test data and other future observations. Since forward selection involves training a lot of models, you run a fairly high risk of this happening. To keep this risk low, only use forward selection when you have a large amount of data and a relatively small total number of predictors. More advanced methods do not suffer from this problem as much; see the additional resources at the end of this chapter for where to learn more about advanced predictor selection methods. 6.8.3 Forward selection in R We now turn to implementing forward selection in R. Unfortunately there is no built-in way to do this using the tidymodels framework, so we will have to code it ourselves. First we will use the select function to extract a smaller set of predictors to work with in this illustrative example—Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, and Irrelevant3—as well as the Class variable as the label. We will also extract the column names for the full set of predictors. cancer_subset <- cancer_irrelevant |> select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, Irrelevant3) names <- colnames(cancer_subset |> select(-Class)) cancer_subset ## # A tibble: 569 × 7 ## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2 Irrelevant3 ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Malignant 0.118 0.300 123. 1 0 1 ## 2 Malignant 0.0847 0.0869 133. 0 0 0 ## 3 Malignant 0.110 0.197 130 0 0 0 ## 4 Malignant 0.142 0.241 77.6 0 1 0 ## 5 Malignant 0.100 0.198 135. 0 0 0 ## 6 Malignant 0.128 0.158 82.6 1 0 1 ## 7 Malignant 0.0946 0.113 120. 0 1 1 ## 8 Malignant 0.119 0.0937 90.2 1 0 0 ## 9 Malignant 0.127 0.186 87.5 0 0 1 ## 10 Malignant 0.119 0.227 84.0 1 1 0 ## # ℹ 559 more rows The key idea of the forward selection code is to use the paste function (which concatenates strings separated by spaces) to create a model formula for each subset of predictors for which we want to build a model. The collapse argument tells paste what to put between the items in the list; to make a formula, we need to put a + symbol between each variable. As an example, let’s make a model formula for all the predictors, which should output something like Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3: example_formula <- paste("Class", "~", paste(names, collapse="+")) example_formula ## [1] "Class ~ Smoothness+Concavity+Perimeter+Irrelevant1+Irrelevant2+Irrelevant3" Finally, we need to write some code that performs the task of sequentially finding the best predictor to add to the model. If you recall the end of the wrangling chapter, we mentioned that sometimes one needs more flexible forms of iteration than what we have used earlier, and in these cases one typically resorts to a for loop; see the chapter on iteration in R for Data Science (Wickham and Grolemund 2016). Here we will use two for loops: one over increasing predictor set sizes (where you see for (i in 1:length(names)) below), and another to check which predictor to add in each round (where you see for (j in 1:length(names)) below). For each set of predictors to try, we construct a model formula, pass it into a recipe, build a workflow that tunes a K-NN classifier using 5-fold cross-validation, and finally records the estimated accuracy. # create an empty tibble to store the results accuracies <- tibble(size = integer(), model_string = character(), accuracy = numeric()) # create a model specification knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("classification") # create a 5-fold cross-validation object cancer_vfold <- vfold_cv(cancer_subset, v = 5, strata = Class) # store the total number of predictors n_total <- length(names) # stores selected predictors selected <- c() # for every size from 1 to the total number of predictors for (i in 1:n_total) { # for every predictor still not added yet accs <- list() models <- list() for (j in 1:length(names)) { # create a model string for this combination of predictors preds_new <- c(selected, names[[j]]) model_string <- paste("Class", "~", paste(preds_new, collapse="+")) # create a recipe from the model string cancer_recipe <- recipe(as.formula(model_string), data = cancer_subset) |> step_scale(all_predictors()) |> step_center(all_predictors()) # tune the K-NN classifier with these predictors, # and collect the accuracy for the best K acc <- workflow() |> add_recipe(cancer_recipe) |> add_model(knn_spec) |> tune_grid(resamples = cancer_vfold, grid = 10) |> collect_metrics() |> filter(.metric == "accuracy") |> summarize(mx = max(mean)) acc <- acc$mx |> unlist() # add this result to the dataframe accs[[j]] <- acc models[[j]] <- model_string } jstar <- which.max(unlist(accs)) accuracies <- accuracies |> add_row(size = i, model_string = models[[jstar]], accuracy = accs[[jstar]]) selected <- c(selected, names[[jstar]]) names <- names[-jstar] } accuracies ## # A tibble: 6 × 3 ## size model_string accuracy ## <int> <chr> <dbl> ## 1 1 Class ~ Perimeter 0.896 ## 2 2 Class ~ Perimeter+Concavity 0.916 ## 3 3 Class ~ Perimeter+Concavity+Smoothness 0.931 ## 4 4 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1 0.928 ## 5 5 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3 0.924 ## 6 6 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3… 0.902 Interesting! The forward selection procedure first added the three meaningful variables Perimeter, Concavity, and Smoothness, followed by the irrelevant variables. Figure 6.12 visualizes the accuracy versus the number of predictors in the model. You can see that as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). The way to find that balance is to look for the elbow in Figure 6.12, i.e., the place on the plot where the accuracy stops increasing dramatically and levels off or begins to decrease. The elbow in Figure 6.12 appears to occur at the model with 3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors occurs with 3 variables: Class ~ Perimeter + Concavity + Smoothness. In other words, we have successfully removed irrelevant predictors from the model! It is always worth remembering, however, that what cross-validation gives you is an estimate of the true accuracy; you have to use your judgement when looking at this plot to decide where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy. Figure 6.12: Estimated accuracy versus the number of predictors for the sequence of models built using forward selection. Note: Since the choice of which variables to include as predictors is part of tuning your classifier, you cannot use your test data for this process! 6.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Classification II: evaluation and tuning” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 6.10 Additional resources The tidymodels website is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a nice beginner’s tutorial and an extensive list of more advanced examples that you can use to continue learning beyond the scope of this book. It’s worth noting that the tidymodels package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you’ll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters. An Introduction to Statistical Learning (James et al. 2013) provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require. References "],["regression1.html", "Chapter 7 Regression I: K-nearest neighbors 7.1 Overview 7.2 Chapter learning objectives 7.3 The regression problem 7.4 Exploring a data set 7.5 K-nearest neighbors regression 7.6 Training, evaluating, and tuning the model 7.7 Underfitting and overfitting 7.8 Evaluating on the test set 7.9 Multivariable K-NN regression 7.10 Strengths and limitations of K-NN regression 7.11 Exercises", " Chapter 7 Regression I: K-nearest neighbors 7.1 Overview This chapter continues our foray into answering predictive questions. Here we will focus on predicting numerical variables and will use regression to perform this task. This is unlike the past two chapters, which focused on predicting categorical variables via classification. However, regression does have many similarities to classification: for example, just as in the case of classification, we will split our data into training, validation, and test sets, we will use tidymodels workflows, we will use a K-nearest neighbors (K-NN) approach to make predictions, and we will use cross-validation to choose K. Because of how similar these procedures are, make sure to read Chapters 5 and 6 before reading this one—we will move a little bit faster here with the concepts that have already been covered. This chapter will primarily focus on the case where there is a single predictor, but the end of the chapter shows how to perform regression with more than one predictor variable, i.e., multivariable regression. It is important to note that regression can also be used to answer inferential and causal questions, however that is beyond the scope of this book. 7.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Recognize situations where a regression analysis would be appropriate for making predictions. Explain the K-nearest neighbors (K-NN) regression algorithm and describe how it differs from K-NN classification. Interpret the output of a K-NN regression. In a data set with two or more variables, perform K-nearest neighbors regression in R. Evaluate K-NN regression prediction quality in R using the root mean squared prediction error (RMSPE). Estimate the RMSPE in R using cross-validation or a test set. Choose the number of neighbors in K-nearest neighbors regression by minimizing estimated cross-validation RMSPE. Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors regression. Describe the advantages and disadvantages of K-nearest neighbors regression. 7.3 The regression problem Regression, like classification, is a predictive problem setting where we want to use past information to predict future observations. But in the case of regression, the goal is to predict numerical values instead of categorical values. The variable that you want to predict is often called the response variable. For example, we could try to use the number of hours a person spends on exercise each week to predict their race time in the annual Boston marathon. As another example, we could try to use the size of a house to predict its sale price. Both of these response variables—race time and sale price—are numerical, and so predicting them given past data is considered a regression problem. Just like in the classification setting, there are many possible methods that we can use to predict numerical response variables. In this chapter we will focus on the K-nearest neighbors algorithm (Fix and Hodges 1951; Cover and Hart 1967), and in the next chapter we will study linear regression. In your future studies, you might encounter regression trees, splines, and general local regression methods; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. Many of the concepts from classification map over to the setting of regression. For example, a regression model predicts a new observation’s response variable based on the response variables for similar observations in the data set of past observations. When building a regression model, we first split the data into training and test sets, in order to ensure that we assess the performance of our method on observations not seen during training. And finally, we can use cross-validation to evaluate different choices of model parameters (e.g., K in a K-nearest neighbors model). The major difference is that we are now predicting numerical variables instead of categorical variables. Note: You can usually tell whether a variable is numerical or categorical—and therefore whether you need to perform regression or classification—by taking the response variable for two observations X and Y from your data, and asking the question, “is response variable X more than response variable Y?” If the variable is categorical, the question will make no sense. (Is blue more than red? Is benign more than malignant?) If the variable is numerical, it will make sense. (Is 1.5 hours more than 2.25 hours? Is $500,000 more than $400,000?) Be careful when applying this heuristic, though: sometimes categorical variables will be encoded as numbers in your data (e.g., “1” represents “benign”, and “0” represents “malignant”). In these cases you have to ask the question about the meaning of the labels (“benign” and “malignant”), not their values (“1” and “0”). 7.4 Exploring a data set In this chapter and the next, we will study a data set of 932 real estate transactions in Sacramento, California originally reported in the Sacramento Bee newspaper. We first need to formulate a precise question that we want to answer. In this example, our question is again predictive: Can we use the size of a house in the Sacramento, CA area to predict its sale price? A rigorous, quantitative answer to this question might help a realtor advise a client as to whether the price of a particular listing is fair, or perhaps how to set the price of a new listing. We begin the analysis by loading and examining the data, and setting the seed value. library(tidyverse) library(tidymodels) library(gridExtra) set.seed(5) sacramento <- read_csv("data/sacramento.csv") sacramento ## # A tibble: 932 × 9 ## city zip beds baths sqft type price latitude longitude ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 SACRAMENTO z95838 2 1 836 Residential 59222 38.6 -121. ## 2 SACRAMENTO z95823 3 1 1167 Residential 68212 38.5 -121. ## 3 SACRAMENTO z95815 2 1 796 Residential 68880 38.6 -121. ## 4 SACRAMENTO z95815 2 1 852 Residential 69307 38.6 -121. ## 5 SACRAMENTO z95824 2 1 797 Residential 81900 38.5 -121. ## 6 SACRAMENTO z95841 3 1 1122 Condo 89921 38.7 -121. ## 7 SACRAMENTO z95842 3 2 1104 Residential 90895 38.7 -121. ## 8 SACRAMENTO z95820 3 1 1177 Residential 91002 38.5 -121. ## 9 RANCHO_CORDOVA z95670 2 2 941 Condo 94905 38.6 -121. ## 10 RIO_LINDA z95673 3 2 1146 Residential 98937 38.7 -121. ## # ℹ 922 more rows The scientific question guides our initial exploration: the columns in the data that we are interested in are sqft (house size, in livable square feet) and price (house sale price, in US dollars (USD)). The first step is to visualize the data as a scatter plot where we place the predictor variable (house size) on the x-axis, and we place the response variable that we want to predict (sale price) on the y-axis. Note: Given that the y-axis unit is dollars in Figure 7.1, we format the axis labels to put dollar signs in front of the house prices, as well as commas to increase the readability of the larger numbers. We can do this in R by passing the dollar_format function (from the scales package) to the labels argument of the scale_y_continuous function. eda <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + theme(text = element_text(size = 12)) eda Figure 7.1: Scatter plot of price (USD) versus house size (square feet). The plot is shown in Figure 7.1. We can see that in Sacramento, CA, as the size of a house increases, so does its sale price. Thus, we can reason that we may be able to use the size of a not-yet-sold house (for which we don’t know the sale price) to predict its final sale price. Note that we do not suggest here that a larger house size causes a higher sale price; just that house price tends to increase with house size, and that we may be able to use the latter to predict the former. 7.5 K-nearest neighbors regression Much like in the case of classification, we can use a K-nearest neighbors-based approach in regression to make predictions. Let’s take a small sample of the data in Figure 7.1 and walk through how K-nearest neighbors (K-NN) works in a regression context before we dive in to creating our model and assessing how well it predicts house sale price. This subsample is taken to allow us to illustrate the mechanics of K-NN regression with a few data points; later in this chapter we will use all the data. To take a small random sample of size 30, we’ll use the function slice_sample, and input the data frame to sample from and the number of rows to randomly select. small_sacramento <- slice_sample(sacramento, n = 30) Next let’s say we come across a 2,000 square-foot house in Sacramento we are interested in purchasing, with an advertised list price of $350,000. Should we offer to pay the asking price for this house, or is it overpriced and we should offer less? Absent any other information, we can get a sense for a good answer to this question by using the data we have to predict the sale price given the sale prices we have already observed. But in Figure 7.2, you can see that we have no observations of a house of size exactly 2,000 square feet. How can we predict the sale price? small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) + geom_point() + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + geom_vline(xintercept = 2000, linetype = "dashed") + theme(text = element_text(size = 12)) small_plot Figure 7.2: Scatter plot of price (USD) versus house size (square feet) with vertical line indicating 2,000 square feet on x-axis. We will employ the same intuition from the classification chapter, and use the neighboring points to the new point of interest to suggest/predict what its sale price might be. For the example shown in Figure 7.2, we find and label the 5 nearest neighbors to our observation of a house that is 2,000 square feet. nearest_neighbors <- small_sacramento |> mutate(diff = abs(2000 - sqft)) |> slice_min(diff, n = 5) nearest_neighbors ## # A tibble: 5 × 10 ## city zip beds baths sqft type price latitude longitude diff ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 ROSEVILLE z95661 3 2 2049 Residenti… 395500 38.7 -121. 49 ## 2 ANTELOPE z95843 4 3 2085 Residenti… 408431 38.7 -121. 85 ## 3 SACRAMENTO z95823 4 2 1876 Residenti… 299940 38.5 -121. 124 ## 4 ROSEVILLE z95747 3 2.5 1829 Residenti… 306500 38.8 -121. 171 ## 5 SACRAMENTO z95825 4 2 1776 Multi_Fam… 221250 38.6 -121. 224 Figure 7.3: Scatter plot of price (USD) versus house size (square feet) with lines to 5 nearest neighbors (highlighted in orange). Figure 7.3 illustrates the difference between the house sizes of the 5 nearest neighbors (in terms of house size) to our new 2,000 square-foot house of interest. Now that we have obtained these nearest neighbors, we can use their values to predict the sale price for the new home. Specifically, we can take the mean (or average) of these 5 values as our predicted value, as illustrated by the red point in Figure 7.4. prediction <- nearest_neighbors |> summarise(predicted = mean(price)) prediction ## # A tibble: 1 × 1 ## predicted ## <dbl> ## 1 326324. Figure 7.4: Scatter plot of price (USD) versus house size (square feet) with predicted price for a 2,000 square-foot house based on 5 nearest neighbors represented as a red dot. Our predicted price is $326,324 (shown as a red point in Figure 7.4), which is much less than $350,000; perhaps we might want to offer less than the list price at which the house is advertised. But this is only the very beginning of the story. We still have all the same unanswered questions here with K-NN regression that we had with K-NN classification: which \\(K\\) do we choose, and is our model any good at making predictions? In the next few sections, we will address these questions in the context of K-NN regression. One strength of the K-NN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships (i.e., if the relationship is not a straight line). This stems from the use of nearest neighbors to predict values. The algorithm really has very few assumptions about what the data must look like for it to work. 7.6 Training, evaluating, and tuning the model As usual, we must start by putting some test data away in a lock box that we will come back to only after we choose our final model. Let’s take care of that now. Note that for the remainder of the chapter we’ll be working with the entire Sacramento data set, as opposed to the smaller sample of 30 points that we used earlier in the chapter (Figure 7.2). sacramento_split <- initial_split(sacramento, prop = 0.75, strata = price) sacramento_train <- training(sacramento_split) sacramento_test <- testing(sacramento_split) Next, we’ll use cross-validation to choose \\(K\\). In K-NN classification, we used accuracy to see how well our predictions matched the true labels. We cannot use the same metric in the regression setting, since our predictions will almost never exactly match the true response variable values. Therefore in the context of K-NN regression we will use root mean square prediction error (RMSPE) instead. The mathematical formula for calculating RMSPE is: \\[\\text{RMSPE} = \\sqrt{\\frac{1}{n}\\sum\\limits_{i=1}^{n}(y_i - \\hat{y}_i)^2}\\] where: \\(n\\) is the number of observations, \\(y_i\\) is the observed value for the \\(i^\\text{th}\\) observation, and \\(\\hat{y}_i\\) is the forecasted/predicted value for the \\(i^\\text{th}\\) observation. In other words, we compute the squared difference between the predicted and true response value for each observation in our test (or validation) set, compute the average, and then finally take the square root. The reason we use the squared difference (and not just the difference) is that the differences can be positive or negative, i.e., we can overshoot or undershoot the true response value. Figure 7.5 illustrates both positive and negative differences between predicted and true response values. So if we want to measure error—a notion of distance between our predicted and true response values—we want to make sure that we are only adding up positive values, with larger positive values representing larger mistakes. If the predictions are very close to the true values, then RMSPE will be small. If, on the other-hand, the predictions are very different from the true values, then RMSPE will be quite large. When we use cross-validation, we will choose the \\(K\\) that gives us the smallest RMSPE. Figure 7.5: Scatter plot of price (USD) versus house size (square feet) with example predictions (blue line) and the error in those predictions compared with true response values (vertical lines). Note: When using many code packages (tidymodels included), the evaluation output we will get to assess the prediction quality of our K-NN regression models is labeled “RMSE”, or “root mean squared error”. Why is this so, and why not RMSPE? In statistics, we try to be very precise with our language to indicate whether we are calculating the prediction error on the training data (in-sample prediction) versus on the testing data (out-of-sample prediction). When predicting and evaluating prediction quality on the training data, we say RMSE. By contrast, when predicting and evaluating prediction quality on the testing or validation data, we say RMSPE. The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the \\(y\\)s are training or testing data. But many people just use RMSE for both, and rely on context to denote which data the root mean squared error is being calculated on. Now that we know how we can assess how well our model predicts a numerical value, let’s use R to perform cross-validation and to choose the optimal \\(K\\). First, we will create a recipe for preprocessing our data. Note that we include standardization in our preprocessing to build good habits, but since we only have one predictor, it is technically not necessary; there is no risk of comparing two predictors of different scales. Next we create a model specification for K-nearest neighbors regression. Note that we use set_mode(\"regression\") now in the model specification to denote a regression problem, as opposed to the classification problems from the previous chapters. The use of set_mode(\"regression\") essentially tells tidymodels that we need to use different metrics (RMSPE, not accuracy) for tuning and evaluation. Then we create a 5-fold cross-validation object, and put the recipe and model specification together in a workflow. sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("regression") sacr_vfold <- vfold_cv(sacramento_train, v = 5, strata = price) sacr_wkflw <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) sacr_wkflw ## ══ Workflow ══════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## K-Nearest Neighbor Model Specification (regression) ## ## Main Arguments: ## neighbors = tune() ## weight_func = rectangular ## ## Computational engine: kknn Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200. The following code tunes the model and returns the RMSPE for each number of neighbors. In the output of the sacr_results results data frame, we see that the neighbors variable contains the value of \\(K\\), the mean (mean) contains the value of the RMSPE estimated via cross-validation, and the standard error (std_err) contains a value corresponding to a measure of how uncertain we are in the mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean RMSPE is $100,000 and standard error is $1,000, you can expect the true RMSPE to be somewhere roughly between $99,000 and $101,000 (although it may fall outside this range). You may ignore the other columns in the metrics data frame, as they do not provide any additional insight. gridvals <- tibble(neighbors = seq(from = 1, to = 200, by = 3)) sacr_results <- sacr_wkflw |> tune_grid(resamples = sacr_vfold, grid = gridvals) |> collect_metrics() |> filter(.metric == "rmse") # show the results sacr_results ## # A tibble: 67 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 rmse standard 107206. 5 4102. Preprocessor1_Model01 ## 2 4 rmse standard 90469. 5 3312. Preprocessor1_Model02 ## 3 7 rmse standard 86580. 5 3062. Preprocessor1_Model03 ## 4 10 rmse standard 85321. 5 3395. Preprocessor1_Model04 ## 5 13 rmse standard 85045. 5 3641. Preprocessor1_Model05 ## 6 16 rmse standard 84675. 5 3679. Preprocessor1_Model06 ## 7 19 rmse standard 84776. 5 3984. Preprocessor1_Model07 ## 8 22 rmse standard 84617. 5 3952. Preprocessor1_Model08 ## 9 25 rmse standard 84953. 5 3929. Preprocessor1_Model09 ## 10 28 rmse standard 84612. 5 3917. Preprocessor1_Model10 ## # ℹ 57 more rows Figure 7.6: Effect of the number of neighbors on the RMSPE. Figure 7.6 visualizes how the RMSPE varies with the number of neighbors \\(K\\). We take the minimum RMSPE to find the best setting for the number of neighbors: # show only the row of minimum RMSPE sacr_min <- sacr_results |> filter(mean == min(mean)) sacr_min ## # A tibble: 1 × 7 ## neighbors .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 52 rmse standard 84561. 5 4470. Preprocessor1_Model18 The smallest RMSPE occurs when \\(K =\\) 52. 7.7 Underfitting and overfitting Similar to the setting of classification, by setting the number of neighbors to be too small or too large, we cause the RMSPE to increase, as shown in Figure 7.6. What is happening here? Figure 7.7 visualizes the effect of different settings of \\(K\\) on the regression model. Each plot shows the predicted values for house sale price from our K-NN regression model on the training data for 6 different values for \\(K\\): 1, 3, 25, 52, 250, and 680 (almost the entire training set). For each model, we predict prices for the range of possible home sizes we observed in the data set (here 500 to 5,000 square feet) and we plot the predicted prices as a blue line. Figure 7.7: Predicted values for house price (represented as a blue line) from K-NN regression models for six different values for \\(K\\). Figure 7.7 shows that when \\(K\\) = 1, the blue line runs perfectly through (almost) all of our training observations. This happens because our predicted values for a given region (typically) depend on just a single observation. In general, when \\(K\\) is too small, the line follows the training data quite closely, even if it does not match it perfectly. If we used a different training data set of house prices and sizes from the Sacramento real estate market, we would end up with completely different predictions. In other words, the model is influenced too much by the data. Because the model follows the training data so closely, it will not make accurate predictions on new observations which, generally, will not have the same fluctuations as the original training data. Recall from the classification chapters that this behavior—where the model is influenced too much by the noisy data—is called overfitting; we use this same term in the context of regression. What about the plots in Figure 7.7 where \\(K\\) is quite large, say, \\(K\\) = 250 or 680? In this case the blue line becomes extremely smooth, and actually becomes flat once \\(K\\) is equal to the number of datapoints in the training set. This happens because our predicted values for a given x value (here, home size), depend on many neighboring observations; in the case where \\(K\\) is equal to the size of the training set, the prediction is just the mean of the house prices (completely ignoring the house size). In contrast to the \\(K=1\\) example, the smooth, inflexible blue line does not follow the training observations very closely. In other words, the model is not influenced enough by the training data. Recall from the classification chapters that this behavior is called underfitting; we again use this same term in the context of regression. Ideally, what we want is neither of the two situations discussed above. Instead, we would like a model that (1) follows the overall “trend” in the training data, so the model actually uses the training data to learn something useful, and (2) does not follow the noisy fluctuations, so that we can be confident that our model will transfer/generalize well to other new data. If we explore the other values for \\(K\\), in particular \\(K\\) = 52 (as suggested by cross-validation), we can see it achieves this goal: it follows the increasing trend of house price versus house size, but is not influenced too much by the idiosyncratic variations in price. All of this is similar to how the choice of \\(K\\) affects K-nearest neighbors classification, as discussed in the previous chapter. 7.8 Evaluating on the test set To assess how well our model might do at predicting on unseen data, we will assess its RMSPE on the test data. To do this, we will first re-train our K-NN regression model on the entire training data set, using \\(K =\\) 52 neighbors. Then we will use predict to make predictions on the test data, and use the metrics function again to compute the summary of regression quality. Because we specify that we are performing regression in set_mode, the metrics function knows to output a quality summary related to regression, and not, say, classification. kmin <- sacr_min |> pull(neighbors) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) |> set_engine("kknn") |> set_mode("regression") sacr_fit <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) |> fit(data = sacramento_train) sacr_summary <- sacr_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) |> metrics(truth = price, estimate = .pred) |> filter(.metric == 'rmse') sacr_summary ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 90529. Our final model’s test error as assessed by RMSPE is $90,529. Note that RMSPE is measured in the same units as the response variable. In other words, on new observations, we expect the error in our prediction to be roughly $90,529. From one perspective, this is good news: this is about the same as the cross-validation RMSPE estimate of our tuned model (which was $84,561), so we can say that the model appears to generalize well to new data that it has never seen before. However, much like in the case of K-NN classification, whether this value for RMSPE is good—i.e., whether an error of around $90,529 is acceptable—depends entirely on the application. In this application, this error is not prohibitively large, but it is not negligible either; $90,529 might represent a substantial fraction of a home buyer’s budget, and could make or break whether or not they could afford put an offer on a house. Finally, Figure 7.8 shows the predictions that our final model makes across the range of house sizes we might encounter in the Sacramento area. Note that instead of predicting the house price only for those house sizes that happen to appear in our data, we predict it for evenly spaced values between the minimum and maximum in the data set (roughly 500 to 5000 square feet). We superimpose this prediction line on a scatter plot of the original housing price data, so that we can qualitatively assess if the model seems to fit the data well. You have already seen a few plots like this in this chapter, but here we also provide the code that generated it as a learning opportunity. sqft_prediction_grid <- tibble( sqft = seq( from = sacramento |> select(sqft) |> min(), to = sacramento |> select(sqft) |> max(), by = 10 ) ) sacr_preds <- sacr_fit |> predict(sqft_prediction_grid) |> bind_cols(sqft_prediction_grid) plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + geom_line(data = sacr_preds, mapping = aes(x = sqft, y = .pred), color = "steelblue", linewidth = 1) + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + ggtitle(paste0("K = ", kmin)) + theme(text = element_text(size = 12)) plot_final Figure 7.8: Predicted values of house price (blue line) for the final K-NN regression model. 7.9 Multivariable K-NN regression As in K-NN classification, we can use multiple predictors in K-NN regression. In this setting, we have the same concerns regarding the scale of the predictors. Once again, predictions are made by identifying the \\(K\\) observations that are nearest to the new point we want to predict; any variables that are on a large scale will have a much larger effect than variables on a small scale. But since the recipe we built above scales and centers all predictor variables, this is handled for us. Note that we also have the same concern regarding the selection of predictors in K-NN regression as in K-NN classification: having more predictors is not always better, and the choice of which predictors to use has a potentially large influence on the quality of predictions. Fortunately, we can use the predictor selection algorithm from the classification chapter in K-NN regression as well. As the algorithm is the same, we will not cover it again in this chapter. We will now demonstrate a multivariable K-NN regression analysis of the Sacramento real estate data using tidymodels. This time we will use house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our response variable that we are trying to predict. It is always a good practice to do exploratory data analysis, such as visualizing the data, before we start modeling the data. Figure 7.9 shows that the number of bedrooms might provide useful information to help predict the sale price of a house. plot_beds <- sacramento |> ggplot(aes(x = beds, y = price)) + geom_point(alpha = 0.4) + labs(x = 'Number of Bedrooms', y = 'Price (USD)') + theme(text = element_text(size = 12)) plot_beds Figure 7.9: Scatter plot of the sale price of houses versus the number of bedrooms. Figure 7.9 shows that as the number of bedrooms increases, the house sale price tends to increase as well, but that the relationship is quite weak. Does adding the number of bedrooms to our model improve our ability to predict price? To answer that question, we will have to create a new K-NN regression model using house size and number of bedrooms, and then we can compare it to the model we previously came up with that only used house size. Let’s do that now! First we’ll build a new model specification and recipe for the analysis. Note that we use the formula price ~ sqft + beds to denote that we have two predictors, and set neighbors = tune() to tell tidymodels to tune the number of neighbors for us. sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) |> step_scale(all_predictors()) |> step_center(all_predictors()) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> set_engine("kknn") |> set_mode("regression") Next, we’ll use 5-fold cross-validation to choose the number of neighbors via the minimum RMSPE: gridvals <- tibble(neighbors = seq(1, 200)) sacr_multi <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) |> tune_grid(sacr_vfold, grid = gridvals) |> collect_metrics() |> filter(.metric == "rmse") |> filter(mean == min(mean)) sacr_k <- sacr_multi |> pull(neighbors) sacr_multi ## # A tibble: 1 × 7 ## neighbors .metric .estimator mean n std_err .config ## <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 11 rmse standard 81839. 5 3108. Preprocessor1_Model011 Here we see that the smallest estimated RMSPE from cross-validation occurs when \\(K =\\) 11. If we want to compare this multivariable K-NN regression model to the model with only a single predictor as part of the model tuning process (e.g., if we are running forward selection as described in the chapter on evaluating and tuning classification models), then we must compare the RMSPE estimated using only the training data via cross-validation. Looking back, the estimated cross-validation RMSPE for the single-predictor model was $84,561. The estimated cross-validation RMSPE for the multivariable model is $81,839. Thus in this case, we did not improve the model by a large amount by adding this additional predictor. Regardless, let’s continue the analysis to see how we can make predictions with a multivariable K-NN regression model and evaluate its performance on test data. We first need to re-train the model on the entire training data set with \\(K =\\) 11, and then use that model to make predictions on the test data. sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = sacr_k) |> set_engine("kknn") |> set_mode("regression") knn_mult_fit <- workflow() |> add_recipe(sacr_recipe) |> add_model(sacr_spec) |> fit(data = sacramento_train) knn_mult_preds <- knn_mult_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) |> filter(.metric == 'rmse') knn_mult_mets ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 90862. This time, when we performed K-NN regression on the same data set, but also included number of bedrooms as a predictor, we obtained a RMSPE test error of $90,862. Figure 7.10 visualizes the model’s predictions overlaid on top of the data. This time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2 predictors instead of 1. Figure 7.10: K-NN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes. We can see that the predictions in this case, where we have 2 predictors, form a surface instead of a line. Because the newly added predictor (number of bedrooms) is related to price (as price changes, so does number of bedrooms) and is not totally determined by house size (our other predictor), we get additional and useful information for making our predictions. For example, in this model we would predict that the cost of a house with a size of 2,500 square feet generally increases slightly as the number of bedrooms increases. Without having the additional predictor of number of bedrooms, we would predict the same price for these two houses. 7.10 Strengths and limitations of K-NN regression As with K-NN classification (or any prediction algorithm for that matter), K-NN regression has both strengths and weaknesses. Some are listed here: Strengths: K-nearest neighbors regression is a simple, intuitive algorithm, requires few assumptions about what the data must look like, and works well with non-linear relationships (i.e., if the relationship is not a straight line). Weaknesses: K-nearest neighbors regression becomes very slow as the training data gets larger, may not perform well with a large number of predictors, and may not predict well beyond the range of values input in your training data. 7.11 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Regression I: K-nearest neighbors” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. References "],["regression2.html", "Chapter 8 Regression II: linear regression 8.1 Overview 8.2 Chapter learning objectives 8.3 Simple linear regression 8.4 Linear regression in R 8.5 Comparing simple linear and K-NN regression 8.6 Multivariable linear regression 8.7 Multicollinearity and outliers 8.8 Designing new predictors 8.9 The other sides of regression 8.10 Exercises 8.11 Additional resources", " Chapter 8 Regression II: linear regression 8.1 Overview Up to this point, we have solved all of our predictive problems—both classification and regression—using K-nearest neighbors (K-NN)-based approaches. In the context of regression, there is another commonly used method known as linear regression. This chapter provides an introduction to the basic concept of linear regression, shows how to use tidymodels to perform linear regression in R, and characterizes its strengths and weaknesses compared to K-NN regression. The focus is, as usual, on the case where there is a single predictor and single response variable of interest; but the chapter concludes with an example using multivariable linear regression when there is more than one predictor. 8.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Use R to fit simple and multivariable linear regression models on training data. Evaluate the linear regression model on test data. Compare and contrast predictions obtained from K-nearest neighbors regression to those obtained using linear regression from the same data set. Describe how linear regression is affected by outliers and multicollinearity. 8.3 Simple linear regression At the end of the previous chapter, we noted some limitations of K-NN regression. While the method is simple and easy to understand, K-NN regression does not predict well beyond the range of the predictors in the training data, and the method gets significantly slower as the training data set grows. Fortunately, there is an alternative to K-NN regression—linear regression—that addresses both of these limitations. Linear regression is also very commonly used in practice because it provides an interpretable mathematical equation that describes the relationship between the predictor and response variables. In this first part of the chapter, we will focus on simple linear regression, which involves only one predictor variable and one response variable; later on, we will consider multivariable linear regression, which involves multiple predictor variables. Like K-NN regression, simple linear regression involves predicting a numerical response variable (like race time, house price, or height); but how it makes those predictions for a new observation is quite different from K-NN regression. Instead of looking at the K nearest neighbors and averaging over their values for a prediction, in simple linear regression, we create a straight line of best fit through the training data and then “look up” the prediction using the line. Note: Although we did not cover it in earlier chapters, there is another popular method for classification called logistic regression (it is used for classification even though the name, somewhat confusingly, has the word “regression” in it). In logistic regression—similar to linear regression—you “fit” the model to the training data and then “look up” the prediction for each new observation. Logistic regression and K-NN classification have an advantage/disadvantage comparison similar to that of linear regression and K-NN regression. It is useful to have a good understanding of linear regression before learning about logistic regression. After reading this chapter, see the “Additional Resources” section at the end of the classification chapters to learn more about logistic regression. Let’s return to the Sacramento housing data from Chapter 7 to learn how to apply linear regression and compare it to K-NN regression. For now, we will consider a smaller version of the housing data to help make our visualizations clear. Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict its sale price? In particular, recall that we have come across a new 2,000 square-foot house we are interested in purchasing with an advertised list price of $350,000. Should we offer the list price, or is that over/undervalued? To answer this question using simple linear regression, we use the data we have to draw the straight line of best fit through our existing data points. The small subset of data as well as the line of best fit are shown in Figure 8.1. Figure 8.1: Scatter plot of sale price versus size with line of best fit for subset of the Sacramento housing data. The equation for the straight line is: \\[\\text{house sale price} = \\beta_0 + \\beta_1 \\cdot (\\text{house size}),\\] where \\(\\beta_0\\) is the vertical intercept of the line (the price when house size is 0) \\(\\beta_1\\) is the slope of the line (how quickly the price increases as you increase house size) Therefore using the data to find the line of best fit is equivalent to finding coefficients \\(\\beta_0\\) and \\(\\beta_1\\) that parametrize (correspond to) the line of best fit. Now of course, in this particular problem, the idea of a 0 square-foot house is a bit silly; but you can think of \\(\\beta_0\\) here as the “base price,” and \\(\\beta_1\\) as the increase in price for each square foot of space. Let’s push this thought even further: what would happen in the equation for the line if you tried to evaluate the price of a house with size 6 million square feet? Or what about negative 2,000 square feet? As it turns out, nothing in the formula breaks; linear regression will happily make predictions for nonsensical predictor values if you ask it to. But even though you can make these wild predictions, you shouldn’t. You should only make predictions roughly within the range of your original data, and perhaps a bit beyond it only if it makes sense. For example, the data in Figure 8.1 only reaches around 800 square feet on the low end, but it would probably be reasonable to use the linear regression model to make a prediction at 600 square feet, say. Back to the example! Once we have the coefficients \\(\\beta_0\\) and \\(\\beta_1\\), we can use the equation above to evaluate the predicted sale price given the value we have for the predictor variable—here 2,000 square feet. Figure 8.2 demonstrates this process. Figure 8.2: Scatter plot of sale price versus size with line of best fit and a red dot at the predicted sale price for a 2,000 square-foot home. By using simple linear regression on this small data set to predict the sale price for a 2,000 square-foot house, we get a predicted value of $295,564. But wait a minute… how exactly does simple linear regression choose the line of best fit? Many different lines could be drawn through the data points. Some plausible examples are shown in Figure 8.3. Figure 8.3: Scatter plot of sale price versus size with many possible lines that could be drawn through the data points. Simple linear regression chooses the straight line of best fit by choosing the line that minimizes the average squared vertical distance between itself and each of the observed data points in the training data (equivalent to minimizing the RMSE). Figure 8.4 illustrates these vertical distances as red lines. Finally, to assess the predictive accuracy of a simple linear regression model, we use RMSPE—the same measure of predictive performance we used with K-NN regression. Figure 8.4: Scatter plot of sale price versus size with red lines denoting the vertical distances between the predicted values and the observed data points. 8.4 Linear regression in R We can perform simple linear regression in R using tidymodels in a very similar manner to how we performed K-NN regression. To do this, instead of creating a nearest_neighbor model specification with the kknn engine, we use a linear_reg model specification with the lm engine. Another difference is that we do not need to choose \\(K\\) in the context of linear regression, and so we do not need to perform cross-validation. Below we illustrate how we can use the usual tidymodels workflow to predict house sale price given house size using a simple linear regression approach using the full Sacramento real estate data set. As usual, we start by loading packages, setting the seed, loading data, and putting some test data away in a lock box that we can come back to after we choose our final model. Let’s take care of that now. library(tidyverse) library(tidymodels) set.seed(7) sacramento <- read_csv("data/sacramento.csv") sacramento_split <- initial_split(sacramento, prop = 0.75, strata = price) sacramento_train <- training(sacramento_split) sacramento_test <- testing(sacramento_split) Now that we have our training data, we will create the model specification and recipe, and fit our simple linear regression model: lm_spec <- linear_reg() |> set_engine("lm") |> set_mode("regression") lm_recipe <- recipe(price ~ sqft, data = sacramento_train) lm_fit <- workflow() |> add_recipe(lm_recipe) |> add_model(lm_spec) |> fit(data = sacramento_train) lm_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────── ## 0 Recipe Steps ## ## ── Model ────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) sqft ## 18450.3 134.8 Note: An additional difference that you will notice here is that we do not standardize (i.e., scale and center) our predictors. In K-nearest neighbors models, recall that the model fit changes depending on whether we standardize first or not. In linear regression, standardization does not affect the fit (it does affect the coefficients in the equation, though!). So you can standardize if you want—it won’t hurt anything—but if you leave the predictors in their original form, the best fit coefficients are usually easier to interpret afterward. Our coefficients are (intercept) \\(\\beta_0=\\) 18450 and (slope) \\(\\beta_1=\\) 135. This means that the equation of the line of best fit is \\[\\text{house sale price} = 18450 + 135\\cdot (\\text{house size}).\\] In other words, the model predicts that houses start at $18,450 for 0 square feet, and that every extra square foot increases the cost of the house by $135. Finally, we predict on the test data set to assess how well our model does: lm_test_results <- lm_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) |> metrics(truth = price, estimate = .pred) lm_test_results ## # A tibble: 3 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 88528. ## 2 rsq standard 0.608 ## 3 mae standard 61892. Our final model’s test error as assessed by RMSPE is $88,528. Remember that this is in units of the response variable, and here that is US Dollars (USD). Does this mean our model is “good” at predicting house sale price based off of the predictor of home size? Again, answering this is tricky and requires knowledge of how you intend to use the prediction. To visualize the simple linear regression model, we can plot the predicted house sale price across all possible house sizes we might encounter. Since our model is linear, we only need to compute the predicted price of the minimum and maximum house size, and then connect them with a straight line. We superimpose this prediction line on a scatter plot of the original housing price data, so that we can qualitatively assess if the model seems to fit the data well. Figure 8.5 displays the result. sqft_prediction_grid <- tibble( sqft = c( sacramento |> select(sqft) |> min(), sacramento |> select(sqft) |> max() ) ) sacr_preds <- lm_fit |> predict(sqft_prediction_grid) |> bind_cols(sqft_prediction_grid) lm_plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + geom_line(data = sacr_preds, mapping = aes(x = sqft, y = .pred), color = "steelblue", linewidth = 1) + xlab("House size (square feet)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + theme(text = element_text(size = 12)) lm_plot_final Figure 8.5: Scatter plot of sale price versus size with line of best fit for the full Sacramento housing data. We can extract the coefficients from our model by accessing the fit object that is output by the fit function; we first have to extract it from the workflow using the extract_fit_parsnip function, and then apply the tidy function to convert the result into a data frame: coeffs <- lm_fit |> extract_fit_parsnip() |> tidy() coeffs ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 18450. 7916. 2.33 2.01e- 2 ## 2 sqft 135. 4.31 31.2 1.37e-134 8.5 Comparing simple linear and K-NN regression Now that we have a general understanding of both simple linear and K-NN regression, we can start to compare and contrast these methods as well as the predictions made by them. To start, let’s look at the visualization of the simple linear regression model predictions for the Sacramento real estate data (predicting price from house size) and the “best” K-NN regression model obtained from the same problem, shown in Figure 8.6. Figure 8.6: Comparison of simple linear regression and K-NN regression. What differences do we observe in Figure 8.6? One obvious difference is the shape of the blue lines. In simple linear regression we are restricted to a straight line, whereas in K-NN regression our line is much more flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the response variable we predict given a unit increase in the predictor variable. K-NN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line. There can, however, also be a disadvantage to using a simple linear regression model in some cases, particularly when the relationship between the response and the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In these cases the prediction model from a simple linear regression will underfit, meaning that model/predicted values do not match the actual observed values very well. Such a model would probably have a quite high RMSE when assessing model goodness of fit on the training data and a quite high RMSPE when assessing model prediction quality on a test data set. On such a data set, K-NN regression may fare better. Additionally, there are other types of regression you can learn about in future books that may do even better at predicting with such data. How do these two models compare on the Sacramento house prices data set? In Figure 8.6, we also printed the RMSPE as calculated from predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear regression model is slightly lower than the RMSPE for the K-NN regression model. Considering that the simple linear regression model is also more interpretable, if we were comparing these in practice we would likely choose to use the simple linear regression model. Finally, note that the K-NN regression model becomes “flat” at the left and right boundaries of the data, while the linear model predicts a constant slope. Predicting outside the range of the observed data is known as extrapolation; K-NN and linear models behave quite differently when extrapolating. Depending on the application, the flat or constant slope trend may make more sense. For example, if our housing data were slightly different, the linear model may have actually predicted a negative price for a small house (if the intercept \\(\\beta_0\\) was negative), which obviously does not match reality. On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, so the “flat” extrapolation of K-NN likely does not match reality. 8.6 Multivariable linear regression As in K-NN classification and K-NN regression, we can move beyond the simple case of only one predictor to the case with multiple predictors, known as multivariable linear regression. To do this, we follow a very similar approach to what we did for K-NN regression: we just add more predictors to the model formula in the recipe. But recall that we do not need to use cross-validation to choose any parameters, nor do we need to standardize (i.e., center and scale) the data for linear regression. Note once again that we have the same concerns regarding multiple predictors as in the settings of multivariable K-NN regression and classification: having more predictors is not always better. But because the same predictor selection algorithm from the classification chapter extends to the setting of linear regression, it will not be covered again in this chapter. We will demonstrate multivariable linear regression using the Sacramento real estate data with both house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our response variable. We will start by changing the formula in the recipe to include both the sqft and beds variables as predictors: mlm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) Now we can build our workflow and fit the model: mlm_fit <- workflow() |> add_recipe(mlm_recipe) |> add_model(lm_spec) |> fit(data = sacramento_train) mlm_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────── ## 0 Recipe Steps ## ## ── Model ────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) sqft beds ## 72547.8 160.6 -29644.3 And finally, we make predictions on the test data set to assess the quality of our model: lm_mult_test_results <- mlm_fit |> predict(sacramento_test) |> bind_cols(sacramento_test) |> metrics(truth = price, estimate = .pred) lm_mult_test_results ## # A tibble: 3 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 88739. ## 2 rsq standard 0.603 ## 3 mae standard 61732. Our model’s test error as assessed by RMSPE is $88,739. In the case of two predictors, we can plot the predictions made by our linear regression creates a plane of best fit, as shown in Figure 8.7. Figure 8.7: Linear regression plane of best fit overlaid on top of the data (using price, house size, and number of bedrooms as predictors). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the regression plane looks like for learning purposes. We see that the predictions from linear regression with two predictors form a flat plane. This is the hallmark of linear regression, and differs from the wiggly, flexible surface we get from other methods such as K-NN regression. As discussed, this can be advantageous in one aspect, which is that for each predictor, we can get slopes/intercept from linear regression, and thus describe the plane mathematically. We can extract those slope values from our model object as shown below: mcoeffs <- mlm_fit |> extract_fit_parsnip() |> tidy() mcoeffs ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 72548. 11670. 6.22 8.76e- 10 ## 2 sqft 161. 5.93 27.1 8.34e-111 ## 3 beds -29644. 4799. -6.18 1.11e- 9 And then use those slopes to write a mathematical equation to describe the prediction plane: \\[\\text{house sale price} = \\beta_0 + \\beta_1\\cdot(\\text{house size}) + \\beta_2\\cdot(\\text{number of bedrooms}),\\] where: \\(\\beta_0\\) is the vertical intercept of the hyperplane (the price when both house size and number of bedrooms are 0) \\(\\beta_1\\) is the slope for the first predictor (how quickly the price changes as you increase house size, holding number of bedrooms constant) \\(\\beta_2\\) is the slope for the second predictor (how quickly the price changes as you increase the number of bedrooms, holding house size constant) Finally, we can fill in the values for \\(\\beta_0\\), \\(\\beta_1\\) and \\(\\beta_2\\) from the model output above to create the equation of the plane of best fit to the data: \\[\\text{house sale price} = 72548 + 161\\cdot (\\text{house size}) -29644 \\cdot (\\text{number of bedrooms})\\] This model is more interpretable than the multivariable K-NN regression model; we can write a mathematical equation that explains how each predictor is affecting the predictions. But as always, we should question how well multivariable linear regression is doing compared to the other tools we have, such as simple linear regression and multivariable K-NN regression. If this comparison is part of the model tuning process—for example, if we are trying out many different sets of predictors for multivariable linear and K-NN regression—we must perform this comparison using cross-validation on only our training data. But if we have already decided on a small number (e.g., 2 or 3) of tuned candidate models and we want to make a final comparison, we can do so by comparing the prediction error of the methods on the test data. lm_mult_test_results ## # A tibble: 3 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 88739. ## 2 rsq standard 0.603 ## 3 mae standard 61732. We obtain an RMSPE for the multivariable linear regression model of $88,739.45. This prediction error is less than the prediction error for the multivariable K-NN regression model, indicating that we should likely choose linear regression for predictions of house sale price on this data set. Revisiting the simple linear regression model with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was $88,527.75, which is almost the same as that of our more complex model. As mentioned earlier, this is not always the case: often including more predictors will either positively or negatively impact the prediction performance on unseen test data. 8.7 Multicollinearity and outliers What can go wrong when performing (possibly multivariable) linear regression? This section will introduce two common issues—outliers and collinear predictors—and illustrate their impact on predictions. 8.7.1 Outliers Outliers are data points that do not follow the usual pattern of the rest of the data. In the setting of linear regression, these are points that have a vertical distance to the line of best fit that is either much higher or much lower than you might expect based on the rest of the data. The problem with outliers is that they can have too much influence on the line of best fit. In general, it is very difficult to judge accurately which data are outliers without advanced techniques that are beyond the scope of this book. But to illustrate what can happen when you have outliers, Figure 8.8 shows a small subset of the Sacramento housing data again, except we have added a single data point (highlighted in red). This house is 5,000 square feet in size, and sold for only $50,000. Unbeknownst to the data analyst, this house was sold by a parent to their child for an absurdly low price. Of course, this is not representative of the real housing market values that the other data points follow; the data point is an outlier. In blue we plot the original line of best fit, and in red we plot the new line of best fit including the outlier. You can see how different the red line is from the blue line, which is entirely caused by that one extra outlier data point. Figure 8.8: Scatter plot of a subset of the data, with outlier highlighted in red. Fortunately, if you have enough data, the inclusion of one or two outliers—as long as their values are not too wild—will typically not have a large effect on the line of best fit. Figure 8.9 shows how that same outlier data point from earlier influences the line of best fit when we are working with the entire original Sacramento training data. You can see that with this larger data set, the line changes much less when adding the outlier. Nevertheless, it is still important when working with linear regression to critically think about how much any individual data point is influencing the model. Figure 8.9: Scatter plot of the full data, with outlier highlighted in red. 8.7.2 Multicollinearity The second, and much more subtle, issue can occur when performing multivariable linear regression. In particular, if you include multiple predictors that are strongly linearly related to one another, the coefficients that describe the plane of best fit can be very unreliable—small changes to the data can result in large changes in the coefficients. Consider an extreme example using the Sacramento housing data where the house was measured twice by two people. Since the two people are each slightly inaccurate, the two measurements might not agree exactly, but they are very strongly linearly related to each other, as shown in Figure 8.10. Figure 8.10: Scatter plot of house size (in square feet) measured by person 1 versus house size (in square feet) measured by person 2. If we again fit the multivariable linear regression model on this data, then the plane of best fit has regression coefficients that are very sensitive to the exact values in the data. For example, if we change the data ever so slightly—e.g., by running cross-validation, which splits up the data randomly into different chunks—the coefficients vary by large amounts: Best Fit 1: \\(\\text{house sale price} = 22535 + (220)\\cdot (\\text{house size 1 (ft$^2$)}) + (-86) \\cdot (\\text{house size 2 (ft$^2$)}).\\) Best Fit 2: \\(\\text{house sale price} = 15966 + (86)\\cdot (\\text{house size 1 (ft$^2$)}) + (49) \\cdot (\\text{house size 2 (ft$^2$)}).\\) Best Fit 3: \\(\\text{house sale price} = 17178 + (107)\\cdot (\\text{house size 1 (ft$^2$)}) + (27) \\cdot (\\text{house size 2 (ft$^2$)}).\\) Therefore, when performing multivariable linear regression, it is important to avoid including very linearly related predictors. However, techniques for doing so are beyond the scope of this book; see the list of additional resources at the end of this chapter to find out where you can learn more. 8.8 Designing new predictors We were quite fortunate in our initial exploration to find a predictor variable (house size) that seems to have a meaningful and nearly linear relationship with our response variable (sale price). But what should we do if we cannot immediately find such a nice variable? Well, sometimes it is just a fact that the variables in the data do not have enough of a relationship with the response variable to provide useful predictions. For example, if the only available predictor was “the current house owner’s favorite ice cream flavor”, we likely would have little hope of using that variable to predict the house’s sale price (barring any future remarkable scientific discoveries about the relationship between the housing market and homeowner ice cream preferences). In cases like these, the only option is to obtain measurements of more useful variables. There are, however, a wide variety of cases where the predictor variables do have a meaningful relationship with the response variable, but that relationship does not fit the assumptions of the regression method you have chosen. For example, a data frame df with two variables—x and y—with a nonlinear relationship between the two variables will not be fully captured by simple linear regression, as shown in Figure 8.11. df ## # A tibble: 100 × 2 ## x y ## <dbl> <dbl> ## 1 0.102 0.0720 ## 2 0.800 0.532 ## 3 0.478 0.148 ## 4 0.972 1.01 ## 5 0.846 0.677 ## 6 0.405 0.157 ## 7 0.879 0.768 ## 8 0.130 0.0402 ## 9 0.852 0.576 ## 10 0.180 0.0847 ## # ℹ 90 more rows Figure 8.11: Example of a data set with a nonlinear relationship between the predictor and the response. Instead of trying to predict the response y using a linear regression on x, we might have some scientific background about our problem to suggest that y should be a cubic function of x. So before performing regression, we might create a new predictor variable z using the mutate function: df <- df |> mutate(z = x^3) Then we can perform linear regression for y using the predictor variable z, as shown in Figure 8.12. Here you can see that the transformed predictor z helps the linear regression model make more accurate predictions. Note that none of the y response values have changed between Figures 8.11 and 8.12; the only change is that the x values have been replaced by z values. Figure 8.12: Relationship between the transformed predictor and the response. The process of transforming predictors (and potentially combining multiple predictors in the process) is known as feature engineering. In real data analysis problems, you will need to rely on a deep understanding of the problem—as well as the wrangling tools from previous chapters—to engineer useful new features that improve predictive performance. Note: Feature engineering is part of tuning your model, and as such you must not use your test data to evaluate the quality of the features you produce. You are free to use cross-validation, though! 8.9 The other sides of regression So far in this textbook we have used regression only in the context of prediction. However, regression can also be seen as a method to understand and quantify the effects of individual predictor variables on a response variable of interest. In the housing example from this chapter, beyond just using past data to predict future sale prices, we might also be interested in describing the individual relationships of house size and the number of bedrooms with house price, quantifying how strong each of these relationships are, and assessing how accurately we can estimate their magnitudes. And even beyond that, we may be interested in understanding whether the predictors cause changes in the price. These sides of regression are well beyond the scope of this book; but the material you have learned here should give you a foundation of knowledge that will serve you well when moving to more advanced books on the topic. 8.10 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Regression II: linear regression” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 8.11 Additional resources The tidymodels website is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a nice beginner’s tutorial and an extensive list of more advanced examples that you can use to continue learning beyond the scope of this book. Modern Dive (Ismay and Kim 2020) is another textbook that uses the tidyverse / tidymodels framework. Chapter 6 complements the material in the current chapter well; it covers some slightly more advanced concepts than we do without getting mathematical. Give this chapter a read before moving on to the next reference. It is also worth noting that this book takes a more “explanatory” / “inferential” approach to regression in general (in Chapters 5, 6, and 10), which provides a nice complement to the predictive tack we take in the present book. An Introduction to Statistical Learning (James et al. 2013) provides a great next stop in the process of learning about regression. Chapter 3 covers linear regression at a slightly more mathematical level than we do here, but it is not too large a leap and so should provide a good stepping stone. Chapter 6 discusses how to pick a subset of “informative” predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the computational efficiency of linear regression. In contrast, the K-NN methods we covered earlier are indeed more flexible but become very slow when given lots of data. References "],["clustering.html", "Chapter 9 Clustering 9.1 Overview 9.2 Chapter learning objectives 9.3 Clustering 9.4 An illustrative example 9.5 K-means 9.6 K-means in R 9.7 Exercises 9.8 Additional resources", " Chapter 9 Clustering 9.1 Overview As part of exploratory data analysis, it is often helpful to see if there are meaningful subgroups (or clusters) in the data. This grouping can be used for many purposes, such as generating new questions or improving predictive analyses. This chapter provides an introduction to clustering using the K-means algorithm, including techniques to choose the number of clusters. 9.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe a situation in which clustering is an appropriate technique to use, and what insight it might extract from the data. Explain the K-means clustering algorithm. Interpret the output of a K-means analysis. Differentiate between clustering, classification, and regression. Identify when it is necessary to scale variables before clustering, and do this using R. Perform K-means clustering in R using tidymodels workflows. Use the elbow method to choose the number of clusters for K-means. Visualize the output of K-means clustering in R using colored scatter plots. Describe the advantages, limitations and assumptions of the K-means clustering algorithm. 9.3 Clustering Clustering is a data analysis technique involving separating a data set into subgroups of related data. For example, we might use clustering to separate a data set of documents into groups that correspond to topics, a data set of human genetic information into groups that correspond to ancestral subpopulations, or a data set of online customers into groups that correspond to purchasing behaviors. Once the data are separated, we can, for example, use the subgroups to generate new questions about the data and follow up with a predictive modeling exercise. In this course, clustering will be used only for exploratory analysis, i.e., uncovering patterns in the data. Note that clustering is a fundamentally different kind of task than classification or regression. In particular, both classification and regression are supervised tasks where there is a response variable (a category label or value), and we have examples of past data with labels/values that help us predict those of future data. By contrast, clustering is an unsupervised task, as we are trying to understand and examine the structure of data without any response variable labels or values to help us. This approach has both advantages and disadvantages. Clustering requires no additional annotation or input on the data. For example, while it would be nearly impossible to annotate all the articles on Wikipedia with human-made topic labels, we can cluster the articles without this information to find groupings corresponding to topics automatically. However, given that there is no response variable, it is not as easy to evaluate the “quality” of a clustering. With classification, we can use a test data set to assess prediction performance. In clustering, there is not a single good choice for evaluation. In this book, we will use visualization to ascertain the quality of a clustering, and leave rigorous evaluation for more advanced courses. As in the case of classification, there are many possible methods that we could use to cluster our observations to look for subgroups. In this book, we will focus on the widely used K-means algorithm (Lloyd 1982). In your future studies, you might encounter hierarchical clustering, principal component analysis, multidimensional scaling, and more; see the additional resources section at the end of this chapter for where to begin learning more about these other methods. Note: There are also so-called semisupervised tasks, where only some of the data come with response variable labels/values, but the vast majority don’t. The goal is to try to uncover underlying structure in the data that allows one to guess the missing labels. This sort of task is beneficial, for example, when one has an unlabeled data set that is too large to manually label, but one is willing to provide a few informative example labels as a “seed” to guess the labels for all the data. 9.4 An illustrative example In this chapter we will focus on a data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020). This data set was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Site, and includes measurements for adult penguins (Figure 9.1) found near there (Gorman, Williams, and Fraser 2014). Our goal will be to use two variables—penguin bill and flipper length, both in millimeters—to determine whether there are distinct types of penguins in our data. Understanding this might help us with species discovery and classification in a data-driven way. Note that we have reduced the size of the data set to 18 observations and 2 variables; this will help us make clear visualizations that illustrate how clustering works for learning purposes. Figure 9.1: A Gentoo penguin. Before we get started, we will load the tidyverse metapackage as well as set a random seed. This will ensure we have access to the functions we need and that our analysis will be reproducible. As we will learn in more detail later in the chapter, setting the seed here is important because the K-means clustering algorithm uses randomness when choosing a starting position for each cluster. library(tidyverse) set.seed(1) Now we can load and preview the penguins data. penguins <- read_csv("data/penguins.csv") penguins ## # A tibble: 18 × 2 ## bill_length_mm flipper_length_mm ## <dbl> <dbl> ## 1 39.2 196 ## 2 36.5 182 ## 3 34.5 187 ## 4 36.7 187 ## 5 38.1 181 ## 6 39.2 190 ## 7 36 195 ## 8 37.8 193 ## 9 46.5 213 ## 10 46.1 215 ## 11 47.8 215 ## 12 45 220 ## 13 49.1 212 ## 14 43.3 208 ## 15 46 195 ## 16 46.7 195 ## 17 52.2 197 ## 18 46.8 189 We will begin by using a version of the data that we have standardized, penguins_standardized, to illustrate how K-means clustering works (recall standardization from Chapter 5). Later in this chapter, we will return to the original penguins data to see how to include standardization automatically in the clustering pipeline. penguins_standardized ## # A tibble: 18 × 2 ## bill_length_standardized flipper_length_standardized ## <dbl> <dbl> ## 1 -0.641 -0.190 ## 2 -1.14 -1.33 ## 3 -1.52 -0.922 ## 4 -1.11 -0.922 ## 5 -0.847 -1.41 ## 6 -0.641 -0.678 ## 7 -1.24 -0.271 ## 8 -0.902 -0.434 ## 9 0.720 1.19 ## 10 0.646 1.36 ## 11 0.963 1.36 ## 12 0.440 1.76 ## 13 1.21 1.11 ## 14 0.123 0.786 ## 15 0.627 -0.271 ## 16 0.757 -0.271 ## 17 1.78 -0.108 ## 18 0.776 -0.759 Next, we can create a scatter plot using this data set to see if we can detect subtypes or groups in our data set. ggplot(penguins_standardized, aes(x = flipper_length_standardized, y = bill_length_standardized)) + geom_point() + xlab("Flipper Length (standardized)") + ylab("Bill Length (standardized)") + theme(text = element_text(size = 12)) Figure 9.2: Scatter plot of standardized bill length versus standardized flipper length. Based on the visualization in Figure 9.2, we might suspect there are a few subtypes of penguins within our data set. We can see roughly 3 groups of observations in Figure 9.2, including: a small flipper and bill length group, a small flipper length, but large bill length group, and a large flipper and bill length group. Data visualization is a great tool to give us a rough sense of such patterns when we have a small number of variables. But if we are to group data—and select the number of groups—as part of a reproducible analysis, we need something a bit more automated. Additionally, finding groups via visualization becomes more difficult as we increase the number of variables we consider when clustering. The way to rigorously separate the data into groups is to use a clustering algorithm. In this chapter, we will focus on the K-means algorithm, a widely used and often very effective clustering method, combined with the elbow method for selecting the number of clusters. This procedure will separate the data into groups; Figure 9.3 shows these groups denoted by colored scatter points. Figure 9.3: Scatter plot of standardized bill length versus standardized flipper length with colored groups. What are the labels for these groups? Unfortunately, we don’t have any. K-means, like almost all clustering algorithms, just outputs meaningless “cluster labels” that are typically whole numbers: 1, 2, 3, etc. But in a simple case like this, where we can easily visualize the clusters on a scatter plot, we can give human-made labels to the groups using their positions on the plot: small flipper length and small bill length (orange cluster), small flipper length and large bill length (blue cluster). and large flipper length and large bill length (yellow cluster). Once we have made these determinations, we can use them to inform our species classifications or ask further questions about our data. For example, we might be interested in understanding the relationship between flipper length and bill length, and that relationship may differ depending on the type of penguin we have. 9.5 K-means 9.5.1 Measuring cluster quality The K-means algorithm is a procedure that groups data into K clusters. It starts with an initial clustering of the data, and then iteratively improves it by making adjustments to the assignment of data to clusters until it cannot improve any further. But how do we measure the “quality” of a clustering, and what does it mean to improve it? In K-means clustering, we measure the quality of a cluster by its within-cluster sum-of-squared-distances (WSSD). Computing this involves two steps. First, we find the cluster centers by computing the mean of each variable over data points in the cluster. For example, suppose we have a cluster containing four observations, and we are using two variables, \\(x\\) and \\(y\\), to cluster the data. Then we would compute the coordinates, \\(\\mu_x\\) and \\(\\mu_y\\), of the cluster center via \\[\\mu_x = \\frac{1}{4}(x_1+x_2+x_3+x_4) \\quad \\mu_y = \\frac{1}{4}(y_1+y_2+y_3+y_4).\\] In the first cluster from the example, there are 4 data points. These are shown with their cluster center (standardized flipper length -0.35, standardized bill length 0.99) highlighted in Figure 9.4. Figure 9.4: Cluster 1 from the penguins_standardized data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The second step in computing the WSSD is to add up the squared distance between each point in the cluster and the cluster center. We use the straight-line / Euclidean distance formula that we learned about in Chapter 5. In the 4-observation cluster example above, we would compute the WSSD \\(S^2\\) via \\[\\begin{align*} S^2 = \\left((x_1 - \\mu_x)^2 + (y_1 - \\mu_y)^2\\right) + \\left((x_2 - \\mu_x)^2 + (y_2 - \\mu_y)^2\\right) + \\\\ \\left((x_3 - \\mu_x)^2 + (y_3 - \\mu_y)^2\\right) + \\left((x_4 - \\mu_x)^2 + (y_4 - \\mu_y)^2\\right). \\end{align*}\\] These distances are denoted by lines in Figure 9.5 for the first cluster of the penguin data example. Figure 9.5: Cluster 1 from the penguins_standardized data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The distances from the observations to the cluster center are represented as black lines. The larger the value of \\(S^2\\), the more spread out the cluster is, since large \\(S^2\\) means that points are far from the cluster center. Note, however, that “large” is relative to both the scale of the variables for clustering and the number of points in the cluster. A cluster where points are very close to the center might still have a large \\(S^2\\) if there are many data points in the cluster. After we have calculated the WSSD for all the clusters, we sum them together to get the total WSSD. For our example, this means adding up all the squared distances for the 18 observations. These distances are denoted by black lines in Figure 9.6. Figure 9.6: All clusters from the penguins_standardized data set example. Observations are small orange, blue, and yellow points with cluster centers denoted by larger points with a black outline. The distances from the observations to each of the respective cluster centers are represented as black lines. Since K-means uses the straight-line distance to measure the quality of a clustering, it is limited to clustering based on quantitative variables. However, note that there are variants of the K-means algorithm, as well as other clustering algorithms entirely, that use other distance metrics to allow for non-quantitative data to be clustered. These are beyond the scope of this book. 9.5.2 The clustering algorithm We begin the K-means algorithm by picking K, and randomly assigning a roughly equal number of observations to each of the K clusters. An example random initialization is shown in Figure 9.7. Figure 9.7: Random initialization of labels. Then K-means consists of two major steps that attempt to minimize the sum of WSSDs over all the clusters, i.e., the total WSSD: Center update: Compute the center of each cluster. Label update: Reassign each data point to the cluster with the nearest center. These two steps are repeated until the cluster assignments no longer change. We show what the first four iterations of K-means would look like in Figure 9.8. There each pair of plots in each row corresponds to an iteration, where the left figure in the pair depicts the center update, and the right figure in the pair depicts the label update (i.e., the reassignment of data to clusters). Figure 9.8: First four iterations of K-means clustering on the penguins_standardized example data set. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black. Note that at this point, we can terminate the algorithm since none of the assignments changed in the fourth iteration; both the centers and labels will remain the same from this point onward. Note: Is K-means guaranteed to stop at some point, or could it iterate forever? As it turns out, thankfully, the answer is that K-means is guaranteed to stop after some number of iterations. For the interested reader, the logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration, (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments are changing, and the algorithm terminates. 9.5.3 Random restarts Unlike the classification and regression models we studied in previous chapters, K-means can get “stuck” in a bad solution. For example, Figure 9.9 illustrates an unlucky random initialization by K-means. Figure 9.9: Random initialization of labels. Figure 9.10 shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure 9.9. Figure 9.10: First five iterations of K-means clustering on the penguins_standardized example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black. This looks like a relatively bad clustering of the data, but K-means cannot improve it. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD. 9.5.4 Choosing K In order to cluster data using K-means, we also have to pick the number of clusters, K. But unlike in classification, we have no response variable and cannot perform cross-validation with some measure of model prediction error. Further, if K is chosen too small, then multiple clusters get grouped together; if K is too large, then clusters get subdivided. In both cases, we will potentially miss interesting structure in the data. Figure 9.11 illustrates the impact of K on K-means clustering of our penguin flipper and bill length data by showing the different clusterings for K’s ranging from 1 to 9. Figure 9.11: Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black. If we set K less than 3, then the clustering merges separate groups of data; this causes a large total WSSD, since the cluster center is not close to any of the data in the cluster. On the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still decrease the total WSSD, but by only a diminishing amount. If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly the right number of clusters (Figure 9.12). Figure 9.12: Total WSSD for K clusters ranging from 1 to 9. 9.6 K-means in R We can perform K-means clustering in R using a tidymodels workflow similar to those in the earlier classification and regression chapters. We will begin by loading the tidyclust library, which contains the necessary functionality. library(tidyclust) Returning to the original (unstandardized) penguins data, recall that K-means clustering uses straight-line distance to decide which points are similar to each other. Therefore, the scale of each of the variables in the data will influence which cluster data points end up being assigned. Variables with a large scale will have a much larger effect on deciding cluster assignment than variables with a small scale. To address this problem, we need to create a recipe that standardizes our data before clustering using the step_scale and step_center preprocessing steps. Standardization will ensure that each variable has a mean of 0 and standard deviation of 1 prior to clustering. We will designate that all variables are to be used in clustering via the model formula ~ .. Note: Recipes were originally designed specifically for predictive data analysis problems—like classification and regression—not clustering problems. So the functions in R that we use to construct recipes are a little bit awkward in the setting of clustering In particular, we will have to treat “predictors” here as if it meant “variables to be used in clustering”. So the model formula ~ . specifies that all variables are “predictors”, i.e., all variables should be used for clustering. Similarly, when we use the all_predictors() function in the preprocessing steps, we really mean “apply this step to all variables used for clustering.” kmeans_recipe <- recipe(~ ., data=penguins) |> step_scale(all_predictors()) |> step_center(all_predictors()) kmeans_recipe ## ## ── Recipe ────────── ## ## ── Inputs ## Number of variables by role ## predictor: 2 ## ## ── Operations ## • Scaling for: all_predictors() ## • Centering for: all_predictors() To indicate that we are performing K-means clustering, we will use the k_means model specification. We will use the num_clusters argument to specify the number of clusters (here we choose K = 3), and specify that we are using the \"stats\" engine. kmeans_spec <- k_means(num_clusters = 3) |> set_engine("stats") kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = 3 ## ## Computational engine: stats To actually run the K-means clustering, we combine the recipe and model specification in a workflow, and use the fit function. Note that the K-means algorithm uses a random initialization of assignments; but since we set the random seed earlier, the clustering will be reproducible. kmeans_fit <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> fit(data = penguins) kmeans_fit ## ══ Workflow [trained] ══════════ ## Preprocessor: Recipe ## Model: k_means() ## ## ── Preprocessor ────────── ## 2 Recipe Steps ## ## • step_scale() ## • step_center() ## ## ── Model ────────── ## K-means clustering with 3 clusters of sizes 4, 6, 8 ## ## Cluster means: ## bill_length_mm flipper_length_mm ## 1 0.9858721 -0.3524358 ## 2 0.6828058 1.2606357 ## 3 -1.0050404 -0.7692589 ## ## Clustering vector: ## [1] 3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 ## ## Within cluster sum of squares by cluster: ## [1] 1.098928 1.247042 2.121932 ## (between_SS / total_SS = 86.9 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" ## [6] "betweenss" "size" "iter" "ifault" As you can see above, the fit object has a lot of information that can be used to visualize the clusters, pick K, and evaluate the total WSSD. Let’s start by visualizing the clusters as a colored scatter plot! In order to do that, we first need to augment our original data frame with the cluster assignments. We can achieve this using the augment function from tidyclust. clustered_data <- kmeans_fit |> augment(penguins) clustered_data ## # A tibble: 18 × 3 ## bill_length_mm flipper_length_mm .pred_cluster ## <dbl> <dbl> <fct> ## 1 39.2 196 Cluster_1 ## 2 36.5 182 Cluster_1 ## 3 34.5 187 Cluster_1 ## 4 36.7 187 Cluster_1 ## 5 38.1 181 Cluster_1 ## 6 39.2 190 Cluster_1 ## 7 36 195 Cluster_1 ## 8 37.8 193 Cluster_1 ## 9 46.5 213 Cluster_2 ## 10 46.1 215 Cluster_2 ## 11 47.8 215 Cluster_2 ## 12 45 220 Cluster_2 ## 13 49.1 212 Cluster_2 ## 14 43.3 208 Cluster_2 ## 15 46 195 Cluster_3 ## 16 46.7 195 Cluster_3 ## 17 52.2 197 Cluster_3 ## 18 46.8 189 Cluster_3 Now that we have the cluster assignments included in the clustered_data tidy data frame, we can visualize them as shown in Figure 9.13. Note that we are plotting the un-standardized data here; if we for some reason wanted to visualize the standardized data from the recipe, we would need to use the bake function to obtain that first. cluster_plot <- ggplot(clustered_data, aes(x = flipper_length_mm, y = bill_length_mm, color = .pred_cluster), size = 2) + geom_point() + labs(x = "Flipper Length", y = "Bill Length", color = "Cluster") + scale_color_manual(values = c("steelblue", "darkorange", "goldenrod1")) + theme(text = element_text(size = 12)) cluster_plot Figure 9.13: The data colored by the cluster assignments returned by K-means. As mentioned above, we also need to select K by finding where the “elbow” occurs in the plot of total WSSD versus the number of clusters. We can obtain the total WSSD (tot.withinss) from our clustering with 3 clusters using the glance function. glance(kmeans_fit) ## # A tibble: 1 × 4 ## totss tot.withinss betweenss iter ## <dbl> <dbl> <dbl> <int> ## 1 34 4.47 29.5 2 To calculate the total WSSD for a variety of Ks, we will create a data frame with a column named num_clusters with rows containing each value of K we want to run K-means with (here, 1 to 9). penguin_clust_ks <- tibble(num_clusters = 1:9) penguin_clust_ks ## # A tibble: 9 × 1 ## num_clusters ## <int> ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 Then we construct our model specification again, this time specifying that we want to tune the num_clusters parameter. kmeans_spec <- k_means(num_clusters = tune()) |> set_engine("stats") kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = tune() ## ## Computational engine: stats We combine the recipe and specification in a workflow, and then use the tune_cluster function to run K-means on each of the different settings of num_clusters. The grid argument controls which values of K we want to try—in this case, the values from 1 to 9 that are stored in the penguin_clust_ks data frame. We set the resamples argument to apparent(penguins) to tell K-means to run on the whole data set for each value of num_clusters. Finally, we collect the results using the collect_metrics function. kmeans_results <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |> collect_metrics() kmeans_results ## # A tibble: 18 × 7 ## num_clusters .metric .estimator mean n std_err .config ## <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 sse_total standard 34 1 NA Preprocessor1_… ## 2 1 sse_within_total standard 34 1 NA Preprocessor1_… ## 3 2 sse_total standard 34 1 NA Preprocessor1_… ## 4 2 sse_within_total standard 10.9 1 NA Preprocessor1_… ## 5 3 sse_total standard 34 1 NA Preprocessor1_… ## 6 3 sse_within_total standard 4.47 1 NA Preprocessor1_… ## 7 4 sse_total standard 34 1 NA Preprocessor1_… ## 8 4 sse_within_total standard 3.54 1 NA Preprocessor1_… ## 9 5 sse_total standard 34 1 NA Preprocessor1_… ## 10 5 sse_within_total standard 2.23 1 NA Preprocessor1_… ## 11 6 sse_total standard 34 1 NA Preprocessor1_… ## 12 6 sse_within_total standard 1.75 1 NA Preprocessor1_… ## 13 7 sse_total standard 34 1 NA Preprocessor1_… ## 14 7 sse_within_total standard 2.06 1 NA Preprocessor1_… ## 15 8 sse_total standard 34 1 NA Preprocessor1_… ## 16 8 sse_within_total standard 2.46 1 NA Preprocessor1_… ## 17 9 sse_total standard 34 1 NA Preprocessor1_… ## 18 9 sse_within_total standard 0.906 1 NA Preprocessor1_… The total WSSD results correspond to the mean column when the .metric variable is equal to sse_within_total. We can obtain a tidy data frame with this information using filter and mutate. kmeans_results <- kmeans_results |> filter(.metric == "sse_within_total") |> mutate(total_WSSD = mean) |> select(num_clusters, total_WSSD) kmeans_results ## # A tibble: 9 × 2 ## num_clusters total_WSSD ## <int> <dbl> ## 1 1 34 ## 2 2 10.9 ## 3 3 4.47 ## 4 4 3.54 ## 5 5 2.23 ## 6 6 1.75 ## 7 7 2.06 ## 8 8 2.46 ## 9 9 0.906 Now that we have total_WSSD and num_clusters as columns in a data frame, we can make a line plot (Figure 9.14) and search for the “elbow” to find which value of K to use. elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares") + scale_x_continuous(breaks = 1:9) + theme(text = element_text(size = 12)) elbow_plot Figure 9.14: A plot showing the total WSSD versus the number of clusters. It looks like 3 clusters is the right choice for this data. But why is there a “bump” in the total WSSD plot here? Shouldn’t total WSSD always decrease as we add more clusters? Technically yes, but remember: K-means can get “stuck” in a bad solution. Unfortunately, for K = 8 we had an unlucky initialization and found a bad clustering! We can help prevent finding a bad clustering by trying a few different random initializations via the nstart argument in the model specification. Here we will try using 10 restarts. kmeans_spec <- k_means(num_clusters = tune()) |> set_engine("stats", nstart = 10) kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = tune() ## ## Engine-Specific Arguments: ## nstart = 10 ## ## Computational engine: stats Now if we rerun the same workflow with the new model specification, K-means clustering will be performed nstart = 10 times for each value of K. The collect_metrics function will then pick the best clustering of the 10 runs for each value of K, and report the results for that best clustering. Figure 9.15 shows the resulting total WSSD plot from using 10 restarts; the bump is gone and the total WSSD decreases as expected. The more times we perform K-means clustering, the more likely we are to find a good clustering (if one exists). What value should you choose for nstart? The answer is that it depends on many factors: the size and characteristics of your data set, as well as how powerful your computer is. The larger the nstart value the better from an analysis perspective, but there is a trade-off that doing many clusterings could take a long time. So this is something that needs to be balanced. kmeans_results <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |> collect_metrics() |> filter(.metric == "sse_within_total") |> mutate(total_WSSD = mean) |> select(num_clusters, total_WSSD) elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares") + scale_x_continuous(breaks = 1:9) + theme(text = element_text(size = 12)) elbow_plot Figure 9.15: A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts. 9.7 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Clustering” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 9.8 Additional resources Chapter 10 of An Introduction to Statistical Learning (James et al. 2013) provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers hierarchical clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers principal components analysis (PCA), which is a very popular technique for reducing the number of predictors in a data set. References "],["inference.html", "Chapter 10 Statistical inference 10.1 Overview 10.2 Chapter learning objectives 10.3 Why do we need sampling? 10.4 Sampling distributions 10.5 Bootstrapping 10.6 Exercises 10.7 Additional resources", " Chapter 10 Statistical inference 10.1 Overview A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that population; we typically do not get data on the entire population. Data analysis questions regarding how summaries, patterns, trends, or relationships in a data set extend to the wider population are called inferential questions. This chapter will start with the fundamental ideas of sampling from populations and then introduce two common techniques in statistical inference: point estimation and interval estimation. 10.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe real-world examples of questions that can be answered with statistical inference. Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample. Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution. Explain the difference between a population parameter and a sample point estimate. Use R to draw random samples from a finite population. Use R to create a sampling distribution from a finite population. Describe how sample size influences the sampling distribution. Define bootstrapping. Use R to create a bootstrap distribution to approximate a sampling distribution. Contrast the bootstrap and sampling distributions. 10.3 Why do we need sampling? We often need to understand how quantities we observe in a subset of data relate to the same quantities in the broader population. For example, suppose a retailer is considering selling iPhone accessories, and they want to estimate how big the market might be. Additionally, they want to strategize how they can market their products on North American college and university campuses. This retailer might formulate the following question: What proportion of all undergraduate students in North America own an iPhone? In the above question, we are interested in making a conclusion about all undergraduate students in North America; this is referred to as the population. In general, the population is the complete collection of individuals or cases we are interested in studying. Further, in the above question, we are interested in computing a quantity—the proportion of iPhone owners—based on the entire population. This proportion is referred to as a population parameter. In general, a population parameter is a numerical characteristic of the entire population. To compute this number in the example above, we would need to ask every single undergraduate in North America whether they own an iPhone. In practice, directly computing population parameters is often time-consuming and costly, and sometimes impossible. A more practical approach would be to make measurements for a sample, i.e., a subset of individuals collected from the population. We can then compute a sample estimate—a numerical characteristic of the sample—that estimates the population parameter. For example, suppose we randomly selected ten undergraduate students across North America (the sample) and computed the proportion of those students who own an iPhone (the sample estimate). In that case, we might suspect that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population. Figure 10.1 illustrates this process. In general, the process of using a sample to make a conclusion about the broader population from which it is taken is referred to as statistical inference. Figure 10.1: The process of using a sample from a broader population to obtain a point estimate of a population parameter. In this case, a sample of 10 individuals yielded 6 who own an iPhone, resulting in an estimated population proportion of 60% iPhone owners. The actual population proportion in this example illustration is 53.8%. Note that proportions are not the only kind of population parameter we might be interested in. For example, suppose an undergraduate student studying at the University of British Columbia in Canada is looking for an apartment to rent. They need to create a budget, so they want to know about studio apartment rental prices in Vancouver. This student might formulate the question: What is the average price per month of studio apartment rentals in Vancouver? In this case, the population consists of all studio apartment rentals in Vancouver, and the population parameter is the average price per month. Here we used the average as a measure of the center to describe the “typical value” of studio apartment rental prices. But even within this one example, we could also be interested in many other population parameters. For instance, we know that not every studio apartment rental in Vancouver will have the same price per month. The student might be interested in how much monthly prices vary and want to find a measure of the rentals’ spread (or variability), such as the standard deviation. Or perhaps the student might be interested in the fraction of studio apartment rentals that cost more than $1000 per month. The question we want to answer will help us determine the parameter we want to estimate. If we were somehow able to observe the whole population of studio apartment rental offerings in Vancouver, we could compute each of these numbers exactly; therefore, these are all population parameters. There are many kinds of observations and population parameters that you will run into in practice, but in this chapter, we will focus on two settings: Using categorical observations to estimate the proportion of a category Using quantitative observations to estimate the average (or mean) 10.4 Sampling distributions 10.4.1 Sampling distributions for proportions We will look at an example using data from Inside Airbnb (Cox n.d.). Airbnb is an online marketplace for arranging vacation rentals and places to stay. The data set contains listings for Vancouver, Canada, in September 2020. Our data includes an ID number, neighborhood, type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night. library(tidyverse) set.seed(123) airbnb <- read_csv("data/listings.csv") airbnb ## # A tibble: 4,594 × 8 ## id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 Downtown Entire h… 5 2 baths 2 2 150 ## 2 2 Downtown Eastside Entire h… 4 2 baths 2 2 132 ## 3 3 West End Entire h… 2 1 bath 1 1 85 ## 4 4 Kensington-Cedar… Entire h… 2 1 bath 1 0 146 ## 5 5 Kensington-Cedar… Entire h… 4 1 bath 1 2 110 ## 6 6 Hastings-Sunrise Entire h… 4 1 bath 2 3 195 ## 7 7 Renfrew-Collingw… Entire h… 8 3 baths 4 5 130 ## 8 8 Mount Pleasant Entire h… 2 1 bath 1 1 94 ## 9 9 Grandview-Woodla… Private … 2 1 privat… 1 1 79 ## 10 10 West End Private … 2 1 privat… 1 1 75 ## # ℹ 4,584 more rows Suppose the city of Vancouver wants information about Airbnb rentals to help plan city bylaws, and they want to know how many Airbnb places are listed as entire homes and apartments (rather than as private or shared rooms). Therefore they may want to estimate the true proportion of all Airbnb listings where the “type of place” is listed as “entire home or apartment.” Of course, we usually do not have access to the true population, but here let’s imagine (for learning purposes) that our data set represents the population of all Airbnb rental listings in Vancouver, Canada. We can find the proportion of listings where room_type == \"Entire home/apt\". airbnb |> summarize( n = sum(room_type == "Entire home/apt"), proportion = sum(room_type == "Entire home/apt") / nrow(airbnb) ) ## # A tibble: 1 × 2 ## n proportion ## <int> <dbl> ## 1 3434 0.747 We can see that the proportion of Entire home/apt listings in the data set is 0.747. This value, 0.747, is the population parameter. Remember, this parameter value is usually unknown in real data analysis problems, as it is typically not possible to make measurements for an entire population. Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let’s try randomly selecting 40 listings (i.e., taking a random sample of size 40 from our population), and computing the proportion for that sample. We will use the rep_sample_n function from the infer package to take the sample. The arguments of rep_sample_n are (1) the data frame to sample from, and (2) the size of the sample to take. library(infer) sample_1 <- rep_sample_n(tbl = airbnb, size = 40) airbnb_sample_1 <- summarize(sample_1, n = sum(room_type == "Entire home/apt"), prop = sum(room_type == "Entire home/apt") / 40 ) airbnb_sample_1 ## # A tibble: 1 × 3 ## replicate n prop ## <int> <int> <dbl> ## 1 1 28 0.7 Here we see that the proportion of entire home/apartment listings in this random sample is 0.7. Wow—that’s close to our true population value! But remember, we computed the proportion using a random sample of size 40. This has two consequences. First, this value is only an estimate, i.e., our best guess of our population parameter using this sample. Given that we are estimating a single value here, we often refer to it as a point estimate. Second, since the sample was random, if we were to take another random sample of size 40 and compute the proportion for that sample, we would not get the same answer: sample_2 <- rep_sample_n(airbnb, size = 40) airbnb_sample_2 <- summarize(sample_2, n = sum(room_type == "Entire home/apt"), prop = sum(room_type == "Entire home/apt") / 40 ) airbnb_sample_2 ## # A tibble: 1 × 3 ## replicate n prop ## <int> <int> <dbl> ## 1 1 35 0.875 Confirmed! We get a different value for our estimate this time. That means that our point estimate might be unreliable. Indeed, estimates vary from sample to sample due to sampling variability. But just how much should we expect the estimates of our random samples to vary? Or in other words, how much can we really trust our point estimate based on a single sample? To understand this, we will simulate many samples (much more than just two) of size 40 from our population of listings and calculate the proportion of entire home/apartment listings in each sample. This simulation will create many sample proportions, which we can visualize using a histogram. The distribution of the estimate for all possible samples of a given size (which we commonly refer to as \\(n\\)) from a population is called a sampling distribution. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. We again use the rep_sample_n to take samples of size 40 from our population of Airbnb listings. But this time we set the reps argument to 20,000 to specify that we want to take 20,000 samples of size 40. samples <- rep_sample_n(airbnb, size = 40, reps = 20000) samples ## # A tibble: 800,000 × 9 ## # Groups: replicate [20,000] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1 4403 Downtown Entire h… 2 1 bath 1 1 ## 2 1 902 Kensington-C… Private … 2 1 shared… 1 1 ## 3 1 3808 Hastings-Sun… Entire h… 6 1.5 baths 1 3 ## 4 1 561 Kensington-C… Entire h… 6 1 bath 2 2 ## 5 1 3385 Mount Pleasa… Entire h… 4 1 bath 1 1 ## 6 1 4232 Shaughnessy Entire h… 6 1.5 baths 2 2 ## 7 1 1169 Downtown Entire h… 3 1 bath 1 1 ## 8 1 959 Kitsilano Private … 1 1.5 shar… 1 1 ## 9 1 2171 Downtown Entire h… 2 1 bath 1 1 ## 10 1 1258 Dunbar South… Entire h… 4 1 bath 2 2 ## # ℹ 799,990 more rows ## # ℹ 1 more variable: price <dbl> Notice that the column replicate indicates the replicate, or sample, to which each listing belongs. Above, since by default R only prints the first few rows, it looks like all of the listings have replicate set to 1. But you can check the last few entries using the tail() function to verify that we indeed created 20,000 samples (or replicates). tail(samples) ## # A tibble: 6 × 9 ## # Groups: replicate [1] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 20000 3414 Marpole Entire h… 4 1 bath 2 2 ## 2 20000 1974 Hastings-Sunr… Private … 2 1 shared… 1 1 ## 3 20000 1846 Riley Park Entire h… 4 1 bath 2 3 ## 4 20000 862 Downtown Entire h… 5 2 baths 2 2 ## 5 20000 3295 Victoria-Fras… Private … 2 1 shared… 1 1 ## 6 20000 997 Dunbar Southl… Private … 1 1.5 shar… 1 1 ## # ℹ 1 more variable: price <dbl> Now that we have obtained the samples, we need to compute the proportion of entire home/apartment listings in each sample. We first group the data by the replicate variable—to group the set of listings in each sample together—and then use summarize to compute the proportion in each sample. We print both the first and last few entries of the resulting data frame below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. sample_estimates <- samples |> group_by(replicate) |> summarize(sample_proportion = sum(room_type == "Entire home/apt") / 40) sample_estimates ## # A tibble: 20,000 × 2 ## replicate sample_proportion ## <int> <dbl> ## 1 1 0.85 ## 2 2 0.85 ## 3 3 0.65 ## 4 4 0.7 ## 5 5 0.75 ## 6 6 0.725 ## 7 7 0.775 ## 8 8 0.775 ## 9 9 0.7 ## 10 10 0.675 ## # ℹ 19,990 more rows tail(sample_estimates) ## # A tibble: 6 × 2 ## replicate sample_proportion ## <int> <dbl> ## 1 19995 0.75 ## 2 19996 0.675 ## 3 19997 0.625 ## 4 19998 0.75 ## 5 19999 0.875 ## 6 20000 0.65 We can now visualize the sampling distribution of sample proportions for samples of size 40 using a histogram in Figure 10.2. Keep in mind: in the real world, we don’t have access to the full population. So we can’t take many samples and can’t actually construct or visualize the sampling distribution. We have created this particular example such that we do have access to the full population, which lets us visualize the sampling distribution directly for learning purposes. sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) + geom_histogram(color = "lightgrey", bins = 12) + labs(x = "Sample proportions", y = "Count") + theme(text = element_text(size = 12)) sampling_distribution Figure 10.2: Sampling distribution of the sample proportion for sample size 40. The sampling distribution in Figure 10.2 appears to be bell-shaped, is roughly symmetric, and has one peak. It is centered around 0.7 and the sample proportions range from about 0.4 to about 1. In fact, we can calculate the mean of the sample proportions. sample_estimates |> summarize(mean_proportion = mean(sample_proportion)) ## # A tibble: 1 × 1 ## mean_proportion ## <dbl> ## 1 0.747 We notice that the sample proportions are centered around the population proportion value, 0.747! In general, the mean of the sampling distribution should be equal to the population proportion. This is great news because it means that the sample proportion is neither an overestimate nor an underestimate of the population proportion. In other words, if you were to take many samples as we did above, there is no tendency towards over or underestimating the population proportion. In a real data analysis setting where you just have access to your single sample, this implies that you would suspect that your sample point estimate is roughly equally likely to be above or below the true population proportion. 10.4.2 Sampling distributions for means In the previous section, our variable of interest—room_type—was categorical, and the population parameter was a proportion. As mentioned in the chapter introduction, there are many choices of the population parameter for each type of variable. What if we wanted to infer something about a population of quantitative variables instead? For instance, a traveler visiting Vancouver, Canada may wish to estimate the population mean (or average) price per night of Airbnb listings. Knowing the average could help them tell whether a particular listing is overpriced. We can visualize the population distribution of the price per night with a histogram. population_distribution <- ggplot(airbnb, aes(x = price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) population_distribution Figure 10.3: Population distribution of price per night (dollars) for all Airbnb listings in Vancouver, Canada. In Figure 10.3, we see that the population distribution has one peak. It is also skewed (i.e., is not symmetric): most of the listings are less than $250 per night, but a small number of listings cost much more, creating a long tail on the histogram’s right side. Along with visualizing the population, we can calculate the population mean, the average price per night for all the Airbnb listings. population_parameters <- airbnb |> summarize(mean_price = mean(price)) population_parameters ## # A tibble: 1 × 1 ## mean_price ## <dbl> ## 1 154.51 The price per night of all Airbnb rentals in Vancouver, BC is $154.51, on average. This value is our population parameter since we are calculating it using the population data. Now suppose we did not have access to the population data (which is usually the case!), yet we wanted to estimate the mean price per night. We could answer this question by taking a random sample of as many Airbnb listings as our time and resources allow. Let’s say we could do this for 40 listings. What would such a sample look like? Let’s take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using rep_sample_n. one_sample <- airbnb |> rep_sample_n(40) We can create a histogram to visualize the distribution of observations in the sample (Figure 10.4), and calculate the mean of our sample. sample_distribution <- ggplot(one_sample, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) sample_distribution Figure 10.4: Distribution of price per night (dollars) for sample of 40 Airbnb listings. estimates <- one_sample |> summarize(mean_price = mean(price)) estimates ## # A tibble: 1 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 155.80 The average value of the sample of size 40 is $155.80. This number is a point estimate for the mean of the full population. Recall that the population mean was $154.51. So our estimate was fairly close to the population parameter: the mean was about 0.8% off. Note that we usually cannot compute the estimate’s accuracy in practice since we do not have access to the population parameter; if we did, we wouldn’t need to estimate it! Also, recall from the previous section that the point estimate can vary; if we took another random sample from the population, our estimate’s value might change. So then, did we just get lucky with our point estimate above? How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population, we can take many samples and plot the sampling distribution of sample means for samples of size 40 to get a sense for this variation. In this case, we’ll use 20,000 samples of size 40. samples <- rep_sample_n(airbnb, size = 40, reps = 20000) samples ## # A tibble: 800,000 × 9 ## # Groups: replicate [20,000] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1 1177 Downtown Entire h… 4 2 baths 2 2 ## 2 1 4063 Downtown Entire h… 2 1 bath 1 1 ## 3 1 2641 Kitsilano Private … 1 1 shared… 1 1 ## 4 1 1941 West End Entire h… 2 1 bath 1 1 ## 5 1 2431 Mount Pleasa… Entire h… 2 1 bath 1 1 ## 6 1 1871 Arbutus Ridge Entire h… 4 1 bath 2 2 ## 7 1 2557 Marpole Private … 3 1 privat… 1 2 ## 8 1 3534 Downtown Entire h… 2 1 bath 1 1 ## 9 1 4379 Downtown Entire h… 4 1 bath 1 0 ## 10 1 2161 Downtown Entire h… 4 2 baths 2 2 ## # ℹ 799,990 more rows ## # ℹ 1 more variable: price <dbl> Now we can calculate the sample mean for each replicate and plot the sampling distribution of sample means for samples of size 40. sample_estimates <- samples |> group_by(replicate) |> summarize(mean_price = mean(price)) sample_estimates ## # A tibble: 20,000 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 160.06 ## 2 2 173.18 ## 3 3 131.20 ## 4 4 176.96 ## 5 5 125.65 ## 6 6 148.84 ## 7 7 134.82 ## 8 8 137.26 ## 9 9 166.11 ## 10 10 157.81 ## # ℹ 19,990 more rows sampling_distribution_40 <- ggplot(sample_estimates, aes(x = mean_price)) + geom_histogram(color = "lightgrey") + labs(x = "Sample mean price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) sampling_distribution_40 Figure 10.5: Sampling distribution of the sample means for sample size of 40. In Figure 10.5, the sampling distribution of the mean has one peak and is bell-shaped. Most of the estimates are between about $140 and $170; but there is a good fraction of cases outside this range (i.e., where the point estimate was not close to the population parameter). So it does indeed look like we were quite lucky when we estimated the population mean with only 0.8% error. Let’s visualize the population distribution, distribution of the sample, and the sampling distribution on one plot to compare them in Figure 10.6. Comparing these three distributions, the centers of the distributions are all around the same price (around $150). The original population distribution has a long right tail, and the sample distribution has a similar shape to that of the population distribution. However, the sampling distribution is not shaped like the population or sample distribution. Instead, it has a bell shape, and it has a lower spread than the population or sample distributions. The sample means vary less than the individual observations because there will be some high values and some small values in any random sample, which will keep the average from being too extreme. Figure 10.6: Comparison of population distribution, sample distribution, and sampling distribution. Given that there is quite a bit of variation in the sampling distribution of the sample mean—i.e., the point estimate that we obtain is not very reliable—is there any way to improve the estimate? One way to improve a point estimate is to take a larger sample. To illustrate what effect this has, we will take many samples of size 20, 50, 100, and 500, and plot the sampling distribution of the sample mean. We indicate the mean of the sampling distribution with a vertical dashed line. Figure 10.7: Comparison of sampling distributions, with mean highlighted as a vertical dashed line. Based on the visualization in Figure 10.7, three points about the sample mean become clear. First, the mean of the sample mean (across samples) is equal to the population mean. In other words, the sampling distribution is centered at the population mean. Second, increasing the size of the sample decreases the spread (i.e., the variability) of the sampling distribution. Therefore, a larger sample size results in a more reliable point estimate of the population parameter. And third, the distribution of the sample mean is roughly bell-shaped. Note: You might notice that in the n = 20 case in Figure 10.7, the distribution is not quite bell-shaped. There is a bit of skew towards the right! You might also notice that in the n = 50 case and larger, that skew seems to disappear. In general, the sampling distribution—for both means and proportions—only becomes bell-shaped once the sample size is large enough. How large is “large enough?” Unfortunately, it depends entirely on the problem at hand. But as a rule of thumb, often a sample size of at least 20 will suffice. 10.4.3 Summary A point estimate is a single value computed using a sample from a population (e.g., a mean or proportion). The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population. The shape of the sampling distribution is usually bell-shaped with one peak and centered at the population mean or proportion. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases. 10.5 Bootstrapping 10.5.1 Overview Why all this emphasis on sampling distributions? We saw in the previous section that we could compute a point estimate of a population parameter using a sample of observations from the population. And since we constructed examples where we had access to the population, we could evaluate how accurate the estimate was, and even get a sense of how much the estimate would vary for different samples from the population. But in real data analysis settings, we usually have just one sample from our population and do not have access to the population itself. Therefore we cannot construct the sampling distribution as we did in the previous section. And as we saw, our sample estimate’s value can vary significantly from the population parameter. So reporting the point estimate from a single sample alone may not be enough. We also need to report some notion of uncertainty in the value of the point estimate. Unfortunately, we cannot construct the exact sampling distribution without full access to the population. However, if we could somehow approximate what the sampling distribution would look like for a sample, we could use that approximation to then report how uncertain our sample point estimate is (as we did above with the exact sampling distribution). There are several methods to accomplish this; in this book, we will use the bootstrap. We will discuss interval estimation and construct confidence intervals using just a single sample from a population. A confidence interval is a range of plausible values for our population parameter. Here is the key idea. First, if you take a big enough sample, it looks like the population. Notice the histograms’ shapes for samples of different sizes taken from the population in Figure 10.8. We see that the sample’s distribution looks like that of the population for a large enough sample. Figure 10.8: Comparison of samples of different sizes from the population. In the previous section, we took many samples of the same size from our population to get a sense of the variability of a sample estimate. But if our sample is big enough that it looks like our population, we can pretend that our sample is the population, and take more samples (with replacement) of the same size from it instead! This very clever technique is called the bootstrap. Note that by taking many samples from our single, observed sample, we do not obtain the true sampling distribution, but rather an approximation that we call the bootstrap distribution. Note: We must sample with replacement when using the bootstrap. Otherwise, if we had a sample of size \\(n\\), and obtained a sample from it of size \\(n\\) without replacement, it would just return our original sample! This section will explore how to create a bootstrap distribution from a single sample using R. The process is visualized in Figure 10.9. For a sample of size \\(n\\), you would do the following: Randomly select an observation from the original sample, which was drawn from the population. Record the observation’s value. Replace that observation. Repeat steps 1–3 (sampling with replacement) until you have \\(n\\) observations, which form a bootstrap sample. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the \\(n\\) observations in your bootstrap sample. Repeat steps 1–5 many times to create a distribution of point estimates (the bootstrap distribution). Calculate the plausible range of values around our observed point estimate. Figure 10.9: Overview of the bootstrap process. 10.5.2 Bootstrapping in R Let’s continue working with our Airbnb example to illustrate how we might create and use a bootstrap distribution using just a single sample from the population. Once again, suppose we are interested in estimating the population mean price per night of all Airbnb listings in Vancouver, Canada, using a single sample size of 40. Recall our point estimate was $155.80. The histogram of prices in the sample is displayed in Figure 10.10. one_sample ## # A tibble: 40 × 8 ## id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 3928 Marpole Private … 2 1 shared… 1 1 58 ## 2 3013 Kensington-Cedar… Entire h… 4 1 bath 2 2 112 ## 3 3156 Downtown Entire h… 6 2 baths 2 2 151 ## 4 3873 Dunbar Southlands Private … 5 1 bath 2 3 700 ## 5 3632 Downtown Eastside Entire h… 6 2 baths 3 3 157 ## 6 296 Kitsilano Private … 1 1 shared… 1 1 100 ## 7 3514 West End Entire h… 2 1 bath 1 1 110 ## 8 594 Sunset Entire h… 5 1 bath 3 3 105 ## 9 3305 Dunbar Southlands Entire h… 4 1 bath 1 2 196 ## 10 938 Downtown Entire h… 7 2 baths 2 3 269 ## # ℹ 30 more rows one_sample_dist <- ggplot(one_sample, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) one_sample_dist Figure 10.10: Histogram of price per night (dollars) for one sample of size 40. The histogram for the sample is skewed, with a few observations out to the right. The mean of the sample is $155.80. Remember, in practice, we usually only have this one sample from the population. So this sample and estimate are the only data we can work with. We now perform steps 1–5 listed above to generate a single bootstrap sample in R and calculate a point estimate from that bootstrap sample. We will use the rep_sample_n function as we did when we were creating our sampling distribution. But critically, note that we now pass one_sample—our single sample of size 40—as the first argument. And since we need to sample with replacement, we change the argument for replace from its default value of FALSE to TRUE. boot1 <- one_sample |> rep_sample_n(size = 40, replace = TRUE, reps = 1) boot1_dist <- ggplot(boot1, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) boot1_dist Figure 10.11: Bootstrap distribution. summarize(boot1, mean_price = mean(price)) ## # A tibble: 1 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 164.20 Notice in Figure 10.11 that the histogram of our bootstrap sample has a similar shape to the original sample histogram. Though the shapes of the distributions are similar, they are not identical. You’ll also notice that the original sample mean and the bootstrap sample mean differ. How might that happen? Remember that we are sampling with replacement from the original sample, so we don’t end up with the same sample values again. We are pretending that our single sample is close to the population, and we are trying to mimic drawing another sample from the population by drawing one from our original sample. Let’s now take 20,000 bootstrap samples from the original sample (one_sample) using rep_sample_n, and calculate the means for each of those replicates. Recall that this assumes that one_sample looks like our original population; but since we do not have access to the population itself, this is often the best we can do. boot20000 <- one_sample |> rep_sample_n(size = 40, replace = TRUE, reps = 20000) boot20000 ## # A tibble: 800,000 × 9 ## # Groups: replicate [20,000] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1 1276 Hastings-Sun… Entire h… 2 1 bath 1 1 ## 2 1 3235 Hastings-Sun… Entire h… 2 1 bath 1 1 ## 3 1 1301 Oakridge Entire h… 12 2 baths 2 12 ## 4 1 118 Grandview-Wo… Entire h… 4 1 bath 2 2 ## 5 1 2550 Downtown Eas… Private … 2 1.5 shar… 1 1 ## 6 1 1006 Grandview-Wo… Entire h… 5 1 bath 3 4 ## 7 1 3632 Downtown Eas… Entire h… 6 2 baths 3 3 ## 8 1 1923 West End Entire h… 4 2 baths 2 2 ## 9 1 3873 Dunbar South… Private … 5 1 bath 2 3 ## 10 1 2349 Kerrisdale Private … 2 1 shared… 1 1 ## # ℹ 799,990 more rows ## # ℹ 1 more variable: price <dbl> tail(boot20000) ## # A tibble: 6 × 9 ## # Groups: replicate [1] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds ## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 20000 1949 Kitsilano Entire h… 3 1 bath 1 1 ## 2 20000 1025 Kensington-Ce… Entire h… 3 1 bath 1 1 ## 3 20000 3013 Kensington-Ce… Entire h… 4 1 bath 2 2 ## 4 20000 2868 Downtown Entire h… 2 1 bath 1 1 ## 5 20000 3156 Downtown Entire h… 6 2 baths 2 2 ## 6 20000 1923 West End Entire h… 4 2 baths 2 2 ## # ℹ 1 more variable: price <dbl> Let’s take a look at the histograms of the first six replicates of our bootstrap samples. six_bootstrap_samples <- boot20000 |> filter(replicate <= 6) ggplot(six_bootstrap_samples, aes(price)) + geom_histogram(color = "lightgrey") + labs(x = "Price per night (dollars)", y = "Count") + facet_wrap(~replicate) + theme(text = element_text(size = 12)) Figure 10.12: Histograms of the first six replicates of the bootstrap samples. We see in Figure 10.12 how the bootstrap samples differ. We can also calculate the sample mean for each of these six replicates. six_bootstrap_samples |> group_by(replicate) |> summarize(mean_price = mean(price)) ## # A tibble: 6 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 177.2 ## 2 2 131.45 ## 3 3 179.10 ## 4 4 171.35 ## 5 5 191.32 ## 6 6 170.05 We can see that the bootstrap sample distributions and the sample means are different. They are different because we are sampling with replacement. We will now calculate point estimates for our 20,000 bootstrap samples and generate a bootstrap distribution of our point estimates. The bootstrap distribution (Figure 10.13) suggests how we might expect our point estimate to behave if we took another sample. boot20000_means <- boot20000 |> group_by(replicate) |> summarize(mean_price = mean(price)) boot20000_means ## # A tibble: 20,000 × 2 ## replicate mean_price ## <int> <dbl> ## 1 1 177.2 ## 2 2 131.45 ## 3 3 179.10 ## 4 4 171.35 ## 5 5 191.32 ## 6 6 170.05 ## 7 7 178.83 ## 8 8 154.78 ## 9 9 163.85 ## 10 10 209.28 ## # ℹ 19,990 more rows tail(boot20000_means) ## # A tibble: 6 × 2 ## replicate mean_price ## <int> <dbl> ## 1 19995 130.40 ## 2 19996 189.18 ## 3 19997 168.98 ## 4 19998 168.23 ## 5 19999 155.73 ## 6 20000 136.95 boot_est_dist <- ggplot(boot20000_means, aes(x = mean_price)) + geom_histogram(color = "lightgrey") + labs(x = "Sample mean price per night (dollars)", y = "Count") + theme(text = element_text(size = 12)) boot_est_dist Figure 10.13: Distribution of the bootstrap sample means. Let’s compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with the true sampling distribution—which corresponds to taking many samples from the population. Figure 10.14: Comparison of the distribution of the bootstrap sample means and sampling distribution. There are two essential points that we can take away from Figure 10.14. First, the shape and spread of the true sampling distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate’s variability. The second important point is that the means of these two distributions are different. The sampling distribution is centered at $154.51, the population mean value. However, the bootstrap distribution is centered at the original sample’s mean price per night, $155.87. Because we are resampling from the original sample repeatedly, we see that the bootstrap distribution is centered at the original sample’s mean value (unlike the sampling distribution of the sample mean, which is centered at the population parameter value). Figure 10.15 summarizes the bootstrapping process. The idea here is that we can use this distribution of bootstrap sample means to approximate the sampling distribution of the sample means when we only have one sample. Since the bootstrap distribution pretty well approximates the sampling distribution spread, we can use the bootstrap spread to help us develop a plausible range for our population parameter along with our estimate! Figure 10.15: Summary of bootstrapping process. 10.5.3 Using the bootstrap to calculate a plausible range Now that we have constructed our bootstrap distribution, let’s use it to create an approximate 95% percentile bootstrap confidence interval. A confidence interval is a range of plausible values for the population parameter. We will find the range of values covering the middle 95% of the bootstrap distribution, giving us a 95% confidence interval. You may be wondering, what does “95% confidence” mean? If we took 100 random samples and calculated 100 95% confidence intervals, then about 95% of the ranges would capture the population parameter’s value. Note there’s nothing special about 95%. We could have used other levels, such as 90% or 99%. There is a balance between our level of confidence and precision. A higher confidence level corresponds to a wider range of the interval, and a lower confidence level corresponds to a narrower range. Therefore the level we choose is based on what chance we are willing to take of being wrong based on the implications of being wrong for our application. In general, we choose confidence levels to be comfortable with our level of uncertainty but not so strict that the interval is unhelpful. For instance, if our decision impacts human life and the implications of being wrong are deadly, we may want to be very confident and choose a higher confidence level. To calculate a 95% percentile bootstrap confidence interval, we will do the following: Arrange the observations in the bootstrap distribution in ascending order. Find the value such that 2.5% of observations fall below it (the 2.5% percentile). Use that value as the lower bound of the interval. Find the value such that 97.5% of observations fall below it (the 97.5% percentile). Use that value as the upper bound of the interval. To do this in R, we can use the quantile() function. Quantiles are expressed in proportions rather than percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively. bounds <- boot20000_means |> select(mean_price) |> pull() |> quantile(c(0.025, 0.975)) bounds ## 2.5% 97.5% ## 119 204 Our interval, $119.28 to $203.63, captures the middle 95% of the sample mean prices in the bootstrap distribution. We can visualize the interval on our distribution in Figure 10.16. Figure 10.16: Distribution of the bootstrap sample means with percentile lower and upper bounds. To finish our estimation of the population parameter, we would report the point estimate and our confidence interval’s lower and upper bounds. Here the sample mean price per night of 40 Airbnb listings was $155.80, and we are 95% “confident” that the true population mean price per night for all Airbnb listings in Vancouver is between $119.28 and $203.63. Notice that our interval does indeed contain the true population mean value, $154.51! However, in practice, we would not know whether our interval captured the population parameter or not because we usually only have a single sample, not the entire population. This is the best we can do when we only have one sample! This chapter is only the beginning of the journey into statistical inference. We can extend the concepts learned here to do much more than report point estimates and confidence intervals, such as testing for real differences between populations, tests for associations between variables, and so much more. We have just scratched the surface of statistical inference; however, the material presented here will serve as the foundation for more advanced statistical techniques you may learn about in the future! 10.6 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the two “Statistical inference” rows. You can launch an interactive version of each worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of each worksheet by clicking “view worksheet.” If you instead decide to download the worksheets and run them on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 10.7 Additional resources Chapters 7 to 10 of Modern Dive (Ismay and Kim 2020) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using tidyverse and infer in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics. Chapters 4 to 7 of OpenIntro Statistics (Diez, Çetinkaya-Rundel, and Barr 2019) provide a good next step after Modern Dive. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is the language of statistics; if you have a solid grasp of probability, more advanced statistics will come naturally to you! References "],["jupyter.html", "Chapter 11 Combining code and text with Jupyter 11.1 Overview 11.2 Chapter learning objectives 11.3 Jupyter 11.4 Code cells 11.5 Markdown cells 11.6 Saving your work 11.7 Best practices for running a notebook 11.8 Exploring data files 11.9 Exporting to a different file format 11.10 Creating a new Jupyter notebook 11.11 Additional resources", " Chapter 11 Combining code and text with Jupyter 11.1 Overview A typical data analysis involves not only writing and executing code, but also writing text and displaying images that help tell the story of the analysis. In fact, ideally, we would like to interleave these three media, with the text and images serving as narration for the code and its output. In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in data science. Jupyter notebooks do precisely what we need: they let you combine text, images, and (executable!) code in a single document. In this chapter, we will focus on the use of Jupyter notebooks to program in R and write text via a web interface. These skills are essential to getting your analysis running; think of it like getting dressed in the morning! Note that we assume that you already have Jupyter set up and ready to use. If that is not the case, please first read Chapter 13 to learn how to install and configure Jupyter on your own computer. 11.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Create new Jupyter notebooks. Write, edit, and execute R code in a Jupyter notebook. Write, edit, and view text in a Jupyter notebook. Open and view plain text data files in Jupyter. Export Jupyter notebooks to other standard file types (e.g., .html, .pdf). 11.3 Jupyter Jupyter (Kluyver et al. 2016) is a web-based interactive development environment for creating, editing, and executing documents called Jupyter notebooks. Jupyter notebooks are documents that contain a mix of computer code (and its output) and formattable text. Given that they combine these two analysis artifacts in a single document—code is not separate from the output or written report—notebooks are one of the leading tools to create reproducible data analyses. Reproducible data analysis is one where you can reliably and easily re-create the same results when analyzing the same data. Although this sounds like something that should always be true of any data analysis, in reality, this is not often the case; one needs to make a conscious effort to perform data analysis in a reproducible manner. An example of what a Jupyter notebook looks like is shown in Figure 11.1. Figure 11.1: A screenshot of a Jupyter Notebook. 11.3.1 Accessing Jupyter One of the easiest ways to start working with Jupyter is to use a web-based platform called JupyterHub. JupyterHubs often have Jupyter, R, a number of R packages, and collaboration tools installed, configured and ready to use. JupyterHubs are usually created and provisioned by organizations, and require authentication to gain access. For example, if you are reading this book as part of a course, your instructor may have a JupyterHub already set up for you to use! Jupyter can also be installed on your own computer; see Chapter 13 for instructions. 11.4 Code cells The sections of a Jupyter notebook that contain code are referred to as code cells. A code cell that has not yet been executed has no number inside the square brackets to the left of the cell (Figure 11.2). Running a code cell will execute all of the code it contains, and the output (if any exists) will be displayed directly underneath the code that generated it. Outputs may include printed text or numbers, data frames and data visualizations. Cells that have been executed also have a number inside the square brackets to the left of the cell. This number indicates the order in which the cells were run (Figure 11.3). Figure 11.2: A code cell in Jupyter that has not yet been executed. Figure 11.3: A code cell in Jupyter that has been executed. 11.4.1 Executing code cells Code cells can be run independently or as part of executing the entire notebook using one of the “Run all” commands found in the Run or Kernel menus in Jupyter. Running a single code cell independently is a workflow typically used when editing or writing your own R code. Executing an entire notebook is a workflow typically used to ensure that your analysis runs in its entirety before sharing it with others, and when using a notebook as part of an automated process. To run a code cell independently, the cell needs to first be activated. This is done by clicking on it with the cursor. Jupyter will indicate a cell has been activated by highlighting it with a blue rectangle to its left. After the cell has been activated (Figure 11.4), the cell can be run by either pressing the Run (▶). button in the toolbar, or by using the keyboard shortcut Shift + Enter. Figure 11.4: An activated cell that is ready to be run. The blue rectangle to the cell’s left (annotated by a red arrow) indicates that it is ready to be run. The cell can be run by clicking the run button (circled in red). To execute all of the code cells in an entire notebook, you have three options: Select Run >> Run All Cells from the menu. Select Kernel >> Restart Kernel and Run All Cells… from the menu (Figure 11.5). Click the (⏭) button in the tool bar. All of these commands result in all of the code cells in a notebook being run. However, there is a slight difference between them. In particular, only options 2 and 3 above will restart the R session before running all of the cells; option 1 will not restart the session. Restarting the R session means that all previous objects that were created from running cells before this command was run will be deleted. In other words, restarting the session and then running all cells (options 2 or 3) emulates how your notebook code would run if you completely restarted Jupyter before executing your entire notebook. Figure 11.5: Restarting the R session can be accomplished by clicking Restart Kernel and Run All Cells… 11.4.2 The Kernel The kernel is a program that executes the code inside your notebook and outputs the results. Kernels for many different programming languages have been created for Jupyter, which means that Jupyter can interpret and execute the code of many different programming languages. To run R code, your notebook will need an R kernel. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (◯) the kernel is idle and ready to execute code. If the circle is filled in (⬤) the kernel is busy running some code. You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps: At the top of your screen, click Kernel, then Interrupt Kernel. If that doesn’t help, click Kernel, then Restart Kernel… If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work. If that still doesn’t help, restart Jupyter. First, save your work by clicking File at the top left of your screen, then Save Notebook. Next, if you are accessing Jupyter using a JupyterHub server, from the File menu click Hub Control Panel. Choose Stop My Server to shut it down, then the My Server button to start it back up. If you are running Jupyter on your own computer, from the File menu click Shut Down, then start Jupyter again. Finally, navigate back to the notebook you were working on. 11.4.3 Creating new code cells To create a new code cell in Jupyter (Figure 11.6), click the + button in the toolbar. By default, all new cells in Jupyter start out as code cells, so after this, all you have to do is write R code within the new cell you just created! Figure 11.6: New cells can be created by clicking the + button, and are by default code cells. 11.5 Markdown cells Text cells inside a Jupyter notebook are called Markdown cells. Markdown cells are rich formatted text cells, which means you can bold and italicize text, create subject headers, create bullet and numbered lists, and more. These cells are given the name “Markdown” because they use Markdown language to specify the rich text formatting. You do not need to learn Markdown to write text in the Markdown cells in Jupyter; plain text will work just fine. However, you might want to learn a bit about Markdown eventually to enable you to create nicely formatted analyses. See the additional resources at the end of this chapter to find out where you can start learning Markdown. 11.5.1 Editing Markdown cells To edit a Markdown cell in Jupyter, you need to double click on the cell. Once you do this, the unformatted (or unrendered) version of the text will be shown (Figure 11.7). You can then use your keyboard to edit the text. To view the formatted (or rendered) text (Figure 11.8), click the Run (▶) button in the toolbar, or use the Shift + Enter keyboard shortcut. Figure 11.7: A Markdown cell in Jupyter that has not yet been rendered and can be edited. Figure 11.8: A Markdown cell in Jupyter that has been rendered and exhibits rich text formatting. 11.5.2 Creating new Markdown cells To create a new Markdown cell in Jupyter, click the + button in the toolbar. By default, all new cells in Jupyter start as code cells, so the cell format needs to be changed to be recognized and rendered as a Markdown cell. To do this, click on the cell with your cursor to ensure it is activated. Then click on the drop-down box on the toolbar that says “Code” (it is next to the ⏭ button), and change it from “Code” to “Markdown” (Figure 11.9). Figure 11.9: New cells are by default code cells. To create Markdown cells, the cell format must be changed. 11.6 Saving your work As with any file you work on, it is critical to save your work often so you don’t lose your progress! Jupyter has an autosave feature, where open files are saved periodically. The default for this is every two minutes. You can also manually save a Jupyter notebook by selecting Save Notebook from the File menu, by clicking the disk icon on the toolbar, or by using a keyboard shortcut (Control + S for Windows, or Command + S for Mac OS). 11.7 Best practices for running a notebook 11.7.1 Best practices for executing code cells As you might know (or at least imagine) by now, Jupyter notebooks are great for interactively editing, writing and running R code; this is what they were designed for! Consequently, Jupyter notebooks are flexible in regards to code cell execution order. This flexibility means that code cells can be run in any arbitrary order using the Run (▶) button. But this flexibility has a downside: it can lead to Jupyter notebooks whose code cannot be executed in a linear order (from top to bottom of the notebook). A nonlinear notebook is problematic because a linear order is the conventional way code documents are run, and others will have this expectation when running your notebook. Finally, if the code is used in some automated process, it will need to run in a linear order, from top to bottom of the notebook. The most common way to inadvertently create a nonlinear notebook is to rely solely on using the ▶ button to execute cells. For example, suppose you write some R code that creates an R object, say a variable named y. When you execute that cell and create y, it will continue to exist until it is deliberately deleted with R code, or when the Jupyter notebook R session (i.e., kernel) is stopped or restarted. It can also be referenced in another distinct code cell (Figure 11.10). Together, this means that you could then write a code cell further above in the notebook that references y and execute it without error in the current session (Figure 11.11). This could also be done successfully in future sessions if, and only if, you run the cells in the same unconventional order. However, it is difficult to remember this unconventional order, and it is not the order that others would expect your code to be executed in. Thus, in the future, this would lead to errors when the notebook is run in the conventional linear order (Figure 11.12). Figure 11.10: Code that was written out of order, but not yet executed. Figure 11.11: Code that was written out of order, and was executed using the run button in a nonlinear order without error. The order of execution can be traced by following the numbers to the left of the code cells; their order indicates the order in which the cells were executed. Figure 11.12: Code that was written out of order, and was executed in a linear order using “Restart Kernel and Run All Cells…” This resulted in an error at the execution of the second code cell and it failed to run all code cells in the notebook. You can also accidentally create a nonfunctioning notebook by creating an object in a cell that later gets deleted. In such a scenario, that object only exists for that one particular R session and will not exist once the notebook is restarted and run again. If that object was referenced in another cell in that notebook, an error would occur when the notebook was run again in a new session. These events may not negatively affect the current R session when the code is being written; but as you might now see, they will likely lead to errors when that notebook is run in a future session. Regularly executing the entire notebook in a fresh R session will help guard against this. If you restart your session and new errors seem to pop up when you run all of your cells in linear order, you can at least be aware that there is an issue. Knowing this sooner rather than later will allow you to fix the issue and ensure your notebook can be run linearly from start to finish. We recommend as a best practice to run the entire notebook in a fresh R session at least 2–3 times within any period of work. Note that, critically, you must do this in a fresh R session by restarting your kernel. We recommend using either the Kernel >> Restart Kernel and Run All Cells… command from the menu or the ⏭ button in the toolbar. Note that the Run >> Run All Cells menu item will not restart the kernel, and so it is not sufficient to guard against these errors. 11.7.2 Best practices for including R packages in notebooks Most data analyses these days depend on functions from external R packages that are not built into R. One example is the tidyverse metapackage that we heavily rely on in this book. This package provides us access to functions like read_csv for reading data, select for subsetting columns, and ggplot for creating high-quality graphics. As mentioned earlier in the book, external R packages need to be loaded before the functions they contain can be used. Our recommended way to do this is via library(package_name). But where should this line of code be written in a Jupyter notebook? One idea could be to load the library right before the function is used in the notebook. However, although this technically works, this causes hidden, or at least non-obvious, R package dependencies when others view or try to run the notebook. These hidden dependencies can lead to errors when the notebook is executed on another computer if the needed R packages are not installed. Additionally, if the data analysis code takes a long time to run, uncovering the hidden dependencies that need to be installed so that the analysis can run without error can take a great deal of time to uncover. Therefore, we recommend you load all R packages in a code cell near the top of the Jupyter notebook. Loading all your packages at the start ensures that all packages are loaded before their functions are called, assuming the notebook is run in a linear order from top to bottom as recommended above. It also makes it easy for others viewing or running the notebook to see what external R packages are used in the analysis, and hence, what packages they should install on their computer to run the analysis successfully. 11.7.3 Summary of best practices for running a notebook Write code so that it can be executed in a linear order. As you write code in a Jupyter notebook, run the notebook in a linear order and in its entirety often (2–3 times every work session) via the Kernel >> Restart Kernel and Run All Cells… command from the Jupyter menu or the ⏭ button in the toolbar. Write the code that loads external R packages near the top of the Jupyter notebook. 11.8 Exploring data files It is essential to preview data files before you try to read them into R to see whether or not there are column names, what the delimiters are, and if there are lines you need to skip. In Jupyter, you preview data files stored as plain text files (e.g., comma- and tab-separated files) in their plain text format (Figure 11.14) by right-clicking on the file’s name in the Jupyter file explorer, selecting Open with, and then selecting Editor (Figure 11.13). Suppose you do not specify to open the data file with an editor. In that case, Jupyter will render a nice table for you, and you will not be able to see the column delimiters, and therefore you will not know which function to use, nor which arguments to use and values to specify for them. Figure 11.13: Opening data files with an editor in Jupyter. Figure 11.14: A data file as viewed in an editor in Jupyter. 11.9 Exporting to a different file format In Jupyter, viewing, editing and running R code is done in the Jupyter notebook file format with file extension .ipynb. This file format is not easy to open and view outside of Jupyter. Thus, to share your analysis with people who do not commonly use Jupyter, it is recommended that you export your executed analysis as a more common file type, such as an .html file, or a .pdf. We recommend exporting the Jupyter notebook after executing the analysis so that you can also share the outputs of your code. Note, however, that your audience will not be able to run your analysis using a .html or .pdf file. If you want your audience to be able to reproduce the analysis, you must provide them with the .ipynb Jupyter notebook file. 11.9.1 Exporting to HTML Exporting to .html will result in a shareable file that anyone can open using a web browser (e.g., Firefox, Safari, Chrome, or Edge). The .html output will produce a document that is visually similar to what the Jupyter notebook looked like inside Jupyter. One point of caution here is that if there are images in your Jupyter notebook, you will need to share the image files and the .html file to see them. 11.9.2 Exporting to PDF Exporting to .pdf will result in a shareable file that anyone can open using many programs, including Adobe Acrobat, Preview, web browsers and many more. The benefit of exporting to PDF is that it is a standalone document, even if the Jupyter notebook included references to image files. Unfortunately, the default settings will result in a document that visually looks quite different from what the Jupyter notebook looked like. The font, page margins, and other details will appear different in the .pdf output. 11.10 Creating a new Jupyter notebook At some point, you will want to create a new, fresh Jupyter notebook for your own project instead of viewing, running or editing a notebook that was started by someone else. To do this, navigate to the Launcher tab, and click on the R icon under the Notebook heading. If no Launcher tab is visible, you can get a new one via clicking the + button at the top of the Jupyter file explorer (Figure 11.15). Figure 11.15: Clicking on the R icon under the Notebook heading will create a new Jupyter notebook with an R kernel. Once you have created a new Jupyter notebook, be sure to give it a descriptive name, as the default file name is Untitled.ipynb. You can rename files by first right-clicking on the file name of the notebook you just created, and then clicking Rename. This will make the file name editable. Use your keyboard to change the name. Pressing Enter or clicking anywhere else in the Jupyter interface will save the changed file name. We recommend not using white space or non-standard characters in file names. Doing so will not prevent you from using that file in Jupyter. However, these sorts of things become troublesome as you start to do more advanced data science projects that involve repetition and automation. We recommend naming files using lower case characters and separating words by a dash (-) or an underscore (_). 11.11 Additional resources The JupyterLab Documentation is a good next place to look for more information about working in Jupyter notebooks. This documentation goes into significantly more detail about all of the topics we covered in this chapter, and covers more advanced topics as well. If you are keen to learn about the Markdown language for rich text formatting, two good places to start are CommonMark’s Markdown cheatsheet and Markdown tutorial. References "],["version-control.html", "Chapter 12 Collaboration with version control 12.1 Overview 12.2 Chapter learning objectives 12.3 What is version control, and why should I use it? 12.4 Version control repositories 12.5 Version control workflows 12.6 Working with remote repositories using GitHub 12.7 Working with local repositories using Jupyter 12.8 Collaboration 12.9 Exercises 12.10 Additional resources", " Chapter 12 Collaboration with version control You mostly collaborate with yourself, and me-from-two-months-ago never responds to email. –Mark T. Holder 12.1 Overview This chapter will introduce the concept of using version control systems to track changes to a project over its lifespan, to share and edit code in a collaborative team, and to distribute the finished project to its intended audience. This chapter will also introduce how to use the two most common version control tools: Git for local version control, and GitHub for remote version control. We will focus on the most common version control operations used day-to-day in a standard data science project. There are many user interfaces for Git; in this chapter we will cover the Jupyter Git interface. 12.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Describe what version control is and why data analysis projects can benefit from it. Create a remote version control repository on GitHub. Use Jupyter’s Git version control tools for project versioning and collaboration: Clone a remote version control repository to create a local repository. Commit changes to a local version control repository. Push local changes to a remote version control repository. Pull changes from a remote version control repository to a local version control repository. Resolve merge conflicts. Give collaborators access to a remote GitHub repository. Communicate with collaborators using GitHub issues. Use best practices when collaborating on a project with others. 12.3 What is version control, and why should I use it? Data analysis projects often require iteration and revision to move from an initial idea to a finished product ready for the intended audience. Without deliberate and conscious effort towards tracking changes made to the analysis, projects tend to become messy. This mess can have serious, negative repercussions on an analysis project, including results that your code cannot reproduce, temporary files with snippets of ideas that are forgotten or not easy to find, mind-boggling file names that make it unclear which is the current working version of the file (e.g., document_final_draft_final.txt, to_hand_in_final_v2.txt, etc.), and more. Additionally, the iterative nature of data analysis projects means that most of the time, the final version of the analysis that is shared with the audience is only a fraction of what was explored during the development of that analysis. Changes in data visualizations and modeling approaches, as well as some negative results, are often not observable from reviewing only the final, polished analysis. The lack of observability of these parts of the analysis development can lead to others repeating things that did not work well, instead of seeing what did not work well, and using that as a springboard to new, more fruitful approaches. Finally, data analyses are typically completed by a team of people rather than a single person. This means that files need to be shared across multiple computers, and multiple people often end up editing the project simultaneously. In such a situation, determining who has the latest version of the project—and how to resolve conflicting edits—can be a real challenge. Version control helps solve these challenges. Version control is the process of keeping a record of changes to documents, including when the changes were made and who made them, throughout the history of their development. It also provides the means both to view earlier versions of the project and to revert changes. Version control is most commonly used in software development, but can be used for any electronic files for any type of project, including data analyses. Being able to record and view the history of a data analysis project is important for understanding how and why decisions to use one method or another were made, among other things. Version control also facilitates collaboration via tools to share edits with others and resolve conflicting edits. But even if you’re working on a project alone, you should still use version control. It helps you keep track of what you’ve done, when you did it, and what you’re planning to do next! To version control a project, you generally need two things: a version control system and a repository hosting service. The version control system is the software responsible for tracking changes, sharing changes you make with others, obtaining changes from others, and resolving conflicting edits. The repository hosting service is responsible for storing a copy of the version-controlled project online (a repository), where you and your collaborators can access it remotely, discuss issues and bugs, and distribute your final product. For both of these items, there is a wide variety of choices. In this textbook we’ll use Git for version control, and GitHub for repository hosting, because both are currently the most widely used platforms. In the additional resources section at the end of the chapter, we list many of the common version control systems and repository hosting services in use today. Note: Technically you don’t have to use a repository hosting service. You can, for example, version control a project that is stored only in a folder on your computer—never sharing it on a repository hosting service. But using a repository hosting service provides a few big benefits, including managing collaborator access permissions, tools to discuss and track bugs, and the ability to have external collaborators contribute work, not to mention the safety of having your work backed up in the cloud. Since most repository hosting services now offer free accounts, there are not many situations in which you wouldn’t want to use one for your project. 12.4 Version control repositories Typically, when we put a data analysis project under version control, we create two copies of the repository (Figure 12.1). One copy we use as our primary workspace where we create, edit, and delete files. This copy is commonly referred to as the local repository. The local repository most commonly exists on our computer or laptop, but can also exist within a workspace on a server (e.g., JupyterHub). The other copy is typically stored in a repository hosting service (e.g., GitHub), where we can easily share it with our collaborators. This copy is commonly referred to as the remote repository. Figure 12.1: Schematic of local and remote version control repositories. Both copies of the repository have a working directory where you can create, store, edit, and delete files (e.g., analysis.ipynb in Figure 12.1). Both copies of the repository also maintain a full project history (Figure 12.1). This history is a record of all versions of the project files that have been created. The repository history is not automatically generated; Git must be explicitly told when to record a version of the project. These records are called commits. They are a snapshot of the file contents as well metadata about the repository at that time the record was created (who made the commit, when it was made, etc.). In the local and remote repositories shown in Figure 12.1, there are two commits represented as rectangles inside the “Repository History” sections. The white rectangle represents the most recent commit, while faded rectangles represent previous commits. Each commit can be identified by a human-readable message, which you write when you make a commit, and a commit hash that Git automatically adds for you. The purpose of the message is to contain a brief, rich description of what work was done since the last commit. Messages act as a very useful narrative of the changes to a project over its lifespan. If you ever want to view or revert to an earlier version of the project, the message can help you identify which commit to view or revert to. In Figure 12.1, you can see two such messages, one for each commit: Created README.md and Added analysis draft. The hash is a string of characters consisting of about 40 letters and numbers. The purpose of the hash is to serve as a unique identifier for the commit, and is used by Git to index project history. Although hashes are quite long—imagine having to type out 40 precise characters to view an old project version!—Git is able to work with shorter versions of hashes. In Figure 12.1, you can see two of these shortened hashes, one for each commit: Daa29d6 and 884c7ce. 12.5 Version control workflows When you work in a local version-controlled repository, there are generally three additional steps you must take as part of your regular workflow. In addition to just working on files—creating, editing, and deleting files as you normally would—you must: Tell Git when to make a commit of your own changes in the local repository. Tell Git when to send your new commits to the remote GitHub repository. Tell Git when to retrieve any new changes (that others made) from the remote GitHub repository. In this section we will discuss all three of these steps in detail. 12.5.1 Committing changes to a local repository When working on files in your local version control repository (e.g., using Jupyter) and saving your work, these changes will only initially exist in the working directory of the local repository (Figure 12.2). Figure 12.2: Local repository with changes to files. Once you reach a point that you want Git to keep a record of the current version of your work, you need to commit (i.e., snapshot) your changes. A prerequisite to this is telling Git which files should be included in that snapshot. We call this step adding the files to the staging area. Note that the staging area is not a real physical location on your computer; it is instead a conceptual placeholder for these files until they are committed. The benefit of the Git version control system using a staging area is that you can choose to commit changes in only certain files. For example, in Figure 12.3, we add only the two files that are important to the analysis project (analysis.ipynb and README.md) and not our personal scratch notes for the project (notes.txt). Figure 12.3: Adding modified files to the staging area in the local repository. Once the files we wish to commit have been added to the staging area, we can then commit those files to the repository history (Figure 12.4). When we do this, we are required to include a helpful commit message to tell collaborators (which often includes future you!) about the changes that were made. In Figure 12.4, the message is Message about changes...; in your work you should make sure to replace this with an informative message about what changed. It is also important to note here that these changes are only being committed to the local repository’s history. The remote repository on GitHub has not changed, and collaborators are not yet able to see your new changes. Figure 12.4: Committing the modified files in the staging area to the local repository history, with an informative message about what changed. 12.5.2 Pushing changes to a remote repository Once you have made one or more commits that you want to share with your collaborators, you need to push (i.e., send) those commits back to GitHub (Figure 12.5). This updates the history in the remote repository (i.e., GitHub) to match what you have in your local repository. Now when collaborators interact with the remote repository, they will be able to see the changes you made. And you can also take comfort in the fact that your work is now backed up in the cloud! Figure 12.5: Pushing the commit to send the changes to the remote repository on GitHub. 12.5.3 Pulling changes from a remote repository If you are working on a project with collaborators, they will also be making changes to files (e.g., to the analysis code in a Jupyter notebook and the project’s README file), committing them to their own local repository, and pushing their commits to the remote GitHub repository to share them with you. When they push their changes, those changes will only initially exist in the remote GitHub repository and not in your local repository (Figure 12.6). Figure 12.6: Changes pushed by collaborators, or created directly on GitHub will not be automatically sent to your local repository. To obtain the new changes from the remote repository on GitHub, you will need to pull those changes to your own local repository. By pulling changes, you synchronize your local repository to what is present on GitHub (Figure 12.7). Additionally, until you pull changes from the remote repository, you will not be able to push any more changes yourself (though you will still be able to work and make commits in your own local repository). Figure 12.7: Pulling changes from the remote GitHub repository to synchronize your local repository. 12.6 Working with remote repositories using GitHub Now that you have been introduced to some of the key general concepts and workflows of Git version control, we will walk through the practical steps. There are several different ways to start using version control with a new project. For simplicity and ease of setup, we recommend creating a remote repository first. This section covers how to both create and edit a remote repository on GitHub. Once you have a remote repository set up, we recommend cloning (or copying) that repository to create a local repository in which you primarily work. You can clone the repository either on your own computer or in a workspace on a server (e.g., a JupyterHub server). Section 12.7 below will cover this second step in detail. 12.6.1 Creating a remote repository on GitHub Before you can create remote repositories on GitHub, you will need a GitHub account; you can sign up for a free account at https://github.com/. Once you have logged into your account, you can create a new repository to host your project by clicking on the “+” icon in the upper right-hand corner, and then on “New Repository,” as shown in Figure 12.8. Figure 12.8: New repositories on GitHub can be created by clicking on “New Repository” from the + menu. Repositories can be set up with a variety of configurations, including a name, optional description, and the inclusion (or not) of several template files. One of the most important configuration items to choose is the visibility to the outside world, either public or private. Public repositories can be viewed by anyone. Private repositories can be viewed by only you. Both public and private repositories are only editable by you, but you can change that by giving access to other collaborators. To get started with a public repository having a template README.md file, take the following steps shown in Figure 12.9: Enter the name of your project repository. In the example below, we use canadian_languages. Most repositories follow a similar naming convention involving only lowercase letter words separated by either underscores or hyphens. Choose an option for the privacy of your repository. Select “Add a README file.” This creates a template README.md file in your repository’s root folder. When you are happy with your repository name and configuration, click on the green “Create Repository” button. Figure 12.9: Repository configuration for a project that is public and initialized with a README.md template file. A newly created public repository with a README.md template file should look something like what is shown in Figure 12.10. Figure 12.10: Respository configuration for a project that is public and initialized with a README.md template file. 12.6.2 Editing files on GitHub with the pen tool The pen tool can be used to edit existing plain text files. When you click on the pen tool, the file will be opened in a text box where you can use your keyboard to make changes (Figures 12.11 and 12.12). Figure 12.11: Clicking on the pen tool opens a text box for editing plain text files. Figure 12.12: The text box where edits can be made after clicking on the pen tool. After you are done with your edits, they can be “saved” by committing your changes. When you commit a file in a repository, the version control system takes a snapshot of what the file looks like. As you continue working on the project, over time you will possibly make many commits to a single file; this generates a useful version history for that file. On GitHub, if you click the green “Commit changes” button, it will save the file and then make a commit (Figure 12.13). Recall from Section 12.5.1 that you normally have to add files to the staging area before committing them. Why don’t we have to do that when we work directly on GitHub? Behind the scenes, when you click the green “Commit changes” button, GitHub is adding that one file to the staging area prior to committing it. But note that on GitHub you are limited to committing changes to only one file at a time. When you work in your own local repository, you can commit changes to multiple files simultaneously. This is especially useful when one “improvement” to the project involves modifying multiple files. You can also do things like run code when working in a local repository, which you cannot do on GitHub. In general, editing on GitHub is reserved for small edits to plain text files. Figure 12.13: Saving changes using the pen tool requires committing those changes, and an associated commit message. 12.6.3 Creating files on GitHub with the “Add file” menu The “Add file” menu can be used to create new plain text files and upload files from your computer. To create a new plain text file, click the “Add file” drop-down menu and select the “Create new file” option (Figure 12.14). Figure 12.14: New plain text files can be created directly on GitHub. A page will open with a small text box for the file name to be entered, and a larger text box where the desired file content text can be entered. Note the two tabs, “Edit new file” and “Preview”. Toggling between them lets you enter and edit text and view what the text will look like when rendered, respectively (Figure 12.15). Note that GitHub understands and renders .md files using a markdown syntax very similar to Jupyter notebooks, so the “Preview” tab is especially helpful for checking markdown code correctness. Figure 12.15: New plain text files require a file name in the text box circled in red, and file content entered in the larger text box (red arrow). Save and commit your changes by clicking the green “Commit changes” button at the bottom of the page (Figure 12.16). Figure 12.16: To be saved, newly created files are required to be committed along with an associated commit message. You can also upload files that you have created on your local machine by using the “Add file” drop-down menu and selecting “Upload files” (Figure 12.17). To select the files from your local computer to upload, you can either drag and drop them into the gray box area shown in Figure 12.18, or click the “choose your files” link to access a file browser dialog. Once the files you want to upload have been selected, click the green “Commit changes” button at the bottom of the page (Figure 12.18). Figure 12.17: New files of any type can be uploaded to GitHub. Figure 12.18: Specify files to upload by dragging them into the GitHub website (red circle) or by clicking on “choose your files.” Uploaded files are also required to be committed along with an associated commit message. Note that Git and GitHub are designed to track changes in individual files. Do not upload your whole project in an archive file (e.g., .zip). If you do, then Git can only keep track of changes to the entire .zip file, which will not be human-readable. Committing one big archive defeats the whole purpose of using version control: you won’t be able to see, interpret, or find changes in the history of any of the actual content of your project! 12.7 Working with local repositories using Jupyter Although there are several ways to create and edit files on GitHub, they are not quite powerful enough for efficiently creating and editing complex files, or files that need to be executed to assess whether they work (e.g., files containing code). For example, you wouldn’t be able to run an analysis written with R code directly on GitHub. Thus, it is useful to be able to connect the remote repository that was created on GitHub to a local coding environment. This can be done by creating and working in a local copy of the repository. In this chapter, we focus on interacting with Git via Jupyter using the Jupyter Git extension. The Jupyter Git extension can be run by Jupyter on your local computer, or on a JupyterHub server. We recommend reading Chapter 11 to learn how to use Jupyter before reading this chapter. 12.7.1 Generating a GitHub personal access token To send and retrieve work between your local repository and the remote repository on GitHub, you will frequently need to authenticate with GitHub to prove you have the required permission. There are several methods to do this, but for beginners we recommend using the HTTPS method because it is easier and requires less setup. In order to use the HTTPS method, GitHub requires you to provide a personal access token. A personal access token is like a password—so keep it a secret!—but it gives you more fine-grained control over what parts of your account the token can be used to access, and lets you set an expiry date for the authentication. To generate a personal access token, you must first visit https://github.com/settings/tokens, which will take you to the “Personal access tokens” page in your account settings. Once there, click “Generate new token” (Figure 12.19). Note that you may be asked to re-authenticate with your username and password to proceed. Figure 12.19: The “Generate new token” button used to initiate the creation of a new personal access token. It is found in the “Personal access tokens” section of the “Developer settings” page in your account settings. You will be asked to add a note to describe the purpose for your personal access token. Next, you need to select permissions for the token; this is where you can control what parts of your account the token can be used to access. Make sure to choose only those permissions that you absolutely require. In Figure 12.20, we tick only the “repo” box, which gives the token access to our repositories (so that we can push and pull) but none of our other GitHub account features. Finally, to generate the token, scroll to the bottom of that page and click the green “Generate token” button (Figure 12.20). Figure 12.20: Webpage for creating a new personal access token. Finally, you will be taken to a page where you will be able to see and copy the personal access token you just generated (Figure 12.21). Since it provides access to certain parts of your account, you should treat this token like a password; for example, you should consider securely storing it (and your other passwords and tokens, too!) using a password manager. Note that this page will only display the token to you once, so make sure you store it in a safe place right away. If you accidentally forget to store it, though, do not fret—you can delete that token by clicking the “Delete” button next to your token, and generate a new one from scratch. To learn more about GitHub authentication, see the additional resources section at the end of this chapter. Figure 12.21: Display of the newly generated personal access token. 12.7.2 Cloning a repository using Jupyter Cloning a remote repository from GitHub to create a local repository results in a copy that knows where it was obtained from so that it knows where to send/receive new committed edits. In order to do this, first copy the URL from the HTTPS tab of the Code drop-down menu on GitHub (Figure 12.22). Figure 12.22: The green “Code” drop-down menu contains the remote address (URL) corresponding to the location of the remote GitHub repository. Open Jupyter, and click the Git+ icon on the file browser tab (Figure 12.23). Figure 12.23: The Jupyter Git Clone icon (red circle). Paste the URL of the GitHub project repository you created and click the blue “CLONE” button (Figure 12.24). Figure 12.24: Prompt where the remote address (URL) corresponding to the location of the GitHub repository needs to be input in Jupyter. On the file browser tab, you will now see a folder for the repository. Inside this folder will be all the files that existed on GitHub (Figure 12.25). Figure 12.25: Cloned GitHub repositories can been seen and accessed via the Jupyter file browser. 12.7.3 Specifying files to commit Now that you have cloned the remote repository from GitHub to create a local repository, you can get to work editing, creating, and deleting files. For example, suppose you created and saved a new file (named eda.ipynb) that you would like to send back to the project repository on GitHub (Figure 12.26). To “add” this modified file to the staging area (i.e., flag that this is a file whose changes we would like to commit), click the Jupyter Git extension icon on the far left-hand side of Jupyter (Figure 12.26). Figure 12.26: Jupyter Git extension icon (circled in red). This opens the Jupyter Git graphical user interface pane. Next, click the plus sign (+) beside the file(s) that you want to “add” (Figure 12.27). Note that because this is the first change for this file, it falls under the “Untracked” heading. However, next time you edit this file and want to add the changes, you will find it under the “Changed” heading. You will also see an eda-checkpoint.ipynb file under the “Untracked” heading. This is a temporary “checkpoint file” created by Jupyter when you work on eda.ipynb. You generally do not want to add auto-generated files to Git repositories; only add the files you directly create and edit. Figure 12.27: eda.ipynb is added to the staging area via the plus sign (+). Clicking the plus sign (+) moves the file from the “Untracked” heading to the “Staged” heading, so that Git knows you want a snapshot of its current state as a commit (Figure 12.28). Now you are ready to “commit” the changes. Make sure to include a (clear and helpful!) message about what was changed so that your collaborators (and future you) know what happened in this commit. Figure 12.28: Adding eda.ipynb makes it visible in the staging area. 12.7.4 Making the commit To snapshot the changes with an associated commit message, you must put a message in the text box at the bottom of the Git pane and click on the blue “Commit” button (Figure 12.29). It is highly recommended to write useful and meaningful messages about what was changed. These commit messages, and the datetime stamp for a given commit, are the primary means to navigate through the project’s history in the event that you need to view or retrieve a past version of a file, or revert your project to an earlier state. When you click the “Commit” button for the first time, you will be prompted to enter your name and email. This only needs to be done once for each machine you use Git on. Figure 12.29: A commit message must be added into the Jupyter Git extension commit text box before the blue Commit button can be used to record the commit. After “committing” the file(s), you will see there are 0 “Staged” files. You are now ready to push your changes to the remote repository on GitHub (Figure 12.30). Figure 12.30: After recording a commit, the staging area should be empty. 12.7.5 Pushing the commits to GitHub To send the committed changes back to the remote repository on GitHub, you need to push them. To do this, click on the cloud icon with the up arrow on the Jupyter Git tab (Figure 12.31). Figure 12.31: The Jupyter Git extension “push” button (circled in red). You will then be prompted to enter your GitHub username and the personal access token that you generated earlier (not your account password!). Click the blue “OK” button to initiate the push (Figure 12.32). Figure 12.32: Enter your Git credentials to authorize the push to the remote repository. If the files were successfully pushed to the project repository on GitHub, you will be shown a success message (Figure 12.33). Click “Dismiss” to continue working in Jupyter. Figure 12.33: The prompt that the push was successful. If you visit the remote repository on GitHub, you will see that the changes now exist there too (Figure 12.34)! Figure 12.34: The GitHub web interface shows a preview of the commit message, and the time of the most recently pushed commit for each file. 12.8 Collaboration 12.8.1 Giving collaborators access to your project As mentioned earlier, GitHub allows you to control who has access to your project. The default of both public and private projects are that only the person who created the GitHub repository has permissions to create, edit and delete files (write access). To give your collaborators write access to the projects, navigate to the “Settings” tab (Figure 12.35). Figure 12.35: The “Settings” tab on the GitHub web interface. Then click “Manage access” (Figure 12.36). Figure 12.36: The “Manage access” tab on the GitHub web interface. Then click the green “Invite a collaborator” button (Figure 12.37). Figure 12.37: The “Invite a collaborator” button on the GitHub web interface. Type in the collaborator’s GitHub username or email, and select their name when it appears (Figure 12.38). Figure 12.38: The text box where a collaborator’s GitHub username or email can be entered. Finally, click the green “Add collaborator to this repository” button (Figure 12.39). Figure 12.39: The confirmation button for adding a collaborator to a repository on the GitHub web interface. After this, you should see your newly added collaborator listed under the “Manage access” tab. They should receive an email invitation to join the GitHub repository as a collaborator. They need to accept this invitation to enable write access. 12.8.2 Pulling changes from GitHub using Jupyter We will now walk through how to use the Jupyter Git extension tool to pull changes to our eda.ipynb analysis file that were made by a collaborator (Figure 12.40). Figure 12.40: The GitHub interface indicates the name of the last person to push a commit to the remote repository, a preview of the associated commit message, the unique commit identifier, and how long ago the commit was snapshotted. You can tell Git to “pull” by clicking on the cloud icon with the down arrow in Jupyter (Figure 12.41). Figure 12.41: The Jupyter Git extension clone button. Once the files are successfully pulled from GitHub, you need to click “Dismiss” to keep working (Figure 12.42). Figure 12.42: The prompt after changes have been successfully pulled from a remote repository. And then when you open (or refresh) the files whose changes you just pulled, you should be able to see them (Figure 12.43). Figure 12.43: Changes made by the collaborator to eda.ipynb (code highlighted by red arrows). It can be very useful to review the history of the changes to your project. You can do this directly in Jupyter by clicking “History” in the Git tab (Figure 12.44). Figure 12.44: Version control repository history viewed using the Jupyter Git extension. It is good practice to pull any changes at the start of every work session before you start working on your local copy. If you do not do this, and your collaborators have pushed some changes to the project to GitHub, then you will be unable to push your changes to GitHub until you pull. This situation can be recognized by the error message shown in Figure 12.45. Figure 12.45: Error message that indicates that there are changes on the remote repository that you do not have locally. Usually, getting out of this situation is not too troublesome. First you need to pull the changes that exist on GitHub that you do not yet have in the local repository. Usually when this happens, Git can automatically merge the changes for you, even if you and your collaborators were working on different parts of the same file! If, however, you and your collaborators made changes to the same line of the same file, Git will not be able to automatically merge the changes—it will not know whether to keep your version of the line(s), your collaborators version of the line(s), or some blend of the two. When this happens, Git will tell you that you have a merge conflict in certain file(s) (Figure 12.46). Figure 12.46: Error message that indicates you and your collaborators made changes to the same line of the same file and that Git will not be able to automatically merge the changes. 12.8.3 Handling merge conflicts To fix the merge conflict, you need to open the offending file in a plain text editor and look for special marks that Git puts in the file to tell you where the merge conflict occurred (Figure 12.47). Figure 12.47: How to open a Jupyter notebook as a plain text file view in Jupyter. The beginning of the merge conflict is preceded by <<<<<<< HEAD and the end of the merge conflict is marked by >>>>>>>. Between these markings, Git also inserts a separator (=======). The version of the change before the separator is your change, and the version that follows the separator was the change that existed on GitHub. In Figure 12.48, you can see that in your local repository there is a line of code that calls scale_color_manual with three color values (deeppink2, cyan4, and purple1). It looks like your collaborator made an edit to that line too, except with different colors (to blue3, red3, and black)! Figure 12.48: Merge conflict identifiers (highlighted in red). Once you have decided which version of the change (or what combination!) to keep, you need to use the plain text editor to remove the special marks that Git added (Figure 12.49). Figure 12.49: File where a merge conflict has been resolved. The file must be saved, added to the staging area, and then committed before you will be able to push your changes to GitHub. 12.8.4 Communicating using GitHub issues When working on a project in a team, you don’t just want a historical record of who changed what file and when in the project—you also want a record of decisions that were made, ideas that were floated, problems that were identified and addressed, and all other communication surrounding the project. Email and messaging apps are both very popular for general communication, but are not designed for project-specific communication: they both generally do not have facilities for organizing conversations by project subtopics, searching for conversations related to particular bugs or software versions, etc. GitHub issues are an alternative written communication medium to email and messaging apps, and were designed specifically to facilitate project-specific communication. Issues are opened from the “Issues” tab on the project’s GitHub page, and they persist there even after the conversation is over and the issue is closed (in contrast to email, issues are not usually deleted). One issue thread is usually created per topic, and they are easily searchable using GitHub’s search tools. All issues are accessible to all project collaborators, so no one is left out of the conversation. Finally, issues can be set up so that team members get email notifications when a new issue is created or a new post is made in an issue thread. Replying to issues from email is also possible. Given all of these advantages, we highly recommend the use of issues for project-related communication. To open a GitHub issue, first click on the “Issues” tab (Figure 12.50). Figure 12.50: The “Issues” tab on the GitHub web interface. Next click the “New issue” button (Figure 12.51). Figure 12.51: The “New issue” button on the GitHub web interface. Add an issue title (which acts like an email subject line), and then put the body of the message in the larger text box. Finally, click “Submit new issue” to post the issue to share with others (Figure 12.52). Figure 12.52: Dialog boxes and submission button for creating new GitHub issues. You can reply to an issue that someone opened by adding your written response to the large text box and clicking comment (Figure 12.53). Figure 12.53: Dialog box for replying to GitHub issues. When a conversation is resolved, you can click “Close issue”. The closed issue can be later viewed by clicking the “Closed” header link in the “Issue” tab (Figure 12.54). Figure 12.54: The “Closed” issues tab on the GitHub web interface. 12.9 Exercises Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Collaboration with version control” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter 13. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. 12.10 Additional resources Now that you’ve picked up the basics of version control with Git and GitHub, you can expand your knowledge through the resources listed below: GitHub’s guides website and Happy Git and GitHub for the useR are great resources for learning more about Git and GitHub. Good enough practices in scientific computing (G. Wilson et al. 2017) provides more advice on useful workflows and “good enough” practices in data analysis projects. In addition to GitHub, there are other popular Git repository hosting services such as GitLab and BitBucket. Comparing all of these options is beyond the scope of this book, and until you become a more advanced user, you are perfectly fine to just stick with GitHub. Just be aware that you have options! GitHub’s documentation on creating a personal access token and the Happy Git and GitHub for the useR personal access tokens chapter are both excellent additional resources to consult if you need additional help generating and using personal access tokens. References "],["setup.html", "Chapter 13 Setting up your computer 13.1 Overview 13.2 Chapter learning objectives 13.3 Obtaining the worksheets for this book 13.4 Working with Docker 13.5 Working with JupyterLab Desktop", " Chapter 13 Setting up your computer 13.1 Overview In this chapter, you’ll learn how to set up the software needed to follow along with this book on your own computer. Given that installation instructions can vary based on computer setup, we provide instructions for multiple operating systems (Ubuntu Linux, MacOS, and Windows). Although the instructions in this chapter will likely work on many systems, we have specifically verified that they work on a computer that: runs Windows 10 Home, MacOS 13 Ventura, or Ubuntu 22.04, uses a 64-bit CPU, has a connection to the internet, uses English as the default language. 13.2 Chapter learning objectives By the end of the chapter, readers will be able to do the following: Download the worksheets that accompany this book. Install the Docker virtualization engine. Edit and run the worksheets using JupyterLab running inside a Docker container. Install Git, JupyterLab Desktop, and R packages. Edit and run the worksheets using JupyterLab Desktop. 13.3 Obtaining the worksheets for this book The worksheets containing exercises for this book are online at https://worksheets.datasciencebook.ca. The worksheets can be launched directly from that page using the Binder links in the rightmost column of the table. This is the easiest way to access the worksheets, but note that you will not be able to save your work and return to it again later. In order to save your progress, you will need to download the worksheets to your own computer and work on them locally. You can download the worksheets as a compressed zip file using the link at the top of the page. Once you unzip the downloaded file, you will have a folder containing all of the Jupyter notebook worksheets accompanying this book. See Chapter 11 for instructions on working with Jupyter notebooks. 13.4 Working with Docker Once you have downloaded the worksheets, you will next need to install and run the software required to work on Jupyter notebooks on your own computer. Doing this setup manually can be quite tricky, as it involves quite a few different software packages, not to mention getting the right versions of everything—the worksheets and autograder tests may not work unless all the versions are exactly right! To keep things simple, we instead recommend that you install Docker. Docker lets you run your Jupyter notebooks inside a pre-built container that comes with precisely the right versions of all software packages needed run the worksheets that come with this book. Note: A container is a virtual user space within your computer. Within the container, you can run software in isolation without interfering with the other software that already exists on your machine. In this book, we use a container to run a specific version of the R programming language, as well as other necessary packages. The container ensures that the worksheets function correctly, even if you have a different version of R installed on your computer—or even if you haven’t installed R at all! 13.4.1 Windows Installation To install Docker on Windows, visit the online Docker documentation, and download the Docker Desktop Installer.exe file. Double-click the file to open the installer and follow the instructions on the installation wizard, choosing WSL-2 instead of Hyper-V when prompted. Note: Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see: If you need to update WSL, you can enter cmd.exe in the Start menu to run the command line. Type wsl --update to update WSL. If the admin account on your computer is different to your user account, you must add the user to the “docker-users” group. Run Computer Management as an administrator and navigate to Local Users and Groups -> Groups -> docker-users. Right-click to add the user to the group. Log out and log back in for the changes to take effect. If you need to enable virtualization, you will need to edit your BIOS. Restart your computer, and enter the BIOS using the hotkey (usually Delete, Esc, and/or one of the F# keys). Look for an “Advanced” menu, and under your CPU settings, set the “Virtualization” option to “enabled”. Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. Running JupyterLab Run Docker Desktop. Once it is running, you need to download and run the Docker image that we have made available for the worksheets (an image is like a “snapshot” of a computer with all the right packages pre-installed). You only need to do this step one time; the image will remain the next time you run Docker Desktop. In the Docker Desktop search bar, enter ubcdsci/r-dsci-100, as this is the name of the image. You will see the ubcdsci/r-dsci-100 image in the list (Figure 13.1), and “latest” in the Tag drop down menu. We need to change “latest” to the right image version before proceeding. To find the right tag, open the Dockerfile in the worksheets repository, and look for the line FROM ubcdsci/r-dsci-100: followed by the tag consisting of a sequence of numbers and letters. Back in Docker Desktop, in the “Tag” drop down menu, click that tag to select the correct image version. Then click the “Pull” button to download the image. Figure 13.1: The Docker Desktop search window. Make sure to click the Tag drop down menu and find the right version of the image before clicking the Pull button to download it. Once the image is done downloading, click the “Images” button on the left side of the Docker Desktop window (Figure 13.2). You will see the recently downloaded image listed there under the “Local” tab. Figure 13.2: The Docker Desktop images tab. To start up a container using that image, click the play button beside the image. This will open the run configuration menu (Figure 13.3). Expand the “Optional settings” drop down menu. In the “Host port” textbox, enter 8888. In the “Volumes” section, click the “Host path” box and navigate to the folder where your Jupyter worksheets are stored. In the “Container path” text box, enter /home/jovyan/work. Then click the “Run” button to start the container. Figure 13.3: The Docker Desktop container run configuration menu. After clicking the “Run” button, you will see a terminal. The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the URL in the terminal that starts with http://127.0.0.1:8888 (highlighted by the red box in Figure 13.4), and paste it into your browser to start JupyterLab. Figure 13.4: The terminal text after running the Docker container. The red box indicates the URL that you should paste into your browser to open JupyterLab. When you are done working, make sure to shut down and remove the container by clicking the red trash can symbol (in the top right corner of Figure 13.4). You will not be able to start the container again until you do so. More information on installing and running Docker on Windows, as well as troubleshooting tips, can be found in the online Docker documentation. 13.4.2 MacOS Installation To install Docker on MacOS, visit the online Docker documentation, and download the Docker.dmg installation file that is appropriate for your computer. To know which installer is right for your machine, you need to know whether your computer has an Intel processor (older machines) or an Apple processor (newer machines); the Apple support page has information to help you determine which processor you have. Once downloaded, double-click the file to open the installer, then drag the Docker icon to the Applications folder. Double-click the icon in the Applications folder to start Docker. In the installation window, use the recommended settings. Running JupyterLab Run Docker Desktop. Once it is running, follow the instructions above in the Windows section on Running JupyterLab (the user interface is the same). More information on installing and running Docker on MacOS, as well as troubleshooting tips, can be found in the online Docker documentation. 13.4.3 Ubuntu Installation To install Docker on Ubuntu, open the terminal and enter the following five commands. sudo apt update sudo apt install ca-certificates curl gnupg curl -fsSL https://get.docker.com -o get-docker.sh sudo chmod u+x get-docker.sh sudo sh get-docker.sh Running JupyterLab First, open the Dockerfile in the worksheets repository, and look for the line FROM ubcdsci/r-dsci-100: followed by a tag consisting of a sequence of numbers and letters. Then in the terminal, navigate to the directory where you want to run JupyterLab, and run the following command, replacing TAG with the tag you found earlier. docker run --rm -v $(pwd):/home/jovyan/work -p 8888:8888 ubcdsci/r-dsci-100:TAG jupyter lab The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the URL in your terminal that starts with http://127.0.0.1:8888 (highlighted by the red box in Figure 13.5), and paste it into your browser to start JupyterLab. More information on installing and running Docker on Ubuntu, as well as troubleshooting tips, can be found in the online Docker documentation. Figure 13.5: The terminal text after running the Docker container in Ubuntu. The red box indicates the URL that you should paste into your browser to open JupyterLab. 13.5 Working with JupyterLab Desktop You can also run the worksheets accompanying this book on your computer using JupyterLab Desktop. The advantage of JupyterLab Desktop over Docker is that it can be easier to install; Docker can sometimes run into some fairly technical issues (especially on Windows computers) that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that you may not end up with the right versions of all the R packages needed for the worksheets. Docker, on the other hand, guarantees that the worksheets will work exactly as intended. In this section, we will cover how to install JupyterLab Desktop, Git and the JupyterLab Git extension (for version control, as discussed in Chapter 12), and all of the R packages needed to run the code in this book. 13.5.1 Windows Installation First, we will install Git for version control. Go to the Git download page and download the Windows version of Git. Once the download has finished, run the installer and accept the default configuration for all pages. Next, visit the “Installation” section of the JupyterLab Desktop homepage. Download the JupyterLab-Setup-Windows.exe installer file for Windows. Double-click the installer to run it, use the default settings. Run JupyterLab Desktop by clicking the icon on your desktop. Configuring JupyterLab Desktop Next, in the JupyterLab Desktop graphical interface that appears (Figure 13.6), you will see text at the bottom saying “Python environment not found”. Click “Install using the bundled installer” to set up the environment. Figure 13.6: The JupyterLab Desktop graphical user interface. Next, we need to add the JupyterLab Git extension (so that we can use version control directly from within JupyterLab Desktop), the IRkernel (to enable the R programming language), and various R software packages. Click “New session…” in the JupyterLab Desktop user interface, then scroll to the bottom, and click “Terminal” under the “Other” heading (Figure 13.7). Figure 13.7: A JupyterLab Desktop session, showing the Terminal option at the bottom. In this terminal, run the following commands: pip install --upgrade jupyterlab-git conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/environment.yml The second command installs the specific R and package versions specified in the environment.yml file found in the worksheets repository. We will always keep the versions in the environment.yml file updated so that they are compatible with the exercise worksheets that accompany the book. Once all of the software installation is complete, it is a good idea to restart JupyterLab Desktop entirely before you proceed to doing your data analysis. This will ensure all the software and settings you put in place are correctly set up and ready for use. 13.5.2 MacOS Installation First, we will install Git for version control. Open the terminal (how-to video) and type the following command: xcode-select --install Next, visit the “Installation” section of the JupyterLab Desktop homepage. Download the JupyterLab-Setup-MacOS-x64.dmg or JupyterLab-Setup-MacOS-arm64.dmg installer file. To know which installer is right for your machine, you need to know whether your computer has an Intel processor (older machines) or an Apple processor (newer machines); the Apple support page has information to help you determine which processor you have. Once downloaded, double-click the file to open the installer, then drag the JupyterLab Desktop icon to the Applications folder. Double-click the icon in the Applications folder to start JupyterLab Desktop. Configuring JupyterLab Desktop From this point onward, with JupyterLab Desktop running, follow the instructions in the Windows section on Configuring JupyterLab Desktop to set up the environment, install the JupyterLab Git extension, and install the various R software packages needed for the worksheets. 13.5.3 Ubuntu Installation First, we will install Git for version control. Open the terminal and type the following commands: sudo apt update sudo apt install git Next, visit the “Installation” section of the JupyterLab Desktop homepage. Download the JupyterLab-Setup-Debian.deb installer file for Ubuntu/Debian. Open a terminal, navigate to where the installer file was downloaded, and run the command sudo dpkg -i JupyterLab-Setup-Debian.deb Run JupyterLab Desktop using the command jlab Configuring JupyterLab Desktop From this point onward, with JupyterLab Desktop running, follow the instructions in the Windows section on Configuring JupyterLab Desktop to set up the environment, install the JupyterLab Git extension, and install the various R software packages needed for the worksheets. "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]