Rewrote tidy-data-a-recipe-for-efficient-data-analysis without code

tidy-intelligence · Nov 25, 2023 · f7ba9ec · f7ba9ec
1 parent 3ca9911
commit f7ba9ec
Show file tree

Hide file tree

Showing 6 changed files with 521 additions and 231 deletions.
diff --git a/_freeze/posts/tidy-data-a-recipe-for-efficient-data-analysis/index/execute-results/html.json b/_freeze/posts/tidy-data-a-recipe-for-efficient-data-analysis/index/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "4c62362bc0bfd7f3f9def2c3cf1de1e4",
+  "hash": "565760e0f3b63c95b458bdd01ef18118",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Tidy Data: A Recipe for Efficient Data Analysis\"\ndescription: \"On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen\"\nauthor: \"Christoph Scheuch\"\ndate: \"2023-11-24\" \nimage: thumbnail.png\n---\n\n\nImagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. \n\nTidy data are like a well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row**  [@Wickham2014].\n\nTidying data is about structuring datasets to facilitate analysis or report generation. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.\n\n## Example for tidy data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ningredients <- tibble(\n  type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n  quantity = c(500, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n  unit = c(\"grams\", \"grams\", \"grams\", \"units\", \"liters\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n\nspices <- tibble(\n  type = c(\"Paprika\", \"Turmeric\", \"Cumin\", \"Coriander\", \"Cinnamon\", \"Chili Powder\", \"Oregano\", \"Thyme\", \"Saffron\", \"Nutmeg\"),\n  quantity = c(50, 40, 30, 25, 20, 15, 10, 8, 5, 12),\n  unit = c(\"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\")\n)\n\ndairies <- tibble(\n  type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n  quantity = c(1, 200, 150, 100, 0.5, 250, 150, 100, 0.3, 500),\n  unit = c(\"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\")\n)\n```\n:::\n\n\n## When colum headers are values, not variable names\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n  liters = c(1, NA, NA, NA, 0.5, NA, NA, NA, 0.3, NA),\n  grams = c(NA, 200, 150, 100, NA, 250, 150, 100, NA, 500)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 3\n   type           liters grams\n   <chr>           <dbl> <dbl>\n 1 Milk              1      NA\n 2 Butter           NA     200\n 3 Yogurt           NA     150\n 4 Cheese           NA     100\n 5 Cream             0.5    NA\n 6 Cottage Cheese   NA     250\n 7 Sour Cream       NA     150\n 8 Ghee             NA     100\n 9 Whipping Cream    0.3    NA\n10 Ice Cream        NA     500\n```\n\n\n:::\n:::\n\n\n## When multiple variables are stored in one column\n\nThe `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. This format makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n  quantity_and_unit = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\", \"10 grams\", \"0.2 liters\", \"300 grams\", \"400 grams\", \"250 grams\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 2\n   type      quantity_and_unit\n   <chr>     <chr>            \n 1 Flour     500 grams        \n 2 Sugar     200 grams        \n 3 Butter    100 grams        \n 4 Eggs      4 units          \n 5 Milk      1 liter          \n 6 Salt      10 grams         \n 7 Olive Oil 0.2 liters       \n 8 Tomatoes  300 grams        \n 9 Chicken   400 grams        \n10 Rice      250 grams        \n```\n\n\n:::\n:::\n\n\n## When variables are stored in both rows and columns\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n  recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n  recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n  ingredient recipe1_quantity recipe2_quantity\n  <chr>      <chr>            <chr>           \n1 Flour      500 grams        300 grams       \n2 Sugar      200 grams        150 grams       \n3 Butter     100 grams        50 grams        \n4 Eggs       4 units          3               \n5 Milk       1 liter          0.5 liters      \n```\n\n\n:::\n:::\n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.\n\n## When there are multiple types of data in the same column\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  type = c(\"Flour\", \"Butter\", \"Whisk\", \"Sugar\", \"Baking Time\"),\n  quantity = c(\"500 grams\", \"100 grams\", \"1\", \"200 grams\", \"30 minutes\"),\n  category = c(\"Ingredient\", \"Ingredient\", \"Utensil\", \"Ingredient\", \"Time\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n  type        quantity   category  \n  <chr>       <chr>      <chr>     \n1 Flour       500 grams  Ingredient\n2 Butter      100 grams  Ingredient\n3 Whisk       1          Utensil   \n4 Sugar       200 grams  Ingredient\n5 Baking Time 30 minutes Time      \n```\n\n\n:::\n:::\n\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.\n\n## When some data is missing\n\nKey points:\n\n- Huge difference between NA and 0 (or any other value)\n- Are you sure that you don't have the ingredient or do you just don't know?\n- Missing are dropped in filters \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", NA),\n  quantity = c(NA, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n  unit = c(\"grams\", \"grams\", \"grams\", \"units\", NA, \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 3\n   type      quantity unit  \n   <chr>        <dbl> <chr> \n 1 Flour         NA   grams \n 2 Sugar        200   grams \n 3 Butter       100   grams \n 4 Eggs           4   units \n 5 Milk           1   <NA>  \n 6 Salt          10   grams \n 7 Olive Oil      0.2 liters\n 8 Tomatoes     300   grams \n 9 Chicken      400   grams \n10 <NA>         250   grams \n```\n\n\n:::\n:::",
+    "markdown": "---\ntitle: \"Tidy Data: A Recipe for Efficient Data Analysis\"\ndescription: \"On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen\"\nauthor: \"Christoph Scheuch\"\ndate: \"2023-11-24\" \nimage: thumbnail.png\n---\n\n\nImagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. \n\nTidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014].\n\nTidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.\n\n## Example for tidy data\n\nTo illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called `ingredient` that contains information about all the ingredients that we currently have in our kitchen. It might look as follows:\n\n| name      | quantity | unit   | category  |\n|-----------|----------|--------|-----------|\n| flour     | 500      | grams  | baking    |\n| sugar     | 200      | grams  | baking    |\n| butter    | 100      | grams  | dairy     |\n| eggs      | 4        | units  | dairy     |\n| milk      | 1        | liters | dairy     |\n| salt      | 10       | grams  | seasoning |\n| olive oil | 0.2      | liters | oil       |\n| tomatoes  | 300      | grams  | vegetable |\n| chicken   | 400      | grams  | meat      |\n| rice      | 250      | grams  | grain     |\n\nEach row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column `quantity` contains information about how much of the ingredient called `name` we currently have and which `unit` we use to measure it. \n\nSimilarly, we could have a table just for `dairy` that might look as follows:\n\n| name           | quantity | unit   |\n|----------------|----------|--------|\n| milk           | 1        | liters |\n| butter         | 200      | grams  |\n| yogurt         | 150      | grams  |\n| cheese         | 100      | grams  |\n| cream          | 0.5      | liters |\n| cottage cheese | 250      | grams  |\n| sour cream     | 150      | grams  |\n| ghee           | 100      | grams  |\n| whipping cream | 0.3      | liters |\n| ice cream      | 500      | grams  |\n\nNotice that there is no `category` column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category.\n\n## When colum headers are values, not variable names\n\nNow let us move to data structures that are untidy. Consider the following variant of our `dairy` table:\n\n| type           | liters | grams |\n|----------------|--------|-------|\n| milk           | 1      |       |\n| butter         |        | 200   |\n| yogurt         |        | 150   |\n| cheese         |        | 100   |\n| cream          | 0.5    |       |\n| cottage cheese |        | 250   |\n| sour cream     |        | 150   |\n| ghee           |        | 100   |\n| whipping cream | 0.3    |       |\n| ice cream      |        | 500   |\n\nWhat is the issue here? Each row still refers to a specific dairy product. However, instead of  dedicated `quantity` and `unit` columns, we have a `liters` and `grams` column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name.  In practice, we would create dedicated `quantity` and `unit` columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a `quantity` column.\n\n## When multiple variables are stored in one column\n\nLet us consider the following untidy version of our `ingredient` table. \n\n| type      | quantity_and_unit |\n|-----------|-------------------|\n| flour     | 500 grams         |\n| sugar     | 200 grams         |\n| butter    | 100 grams         |\n| eggs      | 4 units           |\n| milk      | 1 liter           |\n| salt      | 10 grams          |\n| olive oil | 0.2 liters        |\n| tomatoes  | 300 grams         |\n| chicken   | 400 grams         |\n| rice      | 250 grams         |\n\nThis one is really annoying, since the `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the `quantity_and_unit` column into `quantity` and `unit`.\n\n## When variables are stored in both rows and columns\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr     1.1.2     ✔ readr     2.1.4\n✔ forcats   1.0.0     ✔ stringr   1.5.0\n✔ ggplot2   3.4.2     ✔ tibble    3.2.1\n✔ lubridate 1.9.2     ✔ tidyr     1.3.0\n✔ purrr     1.0.1     \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag()    masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code}\ntibble(\n  ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n  recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n  recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n  ingredient recipe1_quantity recipe2_quantity\n  <chr>      <chr>            <chr>           \n1 Flour      500 grams        300 grams       \n2 Sugar      200 grams        150 grams       \n3 Butter     100 grams        50 grams        \n4 Eggs       4 units          3               \n5 Milk       1 liter          0.5 liters      \n```\n\n\n:::\n:::\n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.\n\n## When there are multiple types of data in the same column\n\n\n\n| type         | quantity    | category   |\n|--------------|-------------|------------|\n| flour        | 500 grams   | ingredient |\n| butter       | 100 grams   | ingredient |\n| whisk        | 1           | utensil    |\n| sugar        | 200 grams   | ingredient |\n| baking time  | 30 minutes  | time       |\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.\n\n## When some data is missing\n\nKey points:\n\n- Huge difference between NA and 0 (or any other value)\n- Are you sure that you don't have the ingredient or do you just don't know?\n- Missing are dropped in filters \n\n| type      | quantity | unit   |\n|-----------|----------|--------|\n| flour     |          | grams  |\n| sugar     | 200      | grams  |\n| butter    | 100      | grams  |\n| eggs      | 4        | units  |\n| milk      | 1        |        |\n| salt      | 10       | grams  |\n| olive oil | 0.2      | liters |\n| tomatoes  | 300      | grams  |\n| chicken   | 400      | grams  |\n|           | 250      | grams  |\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"

diff --git a/docs/index.html b/docs/index.html
@@ -137,7 +137,7 @@
 
 <div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
 <div class="list grid quarto-listing-cols-3">
-<div class="g-col-1" data-index="0" data-listing-date-sort="1700780400000" data-listing-file-modified-sort="1700843651647" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="746">
+<div class="g-col-1" data-index="0" data-listing-date-sort="1700780400000" data-listing-file-modified-sort="1700916735155" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="12" data-listing-word-count-sort="2248">
 <a href="./posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top">