Skip to content

Commit

Permalink
Rewrote tidy-data-a-recipe-for-efficient-data-analysis without code
Browse files Browse the repository at this point in the history
  • Loading branch information
christophscheuch committed Nov 25, 2023
1 parent 3ca9911 commit f7ba9ec
Show file tree
Hide file tree
Showing 6 changed files with 521 additions and 231 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "4c62362bc0bfd7f3f9def2c3cf1de1e4",
"hash": "565760e0f3b63c95b458bdd01ef18118",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Tidy Data: A Recipe for Efficient Data Analysis\"\ndescription: \"On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen\"\nauthor: \"Christoph Scheuch\"\ndate: \"2023-11-24\" \nimage: thumbnail.png\n---\n\n\nImagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. \n\nTidy data are like a well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014].\n\nTidying data is about structuring datasets to facilitate analysis or report generation. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.\n\n## Example for tidy data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ningredients <- tibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n quantity = c(500, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n unit = c(\"grams\", \"grams\", \"grams\", \"units\", \"liters\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n\nspices <- tibble(\n type = c(\"Paprika\", \"Turmeric\", \"Cumin\", \"Coriander\", \"Cinnamon\", \"Chili Powder\", \"Oregano\", \"Thyme\", \"Saffron\", \"Nutmeg\"),\n quantity = c(50, 40, 30, 25, 20, 15, 10, 8, 5, 12),\n unit = c(\"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\")\n)\n\ndairies <- tibble(\n type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n quantity = c(1, 200, 150, 100, 0.5, 250, 150, 100, 0.3, 500),\n unit = c(\"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\")\n)\n```\n:::\n\n\n## When colum headers are values, not variable names\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n liters = c(1, NA, NA, NA, 0.5, NA, NA, NA, 0.3, NA),\n grams = c(NA, 200, 150, 100, NA, 250, 150, 100, NA, 500)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 3\n type liters grams\n <chr> <dbl> <dbl>\n 1 Milk 1 NA\n 2 Butter NA 200\n 3 Yogurt NA 150\n 4 Cheese NA 100\n 5 Cream 0.5 NA\n 6 Cottage Cheese NA 250\n 7 Sour Cream NA 150\n 8 Ghee NA 100\n 9 Whipping Cream 0.3 NA\n10 Ice Cream NA 500\n```\n\n\n:::\n:::\n\n\n## When multiple variables are stored in one column\n\nThe `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. This format makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n quantity_and_unit = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\", \"10 grams\", \"0.2 liters\", \"300 grams\", \"400 grams\", \"250 grams\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 2\n type quantity_and_unit\n <chr> <chr> \n 1 Flour 500 grams \n 2 Sugar 200 grams \n 3 Butter 100 grams \n 4 Eggs 4 units \n 5 Milk 1 liter \n 6 Salt 10 grams \n 7 Olive Oil 0.2 liters \n 8 Tomatoes 300 grams \n 9 Chicken 400 grams \n10 Rice 250 grams \n```\n\n\n:::\n:::\n\n\n## When variables are stored in both rows and columns\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n ingredient recipe1_quantity recipe2_quantity\n <chr> <chr> <chr> \n1 Flour 500 grams 300 grams \n2 Sugar 200 grams 150 grams \n3 Butter 100 grams 50 grams \n4 Eggs 4 units 3 \n5 Milk 1 liter 0.5 liters \n```\n\n\n:::\n:::\n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.\n\n## When there are multiple types of data in the same column\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Flour\", \"Butter\", \"Whisk\", \"Sugar\", \"Baking Time\"),\n quantity = c(\"500 grams\", \"100 grams\", \"1\", \"200 grams\", \"30 minutes\"),\n category = c(\"Ingredient\", \"Ingredient\", \"Utensil\", \"Ingredient\", \"Time\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n type quantity category \n <chr> <chr> <chr> \n1 Flour 500 grams Ingredient\n2 Butter 100 grams Ingredient\n3 Whisk 1 Utensil \n4 Sugar 200 grams Ingredient\n5 Baking Time 30 minutes Time \n```\n\n\n:::\n:::\n\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.\n\n## When some data is missing\n\nKey points:\n\n- Huge difference between NA and 0 (or any other value)\n- Are you sure that you don't have the ingredient or do you just don't know?\n- Missing are dropped in filters \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", NA),\n quantity = c(NA, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n unit = c(\"grams\", \"grams\", \"grams\", \"units\", NA, \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 3\n type quantity unit \n <chr> <dbl> <chr> \n 1 Flour NA grams \n 2 Sugar 200 grams \n 3 Butter 100 grams \n 4 Eggs 4 units \n 5 Milk 1 <NA> \n 6 Salt 10 grams \n 7 Olive Oil 0.2 liters\n 8 Tomatoes 300 grams \n 9 Chicken 400 grams \n10 <NA> 250 grams \n```\n\n\n:::\n:::",
"markdown": "---\ntitle: \"Tidy Data: A Recipe for Efficient Data Analysis\"\ndescription: \"On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen\"\nauthor: \"Christoph Scheuch\"\ndate: \"2023-11-24\" \nimage: thumbnail.png\n---\n\n\nImagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. \n\nTidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014].\n\nTidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.\n\n## Example for tidy data\n\nTo illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called `ingredient` that contains information about all the ingredients that we currently have in our kitchen. It might look as follows:\n\n| name | quantity | unit | category |\n|-----------|----------|--------|-----------|\n| flour | 500 | grams | baking |\n| sugar | 200 | grams | baking |\n| butter | 100 | grams | dairy |\n| eggs | 4 | units | dairy |\n| milk | 1 | liters | dairy |\n| salt | 10 | grams | seasoning |\n| olive oil | 0.2 | liters | oil |\n| tomatoes | 300 | grams | vegetable |\n| chicken | 400 | grams | meat |\n| rice | 250 | grams | grain |\n\nEach row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column `quantity` contains information about how much of the ingredient called `name` we currently have and which `unit` we use to measure it. \n\nSimilarly, we could have a table just for `dairy` that might look as follows:\n\n| name | quantity | unit |\n|----------------|----------|--------|\n| milk | 1 | liters |\n| butter | 200 | grams |\n| yogurt | 150 | grams |\n| cheese | 100 | grams |\n| cream | 0.5 | liters |\n| cottage cheese | 250 | grams |\n| sour cream | 150 | grams |\n| ghee | 100 | grams |\n| whipping cream | 0.3 | liters |\n| ice cream | 500 | grams |\n\nNotice that there is no `category` column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category.\n\n## When colum headers are values, not variable names\n\nNow let us move to data structures that are untidy. Consider the following variant of our `dairy` table:\n\n| type | liters | grams |\n|----------------|--------|-------|\n| milk | 1 | |\n| butter | | 200 |\n| yogurt | | 150 |\n| cheese | | 100 |\n| cream | 0.5 | |\n| cottage cheese | | 250 |\n| sour cream | | 150 |\n| ghee | | 100 |\n| whipping cream | 0.3 | |\n| ice cream | | 500 |\n\nWhat is the issue here? Each row still refers to a specific dairy product. However, instead of dedicated `quantity` and `unit` columns, we have a `liters` and `grams` column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name. In practice, we would create dedicated `quantity` and `unit` columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a `quantity` column.\n\n## When multiple variables are stored in one column\n\nLet us consider the following untidy version of our `ingredient` table. \n\n| type | quantity_and_unit |\n|-----------|-------------------|\n| flour | 500 grams |\n| sugar | 200 grams |\n| butter | 100 grams |\n| eggs | 4 units |\n| milk | 1 liter |\n| salt | 10 grams |\n| olive oil | 0.2 liters |\n| tomatoes | 300 grams |\n| chicken | 400 grams |\n| rice | 250 grams |\n\nThis one is really annoying, since the `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the `quantity_and_unit` column into `quantity` and `unit`.\n\n## When variables are stored in both rows and columns\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.2 ✔ readr 2.1.4\n✔ forcats 1.0.0 ✔ stringr 1.5.0\n✔ ggplot2 3.4.2 ✔ tibble 3.2.1\n✔ lubridate 1.9.2 ✔ tidyr 1.3.0\n✔ purrr 1.0.1 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code}\ntibble(\n ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n ingredient recipe1_quantity recipe2_quantity\n <chr> <chr> <chr> \n1 Flour 500 grams 300 grams \n2 Sugar 200 grams 150 grams \n3 Butter 100 grams 50 grams \n4 Eggs 4 units 3 \n5 Milk 1 liter 0.5 liters \n```\n\n\n:::\n:::\n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.\n\n## When there are multiple types of data in the same column\n\n\n\n| type | quantity | category |\n|--------------|-------------|------------|\n| flour | 500 grams | ingredient |\n| butter | 100 grams | ingredient |\n| whisk | 1 | utensil |\n| sugar | 200 grams | ingredient |\n| baking time | 30 minutes | time |\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.\n\n## When some data is missing\n\nKey points:\n\n- Huge difference between NA and 0 (or any other value)\n- Are you sure that you don't have the ingredient or do you just don't know?\n- Missing are dropped in filters \n\n| type | quantity | unit |\n|-----------|----------|--------|\n| flour | | grams |\n| sugar | 200 | grams |\n| butter | 100 | grams |\n| eggs | 4 | units |\n| milk | 1 | |\n| salt | 10 | grams |\n| olive oil | 0.2 | liters |\n| tomatoes | 300 | grams |\n| chicken | 400 | grams |\n| | 250 | grams |\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@

<div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
<div class="list grid quarto-listing-cols-3">
<div class="g-col-1" data-index="0" data-listing-date-sort="1700780400000" data-listing-file-modified-sort="1700843651647" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="746">
<div class="g-col-1" data-index="0" data-listing-date-sort="1700780400000" data-listing-file-modified-sort="1700916735155" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="12" data-listing-word-count-sort="2248">
<a href="./posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top">
Expand Down
Loading

0 comments on commit f7ba9ec

Please sign in to comment.