diff --git a/_freeze/posts/tidy-data-a-recipe-for-efficient-data-analysis/index/execute-results/html.json b/_freeze/posts/tidy-data-a-recipe-for-efficient-data-analysis/index/execute-results/html.json index e522e0e..ee318a7 100644 --- a/_freeze/posts/tidy-data-a-recipe-for-efficient-data-analysis/index/execute-results/html.json +++ b/_freeze/posts/tidy-data-a-recipe-for-efficient-data-analysis/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "4c62362bc0bfd7f3f9def2c3cf1de1e4", + "hash": "565760e0f3b63c95b458bdd01ef18118", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Tidy Data: A Recipe for Efficient Data Analysis\"\ndescription: \"On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen\"\nauthor: \"Christoph Scheuch\"\ndate: \"2023-11-24\" \nimage: thumbnail.png\n---\n\n\nImagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. \n\nTidy data are like a well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014].\n\nTidying data is about structuring datasets to facilitate analysis or report generation. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.\n\n## Example for tidy data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\ningredients <- tibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n quantity = c(500, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n unit = c(\"grams\", \"grams\", \"grams\", \"units\", \"liters\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n\nspices <- tibble(\n type = c(\"Paprika\", \"Turmeric\", \"Cumin\", \"Coriander\", \"Cinnamon\", \"Chili Powder\", \"Oregano\", \"Thyme\", \"Saffron\", \"Nutmeg\"),\n quantity = c(50, 40, 30, 25, 20, 15, 10, 8, 5, 12),\n unit = c(\"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\")\n)\n\ndairies <- tibble(\n type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n quantity = c(1, 200, 150, 100, 0.5, 250, 150, 100, 0.3, 500),\n unit = c(\"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\")\n)\n```\n:::\n\n\n## When colum headers are values, not variable names\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n liters = c(1, NA, NA, NA, 0.5, NA, NA, NA, 0.3, NA),\n grams = c(NA, 200, 150, 100, NA, 250, 150, 100, NA, 500)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 3\n type liters grams\n \n 1 Milk 1 NA\n 2 Butter NA 200\n 3 Yogurt NA 150\n 4 Cheese NA 100\n 5 Cream 0.5 NA\n 6 Cottage Cheese NA 250\n 7 Sour Cream NA 150\n 8 Ghee NA 100\n 9 Whipping Cream 0.3 NA\n10 Ice Cream NA 500\n```\n\n\n:::\n:::\n\n\n## When multiple variables are stored in one column\n\nThe `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. This format makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n quantity_and_unit = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\", \"10 grams\", \"0.2 liters\", \"300 grams\", \"400 grams\", \"250 grams\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 2\n type quantity_and_unit\n \n 1 Flour 500 grams \n 2 Sugar 200 grams \n 3 Butter 100 grams \n 4 Eggs 4 units \n 5 Milk 1 liter \n 6 Salt 10 grams \n 7 Olive Oil 0.2 liters \n 8 Tomatoes 300 grams \n 9 Chicken 400 grams \n10 Rice 250 grams \n```\n\n\n:::\n:::\n\n\n## When variables are stored in both rows and columns\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n ingredient recipe1_quantity recipe2_quantity\n \n1 Flour 500 grams 300 grams \n2 Sugar 200 grams 150 grams \n3 Butter 100 grams 50 grams \n4 Eggs 4 units 3 \n5 Milk 1 liter 0.5 liters \n```\n\n\n:::\n:::\n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.\n\n## When there are multiple types of data in the same column\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Flour\", \"Butter\", \"Whisk\", \"Sugar\", \"Baking Time\"),\n quantity = c(\"500 grams\", \"100 grams\", \"1\", \"200 grams\", \"30 minutes\"),\n category = c(\"Ingredient\", \"Ingredient\", \"Utensil\", \"Ingredient\", \"Time\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n type quantity category \n \n1 Flour 500 grams Ingredient\n2 Butter 100 grams Ingredient\n3 Whisk 1 Utensil \n4 Sugar 200 grams Ingredient\n5 Baking Time 30 minutes Time \n```\n\n\n:::\n:::\n\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.\n\n## When some data is missing\n\nKey points:\n\n- Huge difference between NA and 0 (or any other value)\n- Are you sure that you don't have the ingredient or do you just don't know?\n- Missing are dropped in filters \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", NA),\n quantity = c(NA, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n unit = c(\"grams\", \"grams\", \"grams\", \"units\", NA, \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 3\n type quantity unit \n \n 1 Flour NA grams \n 2 Sugar 200 grams \n 3 Butter 100 grams \n 4 Eggs 4 units \n 5 Milk 1 \n 6 Salt 10 grams \n 7 Olive Oil 0.2 liters\n 8 Tomatoes 300 grams \n 9 Chicken 400 grams \n10 250 grams \n```\n\n\n:::\n:::", + "markdown": "---\ntitle: \"Tidy Data: A Recipe for Efficient Data Analysis\"\ndescription: \"On the importance of tidy data for efficient analysis using the analogy of a well-organized kitchen\"\nauthor: \"Christoph Scheuch\"\ndate: \"2023-11-24\" \nimage: thumbnail.png\n---\n\n\nImagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. \n\nTidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014].\n\nTidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.\n\n## Example for tidy data\n\nTo illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called `ingredient` that contains information about all the ingredients that we currently have in our kitchen. It might look as follows:\n\n| name | quantity | unit | category |\n|-----------|----------|--------|-----------|\n| flour | 500 | grams | baking |\n| sugar | 200 | grams | baking |\n| butter | 100 | grams | dairy |\n| eggs | 4 | units | dairy |\n| milk | 1 | liters | dairy |\n| salt | 10 | grams | seasoning |\n| olive oil | 0.2 | liters | oil |\n| tomatoes | 300 | grams | vegetable |\n| chicken | 400 | grams | meat |\n| rice | 250 | grams | grain |\n\nEach row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column `quantity` contains information about how much of the ingredient called `name` we currently have and which `unit` we use to measure it. \n\nSimilarly, we could have a table just for `dairy` that might look as follows:\n\n| name | quantity | unit |\n|----------------|----------|--------|\n| milk | 1 | liters |\n| butter | 200 | grams |\n| yogurt | 150 | grams |\n| cheese | 100 | grams |\n| cream | 0.5 | liters |\n| cottage cheese | 250 | grams |\n| sour cream | 150 | grams |\n| ghee | 100 | grams |\n| whipping cream | 0.3 | liters |\n| ice cream | 500 | grams |\n\nNotice that there is no `category` column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category.\n\n## When colum headers are values, not variable names\n\nNow let us move to data structures that are untidy. Consider the following variant of our `dairy` table:\n\n| type | liters | grams |\n|----------------|--------|-------|\n| milk | 1 | |\n| butter | | 200 |\n| yogurt | | 150 |\n| cheese | | 100 |\n| cream | 0.5 | |\n| cottage cheese | | 250 |\n| sour cream | | 150 |\n| ghee | | 100 |\n| whipping cream | 0.3 | |\n| ice cream | | 500 |\n\nWhat is the issue here? Each row still refers to a specific dairy product. However, instead of dedicated `quantity` and `unit` columns, we have a `liters` and `grams` column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name. In practice, we would create dedicated `quantity` and `unit` columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a `quantity` column.\n\n## When multiple variables are stored in one column\n\nLet us consider the following untidy version of our `ingredient` table. \n\n| type | quantity_and_unit |\n|-----------|-------------------|\n| flour | 500 grams |\n| sugar | 200 grams |\n| butter | 100 grams |\n| eggs | 4 units |\n| milk | 1 liter |\n| salt | 10 grams |\n| olive oil | 0.2 liters |\n| tomatoes | 300 grams |\n| chicken | 400 grams |\n| rice | 250 grams |\n\nThis one is really annoying, since the `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the `quantity_and_unit` column into `quantity` and `unit`.\n\n## When variables are stored in both rows and columns\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.2 ✔ readr 2.1.4\n✔ forcats 1.0.0 ✔ stringr 1.5.0\n✔ ggplot2 3.4.2 ✔ tibble 3.2.1\n✔ lubridate 1.9.2 ✔ tidyr 1.3.0\n✔ purrr 1.0.1 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code}\ntibble(\n ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n ingredient recipe1_quantity recipe2_quantity\n \n1 Flour 500 grams 300 grams \n2 Sugar 200 grams 150 grams \n3 Butter 100 grams 50 grams \n4 Eggs 4 units 3 \n5 Milk 1 liter 0.5 liters \n```\n\n\n:::\n:::\n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.\n\n## When there are multiple types of data in the same column\n\n\n\n| type | quantity | category |\n|--------------|-------------|------------|\n| flour | 500 grams | ingredient |\n| butter | 100 grams | ingredient |\n| whisk | 1 | utensil |\n| sugar | 200 grams | ingredient |\n| baking time | 30 minutes | time |\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.\n\n## When some data is missing\n\nKey points:\n\n- Huge difference between NA and 0 (or any other value)\n- Are you sure that you don't have the ingredient or do you just don't know?\n- Missing are dropped in filters \n\n| type | quantity | unit |\n|-----------|----------|--------|\n| flour | | grams |\n| sugar | 200 | grams |\n| butter | 100 | grams |\n| eggs | 4 | units |\n| milk | 1 | |\n| salt | 10 | grams |\n| olive oil | 0.2 | liters |\n| tomatoes | 300 | grams |\n| chicken | 400 | grams |\n| | 250 | grams |\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/docs/index.html b/docs/index.html index dc233b8..ecb0801 100644 --- a/docs/index.html +++ b/docs/index.html @@ -137,7 +137,7 @@
-
+

diff --git a/docs/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html b/docs/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html index 8c49df3..72df2b8 100644 --- a/docs/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html +++ b/docs/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html @@ -23,40 +23,6 @@ margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ vertical-align: middle; } -/* CSS for syntax highlighting */ -pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { line-height: 1.25; } -pre > code.sourceCode > span:empty { height: 1.2em; } -.sourceCode { overflow: visible; } -code.sourceCode > span { color: inherit; text-decoration: inherit; } -div.sourceCode { margin: 1em 0; } -pre.sourceCode { margin: 0; } -@media screen { -div.sourceCode { overflow: auto; } -} -@media print { -pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } -} -pre.numberSource code - { counter-reset: source-line 0; } -pre.numberSource code > span - { position: relative; left: -4em; counter-increment: source-line; } -pre.numberSource code > span > a:first-child::before - { content: counter(source-line); - position: relative; left: -1em; text-align: right; vertical-align: baseline; - border: none; display: inline-block; - -webkit-touch-callout: none; -webkit-user-select: none; - -khtml-user-select: none; -moz-user-select: none; - -ms-user-select: none; user-select: none; - padding: 0 4px; width: 4em; - } -pre.numberSource { margin-left: 3em; padding-left: 4px; } -div.sourceCode - { } -@media screen { -pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } -} /* CSS for citations */ div.csl-bib-body { } div.csl-entry { @@ -188,156 +154,431 @@

Tidy Data: A Recipe for Efficient Data Analysis

Imagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal.

-

Tidy data are like a well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside. In the same way, tidy data organizes information into a clear and consistent format, where each type of observational unit forms a table, each variable is in a column, and each observation is in a row (Wickham 2014).

-

Tidying data is about structuring datasets to facilitate analysis or report generation. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.

+

Tidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each type of observational unit forms a table, each variable is in a column, and each observation is in a row (Wickham 2014).

+

Tidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients.

Example for tidy data

-
-
library(tidyverse)
-
-ingredients <- tibble(
-  type = c("Flour", "Sugar", "Butter", "Eggs", "Milk", "Salt", "Olive Oil", "Tomatoes", "Chicken", "Rice"),
-  quantity = c(500, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),
-  unit = c("grams", "grams", "grams", "units", "liters", "grams", "liters", "grams", "grams", "grams")
-)
-
-spices <- tibble(
-  type = c("Paprika", "Turmeric", "Cumin", "Coriander", "Cinnamon", "Chili Powder", "Oregano", "Thyme", "Saffron", "Nutmeg"),
-  quantity = c(50, 40, 30, 25, 20, 15, 10, 8, 5, 12),
-  unit = c("grams", "grams", "grams", "grams", "grams", "grams", "grams", "grams", "grams", "grams")
-)
-
-dairies <- tibble(
-  type = c("Milk", "Butter", "Yogurt", "Cheese", "Cream", "Cottage Cheese", "Sour Cream", "Ghee", "Whipping Cream", "Ice Cream"),
-  quantity = c(1, 200, 150, 100, 0.5, 250, 150, 100, 0.3, 500),
-  unit = c("liters", "grams", "grams", "grams", "liters", "grams", "grams", "grams", "liters", "grams")
-)
-
+

To illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called ingredient that contains information about all the ingredients that we currently have in our kitchen. It might look as follows:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
namequantityunitcategory
flour500gramsbaking
sugar200gramsbaking
butter100gramsdairy
eggs4unitsdairy
milk1litersdairy
salt10gramsseasoning
olive oil0.2litersoil
tomatoes300gramsvegetable
chicken400gramsmeat
rice250gramsgrain
+

Each row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column quantity contains information about how much of the ingredient called name we currently have and which unit we use to measure it.

+

Similarly, we could have a table just for dairy that might look as follows:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
namequantityunit
milk1liters
butter200grams
yogurt150grams
cheese100grams
cream0.5liters
cottage cheese250grams
sour cream150grams
ghee100grams
whipping cream0.3liters
ice cream500grams
+

Notice that there is no category column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category.

When colum headers are values, not variable names

-
-
tibble(
-  type = c("Milk", "Butter", "Yogurt", "Cheese", "Cream", "Cottage Cheese", "Sour Cream", "Ghee", "Whipping Cream", "Ice Cream"),
-  liters = c(1, NA, NA, NA, 0.5, NA, NA, NA, 0.3, NA),
-  grams = c(NA, 200, 150, 100, NA, 250, 150, 100, NA, 500)
-)
-
-
# A tibble: 10 × 3
-   type           liters grams
-   <chr>           <dbl> <dbl>
- 1 Milk              1      NA
- 2 Butter           NA     200
- 3 Yogurt           NA     150
- 4 Cheese           NA     100
- 5 Cream             0.5    NA
- 6 Cottage Cheese   NA     250
- 7 Sour Cream       NA     150
- 8 Ghee             NA     100
- 9 Whipping Cream    0.3    NA
-10 Ice Cream        NA     500
-
-
+

Now let us move to data structures that are untidy. Consider the following variant of our dairy table:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
typelitersgrams
milk1
butter200
yogurt150
cheese100
cream0.5
cottage cheese250
sour cream150
ghee100
whipping cream0.3
ice cream500
+

What is the issue here? Each row still refers to a specific dairy product. However, instead of dedicated quantity and unit columns, we have a liters and grams column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name. In practice, we would create dedicated quantity and unit columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a quantity column.

When multiple variables are stored in one column

-

The quantity_and_unit column combines both the quantity and the unit of measurement into one string for each ingredient. This format makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement.

-
-
tibble(
-  type = c("Flour", "Sugar", "Butter", "Eggs", "Milk", "Salt", "Olive Oil", "Tomatoes", "Chicken", "Rice"),
-  quantity_and_unit = c("500 grams", "200 grams", "100 grams", "4 units", "1 liter", "10 grams", "0.2 liters", "300 grams", "400 grams", "250 grams")
-)
-
-
# A tibble: 10 × 2
-   type      quantity_and_unit
-   <chr>     <chr>            
- 1 Flour     500 grams        
- 2 Sugar     200 grams        
- 3 Butter    100 grams        
- 4 Eggs      4 units          
- 5 Milk      1 liter          
- 6 Salt      10 grams         
- 7 Olive Oil 0.2 liters       
- 8 Tomatoes  300 grams        
- 9 Chicken   400 grams        
-10 Rice      250 grams        
-
-
+

Let us consider the following untidy version of our ingredient table.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
typequantity_and_unit
flour500 grams
sugar200 grams
butter100 grams
eggs4 units
milk1 liter
salt10 grams
olive oil0.2 liters
tomatoes300 grams
chicken400 grams
rice250 grams
+

This one is really annoying, since the quantity_and_unit column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the quantity_and_unit column into quantity and unit.

When variables are stored in both rows and columns

+

Let us extend our kitchen analogy by additionally considering recipes. For simplicity, a recipe just denotes how much of each ingredient is required. The following table contains two variants of a recipe for pancakes:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ingredientrecipe1_quantityrecipe2_quantity
flour500 grams300 grams
sugar200 grams150 grams
butter100 grams50 grams
eggs4 units3 units
milk1 liters0.5 liters

The quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.

-
-
tibble(
-  ingredient = c("Flour", "Sugar", "Butter", "Eggs", "Milk"),
-  recipe1_quantity = c("500 grams", "200 grams", "100 grams", "4 units", "1 liter"),
-  recipe2_quantity = c("300 grams", "150 grams", "50 grams", "3", "0.5 liters")
-)
-
-
# A tibble: 5 × 3
-  ingredient recipe1_quantity recipe2_quantity
-  <chr>      <chr>            <chr>           
-1 Flour      500 grams        300 grams       
-2 Sugar      200 grams        150 grams       
-3 Butter     100 grams        50 grams        
-4 Eggs       4 units          3               
-5 Milk       1 liter          0.5 liters      
-
-
-

To convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity.

+

To convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity. We can then filer

When there are multiple types of data in the same column

+

A recipe typically contains information on the required utensils and how much time a step requires. Consider the following table with different types of data:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
typequantitycategory
flour500 gramsingredient
butter100 gramsingredient
whisk1 unitutensil
sugar200 gramsingredient
baking time30 minutestime

The table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.

-
-
tibble(
-  type = c("Flour", "Butter", "Whisk", "Sugar", "Baking Time"),
-  quantity = c("500 grams", "100 grams", "1", "200 grams", "30 minutes"),
-  category = c("Ingredient", "Ingredient", "Utensil", "Ingredient", "Time")
-)
-
-
# A tibble: 5 × 3
-  type        quantity   category  
-  <chr>       <chr>      <chr>     
-1 Flour       500 grams  Ingredient
-2 Butter      100 grams  Ingredient
-3 Whisk       1          Utensil   
-4 Sugar       200 grams  Ingredient
-5 Baking Time 30 minutes Time      
-
-

A tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization.

When some data is missing

+

As a last example for untidy data, let us consider the original ingredient table again, but with a few empty cells.

Key points:

  • Huge difference between NA and 0 (or any other value)
  • Are you sure that you don’t have the ingredient or do you just don’t know?
  • Missing are dropped in filters
-
-
tibble(
-  type = c("Flour", "Sugar", "Butter", "Eggs", "Milk", "Salt", "Olive Oil", "Tomatoes", "Chicken", NA),
-  quantity = c(NA, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),
-  unit = c("grams", "grams", "grams", "units", NA, "grams", "liters", "grams", "grams", "grams")
-)
-
-
# A tibble: 10 × 3
-   type      quantity unit  
-   <chr>        <dbl> <chr> 
- 1 Flour         NA   grams 
- 2 Sugar        200   grams 
- 3 Butter       100   grams 
- 4 Eggs           4   units 
- 5 Milk           1   <NA>  
- 6 Salt          10   grams 
- 7 Olive Oil      0.2 liters
- 8 Tomatoes     300   grams 
- 9 Chicken      400   grams 
-10 <NA>         250   grams 
-
-
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
namequantityunit
flourgrams
sugar200grams
butter100grams
eggs4units
milk10
salt10grams
olive oil0.2liters
tomatoes300grams
chicken400grams
250grams
+

What is the issue here? There are actually a couple of them:

+
    +
  • The flour row does have any information about quantity, so we just don’t know how much we have.
  • +
  • The milk row does not contain a unit, so we might have 10 liters, 10 milliliters, or 10 cups of milk.
  • +
  • The last row does not have any name, so we have 250 grams of something that we just can’t identify.
  • +
+

Why is this important? It makes a huge difference how me treat the missing information. For instance, we might make an educated guess for milk if we always record that information in litres, then the missing unit is very likely litres. For flour, we could play it safe and just say that the available quantity is zero. For the ingredient without a name, we might have to throw it away or ask somebody else to tell us what it is.

+

Overall, these examples highlight the most important issues that you might have to consider when preparing data for your analysis.

diff --git a/docs/search.json b/docs/search.json index 09aa8d7..4bf2bf3 100644 --- a/docs/search.json +++ b/docs/search.json @@ -11,48 +11,48 @@ "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "", - "text": "Imagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal.\nTidy data are like a well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside. In the same way, tidy data organizes information into a clear and consistent format, where each type of observational unit forms a table, each variable is in a column, and each observation is in a row (Wickham 2014).\nTidying data is about structuring datasets to facilitate analysis or report generation. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients." + "text": "Imagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal.\nTidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each type of observational unit forms a table, each variable is in a column, and each observation is in a row (Wickham 2014).\nTidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients." }, { "objectID": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#example-for-tidy-data", "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#example-for-tidy-data", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "Example for tidy data", - "text": "Example for tidy data\n\nlibrary(tidyverse)\n\ningredients <- tibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n quantity = c(500, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n unit = c(\"grams\", \"grams\", \"grams\", \"units\", \"liters\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n\nspices <- tibble(\n type = c(\"Paprika\", \"Turmeric\", \"Cumin\", \"Coriander\", \"Cinnamon\", \"Chili Powder\", \"Oregano\", \"Thyme\", \"Saffron\", \"Nutmeg\"),\n quantity = c(50, 40, 30, 25, 20, 15, 10, 8, 5, 12),\n unit = c(\"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\", \"grams\")\n)\n\ndairies <- tibble(\n type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n quantity = c(1, 200, 150, 100, 0.5, 250, 150, 100, 0.3, 500),\n unit = c(\"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\", \"grams\", \"grams\", \"liters\", \"grams\")\n)" + "text": "Example for tidy data\nTo illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called ingredient that contains information about all the ingredients that we currently have in our kitchen. It might look as follows:\n\n\n\nname\nquantity\nunit\ncategory\n\n\n\n\nflour\n500\ngrams\nbaking\n\n\nsugar\n200\ngrams\nbaking\n\n\nbutter\n100\ngrams\ndairy\n\n\neggs\n4\nunits\ndairy\n\n\nmilk\n1\nliters\ndairy\n\n\nsalt\n10\ngrams\nseasoning\n\n\nolive oil\n0.2\nliters\noil\n\n\ntomatoes\n300\ngrams\nvegetable\n\n\nchicken\n400\ngrams\nmeat\n\n\nrice\n250\ngrams\ngrain\n\n\n\nEach row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column quantity contains information about how much of the ingredient called name we currently have and which unit we use to measure it.\nSimilarly, we could have a table just for dairy that might look as follows:\n\n\n\nname\nquantity\nunit\n\n\n\n\nmilk\n1\nliters\n\n\nbutter\n200\ngrams\n\n\nyogurt\n150\ngrams\n\n\ncheese\n100\ngrams\n\n\ncream\n0.5\nliters\n\n\ncottage cheese\n250\ngrams\n\n\nsour cream\n150\ngrams\n\n\nghee\n100\ngrams\n\n\nwhipping cream\n0.3\nliters\n\n\nice cream\n500\ngrams\n\n\n\nNotice that there is no category column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category." }, { "objectID": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-colum-headers-are-values-not-variable-names", "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-colum-headers-are-values-not-variable-names", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "When colum headers are values, not variable names", - "text": "When colum headers are values, not variable names\n\ntibble(\n type = c(\"Milk\", \"Butter\", \"Yogurt\", \"Cheese\", \"Cream\", \"Cottage Cheese\", \"Sour Cream\", \"Ghee\", \"Whipping Cream\", \"Ice Cream\"),\n liters = c(1, NA, NA, NA, 0.5, NA, NA, NA, 0.3, NA),\n grams = c(NA, 200, 150, 100, NA, 250, 150, 100, NA, 500)\n)\n\n# A tibble: 10 × 3\n type liters grams\n <chr> <dbl> <dbl>\n 1 Milk 1 NA\n 2 Butter NA 200\n 3 Yogurt NA 150\n 4 Cheese NA 100\n 5 Cream 0.5 NA\n 6 Cottage Cheese NA 250\n 7 Sour Cream NA 150\n 8 Ghee NA 100\n 9 Whipping Cream 0.3 NA\n10 Ice Cream NA 500" + "text": "When colum headers are values, not variable names\nNow let us move to data structures that are untidy. Consider the following variant of our dairy table:\n\n\n\ntype\nliters\ngrams\n\n\n\n\nmilk\n1\n\n\n\nbutter\n\n200\n\n\nyogurt\n\n150\n\n\ncheese\n\n100\n\n\ncream\n0.5\n\n\n\ncottage cheese\n\n250\n\n\nsour cream\n\n150\n\n\nghee\n\n100\n\n\nwhipping cream\n0.3\n\n\n\nice cream\n\n500\n\n\n\nWhat is the issue here? Each row still refers to a specific dairy product. However, instead of dedicated quantity and unit columns, we have a liters and grams column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name. In practice, we would create dedicated quantity and unit columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a quantity column." }, { "objectID": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-multiple-variables-are-stored-in-one-column", "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-multiple-variables-are-stored-in-one-column", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "When multiple variables are stored in one column", - "text": "When multiple variables are stored in one column\nThe quantity_and_unit column combines both the quantity and the unit of measurement into one string for each ingredient. This format makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement.\n\ntibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", \"Rice\"),\n quantity_and_unit = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\", \"10 grams\", \"0.2 liters\", \"300 grams\", \"400 grams\", \"250 grams\")\n)\n\n# A tibble: 10 × 2\n type quantity_and_unit\n <chr> <chr> \n 1 Flour 500 grams \n 2 Sugar 200 grams \n 3 Butter 100 grams \n 4 Eggs 4 units \n 5 Milk 1 liter \n 6 Salt 10 grams \n 7 Olive Oil 0.2 liters \n 8 Tomatoes 300 grams \n 9 Chicken 400 grams \n10 Rice 250 grams" + "text": "When multiple variables are stored in one column\nLet us consider the following untidy version of our ingredient table.\n\n\n\ntype\nquantity_and_unit\n\n\n\n\nflour\n500 grams\n\n\nsugar\n200 grams\n\n\nbutter\n100 grams\n\n\neggs\n4 units\n\n\nmilk\n1 liter\n\n\nsalt\n10 grams\n\n\nolive oil\n0.2 liters\n\n\ntomatoes\n300 grams\n\n\nchicken\n400 grams\n\n\nrice\n250 grams\n\n\n\nThis one is really annoying, since the quantity_and_unit column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the quantity_and_unit column into quantity and unit." }, { "objectID": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-variables-are-stored-in-both-rows-and-columns", "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-variables-are-stored-in-both-rows-and-columns", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "When variables are stored in both rows and columns", - "text": "When variables are stored in both rows and columns\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\n\ntibble(\n ingredient = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\"),\n recipe1_quantity = c(\"500 grams\", \"200 grams\", \"100 grams\", \"4 units\", \"1 liter\"),\n recipe2_quantity = c(\"300 grams\", \"150 grams\", \"50 grams\", \"3\", \"0.5 liters\")\n)\n\n# A tibble: 5 × 3\n ingredient recipe1_quantity recipe2_quantity\n <chr> <chr> <chr> \n1 Flour 500 grams 300 grams \n2 Sugar 200 grams 150 grams \n3 Butter 100 grams 50 grams \n4 Eggs 4 units 3 \n5 Milk 1 liter 0.5 liters \n\n\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity." + "text": "When variables are stored in both rows and columns\nLet us extend our kitchen analogy by additionally considering recipes. For simplicity, a recipe just denotes how much of each ingredient is required. The following table contains two variants of a recipe for pancakes:\n\n\n\ningredient\nrecipe1_quantity\nrecipe2_quantity\n\n\n\n\nflour\n500 grams\n300 grams\n\n\nsugar\n200 grams\n150 grams\n\n\nbutter\n100 grams\n50 grams\n\n\neggs\n4 units\n3 units\n\n\nmilk\n1 liters\n0.5 liters\n\n\n\nThe quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient.\nTo convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity. We can then filer" }, { "objectID": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-there-are-multiple-types-of-data-in-the-same-column", "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-there-are-multiple-types-of-data-in-the-same-column", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "When there are multiple types of data in the same column", - "text": "When there are multiple types of data in the same column\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\n\ntibble(\n type = c(\"Flour\", \"Butter\", \"Whisk\", \"Sugar\", \"Baking Time\"),\n quantity = c(\"500 grams\", \"100 grams\", \"1\", \"200 grams\", \"30 minutes\"),\n category = c(\"Ingredient\", \"Ingredient\", \"Utensil\", \"Ingredient\", \"Time\")\n)\n\n# A tibble: 5 × 3\n type quantity category \n <chr> <chr> <chr> \n1 Flour 500 grams Ingredient\n2 Butter 100 grams Ingredient\n3 Whisk 1 Utensil \n4 Sugar 200 grams Ingredient\n5 Baking Time 30 minutes Time \n\n\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization." + "text": "When there are multiple types of data in the same column\nA recipe typically contains information on the required utensils and how much time a step requires. Consider the following table with different types of data:\n\n\n\ntype\nquantity\ncategory\n\n\n\n\nflour\n500 grams\ningredient\n\n\nbutter\n100 grams\ningredient\n\n\nwhisk\n1 unit\nutensil\n\n\nsugar\n200 grams\ningredient\n\n\nbaking time\n30 minutes\ntime\n\n\n\nThe table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together.\nA tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization." }, { "objectID": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-some-data-is-missing", "href": "posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html#when-some-data-is-missing", "title": "Tidy Data: A Recipe for Efficient Data Analysis", "section": "When some data is missing", - "text": "When some data is missing\nKey points:\n\nHuge difference between NA and 0 (or any other value)\nAre you sure that you don’t have the ingredient or do you just don’t know?\nMissing are dropped in filters\n\n\ntibble(\n type = c(\"Flour\", \"Sugar\", \"Butter\", \"Eggs\", \"Milk\", \"Salt\", \"Olive Oil\", \"Tomatoes\", \"Chicken\", NA),\n quantity = c(NA, 200, 100, 4, 1, 10, 0.2, 300, 400, 250),\n unit = c(\"grams\", \"grams\", \"grams\", \"units\", NA, \"grams\", \"liters\", \"grams\", \"grams\", \"grams\")\n)\n\n# A tibble: 10 × 3\n type quantity unit \n <chr> <dbl> <chr> \n 1 Flour NA grams \n 2 Sugar 200 grams \n 3 Butter 100 grams \n 4 Eggs 4 units \n 5 Milk 1 <NA> \n 6 Salt 10 grams \n 7 Olive Oil 0.2 liters\n 8 Tomatoes 300 grams \n 9 Chicken 400 grams \n10 <NA> 250 grams" + "text": "When some data is missing\nAs a last example for untidy data, let us consider the original ingredient table again, but with a few empty cells.\nKey points:\n\nHuge difference between NA and 0 (or any other value)\nAre you sure that you don’t have the ingredient or do you just don’t know?\nMissing are dropped in filters\n\n\n\n\nname\nquantity\nunit\n\n\n\n\nflour\n\ngrams\n\n\nsugar\n200\ngrams\n\n\nbutter\n100\ngrams\n\n\neggs\n4\nunits\n\n\nmilk\n10\n\n\n\nsalt\n10\ngrams\n\n\nolive oil\n0.2\nliters\n\n\ntomatoes\n300\ngrams\n\n\nchicken\n400\ngrams\n\n\n\n250\ngrams\n\n\n\nWhat is the issue here? There are actually a couple of them:\n\nThe flour row does have any information about quantity, so we just don’t know how much we have.\nThe milk row does not contain a unit, so we might have 10 liters, 10 milliliters, or 10 cups of milk.\nThe last row does not have any name, so we have 250 grams of something that we just can’t identify.\n\nWhy is this important? It makes a huge difference how me treat the missing information. For instance, we might make an educated guess for milk if we always record that information in litres, then the missing unit is very likely litres. For flour, we could play it safe and just say that the available quantity is zero. For the ingredient without a name, we might have to throw it away or ask somebody else to tell us what it is.\nOverall, these examples highlight the most important issues that you might have to consider when preparing data for your analysis." } ] \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 2349c86..e86386f 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -6,6 +6,6 @@ https://www.tidy-intelligence.com/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.html - 2023-11-24T16:34:11.647Z + 2023-11-25T12:52:15.155Z diff --git a/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.qmd b/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.qmd index a79eb6f..081484a 100644 --- a/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.qmd +++ b/posts/tidy-data-a-recipe-for-efficient-data-analysis/index.qmd @@ -8,96 +8,145 @@ image: thumbnail.png Imagine trying to cook a meal in a disorganized kitchen where ingredients are mixed up and nothing is labeled. It would be chaotic and time-consuming to look for the right ingredients and there might be some trial error involved, possibly ruining your planned meal. -Tidy data are like a well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014]. +Tidy data are like well-organized shelves in your kitchen. Each shelf provides a collection of containers that semantically belong together, e.g., spices or dairies. Each container on the shelf holds one type of ingredient, and the labels on the containers clearly describe what is inside, e.g., pepper or milk. In the same way, tidy data organizes information into a clear and consistent format, where each **type of observational unit forms a table**, **each variable is in a column**, and **each observation is in a row** [@Wickham2014]. -Tidying data is about structuring datasets to facilitate analysis or report generation. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients. +Tidying data is about structuring datasets to facilitate analysis, visualization, report generation, or modelling. By following the principle that each variable forms a column, each observation forms a row, and each type of observational unit forms a table, data analysis becomes more intuitive, akin to cooking in a well-organized kitchen where everything has its place and you spend less time on searching for ingredients. ## Example for tidy data -```{r} -#| message: false -library(tidyverse) - -ingredients <- tibble( - type = c("Flour", "Sugar", "Butter", "Eggs", "Milk", "Salt", "Olive Oil", "Tomatoes", "Chicken", "Rice"), - quantity = c(500, 200, 100, 4, 1, 10, 0.2, 300, 400, 250), - unit = c("grams", "grams", "grams", "units", "liters", "grams", "liters", "grams", "grams", "grams") -) - -spices <- tibble( - type = c("Paprika", "Turmeric", "Cumin", "Coriander", "Cinnamon", "Chili Powder", "Oregano", "Thyme", "Saffron", "Nutmeg"), - quantity = c(50, 40, 30, 25, 20, 15, 10, 8, 5, 12), - unit = c("grams", "grams", "grams", "grams", "grams", "grams", "grams", "grams", "grams", "grams") -) - -dairies <- tibble( - type = c("Milk", "Butter", "Yogurt", "Cheese", "Cream", "Cottage Cheese", "Sour Cream", "Ghee", "Whipping Cream", "Ice Cream"), - quantity = c(1, 200, 150, 100, 0.5, 250, 150, 100, 0.3, 500), - unit = c("liters", "grams", "grams", "grams", "liters", "grams", "grams", "grams", "liters", "grams") -) -``` +To illustrate the concept of tidy data in our tidy kitchen, suppose we have a table called `ingredient` that contains information about all the ingredients that we currently have in our kitchen. It might look as follows: + +| name | quantity | unit | category | +|-----------|----------|--------|-----------| +| flour | 500 | grams | baking | +| sugar | 200 | grams | baking | +| butter | 100 | grams | dairy | +| eggs | 4 | units | dairy | +| milk | 1 | liters | dairy | +| salt | 10 | grams | seasoning | +| olive oil | 0.2 | liters | oil | +| tomatoes | 300 | grams | vegetable | +| chicken | 400 | grams | meat | +| rice | 250 | grams | grain | + +Each row refers to a specific ingredient and each column has a dedicated type and meaning. For instance, the column `quantity` contains information about how much of the ingredient called `name` we currently have and which `unit` we use to measure it. + +Similarly, we could have a table just for `dairy` that might look as follows: + +| name | quantity | unit | +|----------------|----------|--------| +| milk | 1 | liters | +| butter | 200 | grams | +| yogurt | 150 | grams | +| cheese | 100 | grams | +| cream | 0.5 | liters | +| cottage cheese | 250 | grams | +| sour cream | 150 | grams | +| ghee | 100 | grams | +| whipping cream | 0.3 | liters | +| ice cream | 500 | grams | + +Notice that there is no `category` column in this table? It would actually be redundant to have this column because all rows in the `dairy`` table have the same category. ## When colum headers are values, not variable names -```{r} -tibble( - type = c("Milk", "Butter", "Yogurt", "Cheese", "Cream", "Cottage Cheese", "Sour Cream", "Ghee", "Whipping Cream", "Ice Cream"), - liters = c(1, NA, NA, NA, 0.5, NA, NA, NA, 0.3, NA), - grams = c(NA, 200, 150, 100, NA, 250, 150, 100, NA, 500) -) -``` +Now let us move to data structures that are untidy. Consider the following variant of our `dairy` table: + +| type | liters | grams | +|----------------|--------|-------| +| milk | 1 | | +| butter | | 200 | +| yogurt | | 150 | +| cheese | | 100 | +| cream | 0.5 | | +| cottage cheese | | 250 | +| sour cream | | 150 | +| ghee | | 100 | +| whipping cream | 0.3 | | +| ice cream | | 500 | + +What is the issue here? Each row still refers to a specific dairy product. However, instead of dedicated `quantity` and `unit` columns, we have a `liters` and `grams` column. Since the units differ across dairy products, the table even contains missing values in the form of emtpy cells. So if you want to find out how much of ice cream you still have, you need to also check out the column name. In practice, we would create dedicated `quantity` and `unit` columns. we might even decide to have the same unit for all ingredients (e.g., measure everything in grams) and just keep a `quantity` column. ## When multiple variables are stored in one column -The `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. This format makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. +Let us consider the following untidy version of our `ingredient` table. -```{r} -tibble( - type = c("Flour", "Sugar", "Butter", "Eggs", "Milk", "Salt", "Olive Oil", "Tomatoes", "Chicken", "Rice"), - quantity_and_unit = c("500 grams", "200 grams", "100 grams", "4 units", "1 liter", "10 grams", "0.2 liters", "300 grams", "400 grams", "250 grams") -) -``` +| type | quantity_and_unit | +|-----------|-------------------| +| flour | 500 grams | +| sugar | 200 grams | +| butter | 100 grams | +| eggs | 4 units | +| milk | 1 liter | +| salt | 10 grams | +| olive oil | 0.2 liters | +| tomatoes | 300 grams | +| chicken | 400 grams | +| rice | 250 grams | + +This one is really annoying, since the `quantity_and_unit` column combines both the quantity and the unit of measurement into one string for each ingredient. Why is this an issue? This format actually makes it harder to perform numerical operations on the quantities or to filter or aggregate the data based on the unit of measurement. So in practice, we would actually start our data analysis by splitting out the `quantity_and_unit` column into `quantity` and `unit`. ## When variables are stored in both rows and columns -The quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient. +Let us extend our kitchen analogy by additionally considering recipes. For simplicity, a recipe just denotes how much of each ingredient is required. The following table contains two variants of a recipe for pancakes: -```{r} -tibble( - ingredient = c("Flour", "Sugar", "Butter", "Eggs", "Milk"), - recipe1_quantity = c("500 grams", "200 grams", "100 grams", "4 units", "1 liter"), - recipe2_quantity = c("300 grams", "150 grams", "50 grams", "3", "0.5 liters") -) -``` +| ingredient | recipe1_quantity | recipe2_quantity | +|------------|------------------|------------------| +| flour | 500 grams | 300 grams | +| sugar | 200 grams | 150 grams | +| butter | 100 grams | 50 grams | +| eggs | 4 units | 3 units | +| milk | 1 liters | 0.5 liters | -To convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity. +The quantity for each ingredient for two different recipes is stored in separate columns. This structure makes it harder to perform operations like filtering or summarizing the data by recipe or ingredient. + +To convert this data to a tidy format, you would typically want to gather the quantities into a single column, and include additional columns to specify the recipe and unit of measurement for each quantity. We can then filer ## When there are multiple types of data in the same column -The table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together. +A recipe typically contains information on the required utensils and how much time a step requires. Consider the following table with different types of data: -```{r} -tibble( - type = c("Flour", "Butter", "Whisk", "Sugar", "Baking Time"), - quantity = c("500 grams", "100 grams", "1", "200 grams", "30 minutes"), - category = c("Ingredient", "Ingredient", "Utensil", "Ingredient", "Time") -) -``` +| type | quantity | category | +|--------------|-------------|------------| +| flour | 500 grams | ingredient | +| butter | 100 grams | ingredient | +| whisk | 1 unit | utensil | +| sugar | 200 grams | ingredient | +| baking time | 30 minutes | time | + +The table is trying to describe a recipe but combines different types of data within the same columns. There are ingredients with their quantities, a utensil, and cooking time, all mixed together. A tidy approach would typically separate these different types of data into separate tables or at least into distinct sets of columns, making it clear what each part of the data represents and facilitating further analysis and visualization. ## When some data is missing +As a last example for untidy data, let us consider the original `ingredient` table again, but with a few empty cells. + Key points: - Huge difference between NA and 0 (or any other value) - Are you sure that you don't have the ingredient or do you just don't know? - Missing are dropped in filters -```{r} -tibble( - type = c("Flour", "Sugar", "Butter", "Eggs", "Milk", "Salt", "Olive Oil", "Tomatoes", "Chicken", NA), - quantity = c(NA, 200, 100, 4, 1, 10, 0.2, 300, 400, 250), - unit = c("grams", "grams", "grams", "units", NA, "grams", "liters", "grams", "grams", "grams") -) -``` \ No newline at end of file +| name | quantity | unit | +|-----------|----------|--------| +| flour | | grams | +| sugar | 200 | grams | +| butter | 100 | grams | +| eggs | 4 | units | +| milk | 10 | | +| salt | 10 | grams | +| olive oil | 0.2 | liters | +| tomatoes | 300 | grams | +| chicken | 400 | grams | +| | 250 | grams | + +What is the issue here? There are actually a couple of them: + +- The `flour` row does have any information about `quantity`, so we just don't know how much we have. +- The `milk` row does not contain a `unit`, so we might have 10 liters, 10 milliliters, or 10 cups of milk. +- The last row does not have any `name`, so we have 250 grams of something that we just can't identify. + +Why is this important? It makes a huge difference how me treat the missing information. For instance, we might make an educated guess for milk if we always record that information in litres, then the missing unit is very likely litres. For flour, we could play it safe and just say that the available quantity is zero. For the ingredient without a name, we might have to throw it away or ask somebody else to tell us what it is. + +Overall, these examples highlight the most important issues that you might have to consider when preparing data for your analysis. \ No newline at end of file