Skip to content

Commit

Permalink
Reviewed posts/data-storage-comparison
Browse files Browse the repository at this point in the history
  • Loading branch information
christophscheuch committed Jan 20, 2024
1 parent b8135c3 commit a0b26ca
Show file tree
Hide file tree
Showing 14 changed files with 255 additions and 52 deletions.
66 changes: 66 additions & 0 deletions .Rhistory
Original file line number Diff line number Diff line change
Expand Up @@ -426,3 +426,69 @@ JuliaCall::julia_setup()
JuliaCall::julia_setup("/Applications/Julia-1.9.app/Contents/Resources/julia/bin/julia")
JuliaCall::julia_setup("/Applications/Julia-1.9.app/Contents/Resources/julia/bin/julia/bin/")
JuliaCall::julia_setup("/Applications/Julia-1.9.app/Contents/Resources/julia/bin/")
#| message: false
library(dplyr)
data_r <- tibble(
character_column = c("A", "B", "C", "D"),
date_column = as.Date(c("2023-01-01", "2023-02-01", "2023-03-01", "2023-04-01")),
datetime_column = as.POSIXct(c("2023-01-01 10:00:00", "2023-02-01 11:00:00", "2023-03-01 12:00:00", "2023-04-01 13:00:00")),
numeric_column = c(1.5, 2.5, 3.5, 4.5),
integer_column = as.integer(c(1, 2, 3, 4)),
logical_column = c(TRUE, FALSE, FALSE, TRUE)
)
extract_column_classes <- function(df) {
sapply(sapply(df, class), function(x) paste(x, collapse = ", "))
}
tibble("data_r" = extract_column_classes(data_r))
reticulate::repl_python()
#| message: false
library(readr)
write_csv(data_r, file = "data_r.csv")
data_r_csv <- read_csv("data_r.csv")
glimpse(data_r_csv)
#| message: false
data_python_csv <- read_csv("data_python.csv")
reticulate::repl_python()
#| message: false
data_python_csv <- read_csv("data_python.csv")
tibble(
"data_r_csv" = extract_column_classes(data_r_csv),
"data_python_csv" = extract_column_classes(data_python_csv)
)
reticulate::repl_python()
library(RSQLite)
con_sqlite_r <- dbConnect(SQLite(), "data_r.sqlite")
dbWriteTable(con_sqlite_r, "data", data_r, overwrite = TRUE)
data_r_sqlite <- dbReadTable(con_sqlite_r, "data")
dbDisconnect(con_sqlite_r)
glimpse(data_r_sqlite)
con_sqlite_python <- dbConnect(SQLite(), "data_python.sqlite")
data_python_sqlite <- dbReadTable(con_sqlite_python, "data")
reticulate::repl_python()
#| message: false
library(arrow)
write_parquet(data_r, "data_r.parquet")
data_r_parquet <- read_parquet("data_r.parquet")
glimpse(data_r_parquet)
?write_feather()
#| message: false
write_feather(data_r, "data_r.feather")
#| message: false
write_feather(data_r, "data_r.feather")
data_r_feather <- read_feather("data_r.feather")
glimpse(data_r_parquet)
glimpse(data_r_feather)
reticulate::repl_python()
data_python_feather <- read_feather("data_python.feather")
tibble(
"data_r_feather" = extract_column_classes(data_r_feather),
"data_python_feather" = extract_column_classes(data_python_feather)
)
data_python_feather
reticulate::repl_python()
#| message: false
library(arrow)
write_parquet(data_r, "data_r.parquet")
data_r_parquet <- read_parquet("data_r.parquet")
glimpse(data_r_parquet)
reticulate::repl_python()

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@

<div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
<div class="list grid quarto-listing-cols-3">
<div class="g-col-1" data-index="0" data-listing-date-sort="1705618800000" data-listing-file-modified-sort="1705664169329" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="12" data-listing-word-count-sort="2280">
<div class="g-col-1" data-index="0" data-listing-date-sort="1705705200000" data-listing-file-modified-sort="1705750762210" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="13" data-listing-word-count-sort="2494">
<a href="./posts/data-storage-comparison/index.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top">
Expand Down
134 changes: 112 additions & 22 deletions docs/posts/data-storage-comparison/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/posts/ggplot2-vs-seaborn/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1005,7 +1005,7 @@ <h1>Conclusion</h1>
</div>
</div>
</footer>
<script>var lightboxQuarto = GLightbox({"selector":".lightbox","loop":false,"openEffect":"zoom","descPosition":"bottom","closeEffect":"zoom"});
<script>var lightboxQuarto = GLightbox({"loop":false,"openEffect":"zoom","closeEffect":"zoom","descPosition":"bottom","selector":".lightbox"});
window.onload = () => {
lightboxQuarto.on('slide_before_load', (data) => {
const { slideIndex, slideNode, slideConfig, player, trigger } = data;
Expand Down
2 changes: 1 addition & 1 deletion docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,7 @@
"href": "posts/data-storage-comparison/index.html",
"title": "Tidy Data: Tabular Data Storage Comparison",
"section": "",
"text": "Sharing data between different collaborators, machines, or programming languages can be cumbersome. In this post, we look into the issue of column types and how different storage technologies handle them. We focus on self-contained technologies that are easy to install and run on your machine without setting up a separate backend server. This requirement typically arises in academic contexts, educational settings, or when you quickly want to prototype something without spending time on setting up a data backend.\nWe start with simple CSV, then move on to the popular SQLite database before we look at the rising star DuckDB. We close the comparison with a look at the Parquet file format. We always check how the column type depends on the language that is used to store the data in the corresponding storage technology."
"text": "Sharing data between different collaborators, machines, or programming languages can be cumbersome for many reasons. In this post, I look into the issue of column types and how different storage technologies handle them. I focus on self-contained technologies that are easy to install and run on your machine without setting up a separate backend server. This requirement typically arises in academic contexts, educational settings, or when you quickly want to prototype something without spending time on setting up a data backend.\nI start with simple CSV, then move on to the popular SQLite database before I look at the rising star DuckDB. We close the comparison with a look at the Parquet and Feather file formats. I always check how the column type depends on the language that is used to store the data in the corresponding storage technology."
},
{
"objectID": "posts/data-storage-comparison/index.html#footnotes",
Expand Down
2 changes: 1 addition & 1 deletion docs/sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,6 @@
</url>
<url>
<loc>https://blog.tidy-intelligence.com/posts/data-storage-comparison/index.html</loc>
<lastmod>2024-01-19T11:36:09.329Z</lastmod>
<lastmod>2024-01-20T11:39:22.210Z</lastmod>
</url>
</urlset>
Binary file modified posts/data-storage-comparison/data_python.duckdb
Binary file not shown.
Binary file added posts/data-storage-comparison/data_python.feather
Binary file not shown.
Binary file modified posts/data-storage-comparison/data_python.sqlite
Binary file not shown.
Binary file modified posts/data-storage-comparison/data_r.duckdb
Binary file not shown.
Binary file added posts/data-storage-comparison/data_r.feather
Binary file not shown.
Binary file modified posts/data-storage-comparison/data_r.sqlite
Binary file not shown.
Loading

0 comments on commit a0b26ca

Please sign in to comment.