Skip to content

Commit

Permalink
Update datasets. (#77)
Browse files Browse the repository at this point in the history
I intend to move these out of the package before pushing an update to CRAN, but I did an update while working on {gutenbergr.data}, so we might as well save it. On my current PC, this took *8.25 hours.*
  • Loading branch information
jonthegeek authored Sep 15, 2024
1 parent a39d2e5 commit 26a6154
Show file tree
Hide file tree
Showing 5 changed files with 3 additions and 13 deletions.
16 changes: 3 additions & 13 deletions data-raw/parse_rdfs.R
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
library(fs)
library(dplyr)
library(gutenbergr)
library(here)
library(purrr)
library(stringr)
library(tibble)
# library(tictoc)
library(xml2)

source(here::here("data-raw", "parsers.R"))
Expand All @@ -14,6 +14,7 @@ source(here::here("data-raw", "parsers.R"))
# this timestamp yet, other than to inform users.
updated <- lubridate::date(lubridate::now(tzone = "UTC"))

# tictoc::tic()
cache_dir <- download_raw_data()
rdf_paths <- unname(fs::dir_ls(cache_dir, recurse = TRUE, glob = "*.rdf"))

Expand All @@ -40,18 +41,6 @@ new_gutenberg_subjects <- purrr::map_dfr(all_metadata, ~ .x$subjects) |>
dplyr::distinct() |>
dplyr::arrange(gutenberg_id)

# waldo::compare(nrow(gutenberg_authors), nrow(new_gutenberg_authors))
# waldo::compare(nrow(gutenberg_subjects), nrow(new_gutenberg_subjects))
# waldo::compare(nrow(gutenberg_languages), nrow(new_gutenberg_languages))
# waldo::compare(nrow(gutenberg_metadata), nrow(new_gutenberg_metadata))
# dplyr::distinct(new_gutenberg_metadata, gutenberg_id, has_text) |>
# dplyr::left_join(
# dplyr::distinct(gutenberg_metadata, gutenberg_id, has_text),
# by = "gutenberg_id"
# ) |>
# dplyr::filter(has_text.x != has_text.y) |>
# dplyr::filter(!has_text.x)

gutenberg_authors <- new_gutenberg_authors
gutenberg_subjects <- new_gutenberg_subjects
gutenberg_languages <- new_gutenberg_languages
Expand Down Expand Up @@ -87,3 +76,4 @@ rm(
parse_subject,
updated
)
# tictoc::toc()
Binary file modified data/gutenberg_authors.rda
Binary file not shown.
Binary file modified data/gutenberg_languages.rda
Binary file not shown.
Binary file modified data/gutenberg_metadata.rda
Binary file not shown.
Binary file modified data/gutenberg_subjects.rda
Binary file not shown.

0 comments on commit 26a6154

Please sign in to comment.