From 890b4ae8e9bcf9d94ec0c431a7e32b74c1f9f74a Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 13 Nov 2024 17:05:07 -0500 Subject: [PATCH 001/114] DEV-458 end-to-end automated test for metadata workflow - README updates on standard directory locations - README updates on daily script inputs and outputs --- README.md | 75 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 63 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 1ec809f..9fb8065 100644 --- a/README.md +++ b/README.md @@ -29,13 +29,50 @@ docker compose run --rm test bundle exec standardrb docker compose run --rm test bundle exec rspec ``` -run_process_zephir_incremental.sh (daily) -========================================= +## Standard Locations + +Post-Zephir can read and write files in a number of locations, and it can become bewildering. +Many of the locations (all of them directories) show up again and again. Under Argo these +all come from the `ENV` provided to the workflow. Under Docker the locations are not so scattered, +and all orient themselves to `ENV[ROOTDIR]`. The shell scripts rely on `config/defaults` to fill +in many of these variables. The Ruby scripts orient off the `DATA_ROOT` in `Dockerfile` +but fill in the other locations in a more haphazard manner (see the `directory_for` method in +`lib/derivatives.rb` for an example of how this can go off the rails). + +TODO: can we use `dotenv` and `.env` in both the shell scripts and the Ruby code, and get rid of +`config/defaults`? Or can we translate `config/defaults` into Ruby and invoke it from the driver? + +| `ENV` | Standard Location | Docker/Default Location | +| -------- | ------- | ----- | +| `CATALOG_ARCHIVE` | `/htapps/archive/catalog` | `DATA_ROOT/catalog_archive` | +| `CATALOG_PREP` | `/htsolr/catalog/prep` | `DATA_ROOT/catalog_prep` | +| `DATA_ROOT` | `/htprep/zephir` | `ROOTDIR/data` | +| `FEDDOCS_HOME` | `/htprep/govdocs` | `DATA_ROOT/govdocs` | +| `INGEST_BIBRECORDS` | `/htapps/babel/feed/var/bibrecords` | `DATA_ROOT/ingest_bibrecords` | +| `RIGHTS_DIR` | `/htapps/babel/feed/var/rights` | `DATA_ROOT/rights` | +| `ROOTDIR` | (not used) | `/usr/src/app` | + +Additional derivative paths are set by `config/defaults`, typically from the daily or monthly shell script. +As such they are not available to Ruby code. (Note: there may be some fuzziness between these two +sets since we may decide to let Argo handle one or more of these in future. Look to the Argo metadata +workflow config for authoritative values.) + +| `ENV` | Standard Location | Docker/Default Location | Note | +| -------- | ------- | ----- | ---- | +| `REPORTS` | `DATA_ROOT/reports` | `DATA_ROOT/reports` | *unused* | +| `RIGHTS_DBM` | `DATA_ROOT/rights_dbm` | `DATA_ROOT/rights_dbm` | *this is a file* | +| `TMPDIR` | `DATA_ROOT/work` | `/tmp` | | +| `ZEPHIR_DATA` | `DATA_ROOT/zephir` | `DATA_ROOT/zephir` | | + + + +## `run_process_zephir_incremental.sh` (daily) + * Process daily file of new/updated/deleted metadata provided by Zephir * Send deleted bib record IDs (provided by Zephir) to Bill * "Clean up" zephir records * (re)determine bibliographic rights - + Write new/updated bib rights to file for Aaron's process to pick up and update the rights db (Why: possibly because of limited permissions on the rights database) + + Write new/updated bib rights to file for `populate_rights_data.pl` to pick up and update the rights db * File of processed new/updated records is copied to an HT server for Bill to index in the catalog * Retrieves full bib metadata file from zephir and runs run_zephir_full_monthly.sh. (Why?) @@ -47,18 +84,32 @@ Data In ------- * `ht_bib_export_incr_YYYY-MM-DD.json.gz` (incremental updates from Zephir, `ftps_zephir_get`) * `vufind_removed_cids_YYYY-MM-DD.txt.gz` (CIDs that have gone away, `ftps_zephir_get`) -* `/tmp/rights_dbm` (taken from `ht_rights.rights_current` table in the rights database) -* `us_cities.db` (dependency for `bib_rights.pm`) -* `us_fed_pub_exception_file` (dependency for `bib_rights.pm`, `/htdata/govdocs/feddocs_oclc_filter/`) +* `DATA_ROOT/rights_dbm` (taken from `ht_rights.rights_current` table in the rights database) +* `ROOTDIR/data/us_cities.db` (dependency for `bib_rights.pm`) +* `ENV[us_fed_pub_exception_file]` (optional dependency for `bib_rights.pm`) Data Out -------- -* `debug_current.txt` (what and why for this?) -* `zephir_upd_YYYYMMDD.rights` - picked up hourly by https://github.com/hathitrust/feed_internal/blob/master/feed.hourly/populate_rights_data.pl and loaded into the `rights_current` table. Will be placed directly in /htapps/babel/feed/var/rights and will remove the scp logic from populate_rights_data.pl -* `zephir_upd_YYYYMMDD_delete.txt.gz` will be moved to /htsolr/catalog/prep. Used by the catalog to process deletes. -* `zephir_upd_YYYYMMDD_dollar_dup.txt `(generated by post_zephir_cleanup.pl, gets sent to Zephir, ftps_zephir_send, Zephir uningests these duplicate records) -* `zephir_upd_YYYYMMDD.json.gz` will be sent to /htsolr/catalog/prep for [catalog indexing](https://github.com/hathitrust/hathitrust_catalog_indexer) -* `zephir_full_monthly_rpt.txt` Does anyone need this? + +Many files are named based on the `BASENAME` variable which is "zephir_upd_YYYYMMDD." Files are typically created in +`TMPDIR` and moved/renamed from there. + +AFAICT, Verifier should only be interested in files outside `TMPDIR`, with the possible exception of +`TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz`. + +| File | Notes | +| -------- | ----- | +| `CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz` | Created as `TMPDIR/BASENAME_all_delete.txt.gz` | +| `CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz` | From `postZephir.pm`: gzipped and copied (not moved) by shell script | +| `RIGHTS_DIR/zephir_upd_YYYYMMDD.rights` | From `postZephir.pm`: moved from `TMPDIR` | +| `ROOTDIR/data/zephir/debug_current.txt` | _Commented out at end of monthly script. Should be removed._ | +| `TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz` | Created as `TMPDIR/BASENAME_dollar_dup.txt`, renamed and sent to Zephir | +| `TMPDIR/zephir_upd_YYYYMMDD_delete.txt` | From `postZephir.pm`: usually empty list of 974-less CIDs, merged with `vufind_removed_cids` | +| `TMPDIR/zephir_upd_YYYYMMDD.rights.debug` | From `postZephir.pm`, _if no one is using this it should be removed_ | +| `TMPDIR/zephir_upd_YYYYMMDD_rpt.txt` | Log data from `postZephir.pm` | +| `TMPDIR/zephir_upd_YYYYMMDD_stderr` | `STDERR` from `postZephir.pm`, _if no one is using this it should be removed_ | +| `TMPDIR/zephir_upd_YYYYMMDD_zephir_delete.txt` | Intermediate file from `vufind_removed_cids_...` before merge with our deletes, _remove?_ | + Perl script dependencies ------------------------ From 066806b830f30ca7baa32b6aa785b17e696c1484 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 14 Nov 2024 12:17:57 -0500 Subject: [PATCH 002/114] Minor README adjustments. --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 9fb8065..a0198ee 100644 --- a/README.md +++ b/README.md @@ -69,12 +69,12 @@ workflow config for authoritative values.) ## `run_process_zephir_incremental.sh` (daily) * Process daily file of new/updated/deleted metadata provided by Zephir -* Send deleted bib record IDs (provided by Zephir) to Bill -* "Clean up" zephir records +* Send deleted bib record IDs (provided by Zephir) to catalog indexer +* "Clean up" zephir records (what does this mean?) * (re)determine bibliographic rights + Write new/updated bib rights to file for `populate_rights_data.pl` to pick up and update the rights db -* File of processed new/updated records is copied to an HT server for Bill to index in the catalog -* Retrieves full bib metadata file from zephir and runs run_zephir_full_monthly.sh. (Why?) +* File of processed new/updated records is copied to a location for the catalog indexer to find it +* Retrieves full bib metadata file from zephir and runs `run_zephir_full_monthly.sh`. (It does?? I don't think so.) Why? ---- @@ -99,8 +99,9 @@ AFAICT, Verifier should only be interested in files outside `TMPDIR`, with the p | File | Notes | | -------- | ----- | -| `CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz` | Created as `TMPDIR/BASENAME_all_delete.txt.gz` | +| `CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz` | From `postZephir.pm`: gzipped and copied (not moved) by shell script | | `CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz` | From `postZephir.pm`: gzipped and copied (not moved) by shell script | +| `CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz` | Created as `TMPDIR/BASENAME_all_delete.txt.gz` combining two files (see below) | | `RIGHTS_DIR/zephir_upd_YYYYMMDD.rights` | From `postZephir.pm`: moved from `TMPDIR` | | `ROOTDIR/data/zephir/debug_current.txt` | _Commented out at end of monthly script. Should be removed._ | | `TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz` | Created as `TMPDIR/BASENAME_dollar_dup.txt`, renamed and sent to Zephir | From a6093f33ed07113b386884fba1fb69d8f8978c52 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 14 Nov 2024 12:18:43 -0500 Subject: [PATCH 003/114] Add log level and DATA_ROOT to docker-compose. --- docker-compose.yml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docker-compose.yml b/docker-compose.yml index 56c7850..d8c5fa4 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -19,7 +19,9 @@ services: mariadb: *healthy pushgateway: *healthy environment: + DATA_ROOT: "/usr/src/app/data" DB_CONNECTION_STRING: "mysql2://ht_rights:ht_rights@mariadb/ht" + POST_ZEPHIR_PROCESSING_LOGGER_LEVEL: "1" PUSHGATEWAY: "http://pushgateway:9091" command: - /bin/bash @@ -33,7 +35,9 @@ services: mariadb: *healthy pushgateway: *healthy environment: + - DATA_ROOT=/usr/src/app/data - DB_CONNECTION_STRING="mysql2://ht_rights:ht_rights@mariadb/ht" + - POST_ZEPHIR_PROCESSING_LOGGER_LEVEL=1 - PUSHGATEWAY="http://pushgateway:9091" # pass through info needed by coveralls uploader - GITHUB_TOKEN From 88cafaacea02eb41202fa006f3e69b6848805c4b Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 14 Nov 2024 12:19:11 -0500 Subject: [PATCH 004/114] Add Canister and Climate Control gems; add logger service. --- Gemfile | 3 +++ Gemfile.lock | 4 ++++ lib/services.rb | 12 ++++++++++++ 3 files changed, 19 insertions(+) create mode 100644 lib/services.rb diff --git a/Gemfile b/Gemfile index 3b5720a..58b520b 100644 --- a/Gemfile +++ b/Gemfile @@ -2,7 +2,10 @@ source "https://rubygems.org" +gem "canister" + group :development, :test do + gem "climate_control" gem "pry" gem "rspec" gem "simplecov" diff --git a/Gemfile.lock b/Gemfile.lock index 28f02b3..4655d69 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -2,6 +2,8 @@ GEM remote: https://rubygems.org/ specs: ast (2.4.2) + canister (0.9.2) + climate_control (1.2.0) coderay (1.1.3) diff-lcs (1.5.1) docile (1.4.1) @@ -78,6 +80,8 @@ PLATFORMS ruby DEPENDENCIES + canister + climate_control pry rspec simplecov diff --git a/lib/services.rb b/lib/services.rb new file mode 100644 index 0000000..38aa6bc --- /dev/null +++ b/lib/services.rb @@ -0,0 +1,12 @@ +# frozen_string_literal: true + +require "canister" +require "logger" + +module PostZephirProcessing + Services = Canister.new + + Services.register(:logger) do + Logger.new($stdout, level: ENV.fetch("POST_ZEPHIR_PROCESSING_LOGGER_LEVEL", Logger::WARN).to_i) + end +end From 270f4282a7d534134cb09e6f623c0317651d8c77 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 14 Nov 2024 12:46:30 -0500 Subject: [PATCH 005/114] Add Journal class and write it from post_zephir.rb. --- bin/post_zephir.rb | 17 +++++--- lib/journal.rb | 38 +++++++++++++++++ spec/spec_helper.rb | 1 + spec/unit/journal_spec.rb | 86 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 137 insertions(+), 5 deletions(-) create mode 100644 lib/journal.rb create mode 100644 spec/unit/journal_spec.rb diff --git a/bin/post_zephir.rb b/bin/post_zephir.rb index cec2ee4..bc2ad7e 100755 --- a/bin/post_zephir.rb +++ b/bin/post_zephir.rb @@ -9,6 +9,7 @@ require_relative "../lib/dates" require_relative "../lib/derivatives" +require_relative "../lib/journal" def run_system_command(command) LOGGER.info command @@ -22,18 +23,24 @@ def run_system_command(command) YESTERDAY = Date.today - 1 inventory = PostZephirProcessing::Derivatives.new(date: YESTERDAY) - -if inventory.earliest_missing_date.nil? - LOGGER.info "no Zephir files to process, exiting" - exit 0 +dates = [] +# Is there a missing date? Plug them into an array to process. +if !inventory.earliest_missing_date.nil? + dates = (inventory.earliest_missing_date..YESTERDAY) end -dates = (inventory.earliest_missing_date..YESTERDAY) LOGGER.info "Processing Zephir files from #{dates}" dates.each do |date| date_str = date.strftime("%Y%m%d") + LOGGER.info "Processing Zephir file for #{date_str}" if date.last_of_month? run_system_command "#{FULL_SCRIPT} #{date_str}" end run_system_command "#{INCREMENTAL_SCRIPT} #{date_str}" end + +# Record our work for the verifier +LOGGER.info "Writing journal for #{dates}" +# TODO: consider moving the `to_a` to the Journal initializer so it can take +# Ranges as well as Arrays +PostZephirProcessing::Journal.new(dates: dates.to_a).write! diff --git a/lib/journal.rb b/lib/journal.rb new file mode 100644 index 0000000..3cc3391 --- /dev/null +++ b/lib/journal.rb @@ -0,0 +1,38 @@ +# frozen_string_literal: true + +require "date" +require "yaml" + +module PostZephirProcessing + class Journal + JOURNAL_NAME = "journal.yml" + attr_reader :dates + + def self.from_yaml + new(dates: YAML.load_file(destination_path)) + end + + # It is okay to clobber last run's journal. The journal is not datestamped. + def self.destination_path + File.join(ENV["DATA_ROOT"], JOURNAL_NAME) + end + + # It is okay to initialize and write a journal with no dates. + # Can be called with a Range as long as it is bounded. + def initialize(dates: []) + @dates = dates.map do |date| + date.is_a?(String) ? Date.parse(date) : date + end.sort + end + + def write!(path: self.class.destination_path) + File.write(path, @dates.map(&method(:to_yyyymmdd)).to_yaml) + end + + private + + def to_yyyymmdd(date) + date.strftime "%Y%m%d" + end + end +end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 56950df..710e21f 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -18,6 +18,7 @@ require_relative "../lib/dates" require_relative "../lib/derivatives" +require_relative "../lib/journal" ENV["POST_ZEPHIR_LOGGER_LEVEL"] = Logger::WARN.to_s diff --git a/spec/unit/journal_spec.rb b/spec/unit/journal_spec.rb new file mode 100644 index 0000000..a163e68 --- /dev/null +++ b/spec/unit/journal_spec.rb @@ -0,0 +1,86 @@ +# frozen_string_literal: true + +require "climate_control" +require "tmpdir" + +module PostZephirProcessing + RSpec.describe(Journal) do + around(:each) do |example| + Dir.mktmpdir do |tmpdir| + ClimateControl.modify DATA_ROOT: tmpdir do + @tmpdir = tmpdir + example.run + end + end + end + + let(:with_no_dates) { described_class.new } + let(:unsorted_dates) { [Date.today, Date.today + 1, Date.today - 100] } + let(:range_of_dates) { (Date.today..Date.today + 1) } + let(:with_dates) { described_class.new(dates: unsorted_dates) } + let(:with_range) { described_class.new(dates: range_of_dates) } + let(:test_yaml) { + <<~TEST_YAML + --- + - '20500101' + - '20500102' + TEST_YAML + } + let(:test_yaml_dates) { [Date.new("2050", "1", "1"), Date.new("2050", "1", "2")] } + + describe ".destination_path" do + it "contains the current DATA_ROOT" do + expect(described_class.destination_path).to match(@tmpdir) + end + end + + describe ".from_yaml" do + it "produces a Journal with the expected dates" do + File.write(described_class.destination_path, test_yaml) + expect(described_class.from_yaml).to be_an_instance_of(Journal) + end + end + + describe ".new" do + context "with default empty dates" do + it "creates a Journal" do + expect(with_no_dates).to be_an_instance_of(Journal) + end + end + + context "with explicit dates" do + it "creates a Journal" do + expect(with_dates).to be_an_instance_of(Journal) + end + end + + context "with a date range" do + it "creates a Journal" do + expect(with_range).to be_an_instance_of(Journal) + end + end + end + + describe "#dates" do + context "with default empty dates" do + it "returns an empty Array" do + expect(with_no_dates.dates).to eq([]) + end + end + + context "with explicit dates" do + it "returns a sorted array" do + expect(with_dates.dates).to eq(unsorted_dates.sort) + end + end + end + + describe "write!" do + it "writes one YAML file to DATA_ROOT" do + with_dates.write! + expect(Dir.children(@tmpdir).count).to eq(1) + expect(Dir.children(@tmpdir)[0]).to match(Journal::JOURNAL_NAME) + end + end + end +end From cccf97a1323fd7b651bca3e75e2ed4e15478994a Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 14 Nov 2024 12:47:59 -0500 Subject: [PATCH 006/114] Reconcile README and updated default TMPDIR. --- README.md | 21 +++++++++++---------- config/defaults | 2 +- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index a0198ee..0a17e8f 100644 --- a/README.md +++ b/README.md @@ -53,16 +53,17 @@ TODO: can we use `dotenv` and `.env` in both the shell scripts and the Ruby code | `ROOTDIR` | (not used) | `/usr/src/app` | Additional derivative paths are set by `config/defaults`, typically from the daily or monthly shell script. -As such they are not available to Ruby code. (Note: there may be some fuzziness between these two -sets since we may decide to let Argo handle one or more of these in future. Look to the Argo metadata -workflow config for authoritative values.) - -| `ENV` | Standard Location | Docker/Default Location | Note | -| -------- | ------- | ----- | ---- | -| `REPORTS` | `DATA_ROOT/reports` | `DATA_ROOT/reports` | *unused* | -| `RIGHTS_DBM` | `DATA_ROOT/rights_dbm` | `DATA_ROOT/rights_dbm` | *this is a file* | -| `TMPDIR` | `DATA_ROOT/work` | `/tmp` | | -| `ZEPHIR_DATA` | `DATA_ROOT/zephir` | `DATA_ROOT/zephir` | | +Another mechanism (`lib/derivatives.rb`) is being experimented with for the Ruby code. +(Note: there may be some fuzziness between these two sets since we may decide to let +Argo handle one or more of these in future. Look to the Argo metadata workflow config for +authoritative values.) + +| `ENV` | Standard/Default/Docker Location | Note | +| -------- | ------- | ---- | +| `REPORTS` | `DATA_ROOT/reports` | *unused* | +| `RIGHTS_DBM` | `DATA_ROOT/rights_dbm` | *this is a file* | +| `TMPDIR` | `DATA_ROOT/work` | | +| `ZEPHIR_DATA` | `DATA_ROOT/zephir` | | diff --git a/config/defaults b/config/defaults index 78314d2..6aa316c 100644 --- a/config/defaults +++ b/config/defaults @@ -1,7 +1,7 @@ #!/bin/bash export DATA_ROOT=${DATA_ROOT:-$ROOTDIR/data} -export TMPDIR=${TMPDIR:-/tmp} +export TMPDIR=${TMPDIR:-$DATA_ROOT/work} # We write a lot of reports no one reads export REPORTS=${REPORTS:-$DATA_ROOT/reports} From 3f50009a4dbd44008f606d972af1a485a65e4adb Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 14 Nov 2024 12:49:41 -0500 Subject: [PATCH 007/114] Prototype Verifier. --- bin/verify.rb | 10 +++++ lib/derivatives.rb | 39 ++++++++++++------- lib/verifier.rb | 58 ++++++++++++++++++++++++++++ lib/verifier/post_zephir_verifier.rb | 44 +++++++++++++++++++++ spec/unit/derivatives_spec.rb | 18 ++++++++- 5 files changed, 155 insertions(+), 14 deletions(-) create mode 100755 bin/verify.rb create mode 100644 lib/verifier.rb create mode 100644 lib/verifier/post_zephir_verifier.rb diff --git a/bin/verify.rb b/bin/verify.rb new file mode 100755 index 0000000..5d5600d --- /dev/null +++ b/bin/verify.rb @@ -0,0 +1,10 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require_relative "../lib/verifier/post_zephir_verifier" + +[ + PostZephirProcessing::PostZephirVerifier +].each do |klass| + klass.new.run +end diff --git a/lib/derivatives.rb b/lib/derivatives.rb index 0701f45..6b4127b 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -7,7 +7,14 @@ module PostZephirProcessing # `earliest_missing_date` is the main entrypoint when constructing an agenda of Zephir # file dates to fetch for processing. class Derivatives + STANDARD_LOCATIONS = [ + :CATALOG_ARCHIVE, + :CATALOG_PREP, + :RIGHTS_ARCHIVE, + :TMPDIR + ].freeze # Location data for the derivatives we care about when constructing our list of missing dates. + DIR_DATA = { zephir_full: { location: :CATALOG_PREP, @@ -38,6 +45,24 @@ class Derivatives attr_reader :dates + # Translate a known file destination as an environment variable key + # into the path via ENV or a default. + # @return [String] path to the directory + def self.directory_for(location:) + case location.to_sym + when :CATALOG_ARCHIVE + ENV["CATALOG_ARCHIVE"] || "/htapps/archive/catalog" + when :CATALOG_PREP + ENV["CATALOG_PREP"] || "/htsolr/catalog/prep/" + when :RIGHTS_ARCHIVE + ENV["RIGHTS_ARCHIVE"] || "/htapps/babel/feed/var/rights/archive" + when :TMPDIR + ENV["TMPDIR"] || File.join(ENV["DATA_ROOT"], "work") + else + raise "Unknown location #{location}" + end + end + # @param date [Date] the file datestamp date, not the "run date" def initialize(date: (Date.today - 1)) @dates = Dates.new(date: date) @@ -56,23 +81,11 @@ def earliest_missing_date private - # Translate a known file destination as an environment variable key - # into the path via ENV or a default. - # @return [String] path to the directory - def directory_for(location:) - case location.to_sym - when :CATALOG_PREP - ENV["CATALOG_PREP"] || "/htsolr/catalog/prep/" - when :RIGHTS_ARCHIVE - ENV["RIGHTS_ARCHIVE"] || "/htapps/babel/feed/var/rights/archive" - end - end - # Run regexp against the contents of dir and store matching files # that have datestamps in the period of interest. # @return [Array] de-duped and sorted ASC def directory_inventory(name:) - dir = directory_for(location: DIR_DATA[name][:location]) + dir = self.class.directory_for(location: DIR_DATA[name][:location]) Dir.children(dir) .filter_map { |filename| (m = DIR_DATA[name][:pattern].match(filename)) && Date.parse(m[1]) } .select { |date| dates.all_dates.include? date } diff --git a/lib/verifier.rb b/lib/verifier.rb new file mode 100644 index 0000000..96d0636 --- /dev/null +++ b/lib/verifier.rb @@ -0,0 +1,58 @@ +# frozen_string_literal: true + +require_relative "journal" +require_relative "services" + +# Common superclass for all things Verifier. +# Right now the only thing I can think of to put here is shared +# code for writing whatever output file, logs, metrics, artifacts, etc. we decide on. + +module PostZephirProcessing + class Verifier + attr_reader :journal + + def self.datestamped_file(name:, date:) + name.sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) + .sub(/YYYY-MM-DD/i, date.strftime("%Y-%m-%d")) + end + + # Generally, needs a Journal in order to know what to look for. + def initialize + @journal = Journal.from_yaml + end + + # Main entrypoint + # What should it return? + # Do we want to bail out or keep going if we encounter a show-stopper? + # I'm inclined to just keep going. + def run + run_for_dates + journal.dates.each do |date| + run_for_date(date: date) + end + end + + # Subclasses can verify outputs that are not datestamped, in case we want to + # avoid running an expensive check multiple times. + # This may not be needed. + def run_for_dates(dates: journal.dates) + end + + # Verify outputs for one date in the journal. + # USeful for verifying datestamped files. + def run_for_date(date:) + end + + # Basic check(s) for the existence of the file at `path`. + # We should do whatever logging/warning we want to do if the file does + # not pass muster. + # At least call File.exist? + # What about permissions? + # Verifying contents is out of scope. + def verify_file(path:) + if !File.exist? path + Services[:logger].error "not found: #{path}" + end + end + end +end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb new file mode 100644 index 0000000..c8a2a6e --- /dev/null +++ b/lib/verifier/post_zephir_verifier.rb @@ -0,0 +1,44 @@ +# frozen_string_literal: true + +require_relative "../verifier" +require_relative "../derivatives" + +# Verifies that post_zephir workflow stage did what it was supposed to. + +# TODO: document and verify the files written by monthly process. +# They should be mostly the same but need to be accounted for. + +module PostZephirProcessing + class PostZephirVerifier < Verifier + # TODO: do we need to check any non-datestamped files for this date? + # Review README list of derivatives in TMPDIR + # def run_for_dates(dates: journal.dates) + # end + + def run_for_date(date:) + datestamped_derivatives(date).each do |path| + verify_file(path: path) + end + end + + private + + # TODO: see if we want to move this to Derivatives class + def datestamped_derivative(location:, name:, date:) + File.join( + Derivatives.directory_for(location: location), + self.class.datestamped_file(name: "zephir_upd_YYYYMMDD.json.gz", date: date) + ) + end + + def datestamped_derivatives(date) + [ + datestamped_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date), + datestamped_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD.json.gz", date: date), + datestamped_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date), + datestamped_derivative(location: :RIGHTS_DIR, name: "zephir_upd_YYYYMMDD.rights", date: date), + datestamped_derivative(location: :TMPDIR, name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz", date: date) + ] + end + end +end diff --git a/spec/unit/derivatives_spec.rb b/spec/unit/derivatives_spec.rb index f2db6f8..1cd9974 100644 --- a/spec/unit/derivatives_spec.rb +++ b/spec/unit/derivatives_spec.rb @@ -12,8 +12,24 @@ module PostZephirProcessing end end + describe ".directory_for" do + context "with known locations" do + Derivatives::STANDARD_LOCATIONS.each do |loc_name| + it "returns a string for #{loc_name}" do + expect(described_class.directory_for(location: loc_name)).to be_a(String) + end + end + end + + context "with an unknown location" do + it "raises" do + expect { described_class.directory_for(location: :NO_SUCH_LOC) }.to raise_error(StandardError) + end + end + end + describe ".new" do - it "creates a MonthlyInventory" do + it "creates a Derivatives" do expect(described_class.new).to be_an_instance_of(Derivatives) end From 9bdc2ee4388c08b60bf85e8818a36b101848f8dc Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Tue, 26 Nov 2024 16:18:56 -0500 Subject: [PATCH 008/114] Flesh out enough details for others to evaluate. --- Gemfile | 2 + Gemfile.lock | 4 + README.md | 30 ++++--- bin/verify.rb | 10 ++- config/env | 8 ++ lib/derivatives.rb | 19 +++-- lib/verifier.rb | 46 ++++++++--- lib/verifier/post_zephir_verifier.rb | 116 ++++++++++++++++++++++----- spec/spec_helper.rb | 3 + 9 files changed, 183 insertions(+), 55 deletions(-) create mode 100644 config/env diff --git a/Gemfile b/Gemfile index 58b520b..9a3e4da 100644 --- a/Gemfile +++ b/Gemfile @@ -3,6 +3,8 @@ source "https://rubygems.org" gem "canister" +gem "dotenv" +gem "zinzout" group :development, :test do gem "climate_control" diff --git a/Gemfile.lock b/Gemfile.lock index 4655d69..09d10f7 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -7,6 +7,7 @@ GEM coderay (1.1.3) diff-lcs (1.5.1) docile (1.4.1) + dotenv (3.1.4) json (2.7.2) language_server-protocol (3.17.0.3) lint_roller (1.1.0) @@ -74,6 +75,7 @@ GEM standardrb (1.0.1) standard unicode-display_width (2.6.0) + zinzout (0.1.1) PLATFORMS aarch64-linux @@ -82,11 +84,13 @@ PLATFORMS DEPENDENCIES canister climate_control + dotenv pry rspec simplecov simplecov-lcov standardrb + zinzout BUNDLED WITH 2.5.19 diff --git a/README.md b/README.md index 0a17e8f..ba29fbd 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ Data In ------- * `ht_bib_export_incr_YYYY-MM-DD.json.gz` (incremental updates from Zephir, `ftps_zephir_get`) * `vufind_removed_cids_YYYY-MM-DD.txt.gz` (CIDs that have gone away, `ftps_zephir_get`) -* `DATA_ROOT/rights_dbm` (taken from `ht_rights.rights_current` table in the rights database) +* `DATA_ROOT/rights_dbm` (local copy of Rights DB `ht_rights.rights_current`) * `ROOTDIR/data/us_cities.db` (dependency for `bib_rights.pm`) * `ENV[us_fed_pub_exception_file]` (optional dependency for `bib_rights.pm`) @@ -101,7 +101,7 @@ AFAICT, Verifier should only be interested in files outside `TMPDIR`, with the p | File | Notes | | -------- | ----- | | `CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz` | From `postZephir.pm`: gzipped and copied (not moved) by shell script | -| `CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz` | From `postZephir.pm`: gzipped and copied (not moved) by shell script | +| `CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz` | Same file as above, removed from `TMPDIR` after being copied to the two destinations | | `CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz` | Created as `TMPDIR/BASENAME_all_delete.txt.gz` combining two files (see below) | | `RIGHTS_DIR/zephir_upd_YYYYMMDD.rights` | From `postZephir.pm`: moved from `TMPDIR` | | `ROOTDIR/data/zephir/debug_current.txt` | _Commented out at end of monthly script. Should be removed._ | @@ -140,20 +140,28 @@ Previously generated the HTRC datasets. All that remains is the zephir_ingested_ Data In ------- +* `ht_bib_export_full_YYYY-MM-DD.json.gz` (monthly updates from Zephir, `ftps_zephir_get`) + Note: this file is deleted by the `unpigz` command that splits it into smaller files to process in parallel. +* Note: there is no monthly "removed CIDs" or "deletes" files, these are only in the daily updates. * US Fed Doc exception list `/htdata/govdocs/feddocs_oclc_filter/oclcs_removed_from_registry.txt` -* `/tmp/rights_dbm` +* `DATA_ROOT/rights_dbm` (local copy of Rights DB `ht_rights.rights_current`) * `groove_export_YYYY-MM-DD.tsv.gz` (ftps from cdlib) -* `ht_bib_export_full_YYYY-MM-DD.json.gz` + Data Out -------- -* `groove_export_YYYY-MM-DD.tsv.gz` will be moved to /htapps/babel/feed/var/bibrecords/groove_full.tsv.gz -* `zephir_full_${YESTERDAY}_vufind.json.gz` catalog archive. Indexed into catalog via the same process as for `run_process_zephir_incremental.sh` -* `zephir_full_${YESTERDAY}.rights` moved to /htapps/babel/feed/var/rights/ -* `zephir_full_${YESTERDAY}.rights.debug`, doesn't appear to be used -* `zephir_full_monthly_rpt.txt`moved to ../data/full/ -* `zephir_full_${YESTERDAY}.rights_rpt.tsv moved to ./data/full/ -* `zephir_ingested_items.txt.gz` - copied to `/htapps/babel/feed/var/bibrecords`. Used by https://github.com/hathitrust/feed_internal/blob/master/feed.monthly/zephir_diff.pl to refresh the full `feed_zephir_items` table on a monthly basis. +| File | Notes | +| -------- | ----- | +| `INGEST_BIBRECORDS/groove_full.tsv.gz` | Downloaded as `groove_export_YYYY-MM-DD.tsv.gz` and moved, contents are not modified | +| `INGEST_BIBRECORDS/zephir_ingested_items.txt.gz` | From `postZephir.pm`, TSV of {htid, source, collection, digitization_source, ia_id} | +| `CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz` | Concatenated from parallel-processed files, gzipped and moved by shell script | +| `CATALOG_PREP/zephir_full_YYYYMMDD_vufind.json.gz` | Same file as above, copied to `CATALOG_PREP` before being moved to `CATALOG_ARCHIVE` | +| `RIGHTS_DIR/zephir_full_YYYYMMDD.rights` | From `postZephir.pm`: moved from `TMPDIR` | +| `TMPDIR/stderr.tmp.txt` | Concatenated from subfiles' STDERR | +| `TMPDIR/zephir_full_YYYYMMDD.rights.debug` | From `postZephir.pm`, _if no one is using this it should be removed_ | | +| `ZEPHIR_DATA/full/zephir_full_monthly_rpt.txt` | Concatenated from subfiles and moved from `TMPDIR` | +| `ZEPHIR_DATA/full/zephir_full_YYYYMMDD.rights_rpt.tsv` | Concatenated from subfiles and moved from `TMPDIR` | + Perl script dependencies ------------------------ diff --git a/bin/verify.rb b/bin/verify.rb index 5d5600d..dbe855f 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -1,10 +1,18 @@ #!/usr/bin/env ruby # frozen_string_literal: true +require "dotenv" require_relative "../lib/verifier/post_zephir_verifier" +Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) + [ PostZephirProcessing::PostZephirVerifier ].each do |klass| - klass.new.run + begin + klass.new.run + # Very simple minded exception handler so we can in theory check subsequent workflow steps + rescue StandardError => e + Services[:logger].fatal e + end end diff --git a/config/env b/config/env new file mode 100644 index 0000000..2c291bb --- /dev/null +++ b/config/env @@ -0,0 +1,8 @@ +# This is just for running/testing the Ruby components under Docker +# Under Argo these ENV variables will all be set externally +CATALOG_ARCHIVE=/usr/src/app/data/catalog_archive +CATALOG_PREP=/usr/src/app/data/catalog_prep +DATA_ROOT=/usr/src/app/data +INGEST_BIBRECORDS=/usr/src/app/data/ingest_bibrecords +RIGHTS_DIR=/usr/src/app/data/rights +ZEPHIR_DATA=/usr/src/app/data/zephir diff --git a/lib/derivatives.rb b/lib/derivatives.rb index 6b4127b..cc8ebc3 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -49,17 +49,16 @@ class Derivatives # into the path via ENV or a default. # @return [String] path to the directory def self.directory_for(location:) - case location.to_sym - when :CATALOG_ARCHIVE - ENV["CATALOG_ARCHIVE"] || "/htapps/archive/catalog" - when :CATALOG_PREP - ENV["CATALOG_PREP"] || "/htsolr/catalog/prep/" - when :RIGHTS_ARCHIVE - ENV["RIGHTS_ARCHIVE"] || "/htapps/babel/feed/var/rights/archive" - when :TMPDIR - ENV["TMPDIR"] || File.join(ENV["DATA_ROOT"], "work") + location = location.to_s + case location + when "CATALOG_ARCHIVE", "CATALOG_PREP", "INGEST_BIBRECORDS", "RIGHTS_DIR", "ZEPHIR_DATA" + ENV.fetch location + when "RIGHTS_ARCHIVE" + ENV["RIGHTS_ARCHIVE"] || File.join(ENV.fetch("RIGHTS_DIR"), "archive") + when "TMPDIR" + ENV["TMPDIR"] || File.join(ENV.fetch("DATA_ROOT"), "work") else - raise "Unknown location #{location}" + raise "Unknown location #{location.inspect}" end end diff --git a/lib/verifier.rb b/lib/verifier.rb index 96d0636..1499efa 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -1,5 +1,6 @@ # frozen_string_literal: true +require_relative "derivatives" require_relative "journal" require_relative "services" @@ -16,6 +17,19 @@ def self.datestamped_file(name:, date:) .sub(/YYYY-MM-DD/i, date.strftime("%Y-%m-%d")) end + # TODO: see if we want to move this to Derivatives class + def self.dated_derivative(location:, name:, date:) + File.join( + Derivatives.directory_for(location: location), + datestamped_file(name: name, date: date) + ) + end + + # TODO: see if we want to move this to Derivatives class + def self.derivative(location:, name:) + File.join(Derivatives.directory_for(location: location), name) + end + # Generally, needs a Journal in order to know what to look for. def initialize @journal = Journal.from_yaml @@ -26,33 +40,41 @@ def initialize # Do we want to bail out or keep going if we encounter a show-stopper? # I'm inclined to just keep going. def run - run_for_dates journal.dates.each do |date| run_for_date(date: date) end end - # Subclasses can verify outputs that are not datestamped, in case we want to - # avoid running an expensive check multiple times. - # This may not be needed. - def run_for_dates(dates: journal.dates) - end - # Verify outputs for one date in the journal. # USeful for verifying datestamped files. def run_for_date(date:) end - # Basic check(s) for the existence of the file at `path`. + # Basic checks for the existence and readability of the file at `path`. # We should do whatever logging/warning we want to do if the file does # not pass muster. - # At least call File.exist? - # What about permissions? # Verifying contents is out of scope. + # Returns `true` if verified. def verify_file(path:) - if !File.exist? path - Services[:logger].error "not found: #{path}" + verify_file_exists(path: path) && verify_file_readable(path: path) + end + + def verify_file_exists(path:) + File.exist?(path).tap do |exists| + error(message: "not found: #{path}") unless exists + end + end + + def verify_file_readable(path:) + File.readable?(path).tap do |readable| + error(message: "not readable: #{path}") unless readable end end + + # I'm not sure if we're going to try to distinguish errors and warnings. + # For now let's call everything an error. + def error(message:) + Services[:logger].error message + end end end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index c8a2a6e..6afb2cf 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -10,35 +10,109 @@ module PostZephirProcessing class PostZephirVerifier < Verifier - # TODO: do we need to check any non-datestamped files for this date? - # Review README list of derivatives in TMPDIR - # def run_for_dates(dates: journal.dates) - # end + attr_reader :current_date def run_for_date(date:) - datestamped_derivatives(date).each do |path| - verify_file(path: path) + @current_date = date + verify_catalog_archive + verify_catalog_prep + verify_dollar_dup + verify_groove_export + verify_ingest_bibrecords + verify_rights + verify_zephir_data + end + + # Frequency: ALL + # Files: CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz + # and potentially CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz + # Contents: TODO + # Verify: + # readable + # TODO: line count must be the same as input JSON + def verify_catalog_archive(date: current_date) + verify_file(path: self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) + if date.last_of_month? + verify_file(path: self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date)) + end + end + + # Frequency: ALL + # Files: CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz and CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz + # and potentially CATALOG_PREP/zephir_full_YYYYMMDD_vufind.json.gz + # Contents: TODO + # Verify: + # readable + # TODO: deletes file is combination of two component files in TMPDIR? + def verify_catalog_prep(date: current_date) + verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) + verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date)) + if date.last_of_month? + verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date)) end end - private + # Frequency: DAILY + # Files: TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz + # Contents: TODO + # Verify: + # readable + # empty + def verify_dollar_dup(date: current_date) + dollar_dup = self.class.dated_derivative(location: :TMPDIR, name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz", date: date) + if verify_file(path: dollar_dup) + Zinzout.zin(dollar_dup) do |infile| + if infile.count.positive? + error "#{dollar_dup} has #{infile.count} lines, should be 0" + end + end + end + end - # TODO: see if we want to move this to Derivatives class - def datestamped_derivative(location:, name:, date:) - File.join( - Derivatives.directory_for(location: location), - self.class.datestamped_file(name: "zephir_upd_YYYYMMDD.json.gz", date: date) - ) + # Frequency: MONTHLY + # Files: INGEST_BIBRECORDS/groove_full.tsv.gz + # Contents: TODO + # Verify: readable + def verify_groove_export(date: current_date) + if date.last_of_month? + verify_file(path: self.class.derivative(location: :INGEST_BIBRECORDS, name: "groove_full.tsv.gz")) + end end - def datestamped_derivatives(date) - [ - datestamped_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date), - datestamped_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD.json.gz", date: date), - datestamped_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date), - datestamped_derivative(location: :RIGHTS_DIR, name: "zephir_upd_YYYYMMDD.rights", date: date), - datestamped_derivative(location: :TMPDIR, name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz", date: date) - ] + # Frequency: MONTHLY + # Files: INGEST_BIBRECORDS/groove_full.tsv.gz, INGEST_BIBRECORDS/zephir_ingested_items.txt.gz + # Contents: TODO + # Verify: readable + def verify_ingest_bibrecords(date: current_date) + if date.last_of_month? + verify_file(path: self.class.derivative(location: :INGEST_BIBRECORDS, name: "groove_full.tsv.gz")) + verify_file(path: self.class.derivative(location: :INGEST_BIBRECORDS, name: "zephir_ingested_items.txt.gz")) + end + end + + # Frequency: BOTH + # Files: RIGHTS_ARCHIVE/zephir_upd_YYYYMMDD.rights + # and potentially RIGHTS_ARCHIVE/zephir_full_YYYYMMDD.rights + # Contents: TODO + # Verify: + # readable + # TODO: compare each line against a basic regex + def verify_rights(date: current_date) + verify_file(path: self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_upd_YYYYMMDD.rights", date: date)) + if date.last_of_month? + verify_file(path: self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_full_YYYYMMDD.rights", date: date)) + end + end + + # Frequency: MONTHLY + # Files: ZEPHIR_DATA/full/zephir_full_monthly_rpt.txt, ZEPHIR_DATA/full/zephir_full_YYYYMMDD.rights_rpt.tsv + # Contents: TODO + # Verify: readable + def verify_zephir_data(date: current_date) + if date.last_of_month? + verify_file(path: self.class.derivative(location: :ZEPHIR_DATA, name: "full/zephir_full_monthly_rpt.txt")) + verify_file(path: self.class.dated_derivative(location: :ZEPHIR_DATA, name: "full/zephir_full_YYYYMMDD.rights_rpt.tsv", date: date)) + end end end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 710e21f..fd46681 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -1,9 +1,12 @@ # frozen_string_literal: true +require "dotenv" require "logger" require "simplecov" require "simplecov-lcov" +Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) + SimpleCov.add_filter "spec" SimpleCov::Formatter::LcovFormatter.config do |c| From e0107ce5c521d0783f35f4e9dcc698e0cc60d73c Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 2 Dec 2024 14:06:18 -0500 Subject: [PATCH 009/114] DEV-1421 (WIP): verify delete files --- lib/verifier/post_zephir_verifier.rb | 14 ++++++++- spec/unit/post_zephir_verifier_spec.rb | 40 ++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 1 deletion(-) create mode 100644 spec/unit/post_zephir_verifier_spec.rb diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 6afb2cf..627ee29 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -1,5 +1,6 @@ # frozen_string_literal: true +require "zlib" require_relative "../verifier" require_relative "../derivatives" @@ -45,13 +46,24 @@ def verify_catalog_archive(date: current_date) # readable # TODO: deletes file is combination of two component files in TMPDIR? def verify_catalog_prep(date: current_date) + delete_file = self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date) verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) - verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date)) + verify_file(path: delete_file) + verify_deletes_contents(path: delete_file) if date.last_of_month? verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date)) end end + # Verify contents of the given file consists of catalog record IDs (9 digits) + def verify_deletes_contents(path:) + Zlib::GzipReader.open(path).each_line do |line| + if !line.match?(/^\d{9}$/) + error message: "Unexpected line in #{path} (was '#{line.strip}'); expecting catalog record ID (9 digits)" + end + end + end + # Frequency: DAILY # Files: TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz # Contents: TODO diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb new file mode 100644 index 0000000..f8151a5 --- /dev/null +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -0,0 +1,40 @@ +# frozen_string_literal: true + +require "climate_control" +require "zlib" +require "verifier/post_zephir_verifier" +require "tempfile" + +module PostZephirProcessing + + RSpec.describe(PostZephirVerifier) do + around(:each) do |example| + Dir.mktmpdir do |tmpdir| + ClimateControl.modify DATA_ROOT: tmpdir do + File.open(File.join(tmpdir,"journal.yml"),"w") do |f| + # minimal yaml -- empty array + f.puts("--- []") + end + example.run + end + end + end + + describe "#verify_deletes_contents" do + it "accepts a file with a newline and nothing else" do + Tempfile.create('pzp_test') do |tmpfile| + gz = Zlib::GzipWriter.new(tmpfile) + gz.write("\n") + gz.close + tmpfile.close + expect { described_class.new.verify_deletes_contents(path: tmpfile.path) }.not_to raise_exception + end + end + it "accepts a file with one catalog record ID" + it "accepts a file with multiple catalog record IDs" + it "rejects a file with a truncated catalog record ID" + it "rejects a file with a mix of catalog record IDs and whitespace" + it "rejects a file with a mix of catalog record IDs and gibberish" + end + end +end From 787d0566a61eeb2c104b7322bd6883d0ac2fa9cb Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 2 Dec 2024 14:26:34 -0500 Subject: [PATCH 010/114] DEV-1421: verify delete files * accepts empty lines * accepts 9 digit catalog record IDs * rejects all other non-whitespace lines * sets up empty journal * sets up logging to string for testing --- lib/verifier/post_zephir_verifier.rb | 5 +- spec/unit/post_zephir_verifier_spec.rb | 103 ++++++++++++++++++++++--- 2 files changed, 94 insertions(+), 14 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 627ee29..934096a 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -55,10 +55,11 @@ def verify_catalog_prep(date: current_date) end end - # Verify contents of the given file consists of catalog record IDs (9 digits) + # Verify contents of the given file consists of catalog record IDs (9 digits) + # or blank lines def verify_deletes_contents(path:) Zlib::GzipReader.open(path).each_line do |line| - if !line.match?(/^\d{9}$/) + if line != "\n" && !line.match?(/^\d{9}$/) error message: "Unexpected line in #{path} (was '#{line.strip}'); expecting catalog record ID (9 digits)" end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index f8151a5..849decf 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -4,10 +4,43 @@ require "zlib" require "verifier/post_zephir_verifier" require "tempfile" +require "logger" module PostZephirProcessing + RSpec.describe(PostZephirVerifier) do + def with_temp_deletefile(contents) + Tempfile.create('deletefile') do |tmpfile| + gz = Zlib::GzipWriter.new(tmpfile) + gz.write(contents) + gz.close + yield tmpfile.path + end + end + + def expect_not_ok(contents) + with_temp_deletefile(contents) do |tmpfile| + described_class.new.verify_deletes_contents(path: tmpfile) + expect(@log_str.string).to match(/ERROR.*deletefile.*expecting catalog record ID/) + end + end + + def expect_ok(contents) + with_temp_deletefile(contents) do |tmpfile| + described_class.new.verify_deletes_contents(path: tmpfile) + expect(@log_str.string).not_to match(/ERROR/) + end + end + + around(:each) do |example| + @log_str = StringIO.new + old_logger = Services.logger + Services.register(:logger) { Logger.new(@log_str, level: Logger::DEBUG) } + example.run + Services.register(:logger) { old_logger } + end + around(:each) do |example| Dir.mktmpdir do |tmpdir| ClimateControl.modify DATA_ROOT: tmpdir do @@ -22,19 +55,65 @@ module PostZephirProcessing describe "#verify_deletes_contents" do it "accepts a file with a newline and nothing else" do - Tempfile.create('pzp_test') do |tmpfile| - gz = Zlib::GzipWriter.new(tmpfile) - gz.write("\n") - gz.close - tmpfile.close - expect { described_class.new.verify_deletes_contents(path: tmpfile.path) }.not_to raise_exception - end + contents = "\n" + expect_ok(contents) + end + + it "accepts a file with one catalog record ID" do + contents = <<~EOT + 000123456 + EOT + + expect_ok(contents) + end + + it "accepts a file with multiple catalog record IDs" do + contents = <<~EOT + 000001234 + 000012345 + EOT + + expect_ok(contents) + end + + it "accepts a file with a mix of catalog record IDs and blank lines" do + contents = <<~EOT + 000000001 + + 212345678 + EOT + + expect_ok(contents) + end + + it "rejects a file with a truncated catalog record ID" do + contents = <<~EOT + 12345 + EOT + + expect_not_ok(contents) + end + + it "rejects a file with a mix of catalog record IDs and whitespace" do + contents = <<~EOT + 000001234 + 000012345 + + \t + 000112345 + EOT + + expect_not_ok(contents) + end + + it "rejects a file with a mix of catalog record IDs and gibberish" do + contents = <<~EOT + mashed potatoes + 000001234 + EOT + + expect_not_ok(contents) end - it "accepts a file with one catalog record ID" - it "accepts a file with multiple catalog record IDs" - it "rejects a file with a truncated catalog record ID" - it "rejects a file with a mix of catalog record IDs and whitespace" - it "rejects a file with a mix of catalog record IDs and gibberish" end end end From 8429ed7cfcb2398379490c9dad794efcbf67d79e Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 2 Dec 2024 14:44:45 -0500 Subject: [PATCH 011/114] DEV-1421: rubocop fixes --- lib/verifier/post_zephir_verifier.rb | 2 +- spec/unit/post_zephir_verifier_spec.rb | 6 ++---- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 934096a..755aaf0 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -55,7 +55,7 @@ def verify_catalog_prep(date: current_date) end end - # Verify contents of the given file consists of catalog record IDs (9 digits) + # Verify contents of the given file consists of catalog record IDs (9 digits) # or blank lines def verify_deletes_contents(path:) Zlib::GzipReader.open(path).each_line do |line| diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 849decf..8d7232d 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -7,11 +7,9 @@ require "logger" module PostZephirProcessing - - RSpec.describe(PostZephirVerifier) do def with_temp_deletefile(contents) - Tempfile.create('deletefile') do |tmpfile| + Tempfile.create("deletefile") do |tmpfile| gz = Zlib::GzipWriter.new(tmpfile) gz.write(contents) gz.close @@ -44,7 +42,7 @@ def expect_ok(contents) around(:each) do |example| Dir.mktmpdir do |tmpdir| ClimateControl.modify DATA_ROOT: tmpdir do - File.open(File.join(tmpdir,"journal.yml"),"w") do |f| + File.open(File.join(tmpdir, "journal.yml"), "w") do |f| # minimal yaml -- empty array f.puts("--- []") end From 93ddd2e152a7182dde6f0ffef3ad2c8add73c22b Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Mon, 2 Dec 2024 15:25:07 -0500 Subject: [PATCH 012/114] - spec for `Verifier` class. - `with_test_environment` helper. --- lib/verifier.rb | 5 +++- spec/spec_helper.rb | 27 ++++++++++++++++++ spec/unit/journal_spec.rb | 18 +++--------- spec/unit/verifier_spec.rb | 57 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 92 insertions(+), 15 deletions(-) create mode 100644 spec/unit/verifier_spec.rb diff --git a/lib/verifier.rb b/lib/verifier.rb index 1499efa..1328eef 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -10,7 +10,7 @@ module PostZephirProcessing class Verifier - attr_reader :journal + attr_reader :journal, :errors def self.datestamped_file(name:, date:) name.sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) @@ -33,6 +33,8 @@ def self.derivative(location:, name:) # Generally, needs a Journal in order to know what to look for. def initialize @journal = Journal.from_yaml + # Mainly for testing + @errors = [] end # Main entrypoint @@ -74,6 +76,7 @@ def verify_file_readable(path:) # I'm not sure if we're going to try to distinguish errors and warnings. # For now let's call everything an error. def error(message:) + @errors << message Services[:logger].error message end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index fd46681..c6d0aff 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -22,7 +22,34 @@ require_relative "../lib/dates" require_relative "../lib/derivatives" require_relative "../lib/journal" +require_relative "../lib/verifier" + +def test_journal + <<~TEST_YAML + --- + - '20500101' + - '20500102' + TEST_YAML +end + +def test_journal_dates + [Date.new(2050, 1, 1), Date.new(2050, 1, 2)] +end + +def with_test_environment + Dir.mktmpdir do |tmpdir| + ClimateControl.modify(DATA_ROOT: tmpdir) do + File.open(File.join(tmpdir, "journal.yml"), "w") { |f| f.puts test_journal } + # Maybe we don't need to yield `tmpdir` since we're also assigning it to an + # instance variable. Leaving it for now in case the ivar approach leads to funny business. + @tmpdir = tmpdir + yield tmpdir + end + end +end +# TODO: the following ENV juggling routines are for the integration tests, +# and should be integrated with the `with_test_environment` facility above. ENV["POST_ZEPHIR_LOGGER_LEVEL"] = Logger::WARN.to_s def catalog_prep_dir diff --git a/spec/unit/journal_spec.rb b/spec/unit/journal_spec.rb index a163e68..b587654 100644 --- a/spec/unit/journal_spec.rb +++ b/spec/unit/journal_spec.rb @@ -6,11 +6,8 @@ module PostZephirProcessing RSpec.describe(Journal) do around(:each) do |example| - Dir.mktmpdir do |tmpdir| - ClimateControl.modify DATA_ROOT: tmpdir do - @tmpdir = tmpdir - example.run - end + with_test_environment do |tmpdir| + example.run end end @@ -19,14 +16,6 @@ module PostZephirProcessing let(:range_of_dates) { (Date.today..Date.today + 1) } let(:with_dates) { described_class.new(dates: unsorted_dates) } let(:with_range) { described_class.new(dates: range_of_dates) } - let(:test_yaml) { - <<~TEST_YAML - --- - - '20500101' - - '20500102' - TEST_YAML - } - let(:test_yaml_dates) { [Date.new("2050", "1", "1"), Date.new("2050", "1", "2")] } describe ".destination_path" do it "contains the current DATA_ROOT" do @@ -36,8 +25,9 @@ module PostZephirProcessing describe ".from_yaml" do it "produces a Journal with the expected dates" do - File.write(described_class.destination_path, test_yaml) + File.write(described_class.destination_path, test_journal) expect(described_class.from_yaml).to be_an_instance_of(Journal) + expect(described_class.from_yaml.dates).to eq(test_journal_dates) end end diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb new file mode 100644 index 0000000..b6d1079 --- /dev/null +++ b/spec/unit/verifier_spec.rb @@ -0,0 +1,57 @@ +# frozen_string_literal: true + +require "climate_control" +require "tmpdir" + +module PostZephirProcessing + RSpec.describe(Verifier) do + around(:each) do |example| + with_test_environment do |tmpdir| + example.run + end + end + + let(:verifier) { described_class.new } + + describe ".new" do + it "creates a Verifier" do + expect(verifier).to be_an_instance_of(Verifier) + end + + context "with no Journal file" do + it "raises StandardError" do + FileUtils.rm(File.join(@tmpdir, "journal.yml")) + expect { verifier }.to raise_error(StandardError) + end + end + end + + describe ".run" do + it "runs to completion" do + verifier.run + end + end + + describe "#verify_file" do + # Note: since the tests currently run as root, no way to test unreadable file + context "with readable file" do + it "does not report an error" do + errors_before = verifier.errors.count + tmpfile = File.join(@tmpdir, "tmpfile.txt") + File.open(tmpfile, "w") { |f| f.puts "blah" } + verifier.verify_file(path: tmpfile) + expect(verifier.errors.count).to eq(errors_before) + end + end + + context "with nonexistent file" do + it "reports an error" do + errors_before = verifier.errors.count + tmpfile = File.join(@tmpdir, "no_such_tmpfile.txt") + verifier.verify_file(path: tmpfile) + expect(verifier.errors.count).to be > errors_before + end + end + end + end +end From 48c1b58e0caa453410b00e35f35a4d3b5b7dd2c6 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Mon, 2 Dec 2024 16:33:12 -0500 Subject: [PATCH 013/114] added verify_rights_file_format, some tests, and test helpers --- lib/verifier/post_zephir_verifier.rb | 33 +++++++- spec/unit/post_zephir_verifier_spec.rb | 113 +++++++++++++++++++------ 2 files changed, 120 insertions(+), 26 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 755aaf0..70fd7b6 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -111,9 +111,38 @@ def verify_ingest_bibrecords(date: current_date) # readable # TODO: compare each line against a basic regex def verify_rights(date: current_date) - verify_file(path: self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_upd_YYYYMMDD.rights", date: date)) + upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_upd_YYYYMMDD.rights", date: date) + verify_file(path: upd_path) + verify_rights_file_format(path: upd_path) + if date.last_of_month? - verify_file(path: self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_full_YYYYMMDD.rights", date: date)) + full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_full_YYYYMMDD.rights", date: date) + verify_file(path: full_path) + verify_rights_file_format(path: full_path) + end + end + + # Rights file must: + # * exist & be be readable (both covered by verify_rights) + # * either be empty, or all its lines must match regex. + def verify_rights_file_format(path:) + # A more readable version of: + # /^\w+\.[\w:\/\$\.]+\t(ic|pd|pdus|und)\tbib\tbibrights\t\w+(-\w+)*$/ + regex = /^ \w+ \. [\w:\/\$\.]+ # col 1, namespace.objid + \t (ic|pd|pdus|und) # col 2, one of these + \t bib # col 3, exactly this + \t bibrights # col 4, exactly this + \t \w+(-\w+)* # col 5, digitizer, e.g. 'ia', 'cornell-ms' + $/x + + # This allows an empty file as well, which is possible. + File.open(path) do |f| + f.each_line do |line| + line.strip! + unless line.match?(regex) + error message: "Rights file #{path} contains malformed line: #{line}" + end + end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 8d7232d..990d7cd 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -8,29 +8,6 @@ module PostZephirProcessing RSpec.describe(PostZephirVerifier) do - def with_temp_deletefile(contents) - Tempfile.create("deletefile") do |tmpfile| - gz = Zlib::GzipWriter.new(tmpfile) - gz.write(contents) - gz.close - yield tmpfile.path - end - end - - def expect_not_ok(contents) - with_temp_deletefile(contents) do |tmpfile| - described_class.new.verify_deletes_contents(path: tmpfile) - expect(@log_str.string).to match(/ERROR.*deletefile.*expecting catalog record ID/) - end - end - - def expect_ok(contents) - with_temp_deletefile(contents) do |tmpfile| - described_class.new.verify_deletes_contents(path: tmpfile) - expect(@log_str.string).not_to match(/ERROR/) - end - end - around(:each) do |example| @log_str = StringIO.new old_logger = Services.logger @@ -51,7 +28,59 @@ def expect_ok(contents) end end + # These helpers are based on the ones from + # #verify_deletes_contents but are more general + + # overwrite with_temp_file if you need to treat temp files differently + def with_temp_file(contents) + tempfile = Tempfile.new("tempfile") + tempfile << contents + tempfile.close + yield tempfile.path + end + + # the expect-methods take a method arg for the method under test, + # a contents string that's written to a tempfile and passed to the method, + # and an optional errmsg arg (as a regexp) for specific error checking + + def expect_not_ok(method, contents, errmsg = /ERROR/) + with_temp_file(contents) do |tmpfile| + described_class.new.send(method, path: tmpfile) + expect(@log_str.string).to match(errmsg) + end + end + + def expect_ok(method, contents, errmsg = /ERROR/) + with_temp_file(contents) do |tmpfile| + described_class.new.send(method, path: tmpfile) + expect(@log_str.string).not_to match(errmsg) + end + end + describe "#verify_deletes_contents" do + def with_temp_deletefile(contents) + Tempfile.create("deletefile") do |tmpfile| + gz = Zlib::GzipWriter.new(tmpfile) + gz.write(contents) + gz.close + yield tmpfile.path + end + end + + def expect_not_ok(contents) + with_temp_deletefile(contents) do |tmpfile| + described_class.new.verify_deletes_contents(path: tmpfile) + expect(@log_str.string).to match(/ERROR.*deletefile.*expecting catalog record ID/) + end + end + + def expect_ok(contents) + with_temp_deletefile(contents) do |tmpfile| + described_class.new.verify_deletes_contents(path: tmpfile) + expect(@log_str.string).not_to match(/ERROR/) + end + end + it "accepts a file with a newline and nothing else" do contents = "\n" expect_ok(contents) @@ -96,7 +125,7 @@ def expect_ok(contents) contents = <<~EOT 000001234 000012345 - + \t 000112345 EOT @@ -113,5 +142,41 @@ def expect_ok(contents) expect_not_ok(contents) end end + + describe "#verify_rights_file_format" do + it "accepts an empty file" do + expect_ok(:verify_rights_file_format, "") + end + + it "accepts a well-formed file" do + contents = [ + ["a.1", "ic", "bib", "bibrights", "aa"].join("\t"), + ["a.2", "pd", "bib", "bibrights", "bb"].join("\t"), + ["a.3", "pdus", "bib", "bibrights", "aa-bb"].join("\t"), + ["a.4", "und", "bib", "bibrights", "aa-bb"].join("\t") + ].join("\n") + + expect_ok(:verify_rights_file_format, contents) + end + + it "rejects a file with malformed volume id" do + expect_not_ok( + :verify_rights_file_format, + ["", "ic", "bib", "bibrights", "aa"].join("\t") + ) + expect_not_ok( + :verify_rights_file_format, + ["x", "ic", "bib", "bibrights", "aa"].join("\t") + ) + expect_not_ok( + :verify_rights_file_format, + ["x.", "ic", "bib", "bibrights", "aa"].join("\t") + ) + expect_not_ok( + :verify_rights_file_format, + [".x", "ic", "bib", "bibrights", "aa"].join("\t") + ) + end + end end end From 48e41b3751e0f40be70196e2dc1c5e00d43cf877 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Mon, 2 Dec 2024 17:19:31 -0500 Subject: [PATCH 014/114] more tests for verify_rights_file_format --- lib/verifier/post_zephir_verifier.rb | 2 +- spec/unit/post_zephir_verifier_spec.rb | 66 ++++++++++++++++++++++++-- 2 files changed, 63 insertions(+), 5 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 70fd7b6..8bcc7fe 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -132,7 +132,7 @@ def verify_rights_file_format(path:) \t (ic|pd|pdus|und) # col 2, one of these \t bib # col 3, exactly this \t bibrights # col 4, exactly this - \t \w+(-\w+)* # col 5, digitizer, e.g. 'ia', 'cornell-ms' + \t \w+(-\w+)* # col 5, digitization source, e.g. 'ia', 'cornell-ms' $/x # This allows an empty file as well, which is possible. diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 990d7cd..e98326d 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -160,23 +160,81 @@ def expect_ok(contents) end it "rejects a file with malformed volume id" do + cols_2_to_5 = ["ic", "bib", "bibrights", "aa"].join("\t") expect_not_ok( :verify_rights_file_format, - ["", "ic", "bib", "bibrights", "aa"].join("\t") + ["", cols_2_to_5].join("\t"), + /Rights file .+ contains malformed line/ ) expect_not_ok( :verify_rights_file_format, - ["x", "ic", "bib", "bibrights", "aa"].join("\t") + ["x", cols_2_to_5].join("\t"), + /Rights file .+ contains malformed line/ ) expect_not_ok( :verify_rights_file_format, - ["x.", "ic", "bib", "bibrights", "aa"].join("\t") + ["x.", cols_2_to_5].join("\t"), + /Rights file .+ contains malformed line/ ) expect_not_ok( :verify_rights_file_format, - [".x", "ic", "bib", "bibrights", "aa"].join("\t") + [".x", cols_2_to_5].join("\t"), + /Rights file .+ contains malformed line/ ) end + + it "rejects a file with malformed rights" do + cols = ["a.1", "ic", "bib", "bibrights", "aa"] + expect_ok(:verify_rights_file_format, cols.join("\t")) + + cols[1] = "" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + + cols[1] = "icus" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + end + + it "rejects a file without bib in col 2" do + cols = ["a.1", "ic", "bib", "bibrights", "aa"] + expect_ok(:verify_rights_file_format, cols.join("\t")) + + cols[2] = "BIB" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + + cols[2] = "" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + end + + it "rejects a file without bibrights in col 3" do + cols = ["a.1", "ic", "bib", "bibrights", "aa"] + expect_ok(:verify_rights_file_format, cols.join("\t")) + + cols[3] = "BIBRIGHTS" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + + cols[3] = "" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + end + + it "rejects a file with malformed digitization source" do + cols = ["a.1", "ic", "bib", "bibrights", "aa"] + expect_ok(:verify_rights_file_format, cols.join("\t")) + + cols[4] = "aa-aa" + expect_ok(:verify_rights_file_format, cols.join("\t")) + + cols[4] = "-aa" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + + cols[4] = "aa-" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + + cols[4] = "AA" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + + cols[4] = "" + expect_not_ok(:verify_rights_file_format, cols.join("\t")) + end end end end From 8ae1a905b33ef3117d3a0eaa88b3eaf6d7c62cda Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 3 Dec 2024 10:56:04 -0500 Subject: [PATCH 015/114] Use verifier.errors and with_test_environment * Fix regex for verifying digitization source --- lib/verifier/post_zephir_verifier.rb | 2 +- spec/unit/post_zephir_verifier_spec.rb | 42 +++++++++----------------- 2 files changed, 16 insertions(+), 28 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 8bcc7fe..b157db6 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -132,7 +132,7 @@ def verify_rights_file_format(path:) \t (ic|pd|pdus|und) # col 2, one of these \t bib # col 3, exactly this \t bibrights # col 4, exactly this - \t \w+(-\w+)* # col 5, digitization source, e.g. 'ia', 'cornell-ms' + \t [a-z]+(-[a-z]+)* # col 5, digitization source, e.g. 'ia', 'cornell-ms' $/x # This allows an empty file as well, which is possible. diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index e98326d..9d0a9b4 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -9,23 +9,7 @@ module PostZephirProcessing RSpec.describe(PostZephirVerifier) do around(:each) do |example| - @log_str = StringIO.new - old_logger = Services.logger - Services.register(:logger) { Logger.new(@log_str, level: Logger::DEBUG) } - example.run - Services.register(:logger) { old_logger } - end - - around(:each) do |example| - Dir.mktmpdir do |tmpdir| - ClimateControl.modify DATA_ROOT: tmpdir do - File.open(File.join(tmpdir, "journal.yml"), "w") do |f| - # minimal yaml -- empty array - f.puts("--- []") - end - example.run - end - end + with_test_environment { example.run } end # These helpers are based on the ones from @@ -43,17 +27,19 @@ def with_temp_file(contents) # a contents string that's written to a tempfile and passed to the method, # and an optional errmsg arg (as a regexp) for specific error checking - def expect_not_ok(method, contents, errmsg = /ERROR/) + def expect_not_ok(method, contents, errmsg = /.*/) with_temp_file(contents) do |tmpfile| - described_class.new.send(method, path: tmpfile) - expect(@log_str.string).to match(errmsg) + verifier = described_class.new + verifier.send(method, path: tmpfile) + expect(verifier.errors).to include(errmsg) end end - def expect_ok(method, contents, errmsg = /ERROR/) + def expect_ok(method, contents) with_temp_file(contents) do |tmpfile| - described_class.new.send(method, path: tmpfile) - expect(@log_str.string).not_to match(errmsg) + verifier = described_class.new + verifier.send(method, path: tmpfile) + expect(verifier.errors).to be_empty end end @@ -69,15 +55,17 @@ def with_temp_deletefile(contents) def expect_not_ok(contents) with_temp_deletefile(contents) do |tmpfile| - described_class.new.verify_deletes_contents(path: tmpfile) - expect(@log_str.string).to match(/ERROR.*deletefile.*expecting catalog record ID/) + verifier = described_class.new + verifier.verify_deletes_contents(path: tmpfile) + expect(verifier.errors).to include(/.*deletefile.*expecting catalog record ID/) end end def expect_ok(contents) with_temp_deletefile(contents) do |tmpfile| - described_class.new.verify_deletes_contents(path: tmpfile) - expect(@log_str.string).not_to match(/ERROR/) + verifier = described_class.new + verifier.verify_deletes_contents(path: tmpfile) + expect(verifier.errors).to be_empty end end From eadc0cdc2d0db473ef8e8f400bce1056c59d7420 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 3 Dec 2024 16:02:12 -0500 Subject: [PATCH 016/114] Use extracted methods for checking delete file * Add optional 'gzipped' argument * Make 'errmsg' a keyword argument --- spec/unit/post_zephir_verifier_spec.rb | 72 +++++++++++--------------- 1 file changed, 31 insertions(+), 41 deletions(-) diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 9d0a9b4..3f4051c 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -16,27 +16,31 @@ module PostZephirProcessing # #verify_deletes_contents but are more general # overwrite with_temp_file if you need to treat temp files differently - def with_temp_file(contents) - tempfile = Tempfile.new("tempfile") - tempfile << contents - tempfile.close - yield tempfile.path + def with_temp_file(contents, gzipped: false) + Tempfile.new("tempfile") do |tmpfile| + if gzipped + write_gzipped(tmpfile, contents) + else + tempfile.write(contents) + end + yield tempfile.path + end end # the expect-methods take a method arg for the method under test, # a contents string that's written to a tempfile and passed to the method, # and an optional errmsg arg (as a regexp) for specific error checking - def expect_not_ok(method, contents, errmsg = /.*/) - with_temp_file(contents) do |tmpfile| + def expect_not_ok(method, contents, errmsg: /.*/, gzipped: false) + with_temp_file(contents, gzipped: gzipped) do |tmpfile| verifier = described_class.new verifier.send(method, path: tmpfile) expect(verifier.errors).to include(errmsg) end end - def expect_ok(method, contents) - with_temp_file(contents) do |tmpfile| + def expect_ok(method, contents, gzipped: false) + with_temp_file(contents, gzipped: gzipped) do |tmpfile| verifier = described_class.new verifier.send(method, path: tmpfile) expect(verifier.errors).to be_empty @@ -44,34 +48,20 @@ def expect_ok(method, contents) end describe "#verify_deletes_contents" do - def with_temp_deletefile(contents) - Tempfile.create("deletefile") do |tmpfile| - gz = Zlib::GzipWriter.new(tmpfile) - gz.write(contents) - gz.close - yield tmpfile.path - end + def expect_deletefile_error(contents) + expect_not_ok(:verify_deletes_contents, + contents, + gzipped: true, + errmsg: /.*tempfile.*expecting catalog record ID/) end - def expect_not_ok(contents) - with_temp_deletefile(contents) do |tmpfile| - verifier = described_class.new - verifier.verify_deletes_contents(path: tmpfile) - expect(verifier.errors).to include(/.*deletefile.*expecting catalog record ID/) - end - end - - def expect_ok(contents) - with_temp_deletefile(contents) do |tmpfile| - verifier = described_class.new - verifier.verify_deletes_contents(path: tmpfile) - expect(verifier.errors).to be_empty - end + def expect_deletefile_ok(contents) + expect_ok(:verify_deletes_contents, contents, gzipped: true) end it "accepts a file with a newline and nothing else" do contents = "\n" - expect_ok(contents) + expect_deletefile_ok(contents) end it "accepts a file with one catalog record ID" do @@ -79,7 +69,7 @@ def expect_ok(contents) 000123456 EOT - expect_ok(contents) + expect_deletefile_ok(contents) end it "accepts a file with multiple catalog record IDs" do @@ -88,7 +78,7 @@ def expect_ok(contents) 000012345 EOT - expect_ok(contents) + expect_deletefile_ok(contents) end it "accepts a file with a mix of catalog record IDs and blank lines" do @@ -98,7 +88,7 @@ def expect_ok(contents) 212345678 EOT - expect_ok(contents) + expect_deletefile_ok(contents) end it "rejects a file with a truncated catalog record ID" do @@ -106,7 +96,7 @@ def expect_ok(contents) 12345 EOT - expect_not_ok(contents) + expect_deletefile_error(contents) end it "rejects a file with a mix of catalog record IDs and whitespace" do @@ -118,7 +108,7 @@ def expect_ok(contents) 000112345 EOT - expect_not_ok(contents) + expect_deletefile_error(contents) end it "rejects a file with a mix of catalog record IDs and gibberish" do @@ -127,7 +117,7 @@ def expect_ok(contents) 000001234 EOT - expect_not_ok(contents) + expect_deletefile_error(contents) end end @@ -152,22 +142,22 @@ def expect_ok(contents) expect_not_ok( :verify_rights_file_format, ["", cols_2_to_5].join("\t"), - /Rights file .+ contains malformed line/ + errmsg: /Rights file .+ contains malformed line/ ) expect_not_ok( :verify_rights_file_format, ["x", cols_2_to_5].join("\t"), - /Rights file .+ contains malformed line/ + errmsg: /Rights file .+ contains malformed line/ ) expect_not_ok( :verify_rights_file_format, ["x.", cols_2_to_5].join("\t"), - /Rights file .+ contains malformed line/ + errmsg: /Rights file .+ contains malformed line/ ) expect_not_ok( :verify_rights_file_format, [".x", cols_2_to_5].join("\t"), - /Rights file .+ contains malformed line/ + errmsg: /Rights file .+ contains malformed line/ ) end From 2e1c7d0462e5cf00fac3422bbf7aa9630ebb255c Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 3 Dec 2024 16:48:37 -0500 Subject: [PATCH 017/114] move with_temp_file to spec_helper and correct --- spec/spec_helper.rb | 19 +++++++++++++++++++ spec/unit/post_zephir_verifier_spec.rb | 12 ------------ 2 files changed, 19 insertions(+), 12 deletions(-) diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index c6d0aff..dbdd152 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -4,6 +4,7 @@ require "logger" require "simplecov" require "simplecov-lcov" +require "zlib" Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) @@ -48,6 +49,24 @@ def with_test_environment end end +def write_gzipped(tmpfile, contents) + gz = Zlib::GzipWriter.new(tmpfile) + gz.write(contents) + gz.close +end + +def with_temp_file(contents, gzipped: false) + Tempfile.create("tempfile") do |tmpfile| + if gzipped + write_gzipped(tmpfile, contents) + else + tmpfile.write(contents) + end + tmpfile.close() + yield tmpfile.path + end +end + # TODO: the following ENV juggling routines are for the integration tests, # and should be integrated with the `with_test_environment` facility above. ENV["POST_ZEPHIR_LOGGER_LEVEL"] = Logger::WARN.to_s diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 3f4051c..61e11eb 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -15,18 +15,6 @@ module PostZephirProcessing # These helpers are based on the ones from # #verify_deletes_contents but are more general - # overwrite with_temp_file if you need to treat temp files differently - def with_temp_file(contents, gzipped: false) - Tempfile.new("tempfile") do |tmpfile| - if gzipped - write_gzipped(tmpfile, contents) - else - tempfile.write(contents) - end - yield tempfile.path - end - end - # the expect-methods take a method arg for the method under test, # a contents string that's written to a tempfile and passed to the method, # and an optional errmsg arg (as a regexp) for specific error checking From 78d572df55e1d7a8a3e62c1a49bbef6bbfb5688a Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 4 Dec 2024 13:15:35 -0500 Subject: [PATCH 018/114] DEV-1414: stub out verification & tests for hathifiles --- lib/verifier/hathifiles_verifier.rb | 50 ++++++ spec/spec_helper.rb | 16 ++ spec/unit/hathifiles_verifier_spec.rb | 220 ++++++++++++++++++++++++++ 3 files changed, 286 insertions(+) create mode 100644 lib/verifier/hathifiles_verifier.rb create mode 100644 spec/unit/hathifiles_verifier_spec.rb diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb new file mode 100644 index 0000000..2be06f3 --- /dev/null +++ b/lib/verifier/hathifiles_verifier.rb @@ -0,0 +1,50 @@ +# frozen_string_literal: true + +require "zlib" +require_relative "../verifier" +require_relative "../derivatives" + +# Verifies that post_hathi workflow stage did what it was supposed to. + +# TODO: document and verify the files written by monthly process. +# They should be mostly the same but need to be accounted for. + +module PostZephirProcessing + class HathifilesVerifier < Verifier + attr_reader :current_date + + def run_for_date(date:) + @current_date = date + verify_hathifile + end + + # /htapps/archive/hathifiles/hathi_upd_20240201.txt.gz or hathi_full_20241201.txt.gz + # + # Frequency: ALL + # Files: CATALOG_PREP/hathi_upd_YYYYMMDD.txt.gz + # and potentially HATHIFILE_ARCHIVE/hathi_full_YYYYMMDD.txt.gz + # Contents: TODO + # Verify: + # readable + # TODO: line count must be > than corresponding catalog file + # TODO: regex format + def verify_hathifile_presence(date: current_date) + update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: date) + verify_file(path: update_file) + verify_hathifile_contents(update_file) + + if date.first_of_month? + full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: date) + verify_file(path: full_file) + verify_hathifile_contents(full_file) + end + end + + def verify_hathifile_contents(file) + # open file + # check each line against a regex + # count lines + # also check linecount against corresponding catalog - hathifile must be >= + end + end +end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index dbdd152..6b7d6cb 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -67,6 +67,22 @@ def with_temp_file(contents, gzipped: false) end end +def expect_not_ok(method, contents, errmsg: /.*/, gzipped: false) + with_temp_file(contents, gzipped: gzipped) do |tmpfile| + verifier = described_class.new + verifier.send(method, path: tmpfile) + expect(verifier.errors).to include(errmsg) + end +end + +def expect_ok(method, contents, gzipped: false) + with_temp_file(contents, gzipped: gzipped) do |tmpfile| + verifier = described_class.new + verifier.send(method, path: tmpfile) + expect(verifier.errors).to be_empty + end +end + # TODO: the following ENV juggling routines are for the integration tests, # and should be integrated with the `with_test_environment` facility above. ENV["POST_ZEPHIR_LOGGER_LEVEL"] = Logger::WARN.to_s diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb new file mode 100644 index 0000000..7f18018 --- /dev/null +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -0,0 +1,220 @@ +# frozen_string_literal: true + +require "climate_control" +require "zlib" +require "verifier/hathifiles_verifier" +require "tempfile" +require "logger" + +module PostZephirProcessing + RSpec.describe(HathifilesVerifier) do + around(:each) do |example| + with_test_environment { example.run } + end + + def nil + end + + describe "#verify_hathifiles_count" do + context "with a catalog json file with 5 records" do + it "accepts a hathifile with 5 records" + it "accepts a hathifile with 10 records" + it "rejects a hathifile with 4 records" + it "rejects a hathifile with no records" + end + end + + HATHIFILES_SAMPLE_LINE = "mdp.39015002678202 deny ic 000000371 MIU 990000003710106381 36054 69011382 Short fiction; a critical collection, edited by James R. Frakes [and] Isadore Traschen. Prentice-Hall [c1969] bib 2008-06-01 09:30:17 0 1969 nju eng BK MIU umich umich google google Frakes, James R." + + HATHIFILES_FIELDS = [ + { + name: 'htid', + good: 'mdp.39015031446076', + bad: 'this is not an id', + optional: false, + }, + { + name: 'access', + good: 'deny', + bad: 'nope', + optional: false, + }, + { + name: 'rights', + good: 'ic', + bad: 'In Copyright', + optional: false, + }, + { + name: 'ht_bib_key', + good: '000000400', + bad: 'not a bib key', + optional: false, + }, + { + name: 'description', + good: 'Jun-Oct 1927', + bad: nil, + optional: true, + }, + { + name: 'source', + good: 'MIU', + bad: 'this is not a NUC code', + optional: false, + }, + { + name: 'source_bib_num', + good: '990000003710106381', + bad: 'this is not a source bib num', + # this can be empty if the record has an sdrnum like sdr-osu(OCoLC)6776655 which the regex at https://github.com/hathitrust/hathifiles/blob/af5e4ff682fb81165e6232a1151cfbeeacfdfd21/lib/bib_record.rb#L160C34-L160C50 doesn't match, probably a bug in hathifiles + optional: true, + }, + + { + name: 'oclc_num', + good: '217079596,55322', + bad: 'this is not an OCLC number', + optional: true, + }, + + # isbn, issn, lccn come straight from the record w/o additional + # validation in hathifiles, probably not worth doing add'l validation + # here + { + name: 'isbn', + good: '9789679430011,9679430014', + bad: nil, + optional: true, + }, + { + name: 'issn', + good: '0084-9499,00113344', + bad: nil, + optional: true, + }, + { + name: 'lccn', + good: '', + bad: nil, + optional: true, + }, + + { + name: 'title', + good: '', + bad: nil, + # this can be empty if the record only has a 245$k. that's probably a bug in the + # hathifiles which we should fix. + optional: true + }, + { + name: 'imprint', + good: 'Pergamon Press [1969]', + bad: nil, + optional: true + }, + { + name: 'rights_reason_code', + good: 'bib', + bad: 'not a reason code', + optional: false + }, + { + name: 'rights_timestamp', + good: '2008-06-01 09:30:17', + bad: 'last thursday', + optional: false + }, + { + name: 'us_gov_doc_flag', + good: '0', + bad: 'not a gov doc flag', + optional: false + }, + { + name: 'rights_date_used', + good: '1987', + bad: 'this is not a year', + optional: false + }, + { + name: 'pub_place', + good: 'miu', + bad: 'not a publication place', + optional: true + }, + { + name: 'lang', + good: 'eng', + bad: 'not a language code', + optional: true + }, + { + name: 'bib_fmt', + good: 'BK', + bad: 'not a bib fmt', + optional: false + }, + { + name: 'collection_code', + good: 'MIU', + bad: 'not a collection code', + optional: false + }, + { + name: 'content_provider_code', + good: 'umich', + bad: 'not an inst id', + optional: false, + }, + { + name: 'responsible_entity_code', + good: 'umich', + bad: 'not an inst id', + optional: false + }, + { + name: 'digitization_agent_code', + good: 'google', + bad: 'not an inst id', + optional: false + }, + { + name: 'access_profile_code', + good: 'open', + bad: 'not an access profile', + optional: false + }, + { + name: 'author', + good: 'Chaucer, Geoffrey, -1400.', + bad: nil, + optional: true + } + ] + + describe "#verify_hathifiles_contents" do + it "accepts a file with #{HATHIFILES_FIELDS.count} columns per line" + it "rejects a file where some lines have less than #{HATHIFILES_FIELDS.count} tab-separated columns" + + HATHIFILES_FIELDS.each do |field| + it "accepts a file with #{field[:name]} matching the regex" + + it "rejects a file with #{field[:name]} not matching the regex" + + if(field[:optional]) + it "accepts a file with empty #{field[:name]}" + else + it "rejects a file with empty #{field[:name]}" + end + end + + end + + describe "#catalog_file_for" do + it "computes a source catalog file based on date - 1" + end + + end +end From 4a6b69369ab1d151c834f27e5c7de9b91e647879 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 4 Dec 2024 17:28:21 -0500 Subject: [PATCH 019/114] DEV-1414: hathifiles field verification * verify lines have the expected number of columns * verify fields by regex as best we can Still to do: test hathifiles line counts against catalog; directly test HathifileContentsVerifier --- lib/verifier/hathifiles_verifier.rb | 126 ++++++++++++++++++++++++-- spec/spec_helper.rb | 7 ++ spec/unit/hathifiles_verifier_spec.rb | 63 ++++++++----- 3 files changed, 167 insertions(+), 29 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 2be06f3..84b9487 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -10,6 +10,114 @@ # They should be mostly the same but need to be accounted for. module PostZephirProcessing + class HathifileContentsVerifier < Verifier + HATHIFILE_FIELDS_COUNT = 26 + HATHIFILE_FIELD_REGEXES = [ + # htid - required; lowercase alphanumeric namespace, period, non-whitespace ID + /^[a-z0-9]{2,4}\.\S+$/, + # access - required; allow or deny + /^(allow|deny)$/, + # rights - required; lowercase alphanumeric plus dash and period + /^[a-z0-9\-.]+$/, + # ht_bib_key - required; 9 digits + /^\d{9}$/, + # description (enumchron) - optional; anything goes + /^.*$/, + # source - required; NUC/MARC organization code, all upper-case + /^[A-Z]+$/, + # source_bib_num - optional (see note) - no whitespace, anything else + # allowed. Note that blank source bib nums are likely a bug in hathifiles + # generation + /^\S*$/, + # oclc_num - optional; zero or more comma-separated numbers + /^(\d+)?(,\d+)*$/, + # hathifiles doesn't validate/normalize what comes out of the record for + # isbn, issn, or lccn + # isbn - optional; no whitespace, anything else goes + /^\S*$/, + # issn - optional; no whitespace, anything else goes + /^\S*$/, + # lccn - optional; no whitespace, anything else goes + /^\S*$/, + # title - optional (see note); anything goes + # Note: currently blank for titles with only a 245$k; hathifiles + # generation should likely be changed to include the k subfield. + /^.*$/, + # imprint - optional; anything goes + /^.*$/, + # rights_reason_code - required; lowercase alphabetical + /^[a-z]+$/, + # rights_timestamp - required; %Y-%m-%d %H:%M:%S + /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$/, + # us_gov_doc_flag - required; 0 or 1 + /^[01]$/, + # rights_date_used - required - numeric + /^\d+$/, + # publication place - required, 2 or 3 characters (but can be whitespace) + /^.{2,3}$/, + # lang - optional, at most 3 characters + /^.{0,3}$/, + # bib_fmt - required, uppercase characters + /^[A-Z]+$/, + # collection code - required, uppercase characters + /^[A-Z]+$/, + # content provider - required, lowercase characters + dash + /^[a-z\-]+$/, + # responsible entity code - required, lowercase characters + dash + /^[a-z\-]+$/, + # digitization agent code - required, lowercase characters + dash + /^[a-z\-]+$/, + # access profile code - required, lowercase characters + plus + /^[a-z+]+$/, + # author - optional, anything goes + /^.*$/, + ] + + attr_reader :file, :line_count + + def initialize(file) + super() + @line_count = 0 + @file = file + end + + def run + Zlib::GzipReader.open(file, encoding: 'utf-8').each_line do |line| + @line_count += 1 + # limit of -1 to ensure we don't drop trailing empty fields + fields = line.chomp.split("\t",-1) + + next unless verify_line_field_count(fields) + + verify_fields(fields) + end + # open file + # check each line against a regex + # count lines + # also check linecount against corresponding catalog - hathifile must be >= + end + + private + + def verify_fields(fields) + fields.each_with_index do |field,i| + regex = HATHIFILE_FIELD_REGEXES[i] + if !fields[i].match?(regex) + error(message: "Field #{i} at line #{line_count} in #{file} ('#{field}') does not match #{regex}") + end + end + end + + def verify_line_field_count(fields) + if fields.count == HATHIFILE_FIELDS_COUNT + true + else + error(message: "Line #{line_count} in #{file} has only #{fields.count} columns, expected #{HATHIFILE_FIELDS_COUNT}") + false + end + end + end + class HathifilesVerifier < Verifier attr_reader :current_date @@ -27,24 +135,26 @@ def run_for_date(date:) # Verify: # readable # TODO: line count must be > than corresponding catalog file - # TODO: regex format def verify_hathifile_presence(date: current_date) update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: date) verify_file(path: update_file) - verify_hathifile_contents(update_file) + verify_hathifile_contents(path: update_file) if date.first_of_month? full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: date) verify_file(path: full_file) - verify_hathifile_contents(full_file) + verify_hathifile_contents(path: full_file) end end - def verify_hathifile_contents(file) - # open file - # check each line against a regex - # count lines - # also check linecount against corresponding catalog - hathifile must be >= + def verify_hathifile_contents(path: ) + verifier = HathifileContentsVerifier.new(path) + verifier.run + # FIXME: could be inefficient if verifier.errors is very long; + # unnecessary except for testing. Would be better to test + # HathifilesContentsVerifier directly. + @errors.append(*verifier.errors) end + end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 6b7d6cb..24211a5 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -152,6 +152,13 @@ def setup_test_files(date:) end end +# Returns the full path to the given fixture file. +# +# @param file [String] +def fixture(file) + File.join(File.dirname(__FILE__),"fixtures",file) +end + # The following RSpec boilerplate tends to recur across HathiTrust Ruby test suites. RSpec.configure do |config| # rspec-expectations config goes here. You can use an alternate diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index 7f18018..b812702 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -12,9 +12,6 @@ module PostZephirProcessing with_test_environment { example.run } end - def nil - end - describe "#verify_hathifiles_count" do context "with a catalog json file with 5 records" do it "accepts a hathifile with 5 records" @@ -24,8 +21,6 @@ def nil end end - HATHIFILES_SAMPLE_LINE = "mdp.39015002678202 deny ic 000000371 MIU 990000003710106381 36054 69011382 Short fiction; a critical collection, edited by James R. Frakes [and] Isadore Traschen. Prentice-Hall [c1969] bib 2008-06-01 09:30:17 0 1969 nju eng BK MIU umich umich google google Frakes, James R." - HATHIFILES_FIELDS = [ { name: 'htid', @@ -54,7 +49,6 @@ def nil { name: 'description', good: 'Jun-Oct 1927', - bad: nil, optional: true, }, { @@ -84,26 +78,22 @@ def nil { name: 'isbn', good: '9789679430011,9679430014', - bad: nil, optional: true, }, { name: 'issn', good: '0084-9499,00113344', - bad: nil, optional: true, }, { name: 'lccn', good: '', - bad: nil, optional: true, }, { name: 'title', good: '', - bad: nil, # this can be empty if the record only has a 245$k. that's probably a bug in the # hathifiles which we should fix. optional: true @@ -111,7 +101,6 @@ def nil { name: 'imprint', good: 'Pergamon Press [1969]', - bad: nil, optional: true }, { @@ -142,7 +131,7 @@ def nil name: 'pub_place', good: 'miu', bad: 'not a publication place', - optional: true + optional: false }, { name: 'lang', @@ -189,24 +178,56 @@ def nil { name: 'author', good: 'Chaucer, Geoffrey, -1400.', - bad: nil, optional: true } ] - describe "#verify_hathifiles_contents" do - it "accepts a file with #{HATHIFILES_FIELDS.count} columns per line" - it "rejects a file where some lines have less than #{HATHIFILES_FIELDS.count} tab-separated columns" + describe "#verify_hathifile_contents" do + let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } + let(:sample_fields) { sample_line.split("\t") } + + it "accepts a file with a single real hathifiles entry" do + expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) + end - HATHIFILES_FIELDS.each do |field| - it "accepts a file with #{field[:name]} matching the regex" + it "rejects a file where some lines have less than #{HATHIFILES_FIELDS.count} tab-separated columns" do + contents = sample_line + "mdp.35112100003484\tdeny\n" + expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) + end - it "rejects a file with #{field[:name]} not matching the regex" + HATHIFILES_FIELDS.each_with_index do |field,i| + it "accepts a file with #{field[:name]} matching the regex" do + sample_fields[i] = field[:good] + contents = sample_fields.join("\t") + + expect_ok(:verify_hathifile_contents, contents, gzipped: true) + end + + if(field.has_key?(:bad)) + it "rejects a file with #{field[:name]} not matching the regex" do + sample_fields[i] = field[:bad] + contents = sample_fields.join("\t") + + expect_not_ok(:verify_hathifile_contents, contents, + errmsg: /Field #{i}.*does not match/, gzipped: true) + end + end if(field[:optional]) - it "accepts a file with empty #{field[:name]}" + it "accepts a file with empty #{field[:name]}" do + sample_fields[i] = "" + contents = sample_fields.join("\t") + + expect_ok(:verify_hathifile_contents, contents, gzipped: true) + end else - it "rejects a file with empty #{field[:name]}" + it "rejects a file with empty #{field[:name]}" do + sample_fields[i] = "" + contents = sample_fields.join("\t") + + expect_not_ok(:verify_hathifile_contents, contents, + errmsg: /Field #{i}.*does not match/, gzipped: true) + end end end From d85c1df9e74a09dc7bbfead7617d75a8e15401fa Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 4 Dec 2024 17:32:24 -0500 Subject: [PATCH 020/114] fixup: run standardrb --- lib/verifier/hathifiles_verifier.rb | 31 +++-- spec/spec_helper.rb | 6 +- spec/unit/hathifiles_verifier_spec.rb | 186 +++++++++++++------------- 3 files changed, 110 insertions(+), 113 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 84b9487..d255aa7 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -14,25 +14,25 @@ class HathifileContentsVerifier < Verifier HATHIFILE_FIELDS_COUNT = 26 HATHIFILE_FIELD_REGEXES = [ # htid - required; lowercase alphanumeric namespace, period, non-whitespace ID - /^[a-z0-9]{2,4}\.\S+$/, + /^[a-z0-9]{2,4}\.\S+$/, # access - required; allow or deny - /^(allow|deny)$/, + /^(allow|deny)$/, # rights - required; lowercase alphanumeric plus dash and period - /^[a-z0-9\-.]+$/, + /^[a-z0-9\-.]+$/, # ht_bib_key - required; 9 digits - /^\d{9}$/, + /^\d{9}$/, # description (enumchron) - optional; anything goes - /^.*$/, + /^.*$/, # source - required; NUC/MARC organization code, all upper-case - /^[A-Z]+$/, + /^[A-Z]+$/, # source_bib_num - optional (see note) - no whitespace, anything else # allowed. Note that blank source bib nums are likely a bug in hathifiles # generation - /^\S*$/, + /^\S*$/, # oclc_num - optional; zero or more comma-separated numbers /^(\d+)?(,\d+)*$/, # hathifiles doesn't validate/normalize what comes out of the record for - # isbn, issn, or lccn + # isbn, issn, or lccn # isbn - optional; no whitespace, anything else goes /^\S*$/, # issn - optional; no whitespace, anything else goes @@ -70,8 +70,8 @@ class HathifileContentsVerifier < Verifier # access profile code - required, lowercase characters + plus /^[a-z+]+$/, # author - optional, anything goes - /^.*$/, - ] + /^.*$/ + ] attr_reader :file, :line_count @@ -82,10 +82,10 @@ def initialize(file) end def run - Zlib::GzipReader.open(file, encoding: 'utf-8').each_line do |line| + Zlib::GzipReader.open(file, encoding: "utf-8").each_line do |line| @line_count += 1 # limit of -1 to ensure we don't drop trailing empty fields - fields = line.chomp.split("\t",-1) + fields = line.chomp.split("\t", -1) next unless verify_line_field_count(fields) @@ -100,7 +100,7 @@ def run private def verify_fields(fields) - fields.each_with_index do |field,i| + fields.each_with_index do |field, i| regex = HATHIFILE_FIELD_REGEXES[i] if !fields[i].match?(regex) error(message: "Field #{i} at line #{line_count} in #{file} ('#{field}') does not match #{regex}") @@ -109,7 +109,7 @@ def verify_fields(fields) end def verify_line_field_count(fields) - if fields.count == HATHIFILE_FIELDS_COUNT + if fields.count == HATHIFILE_FIELDS_COUNT true else error(message: "Line #{line_count} in #{file} has only #{fields.count} columns, expected #{HATHIFILE_FIELDS_COUNT}") @@ -147,7 +147,7 @@ def verify_hathifile_presence(date: current_date) end end - def verify_hathifile_contents(path: ) + def verify_hathifile_contents(path:) verifier = HathifileContentsVerifier.new(path) verifier.run # FIXME: could be inefficient if verifier.errors is very long; @@ -155,6 +155,5 @@ def verify_hathifile_contents(path: ) # HathifilesContentsVerifier directly. @errors.append(*verifier.errors) end - end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 24211a5..e41d6b0 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -62,7 +62,7 @@ def with_temp_file(contents, gzipped: false) else tmpfile.write(contents) end - tmpfile.close() + tmpfile.close yield tmpfile.path end end @@ -154,9 +154,9 @@ def setup_test_files(date:) # Returns the full path to the given fixture file. # -# @param file [String] +# @param file [String] def fixture(file) - File.join(File.dirname(__FILE__),"fixtures",file) + File.join(File.dirname(__FILE__), "fixtures", file) end # The following RSpec boilerplate tends to recur across HathiTrust Ruby test suites. diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index b812702..2d95499 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -21,163 +21,163 @@ module PostZephirProcessing end end - HATHIFILES_FIELDS = [ + hathifiles_fields = [ { - name: 'htid', - good: 'mdp.39015031446076', - bad: 'this is not an id', - optional: false, + name: "htid", + good: "mdp.39015031446076", + bad: "this is not an id", + optional: false }, { - name: 'access', - good: 'deny', - bad: 'nope', - optional: false, + name: "access", + good: "deny", + bad: "nope", + optional: false }, { - name: 'rights', - good: 'ic', - bad: 'In Copyright', - optional: false, + name: "rights", + good: "ic", + bad: "In Copyright", + optional: false }, { - name: 'ht_bib_key', - good: '000000400', - bad: 'not a bib key', - optional: false, + name: "ht_bib_key", + good: "000000400", + bad: "not a bib key", + optional: false }, { - name: 'description', - good: 'Jun-Oct 1927', - optional: true, + name: "description", + good: "Jun-Oct 1927", + optional: true }, { - name: 'source', - good: 'MIU', - bad: 'this is not a NUC code', - optional: false, + name: "source", + good: "MIU", + bad: "this is not a NUC code", + optional: false }, { - name: 'source_bib_num', - good: '990000003710106381', - bad: 'this is not a source bib num', + name: "source_bib_num", + good: "990000003710106381", + bad: "this is not a source bib num", # this can be empty if the record has an sdrnum like sdr-osu(OCoLC)6776655 which the regex at https://github.com/hathitrust/hathifiles/blob/af5e4ff682fb81165e6232a1151cfbeeacfdfd21/lib/bib_record.rb#L160C34-L160C50 doesn't match, probably a bug in hathifiles - optional: true, + optional: true }, { - name: 'oclc_num', - good: '217079596,55322', - bad: 'this is not an OCLC number', - optional: true, + name: "oclc_num", + good: "217079596,55322", + bad: "this is not an OCLC number", + optional: true }, # isbn, issn, lccn come straight from the record w/o additional # validation in hathifiles, probably not worth doing add'l validation # here { - name: 'isbn', - good: '9789679430011,9679430014', - optional: true, + name: "isbn", + good: "9789679430011,9679430014", + optional: true }, { - name: 'issn', - good: '0084-9499,00113344', - optional: true, + name: "issn", + good: "0084-9499,00113344", + optional: true }, { - name: 'lccn', - good: '', - optional: true, + name: "lccn", + good: "", + optional: true }, { - name: 'title', - good: '', + name: "title", + good: "", # this can be empty if the record only has a 245$k. that's probably a bug in the # hathifiles which we should fix. optional: true }, { - name: 'imprint', - good: 'Pergamon Press [1969]', + name: "imprint", + good: "Pergamon Press [1969]", optional: true }, { - name: 'rights_reason_code', - good: 'bib', - bad: 'not a reason code', + name: "rights_reason_code", + good: "bib", + bad: "not a reason code", optional: false }, { - name: 'rights_timestamp', - good: '2008-06-01 09:30:17', - bad: 'last thursday', + name: "rights_timestamp", + good: "2008-06-01 09:30:17", + bad: "last thursday", optional: false }, { - name: 'us_gov_doc_flag', - good: '0', - bad: 'not a gov doc flag', + name: "us_gov_doc_flag", + good: "0", + bad: "not a gov doc flag", optional: false }, { - name: 'rights_date_used', - good: '1987', - bad: 'this is not a year', + name: "rights_date_used", + good: "1987", + bad: "this is not a year", optional: false }, { - name: 'pub_place', - good: 'miu', - bad: 'not a publication place', + name: "pub_place", + good: "miu", + bad: "not a publication place", optional: false }, { - name: 'lang', - good: 'eng', - bad: 'not a language code', + name: "lang", + good: "eng", + bad: "not a language code", optional: true }, { - name: 'bib_fmt', - good: 'BK', - bad: 'not a bib fmt', + name: "bib_fmt", + good: "BK", + bad: "not a bib fmt", optional: false }, { - name: 'collection_code', - good: 'MIU', - bad: 'not a collection code', + name: "collection_code", + good: "MIU", + bad: "not a collection code", optional: false }, { - name: 'content_provider_code', - good: 'umich', - bad: 'not an inst id', - optional: false, + name: "content_provider_code", + good: "umich", + bad: "not an inst id", + optional: false }, { - name: 'responsible_entity_code', - good: 'umich', - bad: 'not an inst id', + name: "responsible_entity_code", + good: "umich", + bad: "not an inst id", optional: false }, { - name: 'digitization_agent_code', - good: 'google', - bad: 'not an inst id', + name: "digitization_agent_code", + good: "google", + bad: "not an inst id", optional: false }, { - name: 'access_profile_code', - good: 'open', - bad: 'not an access profile', + name: "access_profile_code", + good: "open", + bad: "not an access profile", optional: false }, { - name: 'author', - good: 'Chaucer, Geoffrey, -1400.', + name: "author", + good: "Chaucer, Geoffrey, -1400.", optional: true } ] @@ -190,12 +190,12 @@ module PostZephirProcessing expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) end - it "rejects a file where some lines have less than #{HATHIFILES_FIELDS.count} tab-separated columns" do + it "rejects a file where some lines have less than #{hathifiles_fields.count} tab-separated columns" do contents = sample_line + "mdp.35112100003484\tdeny\n" expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) end - HATHIFILES_FIELDS.each_with_index do |field,i| + hathifiles_fields.each_with_index do |field, i| it "accepts a file with #{field[:name]} matching the regex" do sample_fields[i] = field[:good] contents = sample_fields.join("\t") @@ -203,17 +203,17 @@ module PostZephirProcessing expect_ok(:verify_hathifile_contents, contents, gzipped: true) end - if(field.has_key?(:bad)) + if field.has_key?(:bad) it "rejects a file with #{field[:name]} not matching the regex" do sample_fields[i] = field[:bad] contents = sample_fields.join("\t") - expect_not_ok(:verify_hathifile_contents, contents, - errmsg: /Field #{i}.*does not match/, gzipped: true) + expect_not_ok(:verify_hathifile_contents, contents, + errmsg: /Field #{i}.*does not match/, gzipped: true) end end - if(field[:optional]) + if field[:optional] it "accepts a file with empty #{field[:name]}" do sample_fields[i] = "" contents = sample_fields.join("\t") @@ -225,17 +225,15 @@ module PostZephirProcessing sample_fields[i] = "" contents = sample_fields.join("\t") - expect_not_ok(:verify_hathifile_contents, contents, - errmsg: /Field #{i}.*does not match/, gzipped: true) + expect_not_ok(:verify_hathifile_contents, contents, + errmsg: /Field #{i}.*does not match/, gzipped: true) end end end - end describe "#catalog_file_for" do it "computes a source catalog file based on date - 1" end - end end From 2c231d6662e2805a2bb8ef7e8b447f54d0b83baf Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 5 Dec 2024 09:39:28 -0500 Subject: [PATCH 021/114] fixup: add hathifile fixture --- spec/fixtures/sample_hathifile_line.txt | 1 + 1 file changed, 1 insertion(+) create mode 100644 spec/fixtures/sample_hathifile_line.txt diff --git a/spec/fixtures/sample_hathifile_line.txt b/spec/fixtures/sample_hathifile_line.txt new file mode 100644 index 0000000..8099849 --- /dev/null +++ b/spec/fixtures/sample_hathifile_line.txt @@ -0,0 +1 @@ +mdp.39015076059727 allow pd 002252764 v.27 MIU 990022527640106381 57016855 Shi ji ping lin / [Ling Zhilong ji] Liu Hongnian jiao.,史記評林 [凌稚隆輯] 劉鴻年校. Pei lan tang cang ban, Guangxu jia shen [1884], 佩蘭堂藏板 光緖甲申 [1884] bib 2021-01-20 16:38:56 0 1884 cc chi BK MIU umich umich google google Sima, Qian, approximately 145 B.C.-approximately 86 B.C., 司馬遷, approximately 145-approximately 86 B. C. From abcb0e4da06a85d47ad0d640d3ca75dfafb23063 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 5 Dec 2024 11:03:01 -0500 Subject: [PATCH 022/114] - Add Dockerfile/Gemfile support for Sequel - Add `database` to `Services` canister - Add `PopulateRightsVerifier` with unit tests --- Dockerfile | 1 + Gemfile | 2 + Gemfile.lock | 6 +++ bin/verify.rb | 8 +++- lib/services.rb | 18 ++++++++ lib/verifier/populate_rights_verifier.rb | 51 +++++++++++++++++++++ spec/unit/populate_rights_verifier_spec.rb | 52 ++++++++++++++++++++++ 7 files changed, 136 insertions(+), 2 deletions(-) create mode 100644 lib/verifier/populate_rights_verifier.rb create mode 100644 spec/unit/populate_rights_verifier_spec.rb diff --git a/Dockerfile b/Dockerfile index 277faa7..9d7bef1 100644 --- a/Dockerfile +++ b/Dockerfile @@ -20,6 +20,7 @@ RUN apt-get update && apt-get install -y \ libmarc-perl \ libmarc-record-perl \ libmarc-xml-perl \ + libmariadb-dev \ libnet-ssleay-perl \ libtest-output-perl \ libwww-perl \ diff --git a/Gemfile b/Gemfile index 9a3e4da..21421b8 100644 --- a/Gemfile +++ b/Gemfile @@ -4,6 +4,8 @@ source "https://rubygems.org" gem "canister" gem "dotenv" +gem "mysql2" +gem "sequel" gem "zinzout" group :development, :test do diff --git a/Gemfile.lock b/Gemfile.lock index 09d10f7..ca959ee 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -2,6 +2,7 @@ GEM remote: https://rubygems.org/ specs: ast (2.4.2) + bigdecimal (3.1.8) canister (0.9.2) climate_control (1.2.0) coderay (1.1.3) @@ -12,6 +13,7 @@ GEM language_server-protocol (3.17.0.3) lint_roller (1.1.0) method_source (1.1.0) + mysql2 (0.5.6) parallel (1.26.3) parser (3.3.5.0) ast (~> 2.4.1) @@ -53,6 +55,8 @@ GEM rubocop (>= 1.48.1, < 2.0) rubocop-ast (>= 1.31.1, < 2.0) ruby-progressbar (1.13.0) + sequel (5.87.0) + bigdecimal simplecov (0.22.0) docile (~> 1.1) simplecov-html (~> 0.11) @@ -85,8 +89,10 @@ DEPENDENCIES canister climate_control dotenv + mysql2 pry rspec + sequel simplecov simplecov-lcov standardrb diff --git a/bin/verify.rb b/bin/verify.rb index dbe855f..7dc60e8 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -2,17 +2,21 @@ # frozen_string_literal: true require "dotenv" +require_relative "../lib/verifier/hathifiles_verifier" +require_relative "../lib/verifier/populate_rights_verifier" require_relative "../lib/verifier/post_zephir_verifier" Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) [ - PostZephirProcessing::PostZephirVerifier + PostZephirProcessing::PostZephirVerifier, + PostZephirProcessing::PopulateRightsVerifier, + PostZephirProcessing::HathifilesVerifier ].each do |klass| begin klass.new.run # Very simple minded exception handler so we can in theory check subsequent workflow steps rescue StandardError => e - Services[:logger].fatal e + PostZephirProcessing::Services[:logger].fatal e end end diff --git a/lib/services.rb b/lib/services.rb index 38aa6bc..55764e0 100644 --- a/lib/services.rb +++ b/lib/services.rb @@ -2,6 +2,8 @@ require "canister" require "logger" +require "sequel" +require "yaml" module PostZephirProcessing Services = Canister.new @@ -9,4 +11,20 @@ module PostZephirProcessing Services.register(:logger) do Logger.new($stdout, level: ENV.fetch("POST_ZEPHIR_PROCESSING_LOGGER_LEVEL", Logger::WARN).to_i) end + + # Read-only connection to database for verifying rights DB vs .rights files + # Would prefer to populate these values from ENV for consistency with other Ruby + # code running in the workflow but this suffices for now. + Services.register(:database) do + database_yaml = File.join(ENV.fetch("ROOTDIR"), "config", "database.yml") + yaml_data = YAML.load_file(database_yaml) + Sequel.connect( + adapter: "mysql2", + user: yaml_data["user"], + password: yaml_data["password"], + host: yaml_data["hostname"], + database: yaml_data["dbname"], + encoding: "utf8mb4" + ) + end end diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb new file mode 100644 index 0000000..c6b6c3b --- /dev/null +++ b/lib/verifier/populate_rights_verifier.rb @@ -0,0 +1,51 @@ +# frozen_string_literal: true + +require_relative "../verifier" +require_relative "../derivatives" + +module PostZephirProcessing + # The PostZephirVerifier checks for the existence and readability of the .rights files. + # This class is responsible for verifying that the content has been addressed properly. + + # Specifically, we can make sure each HTID in the rights file(s) for the reference date + # exist in ht_rights.rights_current. + + # CRMS, licensing, takedowns, etc can prevent .rights entries from being inserted; + # however, each HTID must exist in the database regardless of whether this particular + # run has made a change. + + # We may also look for errors in the output logs (postZephir.pm and/or populate_rights_data.pl?) + # but thsat is out of scope for now. + class PopulateRightsVerifier < Verifier + FULL_RIGHTS_TEMPLATE = "zephir_full_YYYYMMDD.rights" + UPD_RIGHTS_TEMPLATE = "zephir_upd_YYYYMMDD.rights" + + def run_for_date(date:) + upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: UPD_RIGHTS_TEMPLATE, date: date) + verify_rights_file(path: upd_path) + + if date.last_of_month? + full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: FULL_RIGHTS_TEMPLATE, date: date) + verify_rights_file(path: full_path) + end + end + + # Check each entry in the .rights file for an entry in `rights_current`. + # FIXME: this is likely to be very inefficient. + # Should accumulate a batch of HTIDs to query all in one go. + # See HathifileWriter#batch_extract_rights for a usable Sequel construct. + def verify_rights_file(path:) + db = Services[:database] + File.open(path) do |infile| + infile.each_line do |line| + line.strip! + htid = line.split("\t").first + namespace, id = htid.split(".", 2) + if db[:rights_current].where(namespace: namespace, id: id).count.zero? + error message: "no entry in rights_current for #{htid}" + end + end + end + end + end +end diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/populate_rights_verifier_spec.rb new file mode 100644 index 0000000..25a020d --- /dev/null +++ b/spec/unit/populate_rights_verifier_spec.rb @@ -0,0 +1,52 @@ +# frozen_string_literal: true + +require "verifier/populate_rights_verifier" + +module PostZephirProcessing + RSpec.describe(PopulateRightsVerifier) do + around(:each) do |example| + with_test_environment { example.run } + end + + let(:test_rights) do + [ + ["a.123", "ic", "bib", "bibrights", "aa"].join("\t") + ].join("\n") + end + + # Temporarily add `htid` to `rights_current` with resonable (and irrelevant) default values. + def with_fake_rights_entry(htid:) + namespace, id = htid.split(".", 2) + Services[:database][:rights_current].where(namespace: namespace, id: id).delete + Services[:database][:rights_current].insert( + namespace: namespace, + id: id, + attr: 1, + reason: 1, + source: 1, + access_profile: 1 + ) + begin + yield + ensure + Services[:database][:rights_current].where(namespace: namespace, id: id).delete + end + end + + describe "#verify_rights_file" do + context "with HTID in the Rights Database" do + it "logs no error" do + with_fake_rights_entry(htid: "a.123") do + expect_ok(:verify_rights_file, test_rights) + end + end + end + + context "with HTID not in the Rights Database" do + it "logs an error" do + expect_not_ok(:verify_rights_file, test_rights, errmsg: /no entry/) + end + end + end + end +end From 55656297aa7b7c901af93a3c17328b6bf614f6c8 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 5 Dec 2024 13:56:49 -0500 Subject: [PATCH 023/114] DEV-1414: Test HathifilesVerifier methods directly --- lib/verifier/hathifiles_verifier.rb | 2 - spec/unit/hathifiles_verifier_spec.rb | 408 +++++++++++++------------- 2 files changed, 209 insertions(+), 201 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index d255aa7..563e41b 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -97,8 +97,6 @@ def run # also check linecount against corresponding catalog - hathifile must be >= end - private - def verify_fields(fields) fields.each_with_index do |field, i| regex = HATHIFILE_FIELD_REGEXES[i] diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index 2d95499..21ec67b 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -7,233 +7,243 @@ require "logger" module PostZephirProcessing - RSpec.describe(HathifilesVerifier) do + RSpec.describe "hathifiles verification" do around(:each) do |example| with_test_environment { example.run } end - describe "#verify_hathifiles_count" do - context "with a catalog json file with 5 records" do - it "accepts a hathifile with 5 records" - it "accepts a hathifile with 10 records" - it "rejects a hathifile with 4 records" - it "rejects a hathifile with no records" + let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } + + describe(HathifilesVerifier) do + describe "#verify_hathifiles_count" do + context "with a catalog json file with 5 records" do + it "accepts a hathifile with 5 records" + it "accepts a hathifile with 10 records" + it "rejects a hathifile with 4 records" + it "rejects a hathifile with no records" + end end - end - hathifiles_fields = [ - { - name: "htid", - good: "mdp.39015031446076", - bad: "this is not an id", - optional: false - }, - { - name: "access", - good: "deny", - bad: "nope", - optional: false - }, - { - name: "rights", - good: "ic", - bad: "In Copyright", - optional: false - }, - { - name: "ht_bib_key", - good: "000000400", - bad: "not a bib key", - optional: false - }, - { - name: "description", - good: "Jun-Oct 1927", - optional: true - }, - { - name: "source", - good: "MIU", - bad: "this is not a NUC code", - optional: false - }, - { - name: "source_bib_num", - good: "990000003710106381", - bad: "this is not a source bib num", - # this can be empty if the record has an sdrnum like sdr-osu(OCoLC)6776655 which the regex at https://github.com/hathitrust/hathifiles/blob/af5e4ff682fb81165e6232a1151cfbeeacfdfd21/lib/bib_record.rb#L160C34-L160C50 doesn't match, probably a bug in hathifiles - optional: true - }, - - { - name: "oclc_num", - good: "217079596,55322", - bad: "this is not an OCLC number", - optional: true - }, - - # isbn, issn, lccn come straight from the record w/o additional - # validation in hathifiles, probably not worth doing add'l validation - # here - { - name: "isbn", - good: "9789679430011,9679430014", - optional: true - }, - { - name: "issn", - good: "0084-9499,00113344", - optional: true - }, - { - name: "lccn", - good: "", - optional: true - }, - - { - name: "title", - good: "", - # this can be empty if the record only has a 245$k. that's probably a bug in the - # hathifiles which we should fix. - optional: true - }, - { - name: "imprint", - good: "Pergamon Press [1969]", - optional: true - }, - { - name: "rights_reason_code", - good: "bib", - bad: "not a reason code", - optional: false - }, - { - name: "rights_timestamp", - good: "2008-06-01 09:30:17", - bad: "last thursday", - optional: false - }, - { - name: "us_gov_doc_flag", - good: "0", - bad: "not a gov doc flag", - optional: false - }, - { - name: "rights_date_used", - good: "1987", - bad: "this is not a year", - optional: false - }, - { - name: "pub_place", - good: "miu", - bad: "not a publication place", - optional: false - }, - { - name: "lang", - good: "eng", - bad: "not a language code", - optional: true - }, - { - name: "bib_fmt", - good: "BK", - bad: "not a bib fmt", - optional: false - }, - { - name: "collection_code", - good: "MIU", - bad: "not a collection code", - optional: false - }, - { - name: "content_provider_code", - good: "umich", - bad: "not an inst id", - optional: false - }, - { - name: "responsible_entity_code", - good: "umich", - bad: "not an inst id", - optional: false - }, - { - name: "digitization_agent_code", - good: "google", - bad: "not an inst id", - optional: false - }, - { - name: "access_profile_code", - good: "open", - bad: "not an access profile", - optional: false - }, - { - name: "author", - good: "Chaucer, Geoffrey, -1400.", - optional: true - } - ] - - describe "#verify_hathifile_contents" do - let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } - let(:sample_fields) { sample_line.split("\t") } + describe "#verify_hathifile_contents" do + it "accepts a file with a single real hathifiles entry" do + expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) + end + + it "rejects a file where some lines have less than 26 tab-separated columns" do + contents = sample_line + "mdp.35112100003484\tdeny\n" + expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) + end + end - it "accepts a file with a single real hathifiles entry" do - expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) + describe "#catalog_file_for" do + it "computes a source catalog file based on date - 1" end + end - it "rejects a file where some lines have less than #{hathifiles_fields.count} tab-separated columns" do - contents = sample_line + "mdp.35112100003484\tdeny\n" - expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) + describe(HathifileContentsVerifier) do + around(:each) do |example| + with_test_environment { example.run } end + hathifiles_fields = [ + { + name: "htid", + good: "mdp.39015031446076", + bad: "this is not an id", + optional: false + }, + { + name: "access", + good: "deny", + bad: "nope", + optional: false + }, + { + name: "rights", + good: "ic", + bad: "In Copyright", + optional: false + }, + { + name: "ht_bib_key", + good: "000000400", + bad: "not a bib key", + optional: false + }, + { + name: "description", + good: "Jun-Oct 1927", + optional: true + }, + { + name: "source", + good: "MIU", + bad: "this is not a NUC code", + optional: false + }, + { + name: "source_bib_num", + good: "990000003710106381", + bad: "this is not a source bib num", + # this can be empty if the record has an sdrnum like sdr-osu(OCoLC)6776655 which the regex at https://github.com/hathitrust/hathifiles/blob/af5e4ff682fb81165e6232a1151cfbeeacfdfd21/lib/bib_record.rb#L160C34-L160C50 doesn't match, probably a bug in hathifiles + optional: true + }, + + { + name: "oclc_num", + good: "217079596,55322", + bad: "this is not an OCLC number", + optional: true + }, + + # isbn, issn, lccn come straight from the record w/o additional + # validation in hathifiles, probably not worth doing add'l validation + # here + { + name: "isbn", + good: "9789679430011,9679430014", + optional: true + }, + { + name: "issn", + good: "0084-9499,00113344", + optional: true + }, + { + name: "lccn", + good: "", + optional: true + }, + + { + name: "title", + good: "", + # this can be empty if the record only has a 245$k. that's probably a bug in the + # hathifiles which we should fix. + optional: true + }, + { + name: "imprint", + good: "Pergamon Press [1969]", + optional: true + }, + { + name: "rights_reason_code", + good: "bib", + bad: "not a reason code", + optional: false + }, + { + name: "rights_timestamp", + good: "2008-06-01 09:30:17", + bad: "last thursday", + optional: false + }, + { + name: "us_gov_doc_flag", + good: "0", + bad: "not a gov doc flag", + optional: false + }, + { + name: "rights_date_used", + good: "1987", + bad: "this is not a year", + optional: false + }, + { + name: "pub_place", + good: "miu", + bad: "not a publication place", + optional: false + }, + { + name: "lang", + good: "eng", + bad: "not a language code", + optional: true + }, + { + name: "bib_fmt", + good: "BK", + bad: "not a bib fmt", + optional: false + }, + { + name: "collection_code", + good: "MIU", + bad: "not a collection code", + optional: false + }, + { + name: "content_provider_code", + good: "umich", + bad: "not an inst id", + optional: false + }, + { + name: "responsible_entity_code", + good: "umich", + bad: "not an inst id", + optional: false + }, + { + name: "digitization_agent_code", + good: "google", + bad: "not an inst id", + optional: false + }, + { + name: "access_profile_code", + good: "open", + bad: "not an access profile", + optional: false + }, + { + name: "author", + good: "Chaucer, Geoffrey, -1400.", + optional: true + } + ] + + let(:sample_fields) { sample_line.split("\t") } + # testing a method inside, need to provide a fake filename + # for the constructor + let(:verifier) { described_class.new("/tmp/nonexistent.txt") } + hathifiles_fields.each_with_index do |field, i| it "accepts a file with #{field[:name]} matching the regex" do sample_fields[i] = field[:good] - contents = sample_fields.join("\t") - expect_ok(:verify_hathifile_contents, contents, gzipped: true) + verifier.verify_fields(sample_fields) + expect(verifier.errors).to be_empty end if field.has_key?(:bad) it "rejects a file with #{field[:name]} not matching the regex" do sample_fields[i] = field[:bad] - contents = sample_fields.join("\t") - expect_not_ok(:verify_hathifile_contents, contents, - errmsg: /Field #{i}.*does not match/, gzipped: true) + verifier.verify_fields(sample_fields) + expect(verifier.errors).to include(/Field #{i}.*does not match/) end - end - if field[:optional] - it "accepts a file with empty #{field[:name]}" do - sample_fields[i] = "" - contents = sample_fields.join("\t") + if field[:optional] + it "accepts a file with empty #{field[:name]}" do + sample_fields[i] = "" - expect_ok(:verify_hathifile_contents, contents, gzipped: true) - end - else - it "rejects a file with empty #{field[:name]}" do - sample_fields[i] = "" - contents = sample_fields.join("\t") + verifier.verify_fields(sample_fields) + expect(verifier.errors).to be_empty + end + else + it "rejects a file with empty #{field[:name]}" do + sample_fields[i] = "" - expect_not_ok(:verify_hathifile_contents, contents, - errmsg: /Field #{i}.*does not match/, gzipped: true) + verifier.verify_fields(sample_fields) + expect(verifier.errors).to include(/Field #{i}.*does not match/) + end end end end end - - describe "#catalog_file_for" do - it "computes a source catalog file based on date - 1" - end end end From a2623bcf1a8fd288b05fa3f3a8d15f98b6e16b61 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 5 Dec 2024 14:59:58 -0500 Subject: [PATCH 024/114] DEV-1414: compute catalog file for given hathifile name --- lib/verifier/hathifiles_verifier.rb | 32 +++++++++++++++++++++------ spec/unit/hathifiles_verifier_spec.rb | 10 ++++++++- 2 files changed, 34 insertions(+), 8 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 563e41b..76405bc 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -133,25 +133,43 @@ def run_for_date(date:) # Verify: # readable # TODO: line count must be > than corresponding catalog file - def verify_hathifile_presence(date: current_date) + def verify_hathifile(date: current_date) update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: date) verify_file(path: update_file) - verify_hathifile_contents(path: update_file) + linecount = verify_hathifile_contents(path: update_file) + verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date)) if date.first_of_month? full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: date) verify_file(path: full_file) - verify_hathifile_contents(path: full_file) + linecount = verify_hathifile_contents(path: full_file) + verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date, full: true)) end end def verify_hathifile_contents(path:) verifier = HathifileContentsVerifier.new(path) verifier.run - # FIXME: could be inefficient if verifier.errors is very long; - # unnecessary except for testing. Would be better to test - # HathifilesContentsVerifier directly. - @errors.append(*verifier.errors) + @errors.append(verifier.errors) + return verifier.line_count end + + def verify_hathifile_linecount(linecount, catalog_path:) + catalog_linecount = Zlib::GzipReader.open(catalog_path).count + end + + def catalog_file_for(date, full: false) + filetype = full ? "full" : "upd" + self.class.dated_derivative( + location: :CATALOG_ARCHIVE, + name: "zephir_#{filetype}_YYYYMMDD.json.gz", + date: date - 1 + ) + end + + def errors + super.flatten + end + end end diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index 21ec67b..06ea71d 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -36,7 +36,15 @@ module PostZephirProcessing end describe "#catalog_file_for" do - it "computes a source catalog file based on date - 1" + it "computes a source catalog file based on date - 1" do + expect(described_class.new.catalog_file_for(Date.parse("2023-01-04"))) + .to eq("#{ENV['CATALOG_ARCHIVE']}/zephir_upd_20230103.json.gz") + end + + it "computes a full source catalog file based on date - 1" do + expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"),full: true)) + .to eq("#{ENV['CATALOG_ARCHIVE']}/zephir_full_20241130.json.gz") + end end end From 62c7b63c898574cd0021de2bd503530059d32867 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 5 Dec 2024 16:14:15 -0500 Subject: [PATCH 025/114] DEV-1414: Test hathifile line count * Squelch logging output from tests -- since we now have an 'errors' method to check this in tests To do: integration test for hathifiles --- lib/verifier/hathifiles_verifier.rb | 10 +++--- spec/spec_helper.rb | 5 +++ spec/unit/hathifiles_verifier_spec.rb | 48 +++++++++++++++++++++------ 3 files changed, 48 insertions(+), 15 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 76405bc..becc1a6 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -151,18 +151,21 @@ def verify_hathifile_contents(path:) verifier = HathifileContentsVerifier.new(path) verifier.run @errors.append(verifier.errors) - return verifier.line_count + verifier.line_count end def verify_hathifile_linecount(linecount, catalog_path:) catalog_linecount = Zlib::GzipReader.open(catalog_path).count + if linecount < catalog_linecount + error(message: "#{catalog_path} has #{catalog_linecount} records but corresponding hathifile only has #{linecount}") + end end def catalog_file_for(date, full: false) filetype = full ? "full" : "upd" self.class.dated_derivative( - location: :CATALOG_ARCHIVE, - name: "zephir_#{filetype}_YYYYMMDD.json.gz", + location: :CATALOG_ARCHIVE, + name: "zephir_#{filetype}_YYYYMMDD.json.gz", date: date - 1 ) end @@ -170,6 +173,5 @@ def catalog_file_for(date, full: false) def errors super.flatten end - end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index e41d6b0..9ba1366 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -25,6 +25,11 @@ require_relative "../lib/journal" require_relative "../lib/verifier" +# squelch log output from tests +PostZephirProcessing::Services.register(:logger) { + Logger.new(File.open("/dev/null", "w"), level: Logger::DEBUG) +} + def test_journal <<~TEST_YAML --- diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index 06ea71d..aeb7e48 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -15,15 +15,43 @@ module PostZephirProcessing let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } describe(HathifilesVerifier) do - describe "#verify_hathifiles_count" do + describe "#verify_hathifiles_linecount" do context "with a catalog json file with 5 records" do - it "accepts a hathifile with 5 records" - it "accepts a hathifile with 10 records" - it "rejects a hathifile with 4 records" - it "rejects a hathifile with no records" + let(:verifier) { described_class.new } + + around(:each) do |example| + contents = "{}\n" * 5 + with_temp_file(contents, gzipped: true) do |catalog_json_gz| + @catalog_json_gz = catalog_json_gz + example.run + end + end + + it "accepts a hathifile with 5 records" do + verifier.verify_hathifile_linecount(5, catalog_path: @catalog_json_gz) + expect(verifier.errors).to be_empty + end + + it "accepts a hathifile with 10 records" do + verifier.verify_hathifile_linecount(10, catalog_path: @catalog_json_gz) + expect(verifier.errors).to be_empty + end + + it "rejects a hathifile with 4 records" do + verifier.verify_hathifile_linecount(4, catalog_path: @catalog_json_gz) + expect(verifier.errors).not_to be_empty + end + + it "rejects a hathifile with no records" do + verifier.verify_hathifile_linecount(0, catalog_path: @catalog_json_gz) + expect(verifier.errors).not_to be_empty + end end end + # the whole enchilada + describe "#verify_hathifile" + describe "#verify_hathifile_contents" do it "accepts a file with a single real hathifiles entry" do expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) @@ -38,12 +66,12 @@ module PostZephirProcessing describe "#catalog_file_for" do it "computes a source catalog file based on date - 1" do expect(described_class.new.catalog_file_for(Date.parse("2023-01-04"))) - .to eq("#{ENV['CATALOG_ARCHIVE']}/zephir_upd_20230103.json.gz") + .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_upd_20230103.json.gz") end it "computes a full source catalog file based on date - 1" do - expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"),full: true)) - .to eq("#{ENV['CATALOG_ARCHIVE']}/zephir_full_20241130.json.gz") + expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"), full: true)) + .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_full_20241130.json.gz") end end end @@ -215,9 +243,7 @@ module PostZephirProcessing ] let(:sample_fields) { sample_line.split("\t") } - # testing a method inside, need to provide a fake filename - # for the constructor - let(:verifier) { described_class.new("/tmp/nonexistent.txt") } + let(:verifier) { described_class.new("/tmp/unused.txt") } hathifiles_fields.each_with_index do |field, i| it "accepts a file with #{field[:name]} matching the regex" do From 4b16ab5f64a687568b6b559be8cdf2f2b75a1601 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 5 Dec 2024 17:18:13 -0500 Subject: [PATCH 026/114] fix regex to disallow uppercase in volid --- lib/verifier/post_zephir_verifier.rb | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index b157db6..f1ed22c 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -126,13 +126,11 @@ def verify_rights(date: current_date) # * exist & be be readable (both covered by verify_rights) # * either be empty, or all its lines must match regex. def verify_rights_file_format(path:) - # A more readable version of: - # /^\w+\.[\w:\/\$\.]+\t(ic|pd|pdus|und)\tbib\tbibrights\t\w+(-\w+)*$/ - regex = /^ \w+ \. [\w:\/\$\.]+ # col 1, namespace.objid - \t (ic|pd|pdus|und) # col 2, one of these - \t bib # col 3, exactly this - \t bibrights # col 4, exactly this - \t [a-z]+(-[a-z]+)* # col 5, digitization source, e.g. 'ia', 'cornell-ms' + regex = /^ [a-z0-9]+ \. [a-z0-9:\/\$\.]+ # col 1, namespace.objid + \t (ic|pd|pdus|und) # col 2, one of these + \t bib # col 3, exactly this + \t bibrights # col 4, exactly this + \t [a-z]+(-[a-z]+)* # col 5, digitization source, e.g. 'ia', 'cornell-ms' $/x # This allows an empty file as well, which is possible. From c59adda5dacbf3911bf173ab23cd06be882690a6 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 5 Dec 2024 17:19:21 -0500 Subject: [PATCH 027/114] dev-1420 dry out some tests --- spec/unit/post_zephir_verifier_spec.rb | 52 +++++++++----------------- 1 file changed, 18 insertions(+), 34 deletions(-) diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 61e11eb..73bade7 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -125,28 +125,16 @@ def expect_deletefile_ok(contents) expect_ok(:verify_rights_file_format, contents) end - it "rejects a file with malformed volume id" do - cols_2_to_5 = ["ic", "bib", "bibrights", "aa"].join("\t") - expect_not_ok( - :verify_rights_file_format, - ["", cols_2_to_5].join("\t"), - errmsg: /Rights file .+ contains malformed line/ - ) - expect_not_ok( - :verify_rights_file_format, - ["x", cols_2_to_5].join("\t"), - errmsg: /Rights file .+ contains malformed line/ - ) - expect_not_ok( - :verify_rights_file_format, - ["x.", cols_2_to_5].join("\t"), - errmsg: /Rights file .+ contains malformed line/ - ) - expect_not_ok( - :verify_rights_file_format, - [".x", cols_2_to_5].join("\t"), - errmsg: /Rights file .+ contains malformed line/ - ) + volids_not_ok = ["", "x", "x.", ".x", "X.X"] + line_end = ["ic", "bib", "bibrights", "aa"].join("\t") + volids_not_ok.each do |volid| + it "rejects a file with malformed volume id" do + expect_not_ok( + :verify_rights_file_format, + [volid, line_end].join("\t"), + errmsg: /Rights file .+ contains malformed line/ + ) + end end it "rejects a file with malformed rights" do @@ -182,24 +170,20 @@ def expect_deletefile_ok(contents) expect_not_ok(:verify_rights_file_format, cols.join("\t")) end - it "rejects a file with malformed digitization source" do + it "accepts a file with OK digitization source" do cols = ["a.1", "ic", "bib", "bibrights", "aa"] expect_ok(:verify_rights_file_format, cols.join("\t")) cols[4] = "aa-aa" expect_ok(:verify_rights_file_format, cols.join("\t")) + end - cols[4] = "-aa" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) - - cols[4] = "aa-" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) - - cols[4] = "AA" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) - - cols[4] = "" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) + not_ok_dig_src = ["", "-aa", "aa-", "AA"] + line_start = ["a.1", "ic", "bib", "bibrights"].join("\t") + not_ok_dig_src.each do |dig_src| + it "rejects a file with malformed digitization source (#{dig_src})" do + expect_not_ok(:verify_rights_file_format, [line_start, dig_src].join("\t")) + end end end end From 4c2bf0582349dd80e74720333a250558a5981356 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Fri, 6 Dec 2024 16:17:19 -0500 Subject: [PATCH 028/114] DEV-1414: end-to-end hathifile test * clean up some requires in spec * add fixtures for integration testing * allow whitespace in isbn/lccn/issn - empirically observed in e.g. https://catalog.hathitrust.org/Record/000000365.marc 010$a and resulting hathifile output --- lib/derivatives.rb | 2 +- lib/verifier/.hathifiles_verifier.rb.swp | Bin 0 -> 16384 bytes lib/verifier/hathifiles_verifier.rb | 29 ++++++++------ spec/fixtures/catalog_archive/README.md | 6 +++ .../zephir_full_20241130_vufind.json.gz | Bin 0 -> 4250 bytes .../zephir_upd_20241130.json.gz | Bin 0 -> 2385 bytes .../zephir_upd_20241201.json.gz | Bin 0 -> 2105 bytes .../zephir_upd_20241202.json.gz | Bin 0 -> 2282 bytes spec/fixtures/hathifile_archive/README.md | 6 +++ .../hathi_upd_20241202.txt.gz | Bin 0 -> 640 bytes .../hathi_upd_20241203.txt.gz | Bin 0 -> 314 bytes spec/integration/hathifiles_verifier_spec.rb | 37 ++++++++++++++++++ spec/spec_helper.rb | 2 + spec/unit/hathifiles_verifier_spec.rb | 1 - spec/unit/post_zephir_verifier_spec.rb | 4 -- spec/unit/verifier_spec.rb | 3 +- 16 files changed, 69 insertions(+), 21 deletions(-) create mode 100644 lib/verifier/.hathifiles_verifier.rb.swp create mode 100644 spec/fixtures/catalog_archive/README.md create mode 100644 spec/fixtures/catalog_archive/zephir_full_20241130_vufind.json.gz create mode 100644 spec/fixtures/catalog_archive/zephir_upd_20241130.json.gz create mode 100644 spec/fixtures/catalog_archive/zephir_upd_20241201.json.gz create mode 100644 spec/fixtures/catalog_archive/zephir_upd_20241202.json.gz create mode 100644 spec/fixtures/hathifile_archive/README.md create mode 100644 spec/fixtures/hathifile_archive/hathi_upd_20241202.txt.gz create mode 100644 spec/fixtures/hathifile_archive/hathi_upd_20241203.txt.gz create mode 100644 spec/integration/hathifiles_verifier_spec.rb diff --git a/lib/derivatives.rb b/lib/derivatives.rb index cc8ebc3..f3b1ed9 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -51,7 +51,7 @@ class Derivatives def self.directory_for(location:) location = location.to_s case location - when "CATALOG_ARCHIVE", "CATALOG_PREP", "INGEST_BIBRECORDS", "RIGHTS_DIR", "ZEPHIR_DATA" + when "CATALOG_ARCHIVE", "HATHIFILE_ARCHIVE", "CATALOG_PREP", "INGEST_BIBRECORDS", "RIGHTS_DIR", "ZEPHIR_DATA" ENV.fetch location when "RIGHTS_ARCHIVE" ENV["RIGHTS_ARCHIVE"] || File.join(ENV.fetch("RIGHTS_DIR"), "archive") diff --git a/lib/verifier/.hathifiles_verifier.rb.swp b/lib/verifier/.hathifiles_verifier.rb.swp new file mode 100644 index 0000000000000000000000000000000000000000..ff862d49a9d5c8b98aa94e049028708ff824e750 GIT binary patch literal 16384 zcmeHOO^h5z6>dVDgr5*ZI3Ng8dDh0W@l4Np?L_g|8{4th_DZ|4V>>2^cirjfu9+#e zr@PbDy}PsP0d9dy1j->f5q>28Zv?n-00gl>u|QlbggAhZARq;}AO#Naz3TtnS!Y2x zAn2C9otf^cSFgT$^si%PoTfIC4KHCpKsgr`S#@ZLh{+)^gQ`leN+ro3{(tM3{(tM3{(tM z3{(tM3{(tM3{(tM4Ez@|VB*dF3iQ$`0f6`a>Hhz>_h{O0fL{RL0zM0T2ABiR0(St{ z-=k@-0AB}2z!ES6{NZj*`zi1R;M2e(!2Q5&z+d03Y2O9D23Wvjz%Jn1@6xm{0?z}F z0S^JMy;IX}0AB`P1fBsb;3%L2|9FR{y$lQh19$+~2fTKdrriWS4?F{$1|9_N0^Yb2 zZ38a=0{AHK5#as6ukO&ap8%f%LZAVB0QmVXP5U0O25{gg&;llbIB~Oz!30&CEx*IH}Erz$93Qp;H$t0xB$>Ny`Cm# zwznttCYR{4-8fm{*z$@s#g`p4=4@A0)sEfuGJvKe;v#HF)yr|0Hp zjni|p^QVjpvu9?Xn7w#KEG0!_J#=nbePdOGeM7i5Ut_k%uBk^2Ca$TXHt#ZZ{9|au zb?WiQq^2Ho*G?o(Fw>D9v-;dxW{z;VWPVRIK#HqYvIb11I-iyEeC{$jzLMO@HD$Nz5X0s2wU9+F zPQJ*rDCrOrU?`=^EKsGG7I05Rb*t|Ud|mpE2 z)8>IrJV0BAH_(dZ5&dmeQzsK@ld%DiBsKK(B_H-6ayJk1aT0_+7I-b{F~g=3HDu)T zpdR;_Oh~vQ6s9BAxzZ)E9y1IBBT#y>cC2GN?^fJ+YO$@W`u%%aiFg=W6!yIU2ITpn z@LbbrFw-64#1i=Gd6~Vmu3j;l>-(4X?`bvix)vRHVmSDux&u-;1O@z=NCez&FvnZv zfn`b#Yc>PZ3V9&ee&#z-YKF1c#b#NqEqdtXx=9MbE;-D9r?PER_7nZa%Z?|H6dmvn z!2$;HmdUm}HzZ|2$cli??1YT6!Y7(fm8q~i$KjSDVvCxQ58K#8Z{GJUqB^2ubO&Kc zsauNBH>Mf~N5uhmOjt$9R6;hu5jb|RFv6J%`o{F8`hL`L1bh~atM8ar*<7X>46!JQ z)jJV}-gL>Gt@cF-$A?Pk{BK=dv`h9Fh+aRGhHZx2h$OcQUBy>78sJEtm8o}S>cCR5 zbCEQ9-il#+meF-g>St=nN7xjpmbwONcjM5giNDjyU#7md3xA7}6TLz);4(A^eo6G+ zC!2c*&Am3;d$zrIp}qHFt{`Sy*%}f-nCa4U;Xq6C?MPw1nU3E#J3Iv0P9~4$eLi`` zzz+nBH6P9++syoOJxoU)#6u|iUgX&D!OPqk(imZyyFKs@Pb^%^iJ)KD=WLnDsH6P4 zo+vZH9`!=rW>$n~=7v-owzbTPpO)w!lE>;a_p+QDR`emeLL0%%va7htv1;Ky} zzU6wsKm`WO6q6h#jE#AiKM0OA2muxnflUl&22scSnLJi+&_~HOJj<~R%$E`|>pbvC zqk)GA3-JT9DY*~79%95%xC8H=iC(-|5||rJ))&*E@L=o85}+Y;j%`4CQ2@^XcSTYq zGtpdPVkYZEJ($)wONgr(7Vx ze3=hXJ9k@$nF^)(uuu@6Y93x}>f;Z31N7sbyQ>17l z3N|q8ZZnS{Z0?k5a+~En7G_t*J-=w2Jooscm+*M{K|Hq{6RV%I>GAH9arBp$lY`&v zC`-R;1J91I;5d(1`II^>em*-aj^nKCQXd1CcvHAh3aRjWG;1Zg0;3PHmX7@@9)nHV z65ux`vrg5|ojTWsN3bwPv{k^s#9Iv|g4Skmr-2BuIANV3#Jgc1@{*<)5)$MU;Iigg2Tox6q%5v+O!wgCt}P8)0bh-z@6<7vL;7Q;K zpauLLd;Z@8KLEZ0yaap>Fo6#Pf5X0i2zWpb=mKrvUf?g_SA$G0Ir^ zb}j2!l0QU7&Outv3+g1Xp(?dm7l}x0wl{S>D;yV<>ji7od*$@tV>T9P!M z)RmqYP4Q~V$aKRxB z+GA=VqZfVL(pp(>WwpwZa{;iDq!u+7_itP|njNYtnz_~dD$LG&?~u;VbiylQ4J9(0 z=;2zZ=6S(aq*_8pnsu1WLpGBqpt1B70a9z&zN-a$JP$GF3L~(Ek&Lcw?Q%T-1>(DL^uQ-Nnc4B@$zL`8~^+Px**)jvGkL_LI zUz-b3r>CY5O-)Z7(BlE1J$_+E(9;9?)6LOkvqh_h89SaN3e~{CUdZg!BN|IMM2cPN zSv-!3^DVs=DM~=qO`G?zyLJ>yXIuK2-(k2^ghZ+MQvSLS4d=;VCN{`!Vt+rP)K$C= zh-0HlZ_*8VnHk5N5v<9H2Cl?wrZpd_f9$O%Hgad9-a{sdh*1^|3)w3c?~P-XFdK`) z$-j&jrBV69#867Bs^i|%jb|gjOTsp@ahf`bu80vss!vU(ZMmuMOGp{*n^c)j%IGrn NsumK|`$_8izW{Eq^uGWA literal 0 HcmV?d00001 diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index becc1a6..02f2b72 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -33,12 +33,12 @@ class HathifileContentsVerifier < Verifier /^(\d+)?(,\d+)*$/, # hathifiles doesn't validate/normalize what comes out of the record for # isbn, issn, or lccn - # isbn - optional; no whitespace, anything else goes - /^\S*$/, - # issn - optional; no whitespace, anything else goes - /^\S*$/, - # lccn - optional; no whitespace, anything else goes - /^\S*$/, + # isbn - optional; anything goes + /^.*$/, + # issn - optional; anything goes + /^.*$/, + # lccn - optional; anything goes + /^.*$/, # title - optional (see note); anything goes # Note: currently blank for titles with only a 245$k; hathifiles # generation should likely be changed to include the k subfield. @@ -135,15 +135,18 @@ def run_for_date(date:) # TODO: line count must be > than corresponding catalog file def verify_hathifile(date: current_date) update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: date) - verify_file(path: update_file) - linecount = verify_hathifile_contents(path: update_file) - verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date)) + if verify_file(path: update_file) + linecount = verify_hathifile_contents(path: update_file) + verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date)) + end - if date.first_of_month? + # first of month + if date.day == 1 full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: date) - verify_file(path: full_file) - linecount = verify_hathifile_contents(path: full_file) - verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date, full: true)) + if verify_file(path: full_file) + linecount = verify_hathifile_contents(path: full_file) + verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date, full: true)) + end end end diff --git a/spec/fixtures/catalog_archive/README.md b/spec/fixtures/catalog_archive/README.md new file mode 100644 index 0000000..dcb7f3f --- /dev/null +++ b/spec/fixtures/catalog_archive/README.md @@ -0,0 +1,6 @@ +Fixtures for testing -- samples from the real files for those dates + +* zephir_full_20241130_vufind.json.gz - 5 real catalog entries +* zephir_upd_20241130.json.gz - 3 real catalog entries +* zephir_upd_20241201.json.gz - 3 real catalog entries +* zephir_upd_20241202.json.gz - 3 real catalog entries diff --git a/spec/fixtures/catalog_archive/zephir_full_20241130_vufind.json.gz b/spec/fixtures/catalog_archive/zephir_full_20241130_vufind.json.gz new file mode 100644 index 0000000000000000000000000000000000000000..924137d6ab06d3f62e07ef05105de2fc00cc8597 GIT binary patch literal 4250 zcmV;L5M}QliwFn{2vcVO1A1j}XlZg^W_4_AUotQMo&s2DJDUR8jvy1GB4o@hi8X?3^zg2Md%79l(l>JP%7uib_UOEAqcxW^lu^9ek zFD!Ur>4ac&No>QGKlgstb|QZPHwdH9^|!PL9*ln;2yZGmMuXuXj3)(27(6wxPAt^~ zF+L9Emo)kEm$bQ_HK&cuPakd!dqg$U*b*QG0u*-=u&_m_MRZF`SpE|{dhxJUkn!@7 z1Z~sGX_|d#s`{RiTFJk-6z^Z!c0T;@&rmI?zQ<4tr@?3`V&xEm^;EpTy;CMcRjF!d zR_$+00MIDJV?(>CxPQ{4kmL+_Mb-M+s#zIjP> z+aea;C;F%%y}meqE2aKU{(pCUelj|#s=vN{uC!y9ujYkCp=xJZx9y%?oF*1Kfotx~ zl53~%dn-o>gqqOPmI38N*N|K_i{^rNHi%Pl zVD<&|kS_I{z^zye(d&SDk@)p9UksBTxm?2i>k%&J}>-VKPx#8`u;xBsuiL@-_u=dLo2jaD^29i35zow z!uXi_=o^ZO@$EPt`%4x?!b9VHC@)zg9ENg9KNp9T3~Ksp7`y-og)UpO0`Z&UKMWD5 z=4R~&o-Y+R%#|k1XH?Ufio3VjweQ9-zKD#zF}_8=%V|;_Im?A1 zuI1;!lsUc^aWC2qhpL)~P>#JNp9&W7X-?lOQcvFQXj`3gJJrOFm{a1F1%Z(3MrESw z6(zUhl5F;k(W%SgDMu&F{qp(O{~4mRf25-q1a{q*qb={%4r*psyj-$igTNkNc{5jp zb2<09=fK~a74wn@%$uU`c(7!iQZujz%5mm9z^xaBvQKI943?WWu{UFZVBQeTF511MihO8~h^^Y?H>!fZ_XGax=ammkd5&P{0f_i(L@ZmaX`t0KK-(4; zu4$nw=v@E??Qyg~Q~s0izI8hfB$Q=TT>aD`pGiteEIK(aKT^!J>l+5 zIB>a6dUNI=5tZF(dbvBJw!8I1wrB&XFxp6VlLe3GjPRdOh!z3+iFr!_y*#ivFd+W- zFQV9GqD0NynMFP|d)eKk4|B%9L(BnN2*>Cg){wyCQBKdO>;e$pji)2kbusI;}4W<6%z zUL%Jvl(d{(shO|pp8dbp&tF<}3dq`4C>eCG&2DNIWF%N{F+dVDa zJ$r+h^nfLX~=N)Bl==W2%?VdJkt#>CSYeM4+Z|2Ap!X=ecjHZ2XC1}zF{r9xV^ z-MM#zg%TYTO9KyBGja}3;(36M5g|HNRXfL*YelcIv(xjlLpFn=Tbr~$JHGAYGYfrL z*wPKn&O%?4L0wGj2!ZYIme^yej;a1*jY0L5VT7#_F>i@E zR`D3qAB*hLOr0fmJJoq}c0Ia0f1i)hQpCAo9ba4?CpRwKtOCAb0asS;2xDLb=QH#5 z#fONj3c)mA!e zV=!}skS0Suzx>P3g|7(gYTouMYT1OPUk#h{jH&ip@{Hi;QS15gj21SD9Rh+SvWf81 z3}gBOT_7xjLnsXjHm85e$uin&H*cg?g(wFRd_v-jb|k*q(@T|{Rh>89EBjSSR9VJ} zsVAALLom6DWn6YB8L%z%B=yVQF9yofp!kf*+gvqEIddGtW3mH0gVLpGD zGVdcg3z&kgcYDyCw=JXohZ z^gW-%Q?_9;F8nU=H+&(#;K4*h@b)BL#OQ{FQNYl-H}yRp3YN4xV?1E7ENi6o6z?CDT_qZg&zngi%PC+RWFA)_O|<4`iDhRBt@I58YWt~D|Lj8 z5jAg#H74d*qjk}g(;m^3V%(!&tYcqO)v!lHtYe?OEbP%|VWVY!nZnPCFVBS25=x$CXMKL5BXA}=Ezs2I_gHgY8UdAo-}jp7lt{hHfAGu~s?IG8i`huG%pLI94JJ9H;Mu7izRxSBp8_+`VRoLM7WH@ z0)`DYApq_$ViN)HgshMN40t90gUdt{An}F2T*lBI7KS%G2tsGh?xI4y)f^y79M1TR zPvGmdFQLpt=Bh5avj7~0V!(CB-ZbDWME@=%0eD)6@G_^N20h6ocH#%cQmNMf!<_k- z|MGxX0pL4w^qK%uBiL`Xu(c3Fq!FtO{)Ou*Mf9%$`p?0(X5-GGdhJV9z#xIv9^wF9 z^I0-ZGcGmc%p;zWm#O=cse<%6Pc|!wd0ZXL3y??!M_a?@%Uw}ShxWd{Bhl*AZp|5m ziR%FNoGavZR?H}6}S@uR~mm*XD!yFM@O2Pi0 z=UqFJt@|Ulb3poe9s3}xr`M6DU~>>t@k4e7+2DtCg1=|TOsYL{L`6Q+B?KRucdfW2 zTPgNj5^QU@6pTlmE{Wddk`QbW^Lc~udMJxZWE1W!SRRZ2bCM;8XS2$opJo#^3`})h z#dYAm#}OOg3KL?2`_}3mwBYt~Uf>{KqfcI-mp62Xhhgf?zD1vie_ch>qfeK2<*M4s~=J%GLdyA54e&=Jl3;LO=LHc1A{y`a2%X*b=;QiW};KGrJmQhE|#}Bo8#ELN5Ib( zYpMdhj0Ccvl8N0xDeXRZKVO$m9>#oSL5#yHlOvCPm{gp5-Q7JBl z-FskMt4G-VIS;m%F|N0XcL&>BFmAI!8YW;|%`#e%S8s@F^?|7U;_FW)*j=w;mFiUv ztuo2ZY-`T}pDRe%dIV^#!e3zS8R4&0g}=RcYDdC*q4s7nuZUFu!k1bp@>2o6Eiord z6ip{bwsC~by6{mW{JrFY2Y`IzSsX^CJzO2pJOelAd5QxT7NQzD%@^IQ)mpGS1yDkA zaRl3^X`8xkJ`z^f=PGXJ94%QAQU{PWahbC~$4f4;^zU*V18`%TFd};D>efhSrf1vN7Q(g5wU0H(8(^Hx;)~(W!C(oHf@_$4;MX=@0O+FpL52Dnp7rR=yMOqej(UD0QYqK(ML@H06aGt;s5{u literal 0 HcmV?d00001 diff --git a/spec/fixtures/catalog_archive/zephir_upd_20241130.json.gz b/spec/fixtures/catalog_archive/zephir_upd_20241130.json.gz new file mode 100644 index 0000000000000000000000000000000000000000..2113eb47e2943788e26c4b379955423a2dfa554b GIT binary patch literal 2385 zcmV-X39j}ZiwFo~2vcVO1A1j}XlZg^b#P=~GB7eUF)=eRE^2dcZUDtu+j84D5`CYq zK?lcvB!I^0Mgu&D{jC#GKcu45?_fdfrwB5!n{F%W1Hqh7D) zui#JK;6j7zVux4>@oY~HiPx*m*a`gm50R4~8qW$_1b6!1IxG$)Vgzm^t4YNZAD-GS zwmj1{I^XZRvZtQ){GJUrtk&VLuG!sd&nu$=Q`&}0>a87`NJ7zDu$9(krMpE^Z}8s* zrt7OaOz`2)vU)SlXGfWC_Wb=RAD+H8V2K-Jr`KCJ_21v#4F}`b#PVF?V%sV#DX}zI z8V#@BNtZ9cLy4$&@(fF5>;@5C&MOIiXvy49tVVce!_zDugUWT7?Uy?+zOiUkEn=ma z%MadWw@u(iyRviy+rUu;-_=_~X4vG7=Di_~>DdYo2_Y3cy3)i|oTMyWr4;cPjc7uB z5zNsofAmF&t`2lP($F=F zA}YFwhAibmtA>l@;Or+Q#5?+k{sjQKC8~E>bO6^N$pFeb#HVPyLHiLp1J)ASXL$hL z7s+Gb_H&MS7%jKH={5=$nj*{=Hr2<}37bTW&xF66vj8AXIZtDr(nM=MODw}x8@%Qr zl`i^0K$E18uHux6n5HfEY^x^J>XaBv1;CBbU_rBsO$@(6OCH*Q$FQt2HMJt1+bQbp z=A#oTzA(t|EYoD7iz3lA#jbL87K*f`Ns)I&A{O|Zo%fpFnr%nrR%zY!yw+?oJWSxF z?wY1+IkqB=POr{djU1e?mGNhT+t%|CIEsy3>=N6w4KF7b6D8|tvERB9xkY$IVaLR~ zQ<*(NdIr8FZr?Eby8EdJVHKaDGxk|`Husiwx5Tkv$g?v^geqo9}FQMuXpVp+$h$Ud*$dlQ;U$8YBp zsNV`W6YRyQpU#0gNHUSr>Lg$^4roaxa$>A_VfaxB*ah8k;2YT9D;A>@&L0$1w;@c2 z1i1NaZn#x0OuU>6=LEP0@j#M2Ml)}Hk+L9y>nRgSik2(j#>t$92;fU{FQY-zbyuc^ z(X*X4j*QziF3@Aa!&M;JTNh2%=vNJ0YUqMTVLInxNwr>Y5W)$6PQ`4M>MH+!uUCQ8_#Hz z6aYr_E3{rxF`+Q^cqx-h)VcpQ%%tb-I^vr?FvAe?aSl&YXg`GuxoD7J={)OjtH}^R zJm?zQFDMr?XdgpbZd`l^5QD2Y&LZ@IMN>gTj&6WGrMwQsEnGk#Arn5(Nkq%Vf^ePq z@q@JVEyUu1PB&G!|<9xxwkj8T83H4J^D2_1;W4cBk>12}iyw=c1 zA2gH<_fV7GUGm9XD(a&7g2f4hat0knJd5dDP5SeYb%5(r7_*wqGl#GJuhRB5;2VH) z-Qlkv1`(T1Y41ev$I?rQFJ&?g@Hl2kc z8$0LO45$2&M$tBN;AFU*mKxG^tM^$q z$o*?*eJ4e%{Dc8Y5JEdyfyUoeZ}|VEWYsd0M|`#}W$PtN0N}c6I>+Z2H78f)M5=>? z_hnoCLDckK&LxlP^TUST?Xbm(l14V++e)fNt+uh~fmSnXwVLVlb?;N>@LSHIY-^iN zrQm90TcfF}f;&#N)kw7(eYQ_CkCtjxvU;Rc+ZAR8a81XtNuw~Uqr~;XOzrQ>2(ul? z&k<(b>_3#QmlTP8l~$}mrqb#ZnERI~u*2d?_LHTRIF{|TXaQX@T&|X> zSz7zVjXqKNXZl zbluIF%8r7wdpc}!)#v$RsvtLv0uMoNDus-U>E$yv*DuKOzCa+)OZQ-k0Vo+z2vN8I+16N}s5hSv z?`FoTS!-I`?+#Ng77$s`1yortJ1h?zLpj~@3rj7}gni?}`C{uBvcdfGd8P+U0eB|1 z>uUm5A3F^^^Eowv(v-ea9(V~x*|5KUmS@&RQ%O19pQN0yG&WE@tC?~hkc38_`DfPx zs*r!O7VsiW89Tb!4%;+r)3#jvVkL}{9Cs@P4D1?YTflG};%^EV-+%cZVsd>q!Xy9y D07s6( literal 0 HcmV?d00001 diff --git a/spec/fixtures/catalog_archive/zephir_upd_20241201.json.gz b/spec/fixtures/catalog_archive/zephir_upd_20241201.json.gz new file mode 100644 index 0000000000000000000000000000000000000000..9cc0d8b52d84e7ec8364362f4710d61694dedd8f GIT binary patch literal 2105 zcmV-92*&pxiwFp42vcVO1A1j}XlZg^b#P=~GB7eUF)}bQE^2dcZUD`h>vG#R7RSH$ zDKPrfWJDSO@9ON%$Z{IHD?6DuYG>NX2S|b?G!&@i#j%^s!|b#4#d<(d6iM+W$IY%X z)^T>rhFcqyeU|%!?4|=Z7a^KUq-h#U;9qvbg&VGm9b!eqvt9Lw5a{sv4|s+# z;&QdOMetzsc_8Fcd5i`_bx5O;e@?;Kbwwk zUl7Z4iEEn96Qle7-Sq0_#r4(fHAJSI)6Sd2YP9snKI>zjZ`XxNJho)w`UG7}CuxZ+ z3dhT|gl8T3vKlO2UU(L@^(&jE#^!M8hPn;LaApiTK;p*m@b7ngfdzM)fBq^0S zt+jH)Vi8E@qe+m+h44{uhsN8jFZdF@S_CrKhy{AbBhJFbTCW*|n#DT7wS!CpXuX(+ z8?+6B-}yof8KJu%L?R3%VseS1m~&q*Ru9SS+qkn1Z}=y4!Tq(!r|vhk5PYa9Mgw=7 zdl2_SWI7ma_1qve&|;(47_1~Bu|L!2FUIfdLnq#D zc`v;!=C!YpwytWlpPKQgK};BhmhGDwp&BOdi952a5jB5p7=x6?XzjVORRfPO{k5jE zCM8q#IlR4ir@^Y3FH_^`^*BB8;ba@~EepB6^)-x5O}!|ONBQx(I!f!LoℜTbFO9 zt(}BBURuMfA;c_CH$;;++&-$ULW1R1H#`gl;J_;YJI)R+qQcjd^S$1+o@2JUJuet7 z`mxL=j!;bU(N{sR@^_t)H2e9Omay%nCn21OMSA#eIGl@ldK|;C45T)@MQ{l%SfFaU+}^S0Uh-Z5hO9WrUt&2HEICW-WMFDe|5Md$n7}Yc{qltHPe$IObxnFXQzi zv($|aGOLfJj?B_BJWkN_lbJ~hQR1$YLnW>gnJ2sS!t8wb?IpAh$rm*J)R$$r{yhhf3Ww;YKmRL(XqYF7ZGpTV#0HiX{}I6gWop$+mer1T z`*B-I?{&yFU{Nd>x=iGXg@HDOQ^|(;+|3`+Rlua=5y~A#Z{b29MH1y4bl|>z?VbqFcR_OFlSjb;R?M{3XRtsO_^LJY{k*-LU0LicFd9$fNbs= z$1R~FEuv!~x2D%n>c@VgZ8G;Et0_6_XcjU!OWADf#|#c59ts`}(Y=l=s~v-^9ud3^ z`5lV`;6Ip6l_nw4E4lJTwEh7|{vlQu(wRk&zui=W%XLv&J&5}q--bfQ3XIiUsFwud zOC5a@@ftBTNBmO|E)mqu`*pBk5gP00md|0O-?-$4t%M%zjXL$qnV0mI#w2S3UNME_bzMkbbkPMG_4v{>2cCP;H zSpLZytq!bt2BGfRS?Cx!Gm&mcz2PrkzI+BTpwjtl>|SY?vh9=heUp2pCoUw$q^&`A zSlLKWh{Pm{J9_RJ=iClGq&Cqj9~G0PI;fYeng!OVGrdD$_K_{;KIb2mXqos7 z$nK{gSp)~=LR`<)zFwGIiU@F4FcflMtbn4E%@27J)&SD;Ao!Szfpz-rN5zNx??(rq zf3*}bPhoVT@7zUM3@MSn<^lReEMXN4xvbGM&GS0t>al}p@2YtSI|<{Ekqzipd!Q5(OX779Ho2Y=tz-6)V0Cj)EV3-GuIF|CHkA?-!4#XeY31d1uh%WendWFv z7u}2JvF12UXjJ&<$iFttv3u6W7CqJ+L^E)z4y+NaZtz!U=+wP_l}%uauiy7?QwsIz ze!{+g34MAhg}WofEo+BTaTT@0ANxWp>xg!=`|4i>h=IE$Mak!)pF4TskQ09)phbzX zY5KXKr(&A@JkG=0A7|kiL_v5%cnVG*Ai*ZFJ6>M z?cW~hH!Mu-8J24i@Bg4MpB`y1RSyd`rPRUSdiHvSH@m($v|x6;4-b?ZZEV4f)D6%( jK-#bUWX?3clv;8BJ@cR|?T>#d=lB;}J|%i)$&4JXy`y^?;o z5H~M;zjh4FKC(O|xG8OSTo$IoSQMsq=*FB~>TSyYYo*i0*$oKX`M^BG`y{#FoaIrL zzndiIlNYWJN_U$3!-T!NzBwOXy&#V75uZ5t+w3fbemFn7REEBW_)%&MkK4^^anm^B z^I0RT_v&QhPMuE9CuI}{jq9-7B{w?VoTfyqm3Zvg#I#F6Q}}A0TDDG-NA;d3w*SiP zFbv{EKlpT(#!N~By=B241<#C94X*ZT63@*RsbndY+RY^-TH~(rvH+=V^`Dt+nZwAxnSvoH`H*+@CN7KMVi&G(f&2Y1`> z>)?9E(LCiDmpOV!q9W#q&F7q@;CmsFQWmicF_g`DAXtP%mK7W&cjy-uag^mnD0pW6 zI?!0;VEd44axA8iNbXZMpDm8STJ(up{_T*CEOV7y9LO^VNW4hGQdJfNJj;&Inap`A zd0xvKIosR0j-$>{UCHo#zthhaWOi5&N-W}gp2k&^vlIQ|_smNt4p_xA0U{C1($*byPvwYP-YLqdm`UK8)g!bdjw*z^I>*+ARo*c657M@=f6^icQe z{93f@xu~*g`6Y!fA}BBafXk_vme)C4-&`U1ib9rig8umr$ZFKrtYKxr z^QcXULkW2@we@LWy#}38_(XQuzU_|ali2k=qsVa`*LSQZk0SUMaTfBl+-|<>OBN$W zDR2@d1R-0!mB0S&OXDe(F4s>{XZ4iP!d}F3Xcv3guuY$uy<(^ZeDD*I9Ie?4DUA+h zFL>`-gnF-DrLLs|Fw(*hm;B|+mxoeAZ+KIl)x%p~XnnVof;%!m{iBa1` zJl`8?3dCKS+*_@`5m}z3f}xa0;$9@Oh<~gtpD`IufzQw#fLaa^G%KH8C|p=QpRIxG zbV;K?pA&PbE~;z*I1SNjo}ixv;NX4AWla}rifj+NbFPN1(eV_-o9H+dQPd>Ikw0{a zsaYxu9~4FuJ*%>9ztXLr>$qE(n!Vv}td{;fq>dv)g;ir$_}CrRhcCY+0*IXz z`7B9wu5j&^(*(V@TatNooU%sTUVdHHSOimSk6_EL!tFsNp4%R{+sv4KarZFB-Kvjv zytgS|zfQH75q2ou!Cf-6aLK675$#~qUQYcsT-{>S4lvzz_ahi}qx3nHQR@!H-t)gp_S2xSAc3#yR)kzw}!ri2?p>tEuiaJOab;v{x8{|Qa@;{U><{ZU(D(nqX zH~${j=B=(P^DLr@@D}q_o%$IZY?G1idkxdb^O?4na-V|{?U7|G6Tgpn-#sv~Z#kYr z{QqHKK6>>?kJZD3Qd&X%|J&-_TwJdAU3I-O56G=GIDza;3m^wdY~nfaZnN|6Yu~Lp zx@b9;a7gE|1cf)H_jJb+Kxx#=%C-u8^ME66_-=Lz;~sw=A6y>~VKczrr~fr*;h>v?L? z`-EqGMBsuCT}++6R&CRKO)SIgWhkLfje5ruysv`zQWG_rBQF5&I%KFk(+h7dMfWDk6si>iXO2OADlUc+R;NPW5TvAbz0wqZnp@@O^**(vUWiL8N zMGitDRAftdjG}S?lYvl8)GR5|Kw=9L+-SreAii*rMP|W6; ziGEz4P1wCV+`6{*)f@;nNtZ`9%K66-W63=GJDMbMEb^T5dRY>AmnsiPk(b9mlUWJ` zFCb1%k|0loD4*=x4K zMam?p%AN_&a{aq-3T%XDOew+wq&icQOE`|EX{dnG;7RW*6T|qNZoL`dbcLtEcyC+0 zTGdNS>LY%nK=7__6FBuVXS~5{@0K>Uu3L?^W#Yv8A$`Et%`KiAU-hQ4cvU*HZCITj2=L?tl2uj?gTqk#~;N*?NVDtSIy!sd2ded;mTDUpf`99@U^_>0;%)GmjU$1ADYu zn#ptk_X4yxtG!5$6;f%%i?^Tjzdgse5F%%9p4)MGYkVSkp*R82D(5QgURvWWE-4~- a%F}|qdSZhb(!M@+r}Y!l!Vj*H2><}CdO11( literal 0 HcmV?d00001 diff --git a/spec/fixtures/hathifile_archive/hathi_upd_20241203.txt.gz b/spec/fixtures/hathifile_archive/hathi_upd_20241203.txt.gz new file mode 100644 index 0000000000000000000000000000000000000000..1562826792c6af45b6d0b58ec7a710ff5197dea2 GIT binary patch literal 314 zcmV-A0mc3wiwFqb3sYwR188A%XlY+{aAaRHFfueTGB7hPba-?C-Oo#I!ypg_;PvDb za{$X>z!;;f8mUqr>ZcS6VP-AE@Zspt8Ns}tQLT158zkUJV(0$}WGm?yP5~QmN zI^$<>EwJNgsR5quUqI`>fDtQ2MWFPu06=l3m0|?XDNb#gV!e0ckfG^ZG97-qWU=<> ze2SsVv6=CqZa*h8;;oHC?R^e6zO^CZHiY*vL(+=jYfHK;-=J{~0E0sYexAVi9=4Ci zKTU&c?dj}8==TQTbY*saBU=p@Moy9wUA3{$g(~@V8>Ny;OL5IcVw@FaA^#gmNDf~) zzhoY=GjoY5bzqi~3F6gy9hHDPH>MC>jnV8~?}DF(4cgjw&0@gaf=Sk(J+@{HyWigG M2g%4* Date: Fri, 6 Dec 2024 16:29:05 -0500 Subject: [PATCH 029/114] DEV-1414 - clean up hathifiles verification * split up hathifiles & hathifiles contents verifier to two different files * clean up includes --- lib/verifier/hathifiles_contents_verifier.rb | 114 +++++++ lib/verifier/hathifiles_verifier.rb | 112 +------ spec/spec_helper.rb | 1 + .../unit/hathifiles_contents_verifier_spec.rb | 215 +++++++++++++ spec/unit/hathifiles_verifier_spec.rb | 289 +++--------------- 5 files changed, 372 insertions(+), 359 deletions(-) create mode 100644 lib/verifier/hathifiles_contents_verifier.rb create mode 100644 spec/unit/hathifiles_contents_verifier_spec.rb diff --git a/lib/verifier/hathifiles_contents_verifier.rb b/lib/verifier/hathifiles_contents_verifier.rb new file mode 100644 index 0000000..428781e --- /dev/null +++ b/lib/verifier/hathifiles_contents_verifier.rb @@ -0,0 +1,114 @@ +# frozen_string_literal: true + +require "zlib" +require_relative "../verifier" + +# Verifies that hathifiles workflow stage did what it was supposed to. + +module PostZephirProcessing + class HathifileContentsVerifier < Verifier + HATHIFILE_FIELDS_COUNT = 26 + HATHIFILE_FIELD_REGEXES = [ + # htid - required; lowercase alphanumeric namespace, period, non-whitespace ID + /^[a-z0-9]{2,4}\.\S+$/, + # access - required; allow or deny + /^(allow|deny)$/, + # rights - required; lowercase alphanumeric plus dash and period + /^[a-z0-9\-.]+$/, + # ht_bib_key - required; 9 digits + /^\d{9}$/, + # description (enumchron) - optional; anything goes + /^.*$/, + # source - required; NUC/MARC organization code, all upper-case + /^[A-Z]+$/, + # source_bib_num - optional (see note) - no whitespace, anything else + # allowed. Note that blank source bib nums are likely a bug in hathifiles + # generation + /^\S*$/, + # oclc_num - optional; zero or more comma-separated numbers + /^(\d+)?(,\d+)*$/, + # hathifiles doesn't validate/normalize what comes out of the record for + # isbn, issn, or lccn + # isbn - optional; anything goes + /^.*$/, + # issn - optional; anything goes + /^.*$/, + # lccn - optional; anything goes + /^.*$/, + # title - optional (see note); anything goes + # Note: currently blank for titles with only a 245$k; hathifiles + # generation should likely be changed to include the k subfield. + /^.*$/, + # imprint - optional; anything goes + /^.*$/, + # rights_reason_code - required; lowercase alphabetical + /^[a-z]+$/, + # rights_timestamp - required; %Y-%m-%d %H:%M:%S + /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$/, + # us_gov_doc_flag - required; 0 or 1 + /^[01]$/, + # rights_date_used - required - numeric + /^\d+$/, + # publication place - required, 2 or 3 characters (but can be whitespace) + /^.{2,3}$/, + # lang - optional, at most 3 characters + /^.{0,3}$/, + # bib_fmt - required, uppercase characters + /^[A-Z]+$/, + # collection code - required, uppercase characters + /^[A-Z]+$/, + # content provider - required, lowercase characters + dash + /^[a-z\-]+$/, + # responsible entity code - required, lowercase characters + dash + /^[a-z\-]+$/, + # digitization agent code - required, lowercase characters + dash + /^[a-z\-]+$/, + # access profile code - required, lowercase characters + plus + /^[a-z+]+$/, + # author - optional, anything goes + /^.*$/ + ] + + attr_reader :file, :line_count + + def initialize(file) + super() + @line_count = 0 + @file = file + end + + def run + Zlib::GzipReader.open(file, encoding: "utf-8").each_line do |line| + @line_count += 1 + # limit of -1 to ensure we don't drop trailing empty fields + fields = line.chomp.split("\t", -1) + + next unless verify_line_field_count(fields) + + verify_fields(fields) + end + # open file + # check each line against a regex + # count lines + # also check linecount against corresponding catalog - hathifile must be >= + end + + def verify_fields(fields) + fields.each_with_index do |field, i| + regex = HATHIFILE_FIELD_REGEXES[i] + if !fields[i].match?(regex) + error(message: "Field #{i} at line #{line_count} in #{file} ('#{field}') does not match #{regex}") + end + end + end + + def verify_line_field_count(fields) + if fields.count == HATHIFILE_FIELDS_COUNT + true + else + error(message: "Line #{line_count} in #{file} has only #{fields.count} columns, expected #{HATHIFILE_FIELDS_COUNT}") + false + end + end + end +end diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 02f2b72..aaadc0b 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -1,121 +1,13 @@ # frozen_string_literal: true require "zlib" +require_relative "hathifiles_contents_verifier" require_relative "../verifier" require_relative "../derivatives" -# Verifies that post_hathi workflow stage did what it was supposed to. - -# TODO: document and verify the files written by monthly process. -# They should be mostly the same but need to be accounted for. +# Verifies that hathifiles workflow stage did what it was supposed to. module PostZephirProcessing - class HathifileContentsVerifier < Verifier - HATHIFILE_FIELDS_COUNT = 26 - HATHIFILE_FIELD_REGEXES = [ - # htid - required; lowercase alphanumeric namespace, period, non-whitespace ID - /^[a-z0-9]{2,4}\.\S+$/, - # access - required; allow or deny - /^(allow|deny)$/, - # rights - required; lowercase alphanumeric plus dash and period - /^[a-z0-9\-.]+$/, - # ht_bib_key - required; 9 digits - /^\d{9}$/, - # description (enumchron) - optional; anything goes - /^.*$/, - # source - required; NUC/MARC organization code, all upper-case - /^[A-Z]+$/, - # source_bib_num - optional (see note) - no whitespace, anything else - # allowed. Note that blank source bib nums are likely a bug in hathifiles - # generation - /^\S*$/, - # oclc_num - optional; zero or more comma-separated numbers - /^(\d+)?(,\d+)*$/, - # hathifiles doesn't validate/normalize what comes out of the record for - # isbn, issn, or lccn - # isbn - optional; anything goes - /^.*$/, - # issn - optional; anything goes - /^.*$/, - # lccn - optional; anything goes - /^.*$/, - # title - optional (see note); anything goes - # Note: currently blank for titles with only a 245$k; hathifiles - # generation should likely be changed to include the k subfield. - /^.*$/, - # imprint - optional; anything goes - /^.*$/, - # rights_reason_code - required; lowercase alphabetical - /^[a-z]+$/, - # rights_timestamp - required; %Y-%m-%d %H:%M:%S - /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$/, - # us_gov_doc_flag - required; 0 or 1 - /^[01]$/, - # rights_date_used - required - numeric - /^\d+$/, - # publication place - required, 2 or 3 characters (but can be whitespace) - /^.{2,3}$/, - # lang - optional, at most 3 characters - /^.{0,3}$/, - # bib_fmt - required, uppercase characters - /^[A-Z]+$/, - # collection code - required, uppercase characters - /^[A-Z]+$/, - # content provider - required, lowercase characters + dash - /^[a-z\-]+$/, - # responsible entity code - required, lowercase characters + dash - /^[a-z\-]+$/, - # digitization agent code - required, lowercase characters + dash - /^[a-z\-]+$/, - # access profile code - required, lowercase characters + plus - /^[a-z+]+$/, - # author - optional, anything goes - /^.*$/ - ] - - attr_reader :file, :line_count - - def initialize(file) - super() - @line_count = 0 - @file = file - end - - def run - Zlib::GzipReader.open(file, encoding: "utf-8").each_line do |line| - @line_count += 1 - # limit of -1 to ensure we don't drop trailing empty fields - fields = line.chomp.split("\t", -1) - - next unless verify_line_field_count(fields) - - verify_fields(fields) - end - # open file - # check each line against a regex - # count lines - # also check linecount against corresponding catalog - hathifile must be >= - end - - def verify_fields(fields) - fields.each_with_index do |field, i| - regex = HATHIFILE_FIELD_REGEXES[i] - if !fields[i].match?(regex) - error(message: "Field #{i} at line #{line_count} in #{file} ('#{field}') does not match #{regex}") - end - end - end - - def verify_line_field_count(fields) - if fields.count == HATHIFILE_FIELDS_COUNT - true - else - error(message: "Line #{line_count} in #{file} has only #{fields.count} columns, expected #{HATHIFILE_FIELDS_COUNT}") - false - end - end - end - class HathifilesVerifier < Verifier attr_reader :current_date diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 64efb60..e634dc2 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -4,6 +4,7 @@ require "dotenv" require "logger" require "tmpdir" +require "tempfile" require "simplecov" require "simplecov-lcov" require "zlib" diff --git a/spec/unit/hathifiles_contents_verifier_spec.rb b/spec/unit/hathifiles_contents_verifier_spec.rb new file mode 100644 index 0000000..519878b --- /dev/null +++ b/spec/unit/hathifiles_contents_verifier_spec.rb @@ -0,0 +1,215 @@ +# frozen_string_literal: true + +require "zlib" +require "verifier/hathifiles_contents_verifier" + +module PostZephirProcessing + RSpec.describe(HathifileContentsVerifier) do + around(:each) do |example| + with_test_environment { example.run } + end + + let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } + around(:each) do |example| + with_test_environment { example.run } + end + + hathifiles_fields = [ + { + name: "htid", + good: "mdp.39015031446076", + bad: "this is not an id", + optional: false + }, + { + name: "access", + good: "deny", + bad: "nope", + optional: false + }, + { + name: "rights", + good: "ic", + bad: "In Copyright", + optional: false + }, + { + name: "ht_bib_key", + good: "000000400", + bad: "not a bib key", + optional: false + }, + { + name: "description", + good: "Jun-Oct 1927", + optional: true + }, + { + name: "source", + good: "MIU", + bad: "this is not a NUC code", + optional: false + }, + { + name: "source_bib_num", + good: "990000003710106381", + bad: "this is not a source bib num", + # this can be empty if the record has an sdrnum like sdr-osu(OCoLC)6776655 which the regex at https://github.com/hathitrust/hathifiles/blob/af5e4ff682fb81165e6232a1151cfbeeacfdfd21/lib/bib_record.rb#L160C34-L160C50 doesn't match, probably a bug in hathifiles + optional: true + }, + + { + name: "oclc_num", + good: "217079596,55322", + bad: "this is not an OCLC number", + optional: true + }, + + # isbn, issn, lccn come straight from the record w/o additional + # validation in hathifiles, probably not worth doing add'l validation + # here + { + name: "isbn", + good: "9789679430011,9679430014", + optional: true + }, + { + name: "issn", + good: "0084-9499,00113344", + optional: true + }, + { + name: "lccn", + good: "", + optional: true + }, + + { + name: "title", + good: "", + # this can be empty if the record only has a 245$k. that's probably a bug in the + # hathifiles which we should fix. + optional: true + }, + { + name: "imprint", + good: "Pergamon Press [1969]", + optional: true + }, + { + name: "rights_reason_code", + good: "bib", + bad: "not a reason code", + optional: false + }, + { + name: "rights_timestamp", + good: "2008-06-01 09:30:17", + bad: "last thursday", + optional: false + }, + { + name: "us_gov_doc_flag", + good: "0", + bad: "not a gov doc flag", + optional: false + }, + { + name: "rights_date_used", + good: "1987", + bad: "this is not a year", + optional: false + }, + { + name: "pub_place", + good: "miu", + bad: "not a publication place", + optional: false + }, + { + name: "lang", + good: "eng", + bad: "not a language code", + optional: true + }, + { + name: "bib_fmt", + good: "BK", + bad: "not a bib fmt", + optional: false + }, + { + name: "collection_code", + good: "MIU", + bad: "not a collection code", + optional: false + }, + { + name: "content_provider_code", + good: "umich", + bad: "not an inst id", + optional: false + }, + { + name: "responsible_entity_code", + good: "umich", + bad: "not an inst id", + optional: false + }, + { + name: "digitization_agent_code", + good: "google", + bad: "not an inst id", + optional: false + }, + { + name: "access_profile_code", + good: "open", + bad: "not an access profile", + optional: false + }, + { + name: "author", + good: "Chaucer, Geoffrey, -1400.", + optional: true + } + ] + + let(:sample_fields) { sample_line.split("\t") } + let(:verifier) { described_class.new("/tmp/unused.txt") } + + hathifiles_fields.each_with_index do |field, i| + it "accepts a file with #{field[:name]} matching the regex" do + sample_fields[i] = field[:good] + + verifier.verify_fields(sample_fields) + expect(verifier.errors).to be_empty + end + + if field.has_key?(:bad) + it "rejects a file with #{field[:name]} not matching the regex" do + sample_fields[i] = field[:bad] + + verifier.verify_fields(sample_fields) + expect(verifier.errors).to include(/Field #{i}.*does not match/) + end + + if field[:optional] + it "accepts a file with empty #{field[:name]}" do + sample_fields[i] = "" + + verifier.verify_fields(sample_fields) + expect(verifier.errors).to be_empty + end + else + it "rejects a file with empty #{field[:name]}" do + sample_fields[i] = "" + + verifier.verify_fields(sample_fields) + expect(verifier.errors).to include(/Field #{i}.*does not match/) + end + end + end + end + end +end diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index b28b400..c0ee934 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -1,281 +1,72 @@ # frozen_string_literal: true -require "zlib" require "verifier/hathifiles_verifier" -require "tempfile" -require "logger" module PostZephirProcessing - RSpec.describe "hathifiles verification" do + RSpec.describe(HathifilesVerifier) do around(:each) do |example| with_test_environment { example.run } end let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } - describe(HathifilesVerifier) do - describe "#verify_hathifiles_linecount" do - context "with a catalog json file with 5 records" do - let(:verifier) { described_class.new } + describe "#verify_hathifiles_linecount" do + context "with a catalog json file with 5 records" do + let(:verifier) { described_class.new } - around(:each) do |example| - contents = "{}\n" * 5 - with_temp_file(contents, gzipped: true) do |catalog_json_gz| - @catalog_json_gz = catalog_json_gz - example.run - end - end - - it "accepts a hathifile with 5 records" do - verifier.verify_hathifile_linecount(5, catalog_path: @catalog_json_gz) - expect(verifier.errors).to be_empty - end - - it "accepts a hathifile with 10 records" do - verifier.verify_hathifile_linecount(10, catalog_path: @catalog_json_gz) - expect(verifier.errors).to be_empty - end - - it "rejects a hathifile with 4 records" do - verifier.verify_hathifile_linecount(4, catalog_path: @catalog_json_gz) - expect(verifier.errors).not_to be_empty - end - - it "rejects a hathifile with no records" do - verifier.verify_hathifile_linecount(0, catalog_path: @catalog_json_gz) - expect(verifier.errors).not_to be_empty + around(:each) do |example| + contents = "{}\n" * 5 + with_temp_file(contents, gzipped: true) do |catalog_json_gz| + @catalog_json_gz = catalog_json_gz + example.run end end - end - - # the whole enchilada - describe "#verify_hathifile" - describe "#verify_hathifile_contents" do - it "accepts a file with a single real hathifiles entry" do - expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) + it "accepts a hathifile with 5 records" do + verifier.verify_hathifile_linecount(5, catalog_path: @catalog_json_gz) + expect(verifier.errors).to be_empty end - it "rejects a file where some lines have less than 26 tab-separated columns" do - contents = sample_line + "mdp.35112100003484\tdeny\n" - expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) + it "accepts a hathifile with 10 records" do + verifier.verify_hathifile_linecount(10, catalog_path: @catalog_json_gz) + expect(verifier.errors).to be_empty end - end - describe "#catalog_file_for" do - it "computes a source catalog file based on date - 1" do - expect(described_class.new.catalog_file_for(Date.parse("2023-01-04"))) - .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_upd_20230103.json.gz") + it "rejects a hathifile with 4 records" do + verifier.verify_hathifile_linecount(4, catalog_path: @catalog_json_gz) + expect(verifier.errors).not_to be_empty end - it "computes a full source catalog file based on date - 1" do - expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"), full: true)) - .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_full_20241130.json.gz") + it "rejects a hathifile with no records" do + verifier.verify_hathifile_linecount(0, catalog_path: @catalog_json_gz) + expect(verifier.errors).not_to be_empty end end end - describe(HathifileContentsVerifier) do - around(:each) do |example| - with_test_environment { example.run } - end - - hathifiles_fields = [ - { - name: "htid", - good: "mdp.39015031446076", - bad: "this is not an id", - optional: false - }, - { - name: "access", - good: "deny", - bad: "nope", - optional: false - }, - { - name: "rights", - good: "ic", - bad: "In Copyright", - optional: false - }, - { - name: "ht_bib_key", - good: "000000400", - bad: "not a bib key", - optional: false - }, - { - name: "description", - good: "Jun-Oct 1927", - optional: true - }, - { - name: "source", - good: "MIU", - bad: "this is not a NUC code", - optional: false - }, - { - name: "source_bib_num", - good: "990000003710106381", - bad: "this is not a source bib num", - # this can be empty if the record has an sdrnum like sdr-osu(OCoLC)6776655 which the regex at https://github.com/hathitrust/hathifiles/blob/af5e4ff682fb81165e6232a1151cfbeeacfdfd21/lib/bib_record.rb#L160C34-L160C50 doesn't match, probably a bug in hathifiles - optional: true - }, - - { - name: "oclc_num", - good: "217079596,55322", - bad: "this is not an OCLC number", - optional: true - }, + # the whole enchilada + describe "#verify_hathifile" - # isbn, issn, lccn come straight from the record w/o additional - # validation in hathifiles, probably not worth doing add'l validation - # here - { - name: "isbn", - good: "9789679430011,9679430014", - optional: true - }, - { - name: "issn", - good: "0084-9499,00113344", - optional: true - }, - { - name: "lccn", - good: "", - optional: true - }, - - { - name: "title", - good: "", - # this can be empty if the record only has a 245$k. that's probably a bug in the - # hathifiles which we should fix. - optional: true - }, - { - name: "imprint", - good: "Pergamon Press [1969]", - optional: true - }, - { - name: "rights_reason_code", - good: "bib", - bad: "not a reason code", - optional: false - }, - { - name: "rights_timestamp", - good: "2008-06-01 09:30:17", - bad: "last thursday", - optional: false - }, - { - name: "us_gov_doc_flag", - good: "0", - bad: "not a gov doc flag", - optional: false - }, - { - name: "rights_date_used", - good: "1987", - bad: "this is not a year", - optional: false - }, - { - name: "pub_place", - good: "miu", - bad: "not a publication place", - optional: false - }, - { - name: "lang", - good: "eng", - bad: "not a language code", - optional: true - }, - { - name: "bib_fmt", - good: "BK", - bad: "not a bib fmt", - optional: false - }, - { - name: "collection_code", - good: "MIU", - bad: "not a collection code", - optional: false - }, - { - name: "content_provider_code", - good: "umich", - bad: "not an inst id", - optional: false - }, - { - name: "responsible_entity_code", - good: "umich", - bad: "not an inst id", - optional: false - }, - { - name: "digitization_agent_code", - good: "google", - bad: "not an inst id", - optional: false - }, - { - name: "access_profile_code", - good: "open", - bad: "not an access profile", - optional: false - }, - { - name: "author", - good: "Chaucer, Geoffrey, -1400.", - optional: true - } - ] - - let(:sample_fields) { sample_line.split("\t") } - let(:verifier) { described_class.new("/tmp/unused.txt") } - - hathifiles_fields.each_with_index do |field, i| - it "accepts a file with #{field[:name]} matching the regex" do - sample_fields[i] = field[:good] - - verifier.verify_fields(sample_fields) - expect(verifier.errors).to be_empty - end - - if field.has_key?(:bad) - it "rejects a file with #{field[:name]} not matching the regex" do - sample_fields[i] = field[:bad] - - verifier.verify_fields(sample_fields) - expect(verifier.errors).to include(/Field #{i}.*does not match/) - end + describe "#verify_hathifile_contents" do + it "accepts a file with a single real hathifiles entry" do + expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) + end - if field[:optional] - it "accepts a file with empty #{field[:name]}" do - sample_fields[i] = "" + it "rejects a file where some lines have less than 26 tab-separated columns" do + contents = sample_line + "mdp.35112100003484\tdeny\n" + expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) + end + end - verifier.verify_fields(sample_fields) - expect(verifier.errors).to be_empty - end - else - it "rejects a file with empty #{field[:name]}" do - sample_fields[i] = "" + describe "#catalog_file_for" do + it "computes a source catalog file based on date - 1" do + expect(described_class.new.catalog_file_for(Date.parse("2023-01-04"))) + .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_upd_20230103.json.gz") + end - verifier.verify_fields(sample_fields) - expect(verifier.errors).to include(/Field #{i}.*does not match/) - end - end - end + it "computes a full source catalog file based on date - 1" do + expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"), full: true)) + .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_full_20241130.json.gz") end end end From 1b161f4566b108b393f35bd00d59c23c16fc6586 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Fri, 6 Dec 2024 17:11:30 -0500 Subject: [PATCH 030/114] initial DEV-1415 commit, todo: check json file --- config/env | 1 + lib/dates.rb | 4 + lib/derivatives.rb | 6 +- lib/verifier/hathifiles_listing_verifier.rb | 45 ++++++++++ spec/unit/hathifiles_listing_verifier_spec.rb | 87 +++++++++++++++++++ 5 files changed, 141 insertions(+), 2 deletions(-) create mode 100644 lib/verifier/hathifiles_listing_verifier.rb create mode 100644 spec/unit/hathifiles_listing_verifier_spec.rb diff --git a/config/env b/config/env index 2c291bb..4629025 100644 --- a/config/env +++ b/config/env @@ -5,4 +5,5 @@ CATALOG_PREP=/usr/src/app/data/catalog_prep DATA_ROOT=/usr/src/app/data INGEST_BIBRECORDS=/usr/src/app/data/ingest_bibrecords RIGHTS_DIR=/usr/src/app/data/rights +WWW_DIR=/usr/src/app/data/www ZEPHIR_DATA=/usr/src/app/data/zephir diff --git a/lib/dates.rb b/lib/dates.rb index 78d2108..9ef1010 100644 --- a/lib/dates.rb +++ b/lib/dates.rb @@ -7,6 +7,10 @@ class Date def last_of_month? next_day.month != month end + + def first_of_month? + day == 1 + end end module PostZephirProcessing diff --git a/lib/derivatives.rb b/lib/derivatives.rb index f3b1ed9..0f0eeca 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -11,7 +11,8 @@ class Derivatives :CATALOG_ARCHIVE, :CATALOG_PREP, :RIGHTS_ARCHIVE, - :TMPDIR + :TMPDIR, + :WWW_DIR ].freeze # Location data for the derivatives we care about when constructing our list of missing dates. @@ -51,7 +52,8 @@ class Derivatives def self.directory_for(location:) location = location.to_s case location - when "CATALOG_ARCHIVE", "HATHIFILE_ARCHIVE", "CATALOG_PREP", "INGEST_BIBRECORDS", "RIGHTS_DIR", "ZEPHIR_DATA" + + when "CATALOG_ARCHIVE", "HATHIFILE_ARCHIVE", "CATALOG_PREP", "INGEST_BIBRECORDS", "RIGHTS_DIR", "WWW_DIR", "ZEPHIR_DATA" ENV.fetch location when "RIGHTS_ARCHIVE" ENV["RIGHTS_ARCHIVE"] || File.join(ENV.fetch("RIGHTS_DIR"), "archive") diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb new file mode 100644 index 0000000..5930e9a --- /dev/null +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -0,0 +1,45 @@ +# frozen_string_literal: true + +require_relative "../verifier" +require_relative "../derivatives" + +module PostZephirProcessing + class HathifilesListingVerifier < Verifier + attr_reader :current_date + + def run_for_date(date:) + @current_date = date + verify_hathifiles_listing + end + + def verify_hathifiles_listing(date: current_date) + derivatives_for_date(date: date).each do |derivative_path| + verify_listing(path: derivative_path) + end + end + + def derivatives_for_date(date:) + derivatives = [ + self.class.dated_derivative( + location: :WWW_DIR, + name: "hathi_upd_YYYYMMDD.txt.gz", + date: date + ) + ] + + if date.first_of_month? + derivatives << self.class.dated_derivative( + location: :WWW_DIR, + name: "hathi_full_YYYYMMDD.txt.gz", + date: date + ) + end + + derivatives + end + + def verify_listing(path:) + verify_file(path: path) + end + end +end diff --git a/spec/unit/hathifiles_listing_verifier_spec.rb b/spec/unit/hathifiles_listing_verifier_spec.rb new file mode 100644 index 0000000..d8c12f8 --- /dev/null +++ b/spec/unit/hathifiles_listing_verifier_spec.rb @@ -0,0 +1,87 @@ +# frozen_string_literal: true + +require "climate_control" +require "zlib" +require "verifier/hathifiles_listing_verifier" +require "tempfile" +require "logger" + +module PostZephirProcessing + RSpec.describe(HathifilesListingVerifier) do + around(:each) do |example| + with_test_environment { example.run } + end + + # Using midmonth here as a stand-in for "any day of the month that's not the 1st" + firstday = Date.parse("2023-01-01") + midmonth = Date.parse("2023-01-15") + firstday_ymd = firstday.strftime("%Y%m%d") + midmonth_ymd = midmonth.strftime("%Y%m%d") + + describe "#derivatives_for_date" do + it "expects one derivative midmonth" do + expect(described_class.new.derivatives_for_date(date: midmonth).size).to eq 1 + end + + it "expects two derivativess on the first of the month" do + expect(described_class.new.derivatives_for_date(date: firstday).size).to eq 2 + end + end + + describe "#verify_hathifiles_listing" do + dir_path = ENV["WWW_DIR"] + FileUtils.mkdir_p(dir_path) + + it "finds an update file midmonth" do + update_file = File.join(dir_path, "hathi_upd_#{midmonth_ymd}.txt.gz") + + FileUtils.mkdir_p(dir_path) + FileUtils.touch(update_file) + + verifier = described_class.new + verifier.verify_hathifiles_listing(date: midmonth) + expect(verifier.errors).to be_empty + end + + it "finds both update and full file on the first day of the month" do + update_file = File.join(dir_path, "hathi_upd_#{firstday_ymd}.txt.gz") + full_file = File.join(dir_path, "hathi_full_#{firstday_ymd}.txt.gz") + + FileUtils.touch(update_file) + FileUtils.touch(full_file) + + verifier = described_class.new + verifier.verify_hathifiles_listing(date: firstday) + expect(verifier.errors).to be_empty + end + + it "produces one error if upd file is missing mid month" do + # Make sure file does not exist + update_file = File.join(dir_path, "hathi_upd_#{midmonth_ymd}.txt.gz") + if File.exist?(update_file) + FileUtils.rm(update_file) + end + + verifier = described_class.new + verifier.verify_hathifiles_listing(date: midmonth) + expect(verifier.errors.size).to eq 1 + end + + it "produces two errors if upd and full file are missing on the first day of the month" do + # Make sure files do not exist + update_file = File.join(dir_path, "hathi_upd_#{firstday_ymd}.txt.gz") + full_file = File.join(dir_path, "hathi_full_#{firstday_ymd}.txt.gz") + + [update_file, full_file].each do |f| + if File.exist?(f) + FileUtils.rm(f) + end + end + + verifier = described_class.new + verifier.verify_hathifiles_listing(date: firstday) + expect(verifier.errors.size).to eq 2 + end + end + end +end From 7f6916483bd2ae207450a89462b0c002611051ad Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Fri, 6 Dec 2024 17:15:28 -0500 Subject: [PATCH 031/114] DEV-1418: WIP - verify catalog indexing * stub out tests * add webmock --- Gemfile | 1 + Gemfile.lock | 12 ++++ lib/verifier/catalog_index_verifier.rb | 12 ++++ spec/unit/catalog_indexing_verifier_spec.rb | 61 +++++++++++++++++++++ 4 files changed, 86 insertions(+) create mode 100644 lib/verifier/catalog_index_verifier.rb create mode 100644 spec/unit/catalog_indexing_verifier_spec.rb diff --git a/Gemfile b/Gemfile index 21421b8..6df094d 100644 --- a/Gemfile +++ b/Gemfile @@ -15,4 +15,5 @@ group :development, :test do gem "simplecov" gem "simplecov-lcov" gem "standardrb" + gem "webmock" end diff --git a/Gemfile.lock b/Gemfile.lock index ca959ee..99a1052 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -1,14 +1,20 @@ GEM remote: https://rubygems.org/ specs: + addressable (2.8.7) + public_suffix (>= 2.0.2, < 7.0) ast (2.4.2) bigdecimal (3.1.8) canister (0.9.2) climate_control (1.2.0) coderay (1.1.3) + crack (1.0.0) + bigdecimal + rexml diff-lcs (1.5.1) docile (1.4.1) dotenv (3.1.4) + hashdiff (1.1.2) json (2.7.2) language_server-protocol (3.17.0.3) lint_roller (1.1.0) @@ -21,6 +27,7 @@ GEM pry (0.14.2) coderay (~> 1.1) method_source (~> 1.0) + public_suffix (6.0.1) racc (1.8.1) rainbow (3.1.1) regexp_parser (2.9.2) @@ -79,6 +86,10 @@ GEM standardrb (1.0.1) standard unicode-display_width (2.6.0) + webmock (3.24.0) + addressable (>= 2.8.0) + crack (>= 0.3.2) + hashdiff (>= 0.4.0, < 2.0.0) zinzout (0.1.1) PLATFORMS @@ -96,6 +107,7 @@ DEPENDENCIES simplecov simplecov-lcov standardrb + webmock zinzout BUNDLED WITH diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb new file mode 100644 index 0000000..be57764 --- /dev/null +++ b/lib/verifier/catalog_index_verifier.rb @@ -0,0 +1,12 @@ +# frozen_string_literal: true + +require "zlib" +require_relative "../verifier" +require_relative "../derivatives" + +# Verifies that catalog indexing workflow stage did what it was supposed to. + +module PostZephirProcessing + class CatalogIndexVerifier < Verifier + end +end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb new file mode 100644 index 0000000..83cea3f --- /dev/null +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -0,0 +1,61 @@ +# frozen_string_literal: true + +require "verifier/catalog_index_verifier" + +module PostZephirProcessing + RSpec.describe(CatalogIndexVerifier) do + around(:each) do |example| + with_test_environment { example.run } + end + + def stub_catalog_timerange(date, result_count) + # must be like YYYY-mm-ddTHH:MM:SSZ - iso8601 with a 'Z' for time zone - + # time zone offsets like DateTime.iso8601 produce by default are't + # allowed for solr + datebegin = date.to_datetime.new_offset(0).strftime("%FT%TZ") + dateend = (date + 1).to_datetime.new_offset(0).strftime("%FT%TZ") + WebMock.enable! + + url = "http://solr-sdr-catalog:9033/solr/catalog/select?fq=time_of_index:#{datebegin}%20TO%20#{dateend}]&indent=on&q=*:*&rows=0&wt=json" + + result = { + "responseHeader" => { + "status" => 0, + "QTime" => 0, + "params" => { + "q" => "*=>*", + "fq" => "time_of_index:[#{datebegin} TO #{dateend}]", + "rows" => "0", + "wt" => "json" + } + }, + "response" => {"numFound" => result_count, "start" => 0, "docs" => []} + }.to_json + + WebMock::API.stub_request(:get, url) + .with(body: result, headers: {"Content-Type" => "application/json"}) + end + + describe "#verify_index_count" do + let(:verifier) { described_class.new } + context "with a catalog update file with 3 records" do + it "accepts a catalog with 3 recent updates" + it "accepts a catalog with 5 recent updates" + it "rejects a catalog with no recent updates" + it "rejects a catalog with 2 recent updates" + end + + context "with a catalog full file with 5 records" do + it "accepts a catalog with 5 records" + it "accepts a catalog with 6 records" + it "rejects a catalog with no records" + it "rejects a catalog with 2 records" + end + end + end + + describe "#run" do + it "checks the full file on the last day of the month" + it "checks the file corresponding to today's date" + end +end From 97dabc956cf61feefbcfa82e4687c8e747fc1830 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Mon, 9 Dec 2024 15:29:26 -0500 Subject: [PATCH 032/114] DEV-1415: method, fixture & test for json listing --- lib/verifier/hathifiles_listing_verifier.rb | 23 +++++ spec/fixtures/hathi_file_list.json | 26 +++++ spec/unit/hathifiles_listing_verifier_spec.rb | 95 +++++++++++-------- 3 files changed, 104 insertions(+), 40 deletions(-) create mode 100644 spec/fixtures/hathi_file_list.json diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index 5930e9a..afc63d6 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -2,6 +2,7 @@ require_relative "../verifier" require_relative "../derivatives" +require "json" module PostZephirProcessing class HathifilesListingVerifier < Verifier @@ -40,6 +41,28 @@ def derivatives_for_date(date:) def verify_listing(path:) verify_file(path: path) + + filename = File.basename(path) + verify_file_in_json(filename: filename) + end + + # Verify that the derivatives for the date are included in + # "#{ENV['WWW_DIR']}/hathi_file_list.json" + def verify_file_in_json(filename:) + json_path = "#{ENV['WWW_DIR']}/hathi_file_list.json" + listings = JSON.load_file(json_path) + matches = [] + + listings.each do |listing| + if listing['filename'] == filename + matches << listing + break + end + end + + if matches.empty? + error(message: "Did not find a listing with filename: #{filename} in JSON (#{json_path})") + end end end end diff --git a/spec/fixtures/hathi_file_list.json b/spec/fixtures/hathi_file_list.json new file mode 100644 index 0000000..68681a4 --- /dev/null +++ b/spec/fixtures/hathi_file_list.json @@ -0,0 +1,26 @@ +[ + { + "filename": "hathi_full_20230101.txt.gz", + "full": true, + "size": 12345, + "created": "2023-01-01 01:01:01 -0400", + "modified": "2023-01-01 01:01:01 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_full_20230101.txt.gz" + }, + { + "filename": "hathi_upd_20230101.txt.gz", + "full": false, + "size": 123, + "created": "2023-01-01 01:01:01 -0400", + "modified": "2023-01-01 01:01:01 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_upd_20230101.txt.gz" + }, + { + "filename": "hathi_upd_20230102.txt.gz", + "full": false, + "size": 456, + "created": "2023-01-02 02:02:02 -0400", + "modified": "2023-01-02 02:02:02 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_upd_20230102.txt.gz" + } +] diff --git a/spec/unit/hathifiles_listing_verifier_spec.rb b/spec/unit/hathifiles_listing_verifier_spec.rb index d8c12f8..c81f198 100644 --- a/spec/unit/hathifiles_listing_verifier_spec.rb +++ b/spec/unit/hathifiles_listing_verifier_spec.rb @@ -8,79 +8,94 @@ module PostZephirProcessing RSpec.describe(HathifilesListingVerifier) do - around(:each) do |example| - with_test_environment { example.run } - end + let(:verifier) { described_class.new } - # Using midmonth here as a stand-in for "any day of the month that's not the 1st" + # Using secondday here as a representative for + # "any day of the month that's not the 1st" + # missingday does not have files or listings firstday = Date.parse("2023-01-01") - midmonth = Date.parse("2023-01-15") + secondday = Date.parse("2023-01-02") + missingday = Date.parse("2023-01-13") + firstday_ymd = firstday.strftime("%Y%m%d") - midmonth_ymd = midmonth.strftime("%Y%m%d") + secondday_ymd = secondday.strftime("%Y%m%d") + missingday_ymd = missingday.strftime("%Y%m%d") + + dir_path = ENV["WWW_DIR"] + + before(:all) do + FileUtils.cp("/usr/src/app/spec/fixtures/hathi_file_list.json", dir_path) + end + + around(:each) do |example| + with_test_environment { example.run } + end describe "#derivatives_for_date" do - it "expects one derivative midmonth" do - expect(described_class.new.derivatives_for_date(date: midmonth).size).to eq 1 + it "expects two derivativess on firstday" do + expect(described_class.new.derivatives_for_date(date: firstday).size).to eq 2 end - it "expects two derivativess on the first of the month" do - expect(described_class.new.derivatives_for_date(date: firstday).size).to eq 2 + it "expects one derivative on secondday" do + expect(described_class.new.derivatives_for_date(date: secondday).size).to eq 1 end end describe "#verify_hathifiles_listing" do - dir_path = ENV["WWW_DIR"] FileUtils.mkdir_p(dir_path) - it "finds an update file midmonth" do - update_file = File.join(dir_path, "hathi_upd_#{midmonth_ymd}.txt.gz") - - FileUtils.mkdir_p(dir_path) - FileUtils.touch(update_file) - - verifier = described_class.new - verifier.verify_hathifiles_listing(date: midmonth) - expect(verifier.errors).to be_empty - end - - it "finds both update and full file on the first day of the month" do + it "finds update and full file on firstday" do update_file = File.join(dir_path, "hathi_upd_#{firstday_ymd}.txt.gz") full_file = File.join(dir_path, "hathi_full_#{firstday_ymd}.txt.gz") FileUtils.touch(update_file) FileUtils.touch(full_file) - verifier = described_class.new verifier.verify_hathifiles_listing(date: firstday) expect(verifier.errors).to be_empty end - it "produces one error if upd file is missing mid month" do - # Make sure file does not exist - update_file = File.join(dir_path, "hathi_upd_#{midmonth_ymd}.txt.gz") - if File.exist?(update_file) - FileUtils.rm(update_file) - end + it "finds just an update file on secondday" do + update_file = File.join(dir_path, "hathi_upd_#{secondday_ymd}.txt.gz") - verifier = described_class.new - verifier.verify_hathifiles_listing(date: midmonth) - expect(verifier.errors.size).to eq 1 + FileUtils.mkdir_p(dir_path) + FileUtils.touch(update_file) + + verifier.verify_hathifiles_listing(date: secondday) + expect(verifier.errors).to be_empty end - it "produces two errors if upd and full file are missing on the first day of the month" do - # Make sure files do not exist - update_file = File.join(dir_path, "hathi_upd_#{firstday_ymd}.txt.gz") - full_file = File.join(dir_path, "hathi_full_#{firstday_ymd}.txt.gz") + it "produces 1 error if upd file is missing midmonth" do + verifier.verify_hathifiles_listing(date: missingday) + expect(verifier.errors.size).to eq 2 + expect(verifier.errors).to include(/Did not find a listing with filename: hathi_upd_#{missingday_ymd}/) + expect(verifier.errors.first).to include(/not found:.+_upd_#{missingday_ymd}/) + end - [update_file, full_file].each do |f| + it "produces 2 errors if upd and full file are missing on the first day of the month" do + # Need to remove the 2 files for the first to test + verifier.derivatives_for_date(date: firstday).each do |f| if File.exist?(f) FileUtils.rm(f) end end - - verifier = described_class.new verifier.verify_hathifiles_listing(date: firstday) expect(verifier.errors.size).to eq 2 + expect(verifier.errors.first).to include(/not found:.+_upd_#{firstday_ymd}/) + expect(verifier.errors.last).to include(/not found:.+_full_#{firstday_ymd}/) + end + end + + describe "#verify_file_in_json" do + it "finds a matching listing" do + verifier.verify_file_in_json(filename: "hathi_full_#{firstday_ymd}.txt.gz") + verifier.verify_file_in_json(filename: "hathi_upd_#{firstday_ymd}.txt.gz") + verifier.verify_file_in_json(filename: "hathi_upd_#{secondday_ymd}.txt.gz") + end + it "produces 1 error when not finding a matching listing" do + verifier.verify_file_in_json(filename: "hathi_upd_#{missingday_ymd}.txt.gz") + expect(verifier.errors.size).to eq 1 + expect(verifier.errors).to include(/Did not find a listing with filename: hathi_upd_#{missingday_ymd}/) end end end From db38f03f7d8d63ae1f167d5c1152ec6517bc4ef4 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Mon, 9 Dec 2024 15:32:36 -0500 Subject: [PATCH 033/114] standardrb fix --- lib/verifier/hathifiles_listing_verifier.rb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index afc63d6..5e5e1cc 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -49,12 +49,12 @@ def verify_listing(path:) # Verify that the derivatives for the date are included in # "#{ENV['WWW_DIR']}/hathi_file_list.json" def verify_file_in_json(filename:) - json_path = "#{ENV['WWW_DIR']}/hathi_file_list.json" + json_path = "#{ENV["WWW_DIR"]}/hathi_file_list.json" listings = JSON.load_file(json_path) matches = [] listings.each do |listing| - if listing['filename'] == filename + if listing["filename"] == filename matches << listing break end From fa69fff30e1568cd2e2b6bea9a699dc0ab8a0dc9 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Mon, 9 Dec 2024 16:05:42 -0500 Subject: [PATCH 034/114] using spec helper method fixture() --- spec/unit/hathifiles_listing_verifier_spec.rb | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/spec/unit/hathifiles_listing_verifier_spec.rb b/spec/unit/hathifiles_listing_verifier_spec.rb index c81f198..95d431b 100644 --- a/spec/unit/hathifiles_listing_verifier_spec.rb +++ b/spec/unit/hathifiles_listing_verifier_spec.rb @@ -1,10 +1,6 @@ # frozen_string_literal: true -require "climate_control" -require "zlib" require "verifier/hathifiles_listing_verifier" -require "tempfile" -require "logger" module PostZephirProcessing RSpec.describe(HathifilesListingVerifier) do @@ -16,19 +12,19 @@ module PostZephirProcessing firstday = Date.parse("2023-01-01") secondday = Date.parse("2023-01-02") missingday = Date.parse("2023-01-13") - firstday_ymd = firstday.strftime("%Y%m%d") secondday_ymd = secondday.strftime("%Y%m%d") missingday_ymd = missingday.strftime("%Y%m%d") - dir_path = ENV["WWW_DIR"] before(:all) do - FileUtils.cp("/usr/src/app/spec/fixtures/hathi_file_list.json", dir_path) + FileUtils.cp(fixture("hathi_file_list.json"), dir_path) end around(:each) do |example| - with_test_environment { example.run } + with_test_environment do + example.run + end end describe "#derivatives_for_date" do From a70d8c604b0e51c4c0296552553a81835ec5e2a5 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Mon, 9 Dec 2024 17:14:50 -0500 Subject: [PATCH 035/114] Initial implementation and unit tests for HathifilesDatabaseVerifier --- docker-compose.yml | 1 + lib/verifier/hathifiles_database_verifier.rb | 57 ++++++++++++++ .../unit/hathifiles_database_verifier_spec.rb | 76 +++++++++++++++++++ sql/hathifiles.sql | 47 ++++++++++++ 4 files changed, 181 insertions(+) create mode 100644 lib/verifier/hathifiles_database_verifier.rb create mode 100644 spec/unit/hathifiles_database_verifier_spec.rb create mode 100644 sql/hathifiles.sql diff --git a/docker-compose.yml b/docker-compose.yml index d8c5fa4..a7fbdd1 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -53,6 +53,7 @@ services: image: ghcr.io/hathitrust/db-image:latest volumes: - ./sql/ingest.sql:/docker-entrypoint-initdb.d/999-ingest.sql + - ./sql/hathifiles.sql:/docker-entrypoint-initdb.d/999-hathifiles.sql restart: always environment: # setting via MYSQL_ROOT_PASSWORD didn't work; this at least diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb new file mode 100644 index 0000000..87cdb52 --- /dev/null +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -0,0 +1,57 @@ +# frozen_string_literal: true + +require "zlib" + +require_relative "../verifier" +require_relative "../derivatives" + +module PostZephirProcessing + class HathifilesDatabaseVerifier < Verifier + attr_reader :current_date + + # Does an entry exist in hf_log for the hathifile? + # Can pass a path or just the filename. + def self.has_log?(hathifile:) + PostZephirProcessing::Services[:database][:hf_log] + .where(hathifile: File.basename(hathifile)) + .count + .positive? + end + + def self.gzip_linecount(path:) + Zlib::GzipReader.open(path, encoding: "utf-8") { |gz| gz.count } + end + + # Count the number of entries in hathifiles.hf + def self.db_count + PostZephirProcessing::Services[:database][:hf].count + end + + def run_for_date(date:) + @current_date = date + verify_hathifiles_database_log + verify_hathifiles_database_count + end + + private + + def verify_hathifiles_database_log + update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: current_date) + if !self.class.has_log?(hathifile: update_file) + error "no hf_log entry for #{update_file}" + end + end + + def verify_hathifiles_database_count + # first of month + if current_date.day == 1 + full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: current_date) + full_file_count = self.class.gzip_linecount(path: full_file) + db_count = self.class.db_count + if full_file_count < db_count + error "#{full_file} has #{full_file_count} rows but hathifiles.hf has #{db_count}" + end + end + end + end +end diff --git a/spec/unit/hathifiles_database_verifier_spec.rb b/spec/unit/hathifiles_database_verifier_spec.rb new file mode 100644 index 0000000..9db4b20 --- /dev/null +++ b/spec/unit/hathifiles_database_verifier_spec.rb @@ -0,0 +1,76 @@ +# frozen_string_literal: true + +require "verifier/hathifiles_database_verifier" + +module PostZephirProcessing + TEST_UPDATE_FILE = fixture(File.join("hathifile_archive", "hathi_upd_20241202.txt.gz")) + TEST_UPDATE_LINECOUNT = 8 + + RSpec.describe(HathifilesDatabaseVerifier) do + around(:each) do |example| + with_test_environment { example.run } + end + + # Temporarily add `hathifile` to `hf_log` with the current timestamp. + def with_fake_hf_log_entry(hathifile:) + Services[:database][:hf_log].where(hathifile: hathifile).delete + Services[:database][:hf_log].insert(hathifile: hathifile) + begin + yield + ensure + Services[:database][:hf_log].where(hathifile: hathifile).delete + end + end + + # Temporarily add `htid` to `hf` with reasonable (and irrelevant) defaults. + def with_fake_hf_entries(htids:) + Services[:database][:hf].where(htid: htids).delete + htids.each { |htid| Services[:database][:hf].insert(htid: htid) } + begin + yield + ensure + Services[:database][:hf].where(htid: htids).delete + end + end + + describe ".has_log?" do + context "with corresponding hf_log" do + it "returns `true`" do + with_fake_hf_log_entry(hathifile: "hathi_upd_20241202.txt.gz") do + expect(described_class.has_log?(hathifile: TEST_UPDATE_FILE)).to eq(true) + end + end + end + + context "without corresponding hf_log" do + it "returns `false`" do + expect(described_class.has_log?(hathifile: TEST_UPDATE_FILE)).to eq(false) + end + end + end + + describe ".db_count" do + context "with no `hf` contents" do + it "returns 0" do + expect(described_class.db_count).to eq(0) + end + end + + context "without corresponding hf_log" do + fake_htids = ["test.001", "test.002", "test.003", "test.004", "test.005"] + it "returns the correct count > 0" do + with_fake_hf_entries(htids: fake_htids) do + expect(described_class.db_count.positive?).to eq(true) + expect(described_class.db_count).to eq(fake_htids.count) + end + end + end + end + + describe ".gzip_linecount" do + it "returns the correct number of lines" do + expect(described_class.gzip_linecount(path: TEST_UPDATE_FILE)).to eq(TEST_UPDATE_LINECOUNT) + end + end + end +end diff --git a/sql/hathifiles.sql b/sql/hathifiles.sql new file mode 100644 index 0000000..549f21b --- /dev/null +++ b/sql/hathifiles.sql @@ -0,0 +1,47 @@ +-- Used by HathifilesDatabaseVerifier + +CREATE TABLE IF NOT EXISTS `hf` ( + `htid` varchar(255) NOT NULL, + `access` tinyint(1) DEFAULT NULL, + `rights_code` varchar(255) DEFAULT NULL, + `bib_num` bigint(20) DEFAULT NULL, + `description` varchar(255) DEFAULT NULL, + `source` varchar(255) DEFAULT NULL, + `source_bib_num` text DEFAULT NULL, + `oclc` varchar(255) DEFAULT NULL, + `isbn` text DEFAULT NULL, + `issn` text DEFAULT NULL, + `lccn` varchar(255) DEFAULT NULL, + `title` text DEFAULT NULL, + `imprint` text DEFAULT NULL, + `rights_reason` varchar(255) DEFAULT NULL, + `rights_timestamp` datetime DEFAULT NULL, + `us_gov_doc_flag` tinyint(1) DEFAULT NULL, + `rights_date_used` int(11) DEFAULT NULL, + `pub_place` varchar(255) DEFAULT NULL, + `lang_code` varchar(255) DEFAULT NULL, + `bib_fmt` varchar(255) DEFAULT NULL, + `collection_code` varchar(255) DEFAULT NULL, + `content_provider_code` varchar(255) DEFAULT NULL, + `responsible_entity_code` varchar(255) DEFAULT NULL, + `digitization_agent_code` varchar(255) DEFAULT NULL, + `access_profile_code` varchar(255) DEFAULT NULL, + `author` text DEFAULT NULL, + KEY `hf_htid_index` (`htid`), + KEY `hf_rights_code_index` (`rights_code`), + KEY `hf_bib_num_index` (`bib_num`), + KEY `hf_rights_reason_index` (`rights_reason`), + KEY `hf_rights_timestamp_index` (`rights_timestamp`), + KEY `hf_us_gov_doc_flag_index` (`us_gov_doc_flag`), + KEY `hf_rights_date_used_index` (`rights_date_used`), + KEY `hf_lang_code_index` (`lang_code`), + KEY `hf_bib_fmt_index` (`bib_fmt`), + KEY `hf_collection_code_index` (`collection_code`), + KEY `hf_content_provider_code_index` (`content_provider_code`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci; + + +CREATE TABLE IF NOT EXISTS `hf_log` ( + `hathifile` varchar(255) NOT NULL, + `time` timestamp NOT NULL DEFAULT current_timestamp() +) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;; From 6942c5798a0e3cff831002c0f1958ba9ca73d68f Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 10 Dec 2024 10:21:54 -0500 Subject: [PATCH 036/114] env vars for DEV-1417 --- config/env | 2 ++ 1 file changed, 2 insertions(+) diff --git a/config/env b/config/env index 4629025..d9571ac 100644 --- a/config/env +++ b/config/env @@ -4,6 +4,8 @@ CATALOG_ARCHIVE=/usr/src/app/data/catalog_archive CATALOG_PREP=/usr/src/app/data/catalog_prep DATA_ROOT=/usr/src/app/data INGEST_BIBRECORDS=/usr/src/app/data/ingest_bibrecords +REDIRECTS_DIR=/usr/src/app/data/redirects +REDIRECTS_HISTORY_DIR=/usr/src/app/data/redirects_history RIGHTS_DIR=/usr/src/app/data/rights WWW_DIR=/usr/src/app/data/www ZEPHIR_DATA=/usr/src/app/data/zephir From f061925530c30c85883b69632245fa3d0ae3d9ab Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Tue, 10 Dec 2024 17:02:00 -0500 Subject: [PATCH 037/114] Finish up DEV-1416 hathifiles database tests --- lib/verifier/hathifiles_database_verifier.rb | 38 +++++-- .../unit/hathifiles_database_verifier_spec.rb | 104 ++++++++++++++++-- 2 files changed, 125 insertions(+), 17 deletions(-) diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index 87cdb52..091d008 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -36,22 +36,40 @@ def run_for_date(date:) private def verify_hathifiles_database_log - update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: current_date) - if !self.class.has_log?(hathifile: update_file) - error "no hf_log entry for #{update_file}" + # File missing? Not our problem, should be caught by earlier verifier. + if File.exist?(update_file) + if !self.class.has_log?(hathifile: update_file) + error message: "missing hf_log: no entry for daily #{update_file}" + end + end + if current_date.first_of_month? + full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: current_date) + if File.exist?(full_file) + if !self.class.has_log?(hathifile: full_file) + error message: "missing hf_log: no entry for monthly #{full_file}" + end + end end end def verify_hathifiles_database_count - # first of month - if current_date.day == 1 - full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: current_date) - full_file_count = self.class.gzip_linecount(path: full_file) - db_count = self.class.db_count - if full_file_count < db_count - error "#{full_file} has #{full_file_count} rows but hathifiles.hf has #{db_count}" + if current_date.first_of_month? + if File.exist?(full_file) + full_file_count = self.class.gzip_linecount(path: full_file) + db_count = self.class.db_count + if full_file_count > db_count + error message: "hf count mismatch: #{full_file} (#{full_file_count}) vs hathifiles.hf (#{db_count})" + end end end end + + def update_file + self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: current_date) + end + + def full_file + self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: current_date) + end end end diff --git a/spec/unit/hathifiles_database_verifier_spec.rb b/spec/unit/hathifiles_database_verifier_spec.rb index 9db4b20..1d4868b 100644 --- a/spec/unit/hathifiles_database_verifier_spec.rb +++ b/spec/unit/hathifiles_database_verifier_spec.rb @@ -3,22 +3,34 @@ require "verifier/hathifiles_database_verifier" module PostZephirProcessing - TEST_UPDATE_FILE = fixture(File.join("hathifile_archive", "hathi_upd_20241202.txt.gz")) + TEST_UPDATE_FILE = "hathi_upd_20241202.txt.gz" + TEST_UPDATE_FIXTURE = fixture(File.join("hathifile_archive", TEST_UPDATE_FILE)) + TEST_FULL_FILE = "hathi_full_20241201.txt.gz" TEST_UPDATE_LINECOUNT = 8 RSpec.describe(HathifilesDatabaseVerifier) do around(:each) do |example| - with_test_environment { example.run } + with_test_environment do + ClimateControl.modify(HATHIFILE_ARCHIVE: fixture("hathifile_archive")) do + example.run + end + end + end + + let(:verifier) { described_class.new } + + def delete_hf_logs + Services[:database][:hf_log].delete end # Temporarily add `hathifile` to `hf_log` with the current timestamp. def with_fake_hf_log_entry(hathifile:) - Services[:database][:hf_log].where(hathifile: hathifile).delete + delete_hf_logs Services[:database][:hf_log].insert(hathifile: hathifile) begin yield ensure - Services[:database][:hf_log].where(hathifile: hathifile).delete + delete_hf_logs end end @@ -33,18 +45,28 @@ def with_fake_hf_entries(htids:) end end + # Copies one of our fixtures into @tmpdir and renames it. + # Could just use an additional fixture if we're not worried about the proliferation + # of them. + def with_fake_full_hathifile + ClimateControl.modify(HATHIFILE_ARCHIVE: @tmpdir) do + FileUtils.cp(TEST_UPDATE_FIXTURE, File.join(@tmpdir, TEST_FULL_FILE)) + yield + end + end + describe ".has_log?" do context "with corresponding hf_log" do it "returns `true`" do with_fake_hf_log_entry(hathifile: "hathi_upd_20241202.txt.gz") do - expect(described_class.has_log?(hathifile: TEST_UPDATE_FILE)).to eq(true) + expect(described_class.has_log?(hathifile: TEST_UPDATE_FIXTURE)).to eq(true) end end end context "without corresponding hf_log" do it "returns `false`" do - expect(described_class.has_log?(hathifile: TEST_UPDATE_FILE)).to eq(false) + expect(described_class.has_log?(hathifile: TEST_UPDATE_FIXTURE)).to eq(false) end end end @@ -69,7 +91,75 @@ def with_fake_hf_entries(htids:) describe ".gzip_linecount" do it "returns the correct number of lines" do - expect(described_class.gzip_linecount(path: TEST_UPDATE_FILE)).to eq(TEST_UPDATE_LINECOUNT) + expect(described_class.gzip_linecount(path: TEST_UPDATE_FIXTURE)).to eq(TEST_UPDATE_LINECOUNT) + end + end + + describe "#run_for_date" do + context "with upd hathifile" do + context "with corresponding hf_log" do + it "reports no `missing hf_log` errors" do + with_fake_hf_log_entry(hathifile: "hathi_upd_20241202.txt.gz") do + verifier.run_for_date(date: Date.new(2024, 12, 2)) + expect(verifier.errors).not_to include(/missing hf_log/) + end + end + end + + context "with no corresponding hf_log" do + it "reports `missing hf_log` error" do + verifier.run_for_date(date: Date.new(2024, 12, 2)) + expect(verifier.errors).to include(/missing hf_log/) + end + end + end + + # Each of these must be run with `with_fake_full_hathifile` + context "with full hathifile" do + context "with corresponding hf_log" do + it "reports no `missing hf_log` errors" do + with_fake_hf_log_entry(hathifile: TEST_FULL_FILE) do + with_fake_full_hathifile do + verifier.run_for_date(date: Date.new(2024, 12, 1)) + expect(verifier.errors).not_to include(/missing hf_log/) + end + end + end + end + + context "with no corresponding hf_log" do + it "reports `missing hf_log` error" do + with_fake_full_hathifile do + verifier.run_for_date(date: Date.new(2024, 12, 1)) + expect(verifier.errors).to include(/missing hf_log/) + end + end + end + + context "with the expected `hf` rows" do + it "reports no `hf count mismatch` errors" do + with_fake_hf_log_entry(hathifile: TEST_FULL_FILE) do + with_fake_full_hathifile do + fake_htids = ["test.001", "test.002", "test.003", "test.004", "test.005", "test.006", "test.007", "test.008"] + with_fake_hf_entries(htids: fake_htids) do + verifier.run_for_date(date: Date.new(2024, 12, 1)) + expect(verifier.errors).not_to include(/hf count mismatch/) + end + end + end + end + end + + context "without the expected `hf` rows" do + it "reports `hf count mismatch` error" do + with_fake_hf_log_entry(hathifile: TEST_FULL_FILE) do + with_fake_full_hathifile do + verifier.run_for_date(date: Date.new(2024, 12, 1)) + expect(verifier.errors).to include(/hf count mismatch/) + end + end + end + end end end end From 889ed916c05b723dd7bf8288d20ef9d6d11a3d1d Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 10 Dec 2024 17:05:32 -0500 Subject: [PATCH 038/114] DEV-1417, hathifiles redirects --- lib/verifier/hathifiles_redirects_verifier.rb | 63 +++++++++++ spec/fixtures/redirects/202301.ndj.gz | Bin 0 -> 185 bytes .../redirects/redirects_202301.txt.gz | Bin 0 -> 141 bytes .../hathifiles_redirects_verifier_spec.rb | 99 ++++++++++++++++++ 4 files changed, 162 insertions(+) create mode 100644 lib/verifier/hathifiles_redirects_verifier.rb create mode 100644 spec/fixtures/redirects/202301.ndj.gz create mode 100644 spec/fixtures/redirects/redirects_202301.txt.gz create mode 100644 spec/unit/hathifiles_redirects_verifier_spec.rb diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb new file mode 100644 index 0000000..2552dee --- /dev/null +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -0,0 +1,63 @@ +# frozen_string_literal: true + +require 'zinzout' +require_relative "../verifier" + +module PostZephirProcessing + class HathifileRedirectsVerifier < Verifier + attr_accessor :current_date + + REDIRECTS_REGEX = /^\d{9}\t\d{9}$/ + HISTORY_FILE_KEYS = ["recid", "mrs", "entries", "json_class"] + + def verify_redirects(date: Date.today) + @current_date = date + verify_redirects_file + verify_redirects_history_file + end + + def verify_redirects_file(path: redirects_file) + if verify_file(path: path) + # check that each line in the file matches regex + Zlib::GzipReader.open(path, encoding: "utf-8").each_line do |line| + unless REDIRECTS_REGEX.match?(line) + error(message: "#{redirects_file} contains malformed line: #{line}") + end + end + end + end + + def verify_redirects_history_file(path: redirects_history_file) + if verify_file(path: path) + Zlib::GzipReader.open(path, encoding: "utf-8").each_line do |line| + begin + parsed = JSON.parse(line) + # Check that the line parses to a hash + unless parsed.class == Hash + error(message: "#{redirects_history_file} contains malformed line: #{line}") + next + end + # Check that the outermost level of keys in the JSON line are what we expect + unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS + error(message: "#{redirects_history_file} contains malformed line: #{line}") + next + end + # could go further and verify deeper structure of json, + # but not sure it's worth it? + rescue JSON::ParserError + error(message: "#{redirects_history_file} contains malformed line: #{line}") + end + end + end + end + + def redirects_file(date: current_date) + File.join(ENV["REDIRECTS_DIR"], "redirects_#{date.strftime("%Y%m")}.txt.gz") + end + + def redirects_history_file(date: current_date) + File.join(ENV["REDIRECTS_HISTORY_DIR"], "#{date.strftime("%Y%m")}.ndj.gz") + end + + end +end diff --git a/spec/fixtures/redirects/202301.ndj.gz b/spec/fixtures/redirects/202301.ndj.gz new file mode 100644 index 0000000000000000000000000000000000000000..c762c57247f8875e9f7d7b52dcfa2721778448aa GIT binary patch literal 185 zcmV;q07m~GiwFptv{+{V12Ql&GcYkOZe(fz#g9D;!Y~v?cYns4gS@m&lV(<^tGG*P zB8l{iBufeX@1`K?P|(fs-OD|^6T#`$Dv%V5l0fe~$PBX_DFHnMXLSl9^lIAY6;g&A z882v+t*Hs?a;qfJOw&q2fr3Ei%^Q37<8W=e#wRk=O)&OmJ6&5JMmI}YUrvYTaZC@) nuqaB^H7+!8dhY<-c?u3BLCPo&f*=@=H)d literal 0 HcmV?d00001 diff --git a/spec/fixtures/redirects/redirects_202301.txt.gz b/spec/fixtures/redirects/redirects_202301.txt.gz new file mode 100644 index 0000000000000000000000000000000000000000..cdbe6251334f1c8a509e17ee28903a5393b7db71 GIT binary patch literal 141 zcmV;80CN8yiwFo7tXO9N19D|#X>w&_baP)aFfubRF)nm?bO1e$K@xy42m{acH33p6 z|Noe&RXH)6%>baGh94m&TIkZXN8yRwT#sQ+2#vD%A~hrVd> Date: Tue, 10 Dec 2024 17:07:15 -0500 Subject: [PATCH 039/114] standardrb --- lib/verifier/hathifiles_redirects_verifier.rb | 37 +++++++++---------- .../hathifiles_redirects_verifier_spec.rb | 13 +++---- 2 files changed, 23 insertions(+), 27 deletions(-) diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index 2552dee..e4876eb 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -1,6 +1,6 @@ # frozen_string_literal: true -require 'zinzout' +require "zinzout" require_relative "../verifier" module PostZephirProcessing @@ -9,9 +9,9 @@ class HathifileRedirectsVerifier < Verifier REDIRECTS_REGEX = /^\d{9}\t\d{9}$/ HISTORY_FILE_KEYS = ["recid", "mrs", "entries", "json_class"] - + def verify_redirects(date: Date.today) - @current_date = date + @current_date = date verify_redirects_file verify_redirects_history_file end @@ -30,34 +30,31 @@ def verify_redirects_file(path: redirects_file) def verify_redirects_history_file(path: redirects_history_file) if verify_file(path: path) Zlib::GzipReader.open(path, encoding: "utf-8").each_line do |line| - begin - parsed = JSON.parse(line) - # Check that the line parses to a hash - unless parsed.class == Hash - error(message: "#{redirects_history_file} contains malformed line: #{line}") - next - end - # Check that the outermost level of keys in the JSON line are what we expect - unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS - error(message: "#{redirects_history_file} contains malformed line: #{line}") - next - end - # could go further and verify deeper structure of json, - # but not sure it's worth it? - rescue JSON::ParserError + parsed = JSON.parse(line) + # Check that the line parses to a hash + unless parsed.instance_of?(Hash) + error(message: "#{redirects_history_file} contains malformed line: #{line}") + next + end + # Check that the outermost level of keys in the JSON line are what we expect + unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS error(message: "#{redirects_history_file} contains malformed line: #{line}") + next end + # could go further and verify deeper structure of json, + # but not sure it's worth it? + rescue JSON::ParserError + error(message: "#{redirects_history_file} contains malformed line: #{line}") end end end def redirects_file(date: current_date) - File.join(ENV["REDIRECTS_DIR"], "redirects_#{date.strftime("%Y%m")}.txt.gz") + File.join(ENV["REDIRECTS_DIR"], "redirects_#{date.strftime("%Y%m")}.txt.gz") end def redirects_history_file(date: current_date) File.join(ENV["REDIRECTS_HISTORY_DIR"], "#{date.strftime("%Y%m")}.ndj.gz") end - end end diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index 2fa1dd3..932b8dd 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -9,10 +9,10 @@ module PostZephirProcessing let(:redirects_file) { verifier.redirects_file(date: test_date) } let(:redirects_history_file) { verifier.redirects_history_file(date: test_date) } let(:mess) { "oops, messed up a line!" } - + # Clean dir before each test before(:each) do - [ENV['REDIRECTS_DIR'], ENV['REDIRECTS_HISTORY_DIR']].each do |dir| + [ENV["REDIRECTS_DIR"], ENV["REDIRECTS_HISTORY_DIR"]].each do |dir| FileUtils.rm_rf(dir) FileUtils.mkdir_p(dir) end @@ -23,13 +23,13 @@ module PostZephirProcessing end def stage_redirects_file - FileUtils.cp(fixture("redirects/redirects_202301.txt.gz"), ENV['REDIRECTS_DIR']) + FileUtils.cp(fixture("redirects/redirects_202301.txt.gz"), ENV["REDIRECTS_DIR"]) end def stage_redirects_history_file - FileUtils.cp(fixture("redirects/202301.ndj.gz"), ENV['REDIRECTS_HISTORY_DIR']) + FileUtils.cp(fixture("redirects/202301.ndj.gz"), ENV["REDIRECTS_HISTORY_DIR"]) end - + describe "#verify_redirects" do it "will warn twice if both files missing" do verifier.verify_redirects(date: test_date) @@ -54,7 +54,7 @@ def stage_redirects_history_file stage_redirects_history_file verifier.verify_redirects(date: test_date) expect(verifier.errors).to be_empty - end + end end describe "#verify_redirects_file" do @@ -94,6 +94,5 @@ def stage_redirects_history_file expect(verifier.errors).to include(/#{redirects_history_file} contains malformed line: #{mess}/) end end - end end From 28202974732ad8e73651fd4885d4fef4e8d3eb54 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 11 Dec 2024 10:34:25 -0500 Subject: [PATCH 040/114] [DEV-1417] more tests, setting date in initialize --- lib/verifier/hathifiles_redirects_verifier.rb | 25 +++++--- .../hathifiles_redirects_verifier_spec.rb | 57 ++++++++++++++----- 2 files changed, 60 insertions(+), 22 deletions(-) diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index e4876eb..caa5141 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -1,17 +1,20 @@ # frozen_string_literal: true -require "zinzout" require_relative "../verifier" module PostZephirProcessing class HathifileRedirectsVerifier < Verifier - attr_accessor :current_date + attr_reader :current_date REDIRECTS_REGEX = /^\d{9}\t\d{9}$/ HISTORY_FILE_KEYS = ["recid", "mrs", "entries", "json_class"] - def verify_redirects(date: Date.today) + def initialize(date: Date.today) + super() @current_date = date + end + + def verify_redirects(date: current_date) verify_redirects_file verify_redirects_history_file end @@ -21,7 +24,7 @@ def verify_redirects_file(path: redirects_file) # check that each line in the file matches regex Zlib::GzipReader.open(path, encoding: "utf-8").each_line do |line| unless REDIRECTS_REGEX.match?(line) - error(message: "#{redirects_file} contains malformed line: #{line}") + report_malformed(file: redirects_file, line: line) end end end @@ -33,18 +36,18 @@ def verify_redirects_history_file(path: redirects_history_file) parsed = JSON.parse(line) # Check that the line parses to a hash unless parsed.instance_of?(Hash) - error(message: "#{redirects_history_file} contains malformed line: #{line}") + report_malformed(file: redirects_history_file, line: line) next end # Check that the outermost level of keys in the JSON line are what we expect unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS - error(message: "#{redirects_history_file} contains malformed line: #{line}") + report_malformed(file: redirects_history_file, line: line) next end - # could go further and verify deeper structure of json, + # here we could go further and verify deeper structure of json, # but not sure it's worth it? rescue JSON::ParserError - error(message: "#{redirects_history_file} contains malformed line: #{line}") + report_malformed(file: redirects_history_file, line: line) end end end @@ -56,5 +59,11 @@ def redirects_file(date: current_date) def redirects_history_file(date: current_date) File.join(ENV["REDIRECTS_HISTORY_DIR"], "#{date.strftime("%Y%m")}.ndj.gz") end + + private + + def report_malformed(file:, line:) + error(message: "#{file} contains malformed line: #{line}") + end end end diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index 932b8dd..31b48fe 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -1,13 +1,15 @@ # frozen_string_literal: true require "verifier/hathifiles_redirects_verifier" +require "zinzout" module PostZephirProcessing RSpec.describe(HathifileRedirectsVerifier) do - let(:verifier) { described_class.new } let(:test_date) { Date.parse("2023-01-01") } + let(:verifier) { described_class.new(date: test_date) } let(:redirects_file) { verifier.redirects_file(date: test_date) } let(:redirects_history_file) { verifier.redirects_history_file(date: test_date) } + # Including this mess should invalidate either file let(:mess) { "oops, messed up a line!" } # Clean dir before each test @@ -30,6 +32,33 @@ def stage_redirects_history_file FileUtils.cp(fixture("redirects/202301.ndj.gz"), ENV["REDIRECTS_HISTORY_DIR"]) end + # Intentionally add mess to an otherwise wellformed file to trigger errors + def malform(file) + Zinzout.zout(file) do |outfile| + outfile.puts mess + end + end + + describe "#initialize" do + it "sets current_date (attr_reader) by default or by param" do + expect(described_class.new.current_date).to eq Date.today + expect(described_class.new(date: test_date).current_date).to eq test_date + end + end + + describe "#redirects_file" do + it "returns path to dated file, based on date param or verifier's default date" do + expect(verifier.redirects_file).to end_with("redirects_#{test_date.strftime("%Y%m")}.txt.gz") + expect(verifier.redirects_file(date: Date.today)).to end_with("redirects_#{Date.today.strftime("%Y%m")}.txt.gz") + end + end + describe "#redirects_history_file" do + it "returns path to dated file, based on date param or verifier's default date" do + expect(verifier.redirects_history_file).to end_with("#{test_date.strftime("%Y%m")}.ndj.gz") + expect(verifier.redirects_history_file(date: Date.today)).to end_with("#{Date.today.strftime("%Y%m")}.ndj.gz") + end + end + describe "#verify_redirects" do it "will warn twice if both files missing" do verifier.verify_redirects(date: test_date) @@ -49,7 +78,17 @@ def stage_redirects_history_file expect(verifier.errors.count).to eq 1 expect(verifier.errors).to include(/not found: #{redirects_file}/) end - it "will not warn if both files are there" do + it "will warn if files are there but malformed" do + stage_redirects_file + stage_redirects_history_file + malform(redirects_file) + malform(redirects_history_file) + verifier.verify_redirects(date: test_date) + expect(verifier.errors.count).to eq 2 + expect(verifier.errors).to include(/#{redirects_file} contains malformed line: #{mess}/) + expect(verifier.errors).to include(/#{redirects_history_file} contains malformed line: #{mess}/) + end + it "will not warn if both files are there & valid)" do stage_redirects_file stage_redirects_history_file verifier.verify_redirects(date: test_date) @@ -60,16 +99,11 @@ def stage_redirects_history_file describe "#verify_redirects_file" do it "accepts a well-formed file" do stage_redirects_file - verifier.current_date = test_date verifier.verify_redirects_file(path: redirects_file) end it "rejects a malformed file" do stage_redirects_file - # intentionally mess up the staged file - Zinzout.zout(redirects_file) do |outfile| - outfile.puts mess - end - verifier.current_date = test_date + malform(redirects_file) verifier.verify_redirects_file(path: redirects_file) expect(verifier.errors.count).to eq 1 expect(verifier.errors).to include(/#{redirects_file} contains malformed line: #{mess}/) @@ -79,16 +113,11 @@ def stage_redirects_history_file describe "#verify_redirects_history_file" do it "accepts a well-formed file" do stage_redirects_history_file - verifier.current_date = test_date verifier.verify_redirects_history_file(path: redirects_history_file) end it "rejects a malformed file" do stage_redirects_history_file - # intentionally mess up the staged file - Zinzout.zout(redirects_history_file) do |outfile| - outfile.puts mess - end - verifier.current_date = test_date + malform(redirects_history_file) verifier.verify_redirects_history_file(path: redirects_history_file) expect(verifier.errors.count).to eq 1 expect(verifier.errors).to include(/#{redirects_history_file} contains malformed line: #{mess}/) From 12d12e9e06a4d37d2ec4086057367774b3b605f5 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 11 Dec 2024 14:38:20 -0500 Subject: [PATCH 041/114] Finish spec for DEV-1413 PopulateRightsVerifier --- lib/verifier/populate_rights_verifier.rb | 12 ++- spec/unit/populate_rights_verifier_spec.rb | 106 ++++++++++++++++----- 2 files changed, 92 insertions(+), 26 deletions(-) diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index c6b6c3b..924c56d 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -15,18 +15,22 @@ module PostZephirProcessing # run has made a change. # We may also look for errors in the output logs (postZephir.pm and/or populate_rights_data.pl?) - # but thsat is out of scope for now. + # but that is out of scope for now. class PopulateRightsVerifier < Verifier FULL_RIGHTS_TEMPLATE = "zephir_full_YYYYMMDD.rights" UPD_RIGHTS_TEMPLATE = "zephir_upd_YYYYMMDD.rights" def run_for_date(date:) upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: UPD_RIGHTS_TEMPLATE, date: date) - verify_rights_file(path: upd_path) + if File.exist? upd_path + verify_rights_file(path: upd_path) + end if date.last_of_month? full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: FULL_RIGHTS_TEMPLATE, date: date) - verify_rights_file(path: full_path) + if File.exist? full_path + verify_rights_file(path: full_path) + end end end @@ -42,7 +46,7 @@ def verify_rights_file(path:) htid = line.split("\t").first namespace, id = htid.split(".", 2) if db[:rights_current].where(namespace: namespace, id: id).count.zero? - error message: "no entry in rights_current for #{htid}" + error message: "missing rights_current for #{htid}" end end end diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/populate_rights_verifier_spec.rb index 25a020d..d0b999b 100644 --- a/spec/unit/populate_rights_verifier_spec.rb +++ b/spec/unit/populate_rights_verifier_spec.rb @@ -5,46 +5,108 @@ module PostZephirProcessing RSpec.describe(PopulateRightsVerifier) do around(:each) do |example| - with_test_environment { example.run } + with_test_environment do + ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + example.run + end + end + end + + let(:test_rights) { 10.times.collect { |n| "test.%03d" % n } } + let(:test_rights_file_contents) do + test_rights.map do |rights| + [rights, "ic", "bib", "bibrights", "aa"].join("\t") + end.join("\n") + end + let(:verifier) { described_class.new } + let(:db) { Services[:database][:rights_current] } + + # Creates a full or upd rights file in @tmpdir. + def with_fake_rights_file(date:, full: false) + rights_file = File.join(@tmpdir, full ? described_class::FULL_RIGHTS_TEMPLATE : described_class::UPD_RIGHTS_TEMPLATE) + .sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) + File.write(rights_file, test_rights_file_contents) + yield end - let(:test_rights) do - [ - ["a.123", "ic", "bib", "bibrights", "aa"].join("\t") - ].join("\n") + def insert_fake_rights(namespace:, id:) + db.insert(namespace: namespace, id: id, attr: 1, reason: 1, source: 1, access_profile: 1) end - # Temporarily add `htid` to `rights_current` with resonable (and irrelevant) default values. - def with_fake_rights_entry(htid:) - namespace, id = htid.split(".", 2) - Services[:database][:rights_current].where(namespace: namespace, id: id).delete - Services[:database][:rights_current].insert( - namespace: namespace, - id: id, - attr: 1, - reason: 1, - source: 1, - access_profile: 1 - ) + # Temporarily add each `htid` to `rights_current` with resonable (and irrelevant) default values. + def with_fake_rights_entries(htids: test_rights) + split_htids = htids.map { |htid| htid.split(".", 2) } + Services[:database][:rights_current].where([:namespace, :id] => split_htids).delete + split_htids.each do |split_htid| + insert_fake_rights(namespace: split_htid[0], id: split_htid[1]) + end begin yield ensure - Services[:database][:rights_current].where(namespace: namespace, id: id).delete + Services[:database][:rights_current].where([:namespace, :id] => split_htids).delete + end + end + + describe "#run_for_date" do + context "monthly" do + date = Date.new(2024, 11, 30) + context "with HTID in the Rights Database" do + it "logs no `missing rights_current` error" do + with_fake_rights_entries do + with_fake_rights_file(date: date, full: true) do + verifier.run_for_date(date: date) + expect(verifier.errors).not_to include(/missing rights_current/) + end + end + end + end + + context "with HTID not in the Rights Database" do + it "logs `missing rights_current` error" do + with_fake_rights_file(date: date, full: true) do + verifier.run_for_date(date: date) + expect(verifier.errors).to include(/missing rights_current/) + end + end + end + end + + context "daily" do + date = Date.new(2024, 12, 2) + context "with HTID in the Rights Database" do + it "logs no `missing rights_current` error" do + with_fake_rights_entries do + with_fake_rights_file(date: date) do + verifier.run_for_date(date: date) + expect(verifier.errors).not_to include(/missing rights_current/) + end + end + end + end + + context "with HTID not in the Rights Database" do + it "logs `missing rights_current` error" do + with_fake_rights_file(date: date) do + verifier.run_for_date(date: date) + expect(verifier.errors).to include(/missing rights_current/) + end + end + end end end describe "#verify_rights_file" do context "with HTID in the Rights Database" do it "logs no error" do - with_fake_rights_entry(htid: "a.123") do - expect_ok(:verify_rights_file, test_rights) + with_fake_rights_entries do + expect_ok(:verify_rights_file, test_rights_file_contents) end end end context "with HTID not in the Rights Database" do - it "logs an error" do - expect_not_ok(:verify_rights_file, test_rights, errmsg: /no entry/) + it "logs `missing rights_current` error" do + expect_not_ok(:verify_rights_file, test_rights_file_contents, errmsg: /missing rights_current/) end end end From d4cbf41dd05b19d075fdfeb9e11562dac0890dae Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 11 Dec 2024 14:48:11 -0500 Subject: [PATCH 042/114] moved gzip_linecount from verifier/hathifiles_database_verifier.rb to verifier.rb for reusability --- lib/verifier.rb | 4 ++++ lib/verifier/hathifiles_database_verifier.rb | 6 +----- spec/unit/hathifiles_database_verifier_spec.rb | 6 ------ spec/unit/verifier_spec.rb | 6 ++++++ 4 files changed, 11 insertions(+), 11 deletions(-) diff --git a/lib/verifier.rb b/lib/verifier.rb index 1328eef..ebc499c 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -73,6 +73,10 @@ def verify_file_readable(path:) end end + def gzip_linecount(path:) + Zlib::GzipReader.open(path, encoding: "utf-8") { |gz| gz.count } + end + # I'm not sure if we're going to try to distinguish errors and warnings. # For now let's call everything an error. def error(message:) diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index 091d008..753d568 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -18,10 +18,6 @@ def self.has_log?(hathifile:) .positive? end - def self.gzip_linecount(path:) - Zlib::GzipReader.open(path, encoding: "utf-8") { |gz| gz.count } - end - # Count the number of entries in hathifiles.hf def self.db_count PostZephirProcessing::Services[:database][:hf].count @@ -55,7 +51,7 @@ def verify_hathifiles_database_log def verify_hathifiles_database_count if current_date.first_of_month? if File.exist?(full_file) - full_file_count = self.class.gzip_linecount(path: full_file) + full_file_count = gzip_linecount(path: full_file) db_count = self.class.db_count if full_file_count > db_count error message: "hf count mismatch: #{full_file} (#{full_file_count}) vs hathifiles.hf (#{db_count})" diff --git a/spec/unit/hathifiles_database_verifier_spec.rb b/spec/unit/hathifiles_database_verifier_spec.rb index 1d4868b..45eeb68 100644 --- a/spec/unit/hathifiles_database_verifier_spec.rb +++ b/spec/unit/hathifiles_database_verifier_spec.rb @@ -89,12 +89,6 @@ def with_fake_full_hathifile end end - describe ".gzip_linecount" do - it "returns the correct number of lines" do - expect(described_class.gzip_linecount(path: TEST_UPDATE_FIXTURE)).to eq(TEST_UPDATE_LINECOUNT) - end - end - describe "#run_for_date" do context "with upd hathifile" do context "with corresponding hf_log" do diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index a9a76f4..de46972 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -52,5 +52,11 @@ module PostZephirProcessing end end end + + describe "#gzip_linecount" do + it "returns the correct number of lines" do + expect(verifier.gzip_linecount(path: TEST_UPDATE_FIXTURE)).to eq(TEST_UPDATE_LINECOUNT) + end + end end end From ccb4abc9a2f7e3f71b43e3b5b0be076038cfaa4b Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 11 Dec 2024 17:14:05 -0500 Subject: [PATCH 043/114] [DEV-1422] added input/output line count check + test --- lib/verifier/post_zephir_verifier.rb | 10 +++- spec/unit/post_zephir_verifier_spec.rb | 69 +++++++++++++++++--------- 2 files changed, 55 insertions(+), 24 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index f1ed22c..89d8e03 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -34,7 +34,15 @@ def run_for_date(date:) def verify_catalog_archive(date: current_date) verify_file(path: self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) if date.last_of_month? - verify_file(path: self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date)) + output_path = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date) + verify_file(path: output_path) + output_linecount = gzip_linecount(path: output_path) + input_path = File.join([ENV["ZEPHIR_DATA"], "ht_bib_export_full_#{date.strftime("%Y-%m-%d")}.json.gz"]) + input_linecount = gzip_linecount(path: input_path) + + if output_linecount != input_linecount + error message: "output line count (#{output_path} = #{output_linecount}) != input line count (#{input_path} = #{input_linecount})" + end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index f06d9c2..4ea691d 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -1,6 +1,7 @@ # frozen_string_literal: true require "verifier/post_zephir_verifier" +require "zinzout" module PostZephirProcessing RSpec.describe(PostZephirVerifier) do @@ -8,29 +9,6 @@ module PostZephirProcessing with_test_environment { example.run } end - # These helpers are based on the ones from - # #verify_deletes_contents but are more general - - # the expect-methods take a method arg for the method under test, - # a contents string that's written to a tempfile and passed to the method, - # and an optional errmsg arg (as a regexp) for specific error checking - - def expect_not_ok(method, contents, errmsg: /.*/, gzipped: false) - with_temp_file(contents, gzipped: gzipped) do |tmpfile| - verifier = described_class.new - verifier.send(method, path: tmpfile) - expect(verifier.errors).to include(errmsg) - end - end - - def expect_ok(method, contents, gzipped: false) - with_temp_file(contents, gzipped: gzipped) do |tmpfile| - verifier = described_class.new - verifier.send(method, path: tmpfile) - expect(verifier.errors).to be_empty - end - end - describe "#verify_deletes_contents" do def expect_deletefile_error(contents) expect_not_ok(:verify_deletes_contents, @@ -105,6 +83,51 @@ def expect_deletefile_ok(contents) end end + describe "#verify_catalog_archive" do + it "requires input file to have same line count as output file" do + verifier = described_class.new + test_date = Date.parse("2023-01-31") + + # Make a fake input file + FileUtils.mkdir_p(ENV["ZEPHIR_DATA"]) + input_file_name = "ht_bib_export_full_#{test_date.strftime("%Y-%m-%d")}.json.gz" + input_file_path = File.join(ENV["ZEPHIR_DATA"], input_file_name) + Zinzout.zout(input_file_path) do |input_gz| + 1.upto(3) do |i| + input_gz.puts "{ \"i\": #{i} }" + end + end + + # Fake output files + FileUtils.mkdir_p(ENV["CATALOG_ARCHIVE"]) + output_file_names = [ + "zephir_full_#{test_date.strftime("%Y%m%d")}_vufind.json.gz", + "zephir_upd_#{test_date.strftime("%Y%m%d")}.json.gz" + ] + output_file_names.each do |output_file_name| + output_file_path = File.join(ENV["CATALOG_ARCHIVE"], output_file_name) + Zinzout.zout(output_file_path) do |output_gz| + 1.upto(3) do |i| + output_gz.puts "{ \"i\": #{i} }" + end + end + end + + # Expect no warnings when line counts match. + verifier.verify_catalog_archive(date: test_date) + expect(verifier.errors).to be_empty + + # Change line count in input file and expect a warning + Zinzout.zout(input_file_path) do |input_gz| + input_gz.puts "{ \"i\": \"one line too many\" }" + end + + verifier.verify_catalog_archive(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/output line count .+ != input line count/) + end + end + describe "#verify_rights_file_format" do it "accepts an empty file" do expect_ok(:verify_rights_file_format, "") From eb8a28dc0636b5489fabf862c6830347f759ecdc Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 11 Dec 2024 17:30:42 -0500 Subject: [PATCH 044/114] DEV-1418: verify update files for catalog indexing * add faraday to get info from solr * remove extraneous swap file * update gitignore --- .gitignore | 2 + Gemfile | 1 + Gemfile.lock | 11 +++++ lib/verifier/.hathifiles_verifier.rb.swp | Bin 16384 -> 0 bytes lib/verifier/catalog_index_verifier.rb | 30 ++++++++++++- spec/unit/catalog_indexing_verifier_spec.rb | 44 ++++++++++++++++---- 6 files changed, 80 insertions(+), 8 deletions(-) delete mode 100644 lib/verifier/.hathifiles_verifier.rb.swp diff --git a/.gitignore b/.gitignore index b7f81fd..65eadf1 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,5 @@ compare_* *.jsonl t/fixtures/rights_dbm cover_db +**/.*.sw? +*~ diff --git a/Gemfile b/Gemfile index 6df094d..2f51f48 100644 --- a/Gemfile +++ b/Gemfile @@ -4,6 +4,7 @@ source "https://rubygems.org" gem "canister" gem "dotenv" +gem "faraday" gem "mysql2" gem "sequel" gem "zinzout" diff --git a/Gemfile.lock b/Gemfile.lock index 99a1052..5948e84 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -14,12 +14,21 @@ GEM diff-lcs (1.5.1) docile (1.4.1) dotenv (3.1.4) + faraday (2.12.2) + faraday-net_http (>= 2.0, < 3.5) + json + logger + faraday-net_http (3.4.0) + net-http (>= 0.5.0) hashdiff (1.1.2) json (2.7.2) language_server-protocol (3.17.0.3) lint_roller (1.1.0) + logger (1.6.2) method_source (1.1.0) mysql2 (0.5.6) + net-http (0.6.0) + uri parallel (1.26.3) parser (3.3.5.0) ast (~> 2.4.1) @@ -86,6 +95,7 @@ GEM standardrb (1.0.1) standard unicode-display_width (2.6.0) + uri (1.0.2) webmock (3.24.0) addressable (>= 2.8.0) crack (>= 0.3.2) @@ -100,6 +110,7 @@ DEPENDENCIES canister climate_control dotenv + faraday mysql2 pry rspec diff --git a/lib/verifier/.hathifiles_verifier.rb.swp b/lib/verifier/.hathifiles_verifier.rb.swp deleted file mode 100644 index ff862d49a9d5c8b98aa94e049028708ff824e750..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 16384 zcmeHOO^h5z6>dVDgr5*ZI3Ng8dDh0W@l4Np?L_g|8{4th_DZ|4V>>2^cirjfu9+#e zr@PbDy}PsP0d9dy1j->f5q>28Zv?n-00gl>u|QlbggAhZARq;}AO#Naz3TtnS!Y2x zAn2C9otf^cSFgT$^si%PoTfIC4KHCpKsgr`S#@ZLh{+)^gQ`leN+ro3{(tM3{(tM3{(tM z3{(tM3{(tM3{(tM4Ez@|VB*dF3iQ$`0f6`a>Hhz>_h{O0fL{RL0zM0T2ABiR0(St{ z-=k@-0AB}2z!ES6{NZj*`zi1R;M2e(!2Q5&z+d03Y2O9D23Wvjz%Jn1@6xm{0?z}F z0S^JMy;IX}0AB`P1fBsb;3%L2|9FR{y$lQh19$+~2fTKdrriWS4?F{$1|9_N0^Yb2 zZ38a=0{AHK5#as6ukO&ap8%f%LZAVB0QmVXP5U0O25{gg&;llbIB~Oz!30&CEx*IH}Erz$93Qp;H$t0xB$>Ny`Cm# zwznttCYR{4-8fm{*z$@s#g`p4=4@A0)sEfuGJvKe;v#HF)yr|0Hp zjni|p^QVjpvu9?Xn7w#KEG0!_J#=nbePdOGeM7i5Ut_k%uBk^2Ca$TXHt#ZZ{9|au zb?WiQq^2Ho*G?o(Fw>D9v-;dxW{z;VWPVRIK#HqYvIb11I-iyEeC{$jzLMO@HD$Nz5X0s2wU9+F zPQJ*rDCrOrU?`=^EKsGG7I05Rb*t|Ud|mpE2 z)8>IrJV0BAH_(dZ5&dmeQzsK@ld%DiBsKK(B_H-6ayJk1aT0_+7I-b{F~g=3HDu)T zpdR;_Oh~vQ6s9BAxzZ)E9y1IBBT#y>cC2GN?^fJ+YO$@W`u%%aiFg=W6!yIU2ITpn z@LbbrFw-64#1i=Gd6~Vmu3j;l>-(4X?`bvix)vRHVmSDux&u-;1O@z=NCez&FvnZv zfn`b#Yc>PZ3V9&ee&#z-YKF1c#b#NqEqdtXx=9MbE;-D9r?PER_7nZa%Z?|H6dmvn z!2$;HmdUm}HzZ|2$cli??1YT6!Y7(fm8q~i$KjSDVvCxQ58K#8Z{GJUqB^2ubO&Kc zsauNBH>Mf~N5uhmOjt$9R6;hu5jb|RFv6J%`o{F8`hL`L1bh~atM8ar*<7X>46!JQ z)jJV}-gL>Gt@cF-$A?Pk{BK=dv`h9Fh+aRGhHZx2h$OcQUBy>78sJEtm8o}S>cCR5 zbCEQ9-il#+meF-g>St=nN7xjpmbwONcjM5giNDjyU#7md3xA7}6TLz);4(A^eo6G+ zC!2c*&Am3;d$zrIp}qHFt{`Sy*%}f-nCa4U;Xq6C?MPw1nU3E#J3Iv0P9~4$eLi`` zzz+nBH6P9++syoOJxoU)#6u|iUgX&D!OPqk(imZyyFKs@Pb^%^iJ)KD=WLnDsH6P4 zo+vZH9`!=rW>$n~=7v-owzbTPpO)w!lE>;a_p+QDR`emeLL0%%va7htv1;Ky} zzU6wsKm`WO6q6h#jE#AiKM0OA2muxnflUl&22scSnLJi+&_~HOJj<~R%$E`|>pbvC zqk)GA3-JT9DY*~79%95%xC8H=iC(-|5||rJ))&*E@L=o85}+Y;j%`4CQ2@^XcSTYq zGtpdPVkYZEJ($)wONgr(7Vx ze3=hXJ9k@$nF^)(uuu@6Y93x}>f;Z31N7sbyQ>17l z3N|q8ZZnS{Z0?k5a+~En7G_t*J-=w2Jooscm+*M{K|Hq{6RV%I>GAH9arBp$lY`&v zC`-R;1J91I;5d(1`II^>em*-aj^nKCQXd1CcvHAh3aRjWG;1Zg0;3PHmX7@@9)nHV z65ux`vrg5|ojTWsN3bwPv{k^s#9Iv|g4Skmr-2BuIANV3#Jgc1@{*<)5)$MU;Iigg2Tox6q%5v+O!wgCt}P8)0bh-z@6<7vL;7Q;K zpauLLd;Z@8KLEZ0yaap>Fo6#Pf5X0i2zWpb=mKrvUf?g_SA$G0Ir^ zb}j2!l0QU7&Outv3+g1Xp(?dm7l}x0wl{S>D;yV<>ji7od*$@tV>T9P!M z)RmqYP4Q~V$aKRxB z+GA=VqZfVL(pp(>WwpwZa{;iDq!u+7_itP|njNYtnz_~dD$LG&?~u;VbiylQ4J9(0 z=;2zZ=6S(aq*_8pnsu1WLpGBqpt1B70a9z&zN-a$JP$GF3L~(Ek&Lcw?Q%T-1>(DL^uQ-Nnc4B@$zL`8~^+Px**)jvGkL_LI zUz-b3r>CY5O-)Z7(BlE1J$_+E(9;9?)6LOkvqh_h89SaN3e~{CUdZg!BN|IMM2cPN zSv-!3^DVs=DM~=qO`G?zyLJ>yXIuK2-(k2^ghZ+MQvSLS4d=;VCN{`!Vt+rP)K$C= zh-0HlZ_*8VnHk5N5v<9H2Cl?wrZpd_f9$O%Hgad9-a{sdh*1^|3)w3c?~P-XFdK`) z$-j&jrBV69#867Bs^i|%jb|gjOTsp@ahf`bu80vss!vU(ZMmuMOGp{*n^c)j%IGrn NsumK|`$_8izW{Eq^uGWA diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index be57764..2dae5cf 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -1,6 +1,6 @@ # frozen_string_literal: true -require "zlib" +require "faraday" require_relative "../verifier" require_relative "../derivatives" @@ -8,5 +8,33 @@ module PostZephirProcessing class CatalogIndexVerifier < Verifier + def verify_index_count(path:) + filename = File.basename(path) + if (m = filename.match(/^zephir_upd_(\d+).json.gz/)) + # in normal operation, we _should_ have indexed this the day after the + # date listed in the file. + # + # could potentially use the journal to determine when we actually + # indexed it? + + date_of_indexing = Date.parse(m[1]) + 1 + catalog_linecount = gzip_linecount(path: path) + solr_count = solr_count(date_of_indexing) + + if solr_count < catalog_linecount + error(message: "#{filename} had #{catalog_linecount} records, but only #{solr_count} had time_of_indexing on #{date_of_indexing} in solr") + end + + end + end + + def solr_count(date_of_indexing) + # get: + datebegin = date_of_indexing.to_datetime.new_offset(0).strftime("%FT%TZ") + dateend = (date_of_indexing + 1).to_datetime.new_offset(0).strftime("%FT%TZ") + url = "#{ENV["SOLR_URL"]}/select?fq=time_of_index:#{datebegin}%20TO%20#{dateend}]&q=*:*&rows=0&wt=json" + + JSON.parse(Faraday.get(url).body)["response"]["numFound"] + end end end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index 83cea3f..72b6966 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -1,22 +1,32 @@ # frozen_string_literal: true require "verifier/catalog_index_verifier" +require "webmock" module PostZephirProcessing RSpec.describe(CatalogIndexVerifier) do + let(:solr_url) { "http://solr-sdr-catalog:9033/solr/catalog" } + around(:each) do |example| - with_test_environment { example.run } + with_test_environment do + ClimateControl.modify(SOLR_URL: solr_url) do + example.run + end + end end def stub_catalog_timerange(date, result_count) # must be like YYYY-mm-ddTHH:MM:SSZ - iso8601 with a 'Z' for time zone - # time zone offsets like DateTime.iso8601 produce by default are't # allowed for solr + + # FIXME: don't love that we duplicate this logic & the URL between here & + # the verifier -- anything to do? datebegin = date.to_datetime.new_offset(0).strftime("%FT%TZ") dateend = (date + 1).to_datetime.new_offset(0).strftime("%FT%TZ") WebMock.enable! - url = "http://solr-sdr-catalog:9033/solr/catalog/select?fq=time_of_index:#{datebegin}%20TO%20#{dateend}]&indent=on&q=*:*&rows=0&wt=json" + url = "#{solr_url}/select?fq=time_of_index:#{datebegin}%20TO%20#{dateend}]&q=*:*&rows=0&wt=json" result = { "responseHeader" => { @@ -33,16 +43,36 @@ def stub_catalog_timerange(date, result_count) }.to_json WebMock::API.stub_request(:get, url) - .with(body: result, headers: {"Content-Type" => "application/json"}) + .to_return(body: result, headers: {"Content-Type" => "application/json"}) end describe "#verify_index_count" do let(:verifier) { described_class.new } context "with a catalog update file with 3 records" do - it "accepts a catalog with 3 recent updates" - it "accepts a catalog with 5 recent updates" - it "rejects a catalog with no recent updates" - it "rejects a catalog with 2 recent updates" + let(:catalog_update) { fixture("catalog_archive/zephir_upd_20241202.json.gz") } + # indexed the day after the date in the filename + let(:catalog_index_date) { Date.parse("2024-12-03") } + + it "accepts a catalog with 3 recent updates" do + stub_catalog_timerange(catalog_index_date, 3) + verifier.verify_index_count(path: catalog_update) + expect(verifier.errors).to be_empty + end + it "accepts a catalog with 5 recent updates" do + stub_catalog_timerange(catalog_index_date, 5) + verifier.verify_index_count(path: catalog_update) + expect(verifier.errors).to be_empty + end + it "rejects a catalog with no recent updates" do + stub_catalog_timerange(catalog_index_date, 0) + verifier.verify_index_count(path: catalog_update) + expect(verifier.errors).not_to be_empty + end + it "rejects a catalog with 2 recent updates" do + stub_catalog_timerange(catalog_index_date, 2) + verifier.verify_index_count(path: catalog_update) + expect(verifier.errors).not_to be_empty + end end context "with a catalog full file with 5 records" do From 7fa70ca75bbdfea8646c4e1420781a790444c26d Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 12 Dec 2024 10:06:26 -0500 Subject: [PATCH 045/114] De-constantize gzip_linecount tests relying on too-wide constant scope --- .../unit/hathifiles_database_verifier_spec.rb | 20 +++++++++---------- spec/unit/verifier_spec.rb | 5 ++++- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/spec/unit/hathifiles_database_verifier_spec.rb b/spec/unit/hathifiles_database_verifier_spec.rb index 45eeb68..11b81ee 100644 --- a/spec/unit/hathifiles_database_verifier_spec.rb +++ b/spec/unit/hathifiles_database_verifier_spec.rb @@ -3,11 +3,6 @@ require "verifier/hathifiles_database_verifier" module PostZephirProcessing - TEST_UPDATE_FILE = "hathi_upd_20241202.txt.gz" - TEST_UPDATE_FIXTURE = fixture(File.join("hathifile_archive", TEST_UPDATE_FILE)) - TEST_FULL_FILE = "hathi_full_20241201.txt.gz" - TEST_UPDATE_LINECOUNT = 8 - RSpec.describe(HathifilesDatabaseVerifier) do around(:each) do |example| with_test_environment do @@ -18,6 +13,9 @@ module PostZephirProcessing end let(:verifier) { described_class.new } + let(:test_update_file) { "hathi_upd_20241202.txt.gz" } + let(:test_update_fixture) { fixture(File.join("hathifile_archive", test_update_file)) } + let(:test_full_file) { "hathi_full_20241201.txt.gz" } def delete_hf_logs Services[:database][:hf_log].delete @@ -50,7 +48,7 @@ def with_fake_hf_entries(htids:) # of them. def with_fake_full_hathifile ClimateControl.modify(HATHIFILE_ARCHIVE: @tmpdir) do - FileUtils.cp(TEST_UPDATE_FIXTURE, File.join(@tmpdir, TEST_FULL_FILE)) + FileUtils.cp(test_update_fixture, File.join(@tmpdir, test_full_file)) yield end end @@ -59,14 +57,14 @@ def with_fake_full_hathifile context "with corresponding hf_log" do it "returns `true`" do with_fake_hf_log_entry(hathifile: "hathi_upd_20241202.txt.gz") do - expect(described_class.has_log?(hathifile: TEST_UPDATE_FIXTURE)).to eq(true) + expect(described_class.has_log?(hathifile: test_update_fixture)).to eq(true) end end end context "without corresponding hf_log" do it "returns `false`" do - expect(described_class.has_log?(hathifile: TEST_UPDATE_FIXTURE)).to eq(false) + expect(described_class.has_log?(hathifile: test_update_fixture)).to eq(false) end end end @@ -112,7 +110,7 @@ def with_fake_full_hathifile context "with full hathifile" do context "with corresponding hf_log" do it "reports no `missing hf_log` errors" do - with_fake_hf_log_entry(hathifile: TEST_FULL_FILE) do + with_fake_hf_log_entry(hathifile: test_full_file) do with_fake_full_hathifile do verifier.run_for_date(date: Date.new(2024, 12, 1)) expect(verifier.errors).not_to include(/missing hf_log/) @@ -132,7 +130,7 @@ def with_fake_full_hathifile context "with the expected `hf` rows" do it "reports no `hf count mismatch` errors" do - with_fake_hf_log_entry(hathifile: TEST_FULL_FILE) do + with_fake_hf_log_entry(hathifile: test_full_file) do with_fake_full_hathifile do fake_htids = ["test.001", "test.002", "test.003", "test.004", "test.005", "test.006", "test.007", "test.008"] with_fake_hf_entries(htids: fake_htids) do @@ -146,7 +144,7 @@ def with_fake_full_hathifile context "without the expected `hf` rows" do it "reports `hf count mismatch` error" do - with_fake_hf_log_entry(hathifile: TEST_FULL_FILE) do + with_fake_hf_log_entry(hathifile: test_full_file) do with_fake_full_hathifile do verifier.run_for_date(date: Date.new(2024, 12, 1)) expect(verifier.errors).to include(/hf count mismatch/) diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index de46972..6315670 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -11,6 +11,9 @@ module PostZephirProcessing end let(:verifier) { described_class.new } + let(:test_update_file) { "zephir_upd_20241202.json.gz" } + let(:test_update_fixture) { fixture(File.join("catalog_archive", test_update_file)) } + let(:test_update_linecount) { 3 } describe ".new" do it "creates a Verifier" do @@ -55,7 +58,7 @@ module PostZephirProcessing describe "#gzip_linecount" do it "returns the correct number of lines" do - expect(verifier.gzip_linecount(path: TEST_UPDATE_FIXTURE)).to eq(TEST_UPDATE_LINECOUNT) + expect(verifier.gzip_linecount(path: test_update_fixture)).to eq(test_update_linecount) end end end From 7c4d3a662e78267677bf8a9f9fd301fc01e9d19d Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 12 Dec 2024 14:00:54 -0500 Subject: [PATCH 046/114] [DEV-1422] made tests for PostZephirVerifier.verify_catalog_archive more fixture-based --- .../ht_bib_export_full_2024-11-30.json.gz | Bin 0 -> 3752 bytes spec/unit/post_zephir_verifier_spec.rb | 60 +++++++----------- 2 files changed, 23 insertions(+), 37 deletions(-) create mode 100644 spec/fixtures/zephir_data/ht_bib_export_full_2024-11-30.json.gz diff --git a/spec/fixtures/zephir_data/ht_bib_export_full_2024-11-30.json.gz b/spec/fixtures/zephir_data/ht_bib_export_full_2024-11-30.json.gz new file mode 100644 index 0000000000000000000000000000000000000000..9451c3b696e7aa7cfacb309e26042c5ba879e9da GIT binary patch literal 3752 zcmV;Z4p;FXiwFqs5L;&e188(#VrgPuWq5FJa&%v2b!=>3GB7eUEio}IGcYb{b8l_{ zRU7O|uM8Xm(iq!E%vEA!`zXd5# zlq?9ctfalEGD-M`-Nj#n^Uigpg%p+-kS(1dqb3 z)&2Nu3u7X`UHnMvx3+v{!m~?p%HEw_ES}i#gkowDLj3Am120}V@M?+{zi@f6cru$5qrlOaxF2#q}zF!YC{BcZAr!y6`EKEC7Q5Z)h7^l&lF(jr*O~*Qj z(&MtDm**cqo-IGp870Z+P#s?NlZ*aq>bkyz5AErzKD)a8aB(_#O>EyI9zLAX&Bgh9 z+12ml-|L(6UZ?jGrJY86Hg0%=os=wv?VerqON$-CHIGN-b${ZK*URtabv9#73GfyP%Rl@JYL* z&!xjorZW-o)TkVLhM#=ny75O+xMJ{-D60@oN*8?YDgHLZtMV@q$ zS_*p}M9G}T*#zXl+(+|=EQd>z*le84S;T{Uo}iQ3qJB!U+}P*>s(OYPdIRgmO-YGN zlIpdgkZYkWs!~2NK$g(hJ#^Hup0<|DC^&7}hQV^}&Nmdq__kZy#3f5}5u zU%8JI$Qm=e6+p;#xLLLlz`)&f{E1+1*aLh^IUhEazcO zKWa10jVk8IMz090 z&aX8dN6BbVOs6b;KwzJ5;!z~Bv2+(a4&eX6jK$>x7Ki9No=#b;nH@?tXGs9k#(5?g zNxNs@mfjWdh^2zXZIq<@GEhu|^Eik=PZ=7DVI-1K%4TDjV-ZUE9hd!qSF#&Gj^Fc4 z%SPe5&XP&lj|T+a;w(Wa+{2_N9>89-dPxoW(C-l6^w-~*2KpgMdF82Q2}lzMf}uws zl*|$Fd~-`9OVJ1|TX}d(E89TVDHwRn(FBG37ZFz$G?iex;ep4A^ud*j$X7m?p)e`k zZXiFqV~^Q{fy3?bXe0u-oR{2#G4zy5_LjZQ-chojKQ)3l3(C+W zWxud^DuCM~8-p9h@)CZ z-b}?vI^`s;EN)ug8?}SDV_9A=x(mE{m*p7y*mHcW;OuHRR}(8M2}`*57K}!WViY7r zoP#&2-rU2RcOuJ^^r5_;%a%!z2I^^nmy8uStYR#fzJr8fQ$~xL4l4Lm9^+vzHa0Qo z%h=S~+Bu>xQrIzAs9;t-q&i($)P(p>rz-`Ug(dC7h7gB(yV~G6rh3g<5ivZe1H2ufCUp96VHNZ z+j43td&M|Kr-%@nnx_+4stsr}P#1KY`?gS`%MZ;9JAaTn7yQuL`hRip-IH=WD# z5B0!IMNwPS>BZ$~c@x3SEafwna$^>CFb3{e_I}vi@ZHtL(}r^D?#hKN z13mF!6@s1J)V*5v@UFF)f}P1ehcjb~`DTrVk4u}C(LP*1Hy`J_TVV6B!JsK+*f677 z5}8hms#vE`%UsL(1%#nkFO1`uwqsjvLzv%BGo>&VZfZsaNWPM09N%sTWB$IHv2Cj% zjCG(eANsW{?{6;Nt_g!}=_R%cQv}5L>(I4*hc*WY9cnH|kfUj}=uk^Jf*f8`j$KCb z>9@c9+NeVj@I>*l4n;kmFl`-*>grGcsOK&H#|x}yk{zGoAb=FKei znS^vXpYj-Z$Ds(tgvAILqaXON2-yRZsln?sdEgWI1y6?}hqt|AQlJ5ZnJGi(ahSwB z6Rd1^#(2uc%{;e9G-(`Vo*ULxG&36FPT8!P?k3uFw=x_0bL;wgaMH_9>=O?gKerAf za}KItIh*K`nO`gMRSJV%)k6}a8Bb;re_dxe21y`Tgd%ar%QQz&WO*{xrJa<~_rfCE zOB2;(0F`u<)1Y0=lp^0qOmCsB+O?oC-yRFt!q|6Mvavg?(nP}$@e-2|dp#QNA&41> z=c;mW`$oVT_qZc^B%mxR1@ujOJwn}B?hlQCB^jTZ7ADp@eY1m|4u!DQ>q6veQB4HA zSJ?W05vtPXhN|m<>6#@q1Jm_jbU);BNMK>&B|hF6SghpS8?;hl)}Ym-HqwJuQwLgQ zNJftaS^>2?1ZQ$@muxcn3?kN>r%1VZYQ$4i6(5y*jZB~-x3hSDjqb}wWcl!AyNzr) z>_G5nQ250JcR}F=@gwXJa)1_bT}|}57wB9Fb>M!S{vfaC&fs!DbEE<_ty#9IgKB8rSA zVz&UXYdBcGg2UTHssjcJH1}Af=!TEVu^MryrC^-%1t(ce?S%%i&E5FrMLQW@K|f6h znJw(TUTtgmqr?GAJ4RUv_rZ{B&GeIalme4IO3e!5mqjcBF$3zO1g1|sE%yd3mb7XQ z$*OSMEOS4Ui*r6k0kkL@weP)UQpOMUTxl<-&!LgXVkWj1GgCuGgZ~|3*#Lw!i<1`U zYceXDQWOO&L_NA$idu(AFDP{Zbz(Xe-BRkJ<}ES17Ia63~qb}x2!3`vytvNhT~vr+oo#Yi0<$Mh=K`mA>=yFNrxlG9!DX#2oT(ZyZ0p5 zelVpv*IxGPz)bu09QmQ)%JP`N=8*w??p%cn{&DJ?t4x=13X#H&o%2g-o&w|BCOI)o zPqXwXH{PG7087e!G$5 zo9jQ4EJr*vKIyjCE81V(-z90-uGgRU{4@G7Nf(D1Jw~vrn8P$ zHx~Fq@i4(Yg|&iumb)eKO`H-k-5Id#mlr3n2)e#&+wOs=_;$?E zl$Dum09C_?1ru~SAG!nFO%ywcF~ND z00JHUwCcRb#N&o;fvpEj{ literal 0 HcmV?d00001 diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 4ea691d..7d47af9 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -6,7 +6,12 @@ module PostZephirProcessing RSpec.describe(PostZephirVerifier) do around(:each) do |example| - with_test_environment { example.run } + ClimateControl.modify( + CATALOG_ARCHIVE: fixture("catalog_archive"), + ZEPHIR_DATA: fixture("zephir_data") + ) do + with_test_environment { example.run } + end end describe "#verify_deletes_contents" do @@ -84,47 +89,28 @@ def expect_deletefile_ok(contents) end describe "#verify_catalog_archive" do + let(:verifier) { described_class.new } + let(:test_date) { Date.parse("2024-11-30") } it "requires input file to have same line count as output file" do - verifier = described_class.new - test_date = Date.parse("2023-01-31") - - # Make a fake input file - FileUtils.mkdir_p(ENV["ZEPHIR_DATA"]) - input_file_name = "ht_bib_export_full_#{test_date.strftime("%Y-%m-%d")}.json.gz" - input_file_path = File.join(ENV["ZEPHIR_DATA"], input_file_name) - Zinzout.zout(input_file_path) do |input_gz| - 1.upto(3) do |i| - input_gz.puts "{ \"i\": #{i} }" - end - end - - # Fake output files - FileUtils.mkdir_p(ENV["CATALOG_ARCHIVE"]) - output_file_names = [ - "zephir_full_#{test_date.strftime("%Y%m%d")}_vufind.json.gz", - "zephir_upd_#{test_date.strftime("%Y%m%d")}.json.gz" - ] - output_file_names.each do |output_file_name| - output_file_path = File.join(ENV["CATALOG_ARCHIVE"], output_file_name) - Zinzout.zout(output_file_path) do |output_gz| - 1.upto(3) do |i| - output_gz.puts "{ \"i\": #{i} }" - end - end - end - - # Expect no warnings when line counts match. + # We have fixtures with matching line counts for test_date, + # so expect no warnings verifier.verify_catalog_archive(date: test_date) expect(verifier.errors).to be_empty + end - # Change line count in input file and expect a warning - Zinzout.zout(input_file_path) do |input_gz| - input_gz.puts "{ \"i\": \"one line too many\" }" + it "warns if there is a input/output line count mismatch" do + # Make a temporary ht_bib_export with just 1 line to trigger error + ClimateControl.modify(ZEPHIR_DATA: "/tmp/test/zephir_data") do + FileUtils.mkdir_p(ENV["ZEPHIR_DATA"]) + Zinzout.zout(File.join(ENV["ZEPHIR_DATA"], "ht_bib_export_full_2024-11-30.json.gz")) do |gz| + gz.puts "{ \"this file\": \"too short\" }" + end + # The other unmodified fixtures in CATALOG_ARCHIVE should + # no longer have matching line counts, so expect a warning + verifier.verify_catalog_archive(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/output line count .+ != input line count/) end - - verifier.verify_catalog_archive(date: test_date) - expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/output line count .+ != input line count/) end end From ea0d5da604297b07b1e7a79d0191360b9e8956be Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 12 Dec 2024 14:38:17 -0500 Subject: [PATCH 047/114] [DEV-1422] verify_catalog_archive: use dated_derivative, make more readable --- lib/verifier/post_zephir_verifier.rb | 37 ++++++++++++++++++++++++---- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 89d8e03..c90bd72 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -30,18 +30,45 @@ def run_for_date(date:) # Contents: TODO # Verify: # readable - # TODO: line count must be the same as input JSON def verify_catalog_archive(date: current_date) - verify_file(path: self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) + zephir_update_derivative_params = { + location: :CATALOG_ARCHIVE, + name: "zephir_upd_YYYYMMDD.json.gz", + date: date + } + verify_file(path: self.class.dated_derivative(**zephir_update_derivative_params)) + if date.last_of_month? - output_path = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date) + zephir_full_derivative_params = { + location: :CATALOG_ARCHIVE, + name: "zephir_full_YYYYMMDD_vufind.json.gz", + date: date + } + + ht_bib_export_derivative_params = { + location: :ZEPHIR_DATA, + name: "ht_bib_export_full_YYYY-MM-DD.json.gz", + date: date + } + + output_path = self.class.dated_derivative(**zephir_full_derivative_params) verify_file(path: output_path) output_linecount = gzip_linecount(path: output_path) - input_path = File.join([ENV["ZEPHIR_DATA"], "ht_bib_export_full_#{date.strftime("%Y-%m-%d")}.json.gz"]) + + input_path = self.class.dated_derivative(**ht_bib_export_derivative_params) + verify_file(path: input_path) input_linecount = gzip_linecount(path: input_path) if output_linecount != input_linecount - error message: "output line count (#{output_path} = #{output_linecount}) != input line count (#{input_path} = #{input_linecount})" + error( + message: sprintf( + "output line count (%s = %s) != input line count (%s = %s)", + output_path, + output_linecount, + input_path, + input_linecount + ) + ) end end end From 9fbb65d0de58f2b9ace34ad3c8c0f1f975043790 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 12 Dec 2024 15:09:44 -0500 Subject: [PATCH 048/114] [DEV-1422] added Verifier.verify_parseable_ndj, use in post_zephir_verifier.rb --- lib/verifier.rb | 11 +++++++++++ lib/verifier/post_zephir_verifier.rb | 6 +++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/lib/verifier.rb b/lib/verifier.rb index ebc499c..d1fa773 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -77,6 +77,17 @@ def gzip_linecount(path:) Zlib::GzipReader.open(path, encoding: "utf-8") { |gz| gz.count } end + # Take a .ndj(.gz) file and check that each line is indeed parseable json + def verify_parseable_ndj(path:) + Zinzout.zin(path) do |infile| + infile.each do |line| + JSON.parse(line) + end + rescue + error(message: "File #{path} contains unparseable JSON: #{line}") + end + end + # I'm not sure if we're going to try to distinguish errors and warnings. # For now let's call everything an error. def error(message:) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index c90bd72..9eba7fb 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -36,7 +36,9 @@ def verify_catalog_archive(date: current_date) name: "zephir_upd_YYYYMMDD.json.gz", date: date } - verify_file(path: self.class.dated_derivative(**zephir_update_derivative_params)) + zephir_update_path = self.class.dated_derivative(**zephir_update_derivative_params) + verify_file(path: zephir_update_path) + verify_parseable_ndj(path: zephir_update_path) if date.last_of_month? zephir_full_derivative_params = { @@ -53,10 +55,12 @@ def verify_catalog_archive(date: current_date) output_path = self.class.dated_derivative(**zephir_full_derivative_params) verify_file(path: output_path) + verify_parseable_ndj(path: output_path) output_linecount = gzip_linecount(path: output_path) input_path = self.class.dated_derivative(**ht_bib_export_derivative_params) verify_file(path: input_path) + verify_parseable_ndj(path: input_path) input_linecount = gzip_linecount(path: input_path) if output_linecount != input_linecount From 1e61dec62bf788074f61db5550178170b8e6ad1b Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 12 Dec 2024 15:29:34 -0500 Subject: [PATCH 049/114] test for verify_parseable_ndj --- lib/verifier.rb | 5 +++-- spec/unit/verifier_spec.rb | 11 +++++++++++ 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/lib/verifier.rb b/lib/verifier.rb index d1fa773..59ee359 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -3,6 +3,7 @@ require_relative "derivatives" require_relative "journal" require_relative "services" +require "zinzout" # Common superclass for all things Verifier. # Right now the only thing I can think of to put here is shared @@ -83,8 +84,8 @@ def verify_parseable_ndj(path:) infile.each do |line| JSON.parse(line) end - rescue - error(message: "File #{path} contains unparseable JSON: #{line}") + rescue JSON::ParserError + error(message: "File #{path} contains unparseable JSON") end end diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index 6315670..2d266ee 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -61,5 +61,16 @@ module PostZephirProcessing expect(verifier.gzip_linecount(path: test_update_fixture)).to eq(test_update_linecount) end end + + describe "#verify_parseable_ndj" do + it "checks if a .ndj file contains only parseable lines" do + content = "{}\n[]" + expect_ok(:verify_parseable_ndj, content) + end + it "warns if it sees an unparseable line" do + content = "oops\n{}\n[]\n" + expect_not_ok(:verify_parseable_ndj, content, /unparseable JSON/) + end + end end end From 88990fa1a859424acb31e87efea9ace27b0b0fd7 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 12 Dec 2024 15:33:53 -0500 Subject: [PATCH 050/114] unbreak test --- spec/unit/verifier_spec.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index 2d266ee..0b3545b 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -69,7 +69,7 @@ module PostZephirProcessing end it "warns if it sees an unparseable line" do content = "oops\n{}\n[]\n" - expect_not_ok(:verify_parseable_ndj, content, /unparseable JSON/) + expect_not_ok(:verify_parseable_ndj, content, errmsg: /unparseable JSON/) end end end From 2d9701b8944e19f45bdbb14434a776a5e4268a14 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 12 Dec 2024 15:44:34 -0500 Subject: [PATCH 051/114] More PZP verifier tests (not done yet) --- lib/verifier/post_zephir_verifier.rb | 24 ++++---- spec/unit/post_zephir_verifier_spec.rb | 78 ++++++++++++++++++++++++++ 2 files changed, 91 insertions(+), 11 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 9eba7fb..3a007c5 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -25,11 +25,12 @@ def run_for_date(date:) end # Frequency: ALL - # Files: CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz + # Files: CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz # and potentially CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz - # Contents: TODO + # Contents: ndj file with one catalog record per line # Verify: # readable + # line count must be the same as input JSON def verify_catalog_archive(date: current_date) zephir_update_derivative_params = { location: :CATALOG_ARCHIVE, @@ -52,7 +53,6 @@ def verify_catalog_archive(date: current_date) name: "ht_bib_export_full_YYYY-MM-DD.json.gz", date: date } - output_path = self.class.dated_derivative(**zephir_full_derivative_params) verify_file(path: output_path) verify_parseable_ndj(path: output_path) @@ -80,15 +80,18 @@ def verify_catalog_archive(date: current_date) # Frequency: ALL # Files: CATALOG_PREP/zephir_upd_YYYYMMDD.json.gz and CATALOG_PREP/zephir_upd_YYYYMMDD_delete.txt.gz # and potentially CATALOG_PREP/zephir_full_YYYYMMDD_vufind.json.gz - # Contents: TODO + # Contents: + # json.gz files: ndj with one catalog record per line + # delete.txt.gz: see `#verify_deletes_contents` # Verify: # readable # TODO: deletes file is combination of two component files in TMPDIR? def verify_catalog_prep(date: current_date) delete_file = self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date) verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) - verify_file(path: delete_file) - verify_deletes_contents(path: delete_file) + if verify_file(path: delete_file) + verify_deletes_contents(path: delete_file) + end if date.last_of_month? verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date)) end @@ -106,17 +109,16 @@ def verify_deletes_contents(path:) # Frequency: DAILY # Files: TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz - # Contents: TODO + # Contents: historically undallarized uc1 HTIDs (e.g., uc1.b312920) one per line # Verify: # readable # empty def verify_dollar_dup(date: current_date) dollar_dup = self.class.dated_derivative(location: :TMPDIR, name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz", date: date) if verify_file(path: dollar_dup) - Zinzout.zin(dollar_dup) do |infile| - if infile.count.positive? - error "#{dollar_dup} has #{infile.count} lines, should be 0" - end + gz_count = gzip_linecount(path: dollar_dup) + if gz_count.positive? + error message: "spurious dollar_dup lines: #{dollar_dup} should be empty (found #{gz_count} lines)" end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 7d47af9..1f4e314 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -114,6 +114,84 @@ def expect_deletefile_ok(contents) end end + describe "#verify_catalog_prep" do + test_date = Date.parse("2024-11-30") + context "with all the expected files" do + it "reports no errors" do + # Create and test upd, full, and deletes in @tmpdir/catalog_prep + ClimateControl.modify(CATALOG_PREP: @tmpdir) do + FileUtils.cp(fixture(File.join("catalog_archive", "zephir_full_20241130_vufind.json.gz")), @tmpdir) + FileUtils.cp(fixture(File.join("catalog_archive", "zephir_upd_20241130.json.gz")), @tmpdir) + FileUtils.cp(fixture(File.join("catalog_prep", "zephir_upd_20241130_delete.txt.gz")), @tmpdir) + verifier = described_class.new + verifier.verify_catalog_prep(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end + + context "without any of the expected files" do + it "reports no errors" do + ClimateControl.modify(CATALOG_PREP: @tmpdir) do + verifier = described_class.new + verifier.verify_catalog_prep(date: test_date) + expect(verifier.errors.count).to eq 3 + end + end + end + end + + describe "#verify_dollar_dup" do + test_date = Date.parse("2024-12-01") + context "with empty file" do + it "reports no errors" do + ClimateControl.modify(TMPDIR: @tmpdir) do + dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") + Zinzout.zout(dollar_dup_path) { |output_gz| } + verifier = described_class.new + verifier.verify_dollar_dup(date: test_date) + expect(verifier.errors).to eq [] + end + end + end + + context "with nonempty file" do + it "reports one `spurious dollar_dup lines` error" do + ClimateControl.modify(TMPDIR: @tmpdir) do + dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") + Zinzout.zout(dollar_dup_path) do |output_gz| + output_gz.puts <<~GZ + uc1.b275234 + uc1.b85271 + uc1.b312920 + uc1.b257214 + uc1.b316327 + uc1.b23918 + uc1.b95355 + uc1.b183819 + uc1.b197217 + GZ + end + verifier = described_class.new + verifier.verify_dollar_dup(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/spurious dollar_dup lines/) + end + end + end + + context "with missing file" do + it "reports one `not found` error" do + ClimateControl.modify(TMPDIR: @tmpdir) do + verifier = described_class.new + verifier.verify_dollar_dup(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/^not found/) + end + end + end + end + describe "#verify_rights_file_format" do it "accepts an empty file" do expect_ok(:verify_rights_file_format, "") From c061899870c28e7d347a68abd4fb5dff1564b724 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 12 Dec 2024 15:49:19 -0500 Subject: [PATCH 052/114] Fix expectation string in PZP verifier spec --- spec/unit/post_zephir_verifier_spec.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 1f4e314..1e1626f 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -131,7 +131,7 @@ def expect_deletefile_ok(contents) end context "without any of the expected files" do - it "reports no errors" do + it "reports an error for each of the three missing files" do ClimateControl.modify(CATALOG_PREP: @tmpdir) do verifier = described_class.new verifier.verify_catalog_prep(date: test_date) From a04d90f1cdfacf6017010adccefadbb0f18806d9 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 12 Dec 2024 15:54:37 -0500 Subject: [PATCH 053/114] Add missing deletes fixture --- .../catalog_prep/zephir_upd_20241130_delete.txt.gz | Bin 0 -> 71 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 spec/fixtures/catalog_prep/zephir_upd_20241130_delete.txt.gz diff --git a/spec/fixtures/catalog_prep/zephir_upd_20241130_delete.txt.gz b/spec/fixtures/catalog_prep/zephir_upd_20241130_delete.txt.gz new file mode 100644 index 0000000000000000000000000000000000000000..7b29616d0987e48915142931d0d976e85f22b636 GIT binary patch literal 71 zcmb2|=HU2d9-Yp>T$Ngoky#X9T96WNWME`sXlQH@pOTuBT9T?)Qc=QSY+__)XllHD b!;?*049ty94b6<6FfnwT&9W3>U|;|MKRy Date: Thu, 12 Dec 2024 16:45:04 -0500 Subject: [PATCH 054/114] DEV-1418: test catalog indexing for full file --- lib/verifier/catalog_index_verifier.rb | 24 ++++++--- spec/unit/catalog_indexing_verifier_spec.rb | 57 ++++++++++++++++++--- 2 files changed, 69 insertions(+), 12 deletions(-) diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 2dae5cf..e9f8814 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -10,7 +10,7 @@ module PostZephirProcessing class CatalogIndexVerifier < Verifier def verify_index_count(path:) filename = File.basename(path) - if (m = filename.match(/^zephir_upd_(\d+).json.gz/)) + if (m = filename.match(/^zephir_upd_(\d+)\.json\.gz/)) # in normal operation, we _should_ have indexed this the day after the # date listed in the file. # @@ -20,11 +20,15 @@ def verify_index_count(path:) date_of_indexing = Date.parse(m[1]) + 1 catalog_linecount = gzip_linecount(path: path) solr_count = solr_count(date_of_indexing) + elsif /^zephir_full_\d+_vufind\.json\.gz/.match?(filename) + catalog_linecount = gzip_linecount(path: path) + solr_count = solr_nondeleted_records + else + raise ArgumentError, "#{path} doesn't seem to be a catalog index file" + end - if solr_count < catalog_linecount - error(message: "#{filename} had #{catalog_linecount} records, but only #{solr_count} had time_of_indexing on #{date_of_indexing} in solr") - end - + if solr_count < catalog_linecount + error(message: "#{filename} had #{catalog_linecount} records, but only #{solr_count} had time_of_indexing on #{date_of_indexing} in solr") end end @@ -32,7 +36,15 @@ def solr_count(date_of_indexing) # get: datebegin = date_of_indexing.to_datetime.new_offset(0).strftime("%FT%TZ") dateend = (date_of_indexing + 1).to_datetime.new_offset(0).strftime("%FT%TZ") - url = "#{ENV["SOLR_URL"]}/select?fq=time_of_index:#{datebegin}%20TO%20#{dateend}]&q=*:*&rows=0&wt=json" + solr_result_count("time_of_index:#{datebegin}%20TO%20#{dateend}]") + end + + def solr_nondeleted_records + solr_result_count("deleted:false") + end + + def solr_result_count(filter_query) + url = "#{ENV["SOLR_URL"]}/select?fq=#{filter_query}&q=*:*&rows=0&wt=json" JSON.parse(Faraday.get(url).body)["response"]["numFound"] end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index 72b6966..e088d6c 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -15,6 +15,29 @@ module PostZephirProcessing end end + def stub_catalog_record_count(result_count) + WebMock.enable! + + url = "#{solr_url}/select?fq=deleted:false&q=*:*&rows=0&wt=json" + + result = { + "responseHeader" => { + "status" => 0, + "QTime" => 0, + "params" => { + "q" => "*=>*", + "fq" => "deleted:false", + "rows" => "0", + "wt" => "json" + } + }, + "response" => {"numFound" => result_count, "start" => 0, "docs" => []} + }.to_json + + WebMock::API.stub_request(:get, url) + .to_return(body: result, headers: {"Content-Type" => "application/json"}) + end + def stub_catalog_timerange(date, result_count) # must be like YYYY-mm-ddTHH:MM:SSZ - iso8601 with a 'Z' for time zone - # time zone offsets like DateTime.iso8601 produce by default are't @@ -66,20 +89,42 @@ def stub_catalog_timerange(date, result_count) it "rejects a catalog with no recent updates" do stub_catalog_timerange(catalog_index_date, 0) verifier.verify_index_count(path: catalog_update) - expect(verifier.errors).not_to be_empty + expect(verifier.errors).to include(/only 0 .* in solr/) end it "rejects a catalog with 2 recent updates" do stub_catalog_timerange(catalog_index_date, 2) verifier.verify_index_count(path: catalog_update) - expect(verifier.errors).not_to be_empty + expect(verifier.errors).to include(/only 2 .* in solr/) end end context "with a catalog full file with 5 records" do - it "accepts a catalog with 5 records" - it "accepts a catalog with 6 records" - it "rejects a catalog with no records" - it "rejects a catalog with 2 records" + let(:catalog_full) { fixture("catalog_archive/zephir_full_20241130_vufind.json.gz") } + + it "accepts a catalog with 5 records" do + stub_catalog_record_count(5) + verifier.verify_index_count(path: catalog_full) + expect(verifier.errors).to be_empty + end + it "accepts a catalog with 6 records" do + stub_catalog_record_count(6) + verifier.verify_index_count(path: catalog_full) + expect(verifier.errors).to be_empty + end + it "rejects a catalog with no records" do + stub_catalog_record_count(0) + verifier.verify_index_count(path: catalog_full) + expect(verifier.errors).to include(/only 0 .* in solr/) + end + it "rejects a catalog with 2 records" do + stub_catalog_record_count(2) + verifier.verify_index_count(path: catalog_full) + expect(verifier.errors).to include(/only 2 .* in solr/) + end + end + + it "raises an exception when given some other file" do + expect { verifier.verify_index_count(path: fixture("zephir_data/ht_bib_export_full_2024-11-30.json.gz")) }.to raise_exception(ArgumentError) end end end From 54ab234f39a6fc7527361fa012d8b174f73d95c2 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 12 Dec 2024 17:11:04 -0500 Subject: [PATCH 055/114] verify_ingest_bibrecords and verify_rights tests --- lib/verifier/post_zephir_verifier.rb | 34 +++---- spec/unit/post_zephir_verifier_spec.rb | 129 +++++++++++++++++++++++-- 2 files changed, 135 insertions(+), 28 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 3a007c5..496b8e9 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -18,7 +18,6 @@ def run_for_date(date:) verify_catalog_archive verify_catalog_prep verify_dollar_dup - verify_groove_export verify_ingest_bibrecords verify_rights verify_zephir_data @@ -124,17 +123,9 @@ def verify_dollar_dup(date: current_date) end # Frequency: MONTHLY - # Files: INGEST_BIBRECORDS/groove_full.tsv.gz - # Contents: TODO - # Verify: readable - def verify_groove_export(date: current_date) - if date.last_of_month? - verify_file(path: self.class.derivative(location: :INGEST_BIBRECORDS, name: "groove_full.tsv.gz")) - end - end - - # Frequency: MONTHLY - # Files: INGEST_BIBRECORDS/groove_full.tsv.gz, INGEST_BIBRECORDS/zephir_ingested_items.txt.gz + # Files: + # INGEST_BIBRECORDS/groove_full.tsv.gz + # INGEST_BIBRECORDS/zephir_ingested_items.txt.gz # Contents: TODO # Verify: readable def verify_ingest_bibrecords(date: current_date) @@ -145,21 +136,24 @@ def verify_ingest_bibrecords(date: current_date) end # Frequency: BOTH - # Files: RIGHTS_ARCHIVE/zephir_upd_YYYYMMDD.rights - # and potentially RIGHTS_ARCHIVE/zephir_full_YYYYMMDD.rights - # Contents: TODO + # Files: + # RIGHTS_ARCHIVE/zephir_upd_YYYYMMDD.rights (daily) + # RIGHTS_ARCHIVE/zephir_full_YYYYMMDD.rights (monthly) + # Contents: see verify_rights_file_format # Verify: # readable - # TODO: compare each line against a basic regex + # accepted by verify_rights_file_format def verify_rights(date: current_date) upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_upd_YYYYMMDD.rights", date: date) - verify_file(path: upd_path) - verify_rights_file_format(path: upd_path) + if verify_file(path: upd_path) + verify_rights_file_format(path: upd_path) + end if date.last_of_month? full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_full_YYYYMMDD.rights", date: date) - verify_file(path: full_path) - verify_rights_file_format(path: full_path) + if verify_file(path: full_path) + verify_rights_file_format(path: full_path) + end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 1e1626f..da47ece 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -14,6 +14,15 @@ module PostZephirProcessing end end + let(:well_formed_rights_file_content) do + [ + ["a.1", "ic", "bib", "bibrights", "aa"].join("\t"), + ["a.2", "pd", "bib", "bibrights", "bb"].join("\t"), + ["a.3", "pdus", "bib", "bibrights", "aa-bb"].join("\t"), + ["a.4", "und", "bib", "bibrights", "aa-bb"].join("\t") + ].join("\n") + end + describe "#verify_deletes_contents" do def expect_deletefile_error(contents) expect_not_ok(:verify_deletes_contents, @@ -192,20 +201,124 @@ def expect_deletefile_ok(contents) end end + describe "#verify_ingest_bibrecords" do + context "last day of month" do + test_date = Date.parse("2024-11-30") + context "with expected groove_full and zephir_ingested_items files" do + it "reports no errors" do + ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do + FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) + FileUtils.touch(File.join(@tmpdir, "zephir_ingested_items.txt.gz")) + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end + + context "missing zephir_ingested_items" do + it "reports one `not found` error" do + ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do + FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/^not found/) + end + end + end + + context "missing groove_full" do + it "reports one `not found` error" do + ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do + FileUtils.touch(File.join(@tmpdir, "zephir_ingested_items.txt.gz")) + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/^not found/) + end + end + end + end + + context "non-last day of month" do + test_date = Date.parse("2024-12-01") + it "reports no errors" do + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end + + describe "#verify_rights" do + context "last day of month" do + test_date = Date.parse("2024-11-30") + context "with full and update rights files" do + it "reports no errors" do + ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + verifier = described_class.new + upd_rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + upd_rights_path = File.join(@tmpdir, upd_rights_file) + File.write(upd_rights_path, well_formed_rights_file_content) + full_rights_file = "zephir_full_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + full_rights_path = File.join(@tmpdir, full_rights_file) + File.write(full_rights_path, well_formed_rights_file_content) + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end + + context "with no rights files" do + it "reports two `not found` errors" do + ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + verifier = described_class.new + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 2 + verifier.errors.each do |err| + expect(err).to include(/^not found/) + end + end + end + end + end + + context "non-last day of month" do + test_date = Date.parse("2024-12-01") + context "with update rights file" do + it "reports no errors" do + ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + verifier = described_class.new + rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + rights_path = File.join(@tmpdir, rights_file) + File.write(rights_path, well_formed_rights_file_content) + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end + + context "missing update rights file" do + it "reports one `not found` error" do + ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + verifier = described_class.new + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/^not found/) + end + end + end + end + end + describe "#verify_rights_file_format" do it "accepts an empty file" do expect_ok(:verify_rights_file_format, "") end it "accepts a well-formed file" do - contents = [ - ["a.1", "ic", "bib", "bibrights", "aa"].join("\t"), - ["a.2", "pd", "bib", "bibrights", "bb"].join("\t"), - ["a.3", "pdus", "bib", "bibrights", "aa-bb"].join("\t"), - ["a.4", "und", "bib", "bibrights", "aa-bb"].join("\t") - ].join("\n") - - expect_ok(:verify_rights_file_format, contents) + expect_ok(:verify_rights_file_format, well_formed_rights_file_content) end volids_not_ok = ["", "x", "x.", ".x", "X.X"] From fec6aa44146ec0fdd32027ec3caaef83f6ec1a1a Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Fri, 13 Dec 2024 09:41:07 -0500 Subject: [PATCH 056/114] Finish PostZephirVerifier test coverage --- lib/verifier/post_zephir_verifier.rb | 4 +- spec/unit/post_zephir_verifier_spec.rb | 57 ++++++++++++++++++++++++++ 2 files changed, 60 insertions(+), 1 deletion(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 496b8e9..7bfbe75 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -180,7 +180,9 @@ def verify_rights_file_format(path:) end # Frequency: MONTHLY - # Files: ZEPHIR_DATA/full/zephir_full_monthly_rpt.txt, ZEPHIR_DATA/full/zephir_full_YYYYMMDD.rights_rpt.tsv + # Files: + # ZEPHIR_DATA/full/zephir_full_monthly_rpt.txt + # ZEPHIR_DATA/full/zephir_full_YYYYMMDD.rights_rpt.tsv # Contents: TODO # Verify: readable def verify_zephir_data(date: current_date) diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index da47ece..fd2a3d0 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -23,6 +23,22 @@ module PostZephirProcessing ].join("\n") end + describe "#run_for_date" do + context "last day of month" do + test_date = Date.parse("2024-11-30") + it "runs" do + described_class.new.run_for_date(date: test_date) + end + end + + context "non-last day of month" do + test_date = Date.parse("2024-12-01") + it "runs" do + described_class.new.run_for_date(date: test_date) + end + end + end + describe "#verify_deletes_contents" do def expect_deletefile_error(contents) expect_not_ok(:verify_deletes_contents, @@ -382,5 +398,46 @@ def expect_deletefile_ok(contents) end end end + + describe "verify_zephir_data" do + context "last day of month" do + test_date = Date.parse("2024-11-30") + context "with both files present" do + it "reports no errors" do + ClimateControl.modify(ZEPHIR_DATA: @tmpdir) do + FileUtils.mkdir(File.join(@tmpdir, "full")) + FileUtils.touch(File.join(@tmpdir, "full", "zephir_full_monthly_rpt.txt")) + zephir_full_file = "zephir_full_YYYYMMDD.rights_rpt.tsv".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + FileUtils.touch(File.join(@tmpdir, "full", zephir_full_file)) + verifier = described_class.new + verifier.verify_zephir_data(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end + + context "with both files absent" do + it "reports two `not found` errors" do + ClimateControl.modify(ZEPHIR_DATA: @tmpdir) do + verifier = described_class.new + verifier.verify_zephir_data(date: test_date) + expect(verifier.errors.count).to eq 2 + verifier.errors.each do |err| + expect(err).to include(/^not found/) + end + end + end + end + end + + context "non-last day of month" do + test_date = Date.parse("2024-12-01") + it "reports no errors" do + verifier = described_class.new + verifier.verify_zephir_data(date: test_date) + expect(verifier.errors.count).to eq 0 + end + end + end end end From 77eb63811a46c221393d2f9cf68be5e41ded5c08 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Fri, 13 Dec 2024 12:18:38 -0500 Subject: [PATCH 057/114] - Use zlib instead of zinzout in verify_parseable_ndj - Return Boolean from verify_parseable_ndj - Add option to expect_(not_)ok helpers to check return values --- lib/verifier.rb | 17 +++++++++-------- spec/spec_helper.rb | 14 ++++++++++---- spec/unit/verifier_spec.rb | 8 ++++---- 3 files changed, 23 insertions(+), 16 deletions(-) diff --git a/lib/verifier.rb b/lib/verifier.rb index 59ee359..3e871a4 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -3,7 +3,6 @@ require_relative "derivatives" require_relative "journal" require_relative "services" -require "zinzout" # Common superclass for all things Verifier. # Right now the only thing I can think of to put here is shared @@ -54,20 +53,19 @@ def run_for_date(date:) end # Basic checks for the existence and readability of the file at `path`. - # We should do whatever logging/warning we want to do if the file does - # not pass muster. - # Verifying contents is out of scope. - # Returns `true` if verified. + # @return [Boolean] `true` if verified, `false` if error was reported. def verify_file(path:) verify_file_exists(path: path) && verify_file_readable(path: path) end + # @return [Boolean] `true` if verified, `false` if error was reported. def verify_file_exists(path:) File.exist?(path).tap do |exists| error(message: "not found: #{path}") unless exists end end + # @return [Boolean] `true` if verified, `false` if error was reported. def verify_file_readable(path:) File.readable?(path).tap do |readable| error(message: "not readable: #{path}") unless readable @@ -78,15 +76,18 @@ def gzip_linecount(path:) Zlib::GzipReader.open(path, encoding: "utf-8") { |gz| gz.count } end - # Take a .ndj(.gz) file and check that each line is indeed parseable json + # Take a .ndj.gz file and check that each line is indeed parseable json + # @return [Boolean] `true` if verified, `false` if error was reported. def verify_parseable_ndj(path:) - Zinzout.zin(path) do |infile| - infile.each do |line| + Zlib::GzipReader.open(path, encoding: "utf-8") do |gz| + gz.each_line do |line| JSON.parse(line) end rescue JSON::ParserError error(message: "File #{path} contains unparseable JSON") + return false end + true end # I'm not sure if we're going to try to distinguish errors and warnings. diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index e634dc2..0c4199b 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -75,19 +75,25 @@ def with_temp_file(contents, gzipped: false) end end -def expect_not_ok(method, contents, errmsg: /.*/, gzipped: false) +def expect_not_ok(method, contents, errmsg: /.*/, gzipped: false, check_return: false) with_temp_file(contents, gzipped: gzipped) do |tmpfile| verifier = described_class.new - verifier.send(method, path: tmpfile) + result = verifier.send(method, path: tmpfile) expect(verifier.errors).to include(errmsg) + if check_return + expect(result).to be false + end end end -def expect_ok(method, contents, gzipped: false) +def expect_ok(method, contents, gzipped: false, check_return: false) with_temp_file(contents, gzipped: gzipped) do |tmpfile| verifier = described_class.new - verifier.send(method, path: tmpfile) + result = verifier.send(method, path: tmpfile) expect(verifier.errors).to be_empty + if check_return + expect(result).to be true + end end end diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index 0b3545b..784a600 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -63,13 +63,13 @@ module PostZephirProcessing end describe "#verify_parseable_ndj" do - it "checks if a .ndj file contains only parseable lines" do + it "returns `true` and no errors if a .ndj file contains only parseable lines" do content = "{}\n[]" - expect_ok(:verify_parseable_ndj, content) + expect_ok(:verify_parseable_ndj, content, gzipped: true, check_return: true) end - it "warns if it sees an unparseable line" do + it "warns and returns `false` if it sees an unparseable line" do content = "oops\n{}\n[]\n" - expect_not_ok(:verify_parseable_ndj, content, errmsg: /unparseable JSON/) + expect_not_ok(:verify_parseable_ndj, content, errmsg: /unparseable JSON/, gzipped: true, check_return: true) end end end From 4c187a55921fa7eeac68de434a7db03fe11abb50 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 16 Dec 2024 14:12:03 -0500 Subject: [PATCH 058/114] DEV-1418: Tests for catalog verifier run_for_date --- lib/verifier/catalog_index_verifier.rb | 20 ++++++++++++++++++++ spec/unit/catalog_indexing_verifier_spec.rb | 16 +++++++++++----- spec/unit/hathifiles_verifier_spec.rb | 3 --- 3 files changed, 31 insertions(+), 8 deletions(-) diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index e9f8814..4e9e5fb 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -9,6 +9,8 @@ module PostZephirProcessing class CatalogIndexVerifier < Verifier def verify_index_count(path:) + # TODO: we compute this path based on full/update in run_for_date -- avoid logic twice + # by using (todo) Derivative class filename = File.basename(path) if (m = filename.match(/^zephir_upd_(\d+)\.json\.gz/)) # in normal operation, we _should_ have indexed this the day after the @@ -48,5 +50,23 @@ def solr_result_count(filter_query) JSON.parse(Faraday.get(url).body)["response"]["numFound"] end + + def run_for_date(date:) + # TODO: The dates on the files are the previous day, but the indexing + # happens on the current day -- not sure the logic here makes sense? + @current_date = date + update_file = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date - 1) + if(verify_file(path: update_file)) + verify_index_count(path: update_file) + end + + # first of month + if date.first_of_month? + full_file = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date - 1) + if(verify_file(path: full_file)) + verify_index_count(path: full_file) + end + end + end end end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index e088d6c..ff28fbd 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -6,6 +6,7 @@ module PostZephirProcessing RSpec.describe(CatalogIndexVerifier) do let(:solr_url) { "http://solr-sdr-catalog:9033/solr/catalog" } + let(:verifier) { described_class.new } around(:each) do |example| with_test_environment do @@ -70,7 +71,6 @@ def stub_catalog_timerange(date, result_count) end describe "#verify_index_count" do - let(:verifier) { described_class.new } context "with a catalog update file with 3 records" do let(:catalog_update) { fixture("catalog_archive/zephir_upd_20241202.json.gz") } # indexed the day after the date in the filename @@ -127,10 +127,16 @@ def stub_catalog_timerange(date, result_count) expect { verifier.verify_index_count(path: fixture("zephir_data/ht_bib_export_full_2024-11-30.json.gz")) }.to raise_exception(ArgumentError) end end - end - describe "#run" do - it "checks the full file on the last day of the month" - it "checks the file corresponding to today's date" + describe "#run_for_date" do + it "checks the full file on the first day of the month" do + verifier.run_for_date(date: Date.parse("2024-03-01")) + expect(verifier.errors).to include(/.*not found.*zephir_full_20240229_vufind.json.gz.*/) + end + it "checks the update file corresponding to today's date" do + verifier.run_for_date(date: Date.parse("2024-03-02")) + expect(verifier.errors).to include(/.*not found.*zephir_upd_20240301.json.gz.*/) + end + end end end diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index c0ee934..0b2a2a4 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -44,9 +44,6 @@ module PostZephirProcessing end end - # the whole enchilada - describe "#verify_hathifile" - describe "#verify_hathifile_contents" do it "accepts a file with a single real hathifiles entry" do expect_ok(:verify_hathifile_contents, sample_line, gzipped: true) From b35c4dd1b80b42c8545ff441b846c1ce48c54de2 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 16 Dec 2024 14:19:41 -0500 Subject: [PATCH 059/114] Update bundler & dependencies --- Gemfile.lock | 49 +++++++++++++++++++++++++------------------------ 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index 5948e84..aa71435 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -13,7 +13,7 @@ GEM rexml diff-lcs (1.5.1) docile (1.4.1) - dotenv (3.1.4) + dotenv (3.1.6) faraday (2.12.2) faraday-net_http (>= 2.0, < 3.5) json @@ -21,53 +21,52 @@ GEM faraday-net_http (3.4.0) net-http (>= 0.5.0) hashdiff (1.1.2) - json (2.7.2) + json (2.9.0) language_server-protocol (3.17.0.3) lint_roller (1.1.0) - logger (1.6.2) + logger (1.6.3) method_source (1.1.0) mysql2 (0.5.6) net-http (0.6.0) uri parallel (1.26.3) - parser (3.3.5.0) + parser (3.3.6.0) ast (~> 2.4.1) racc - pry (0.14.2) + pry (0.15.0) coderay (~> 1.1) method_source (~> 1.0) public_suffix (6.0.1) racc (1.8.1) rainbow (3.1.1) - regexp_parser (2.9.2) - rexml (3.3.7) + regexp_parser (2.9.3) + rexml (3.4.0) rspec (3.13.0) rspec-core (~> 3.13.0) rspec-expectations (~> 3.13.0) rspec-mocks (~> 3.13.0) - rspec-core (3.13.1) + rspec-core (3.13.2) rspec-support (~> 3.13.0) rspec-expectations (3.13.3) diff-lcs (>= 1.2.0, < 2.0) rspec-support (~> 3.13.0) - rspec-mocks (3.13.1) + rspec-mocks (3.13.2) diff-lcs (>= 1.2.0, < 2.0) rspec-support (~> 3.13.0) - rspec-support (3.13.1) - rubocop (1.65.1) + rspec-support (3.13.2) + rubocop (1.69.2) json (~> 2.3) language_server-protocol (>= 3.17.0) parallel (~> 1.10) parser (>= 3.3.0.2) rainbow (>= 2.2.2, < 4.0) - regexp_parser (>= 2.4, < 3.0) - rexml (>= 3.2.5, < 4.0) - rubocop-ast (>= 1.31.1, < 2.0) + regexp_parser (>= 2.9.3, < 3.0) + rubocop-ast (>= 1.36.2, < 2.0) ruby-progressbar (~> 1.7) - unicode-display_width (>= 2.4.0, < 3.0) - rubocop-ast (1.32.3) + unicode-display_width (>= 2.4.0, < 4.0) + rubocop-ast (1.37.0) parser (>= 3.3.1.0) - rubocop-performance (1.21.1) + rubocop-performance (1.23.0) rubocop (>= 1.48.1, < 2.0) rubocop-ast (>= 1.31.1, < 2.0) ruby-progressbar (1.13.0) @@ -80,21 +79,23 @@ GEM simplecov-html (0.13.1) simplecov-lcov (0.8.0) simplecov_json_formatter (0.1.4) - standard (1.40.0) + standard (1.43.0) language_server-protocol (~> 3.17.0.2) lint_roller (~> 1.0) - rubocop (~> 1.65.0) + rubocop (~> 1.69.1) standard-custom (~> 1.0.0) - standard-performance (~> 1.4) + standard-performance (~> 1.6) standard-custom (1.0.2) lint_roller (~> 1.0) rubocop (~> 1.50) - standard-performance (1.4.0) + standard-performance (1.6.0) lint_roller (~> 1.1) - rubocop-performance (~> 1.21.0) + rubocop-performance (~> 1.23.0) standardrb (1.0.1) standard - unicode-display_width (2.6.0) + unicode-display_width (3.1.2) + unicode-emoji (~> 4.0, >= 4.0.4) + unicode-emoji (4.0.4) uri (1.0.2) webmock (3.24.0) addressable (>= 2.8.0) @@ -122,4 +123,4 @@ DEPENDENCIES zinzout BUNDLED WITH - 2.5.19 + 2.6.0 From 1731db883f104571255dee3e9cbe631043dea1e7 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 16 Dec 2024 14:44:19 -0500 Subject: [PATCH 060/114] Pin to bundler 2.5.23 bundler 2.6.0 appears to conflict with the debian bookworm ruby --- Dockerfile | 2 +- Gemfile.lock | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index 9d7bef1..96cc62e 100644 --- a/Dockerfile +++ b/Dockerfile @@ -39,7 +39,7 @@ RUN cpanm --notest \ # Ruby setup ENV BUNDLE_PATH /gems ENV RUBYLIB /usr/src/app/lib -RUN gem install bundler +RUN gem install bundler --version "~> 2.5.23" RUN bundle config --global silence_root_warning 1 RUN bundle install diff --git a/Gemfile.lock b/Gemfile.lock index aa71435..e473525 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -123,4 +123,4 @@ DEPENDENCIES zinzout BUNDLED WITH - 2.6.0 + 2.5.23 From c6bb369cded689a0e3df311eead17bdeafa8ed9b Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 17 Dec 2024 14:53:45 -0500 Subject: [PATCH 061/114] started implementing Derivative (sing.) class --- lib/derivative.rb | 62 ++++++++++++++++++ lib/verifier/catalog_index_verifier.rb | 4 +- lib/verifier/hathifiles_database_verifier.rb | 40 ++++-------- spec/spec_helper.rb | 2 +- spec/unit/derivative_spec.rb | 67 ++++++++++++++++++++ 5 files changed, 146 insertions(+), 29 deletions(-) create mode 100644 lib/derivative.rb create mode 100644 spec/unit/derivative_spec.rb diff --git a/lib/derivative.rb b/lib/derivative.rb new file mode 100644 index 0000000..0c2eb51 --- /dev/null +++ b/lib/derivative.rb @@ -0,0 +1,62 @@ +require "verifier" +require "derivatives" + +module PostZephirProcessing + class Derivative + attr_reader :date, :full, :derivative_type + + def initialize(date:, full:, derivative_type:) + @date = date + @full = full + @derivative_type = derivative_type + end + + def full? + full + end + + def path + Verifier.dated_derivative(**template, date: date) + end + + def self.derivatives_for_date(date:, derivative_type:) + raise unless derivative_type == :hathifile + + derivatives = [ + Derivative.new( + derivative_type: :hathifile, + full: false, + date: date + ) + ] + + if date.first_of_month? + derivatives << Derivative.new( + derivative_type: :hathifile, + full: true, + date: date + ) + end + + derivatives + end + + private + + def fullness + if full + "full" + else + "upd" + end + end + + # given derivative type, knows how to construct params for Verifier.dated_derivative + def template + case derivative_type + when :hathifile + {location: :HATHIFILE_ARCHIVE, name: "hathi_#{fullness}_YYYYMMDD.txt.gz"} + end + end + end +end diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 4e9e5fb..6f259f7 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -56,14 +56,14 @@ def run_for_date(date:) # happens on the current day -- not sure the logic here makes sense? @current_date = date update_file = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date - 1) - if(verify_file(path: update_file)) + if verify_file(path: update_file) verify_index_count(path: update_file) end # first of month if date.first_of_month? full_file = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date - 1) - if(verify_file(path: full_file)) + if verify_file(path: full_file) verify_index_count(path: full_file) end end diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index 753d568..3886df6 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -4,6 +4,7 @@ require_relative "../verifier" require_relative "../derivatives" +require_relative "../derivative" module PostZephirProcessing class HathifilesDatabaseVerifier < Verifier @@ -33,39 +34,26 @@ def run_for_date(date:) def verify_hathifiles_database_log # File missing? Not our problem, should be caught by earlier verifier. - if File.exist?(update_file) - if !self.class.has_log?(hathifile: update_file) - error message: "missing hf_log: no entry for daily #{update_file}" - end - end - if current_date.first_of_month? - full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: current_date) - if File.exist?(full_file) - if !self.class.has_log?(hathifile: full_file) - error message: "missing hf_log: no entry for monthly #{full_file}" - end + + Derivative.derivatives_for_date(date: current_date, derivative_type: :hathifile).each do |d| + next unless File.exist?(d.path) + + if !self.class.has_log?(hathifile: d.path) + error message: "missing hf_log: no entry for #{d.path}" end end end def verify_hathifiles_database_count - if current_date.first_of_month? - if File.exist?(full_file) - full_file_count = gzip_linecount(path: full_file) - db_count = self.class.db_count - if full_file_count > db_count - error message: "hf count mismatch: #{full_file} (#{full_file_count}) vs hathifiles.hf (#{db_count})" - end + Derivative.derivatives_for_date(date: current_date, derivative_type: :hathifile).select { |d| d.full? }.each do |full_file| + next unless File.exist?(full_file.path) + + full_file_count = gzip_linecount(path: full_file.path) + db_count = self.class.db_count + if full_file_count > db_count + error message: "hf count mismatch: #{full_file.path} (#{full_file_count}) vs hathifiles.hf (#{db_count})" end end end - - def update_file - self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: current_date) - end - - def full_file - self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: current_date) - end end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 0c4199b..6be0711 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -30,7 +30,7 @@ # squelch log output from tests PostZephirProcessing::Services.register(:logger) { - Logger.new(File.open("/dev/null", "w"), level: Logger::DEBUG) + Logger.new(File.open(File::NULL, "w"), level: Logger::DEBUG) } def test_journal diff --git a/spec/unit/derivative_spec.rb b/spec/unit/derivative_spec.rb new file mode 100644 index 0000000..a314673 --- /dev/null +++ b/spec/unit/derivative_spec.rb @@ -0,0 +1,67 @@ +# frozen_string_literal: true + +require "derivative" + +module PostZephirProcessing + RSpec.describe(Derivative) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify(HATHIFILE_ARCHIVE: "/tmp") do + example.run + end + end + end + + let(:test_date_first_of_month) { Date.parse("2023-11-01") } + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + let(:derivative_type) { :hathifile } + + let(:params) do + { + date: test_date_first_of_month, + full: true, + derivative_type: :hathifile + } + end + let(:derivative) { described_class.new(**params) } + + describe "#initialize" do + it "requires a date and a fullness" do + expect(derivative).to be_an_instance_of(Derivative) + end + end + + describe "Derivative.derivatives_for_date" do + it "returns 2 derivatives (one full, one upd) on the first of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_first_of_month, + derivative_type: derivative_type + ) + expect(derivatives.count).to eq 2 + expect(derivatives.count { |d| d.full == true }).to eq 1 + expect(derivatives.count { |d| d.full == false }).to eq 1 + end + + it "returns 1 derivative on the last of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_last_of_month, + derivative_type: derivative_type + ) + expect(derivatives.count).to eq 1 + end + end + + it "reports back its fullness" do + expect(derivative.full?).to be true + end + + it "reports the expected file name for a full hathifile" do + expect(derivative.path).to eq "/tmp/hathi_full_20231101.txt.gz" + end + + it "reports the expected file name for a upd hathifile" do + params[:full] = false + expect(derivative.path).to eq "/tmp/hathi_upd_20231101.txt.gz" + end + end +end From 08afbf68c03ec65a345fa5b18238f499839be447 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 15:23:19 -0500 Subject: [PATCH 062/114] Misc cleanup / address feedback * remove extraneous comments * use first_of_month? helper * use verify_file where appropriate * use chomp! instead of strip! * fix full file pattern for zephir (in hathifiles verifier) * refactor rights format specs --- lib/verifier/hathifiles_contents_verifier.rb | 4 -- lib/verifier/hathifiles_verifier.rb | 18 +++-- lib/verifier/populate_rights_verifier.rb | 4 +- lib/verifier/post_zephir_verifier.rb | 5 +- spec/unit/hathifiles_verifier_spec.rb | 2 +- spec/unit/populate_rights_verifier_spec.rb | 2 +- spec/unit/post_zephir_verifier_spec.rb | 76 ++++++++++---------- spec/unit/verifier_spec.rb | 12 ++-- 8 files changed, 64 insertions(+), 59 deletions(-) diff --git a/lib/verifier/hathifiles_contents_verifier.rb b/lib/verifier/hathifiles_contents_verifier.rb index 428781e..3f00724 100644 --- a/lib/verifier/hathifiles_contents_verifier.rb +++ b/lib/verifier/hathifiles_contents_verifier.rb @@ -87,10 +87,6 @@ def run verify_fields(fields) end - # open file - # check each line against a regex - # count lines - # also check linecount against corresponding catalog - hathifile must be >= end def verify_fields(fields) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index aaadc0b..9d06f12 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -21,10 +21,9 @@ def run_for_date(date:) # Frequency: ALL # Files: CATALOG_PREP/hathi_upd_YYYYMMDD.txt.gz # and potentially HATHIFILE_ARCHIVE/hathi_full_YYYYMMDD.txt.gz - # Contents: TODO + # Contents: verified with HathifileContentsVerifier with regexes for each line/field # Verify: # readable - # TODO: line count must be > than corresponding catalog file def verify_hathifile(date: current_date) update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: date) if verify_file(path: update_file) @@ -33,7 +32,7 @@ def verify_hathifile(date: current_date) end # first of month - if date.day == 1 + if date.first_of_month? full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: date) if verify_file(path: full_file) linecount = verify_hathifile_contents(path: full_file) @@ -50,17 +49,24 @@ def verify_hathifile_contents(path:) end def verify_hathifile_linecount(linecount, catalog_path:) - catalog_linecount = Zlib::GzipReader.open(catalog_path).count + catalog_linecount = gzip_linecount(path: catalog_path) if linecount < catalog_linecount error(message: "#{catalog_path} has #{catalog_linecount} records but corresponding hathifile only has #{linecount}") end end def catalog_file_for(date, full: false) - filetype = full ? "full" : "upd" + # TODO address this somehow with Derivative. Maybe Derivative should know + # how to construct the filenames? + name = if full + "zephir_full_YYYYMMDD_vufind.json.gz" + else + "zephir_upd_YYYYMMDD.json.gz" + end + self.class.dated_derivative( location: :CATALOG_ARCHIVE, - name: "zephir_#{filetype}_YYYYMMDD.json.gz", + name: name, date: date - 1 ) end diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index 924c56d..e9eb7d2 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -22,13 +22,13 @@ class PopulateRightsVerifier < Verifier def run_for_date(date:) upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: UPD_RIGHTS_TEMPLATE, date: date) - if File.exist? upd_path + if verify_file(path: upd_path) verify_rights_file(path: upd_path) end if date.last_of_month? full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: FULL_RIGHTS_TEMPLATE, date: date) - if File.exist? full_path + if verify_file(path: full_path) verify_rights_file(path: full_path) end end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 7bfbe75..2797f89 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -6,9 +6,6 @@ # Verifies that post_zephir workflow stage did what it was supposed to. -# TODO: document and verify the files written by monthly process. -# They should be mostly the same but need to be accounted for. - module PostZephirProcessing class PostZephirVerifier < Verifier attr_reader :current_date @@ -171,7 +168,7 @@ def verify_rights_file_format(path:) # This allows an empty file as well, which is possible. File.open(path) do |f| f.each_line do |line| - line.strip! + line.chomp! unless line.match?(regex) error message: "Rights file #{path} contains malformed line: #{line}" end diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index 0b2a2a4..e9da2d7 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -63,7 +63,7 @@ module PostZephirProcessing it "computes a full source catalog file based on date - 1" do expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"), full: true)) - .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_full_20241130.json.gz") + .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_full_20241130_vufind.json.gz") end end end diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/populate_rights_verifier_spec.rb index d0b999b..a6918b2 100644 --- a/spec/unit/populate_rights_verifier_spec.rb +++ b/spec/unit/populate_rights_verifier_spec.rb @@ -12,7 +12,7 @@ module PostZephirProcessing end end - let(:test_rights) { 10.times.collect { |n| "test.%03d" % n } } + let(:test_rights) { (0..9).map { |n| "test.%03d" % n } } let(:test_rights_file_contents) do test_rights.map do |rights| [rights, "ic", "bib", "bibrights", "aa"].join("\t") diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index fd2a3d0..6d85605 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -329,6 +329,8 @@ def expect_deletefile_ok(contents) end describe "#verify_rights_file_format" do + let(:rights_cols) { ["a.1", "ic", "bib", "bibrights", "aa"] } + it "accepts an empty file" do expect_ok(:verify_rights_file_format, "") end @@ -337,64 +339,64 @@ def expect_deletefile_ok(contents) expect_ok(:verify_rights_file_format, well_formed_rights_file_content) end + it "accepts a well-formed line" do + expect_ok(:verify_rights_file_format, rights_cols.join("\t")) + end + volids_not_ok = ["", "x", "x.", ".x", "X.X"] - line_end = ["ic", "bib", "bibrights", "aa"].join("\t") - volids_not_ok.each do |volid| + volids_not_ok.each do |bad_volume_id| it "rejects a file with malformed volume id" do + rights_cols[0] = bad_volume_id + expect_not_ok( :verify_rights_file_format, - [volid, line_end].join("\t"), + rights_cols.join("\t"), errmsg: /Rights file .+ contains malformed line/ ) end end - it "rejects a file with malformed rights" do - cols = ["a.1", "ic", "bib", "bibrights", "aa"] - expect_ok(:verify_rights_file_format, cols.join("\t")) - - cols[1] = "" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) - - cols[1] = "icus" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) + it "rejects a file with no rights" do + rights_cols[1] = "" + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) end - it "rejects a file without bib in col 2" do - cols = ["a.1", "ic", "bib", "bibrights", "aa"] - expect_ok(:verify_rights_file_format, cols.join("\t")) - - cols[2] = "BIB" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) + it "rejects a file with unexpected (icus) rights" do + rights_cols[1] = "icus" + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + end - cols[2] = "" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) + it "rejects a file without 'bib' (lowercase) in col 2" do + rights_cols[2] = "BIB" + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) end - it "rejects a file without bibrights in col 3" do - cols = ["a.1", "ic", "bib", "bibrights", "aa"] - expect_ok(:verify_rights_file_format, cols.join("\t")) + it "rejects a file with no reason in col 2" do + rights_cols[2] = "" + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + end - cols[3] = "BIBRIGHTS" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) + it "rejects a file without 'bibrights' (lowercase) in col 3" do + rights_cols[3] = "BIBRIGHTS" + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + end - cols[3] = "" - expect_not_ok(:verify_rights_file_format, cols.join("\t")) + it "rejects a file with no user in col 3" do + rights_cols[3] = "" + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) end it "accepts a file with OK digitization source" do - cols = ["a.1", "ic", "bib", "bibrights", "aa"] - expect_ok(:verify_rights_file_format, cols.join("\t")) - - cols[4] = "aa-aa" - expect_ok(:verify_rights_file_format, cols.join("\t")) + rights_cols[4] = "aa-aa" + expect_ok(:verify_rights_file_format, rights_cols.join("\t")) end - not_ok_dig_src = ["", "-aa", "aa-", "AA"] - line_start = ["a.1", "ic", "bib", "bibrights"].join("\t") - not_ok_dig_src.each do |dig_src| - it "rejects a file with malformed digitization source (#{dig_src})" do - expect_not_ok(:verify_rights_file_format, [line_start, dig_src].join("\t")) + not_ok_dig_sources = ["", "-aa", "aa-", "AA"] + not_ok_dig_sources.each do |bad_dig_source| + it "rejects a file with malformed digitization source (#{bad_dig_source})" do + rights_cols[4] = bad_dig_source + + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) end end end diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index 784a600..597bdcb 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -36,22 +36,26 @@ module PostZephirProcessing describe "#verify_file" do # Note: since the tests currently run as root, no way to test unreadable file + + it "starts with no errors" do + expect(verifier.errors).to be_empty + end + context "with readable file" do it "does not report an error" do - errors_before = verifier.errors.count tmpfile = File.join(@tmpdir, "tmpfile.txt") File.open(tmpfile, "w") { |f| f.puts "blah" } verifier.verify_file(path: tmpfile) - expect(verifier.errors.count).to eq(errors_before) + expect(verifier.errors).to be_empty end end context "with nonexistent file" do it "reports an error" do - errors_before = verifier.errors.count + verifier.errors.count tmpfile = File.join(@tmpdir, "no_such_tmpfile.txt") verifier.verify_file(path: tmpfile) - expect(verifier.errors.count).to be > errors_before + expect(verifier.errors).not_to be_empty end end end From fe2236a2450e1fa59b239a57db6907c892a9dec9 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 15:50:28 -0500 Subject: [PATCH 063/114] Include field names for hathifile verifier output --- lib/verifier/hathifiles_contents_verifier.rb | 68 +++++++++---------- .../unit/hathifiles_contents_verifier_spec.rb | 4 +- 2 files changed, 36 insertions(+), 36 deletions(-) diff --git a/lib/verifier/hathifiles_contents_verifier.rb b/lib/verifier/hathifiles_contents_verifier.rb index 3f00724..d9dff8b 100644 --- a/lib/verifier/hathifiles_contents_verifier.rb +++ b/lib/verifier/hathifiles_contents_verifier.rb @@ -7,66 +7,65 @@ module PostZephirProcessing class HathifileContentsVerifier < Verifier - HATHIFILE_FIELDS_COUNT = 26 - HATHIFILE_FIELD_REGEXES = [ + HATHIFILE_FIELD_SPECS = [ # htid - required; lowercase alphanumeric namespace, period, non-whitespace ID - /^[a-z0-9]{2,4}\.\S+$/, + {name: "htid", regex: /^[a-z0-9]{2,4}\.\S+$/}, # access - required; allow or deny - /^(allow|deny)$/, + {name: "access", regex: /^(allow|deny)$/}, # rights - required; lowercase alphanumeric plus dash and period - /^[a-z0-9\-.]+$/, + {name: "rights", regex: /^[a-z0-9\-.]+$/}, # ht_bib_key - required; 9 digits - /^\d{9}$/, + {name: "ht_bib_key", regex: /^\d{9}$/}, # description (enumchron) - optional; anything goes - /^.*$/, + {name: "description", regex: /^.*$/}, # source - required; NUC/MARC organization code, all upper-case - /^[A-Z]+$/, + {name: "source", regex: /^[A-Z]+$/}, # source_bib_num - optional (see note) - no whitespace, anything else # allowed. Note that blank source bib nums are likely a bug in hathifiles # generation - /^\S*$/, + {name: "source_bib_num", regex: /^\S*$/}, # oclc_num - optional; zero or more comma-separated numbers - /^(\d+)?(,\d+)*$/, + {name: "oclc_num", regex: /^(\d+)?(,\d+)*$/}, # hathifiles doesn't validate/normalize what comes out of the record for # isbn, issn, or lccn # isbn - optional; anything goes - /^.*$/, + {name: "hathifiles", regex: /^.*$/}, # issn - optional; anything goes - /^.*$/, + {name: "issn", regex: /^.*$/}, # lccn - optional; anything goes - /^.*$/, + {name: "lccn", regex: /^.*$/}, # title - optional (see note); anything goes # Note: currently blank for titles with only a 245$k; hathifiles # generation should likely be changed to include the k subfield. - /^.*$/, + {name: "title", regex: /^.*$/}, # imprint - optional; anything goes - /^.*$/, + {name: "imprint", regex: /^.*$/}, # rights_reason_code - required; lowercase alphabetical - /^[a-z]+$/, + {name: "rights_reason_code", regex: /^[a-z]+$/}, # rights_timestamp - required; %Y-%m-%d %H:%M:%S - /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$/, + {name: "rights_timestamp", regex: /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$/}, # us_gov_doc_flag - required; 0 or 1 - /^[01]$/, + {name: "us_gov_doc_flag", regex: /^[01]$/}, # rights_date_used - required - numeric - /^\d+$/, + {name: "rights_date_used", regex: /^\d+$/}, # publication place - required, 2 or 3 characters (but can be whitespace) - /^.{2,3}$/, + {name: "pub_place", regex: /^.{2,3}$/}, # lang - optional, at most 3 characters - /^.{0,3}$/, + {name: "lang", regex: /^.{0,3}$/}, # bib_fmt - required, uppercase characters - /^[A-Z]+$/, + {name: "bib_fmt", regex: /^[A-Z]+$/}, # collection code - required, uppercase characters - /^[A-Z]+$/, + {name: "collection_code", regex: /^[A-Z]+$/}, # content provider - required, lowercase characters + dash - /^[a-z\-]+$/, + {name: "content_provider_code", regex: /^[a-z\-]+$/}, # responsible entity code - required, lowercase characters + dash - /^[a-z\-]+$/, + {name: "responsible_entity_code", regex: /^[a-z\-]+$/}, # digitization agent code - required, lowercase characters + dash - /^[a-z\-]+$/, + {name: "digitization_agent_code", regex: /^[a-z\-]+$/}, # access profile code - required, lowercase characters + plus - /^[a-z+]+$/, + {name: "access_profile_code", regex: /^[a-z+]+$/}, # author - optional, anything goes - /^.*$/ + {name: "author", regex: /^.*$/} ] attr_reader :file, :line_count @@ -90,19 +89,20 @@ def run end def verify_fields(fields) - fields.each_with_index do |field, i| - regex = HATHIFILE_FIELD_REGEXES[i] - if !fields[i].match?(regex) - error(message: "Field #{i} at line #{line_count} in #{file} ('#{field}') does not match #{regex}") + fields.zip(HATHIFILE_FIELD_SPECS).each do |field_value, field_spec| + field_name = field_spec[:name] + regex = field_spec[:regex] + if !field_value.match?(regex) + error(message: "Field #{field_name} at line #{line_count} in #{file} ('#{field_value}') does not match #{regex}") end end end def verify_line_field_count(fields) - if fields.count == HATHIFILE_FIELDS_COUNT + if fields.count == HATHIFILE_FIELD_SPECS.count true else - error(message: "Line #{line_count} in #{file} has only #{fields.count} columns, expected #{HATHIFILE_FIELDS_COUNT}") + error(message: "Line #{line_count} in #{file} has only #{fields.count} columns, expected #{HATHIFILE_FIELD_SPECS.count}") false end end diff --git a/spec/unit/hathifiles_contents_verifier_spec.rb b/spec/unit/hathifiles_contents_verifier_spec.rb index 519878b..40e09f2 100644 --- a/spec/unit/hathifiles_contents_verifier_spec.rb +++ b/spec/unit/hathifiles_contents_verifier_spec.rb @@ -191,7 +191,7 @@ module PostZephirProcessing sample_fields[i] = field[:bad] verifier.verify_fields(sample_fields) - expect(verifier.errors).to include(/Field #{i}.*does not match/) + expect(verifier.errors).to include(/Field #{field[:name]}.*does not match/) end if field[:optional] @@ -206,7 +206,7 @@ module PostZephirProcessing sample_fields[i] = "" verifier.verify_fields(sample_fields) - expect(verifier.errors).to include(/Field #{i}.*does not match/) + expect(verifier.errors).to include(/Field #{field[:name]}.*does not match/) end end end From 2e3bc0c965b97572228a8fd1aa81a20c28fd6ce9 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 17 Dec 2024 15:50:49 -0500 Subject: [PATCH 064/114] started subclassing Derivative, added HathifileDerivative & implemented in HFDBVerifier --- lib/derivative.rb | 33 ++---------- lib/derivative/hathifile_derivative.rb | 30 +++++++++++ lib/verifier/hathifiles_database_verifier.rb | 6 +-- spec/unit/derivative_spec.rb | 41 +-------------- spec/unit/hathifile_derivative_spec.rb | 54 ++++++++++++++++++++ 5 files changed, 93 insertions(+), 71 deletions(-) create mode 100644 lib/derivative/hathifile_derivative.rb create mode 100644 spec/unit/hathifile_derivative_spec.rb diff --git a/lib/derivative.rb b/lib/derivative.rb index 0c2eb51..fce2dfb 100644 --- a/lib/derivative.rb +++ b/lib/derivative.rb @@ -3,12 +3,11 @@ module PostZephirProcessing class Derivative - attr_reader :date, :full, :derivative_type + attr_reader :date, :full - def initialize(date:, full:, derivative_type:) + def initialize(date:, full:) @date = date @full = full - @derivative_type = derivative_type end def full? @@ -19,26 +18,8 @@ def path Verifier.dated_derivative(**template, date: date) end - def self.derivatives_for_date(date:, derivative_type:) - raise unless derivative_type == :hathifile - - derivatives = [ - Derivative.new( - derivative_type: :hathifile, - full: false, - date: date - ) - ] - - if date.first_of_month? - derivatives << Derivative.new( - derivative_type: :hathifile, - full: true, - date: date - ) - end - - derivatives + def self.derivatives_for_date(date:) + # each subclass to return an array with all the derivatives for this date end private @@ -51,12 +32,8 @@ def fullness end end - # given derivative type, knows how to construct params for Verifier.dated_derivative def template - case derivative_type - when :hathifile - {location: :HATHIFILE_ARCHIVE, name: "hathi_#{fullness}_YYYYMMDD.txt.gz"} - end + # each subclass to return a hash with params for Verifier.dated_derivative end end end diff --git a/lib/derivative/hathifile_derivative.rb b/lib/derivative/hathifile_derivative.rb new file mode 100644 index 0000000..e116ed5 --- /dev/null +++ b/lib/derivative/hathifile_derivative.rb @@ -0,0 +1,30 @@ +require "derivative" + +module PostZephirProcessing + class HathifileDerivative < Derivative + def self.derivatives_for_date(date:) + derivatives = [ + HathifileDerivative.new( + full: false, + date: date + ) + ] + + if date.first_of_month? + derivatives << HathifileDerivative.new( + full: true, + date: date + ) + end + + derivatives + end + + def template + { + location: :HATHIFILE_ARCHIVE, + name: "hathi_#{fullness}_YYYYMMDD.txt.gz" + } + end + end +end diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index 3886df6..b6756d5 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -4,7 +4,7 @@ require_relative "../verifier" require_relative "../derivatives" -require_relative "../derivative" +require_relative "../derivative/hathifile_derivative" module PostZephirProcessing class HathifilesDatabaseVerifier < Verifier @@ -35,7 +35,7 @@ def run_for_date(date:) def verify_hathifiles_database_log # File missing? Not our problem, should be caught by earlier verifier. - Derivative.derivatives_for_date(date: current_date, derivative_type: :hathifile).each do |d| + HathifileDerivative.derivatives_for_date(date: current_date).each do |d| next unless File.exist?(d.path) if !self.class.has_log?(hathifile: d.path) @@ -45,7 +45,7 @@ def verify_hathifiles_database_log end def verify_hathifiles_database_count - Derivative.derivatives_for_date(date: current_date, derivative_type: :hathifile).select { |d| d.full? }.each do |full_file| + HathifileDerivative.derivatives_for_date(date: current_date).select { |d| d.full? }.each do |full_file| next unless File.exist?(full_file.path) full_file_count = gzip_linecount(path: full_file.path) diff --git a/spec/unit/derivative_spec.rb b/spec/unit/derivative_spec.rb index a314673..7026641 100644 --- a/spec/unit/derivative_spec.rb +++ b/spec/unit/derivative_spec.rb @@ -4,23 +4,13 @@ module PostZephirProcessing RSpec.describe(Derivative) do - around(:each) do |example| - with_test_environment do - ClimateControl.modify(HATHIFILE_ARCHIVE: "/tmp") do - example.run - end - end - end - let(:test_date_first_of_month) { Date.parse("2023-11-01") } let(:test_date_last_of_month) { Date.parse("2023-11-30") } - let(:derivative_type) { :hathifile } let(:params) do { date: test_date_first_of_month, - full: true, - derivative_type: :hathifile + full: true } end let(:derivative) { described_class.new(**params) } @@ -31,37 +21,8 @@ module PostZephirProcessing end end - describe "Derivative.derivatives_for_date" do - it "returns 2 derivatives (one full, one upd) on the first of month" do - derivatives = described_class.derivatives_for_date( - date: test_date_first_of_month, - derivative_type: derivative_type - ) - expect(derivatives.count).to eq 2 - expect(derivatives.count { |d| d.full == true }).to eq 1 - expect(derivatives.count { |d| d.full == false }).to eq 1 - end - - it "returns 1 derivative on the last of month" do - derivatives = described_class.derivatives_for_date( - date: test_date_last_of_month, - derivative_type: derivative_type - ) - expect(derivatives.count).to eq 1 - end - end - it "reports back its fullness" do expect(derivative.full?).to be true end - - it "reports the expected file name for a full hathifile" do - expect(derivative.path).to eq "/tmp/hathi_full_20231101.txt.gz" - end - - it "reports the expected file name for a upd hathifile" do - params[:full] = false - expect(derivative.path).to eq "/tmp/hathi_upd_20231101.txt.gz" - end end end diff --git a/spec/unit/hathifile_derivative_spec.rb b/spec/unit/hathifile_derivative_spec.rb new file mode 100644 index 0000000..a107fd4 --- /dev/null +++ b/spec/unit/hathifile_derivative_spec.rb @@ -0,0 +1,54 @@ +# frozen_string_literal: true + +require "derivative" +require "derivative/hathifile_derivative" + +module PostZephirProcessing + RSpec.describe(HathifileDerivative) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify(HATHIFILE_ARCHIVE: "/tmp") do + example.run + end + end + end + + let(:test_date_first_of_month) { Date.parse("2023-11-01") } + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + + let(:params) do + { + date: test_date_first_of_month + } + end + let(:derivative) { described_class.new(**params) } + + describe "self.derivatives_for_date" do + it "returns 2 derivatives (one full, one upd) on the first of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_first_of_month + ) + expect(derivatives.count).to eq 2 + expect(derivatives.count { |d| d.full == true }).to eq 1 + expect(derivatives.count { |d| d.full == false }).to eq 1 + end + + it "returns 1 derivative on the last of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_last_of_month + ) + expect(derivatives.count).to eq 1 + end + end + + it "reports the expected file name for a full hathifile" do + params[:full] = true + expect(derivative.path).to eq "/tmp/hathi_full_20231101.txt.gz" + end + + it "reports the expected file name for a upd hathifile" do + params[:full] = false + expect(derivative.path).to eq "/tmp/hathi_upd_20231101.txt.gz" + end + end +end From dba6ce4d5d65b150940cb21fe1ced5faa348127b Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 16:03:13 -0500 Subject: [PATCH 065/114] Refactor rights verifier * use around block * truncate before & after each example * separate into contexts for with & without rights in db --- spec/unit/populate_rights_verifier_spec.rb | 103 ++++++++++----------- 1 file changed, 50 insertions(+), 53 deletions(-) diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/populate_rights_verifier_spec.rb index a6918b2..6bbc057 100644 --- a/spec/unit/populate_rights_verifier_spec.rb +++ b/spec/unit/populate_rights_verifier_spec.rb @@ -34,77 +34,74 @@ def insert_fake_rights(namespace:, id:) end # Temporarily add each `htid` to `rights_current` with resonable (and irrelevant) default values. - def with_fake_rights_entries(htids: test_rights) - split_htids = htids.map { |htid| htid.split(".", 2) } - Services[:database][:rights_current].where([:namespace, :id] => split_htids).delete - split_htids.each do |split_htid| - insert_fake_rights(namespace: split_htid[0], id: split_htid[1]) - end - begin - yield - ensure - Services[:database][:rights_current].where([:namespace, :id] => split_htids).delete + + context "with HTIDs in the rights database" do + around(:each) do |example| + Services[:database][:rights_current].truncate + + split_htids = test_rights.map { |htid| htid.split(".", 2) } + split_htids.each do |split_htid| + insert_fake_rights(namespace: split_htid[0], id: split_htid[1]) + end + + example.run + + Services[:database][:rights_current].truncate end - end - describe "#run_for_date" do - context "monthly" do - date = Date.new(2024, 11, 30) - context "with HTID in the Rights Database" do - it "logs no `missing rights_current` error" do - with_fake_rights_entries do - with_fake_rights_file(date: date, full: true) do - verifier.run_for_date(date: date) - expect(verifier.errors).not_to include(/missing rights_current/) - end - end + describe "#run_for_date" do + it "logs no `missing rights_current` error for full file" do + date = Date.new(2024, 11, 30) + with_fake_rights_file(date: date, full: true) do + verifier.run_for_date(date: date) + expect(verifier.errors).not_to include(/missing rights_current/) end end - context "with HTID not in the Rights Database" do - it "logs `missing rights_current` error" do - with_fake_rights_file(date: date, full: true) do - verifier.run_for_date(date: date) - expect(verifier.errors).to include(/missing rights_current/) - end + it "logs no `missing rights_current` error for update file" do + date = Date.new(2024, 12, 2) + with_fake_rights_file(date: date) do + verifier.run_for_date(date: date) + expect(verifier.errors).not_to include(/missing rights_current/) end end end - context "daily" do - date = Date.new(2024, 12, 2) - context "with HTID in the Rights Database" do - it "logs no `missing rights_current` error" do - with_fake_rights_entries do - with_fake_rights_file(date: date) do - verifier.run_for_date(date: date) - expect(verifier.errors).not_to include(/missing rights_current/) - end - end - end + describe "#verify_rights_file" do + it "logs no error" do + expect_ok(:verify_rights_file, test_rights_file_contents) end + end + end + + context "with no HTIDs in the rights database" do + around(:each) do |example| + Services[:database][:rights_current].truncate - context "with HTID not in the Rights Database" do - it "logs `missing rights_current` error" do - with_fake_rights_file(date: date) do - verifier.run_for_date(date: date) - expect(verifier.errors).to include(/missing rights_current/) - end + example.run + + Services[:database][:rights_current].truncate + end + + describe "#run_for_date" do + it "logs `missing rights_current` error for full file" do + date = Date.new(2024, 11, 30) + with_fake_rights_file(date: date, full: true) do + verifier.run_for_date(date: date) + expect(verifier.errors).to include(/missing rights_current/) end end - end - end - describe "#verify_rights_file" do - context "with HTID in the Rights Database" do - it "logs no error" do - with_fake_rights_entries do - expect_ok(:verify_rights_file, test_rights_file_contents) + it "logs `missing rights_current` error for update file" do + date = Date.new(2024, 12, 2) + with_fake_rights_file(date: date) do + verifier.run_for_date(date: date) + expect(verifier.errors).to include(/missing rights_current/) end end end - context "with HTID not in the Rights Database" do + describe "#verify_rights_file" do it "logs `missing rights_current` error" do expect_not_ok(:verify_rights_file, test_rights_file_contents, errmsg: /missing rights_current/) end From d1119d03076eee29b8b483104107e444cc607dd2 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 16:25:08 -0500 Subject: [PATCH 066/114] Add Derivative::Catalog --- lib/derivative/catalog.rb | 52 ++++++++++++++++++++ spec/unit/derivative/catalog_spec.rb | 71 ++++++++++++++++++++++++++++ 2 files changed, 123 insertions(+) create mode 100644 lib/derivative/catalog.rb create mode 100644 spec/unit/derivative/catalog_spec.rb diff --git a/lib/derivative/catalog.rb b/lib/derivative/catalog.rb new file mode 100644 index 0000000..9807fc5 --- /dev/null +++ b/lib/derivative/catalog.rb @@ -0,0 +1,52 @@ +require "derivative" + +module PostZephirProcessing + class Derivative::Catalog < Derivative + def self.derivatives_for_date(date:) + derivatives = [ + self.new( + full: false, + date: date + ) + ] + + if date.last_of_month? + derivatives << self.new( + full: true, + date: date + ) + end + + derivatives + end + + def template + { + location: location, + name: filename_template + } + end + + private + + def filename_template + if(full) + "zephir_full_YYYYMMDD_vufind.json.gz" + else + "zephir_upd_YYYYMMDD.json.gz" + end + end + end + + class Derivative::CatalogArchive < Derivative::Catalog + def location + :CATALOG_ARCHIVE + end + end + + class Derivative::CatalogPrep < Derivative::Catalog + def location + :CATALOG_PREP + end + end +end diff --git a/spec/unit/derivative/catalog_spec.rb b/spec/unit/derivative/catalog_spec.rb new file mode 100644 index 0000000..52554c6 --- /dev/null +++ b/spec/unit/derivative/catalog_spec.rb @@ -0,0 +1,71 @@ +# frozen_string_literal: true + +require "derivative" +require "derivative/catalog" + +module PostZephirProcessing + RSpec.describe(Derivative::Catalog) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify( + CATALOG_ARCHIVE: "/tmp/archive", + CATALOG_PREP: "/tmp/prep" + ) do + example.run + end + end + end + + let(:test_date_first_of_month) { Date.parse("2023-11-01") } + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + + let(:params) do + { + date: test_date_last_of_month, + } + end + let(:derivative) { described_class.new(**params) } + + describe "self.derivatives_for_date" do + it "returns 2 derivatives (one full, one upd) on the last of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_last_of_month + ) + expect(derivatives.count).to eq 2 + expect(derivatives.count { |d| d.full == true }).to eq 1 + expect(derivatives.count { |d| d.full == false }).to eq 1 + end + + it "returns 1 derivative on the first of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_first_of_month + ) + expect(derivatives.count).to eq 1 + end + end + + describe(Derivative::CatalogArchive) do + it "reports the expected file name for a full catalog file" do + params[:full] = true + expect(derivative.path).to eq "/tmp/archive/zephir_full_20231130_vufind.json.gz" + end + + it "reports the expected file name for a upd hathifile" do + params[:full] = false + expect(derivative.path).to eq "/tmp/archive/zephir_upd_20231130.json.gz" + end + end + + describe(Derivative::CatalogPrep) do + it "reports the expected file name for a full catalog file" do + params[:full] = true + expect(derivative.path).to eq "/tmp/prep/zephir_full_20231130_vufind.json.gz" + end + + it "reports the expected file name for a upd hathifile" do + params[:full] = false + expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130.json.gz" + end + end + end +end From b2892691ce523f40343d256a53bce9a215871600 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 17 Dec 2024 16:43:00 -0500 Subject: [PATCH 067/114] updated naming convention for subclasses --- lib/derivative/{hathifile_derivative.rb => hathifile.rb} | 6 +++--- lib/verifier/hathifiles_database_verifier.rb | 6 +++--- spec/unit/hathifile_derivative_spec.rb | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-) rename lib/derivative/{hathifile_derivative.rb => hathifile.rb} (77%) diff --git a/lib/derivative/hathifile_derivative.rb b/lib/derivative/hathifile.rb similarity index 77% rename from lib/derivative/hathifile_derivative.rb rename to lib/derivative/hathifile.rb index e116ed5..c97cbd5 100644 --- a/lib/derivative/hathifile_derivative.rb +++ b/lib/derivative/hathifile.rb @@ -1,17 +1,17 @@ require "derivative" module PostZephirProcessing - class HathifileDerivative < Derivative + class Derivative::Hathifile < Derivative def self.derivatives_for_date(date:) derivatives = [ - HathifileDerivative.new( + Derivative::Hathifile.new( full: false, date: date ) ] if date.first_of_month? - derivatives << HathifileDerivative.new( + derivatives << Derivative::Hathifile.new( full: true, date: date ) diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index b6756d5..38d4e6d 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -4,7 +4,7 @@ require_relative "../verifier" require_relative "../derivatives" -require_relative "../derivative/hathifile_derivative" +require_relative "../derivative/hathifile" module PostZephirProcessing class HathifilesDatabaseVerifier < Verifier @@ -35,7 +35,7 @@ def run_for_date(date:) def verify_hathifiles_database_log # File missing? Not our problem, should be caught by earlier verifier. - HathifileDerivative.derivatives_for_date(date: current_date).each do |d| + Derivative::Hathifile.derivatives_for_date(date: current_date).each do |d| next unless File.exist?(d.path) if !self.class.has_log?(hathifile: d.path) @@ -45,7 +45,7 @@ def verify_hathifiles_database_log end def verify_hathifiles_database_count - HathifileDerivative.derivatives_for_date(date: current_date).select { |d| d.full? }.each do |full_file| + Derivative::Hathifile.derivatives_for_date(date: current_date).select { |d| d.full? }.each do |full_file| next unless File.exist?(full_file.path) full_file_count = gzip_linecount(path: full_file.path) diff --git a/spec/unit/hathifile_derivative_spec.rb b/spec/unit/hathifile_derivative_spec.rb index a107fd4..9435940 100644 --- a/spec/unit/hathifile_derivative_spec.rb +++ b/spec/unit/hathifile_derivative_spec.rb @@ -1,10 +1,10 @@ # frozen_string_literal: true require "derivative" -require "derivative/hathifile_derivative" +require "derivative/hathifile" module PostZephirProcessing - RSpec.describe(HathifileDerivative) do + RSpec.describe(Derivative::Hathifile) do around(:each) do |example| with_test_environment do ClimateControl.modify(HATHIFILE_ARCHIVE: "/tmp") do From f0810156308bcab4d541b782cfef74fc32e52791 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 17 Dec 2024 16:45:02 -0500 Subject: [PATCH 068/114] standardrbrrbrb --- lib/derivative/catalog.rb | 6 +++--- spec/unit/derivative/catalog_spec.rb | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/derivative/catalog.rb b/lib/derivative/catalog.rb index 9807fc5..1f877b8 100644 --- a/lib/derivative/catalog.rb +++ b/lib/derivative/catalog.rb @@ -4,14 +4,14 @@ module PostZephirProcessing class Derivative::Catalog < Derivative def self.derivatives_for_date(date:) derivatives = [ - self.new( + new( full: false, date: date ) ] if date.last_of_month? - derivatives << self.new( + derivatives << new( full: true, date: date ) @@ -30,7 +30,7 @@ def template private def filename_template - if(full) + if full "zephir_full_YYYYMMDD_vufind.json.gz" else "zephir_upd_YYYYMMDD.json.gz" diff --git a/spec/unit/derivative/catalog_spec.rb b/spec/unit/derivative/catalog_spec.rb index 52554c6..b99cf53 100644 --- a/spec/unit/derivative/catalog_spec.rb +++ b/spec/unit/derivative/catalog_spec.rb @@ -21,7 +21,7 @@ module PostZephirProcessing let(:params) do { - date: test_date_last_of_month, + date: test_date_last_of_month } end let(:derivative) { described_class.new(**params) } From 1c99b43ed596b8ace436c24f467d29e6b578a39e Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 16:44:29 -0500 Subject: [PATCH 069/114] Use Derivative::Catalog for CatalogIndexVerifier --- lib/verifier/catalog_index_verifier.rb | 51 +++++++++------------ spec/unit/catalog_indexing_verifier_spec.rb | 30 ++++++------ 2 files changed, 37 insertions(+), 44 deletions(-) diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 6f259f7..dc4a66c 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -8,29 +8,25 @@ module PostZephirProcessing class CatalogIndexVerifier < Verifier - def verify_index_count(path:) - # TODO: we compute this path based on full/update in run_for_date -- avoid logic twice - # by using (todo) Derivative class - filename = File.basename(path) - if (m = filename.match(/^zephir_upd_(\d+)\.json\.gz/)) - # in normal operation, we _should_ have indexed this the day after the - # date listed in the file. - # - # could potentially use the journal to determine when we actually - # indexed it? - - date_of_indexing = Date.parse(m[1]) + 1 - catalog_linecount = gzip_linecount(path: path) - solr_count = solr_count(date_of_indexing) - elsif /^zephir_full_\d+_vufind\.json\.gz/.match?(filename) - catalog_linecount = gzip_linecount(path: path) + def verify_index_count(derivative:) + if derivative.full? + catalog_linecount = gzip_linecount(path: derivative.path) solr_count = solr_nondeleted_records + query_desc = "existed" else - raise ArgumentError, "#{path} doesn't seem to be a catalog index file" + date_of_indexing = derivative.date + 1 + catalog_linecount = gzip_linecount(path: derivative.path) + solr_count = solr_count(date_of_indexing) + query_desc = "had time_of_indexing on #{date_of_indexing}" end + # in normal operation, we _should_ have indexed this the day after the + # date listed in the file. + # + # could potentially use the journal to determine when we actually + # indexed it? if solr_count < catalog_linecount - error(message: "#{filename} had #{catalog_linecount} records, but only #{solr_count} had time_of_indexing on #{date_of_indexing} in solr") + error(message: "#{derivative.path} had #{catalog_linecount} records, but only #{solr_count} #{query_desc} in solr") end end @@ -52,19 +48,16 @@ def solr_result_count(filter_query) end def run_for_date(date:) - # TODO: The dates on the files are the previous day, but the indexing - # happens on the current day -- not sure the logic here makes sense? + # The dates on the files are the previous day, but the indexing + # happens on the current day. When we verify the current day, we are + # verifying that the file named for the _previous_ day was produced. + @current_date = date - update_file = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_upd_YYYYMMDD.json.gz", date: date - 1) - if verify_file(path: update_file) - verify_index_count(path: update_file) - end + Derivative::CatalogArchive.derivatives_for_date(date: date - 1).each do |derivative| + path = derivative.path - # first of month - if date.first_of_month? - full_file = self.class.dated_derivative(location: :CATALOG_ARCHIVE, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date - 1) - if verify_file(path: full_file) - verify_index_count(path: full_file) + if verify_file(path: path) + verify_index_count(derivative: derivative) end end end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index ff28fbd..a672585 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -2,6 +2,7 @@ require "verifier/catalog_index_verifier" require "webmock" +require "derivative/catalog" module PostZephirProcessing RSpec.describe(CatalogIndexVerifier) do @@ -10,7 +11,10 @@ module PostZephirProcessing around(:each) do |example| with_test_environment do - ClimateControl.modify(SOLR_URL: solr_url) do + ClimateControl.modify( + SOLR_URL: solr_url, + CATALOG_ARCHIVE: fixture("catalog_archive") + ) do example.run end end @@ -72,60 +76,56 @@ def stub_catalog_timerange(date, result_count) describe "#verify_index_count" do context "with a catalog update file with 3 records" do - let(:catalog_update) { fixture("catalog_archive/zephir_upd_20241202.json.gz") } + let(:catalog_update) { Derivative::CatalogArchive.new(date: Date.parse("2024-12-02"), full: false) } # indexed the day after the date in the filename let(:catalog_index_date) { Date.parse("2024-12-03") } it "accepts a catalog with 3 recent updates" do stub_catalog_timerange(catalog_index_date, 3) - verifier.verify_index_count(path: catalog_update) + verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to be_empty end it "accepts a catalog with 5 recent updates" do stub_catalog_timerange(catalog_index_date, 5) - verifier.verify_index_count(path: catalog_update) + verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to be_empty end it "rejects a catalog with no recent updates" do stub_catalog_timerange(catalog_index_date, 0) - verifier.verify_index_count(path: catalog_update) + verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to include(/only 0 .* in solr/) end it "rejects a catalog with 2 recent updates" do stub_catalog_timerange(catalog_index_date, 2) - verifier.verify_index_count(path: catalog_update) + verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to include(/only 2 .* in solr/) end end context "with a catalog full file with 5 records" do - let(:catalog_full) { fixture("catalog_archive/zephir_full_20241130_vufind.json.gz") } + let(:catalog_full) { Derivative::CatalogArchive.new(date: Date.parse("2024-11-30"), full: true) } it "accepts a catalog with 5 records" do stub_catalog_record_count(5) - verifier.verify_index_count(path: catalog_full) + verifier.verify_index_count(derivative: catalog_full) expect(verifier.errors).to be_empty end it "accepts a catalog with 6 records" do stub_catalog_record_count(6) - verifier.verify_index_count(path: catalog_full) + verifier.verify_index_count(derivative: catalog_full) expect(verifier.errors).to be_empty end it "rejects a catalog with no records" do stub_catalog_record_count(0) - verifier.verify_index_count(path: catalog_full) + verifier.verify_index_count(derivative: catalog_full) expect(verifier.errors).to include(/only 0 .* in solr/) end it "rejects a catalog with 2 records" do stub_catalog_record_count(2) - verifier.verify_index_count(path: catalog_full) + verifier.verify_index_count(derivative: catalog_full) expect(verifier.errors).to include(/only 2 .* in solr/) end end - - it "raises an exception when given some other file" do - expect { verifier.verify_index_count(path: fixture("zephir_data/ht_bib_export_full_2024-11-30.json.gz")) }.to raise_exception(ArgumentError) - end end describe "#run_for_date" do From 09434a30ecf11c5db7d5d9e7dd417fc19d9793c9 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 16:53:06 -0500 Subject: [PATCH 070/114] Use Derivative::Catalog in post_zephir_verifier --- lib/verifier/post_zephir_verifier.rb | 21 +++++---------------- 1 file changed, 5 insertions(+), 16 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 2797f89..4a76ec7 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -28,28 +28,17 @@ def run_for_date(date:) # readable # line count must be the same as input JSON def verify_catalog_archive(date: current_date) - zephir_update_derivative_params = { - location: :CATALOG_ARCHIVE, - name: "zephir_upd_YYYYMMDD.json.gz", - date: date - } - zephir_update_path = self.class.dated_derivative(**zephir_update_derivative_params) + zephir_update_path = Derivative::CatalogArchive.new(date: date, full: false).path verify_file(path: zephir_update_path) verify_parseable_ndj(path: zephir_update_path) if date.last_of_month? - zephir_full_derivative_params = { - location: :CATALOG_ARCHIVE, - name: "zephir_full_YYYYMMDD_vufind.json.gz", - date: date - } - ht_bib_export_derivative_params = { location: :ZEPHIR_DATA, name: "ht_bib_export_full_YYYY-MM-DD.json.gz", date: date } - output_path = self.class.dated_derivative(**zephir_full_derivative_params) + output_path = Derivative::CatalogArchive.new(date: date, full: true).path verify_file(path: output_path) verify_parseable_ndj(path: output_path) output_linecount = gzip_linecount(path: output_path) @@ -84,12 +73,12 @@ def verify_catalog_archive(date: current_date) # TODO: deletes file is combination of two component files in TMPDIR? def verify_catalog_prep(date: current_date) delete_file = self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date) - verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD.json.gz", date: date)) if verify_file(path: delete_file) verify_deletes_contents(path: delete_file) end - if date.last_of_month? - verify_file(path: self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_full_YYYYMMDD_vufind.json.gz", date: date)) + + Derivative::CatalogPrep.derivatives_for_date(date: date).each do |derivative| + verify_file(path: derivative.path) end end From f424f53fac1f17415915527ab0e7e22a34f6c2ad Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Tue, 17 Dec 2024 17:14:27 -0500 Subject: [PATCH 071/114] Use Derivative::Rights instead of :RIGHTS_ARCHIVE --- lib/derivative/rights.rb | 30 +++++++++++++ lib/verifier/catalog_index_verifier.rb | 4 +- lib/verifier/populate_rights_verifier.rb | 13 ++---- lib/verifier/post_zephir_verifier.rb | 13 ++---- spec/unit/derivative/catalog_spec.rb | 8 ++-- spec/unit/derivative/rights_spec.rb | 56 ++++++++++++++++++++++++ 6 files changed, 100 insertions(+), 24 deletions(-) create mode 100644 lib/derivative/rights.rb create mode 100644 spec/unit/derivative/rights_spec.rb diff --git a/lib/derivative/rights.rb b/lib/derivative/rights.rb new file mode 100644 index 0000000..1cec366 --- /dev/null +++ b/lib/derivative/rights.rb @@ -0,0 +1,30 @@ +require "derivative" + +module PostZephirProcessing + class Derivative::Rights < Derivative + def self.derivatives_for_date(date:) + derivatives = [ + new( + full: false, + date: date + ) + ] + + if date.last_of_month? + derivatives << new( + full: true, + date: date + ) + end + + derivatives + end + + def template + { + location: :RIGHTS_ARCHIVE, + name: "zephir_#{fullness}_YYYYMMDD.rights" + } + end + end +end diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index dc4a66c..0c3e57a 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -9,13 +9,13 @@ module PostZephirProcessing class CatalogIndexVerifier < Verifier def verify_index_count(derivative:) + catalog_linecount = gzip_linecount(path: derivative.path) + if derivative.full? - catalog_linecount = gzip_linecount(path: derivative.path) solr_count = solr_nondeleted_records query_desc = "existed" else date_of_indexing = derivative.date + 1 - catalog_linecount = gzip_linecount(path: derivative.path) solr_count = solr_count(date_of_indexing) query_desc = "had time_of_indexing on #{date_of_indexing}" end diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index e9eb7d2..5c07f84 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -21,15 +21,10 @@ class PopulateRightsVerifier < Verifier UPD_RIGHTS_TEMPLATE = "zephir_upd_YYYYMMDD.rights" def run_for_date(date:) - upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: UPD_RIGHTS_TEMPLATE, date: date) - if verify_file(path: upd_path) - verify_rights_file(path: upd_path) - end - - if date.last_of_month? - full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: FULL_RIGHTS_TEMPLATE, date: date) - if verify_file(path: full_path) - verify_rights_file(path: full_path) + Derivative::Rights.derivatives_for_date(date: date).each do |derivative| + path = derivative.path + if verify_file(path: path) + verify_rights_file(path: path) end end end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 4a76ec7..7006ebb 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -130,15 +130,10 @@ def verify_ingest_bibrecords(date: current_date) # readable # accepted by verify_rights_file_format def verify_rights(date: current_date) - upd_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_upd_YYYYMMDD.rights", date: date) - if verify_file(path: upd_path) - verify_rights_file_format(path: upd_path) - end - - if date.last_of_month? - full_path = self.class.dated_derivative(location: :RIGHTS_ARCHIVE, name: "zephir_full_YYYYMMDD.rights", date: date) - if verify_file(path: full_path) - verify_rights_file_format(path: full_path) + Derivative::Rights.derivatives_for_date(date: date).each do |derivative| + path = derivative.path + if verify_file(path: path) + verify_rights_file_format(path: path) end end end diff --git a/spec/unit/derivative/catalog_spec.rb b/spec/unit/derivative/catalog_spec.rb index b99cf53..c802fed 100644 --- a/spec/unit/derivative/catalog_spec.rb +++ b/spec/unit/derivative/catalog_spec.rb @@ -45,24 +45,24 @@ module PostZephirProcessing end describe(Derivative::CatalogArchive) do - it "reports the expected file name for a full catalog file" do + it "reports the expected path for a full catalog file" do params[:full] = true expect(derivative.path).to eq "/tmp/archive/zephir_full_20231130_vufind.json.gz" end - it "reports the expected file name for a upd hathifile" do + it "reports the expected path for an update catalog file" do params[:full] = false expect(derivative.path).to eq "/tmp/archive/zephir_upd_20231130.json.gz" end end describe(Derivative::CatalogPrep) do - it "reports the expected file name for a full catalog file" do + it "reports the expected path for a full catalog file" do params[:full] = true expect(derivative.path).to eq "/tmp/prep/zephir_full_20231130_vufind.json.gz" end - it "reports the expected file name for a upd hathifile" do + it "reports the expected path for an update catalog file" do params[:full] = false expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130.json.gz" end diff --git a/spec/unit/derivative/rights_spec.rb b/spec/unit/derivative/rights_spec.rb new file mode 100644 index 0000000..e68d71f --- /dev/null +++ b/spec/unit/derivative/rights_spec.rb @@ -0,0 +1,56 @@ +# frozen_string_literal: true + +require "derivative" +require "derivative/rights" + +module PostZephirProcessing + RSpec.describe(Derivative::Rights) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify( + RIGHTS_ARCHIVE: "/tmp/rights" + ) do + example.run + end + end + end + + let(:test_date_first_of_month) { Date.parse("2023-11-01") } + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + + let(:params) do + { + date: test_date_last_of_month + } + end + let(:derivative) { described_class.new(**params) } + + describe "self.derivatives_for_date" do + it "returns 2 derivatives (one full, one upd) on the last of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_last_of_month + ) + expect(derivatives.count).to eq 2 + expect(derivatives.count { |d| d.full == true }).to eq 1 + expect(derivatives.count { |d| d.full == false }).to eq 1 + end + + it "returns 1 derivative on the first of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_first_of_month + ) + expect(derivatives.count).to eq 1 + end + + it "reports the expected path for a rights file derived from a full catalog file" do + params[:full] = true + expect(derivative.path).to eq "/tmp/rights/zephir_full_20231130.rights" + end + + it "reports the expected path for a rights file derived from an update catalog file" do + params[:full] = false + expect(derivative.path).to eq "/tmp/rights/zephir_upd_20231130.rights" + end + end + end +end From 6386eadd2b6b6e36a33cf176fc11673a8226b7d0 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Tue, 17 Dec 2024 17:19:32 -0500 Subject: [PATCH 072/114] using Derivative::HathifileWWW --- lib/derivative/hathifile_www.rb | 12 +++++++++ lib/verifier/hathifiles_listing_verifier.rb | 25 +++---------------- spec/unit/hathifiles_listing_verifier_spec.rb | 21 ++++++---------- 3 files changed, 22 insertions(+), 36 deletions(-) create mode 100644 lib/derivative/hathifile_www.rb diff --git a/lib/derivative/hathifile_www.rb b/lib/derivative/hathifile_www.rb new file mode 100644 index 0000000..a008415 --- /dev/null +++ b/lib/derivative/hathifile_www.rb @@ -0,0 +1,12 @@ +require_relative "hathifile" + +module PostZephirProcessing + class Derivative::HathifileWWW < Derivative::Hathifile + def template + { + location: :WWW_DIR, + name: "hathi_#{fullness}_YYYYMMDD.txt.gz" + } + end + end +end diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index 5e5e1cc..13e8252 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -2,6 +2,7 @@ require_relative "../verifier" require_relative "../derivatives" +require_relative "../derivative/hathifile_www" require "json" module PostZephirProcessing @@ -14,31 +15,11 @@ def run_for_date(date:) end def verify_hathifiles_listing(date: current_date) - derivatives_for_date(date: date).each do |derivative_path| - verify_listing(path: derivative_path) + Derivative::HathifileWWW.derivatives_for_date(date: date).each do |hathifile_derivative| + verify_listing(path: hathifile_derivative.path) end end - def derivatives_for_date(date:) - derivatives = [ - self.class.dated_derivative( - location: :WWW_DIR, - name: "hathi_upd_YYYYMMDD.txt.gz", - date: date - ) - ] - - if date.first_of_month? - derivatives << self.class.dated_derivative( - location: :WWW_DIR, - name: "hathi_full_YYYYMMDD.txt.gz", - date: date - ) - end - - derivatives - end - def verify_listing(path:) verify_file(path: path) diff --git a/spec/unit/hathifiles_listing_verifier_spec.rb b/spec/unit/hathifiles_listing_verifier_spec.rb index 95d431b..a807ce4 100644 --- a/spec/unit/hathifiles_listing_verifier_spec.rb +++ b/spec/unit/hathifiles_listing_verifier_spec.rb @@ -1,6 +1,7 @@ # frozen_string_literal: true require "verifier/hathifiles_listing_verifier" +require "derivative/hathifile_www" module PostZephirProcessing RSpec.describe(HathifilesListingVerifier) do @@ -23,17 +24,9 @@ module PostZephirProcessing around(:each) do |example| with_test_environment do - example.run - end - end - - describe "#derivatives_for_date" do - it "expects two derivativess on firstday" do - expect(described_class.new.derivatives_for_date(date: firstday).size).to eq 2 - end - - it "expects one derivative on secondday" do - expect(described_class.new.derivatives_for_date(date: secondday).size).to eq 1 + ClimateControl.modify(HATHIFILE_ARCHIVE: "data/www") do + example.run + end end end @@ -70,9 +63,9 @@ module PostZephirProcessing it "produces 2 errors if upd and full file are missing on the first day of the month" do # Need to remove the 2 files for the first to test - verifier.derivatives_for_date(date: firstday).each do |f| - if File.exist?(f) - FileUtils.rm(f) + Derivative::HathifileWWW.derivatives_for_date(date: firstday).each do |d| + if File.exist?(d.path) + FileUtils.rm(d.path) end end verifier.verify_hathifiles_listing(date: firstday) From d99469809bc6848e460ccac3b0d549d6bc8c643b Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 18 Dec 2024 10:52:17 -0500 Subject: [PATCH 073/114] Use Derivative in hathifiles_verifier * remove full vs. upd behavior * use Derivative::CatalogArchive instead of catalog_file_for --- lib/verifier/hathifiles_verifier.rb | 37 +++++---------------------- spec/unit/hathifiles_verifier_spec.rb | 12 --------- 2 files changed, 7 insertions(+), 42 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 9d06f12..67fc071 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -3,7 +3,7 @@ require "zlib" require_relative "hathifiles_contents_verifier" require_relative "../verifier" -require_relative "../derivatives" +require_relative "../derivative/hathifile" # Verifies that hathifiles workflow stage did what it was supposed to. @@ -25,19 +25,12 @@ def run_for_date(date:) # Verify: # readable def verify_hathifile(date: current_date) - update_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_upd_YYYYMMDD.txt.gz", date: date) - if verify_file(path: update_file) - linecount = verify_hathifile_contents(path: update_file) - verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date)) - end - - # first of month - if date.first_of_month? - full_file = self.class.dated_derivative(location: :HATHIFILE_ARCHIVE, name: "hathi_full_YYYYMMDD.txt.gz", date: date) - if verify_file(path: full_file) - linecount = verify_hathifile_contents(path: full_file) - verify_hathifile_linecount(linecount, catalog_path: catalog_file_for(date, full: true)) - end + Derivative::Hathifile.derivatives_for_date(date: date).each do |derivative| + path = derivative.path + next unless verify_file(path: path) + linecount = verify_hathifile_contents(path: path) + catalog_path = Derivative::CatalogArchive.new(date: date - 1, full: derivative.full).path + verify_hathifile_linecount(linecount, catalog_path: catalog_path) end end @@ -55,22 +48,6 @@ def verify_hathifile_linecount(linecount, catalog_path:) end end - def catalog_file_for(date, full: false) - # TODO address this somehow with Derivative. Maybe Derivative should know - # how to construct the filenames? - name = if full - "zephir_full_YYYYMMDD_vufind.json.gz" - else - "zephir_upd_YYYYMMDD.json.gz" - end - - self.class.dated_derivative( - location: :CATALOG_ARCHIVE, - name: name, - date: date - 1 - ) - end - def errors super.flatten end diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/hathifiles_verifier_spec.rb index e9da2d7..be787f4 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/hathifiles_verifier_spec.rb @@ -54,17 +54,5 @@ module PostZephirProcessing expect_not_ok(:verify_hathifile_contents, contents, errmsg: /.*columns.*/, gzipped: true) end end - - describe "#catalog_file_for" do - it "computes a source catalog file based on date - 1" do - expect(described_class.new.catalog_file_for(Date.parse("2023-01-04"))) - .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_upd_20230103.json.gz") - end - - it "computes a full source catalog file based on date - 1" do - expect(described_class.new.catalog_file_for(Date.parse("2024-12-01"), full: true)) - .to eq("#{ENV["CATALOG_ARCHIVE"]}/zephir_full_20241130_vufind.json.gz") - end - end end end From 5c22657e56dc02523343ac68ca751202a77cc1b1 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 18 Dec 2024 11:01:06 -0500 Subject: [PATCH 074/114] Don't return line count from verify_hathifile_contents * Make method private * Return verifier & get line count from that --- lib/verifier/hathifiles_verifier.rb | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 67fc071..6a19fc5 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -28,19 +28,12 @@ def verify_hathifile(date: current_date) Derivative::Hathifile.derivatives_for_date(date: date).each do |derivative| path = derivative.path next unless verify_file(path: path) - linecount = verify_hathifile_contents(path: path) + contents_verifier = verify_hathifile_contents(path: path) catalog_path = Derivative::CatalogArchive.new(date: date - 1, full: derivative.full).path - verify_hathifile_linecount(linecount, catalog_path: catalog_path) + verify_hathifile_linecount(contents_verifier.line_count, catalog_path: catalog_path) end end - def verify_hathifile_contents(path:) - verifier = HathifileContentsVerifier.new(path) - verifier.run - @errors.append(verifier.errors) - verifier.line_count - end - def verify_hathifile_linecount(linecount, catalog_path:) catalog_linecount = gzip_linecount(path: catalog_path) if linecount < catalog_linecount @@ -51,5 +44,14 @@ def verify_hathifile_linecount(linecount, catalog_path:) def errors super.flatten end + + private + + def verify_hathifile_contents(path:) + HathifileContentsVerifier.new(path).tap do |contents_verifier| + contents_verifier.run + @errors.append(contents_verifier.errors) + end + end end end From 5da30f6e982aedd1d952be977b9987f46db3dde3 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 18 Dec 2024 11:45:43 -0500 Subject: [PATCH 075/114] fixes to hathifiles_listing_verifier & its spec, new hathifile_www derivative --- .../www}/hathi_file_list.json | 0 lib/derivative/hathifile_www.rb | 26 ++++++- lib/verifier/hathifiles_listing_verifier.rb | 30 ++++---- spec/fixtures/www/hathi_file_list.json | 26 +++++++ spec/unit/hathifiles_listing_verifier_spec.rb | 70 ++++++++----------- 5 files changed, 91 insertions(+), 61 deletions(-) rename {spec/fixtures => data/www}/hathi_file_list.json (100%) create mode 100644 spec/fixtures/www/hathi_file_list.json diff --git a/spec/fixtures/hathi_file_list.json b/data/www/hathi_file_list.json similarity index 100% rename from spec/fixtures/hathi_file_list.json rename to data/www/hathi_file_list.json diff --git a/lib/derivative/hathifile_www.rb b/lib/derivative/hathifile_www.rb index a008415..789d611 100644 --- a/lib/derivative/hathifile_www.rb +++ b/lib/derivative/hathifile_www.rb @@ -1,7 +1,29 @@ -require_relative "hathifile" +require "derivative" module PostZephirProcessing - class Derivative::HathifileWWW < Derivative::Hathifile + class Derivative::HathifileWWW < Derivative + def self.derivatives_for_date(date:) + derivatives = [ + Derivative::HathifileWWW.new( + full: false, + date: date + ) + ] + + if date.first_of_month? + derivatives << Derivative::HathifileWWW.new( + full: true, + date: date + ) + end + + derivatives + end + + def self.json_path + File.join(ENV["WWW_DIR"], "hathi_file_list.json") + end + def template { location: :WWW_DIR, diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index 13e8252..25fd0f7 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -4,6 +4,7 @@ require_relative "../derivatives" require_relative "../derivative/hathifile_www" require "json" +require "set" module PostZephirProcessing class HathifilesListingVerifier < Verifier @@ -22,28 +23,23 @@ def verify_hathifiles_listing(date: current_date) def verify_listing(path:) verify_file(path: path) - - filename = File.basename(path) - verify_file_in_json(filename: filename) + verify_file_in_json(filename: File.basename(path)) end - # Verify that the derivatives for the date are included in - # "#{ENV['WWW_DIR']}/hathi_file_list.json" def verify_file_in_json(filename:) - json_path = "#{ENV["WWW_DIR"]}/hathi_file_list.json" - listings = JSON.load_file(json_path) - matches = [] - - listings.each do |listing| - if listing["filename"] == filename - matches << listing - break - end + unless listings.include?(filename) + error(message: "No listing with filename: #{filename} in #{Derivative::HathifileWWW.json_path}") end + end - if matches.empty? - error(message: "Did not find a listing with filename: #{filename} in JSON (#{json_path})") - end + private + + # Load json file and produce the set of "filename" values in that json, once. + def listings + @listings ||= JSON + .load_file(Derivative::HathifileWWW.json_path) + .map { |listing| listing["filename"] } + .to_set end end end diff --git a/spec/fixtures/www/hathi_file_list.json b/spec/fixtures/www/hathi_file_list.json new file mode 100644 index 0000000..68681a4 --- /dev/null +++ b/spec/fixtures/www/hathi_file_list.json @@ -0,0 +1,26 @@ +[ + { + "filename": "hathi_full_20230101.txt.gz", + "full": true, + "size": 12345, + "created": "2023-01-01 01:01:01 -0400", + "modified": "2023-01-01 01:01:01 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_full_20230101.txt.gz" + }, + { + "filename": "hathi_upd_20230101.txt.gz", + "full": false, + "size": 123, + "created": "2023-01-01 01:01:01 -0400", + "modified": "2023-01-01 01:01:01 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_upd_20230101.txt.gz" + }, + { + "filename": "hathi_upd_20230102.txt.gz", + "full": false, + "size": 456, + "created": "2023-01-02 02:02:02 -0400", + "modified": "2023-01-02 02:02:02 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_upd_20230102.txt.gz" + } +] diff --git a/spec/unit/hathifiles_listing_verifier_spec.rb b/spec/unit/hathifiles_listing_verifier_spec.rb index a807ce4..3e7f130 100644 --- a/spec/unit/hathifiles_listing_verifier_spec.rb +++ b/spec/unit/hathifiles_listing_verifier_spec.rb @@ -5,73 +5,59 @@ module PostZephirProcessing RSpec.describe(HathifilesListingVerifier) do - let(:verifier) { described_class.new } - # Using secondday here as a representative for # "any day of the month that's not the 1st" # missingday does not have files or listings firstday = Date.parse("2023-01-01") secondday = Date.parse("2023-01-02") - missingday = Date.parse("2023-01-13") + missingday_first = Date.parse("2020-01-01") + missingday_second = Date.parse("2020-01-02") + firstday_ymd = firstday.strftime("%Y%m%d") secondday_ymd = secondday.strftime("%Y%m%d") - missingday_ymd = missingday.strftime("%Y%m%d") - dir_path = ENV["WWW_DIR"] + missingday_first_ymd = missingday_first.strftime("%Y%m%d") + missingday_second_ymd = missingday_second.strftime("%Y%m%d") - before(:all) do - FileUtils.cp(fixture("hathi_file_list.json"), dir_path) - end + fixture("hathi_file_list.json") + + let(:verifier) { described_class.new } around(:each) do |example| with_test_environment do - ClimateControl.modify(HATHIFILE_ARCHIVE: "data/www") do + ClimateControl.modify( + WWW_DIR: fixture("www"), + HATHIFILE_ARCHIVE: fixture("hathifile_archive") + ) do example.run end end end describe "#verify_hathifiles_listing" do - FileUtils.mkdir_p(dir_path) - it "finds update and full file on firstday" do - update_file = File.join(dir_path, "hathi_upd_#{firstday_ymd}.txt.gz") - full_file = File.join(dir_path, "hathi_full_#{firstday_ymd}.txt.gz") - - FileUtils.touch(update_file) - FileUtils.touch(full_file) - verifier.verify_hathifiles_listing(date: firstday) - expect(verifier.errors).to be_empty + expect(verifier.errors).to eq [] end it "finds just an update file on secondday" do - update_file = File.join(dir_path, "hathi_upd_#{secondday_ymd}.txt.gz") - - FileUtils.mkdir_p(dir_path) - FileUtils.touch(update_file) - verifier.verify_hathifiles_listing(date: secondday) - expect(verifier.errors).to be_empty + expect(verifier.errors).to eq [] end - it "produces 1 error if upd file is missing midmonth" do - verifier.verify_hathifiles_listing(date: missingday) - expect(verifier.errors.size).to eq 2 - expect(verifier.errors).to include(/Did not find a listing with filename: hathi_upd_#{missingday_ymd}/) - expect(verifier.errors.first).to include(/not found:.+_upd_#{missingday_ymd}/) + it "produces 4 errors if upd and full file are missing on the first day of the month + no listing" do + verifier.verify_hathifiles_listing(date: missingday_first) + expect(verifier.errors.count).to eq 4 # 2 files not found, 2 listings not found + expect(verifier.errors).to include(/No listing with filename: hathi_upd_#{missingday_first_ymd}.txt.gz .+/) + expect(verifier.errors).to include(/not found: .+hathi_upd_#{missingday_first_ymd}.txt.gz/) + expect(verifier.errors).to include(/No listing with filename: hathi_full_#{missingday_first_ymd}.txt.gz .+/) + expect(verifier.errors).to include(/not found: .+hathi_full_#{missingday_first_ymd}.txt.gz/) end - it "produces 2 errors if upd and full file are missing on the first day of the month" do - # Need to remove the 2 files for the first to test - Derivative::HathifileWWW.derivatives_for_date(date: firstday).each do |d| - if File.exist?(d.path) - FileUtils.rm(d.path) - end - end - verifier.verify_hathifiles_listing(date: firstday) - expect(verifier.errors.size).to eq 2 - expect(verifier.errors.first).to include(/not found:.+_upd_#{firstday_ymd}/) - expect(verifier.errors.last).to include(/not found:.+_full_#{firstday_ymd}/) + it "produces 2 errors if upd file is missing midmonth + no listing" do + verifier.verify_hathifiles_listing(date: missingday_second) + expect(verifier.errors.count).to eq 2 # 1 file not found, 1 listing not found + expect(verifier.errors).to include(/No listing with filename: hathi_upd_#{missingday_second_ymd}.txt.gz .+/) + expect(verifier.errors).to include(/not found: .+hathi_upd_#{missingday_second_ymd}.txt.gz/) end end @@ -82,9 +68,9 @@ module PostZephirProcessing verifier.verify_file_in_json(filename: "hathi_upd_#{secondday_ymd}.txt.gz") end it "produces 1 error when not finding a matching listing" do - verifier.verify_file_in_json(filename: "hathi_upd_#{missingday_ymd}.txt.gz") + verifier.verify_file_in_json(filename: "hathi_upd_#{missingday_first_ymd}.txt.gz") expect(verifier.errors.size).to eq 1 - expect(verifier.errors).to include(/Did not find a listing with filename: hathi_upd_#{missingday_ymd}/) + expect(verifier.errors).to include(/No listing with filename: hathi_upd_#{missingday_first_ymd}.txt.gz .+/) end end end From eec748baee76ae433b5c9dcf16ed0789be6854cf Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 18 Dec 2024 11:53:18 -0500 Subject: [PATCH 076/114] gitignore got me again --- spec/fixtures/www/hathi_full_20230101.txt.gz | 1 + spec/fixtures/www/hathi_upd_20230101.txt.gz | 1 + spec/fixtures/www/hathi_upd_20230102.txt.gz | 1 + 3 files changed, 3 insertions(+) create mode 100644 spec/fixtures/www/hathi_full_20230101.txt.gz create mode 100644 spec/fixtures/www/hathi_upd_20230101.txt.gz create mode 100644 spec/fixtures/www/hathi_upd_20230102.txt.gz diff --git a/spec/fixtures/www/hathi_full_20230101.txt.gz b/spec/fixtures/www/hathi_full_20230101.txt.gz new file mode 100644 index 0000000..ef8f769 --- /dev/null +++ b/spec/fixtures/www/hathi_full_20230101.txt.gz @@ -0,0 +1 @@ +we do not care about the contents of these diff --git a/spec/fixtures/www/hathi_upd_20230101.txt.gz b/spec/fixtures/www/hathi_upd_20230101.txt.gz new file mode 100644 index 0000000..ef8f769 --- /dev/null +++ b/spec/fixtures/www/hathi_upd_20230101.txt.gz @@ -0,0 +1 @@ +we do not care about the contents of these diff --git a/spec/fixtures/www/hathi_upd_20230102.txt.gz b/spec/fixtures/www/hathi_upd_20230102.txt.gz new file mode 100644 index 0000000..ef8f769 --- /dev/null +++ b/spec/fixtures/www/hathi_upd_20230102.txt.gz @@ -0,0 +1 @@ +we do not care about the contents of these From d39507dd27f8ed1104d6dd56d5f2a7b0efd162d9 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 18 Dec 2024 12:08:05 -0500 Subject: [PATCH 077/114] Add Derivative::Delete class --- lib/derivative/delete.rb | 21 +++++++++++ spec/unit/derivative/delete_spec.rb | 58 +++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+) create mode 100644 lib/derivative/delete.rb create mode 100644 spec/unit/derivative/delete_spec.rb diff --git a/lib/derivative/delete.rb b/lib/derivative/delete.rb new file mode 100644 index 0000000..1dacb30 --- /dev/null +++ b/lib/derivative/delete.rb @@ -0,0 +1,21 @@ +require "derivative" + +module PostZephirProcessing + class Derivative::Delete < Derivative + def self.derivatives_for_date(date:) + [ + new( + full: false, + date: date + ) + ] + end + + def template + { + location: :CATALOG_PREP, + name: "zephir_upd_YYYYMMDD_delete.txt.gz" + } + end + end +end diff --git a/spec/unit/derivative/delete_spec.rb b/spec/unit/derivative/delete_spec.rb new file mode 100644 index 0000000..33b87e0 --- /dev/null +++ b/spec/unit/derivative/delete_spec.rb @@ -0,0 +1,58 @@ +# frozen_string_literal: true + +require "derivative" +require "derivative/delete" + +module PostZephirProcessing + RSpec.describe(Derivative::Delete) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify( + CATALOG_ARCHIVE: "/tmp/archive", + CATALOG_PREP: "/tmp/prep" + ) do + example.run + end + end + end + + let(:test_date_first_of_month) { Date.parse("2023-11-01") } + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + + let(:params) do + { + date: test_date_last_of_month + } + end + let(:derivative) { described_class.new(**params) } + + describe "self.derivatives_for_date" do + it "returns 1 derivative (upd) on the last of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_last_of_month + ) + expect(derivatives.count).to eq 1 + expect(derivatives.first.full?).to be false + end + + it "returns 1 derivative (upd) on the first of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_first_of_month + ) + expect(derivatives.count).to eq 1 + expect(derivatives.first.full?).to be false + end + end + + it "reports the expected path for a delete file derived from an update catalog file" do + params[:full] = false + expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130_delete.txt.gz" + end + + # TODO: maybe this should raise since it's asking for a nonexistent derivative? + it "reports the same (upd) path regardless of fullness" do + params[:full] = true + expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130_delete.txt.gz" + end + end +end From 33ec293ad30b4c41b60cf0991400810db33f3d83 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 18 Dec 2024 12:38:18 -0500 Subject: [PATCH 078/114] Appease standardrb --- spec/unit/derivative/delete_spec.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/unit/derivative/delete_spec.rb b/spec/unit/derivative/delete_spec.rb index 33b87e0..7497f5e 100644 --- a/spec/unit/derivative/delete_spec.rb +++ b/spec/unit/derivative/delete_spec.rb @@ -43,7 +43,7 @@ module PostZephirProcessing expect(derivatives.first.full?).to be false end end - + it "reports the expected path for a delete file derived from an update catalog file" do params[:full] = false expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130_delete.txt.gz" From 9f5beb33e04e60012dd5dc28162cce4ae6e4971f Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 18 Dec 2024 12:39:56 -0500 Subject: [PATCH 079/114] Remove DIR_DATA in favor of Derivative classes --- lib/derivatives.rb | 72 +++++++++++++++------------------------------- 1 file changed, 23 insertions(+), 49 deletions(-) diff --git a/lib/derivatives.rb b/lib/derivatives.rb index 0f0eeca..b850bcf 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -1,12 +1,20 @@ # frozen_string_literal: true require_relative "dates" +require "derivative" +require "derivative/catalog" +require "derivative/delete" +require "derivative/rights" module PostZephirProcessing # A class that knows the expected locations of standard Zephir derivative files. # `earliest_missing_date` is the main entrypoint when constructing an agenda of Zephir # file dates to fetch for processing. + # + # TODO: this class may be renamed PostZephirDerivatives once directory_for is updated, + # moved, or elimminated. class Derivatives + # TODO: STANDARD_LOCATIONS is only used for testing directory_for and may be eliminated. STANDARD_LOCATIONS = [ :CATALOG_ARCHIVE, :CATALOG_PREP, @@ -14,35 +22,6 @@ class Derivatives :TMPDIR, :WWW_DIR ].freeze - # Location data for the derivatives we care about when constructing our list of missing dates. - - DIR_DATA = { - zephir_full: { - location: :CATALOG_PREP, - pattern: /^zephir_full_(\d{8})_vufind\.json\.gz$/, - full: true - }, - zephir_full_rights: { - location: :RIGHTS_ARCHIVE, - pattern: /^zephir_full_(\d{8})\.rights$/, - full: true - }, - zephir_update: { - location: :CATALOG_PREP, - pattern: /^zephir_upd_(\d{8})\.json\.gz$/, - full: false - }, - zephir_update_rights: { - location: :RIGHTS_ARCHIVE, - pattern: /^zephir_upd_(\d{8})\.rights$/, - full: false - }, - zephir_update_delete: { - location: :CATALOG_PREP, - pattern: /^zephir_upd_(\d{8})_delete\.txt\.gz$/, - full: false - } - }.freeze attr_reader :dates @@ -71,27 +50,22 @@ def initialize(date: (Date.today - 1)) # @return [Date,nil] def earliest_missing_date - earliest = [] - DIR_DATA.each_pair do |name, data| - required_dates = data[:full] ? [dates.all_dates.min] : dates.all_dates - delta = required_dates - directory_inventory(name: name) - earliest << delta.min if delta.any? + derivative_classes = [ + Derivative::CatalogPrep, + Derivative::Rights, + Derivative::Delete + ] + earliest = nil + dates.all_dates.each do |date| + derivative_classes.each do |klass| + klass.derivatives_for_date(date: date).each do |derivative| + if !File.exist?(derivative.path) + earliest = [earliest, date].compact.min + end + end + end end - earliest.min - end - - private - - # Run regexp against the contents of dir and store matching files - # that have datestamps in the period of interest. - # @return [Array] de-duped and sorted ASC - def directory_inventory(name:) - dir = self.class.directory_for(location: DIR_DATA[name][:location]) - Dir.children(dir) - .filter_map { |filename| (m = DIR_DATA[name][:pattern].match(filename)) && Date.parse(m[1]) } - .select { |date| dates.all_dates.include? date } - .sort - .uniq + earliest end end end From f10458a73108bc9c93817d7937a98dd4c513219d Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 18 Dec 2024 13:25:30 -0500 Subject: [PATCH 080/114] Derivative subclass for dollar dup Tests mostly copied from deletes. --- lib/derivative/dollar_dup.rb | 26 ++++++++++++ lib/verifier/post_zephir_verifier.rb | 24 ++++++++++- spec/unit/derivative/dollar_dup_spec.rb | 56 +++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 2 deletions(-) create mode 100644 lib/derivative/dollar_dup.rb create mode 100644 spec/unit/derivative/dollar_dup_spec.rb diff --git a/lib/derivative/dollar_dup.rb b/lib/derivative/dollar_dup.rb new file mode 100644 index 0000000..6e3447a --- /dev/null +++ b/lib/derivative/dollar_dup.rb @@ -0,0 +1,26 @@ +require "derivative" + +module PostZephirProcessing + class Derivative::DollarDup < Derivative + def initialize(date:, full: false) + raise ArgumentError, "'dollar dup' has no full version" if full + super + end + + def self.derivatives_for_date(date:) + [ + new( + full: false, + date: date + ) + ] + end + + def template + { + location: :TMPDIR, + name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz" + } + end + end +end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 7006ebb..233aa77 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -3,6 +3,9 @@ require "zlib" require_relative "../verifier" require_relative "../derivatives" +require_relative "../derivative/dollar_dup" +require_relative "../derivative/catalog" +require_relative "../derivative/rights" # Verifies that post_zephir workflow stage did what it was supposed to. @@ -94,12 +97,29 @@ def verify_deletes_contents(path:) # Frequency: DAILY # Files: TMPDIR/vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz - # Contents: historically undallarized uc1 HTIDs (e.g., uc1.b312920) one per line + # + # Contents: + # + # Historically, un-dollarized uc1 HTIDs (e.g., uc1.b312920) one per line. + # These files originally served as a way to report back to Zephir on items to + # "uningest" related to a change that University of California made + # regarding certain barcode ranges -- things like uc1.b312920 moved to + # uc1.$b312920, and we needed to 'uningest' uc1.b312920. + # + # Later, it served as a more general way to cause Zephir to mark items as not + # ingested and thereby no longer export them in full files. This + # functionality (as of 2024) has not been used in many years. If we at some + # point begin fully deleting material from the repository, this + # functionality could again be used. + # + # As of December 2024, these files are generated each day, but are expected + # to be empty. + # # Verify: # readable # empty def verify_dollar_dup(date: current_date) - dollar_dup = self.class.dated_derivative(location: :TMPDIR, name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz", date: date) + dollar_dup = Derivative::DollarDup.new(date: date).path if verify_file(path: dollar_dup) gz_count = gzip_linecount(path: dollar_dup) if gz_count.positive? diff --git a/spec/unit/derivative/dollar_dup_spec.rb b/spec/unit/derivative/dollar_dup_spec.rb new file mode 100644 index 0000000..97320c8 --- /dev/null +++ b/spec/unit/derivative/dollar_dup_spec.rb @@ -0,0 +1,56 @@ +# frozen_string_literal: true + +require "derivative" +require "derivative/dollar_dup" + +module PostZephirProcessing + RSpec.describe(Derivative::DollarDup) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify( + TMPDIR: "/tmp" + ) do + example.run + end + end + end + + let(:test_date_first_of_month) { Date.parse("2023-11-01") } + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + + let(:params) do + { + date: test_date_last_of_month, + full: false + } + end + let(:derivative) { described_class.new(**params) } + + describe "self.derivatives_for_date" do + it "returns 1 derivative (upd) on the last of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_last_of_month + ) + expect(derivatives.count).to eq 1 + expect(derivatives.first.full?).to be false + end + + it "returns 1 derivative (upd) on the first of month" do + derivatives = described_class.derivatives_for_date( + date: test_date_first_of_month + ) + expect(derivatives.count).to eq 1 + expect(derivatives.first.full?).to be false + end + end + + it "reports the expected path for a dollar dup file" do + expect(derivative.path).to eq "/tmp/vufind_incremental_2023-11-30_dollar_dup.txt.gz" + end + + it "raises if a full file is requested" do + params[:full] = true + expect { derivative }.to raise_exception(ArgumentError, /full/) + end + end +end From 7906dc1ddc34740caeaab87c38df403e0f1b1b97 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 18 Dec 2024 13:29:07 -0500 Subject: [PATCH 081/114] Remove verification for rights report Currently, nothing depends on that file being written. The data in it may be useful for debugging purposes in the future, but we are likely to want to track that data in some other way (i.e. prometheus). --- lib/verifier/post_zephir_verifier.rb | 14 --------- spec/unit/post_zephir_verifier_spec.rb | 41 -------------------------- 2 files changed, 55 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 233aa77..caad67d 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -20,7 +20,6 @@ def run_for_date(date:) verify_dollar_dup verify_ingest_bibrecords verify_rights - verify_zephir_data end # Frequency: ALL @@ -179,18 +178,5 @@ def verify_rights_file_format(path:) end end end - - # Frequency: MONTHLY - # Files: - # ZEPHIR_DATA/full/zephir_full_monthly_rpt.txt - # ZEPHIR_DATA/full/zephir_full_YYYYMMDD.rights_rpt.tsv - # Contents: TODO - # Verify: readable - def verify_zephir_data(date: current_date) - if date.last_of_month? - verify_file(path: self.class.derivative(location: :ZEPHIR_DATA, name: "full/zephir_full_monthly_rpt.txt")) - verify_file(path: self.class.dated_derivative(location: :ZEPHIR_DATA, name: "full/zephir_full_YYYYMMDD.rights_rpt.tsv", date: date)) - end - end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 6d85605..e8b4308 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -400,46 +400,5 @@ def expect_deletefile_ok(contents) end end end - - describe "verify_zephir_data" do - context "last day of month" do - test_date = Date.parse("2024-11-30") - context "with both files present" do - it "reports no errors" do - ClimateControl.modify(ZEPHIR_DATA: @tmpdir) do - FileUtils.mkdir(File.join(@tmpdir, "full")) - FileUtils.touch(File.join(@tmpdir, "full", "zephir_full_monthly_rpt.txt")) - zephir_full_file = "zephir_full_YYYYMMDD.rights_rpt.tsv".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) - FileUtils.touch(File.join(@tmpdir, "full", zephir_full_file)) - verifier = described_class.new - verifier.verify_zephir_data(date: test_date) - expect(verifier.errors.count).to eq 0 - end - end - end - - context "with both files absent" do - it "reports two `not found` errors" do - ClimateControl.modify(ZEPHIR_DATA: @tmpdir) do - verifier = described_class.new - verifier.verify_zephir_data(date: test_date) - expect(verifier.errors.count).to eq 2 - verifier.errors.each do |err| - expect(err).to include(/^not found/) - end - end - end - end - end - - context "non-last day of month" do - test_date = Date.parse("2024-12-01") - it "reports no errors" do - verifier = described_class.new - verifier.verify_zephir_data(date: test_date) - expect(verifier.errors.count).to eq 0 - end - end - end end end From d227b62a979dd7cfa6cf376d7bb24864081dc111 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 18 Dec 2024 14:04:12 -0500 Subject: [PATCH 082/114] PostZephirVerifier.verify_rights_file_format now checks individual cols --- lib/verifier/post_zephir_verifier.rb | 23 +++++++++++++++++++---- spec/unit/post_zephir_verifier_spec.rb | 16 ++++++++-------- 2 files changed, 27 insertions(+), 12 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index caad67d..75baa45 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -161,19 +161,34 @@ def verify_rights(date: current_date) # * exist & be be readable (both covered by verify_rights) # * either be empty, or all its lines must match regex. def verify_rights_file_format(path:) - regex = /^ [a-z0-9]+ \. [a-z0-9:\/\$\.]+ # col 1, namespace.objid + line_regex = /^ [a-z0-9]+ \. [a-z0-9:\/\$\.]+ # col 1, namespace.objid \t (ic|pd|pdus|und) # col 2, one of these \t bib # col 3, exactly this \t bibrights # col 4, exactly this \t [a-z]+(-[a-z]+)* # col 5, digitization source, e.g. 'ia', 'cornell-ms' $/x + # A column-by column version of line_regex + column_regexes = [ + {name: :id, regex: /^[a-z0-9]+\.[a-z0-9:\/\$\.]+s$/}, + {name: :rights, regex: /^(ic|pd|pdus|und)$/}, + {name: :bib, regex: /^bib$/}, + {name: :bibrights, regex: /^bibrights$/}, + {name: :digitization_source, regex: /^[a-z]+(-[a-z]+)*$/} + ] + # This allows an empty file as well, which is possible. File.open(path) do |f| - f.each_line do |line| + f.each_line.with_index do |line, i| line.chomp! - unless line.match?(regex) - error message: "Rights file #{path} contains malformed line: #{line}" + unless line.match?(line_regex) + # If line_regex did not match the line, find the offending col(s) and report + cols = line.split("\t", -1) + cols.each_with_index do |col, j| + unless col.match?(column_regexes[j][:regex]) + error message: "Rights file #{path}:#{i + 1}, invalid column #{column_regexes[j][:name]} : #{col}" + end + end end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index e8b4308..b7c151d 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -351,39 +351,39 @@ def expect_deletefile_ok(contents) expect_not_ok( :verify_rights_file_format, rights_cols.join("\t"), - errmsg: /Rights file .+ contains malformed line/ + errmsg: /invalid column id/ ) end end it "rejects a file with no rights" do rights_cols[1] = "" - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column rights/) end it "rejects a file with unexpected (icus) rights" do rights_cols[1] = "icus" - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column rights/) end it "rejects a file without 'bib' (lowercase) in col 2" do rights_cols[2] = "BIB" - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column bib/) end it "rejects a file with no reason in col 2" do rights_cols[2] = "" - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column bib/) end it "rejects a file without 'bibrights' (lowercase) in col 3" do rights_cols[3] = "BIBRIGHTS" - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column bibrights/) end it "rejects a file with no user in col 3" do rights_cols[3] = "" - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column bibrights/) end it "accepts a file with OK digitization source" do @@ -396,7 +396,7 @@ def expect_deletefile_ok(contents) it "rejects a file with malformed digitization source (#{bad_dig_source})" do rights_cols[4] = bad_dig_source - expect_not_ok(:verify_rights_file_format, rights_cols.join("\t")) + expect_not_ok(:verify_rights_file_format, rights_cols.join("\t"), errmsg: /invalid column digitization_source/) end end end From 0cce69e89b4feac1162900b70c10105fe6752405 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 18 Dec 2024 14:25:27 -0500 Subject: [PATCH 083/114] Respond to issues in HathifilesDatabaseVerifier spec --- spec/fixtures/hathifile_archive/README.md | 5 +- .../hathi_full_20241101.txt.gz | Bin 0 -> 963 bytes .../unit/hathifiles_database_verifier_spec.rb | 94 +++++++----------- 3 files changed, 39 insertions(+), 60 deletions(-) create mode 100644 spec/fixtures/hathifile_archive/hathi_full_20241101.txt.gz diff --git a/spec/fixtures/hathifile_archive/README.md b/spec/fixtures/hathifile_archive/README.md index 58591a6..541fa53 100644 --- a/spec/fixtures/hathifile_archive/README.md +++ b/spec/fixtures/hathifile_archive/README.md @@ -2,5 +2,6 @@ Fixtures for end-to-end testing. Derived from hathifiles v1.5.2 via data from ../catalog_archive -hathifile_upd_20241202.txt.gz - 8 real hathifile entries, with rights_timestamp and access_profile manually added. -hathifile_upd_20241203.txt.gz - 5 real hathifile entries, all missing rights_timestamp and access profile, and the last entry truncated +hathi_upd_20241202.txt.gz - 8 real hathifile entries, with rights_timestamp and access_profile manually added. +hathi_upd_20241203.txt.gz - 5 real hathifile entries, all missing rights_timestamp and access profile, and the last entry truncated +hathi_full_20241101.txt.gz - 11 real hathifile entries copied from a hathifiles v1.5.2 test fixture \ No newline at end of file diff --git a/spec/fixtures/hathifile_archive/hathi_full_20241101.txt.gz b/spec/fixtures/hathifile_archive/hathi_full_20241101.txt.gz new file mode 100644 index 0000000000000000000000000000000000000000..d72b3c0a42f8b5e25f2e23ba889ba89f67f10b68 GIT binary patch literal 963 zcmV;!13dg6iwFoewnAk919M?*aBO8?W_4_AUotQP@NOX0r1M%y)jA0Q%0C&RaLm zJOP#``IHo=r$nNuXctdB(`qSsaxkaltUmuq; zxwE>-lkSFGl888oWxACdd(h|~ zQY)qG1YUTz!o=VR?F-Xm@9xkHFjt}X3Jz@LtPV!4{q~ghsvZ-Ov1pSBF=ccrC@_If zKQlh+dh90#6Jc!i_ebj8%EM$F^Zd)oBW{t#BS{f&kS1tgX(|btl2Nx=j&SF_s(MWD zYt{Axf;Al+p~#$7ZE(LDOnB&qfV=~$R(GLSg+ZIq9!Y7~Ge|AqaAeoGt{c!#d-zU! zAIe&t^{DL{Sqx#J7rMYvOJ|W;VXTHjT{f677;FVQw(?p9_{ofpXpP7|=CrIY|FGCX zmLl-Mq-x*X~P*(L<7w|Ql48Ad;hTgOqP+W-NTX+4|;_w!J zz#n=Q;jXcT@fFIj0o>VzQsvb}dv(zuAgfl!z)m#3eiywQg_I^YvCC0NlmQhpp5nvw lT1aU^`T_7>$38CG4CDLrs6YPo 0" do - with_fake_hf_entries(htids: fake_htids) do - expect(described_class.db_count.positive?).to eq(true) - expect(described_class.db_count).to eq(fake_htids.count) + with_fake_hf_entries(htids: fake_upd_htids) do + expect(described_class.db_count.positive?).to be true + expect(described_class.db_count).to eq(fake_upd_htids.count) end end end @@ -92,28 +76,27 @@ def with_fake_full_hathifile context "with corresponding hf_log" do it "reports no `missing hf_log` errors" do with_fake_hf_log_entry(hathifile: "hathi_upd_20241202.txt.gz") do - verifier.run_for_date(date: Date.new(2024, 12, 2)) - expect(verifier.errors).not_to include(/missing hf_log/) + verifier.run_for_date(date: second_of_month) + expect(verifier.errors).to be_empty end end end context "with no corresponding hf_log" do it "reports `missing hf_log` error" do - verifier.run_for_date(date: Date.new(2024, 12, 2)) - expect(verifier.errors).to include(/missing hf_log/) + verifier.run_for_date(date: second_of_month) + expect(verifier.errors.count).to eq 1 end end end - # Each of these must be run with `with_fake_full_hathifile` context "with full hathifile" do context "with corresponding hf_log" do - it "reports no `missing hf_log` errors" do + it "reports no errors" do with_fake_hf_log_entry(hathifile: test_full_file) do - with_fake_full_hathifile do - verifier.run_for_date(date: Date.new(2024, 12, 1)) - expect(verifier.errors).not_to include(/missing hf_log/) + with_fake_hf_entries(htids: fake_full_htids) do + verifier.run_for_date(date: first_of_month) + expect(verifier.errors).to be_empty end end end @@ -121,9 +104,9 @@ def with_fake_full_hathifile context "with no corresponding hf_log" do it "reports `missing hf_log` error" do - with_fake_full_hathifile do - verifier.run_for_date(date: Date.new(2024, 12, 1)) - expect(verifier.errors).to include(/missing hf_log/) + with_fake_hf_entries(htids: fake_full_htids) do + verifier.run_for_date(date: first_of_month) + expect(verifier.errors.count).to eq 1 end end end @@ -131,24 +114,19 @@ def with_fake_full_hathifile context "with the expected `hf` rows" do it "reports no `hf count mismatch` errors" do with_fake_hf_log_entry(hathifile: test_full_file) do - with_fake_full_hathifile do - fake_htids = ["test.001", "test.002", "test.003", "test.004", "test.005", "test.006", "test.007", "test.008"] - with_fake_hf_entries(htids: fake_htids) do - verifier.run_for_date(date: Date.new(2024, 12, 1)) - expect(verifier.errors).not_to include(/hf count mismatch/) - end + with_fake_hf_entries(htids: fake_full_htids) do + verifier.run_for_date(date: first_of_month) + expect(verifier.errors).to be_empty end end end end context "without the expected `hf` rows" do - it "reports `hf count mismatch` error" do + it "reports one `hf count mismatch` error" do with_fake_hf_log_entry(hathifile: test_full_file) do - with_fake_full_hathifile do - verifier.run_for_date(date: Date.new(2024, 12, 1)) - expect(verifier.errors).to include(/hf count mismatch/) - end + verifier.run_for_date(date: first_of_month) + expect(verifier.errors.count).to eq 1 end end end From 3dccc269b580187358c90c55e17431f54d7a38be Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 18 Dec 2024 14:33:45 -0500 Subject: [PATCH 084/114] climatecontrol for hathifiles_redirects_verifier_spec --- .../unit/hathifiles_redirects_verifier_spec.rb | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index 31b48fe..c7e5487 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -12,18 +12,26 @@ module PostZephirProcessing # Including this mess should invalidate either file let(:mess) { "oops, messed up a line!" } - # Clean dir before each test + # Clean temp subdir before each test before(:each) do - [ENV["REDIRECTS_DIR"], ENV["REDIRECTS_HISTORY_DIR"]].each do |dir| - FileUtils.rm_rf(dir) - FileUtils.mkdir_p(dir) + ["redirects", "redirects_history"].each do |subdir| + FileUtils.rm_rf(File.join(@tmpdir, subdir)) + FileUtils.mkdir_p(File.join(@tmpdir, subdir)) end end around(:each) do |example| - with_test_environment { example.run } + with_test_environment { + ClimateControl.modify( + REDIRECTS_DIR: File.join(@tmpdir, "redirects"), + REDIRECTS_HISTORY_DIR: File.join(@tmpdir, "redirects_history") + ) do + example.run + end + } end + # copy fixture to temporary subdir def stage_redirects_file FileUtils.cp(fixture("redirects/redirects_202301.txt.gz"), ENV["REDIRECTS_DIR"]) end From f8f1693b6d808372b9c0708209c5658c308f9532 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Wed, 18 Dec 2024 15:02:03 -0500 Subject: [PATCH 085/114] redirects_verifier now displays line number in errors --- lib/verifier/hathifiles_redirects_verifier.rb | 17 +++++++++-------- spec/unit/hathifiles_redirects_verifier_spec.rb | 8 ++++---- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index caa5141..5e61a49 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -22,9 +22,9 @@ def verify_redirects(date: current_date) def verify_redirects_file(path: redirects_file) if verify_file(path: path) # check that each line in the file matches regex - Zlib::GzipReader.open(path, encoding: "utf-8").each_line do |line| + Zlib::GzipReader.open(path, encoding: "utf-8").each_line.with_index(1) do |line, i| unless REDIRECTS_REGEX.match?(line) - report_malformed(file: redirects_file, line: line) + report_malformed(file: redirects_file, line: line, line_number: i) end end end @@ -32,26 +32,27 @@ def verify_redirects_file(path: redirects_file) def verify_redirects_history_file(path: redirects_history_file) if verify_file(path: path) - Zlib::GzipReader.open(path, encoding: "utf-8").each_line do |line| + Zlib::GzipReader.open(path, encoding: "utf-8").each_line.with_index(1) do |line, i| parsed = JSON.parse(line) # Check that the line parses to a hash unless parsed.instance_of?(Hash) - report_malformed(file: redirects_history_file, line: line) + report_malformed(file: redirects_history_file, line: line, line_number: i) next end # Check that the outermost level of keys in the JSON line are what we expect unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS - report_malformed(file: redirects_history_file, line: line) + report_malformed(file: redirects_history_file, line: line, line_number: i) next end # here we could go further and verify deeper structure of json, # but not sure it's worth it? rescue JSON::ParserError - report_malformed(file: redirects_history_file, line: line) + report_malformed(file: redirects_history_file, line: line, line_number: i) end end end + # These are simple enough that Derivative subclasses would be overkill def redirects_file(date: current_date) File.join(ENV["REDIRECTS_DIR"], "redirects_#{date.strftime("%Y%m")}.txt.gz") end @@ -62,8 +63,8 @@ def redirects_history_file(date: current_date) private - def report_malformed(file:, line:) - error(message: "#{file} contains malformed line: #{line}") + def report_malformed(file:, line:, line_number:) + error(message: "#{file}:#{line_number} contains malformed line: #{line}") end end end diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index c7e5487..8f42539 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -93,8 +93,8 @@ def malform(file) malform(redirects_history_file) verifier.verify_redirects(date: test_date) expect(verifier.errors.count).to eq 2 - expect(verifier.errors).to include(/#{redirects_file} contains malformed line: #{mess}/) - expect(verifier.errors).to include(/#{redirects_history_file} contains malformed line: #{mess}/) + expect(verifier.errors).to include(/#{redirects_file}:1 contains malformed line: #{mess}/) + expect(verifier.errors).to include(/#{redirects_history_file}:1 contains malformed line: #{mess}/) end it "will not warn if both files are there & valid)" do stage_redirects_file @@ -114,7 +114,7 @@ def malform(file) malform(redirects_file) verifier.verify_redirects_file(path: redirects_file) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/#{redirects_file} contains malformed line: #{mess}/) + expect(verifier.errors).to include(/#{redirects_file}:1 contains malformed line: #{mess}/) end end @@ -128,7 +128,7 @@ def malform(file) malform(redirects_history_file) verifier.verify_redirects_history_file(path: redirects_history_file) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/#{redirects_history_file} contains malformed line: #{mess}/) + expect(verifier.errors).to include(/#{redirects_history_file}:1 contains malformed line: #{mess}/) end end end From b728e9c33db4b59d992123a4c16baab4a2887320 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 18 Dec 2024 15:13:43 -0500 Subject: [PATCH 086/114] Catalog indexing verification time range improvements * Check that things were indexed since midnight EST on the day the file was supposed to have been processed * Directly specify expected solr parameter in tests * Use webmock/rspec (otherwise stubbed endpoints get leaked between tests) --- lib/verifier/catalog_index_verifier.rb | 6 ++-- spec/spec_helper.rb | 1 + spec/unit/catalog_indexing_verifier_spec.rb | 34 ++++++++------------- 3 files changed, 15 insertions(+), 26 deletions(-) diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 0c3e57a..0d1c5b3 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -31,10 +31,8 @@ def verify_index_count(derivative:) end def solr_count(date_of_indexing) - # get: - datebegin = date_of_indexing.to_datetime.new_offset(0).strftime("%FT%TZ") - dateend = (date_of_indexing + 1).to_datetime.new_offset(0).strftime("%FT%TZ") - solr_result_count("time_of_index:#{datebegin}%20TO%20#{dateend}]") + datebegin = date_of_indexing.to_time.utc.strftime("%FT%TZ") + solr_result_count("time_of_index:[#{datebegin}%20TO%20NOW]") end def solr_nondeleted_records diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 6be0711..1b6bd8b 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -7,6 +7,7 @@ require "tempfile" require "simplecov" require "simplecov-lcov" +require "webmock/rspec" require "zlib" Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index a672585..e6d9736 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -13,7 +13,8 @@ module PostZephirProcessing with_test_environment do ClimateControl.modify( SOLR_URL: solr_url, - CATALOG_ARCHIVE: fixture("catalog_archive") + CATALOG_ARCHIVE: fixture("catalog_archive"), + TZ: "America/Detroit" ) do example.run end @@ -21,8 +22,6 @@ module PostZephirProcessing end def stub_catalog_record_count(result_count) - WebMock.enable! - url = "#{solr_url}/select?fq=deleted:false&q=*:*&rows=0&wt=json" result = { @@ -43,18 +42,8 @@ def stub_catalog_record_count(result_count) .to_return(body: result, headers: {"Content-Type" => "application/json"}) end - def stub_catalog_timerange(date, result_count) - # must be like YYYY-mm-ddTHH:MM:SSZ - iso8601 with a 'Z' for time zone - - # time zone offsets like DateTime.iso8601 produce by default are't - # allowed for solr - - # FIXME: don't love that we duplicate this logic & the URL between here & - # the verifier -- anything to do? - datebegin = date.to_datetime.new_offset(0).strftime("%FT%TZ") - dateend = (date + 1).to_datetime.new_offset(0).strftime("%FT%TZ") - WebMock.enable! - - url = "#{solr_url}/select?fq=time_of_index:#{datebegin}%20TO%20#{dateend}]&q=*:*&rows=0&wt=json" + def stub_catalog_timerange(datebegin, result_count) + url = "#{solr_url}/select?fq=time_of_index:[#{datebegin}%20TO%20NOW]&q=*:*&rows=0&wt=json" result = { "responseHeader" => { @@ -62,7 +51,7 @@ def stub_catalog_timerange(date, result_count) "QTime" => 0, "params" => { "q" => "*=>*", - "fq" => "time_of_index:[#{datebegin} TO #{dateend}]", + "fq" => "time_of_index:[#{datebegin} TO NOW]", "rows" => "0", "wt" => "json" } @@ -77,26 +66,27 @@ def stub_catalog_timerange(date, result_count) describe "#verify_index_count" do context "with a catalog update file with 3 records" do let(:catalog_update) { Derivative::CatalogArchive.new(date: Date.parse("2024-12-02"), full: false) } - # indexed the day after the date in the filename - let(:catalog_index_date) { Date.parse("2024-12-03") } + # indexed the day after the date in the filename starting at midnight + # EST + let(:catalog_index_begin) { "2024-12-03T05:00:00Z" } it "accepts a catalog with 3 recent updates" do - stub_catalog_timerange(catalog_index_date, 3) + stub_catalog_timerange(catalog_index_begin, 3) verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to be_empty end it "accepts a catalog with 5 recent updates" do - stub_catalog_timerange(catalog_index_date, 5) + stub_catalog_timerange(catalog_index_begin, 5) verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to be_empty end it "rejects a catalog with no recent updates" do - stub_catalog_timerange(catalog_index_date, 0) + stub_catalog_timerange(catalog_index_begin, 0) verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to include(/only 0 .* in solr/) end it "rejects a catalog with 2 recent updates" do - stub_catalog_timerange(catalog_index_date, 2) + stub_catalog_timerange(catalog_index_begin, 2) verifier.verify_index_count(derivative: catalog_update) expect(verifier.errors).to include(/only 2 .* in solr/) end From 62b92e048d368bf61feb330305bb9943fae676b4 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 18 Dec 2024 15:35:26 -0500 Subject: [PATCH 087/114] Prepend class name to error messages --- lib/verifier.rb | 5 +++-- spec/unit/post_zephir_verifier_spec.rb | 10 +++++----- spec/unit/verifier_spec.rb | 8 +++++++- 3 files changed, 15 insertions(+), 8 deletions(-) diff --git a/lib/verifier.rb b/lib/verifier.rb index 3e871a4..e94e661 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -93,8 +93,9 @@ def verify_parseable_ndj(path:) # I'm not sure if we're going to try to distinguish errors and warnings. # For now let's call everything an error. def error(message:) - @errors << message - Services[:logger].error message + output_msg = self.class.to_s + ": " + message + @errors << output_msg + Services[:logger].error output_msg end end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index b7c151d..4b6294c 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -211,7 +211,7 @@ def expect_deletefile_ok(contents) verifier = described_class.new verifier.verify_dollar_dup(date: test_date) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/^not found/) + expect(verifier.errors).to include(/.*not found.*dollar_dup.*/) end end end @@ -239,7 +239,7 @@ def expect_deletefile_ok(contents) verifier = described_class.new verifier.verify_ingest_bibrecords(date: test_date) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/^not found/) + expect(verifier.errors).to include(/.*not found.*zephir_ingested_items.*/) end end end @@ -251,7 +251,7 @@ def expect_deletefile_ok(contents) verifier = described_class.new verifier.verify_ingest_bibrecords(date: test_date) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/^not found/) + expect(verifier.errors).to include(/.*not found.*groove_full.*/) end end end @@ -293,7 +293,7 @@ def expect_deletefile_ok(contents) verifier.verify_rights(date: test_date) expect(verifier.errors.count).to eq 2 verifier.errors.each do |err| - expect(err).to include(/^not found/) + expect(err).to include(/not found.*rights/) end end end @@ -321,7 +321,7 @@ def expect_deletefile_ok(contents) verifier = described_class.new verifier.verify_rights(date: test_date) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/^not found/) + expect(verifier.errors).to include(/.*not found.*rights.*/) end end end diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index 597bdcb..c6a15de 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -52,11 +52,17 @@ module PostZephirProcessing context "with nonexistent file" do it "reports an error" do - verifier.errors.count tmpfile = File.join(@tmpdir, "no_such_tmpfile.txt") verifier.verify_file(path: tmpfile) expect(verifier.errors).not_to be_empty end + + it "includes the class name in the error" do + tmpfile = File.join(@tmpdir, "no_such_tmpfile.txt") + verifier.verify_file(path: tmpfile) + expect(verifier.errors).to include(/^PostZephirProcessing::Verifier: .*/) + end + end end From 912ffe40efa64409db0fcc18db1864e661c008c3 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 18 Dec 2024 16:23:05 -0500 Subject: [PATCH 088/114] Appease standardrb --- spec/unit/verifier_spec.rb | 1 - 1 file changed, 1 deletion(-) diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index c6a15de..349e313 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -62,7 +62,6 @@ module PostZephirProcessing verifier.verify_file(path: tmpfile) expect(verifier.errors).to include(/^PostZephirProcessing::Verifier: .*/) end - end end From fdeadcc903c6cdb0acfffdcd28e774197d4c978b Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Wed, 18 Dec 2024 16:25:38 -0500 Subject: [PATCH 089/114] Batch rights_current checks in PopulateRightsVerifier; refactor tests --- lib/verifier/populate_rights_verifier.rb | 44 +++++++++++++++----- spec/unit/populate_rights_verifier_spec.rb | 48 +++++++++++----------- 2 files changed, 57 insertions(+), 35 deletions(-) diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index 5c07f84..d0a273d 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -17,8 +17,14 @@ module PostZephirProcessing # We may also look for errors in the output logs (postZephir.pm and/or populate_rights_data.pl?) # but that is out of scope for now. class PopulateRightsVerifier < Verifier - FULL_RIGHTS_TEMPLATE = "zephir_full_YYYYMMDD.rights" - UPD_RIGHTS_TEMPLATE = "zephir_upd_YYYYMMDD.rights" + # This is an efficient slice size we adopted for hathifiles based on experimental evidence + DEFAULT_SLICE_SIZE = 10_000 + attr_reader :slice_size + + def initialize(slice_size: DEFAULT_SLICE_SIZE) + @slice_size = slice_size + super() + end def run_for_date(date:) Derivative::Rights.derivatives_for_date(date: date).each do |derivative| @@ -30,20 +36,38 @@ def run_for_date(date:) end # Check each entry in the .rights file for an entry in `rights_current`. - # FIXME: this is likely to be very inefficient. - # Should accumulate a batch of HTIDs to query all in one go. - # See HathifileWriter#batch_extract_rights for a usable Sequel construct. def verify_rights_file(path:) - db = Services[:database] File.open(path) do |infile| + slice = Set.new infile.each_line do |line| line.strip! - htid = line.split("\t").first - namespace, id = htid.split(".", 2) - if db[:rights_current].where(namespace: namespace, id: id).count.zero? - error message: "missing rights_current for #{htid}" + slice << line.split("\t").first + if slice.count >= slice_size + find_missing_rights(htids: slice) + slice.clear end end + if slice.count.positive? + find_missing_rights(htids: slice) + end + end + end + + private + + # @param htids [Set] a Set of HTID strings to check against the database + # @return (not defined) + def find_missing_rights(htids:) + db_htids = Set.new + split_htids = htids.map { |htid| htid.split(".", 2) } + Services[:database][:rights_current] + .select(:namespace, :id) + .where([:namespace, :id] => split_htids) + .each do |record| + db_htids << record[:namespace] + "." + record[:id] + end + (htids - db_htids).each do |htid| + error message: "missing rights_current for #{htid}" end end end diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/populate_rights_verifier_spec.rb index 6bbc057..774e2a0 100644 --- a/spec/unit/populate_rights_verifier_spec.rb +++ b/spec/unit/populate_rights_verifier_spec.rb @@ -1,13 +1,17 @@ # frozen_string_literal: true require "verifier/populate_rights_verifier" +require "derivative/rights" +require "pry" module PostZephirProcessing RSpec.describe(PopulateRightsVerifier) do around(:each) do |example| with_test_environment do ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + Services[:database][:rights_current].truncate example.run + Services[:database][:rights_current].truncate end end end @@ -18,13 +22,15 @@ module PostZephirProcessing [rights, "ic", "bib", "bibrights", "aa"].join("\t") end.join("\n") end - let(:verifier) { described_class.new } + # Choose a small slice size to make sure we have leftovers after the main rights fetch loop. + let(:verifier) { described_class.new(slice_size: 3) } let(:db) { Services[:database][:rights_current] } # Creates a full or upd rights file in @tmpdir. def with_fake_rights_file(date:, full: false) - rights_file = File.join(@tmpdir, full ? described_class::FULL_RIGHTS_TEMPLATE : described_class::UPD_RIGHTS_TEMPLATE) - .sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) + rights_file = Derivative::Rights.derivatives_for_date(date: date) + .find { |derivative| derivative.full? == full } + .path File.write(rights_file, test_rights_file_contents) yield end @@ -37,32 +43,30 @@ def insert_fake_rights(namespace:, id:) context "with HTIDs in the rights database" do around(:each) do |example| - Services[:database][:rights_current].truncate - split_htids = test_rights.map { |htid| htid.split(".", 2) } split_htids.each do |split_htid| insert_fake_rights(namespace: split_htid[0], id: split_htid[1]) end - example.run - - Services[:database][:rights_current].truncate end describe "#run_for_date" do it "logs no `missing rights_current` error for full file" do - date = Date.new(2024, 11, 30) + date = Date.parse("2024-11-30") with_fake_rights_file(date: date, full: true) do verifier.run_for_date(date: date) - expect(verifier.errors).not_to include(/missing rights_current/) + # The only error is for the missing upd file. + expect(verifier.errors.count).to eq 1 + missing_rights_errors = verifier.errors.select { |err| /missing rights_current/.match? err } + expect(missing_rights_errors).to be_empty end end it "logs no `missing rights_current` error for update file" do - date = Date.new(2024, 12, 2) + date = Date.parse("2024-12-02") with_fake_rights_file(date: date) do verifier.run_for_date(date: date) - expect(verifier.errors).not_to include(/missing rights_current/) + expect(verifier.errors).to be_empty end end end @@ -75,28 +79,22 @@ def insert_fake_rights(namespace:, id:) end context "with no HTIDs in the rights database" do - around(:each) do |example| - Services[:database][:rights_current].truncate - - example.run - - Services[:database][:rights_current].truncate - end - describe "#run_for_date" do it "logs `missing rights_current` error for full file" do - date = Date.new(2024, 11, 30) + date = Date.parse("2024-11-30") with_fake_rights_file(date: date, full: true) do verifier.run_for_date(date: date) - expect(verifier.errors).to include(/missing rights_current/) + # There will be an error for the missing upd file, ignore it. + missing_rights_errors = verifier.errors.select { |err| /missing rights_current/.match? err } + expect(missing_rights_errors.count).to eq test_rights.count end end - it "logs `missing rights_current` error for update file" do - date = Date.new(2024, 12, 2) + it "logs an error for each HTID in the update file" do + date = Date.parse("2024-12-02") with_fake_rights_file(date: date) do verifier.run_for_date(date: date) - expect(verifier.errors).to include(/missing rights_current/) + expect(verifier.errors.count).to eq test_rights.count end end end From 609468755684f41a359db82c927224873aa3a44b Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 19 Dec 2024 10:33:23 -0500 Subject: [PATCH 090/114] nested conditions --- lib/verifier/post_zephir_verifier.rb | 59 ++++++++++++++-------------- 1 file changed, 30 insertions(+), 29 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 75baa45..3bc3c00 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -31,35 +31,36 @@ def run_for_date(date:) # line count must be the same as input JSON def verify_catalog_archive(date: current_date) zephir_update_path = Derivative::CatalogArchive.new(date: date, full: false).path - verify_file(path: zephir_update_path) - verify_parseable_ndj(path: zephir_update_path) - - if date.last_of_month? - ht_bib_export_derivative_params = { - location: :ZEPHIR_DATA, - name: "ht_bib_export_full_YYYY-MM-DD.json.gz", - date: date - } - output_path = Derivative::CatalogArchive.new(date: date, full: true).path - verify_file(path: output_path) - verify_parseable_ndj(path: output_path) - output_linecount = gzip_linecount(path: output_path) - - input_path = self.class.dated_derivative(**ht_bib_export_derivative_params) - verify_file(path: input_path) - verify_parseable_ndj(path: input_path) - input_linecount = gzip_linecount(path: input_path) - - if output_linecount != input_linecount - error( - message: sprintf( - "output line count (%s = %s) != input line count (%s = %s)", - output_path, - output_linecount, - input_path, - input_linecount - ) - ) + if verify_file(path: zephir_update_path) + if verify_parseable_ndj(path: zephir_update_path) + if date.last_of_month? + ht_bib_export_derivative_params = { + location: :ZEPHIR_DATA, + name: "ht_bib_export_full_YYYY-MM-DD.json.gz", + date: date + } + output_path = Derivative::CatalogArchive.new(date: date, full: true).path + verify_file(path: output_path) + verify_parseable_ndj(path: output_path) + output_linecount = gzip_linecount(path: output_path) + + input_path = self.class.dated_derivative(**ht_bib_export_derivative_params) + verify_file(path: input_path) + verify_parseable_ndj(path: input_path) + input_linecount = gzip_linecount(path: input_path) + + if output_linecount != input_linecount + error( + message: sprintf( + "output line count (%s = %s) != input line count (%s = %s)", + output_path, + output_linecount, + input_path, + input_linecount + ) + ) + end + end end end end From 2970aa53b338db04d15dffe2e1117d8290a9d190 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 19 Dec 2024 10:40:56 -0500 Subject: [PATCH 091/114] PostZephirVerifier.verify_catalog_prep now using Derivative::Delete --- lib/verifier/post_zephir_verifier.rb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 3bc3c00..3456243 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -75,9 +75,9 @@ def verify_catalog_archive(date: current_date) # readable # TODO: deletes file is combination of two component files in TMPDIR? def verify_catalog_prep(date: current_date) - delete_file = self.class.dated_derivative(location: :CATALOG_PREP, name: "zephir_upd_YYYYMMDD_delete.txt.gz", date: date) - if verify_file(path: delete_file) - verify_deletes_contents(path: delete_file) + delete_file = Derivative::Delete.new(date: date, full: false) + if verify_file(path: delete_file.path) + verify_deletes_contents(path: delete_file.path) end Derivative::CatalogPrep.derivatives_for_date(date: date).each do |derivative| From 5498a332cc98ba6227f66837e22f22fba18f2f7d Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 19 Dec 2024 12:03:28 -0500 Subject: [PATCH 092/114] added, implemented and tested Derivative::IngestBibrecord --- lib/derivative/ingest_bibrecord.rb | 26 ++++++++++++++ lib/derivatives.rb | 1 + lib/verifier/post_zephir_verifier.rb | 6 ++-- spec/unit/derivative/ingest_bibrecord_spec.rb | 35 +++++++++++++++++++ 4 files changed, 65 insertions(+), 3 deletions(-) create mode 100644 lib/derivative/ingest_bibrecord.rb create mode 100644 spec/unit/derivative/ingest_bibrecord_spec.rb diff --git a/lib/derivative/ingest_bibrecord.rb b/lib/derivative/ingest_bibrecord.rb new file mode 100644 index 0000000..dbf289b --- /dev/null +++ b/lib/derivative/ingest_bibrecord.rb @@ -0,0 +1,26 @@ +require "derivative" + +module PostZephirProcessing + class Derivative::IngestBibrecord < Derivative + attr_reader :name + + def initialize(name:) + @name = name + end + + def path + Verifier.derivative(location: :INGEST_BIBRECORDS, name: name) + end + + def self.derivatives_for_date(date:) + if date.last_of_month? + [ + new(name: "groove_full.tsv.gz"), + new(name: "zephir_ingested_items.txt.gz") + ] + else + [] + end + end + end +end diff --git a/lib/derivatives.rb b/lib/derivatives.rb index b850bcf..9b20e71 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -18,6 +18,7 @@ class Derivatives STANDARD_LOCATIONS = [ :CATALOG_ARCHIVE, :CATALOG_PREP, + :INGEST_BIBRECORDS, :RIGHTS_ARCHIVE, :TMPDIR, :WWW_DIR diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 3456243..051b064 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -6,6 +6,7 @@ require_relative "../derivative/dollar_dup" require_relative "../derivative/catalog" require_relative "../derivative/rights" +require_relative "../derivative/ingest_bibrecord" # Verifies that post_zephir workflow stage did what it was supposed to. @@ -135,9 +136,8 @@ def verify_dollar_dup(date: current_date) # Contents: TODO # Verify: readable def verify_ingest_bibrecords(date: current_date) - if date.last_of_month? - verify_file(path: self.class.derivative(location: :INGEST_BIBRECORDS, name: "groove_full.tsv.gz")) - verify_file(path: self.class.derivative(location: :INGEST_BIBRECORDS, name: "zephir_ingested_items.txt.gz")) + Derivative::IngestBibrecord.derivatives_for_date(date: date).each do |derivative| + verify_file(path: derivative.path) end end diff --git a/spec/unit/derivative/ingest_bibrecord_spec.rb b/spec/unit/derivative/ingest_bibrecord_spec.rb new file mode 100644 index 0000000..6840d97 --- /dev/null +++ b/spec/unit/derivative/ingest_bibrecord_spec.rb @@ -0,0 +1,35 @@ +# frozen_string_literal: true + +require "derivative" +require "derivative/ingest_bibrecord" + +module PostZephirProcessing + RSpec.describe(Derivative::IngestBibrecord) do + around(:each) do |example| + with_test_environment do + ClimateControl.modify( + INGEST_BIBRECORDS: fixture("ingest_bibrecords") + ) do + example.run + end + end + end + + let(:test_date_last_of_month) { Date.parse("2023-11-30") } + + describe "#{described_class}.derivatives_for_date" do + it "returns 2 derivatives on the last of the month, otherwise 0" do + 1.upto(29) do |day| + date = Date.new(2023, 11, day) + expect(described_class.derivatives_for_date(date: date)).to be_empty + end + expect(described_class.derivatives_for_date(date: test_date_last_of_month).count).to eq 2 + end + it "reports the expected paths" do + derivative_paths = described_class.derivatives_for_date(date: test_date_last_of_month).map { |d| d.path } + expect(derivative_paths).to include(fixture("ingest_bibrecords/groove_full.tsv.gz")) + expect(derivative_paths).to include(fixture("ingest_bibrecords/zephir_ingested_items.txt.gz")) + end + end + end +end From c1aee80e775b64a8574ef48d52fbfaa3a236f89b Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 19 Dec 2024 12:17:58 -0500 Subject: [PATCH 093/114] made and implemented Derivative::HTBibExport --- lib/derivative/ht_bib_export.rb | 12 ++++++++++++ lib/verifier/post_zephir_verifier.rb | 9 ++------- 2 files changed, 14 insertions(+), 7 deletions(-) create mode 100644 lib/derivative/ht_bib_export.rb diff --git a/lib/derivative/ht_bib_export.rb b/lib/derivative/ht_bib_export.rb new file mode 100644 index 0000000..2966f0b --- /dev/null +++ b/lib/derivative/ht_bib_export.rb @@ -0,0 +1,12 @@ +require "derivative" + +module PostZephirProcessing + class Derivative::HTBibExport < Derivative + def template + { + location: :ZEPHIR_DATA, + name: "ht_bib_export_full_YYYY-MM-DD.json.gz" + } + end + end +end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 051b064..068e547 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -7,6 +7,7 @@ require_relative "../derivative/catalog" require_relative "../derivative/rights" require_relative "../derivative/ingest_bibrecord" +require_relative "../derivative/ht_bib_export" # Verifies that post_zephir workflow stage did what it was supposed to. @@ -35,17 +36,11 @@ def verify_catalog_archive(date: current_date) if verify_file(path: zephir_update_path) if verify_parseable_ndj(path: zephir_update_path) if date.last_of_month? - ht_bib_export_derivative_params = { - location: :ZEPHIR_DATA, - name: "ht_bib_export_full_YYYY-MM-DD.json.gz", - date: date - } output_path = Derivative::CatalogArchive.new(date: date, full: true).path verify_file(path: output_path) verify_parseable_ndj(path: output_path) output_linecount = gzip_linecount(path: output_path) - - input_path = self.class.dated_derivative(**ht_bib_export_derivative_params) + input_path = Derivative::HTBibExport.new(date: date, full: true).path verify_file(path: input_path) verify_parseable_ndj(path: input_path) input_linecount = gzip_linecount(path: input_path) From 46a86d59f63170dcc0b31227dbffaa99a3e2f58a Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 19 Dec 2024 12:27:21 -0500 Subject: [PATCH 094/114] changed require_relative to require where possible --- lib/derivatives.rb | 2 +- lib/verifier.rb | 6 +++--- lib/verifier/catalog_index_verifier.rb | 4 ++-- lib/verifier/hathifiles_contents_verifier.rb | 2 +- lib/verifier/hathifiles_database_verifier.rb | 6 +++--- lib/verifier/hathifiles_listing_verifier.rb | 6 +++--- lib/verifier/hathifiles_redirects_verifier.rb | 2 +- lib/verifier/hathifiles_verifier.rb | 4 ++-- lib/verifier/populate_rights_verifier.rb | 4 ++-- lib/verifier/post_zephir_verifier.rb | 14 +++++++------- 10 files changed, 25 insertions(+), 25 deletions(-) diff --git a/lib/derivatives.rb b/lib/derivatives.rb index 9b20e71..e0d0e6b 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -1,6 +1,6 @@ # frozen_string_literal: true -require_relative "dates" +require "dates" require "derivative" require "derivative/catalog" require "derivative/delete" diff --git a/lib/verifier.rb b/lib/verifier.rb index e94e661..1c30ed8 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -1,8 +1,8 @@ # frozen_string_literal: true -require_relative "derivatives" -require_relative "journal" -require_relative "services" +require "derivatives" +require "journal" +require "services" # Common superclass for all things Verifier. # Right now the only thing I can think of to put here is shared diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 0d1c5b3..02b5cfc 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -1,8 +1,8 @@ # frozen_string_literal: true require "faraday" -require_relative "../verifier" -require_relative "../derivatives" +require "verifier" +require "derivatives" # Verifies that catalog indexing workflow stage did what it was supposed to. diff --git a/lib/verifier/hathifiles_contents_verifier.rb b/lib/verifier/hathifiles_contents_verifier.rb index d9dff8b..ee13df1 100644 --- a/lib/verifier/hathifiles_contents_verifier.rb +++ b/lib/verifier/hathifiles_contents_verifier.rb @@ -1,7 +1,7 @@ # frozen_string_literal: true require "zlib" -require_relative "../verifier" +require "verifier" # Verifies that hathifiles workflow stage did what it was supposed to. diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index 38d4e6d..1bf7469 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -2,9 +2,9 @@ require "zlib" -require_relative "../verifier" -require_relative "../derivatives" -require_relative "../derivative/hathifile" +require "verifier" +require "derivatives" +require "derivative/hathifile" module PostZephirProcessing class HathifilesDatabaseVerifier < Verifier diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index 25fd0f7..a41539b 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -1,8 +1,8 @@ # frozen_string_literal: true -require_relative "../verifier" -require_relative "../derivatives" -require_relative "../derivative/hathifile_www" +require "verifier" +require "derivatives" +require "derivative/hathifile_www" require "json" require "set" diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index 5e61a49..97f33da 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -1,6 +1,6 @@ # frozen_string_literal: true -require_relative "../verifier" +require "verifier" module PostZephirProcessing class HathifileRedirectsVerifier < Verifier diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 6a19fc5..0f7dc06 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -1,9 +1,9 @@ # frozen_string_literal: true require "zlib" +require "verifier" require_relative "hathifiles_contents_verifier" -require_relative "../verifier" -require_relative "../derivative/hathifile" +require "derivative/hathifile" # Verifies that hathifiles workflow stage did what it was supposed to. diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index d0a273d..16918a9 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -1,7 +1,7 @@ # frozen_string_literal: true -require_relative "../verifier" -require_relative "../derivatives" +require "verifier" +require "derivatives" module PostZephirProcessing # The PostZephirVerifier checks for the existence and readability of the .rights files. diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 068e547..2b4c3a8 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -1,13 +1,13 @@ # frozen_string_literal: true require "zlib" -require_relative "../verifier" -require_relative "../derivatives" -require_relative "../derivative/dollar_dup" -require_relative "../derivative/catalog" -require_relative "../derivative/rights" -require_relative "../derivative/ingest_bibrecord" -require_relative "../derivative/ht_bib_export" +require "verifier" +require "derivatives" +require "derivative/dollar_dup" +require "derivative/catalog" +require "derivative/rights" +require "derivative/ingest_bibrecord" +require "derivative/ht_bib_export" # Verifies that post_zephir workflow stage did what it was supposed to. From aa73c64e8e294f9df5d9643406079ee49dcdb633 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 14:23:14 -0500 Subject: [PATCH 095/114] Clean up some uses of ClimateControl * pull repeated uses to :around blocks * override TMPDIR by default --- spec/spec_helper.rb | 5 +- spec/unit/post_zephir_verifier_spec.rb | 189 ++++++++++++------------- 2 files changed, 95 insertions(+), 99 deletions(-) diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 1b6bd8b..a2b5e51 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -48,7 +48,10 @@ def test_journal_dates def with_test_environment Dir.mktmpdir do |tmpdir| - ClimateControl.modify(DATA_ROOT: tmpdir) do + ClimateControl.modify( + DATA_ROOT: tmpdir, + TMPDIR: tmpdir + ) do File.open(File.join(tmpdir, "journal.yml"), "w") { |f| f.puts test_journal } # Maybe we don't need to yield `tmpdir` since we're also assigning it to an # instance variable. Leaving it for now in case the ivar approach leads to funny business. diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 4b6294c..06466aa 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -140,28 +140,30 @@ def expect_deletefile_ok(contents) end describe "#verify_catalog_prep" do + around(:each) do |example| + ClimateControl.modify(CATALOG_PREP: @tmpdir) do + example.run + end + end + test_date = Date.parse("2024-11-30") context "with all the expected files" do it "reports no errors" do # Create and test upd, full, and deletes in @tmpdir/catalog_prep - ClimateControl.modify(CATALOG_PREP: @tmpdir) do - FileUtils.cp(fixture(File.join("catalog_archive", "zephir_full_20241130_vufind.json.gz")), @tmpdir) - FileUtils.cp(fixture(File.join("catalog_archive", "zephir_upd_20241130.json.gz")), @tmpdir) - FileUtils.cp(fixture(File.join("catalog_prep", "zephir_upd_20241130_delete.txt.gz")), @tmpdir) - verifier = described_class.new - verifier.verify_catalog_prep(date: test_date) - expect(verifier.errors.count).to eq 0 - end + FileUtils.cp(fixture(File.join("catalog_archive", "zephir_full_20241130_vufind.json.gz")), @tmpdir) + FileUtils.cp(fixture(File.join("catalog_archive", "zephir_upd_20241130.json.gz")), @tmpdir) + FileUtils.cp(fixture(File.join("catalog_prep", "zephir_upd_20241130_delete.txt.gz")), @tmpdir) + verifier = described_class.new + verifier.verify_catalog_prep(date: test_date) + expect(verifier.errors.count).to eq 0 end end context "without any of the expected files" do it "reports an error for each of the three missing files" do - ClimateControl.modify(CATALOG_PREP: @tmpdir) do - verifier = described_class.new - verifier.verify_catalog_prep(date: test_date) - expect(verifier.errors.count).to eq 3 - end + verifier = described_class.new + verifier.verify_catalog_prep(date: test_date) + expect(verifier.errors.count).to eq 3 end end end @@ -170,89 +172,83 @@ def expect_deletefile_ok(contents) test_date = Date.parse("2024-12-01") context "with empty file" do it "reports no errors" do - ClimateControl.modify(TMPDIR: @tmpdir) do - dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") - Zinzout.zout(dollar_dup_path) { |output_gz| } - verifier = described_class.new - verifier.verify_dollar_dup(date: test_date) - expect(verifier.errors).to eq [] - end + dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") + Zinzout.zout(dollar_dup_path) { |output_gz| } + verifier = described_class.new + verifier.verify_dollar_dup(date: test_date) + expect(verifier.errors).to eq [] end end context "with nonempty file" do it "reports one `spurious dollar_dup lines` error" do - ClimateControl.modify(TMPDIR: @tmpdir) do - dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") - Zinzout.zout(dollar_dup_path) do |output_gz| - output_gz.puts <<~GZ - uc1.b275234 - uc1.b85271 - uc1.b312920 - uc1.b257214 - uc1.b316327 - uc1.b23918 - uc1.b95355 - uc1.b183819 - uc1.b197217 - GZ - end - verifier = described_class.new - verifier.verify_dollar_dup(date: test_date) - expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/spurious dollar_dup lines/) + dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") + Zinzout.zout(dollar_dup_path) do |output_gz| + output_gz.puts <<~GZ + uc1.b275234 + uc1.b85271 + uc1.b312920 + uc1.b257214 + uc1.b316327 + uc1.b23918 + uc1.b95355 + uc1.b183819 + uc1.b197217 + GZ end + verifier = described_class.new + verifier.verify_dollar_dup(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/spurious dollar_dup lines/) end end context "with missing file" do it "reports one `not found` error" do - ClimateControl.modify(TMPDIR: @tmpdir) do - verifier = described_class.new - verifier.verify_dollar_dup(date: test_date) - expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/.*not found.*dollar_dup.*/) - end + verifier = described_class.new + verifier.verify_dollar_dup(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/.*not found.*dollar_dup.*/) end end end describe "#verify_ingest_bibrecords" do + around(:each) do |example| + ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do + example.run + end + end + context "last day of month" do test_date = Date.parse("2024-11-30") context "with expected groove_full and zephir_ingested_items files" do it "reports no errors" do - ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do - FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) - FileUtils.touch(File.join(@tmpdir, "zephir_ingested_items.txt.gz")) - verifier = described_class.new - verifier.verify_ingest_bibrecords(date: test_date) - expect(verifier.errors.count).to eq 0 - end + FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) + FileUtils.touch(File.join(@tmpdir, "zephir_ingested_items.txt.gz")) + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 0 end end context "missing zephir_ingested_items" do it "reports one `not found` error" do - ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do - FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) - verifier = described_class.new - verifier.verify_ingest_bibrecords(date: test_date) - expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/.*not found.*zephir_ingested_items.*/) - end + FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/.*not found.*zephir_ingested_items.*/) end end context "missing groove_full" do it "reports one `not found` error" do - ClimateControl.modify(INGEST_BIBRECORDS: @tmpdir) do - FileUtils.touch(File.join(@tmpdir, "zephir_ingested_items.txt.gz")) - verifier = described_class.new - verifier.verify_ingest_bibrecords(date: test_date) - expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/.*not found.*groove_full.*/) - end + FileUtils.touch(File.join(@tmpdir, "zephir_ingested_items.txt.gz")) + verifier = described_class.new + verifier.verify_ingest_bibrecords(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/.*not found.*groove_full.*/) end end end @@ -268,33 +264,34 @@ def expect_deletefile_ok(contents) end describe "#verify_rights" do + around(:each) do |example| + ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do + example.run + end + end context "last day of month" do test_date = Date.parse("2024-11-30") context "with full and update rights files" do it "reports no errors" do - ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do - verifier = described_class.new - upd_rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) - upd_rights_path = File.join(@tmpdir, upd_rights_file) - File.write(upd_rights_path, well_formed_rights_file_content) - full_rights_file = "zephir_full_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) - full_rights_path = File.join(@tmpdir, full_rights_file) - File.write(full_rights_path, well_formed_rights_file_content) - verifier.verify_rights(date: test_date) - expect(verifier.errors.count).to eq 0 - end + verifier = described_class.new + upd_rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + upd_rights_path = File.join(@tmpdir, upd_rights_file) + File.write(upd_rights_path, well_formed_rights_file_content) + full_rights_file = "zephir_full_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + full_rights_path = File.join(@tmpdir, full_rights_file) + File.write(full_rights_path, well_formed_rights_file_content) + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 0 end end context "with no rights files" do it "reports two `not found` errors" do - ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do - verifier = described_class.new - verifier.verify_rights(date: test_date) - expect(verifier.errors.count).to eq 2 - verifier.errors.each do |err| - expect(err).to include(/not found.*rights/) - end + verifier = described_class.new + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 2 + verifier.errors.each do |err| + expect(err).to include(/not found.*rights/) end end end @@ -304,25 +301,21 @@ def expect_deletefile_ok(contents) test_date = Date.parse("2024-12-01") context "with update rights file" do it "reports no errors" do - ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do - verifier = described_class.new - rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) - rights_path = File.join(@tmpdir, rights_file) - File.write(rights_path, well_formed_rights_file_content) - verifier.verify_rights(date: test_date) - expect(verifier.errors.count).to eq 0 - end + verifier = described_class.new + rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + rights_path = File.join(@tmpdir, rights_file) + File.write(rights_path, well_formed_rights_file_content) + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 0 end end context "missing update rights file" do it "reports one `not found` error" do - ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do - verifier = described_class.new - verifier.verify_rights(date: test_date) - expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/.*not found.*rights.*/) - end + verifier = described_class.new + verifier.verify_rights(date: test_date) + expect(verifier.errors.count).to eq 1 + expect(verifier.errors).to include(/.*not found.*rights.*/) end end end From 6deabe3e5f42802cbc0b1ac3faa7eaf0c81f4969 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 14:23:59 -0500 Subject: [PATCH 096/114] Remove directory_for; datestamped_derivative * Use env vars directly in derivative, or extract logic from directory_for where needed * Move filename logic from Verifier to Derivative --- lib/derivative.rb | 10 +++++++++- lib/derivative/catalog.rb | 4 ++-- lib/derivative/delete.rb | 2 +- lib/derivative/dollar_dup.rb | 2 +- lib/derivative/hathifile.rb | 2 +- lib/derivative/hathifile_www.rb | 2 +- lib/derivative/ht_bib_export.rb | 2 +- lib/derivative/ingest_bibrecord.rb | 2 +- lib/derivative/rights.rb | 2 +- lib/derivatives.rb | 28 ---------------------------- lib/verifier.rb | 18 ------------------ spec/unit/derivatives_spec.rb | 16 ---------------- 12 files changed, 18 insertions(+), 72 deletions(-) diff --git a/lib/derivative.rb b/lib/derivative.rb index fce2dfb..2168ee5 100644 --- a/lib/derivative.rb +++ b/lib/derivative.rb @@ -15,7 +15,15 @@ def full? end def path - Verifier.dated_derivative(**template, date: date) + File.join( + template[:location], + datestamped_file + ) + end + + def datestamped_file + template[:name].sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) + .sub(/YYYY-MM-DD/i, date.strftime("%Y-%m-%d")) end def self.derivatives_for_date(date:) diff --git a/lib/derivative/catalog.rb b/lib/derivative/catalog.rb index 1f877b8..bc47b98 100644 --- a/lib/derivative/catalog.rb +++ b/lib/derivative/catalog.rb @@ -40,13 +40,13 @@ def filename_template class Derivative::CatalogArchive < Derivative::Catalog def location - :CATALOG_ARCHIVE + ENV["CATALOG_ARCHIVE"] end end class Derivative::CatalogPrep < Derivative::Catalog def location - :CATALOG_PREP + ENV["CATALOG_PREP"] end end end diff --git a/lib/derivative/delete.rb b/lib/derivative/delete.rb index 1dacb30..1d47c43 100644 --- a/lib/derivative/delete.rb +++ b/lib/derivative/delete.rb @@ -13,7 +13,7 @@ def self.derivatives_for_date(date:) def template { - location: :CATALOG_PREP, + location: ENV["CATALOG_PREP"], name: "zephir_upd_YYYYMMDD_delete.txt.gz" } end diff --git a/lib/derivative/dollar_dup.rb b/lib/derivative/dollar_dup.rb index 6e3447a..2ac220f 100644 --- a/lib/derivative/dollar_dup.rb +++ b/lib/derivative/dollar_dup.rb @@ -18,7 +18,7 @@ def self.derivatives_for_date(date:) def template { - location: :TMPDIR, + location: ENV["TMPDIR"] || File.join(ENV["DATA_ROOT"], "work"), name: "vufind_incremental_YYYY-MM-DD_dollar_dup.txt.gz" } end diff --git a/lib/derivative/hathifile.rb b/lib/derivative/hathifile.rb index c97cbd5..0fe0ce7 100644 --- a/lib/derivative/hathifile.rb +++ b/lib/derivative/hathifile.rb @@ -22,7 +22,7 @@ def self.derivatives_for_date(date:) def template { - location: :HATHIFILE_ARCHIVE, + location: ENV["HATHIFILE_ARCHIVE"], name: "hathi_#{fullness}_YYYYMMDD.txt.gz" } end diff --git a/lib/derivative/hathifile_www.rb b/lib/derivative/hathifile_www.rb index 789d611..4967dce 100644 --- a/lib/derivative/hathifile_www.rb +++ b/lib/derivative/hathifile_www.rb @@ -26,7 +26,7 @@ def self.json_path def template { - location: :WWW_DIR, + location: ENV["WWW_DIR"], name: "hathi_#{fullness}_YYYYMMDD.txt.gz" } end diff --git a/lib/derivative/ht_bib_export.rb b/lib/derivative/ht_bib_export.rb index 2966f0b..813f97f 100644 --- a/lib/derivative/ht_bib_export.rb +++ b/lib/derivative/ht_bib_export.rb @@ -4,7 +4,7 @@ module PostZephirProcessing class Derivative::HTBibExport < Derivative def template { - location: :ZEPHIR_DATA, + location: ENV["ZEPHIR_DATA"], name: "ht_bib_export_full_YYYY-MM-DD.json.gz" } end diff --git a/lib/derivative/ingest_bibrecord.rb b/lib/derivative/ingest_bibrecord.rb index dbf289b..494835d 100644 --- a/lib/derivative/ingest_bibrecord.rb +++ b/lib/derivative/ingest_bibrecord.rb @@ -9,7 +9,7 @@ def initialize(name:) end def path - Verifier.derivative(location: :INGEST_BIBRECORDS, name: name) + File.join(ENV["INGEST_BIBRECORDS"], name) end def self.derivatives_for_date(date:) diff --git a/lib/derivative/rights.rb b/lib/derivative/rights.rb index 1cec366..2e83097 100644 --- a/lib/derivative/rights.rb +++ b/lib/derivative/rights.rb @@ -22,7 +22,7 @@ def self.derivatives_for_date(date:) def template { - location: :RIGHTS_ARCHIVE, + location: ENV["RIGHTS_ARCHIVE"] || File.join(ENV.fetch("RIGHTS_DIR"), "archive"), name: "zephir_#{fullness}_YYYYMMDD.rights" } end diff --git a/lib/derivatives.rb b/lib/derivatives.rb index e0d0e6b..e3a9778 100644 --- a/lib/derivatives.rb +++ b/lib/derivatives.rb @@ -14,36 +14,8 @@ module PostZephirProcessing # TODO: this class may be renamed PostZephirDerivatives once directory_for is updated, # moved, or elimminated. class Derivatives - # TODO: STANDARD_LOCATIONS is only used for testing directory_for and may be eliminated. - STANDARD_LOCATIONS = [ - :CATALOG_ARCHIVE, - :CATALOG_PREP, - :INGEST_BIBRECORDS, - :RIGHTS_ARCHIVE, - :TMPDIR, - :WWW_DIR - ].freeze - attr_reader :dates - # Translate a known file destination as an environment variable key - # into the path via ENV or a default. - # @return [String] path to the directory - def self.directory_for(location:) - location = location.to_s - case location - - when "CATALOG_ARCHIVE", "HATHIFILE_ARCHIVE", "CATALOG_PREP", "INGEST_BIBRECORDS", "RIGHTS_DIR", "WWW_DIR", "ZEPHIR_DATA" - ENV.fetch location - when "RIGHTS_ARCHIVE" - ENV["RIGHTS_ARCHIVE"] || File.join(ENV.fetch("RIGHTS_DIR"), "archive") - when "TMPDIR" - ENV["TMPDIR"] || File.join(ENV.fetch("DATA_ROOT"), "work") - else - raise "Unknown location #{location.inspect}" - end - end - # @param date [Date] the file datestamp date, not the "run date" def initialize(date: (Date.today - 1)) @dates = Dates.new(date: date) diff --git a/lib/verifier.rb b/lib/verifier.rb index 1c30ed8..7b99a3f 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -12,24 +12,6 @@ module PostZephirProcessing class Verifier attr_reader :journal, :errors - def self.datestamped_file(name:, date:) - name.sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) - .sub(/YYYY-MM-DD/i, date.strftime("%Y-%m-%d")) - end - - # TODO: see if we want to move this to Derivatives class - def self.dated_derivative(location:, name:, date:) - File.join( - Derivatives.directory_for(location: location), - datestamped_file(name: name, date: date) - ) - end - - # TODO: see if we want to move this to Derivatives class - def self.derivative(location:, name:) - File.join(Derivatives.directory_for(location: location), name) - end - # Generally, needs a Journal in order to know what to look for. def initialize @journal = Journal.from_yaml diff --git a/spec/unit/derivatives_spec.rb b/spec/unit/derivatives_spec.rb index 1cd9974..53f62db 100644 --- a/spec/unit/derivatives_spec.rb +++ b/spec/unit/derivatives_spec.rb @@ -12,22 +12,6 @@ module PostZephirProcessing end end - describe ".directory_for" do - context "with known locations" do - Derivatives::STANDARD_LOCATIONS.each do |loc_name| - it "returns a string for #{loc_name}" do - expect(described_class.directory_for(location: loc_name)).to be_a(String) - end - end - end - - context "with an unknown location" do - it "raises" do - expect { described_class.directory_for(location: :NO_SUCH_LOC) }.to raise_error(StandardError) - end - end - end - describe ".new" do it "creates a Derivatives" do expect(described_class.new).to be_an_instance_of(Derivatives) From 48d1a6471ccf2b66c08da375cdb09dcf4124ab75 Mon Sep 17 00:00:00 2001 From: Martin Warin Date: Thu, 19 Dec 2024 14:42:15 -0500 Subject: [PATCH 097/114] dropped zinzout, uzing zlib everywhere for consistency --- Gemfile | 1 - Gemfile.lock | 2 -- spec/unit/hathifiles_redirects_verifier_spec.rb | 4 ++-- spec/unit/post_zephir_verifier_spec.rb | 8 ++++---- 4 files changed, 6 insertions(+), 9 deletions(-) diff --git a/Gemfile b/Gemfile index 2f51f48..0fec320 100644 --- a/Gemfile +++ b/Gemfile @@ -7,7 +7,6 @@ gem "dotenv" gem "faraday" gem "mysql2" gem "sequel" -gem "zinzout" group :development, :test do gem "climate_control" diff --git a/Gemfile.lock b/Gemfile.lock index e473525..261d524 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -101,7 +101,6 @@ GEM addressable (>= 2.8.0) crack (>= 0.3.2) hashdiff (>= 0.4.0, < 2.0.0) - zinzout (0.1.1) PLATFORMS aarch64-linux @@ -120,7 +119,6 @@ DEPENDENCIES simplecov-lcov standardrb webmock - zinzout BUNDLED WITH 2.5.23 diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index 8f42539..1016564 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -1,7 +1,7 @@ # frozen_string_literal: true require "verifier/hathifiles_redirects_verifier" -require "zinzout" +require "zlib" module PostZephirProcessing RSpec.describe(HathifileRedirectsVerifier) do @@ -42,7 +42,7 @@ def stage_redirects_history_file # Intentionally add mess to an otherwise wellformed file to trigger errors def malform(file) - Zinzout.zout(file) do |outfile| + Zlib::GzipWriter.open(file) do |outfile| outfile.puts mess end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 06466aa..3efd13c 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -1,7 +1,7 @@ # frozen_string_literal: true require "verifier/post_zephir_verifier" -require "zinzout" +require "zlib" module PostZephirProcessing RSpec.describe(PostZephirVerifier) do @@ -127,7 +127,7 @@ def expect_deletefile_ok(contents) # Make a temporary ht_bib_export with just 1 line to trigger error ClimateControl.modify(ZEPHIR_DATA: "/tmp/test/zephir_data") do FileUtils.mkdir_p(ENV["ZEPHIR_DATA"]) - Zinzout.zout(File.join(ENV["ZEPHIR_DATA"], "ht_bib_export_full_2024-11-30.json.gz")) do |gz| + Zlib::GzipWriter.open(File.join(ENV["ZEPHIR_DATA"], "ht_bib_export_full_2024-11-30.json.gz")) do |gz| gz.puts "{ \"this file\": \"too short\" }" end # The other unmodified fixtures in CATALOG_ARCHIVE should @@ -173,7 +173,7 @@ def expect_deletefile_ok(contents) context "with empty file" do it "reports no errors" do dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") - Zinzout.zout(dollar_dup_path) { |output_gz| } + Zlib::GzipWriter.open(dollar_dup_path) { |output_gz| } verifier = described_class.new verifier.verify_dollar_dup(date: test_date) expect(verifier.errors).to eq [] @@ -183,7 +183,7 @@ def expect_deletefile_ok(contents) context "with nonempty file" do it "reports one `spurious dollar_dup lines` error" do dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") - Zinzout.zout(dollar_dup_path) do |output_gz| + Zlib::GzipWriter.open(dollar_dup_path) do |output_gz| output_gz.puts <<~GZ uc1.b275234 uc1.b85271 From 1435ea674cce55e928909864a2cf0ba5c8a7be81 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 15:22:15 -0500 Subject: [PATCH 098/114] Refactor derivatives integration spec * move helpers out of spec helper to derivatives * use ClimateControl * move test for all missing files to derivatives integration * raise exception if requesting nonexistent 'full' deletes file --- lib/derivative/delete.rb | 5 + lib/verifier/post_zephir_verifier.rb | 2 +- .../derivatives_integration_spec.rb | 153 +++++++++++++----- spec/spec_helper.rb | 79 +-------- spec/unit/derivative/delete_spec.rb | 5 +- spec/unit/derivatives_spec.rb | 23 --- 6 files changed, 122 insertions(+), 145 deletions(-) diff --git a/lib/derivative/delete.rb b/lib/derivative/delete.rb index 1d47c43..200b76b 100644 --- a/lib/derivative/delete.rb +++ b/lib/derivative/delete.rb @@ -2,6 +2,11 @@ module PostZephirProcessing class Derivative::Delete < Derivative + def initialize(date:, full: false) + raise ArgumentError, "'deletes' has no full version" if full + super + end + def self.derivatives_for_date(date:) [ new( diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 2b4c3a8..1e1a3aa 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -71,7 +71,7 @@ def verify_catalog_archive(date: current_date) # readable # TODO: deletes file is combination of two component files in TMPDIR? def verify_catalog_prep(date: current_date) - delete_file = Derivative::Delete.new(date: date, full: false) + delete_file = Derivative::Delete.new(date: date) if verify_file(path: delete_file.path) verify_deletes_contents(path: delete_file.path) end diff --git a/spec/integration/derivatives_integration_spec.rb b/spec/integration/derivatives_integration_spec.rb index 1485bc7..0b87bdd 100644 --- a/spec/integration/derivatives_integration_spec.rb +++ b/spec/integration/derivatives_integration_spec.rb @@ -3,63 +3,128 @@ require "date" require "fileutils" require "tmpdir" +require "derivative/catalog" +require "derivative/delete" +require "derivative/rights" -RSpec.describe "Derivatives Integration" do - around(:each) do |example| - Dir.mktmpdir do |tmpdir| - setup_test_dirs(parent_dir: tmpdir) - setup_test_files(date: Date.parse("2023-11-29")) - example.run +module PostZephirProcessing + RSpec.describe Derivatives do + def catalog_prep_dir + File.join(@tmpdir, "catalog_prep") end - end - describe "all files present" do - it "returns nil" do - mi = PostZephirProcessing::Derivatives.new(date: Date.parse("2023-11-29")) - expect(mi.earliest_missing_date).to be_nil + def rights_dir + File.join(@tmpdir, "rights") end - end - describe "one date missing" do - it "returns the earliest" do - date = Date.parse("2023-11-03") - FileUtils.rm update_rights_file_for_date(date: date) - mi = PostZephirProcessing::Derivatives.new(date: Date.parse("2023-11-29")) - expect(mi.earliest_missing_date).to eq date + def rights_archive_dir + File.join(@tmpdir, "rights_archive") end - end - describe "monthly missing" do - it "returns the last day of the last month" do - date = Date.parse("2023-10-31") - FileUtils.rm full_file_for_date(date: date) - mi = PostZephirProcessing::Derivatives.new(date: Date.parse("2023-11-29")) - expect(mi.earliest_missing_date).to eq date + # Set derivative env vars, and populate test dir with the appropriate + # directories. + def with_test_dirs(parent_dir:) + ClimateControl.modify( + CATALOG_PREP: catalog_prep_dir, + RIGHTS_ARCHIVE: rights_archive_dir + ) do + [catalog_prep_dir, rights_archive_dir].each do |loc| + Dir.mkdir loc + end + + yield + end + end + + def full_rights_file_for_date(date:) + Derivative::Rights.new(date: date, full: true).path end - end - describe "different date in each category missing" do - it "returns the earliest" do - [ - delete_file_for_date(date: Date.parse("2023-11-20")), - update_file_for_date(date: Date.parse("2023-11-11")), - update_rights_file_for_date(date: Date.parse("2023-11-18")) - ].each do |file| - FileUtils.rm file + def update_rights_file_for_date(date:) + Derivative::Rights.new(date: date, full: false).path + end + + def full_file_for_date(date:) + Derivative::CatalogPrep.new(date: date, full: true).path + end + + def update_file_for_date(date:) + Derivative::CatalogPrep.new(date: date, full: false).path + end + + def delete_file_for_date(date:) + Derivative::Delete.new(date: date).path + end + + # @param date [Date] determines the month and year for the file datestamps + def setup_test_files(date:) + start_date = Date.new(date.year, date.month - 1, -1) + `touch #{full_file_for_date(date: start_date)}` + `touch #{full_rights_file_for_date(date: start_date)}` + end_date = Date.new(date.year, date.month, -2) + (start_date..end_date).each do |d| + `touch #{update_file_for_date(date: d)}` + `touch #{delete_file_for_date(date: d)}` + `touch #{update_rights_file_for_date(date: d)}` end - mi = PostZephirProcessing::Derivatives.new(date: Date.parse("2023-11-29")) - expect(mi.earliest_missing_date).to eq Date.parse("2023-11-11") end - end - describe "multiple dates missing" do - it "returns the earliest" do - dates = (Date.parse("2023-11-24")..Date.parse("2023-11-29")) - dates.each do |date| - FileUtils.rm delete_file_for_date(date: date) + around(:each) do |example| + with_test_environment do |tmpdir| + with_test_dirs(parent_dir: tmpdir) do + example.run + end + end + end + + it "with no files present, returns the last day of last month" do + expect(described_class + .new(date: Date.parse("2023-01-15")) + .earliest_missing_date) + .to eq Date.parse("2022-12-31") + end + + context "with test files" do + let(:date_for_run) { Date.parse("2023-11-29") } + let(:verifier) { described_class.new(date: date_for_run) } + + before(:each) { setup_test_files(date: date_for_run) } + + it "with all files present, returns nil" do + expect(verifier.earliest_missing_date).to be_nil + end + + it "with one date missing, returns the earliest" do + date = Date.parse("2023-11-03") + FileUtils.rm update_rights_file_for_date(date: date) + + expect(verifier.earliest_missing_date).to eq date + end + + it "with monthly file missing, returns the last day of the last month" do + date = Date.parse("2023-10-31") + FileUtils.rm full_file_for_date(date: date) + expect(verifier.earliest_missing_date).to eq date + end + + it "with different dates in each category missing, returns the earliest" do + [ + delete_file_for_date(date: Date.parse("2023-11-20")), + update_file_for_date(date: Date.parse("2023-11-11")), + update_rights_file_for_date(date: Date.parse("2023-11-18")) + ].each do |file| + FileUtils.rm file + end + expect(verifier.earliest_missing_date).to eq Date.parse("2023-11-11") + end + + it "with multiple dates missing, returns the earliest" do + dates = (Date.parse("2023-11-24")..Date.parse("2023-11-29")) + dates.each do |date| + FileUtils.rm delete_file_for_date(date: date) + end + expect(verifier.earliest_missing_date).to eq dates.first end - mi = PostZephirProcessing::Derivatives.new(date: Date.parse("2023-11-29")) - expect(mi.earliest_missing_date).to eq dates.first end end end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index a2b5e51..69a6934 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -10,6 +10,11 @@ require "webmock/rspec" require "zlib" +require "dates" +require "derivatives" +require "journal" +require "verifier" + Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) SimpleCov.add_filter "spec" @@ -24,11 +29,6 @@ ]) SimpleCov.start -require_relative "../lib/dates" -require_relative "../lib/derivatives" -require_relative "../lib/journal" -require_relative "../lib/verifier" - # squelch log output from tests PostZephirProcessing::Services.register(:logger) { Logger.new(File.open(File::NULL, "w"), level: Logger::DEBUG) @@ -101,75 +101,6 @@ def expect_ok(method, contents, gzipped: false, check_return: false) end end -# TODO: the following ENV juggling routines are for the integration tests, -# and should be integrated with the `with_test_environment` facility above. -ENV["POST_ZEPHIR_LOGGER_LEVEL"] = Logger::WARN.to_s - -def catalog_prep_dir - File.join(ENV["SPEC_TMPDIR"], "catalog_prep") -end - -def rights_dir - File.join(ENV["SPEC_TMPDIR"], "rights") -end - -def rights_archive_dir - File.join(ENV["SPEC_TMPDIR"], "rights_archive") -end - -# Set the all-important SPEC_TMPDIR and derivative env vars, -# and populate test dir with the appropriate directories. -# FIXME: RIGHTS_DIR should no longer be needed for testing Derivatives, -# and may not be needed for testing Verifier and friends. -def setup_test_dirs(parent_dir:) - ENV["SPEC_TMPDIR"] = parent_dir - ENV["CATALOG_PREP"] = catalog_prep_dir - ENV["RIGHTS_DIR"] = rights_dir - ENV["RIGHTS_ARCHIVE"] = rights_archive_dir - [catalog_prep_dir, rights_dir, rights_archive_dir].each do |loc| - Dir.mkdir loc - end -end - -def full_file_for_date(date:) - File.join(catalog_prep_dir, "zephir_full_#{date.strftime("%Y%m%d")}_vufind.json.gz") -end - -def full_rights_file_for_date(date:, archive: true) - File.join( - archive ? rights_archive_dir : rights_dir, - "zephir_full_#{date.strftime("%Y%m%d")}.rights" - ) -end - -def update_file_for_date(date:) - File.join(catalog_prep_dir, "zephir_upd_#{date.strftime("%Y%m%d")}.json.gz") -end - -def delete_file_for_date(date:) - File.join(catalog_prep_dir, "zephir_upd_#{date.strftime("%Y%m%d")}_delete.txt.gz") -end - -def update_rights_file_for_date(date:, archive: true) - File.join( - archive ? rights_archive_dir : rights_dir, - "zephir_upd_#{date.strftime("%Y%m%d")}.rights" - ) -end - -# @param date [Date] determines the month and year for the file datestamps -def setup_test_files(date:) - start_date = Date.new(date.year, date.month - 1, -1) - `touch #{full_file_for_date(date: start_date)}` - `touch #{full_rights_file_for_date(date: start_date)}` - end_date = Date.new(date.year, date.month, -2) - (start_date..end_date).each do |d| - `touch #{update_file_for_date(date: d)}` - `touch #{delete_file_for_date(date: d)}` - `touch #{update_rights_file_for_date(date: d)}` - end -end - # Returns the full path to the given fixture file. # # @param file [String] diff --git a/spec/unit/derivative/delete_spec.rb b/spec/unit/derivative/delete_spec.rb index 7497f5e..4315774 100644 --- a/spec/unit/derivative/delete_spec.rb +++ b/spec/unit/derivative/delete_spec.rb @@ -49,10 +49,9 @@ module PostZephirProcessing expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130_delete.txt.gz" end - # TODO: maybe this should raise since it's asking for a nonexistent derivative? - it "reports the same (upd) path regardless of fullness" do + it "raises if a full file is requested" do params[:full] = true - expect(derivative.path).to eq "/tmp/prep/zephir_upd_20231130_delete.txt.gz" + expect { derivative }.to raise_exception(ArgumentError, /full/) end end end diff --git a/spec/unit/derivatives_spec.rb b/spec/unit/derivatives_spec.rb index 53f62db..05a8320 100644 --- a/spec/unit/derivatives_spec.rb +++ b/spec/unit/derivatives_spec.rb @@ -4,14 +4,6 @@ module PostZephirProcessing RSpec.describe(Derivatives) do - around(:each) do |example| - Dir.mktmpdir do |tmpdir| - setup_test_dirs(parent_dir: tmpdir) - setup_test_files(date: Date.parse("2023-10-30")) - example.run - end - end - describe ".new" do it "creates a Derivatives" do expect(described_class.new).to be_an_instance_of(Derivatives) @@ -27,20 +19,5 @@ module PostZephirProcessing expect(described_class.new.dates).to be_an_instance_of(Dates) end end - - describe "#earliest_missing_date" do - context "with no files" do - it "returns the last day of last month" do - expect(described_class.new(date: Date.parse("2023-01-15")).earliest_missing_date).to eq Date.parse("2022-12-31") - end - end - - context "with all files for the month present" do - date = Date.parse("2023-10-15") - it "returns nil" do - expect(described_class.new(date: date).earliest_missing_date).to eq nil - end - end - end end end From 77323541fd6e6604bb7fb13b7af04d78c4b22e79 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 15:39:31 -0500 Subject: [PATCH 099/114] Rename Derivatives -> PostZephirDerivatives * clean up requires --- lib/derivative.rb | 1 - lib/{derivatives.rb => post_zephir_derivatives.rb} | 5 +---- lib/verifier.rb | 1 - lib/verifier/catalog_index_verifier.rb | 2 +- lib/verifier/hathifiles_database_verifier.rb | 5 ++--- lib/verifier/hathifiles_listing_verifier.rb | 1 - lib/verifier/populate_rights_verifier.rb | 2 +- lib/verifier/post_zephir_verifier.rb | 2 +- spec/integration/derivatives_integration_spec.rb | 3 ++- spec/spec_helper.rb | 1 - ...derivatives_spec.rb => post_zephir_derivatives_spec.rb} | 7 ++++--- 11 files changed, 12 insertions(+), 18 deletions(-) rename lib/{derivatives.rb => post_zephir_derivatives.rb} (88%) rename spec/unit/{derivatives_spec.rb => post_zephir_derivatives_spec.rb} (66%) diff --git a/lib/derivative.rb b/lib/derivative.rb index 2168ee5..3655d86 100644 --- a/lib/derivative.rb +++ b/lib/derivative.rb @@ -1,5 +1,4 @@ require "verifier" -require "derivatives" module PostZephirProcessing class Derivative diff --git a/lib/derivatives.rb b/lib/post_zephir_derivatives.rb similarity index 88% rename from lib/derivatives.rb rename to lib/post_zephir_derivatives.rb index e3a9778..29b5e67 100644 --- a/lib/derivatives.rb +++ b/lib/post_zephir_derivatives.rb @@ -10,10 +10,7 @@ module PostZephirProcessing # A class that knows the expected locations of standard Zephir derivative files. # `earliest_missing_date` is the main entrypoint when constructing an agenda of Zephir # file dates to fetch for processing. - # - # TODO: this class may be renamed PostZephirDerivatives once directory_for is updated, - # moved, or elimminated. - class Derivatives + class PostZephirDerivatives attr_reader :dates # @param date [Date] the file datestamp date, not the "run date" diff --git a/lib/verifier.rb b/lib/verifier.rb index 7b99a3f..7a2143d 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -1,6 +1,5 @@ # frozen_string_literal: true -require "derivatives" require "journal" require "services" diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 02b5cfc..74f0c56 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -2,7 +2,7 @@ require "faraday" require "verifier" -require "derivatives" +require "derivative/catalog" # Verifies that catalog indexing workflow stage did what it was supposed to. diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index 1bf7469..d7f4662 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -3,7 +3,6 @@ require "zlib" require "verifier" -require "derivatives" require "derivative/hathifile" module PostZephirProcessing @@ -13,7 +12,7 @@ class HathifilesDatabaseVerifier < Verifier # Does an entry exist in hf_log for the hathifile? # Can pass a path or just the filename. def self.has_log?(hathifile:) - PostZephirProcessing::Services[:database][:hf_log] + Services[:database][:hf_log] .where(hathifile: File.basename(hathifile)) .count .positive? @@ -21,7 +20,7 @@ def self.has_log?(hathifile:) # Count the number of entries in hathifiles.hf def self.db_count - PostZephirProcessing::Services[:database][:hf].count + Services[:database][:hf].count end def run_for_date(date:) diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index a41539b..628492b 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -1,7 +1,6 @@ # frozen_string_literal: true require "verifier" -require "derivatives" require "derivative/hathifile_www" require "json" require "set" diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index 16918a9..5b0d2b8 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -1,7 +1,7 @@ # frozen_string_literal: true require "verifier" -require "derivatives" +require "derivative/rights" module PostZephirProcessing # The PostZephirVerifier checks for the existence and readability of the .rights files. diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 1e1a3aa..18007d4 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -2,7 +2,7 @@ require "zlib" require "verifier" -require "derivatives" +require "post_zephir_derivatives" require "derivative/dollar_dup" require "derivative/catalog" require "derivative/rights" diff --git a/spec/integration/derivatives_integration_spec.rb b/spec/integration/derivatives_integration_spec.rb index 0b87bdd..d449f09 100644 --- a/spec/integration/derivatives_integration_spec.rb +++ b/spec/integration/derivatives_integration_spec.rb @@ -3,12 +3,13 @@ require "date" require "fileutils" require "tmpdir" +require "post_zephir_derivatives" require "derivative/catalog" require "derivative/delete" require "derivative/rights" module PostZephirProcessing - RSpec.describe Derivatives do + RSpec.describe PostZephirDerivatives do def catalog_prep_dir File.join(@tmpdir, "catalog_prep") end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 69a6934..d538abd 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -11,7 +11,6 @@ require "zlib" require "dates" -require "derivatives" require "journal" require "verifier" diff --git a/spec/unit/derivatives_spec.rb b/spec/unit/post_zephir_derivatives_spec.rb similarity index 66% rename from spec/unit/derivatives_spec.rb rename to spec/unit/post_zephir_derivatives_spec.rb index 05a8320..055f899 100644 --- a/spec/unit/derivatives_spec.rb +++ b/spec/unit/post_zephir_derivatives_spec.rb @@ -1,12 +1,13 @@ # frozen_string_literal: true require "tmpdir" +require "post_zephir_derivatives" module PostZephirProcessing - RSpec.describe(Derivatives) do + RSpec.describe(PostZephirDerivatives) do describe ".new" do - it "creates a Derivatives" do - expect(described_class.new).to be_an_instance_of(Derivatives) + it "creates a PostZephirDerivatives" do + expect(described_class.new).to be_an_instance_of(PostZephirDerivatives) end it "has a default date of yesterday" do From 75fe457842919e77397ec8f8d07990fb64d76110 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 16:51:37 -0500 Subject: [PATCH 100/114] Use guard clauses for verify_file, etc * Reduces level of nested ifs * Separate catalog full verification from catalog update -- rather different behavior * Also: update documentation regarding derivatives / paths --- README.md | 8 +- lib/verifier.rb | 9 +-- lib/verifier/catalog_index_verifier.rb | 5 +- lib/verifier/hathifiles_redirects_verifier.rb | 43 ++++++----- lib/verifier/populate_rights_verifier.rb | 5 +- lib/verifier/post_zephir_verifier.rb | 74 ++++++++++--------- spec/unit/post_zephir_verifier_spec.rb | 6 +- 7 files changed, 70 insertions(+), 80 deletions(-) diff --git a/README.md b/README.md index ba29fbd..c26726d 100644 --- a/README.md +++ b/README.md @@ -35,9 +35,7 @@ Post-Zephir can read and write files in a number of locations, and it can become Many of the locations (all of them directories) show up again and again. Under Argo these all come from the `ENV` provided to the workflow. Under Docker the locations are not so scattered, and all orient themselves to `ENV[ROOTDIR]`. The shell scripts rely on `config/defaults` to fill -in many of these variables. The Ruby scripts orient off the `DATA_ROOT` in `Dockerfile` -but fill in the other locations in a more haphazard manner (see the `directory_for` method in -`lib/derivatives.rb` for an example of how this can go off the rails). +in many of these variables; the Ruby scripts expect that the environment variables set by `config/defaults` are present. TODO: can we use `dotenv` and `.env` in both the shell scripts and the Ruby code, and get rid of `config/defaults`? Or can we translate `config/defaults` into Ruby and invoke it from the driver? @@ -53,10 +51,6 @@ TODO: can we use `dotenv` and `.env` in both the shell scripts and the Ruby code | `ROOTDIR` | (not used) | `/usr/src/app` | Additional derivative paths are set by `config/defaults`, typically from the daily or monthly shell script. -Another mechanism (`lib/derivatives.rb`) is being experimented with for the Ruby code. -(Note: there may be some fuzziness between these two sets since we may decide to let -Argo handle one or more of these in future. Look to the Argo metadata workflow config for -authoritative values.) | `ENV` | Standard/Default/Docker Location | Note | | -------- | ------- | ---- | diff --git a/lib/verifier.rb b/lib/verifier.rb index 7a2143d..89dc09c 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -4,8 +4,6 @@ require "services" # Common superclass for all things Verifier. -# Right now the only thing I can think of to put here is shared -# code for writing whatever output file, logs, metrics, artifacts, etc. we decide on. module PostZephirProcessing class Verifier @@ -20,8 +18,6 @@ def initialize # Main entrypoint # What should it return? - # Do we want to bail out or keep going if we encounter a show-stopper? - # I'm inclined to just keep going. def run journal.dates.each do |date| run_for_date(date: date) @@ -71,8 +67,9 @@ def verify_parseable_ndj(path:) true end - # I'm not sure if we're going to try to distinguish errors and warnings. - # For now let's call everything an error. + # Log an error -- something unexpected from the verifier. Generally, this + # indicates something unexpected that requires human intervention to + # correct, and which should be corrected ASAP. def error(message:) output_msg = self.class.to_s + ": " + message @errors << output_msg diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 74f0c56..2642ad6 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -54,9 +54,8 @@ def run_for_date(date:) Derivative::CatalogArchive.derivatives_for_date(date: date - 1).each do |derivative| path = derivative.path - if verify_file(path: path) - verify_index_count(derivative: derivative) - end + next unless verify_file(path: path) + verify_index_count(derivative: derivative) end end end diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index 97f33da..b8932d9 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -20,35 +20,34 @@ def verify_redirects(date: current_date) end def verify_redirects_file(path: redirects_file) - if verify_file(path: path) - # check that each line in the file matches regex - Zlib::GzipReader.open(path, encoding: "utf-8").each_line.with_index(1) do |line, i| - unless REDIRECTS_REGEX.match?(line) - report_malformed(file: redirects_file, line: line, line_number: i) - end + return unless verify_file(path: path) + # check that each line in the file matches regex + Zlib::GzipReader.open(path, encoding: "utf-8").each_line.with_index(1) do |line, i| + unless REDIRECTS_REGEX.match?(line) + report_malformed(file: redirects_file, line: line, line_number: i) end end end def verify_redirects_history_file(path: redirects_history_file) - if verify_file(path: path) - Zlib::GzipReader.open(path, encoding: "utf-8").each_line.with_index(1) do |line, i| - parsed = JSON.parse(line) - # Check that the line parses to a hash - unless parsed.instance_of?(Hash) - report_malformed(file: redirects_history_file, line: line, line_number: i) - next - end - # Check that the outermost level of keys in the JSON line are what we expect - unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS - report_malformed(file: redirects_history_file, line: line, line_number: i) - next - end - # here we could go further and verify deeper structure of json, - # but not sure it's worth it? - rescue JSON::ParserError + return unless verify_file(path: path) + + Zlib::GzipReader.open(path, encoding: "utf-8").each_line.with_index(1) do |line, i| + parsed = JSON.parse(line) + # Check that the line parses to a hash + unless parsed.instance_of?(Hash) + report_malformed(file: redirects_history_file, line: line, line_number: i) + next + end + # Check that the outermost level of keys in the JSON line are what we expect + unless HISTORY_FILE_KEYS & parsed.keys == HISTORY_FILE_KEYS report_malformed(file: redirects_history_file, line: line, line_number: i) + next end + # here we could go further and verify deeper structure of json, + # but not sure it's worth it? + rescue JSON::ParserError + report_malformed(file: redirects_history_file, line: line, line_number: i) end end diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index 5b0d2b8..db68070 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -29,9 +29,8 @@ def initialize(slice_size: DEFAULT_SLICE_SIZE) def run_for_date(date:) Derivative::Rights.derivatives_for_date(date: date).each do |derivative| path = derivative.path - if verify_file(path: path) - verify_rights_file(path: path) - end + next unless verify_file(path: path) + verify_rights_file(path: path) end end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 18007d4..4a5d968 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -17,13 +17,20 @@ class PostZephirVerifier < Verifier def run_for_date(date:) @current_date = date - verify_catalog_archive + verify_catalog_update_archive + verify_catalog_full_archive verify_catalog_prep verify_dollar_dup verify_ingest_bibrecords verify_rights end + def verify_catalog_update_archive(date: current_date) + zephir_update_path = Derivative::CatalogArchive.new(date: date, full: false).path + return unless verify_file(path: zephir_update_path) + verify_parseable_ndj(path: zephir_update_path) + end + # Frequency: ALL # Files: CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz # and potentially CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz @@ -31,33 +38,29 @@ def run_for_date(date:) # Verify: # readable # line count must be the same as input JSON - def verify_catalog_archive(date: current_date) - zephir_update_path = Derivative::CatalogArchive.new(date: date, full: false).path - if verify_file(path: zephir_update_path) - if verify_parseable_ndj(path: zephir_update_path) - if date.last_of_month? - output_path = Derivative::CatalogArchive.new(date: date, full: true).path - verify_file(path: output_path) - verify_parseable_ndj(path: output_path) - output_linecount = gzip_linecount(path: output_path) - input_path = Derivative::HTBibExport.new(date: date, full: true).path - verify_file(path: input_path) - verify_parseable_ndj(path: input_path) - input_linecount = gzip_linecount(path: input_path) - - if output_linecount != input_linecount - error( - message: sprintf( - "output line count (%s = %s) != input line count (%s = %s)", - output_path, - output_linecount, - input_path, - input_linecount - ) - ) - end - end - end + def verify_catalog_full_archive(date: current_date) + return unless date.last_of_month? + output_path = Derivative::CatalogArchive.new(date: date, full: true).path + input_path = Derivative::HTBibExport.new(date: date, full: true).path + + paths = [input_path, output_path] + return unless paths.all? { |path| verify_file(path: path) } + + paths.each { |path| verify_parseable_ndj(path: path) } + + output_linecount = gzip_linecount(path: output_path) + input_linecount = gzip_linecount(path: input_path) + + if output_linecount != input_linecount + error( + message: sprintf( + "output line count (%s = %s) != input line count (%s = %s)", + output_path, + output_linecount, + input_path, + input_linecount + ) + ) end end @@ -72,6 +75,7 @@ def verify_catalog_archive(date: current_date) # TODO: deletes file is combination of two component files in TMPDIR? def verify_catalog_prep(date: current_date) delete_file = Derivative::Delete.new(date: date) + if verify_file(path: delete_file.path) verify_deletes_contents(path: delete_file.path) end @@ -116,11 +120,10 @@ def verify_deletes_contents(path:) # empty def verify_dollar_dup(date: current_date) dollar_dup = Derivative::DollarDup.new(date: date).path - if verify_file(path: dollar_dup) - gz_count = gzip_linecount(path: dollar_dup) - if gz_count.positive? - error message: "spurious dollar_dup lines: #{dollar_dup} should be empty (found #{gz_count} lines)" - end + return unless verify_file(path: dollar_dup) + gz_count = gzip_linecount(path: dollar_dup) + if gz_count.positive? + error message: "spurious dollar_dup lines: #{dollar_dup} should be empty (found #{gz_count} lines)" end end @@ -147,9 +150,8 @@ def verify_ingest_bibrecords(date: current_date) def verify_rights(date: current_date) Derivative::Rights.derivatives_for_date(date: date).each do |derivative| path = derivative.path - if verify_file(path: path) - verify_rights_file_format(path: path) - end + next unless verify_file(path: path) + verify_rights_file_format(path: path) end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 3efd13c..7c79211 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -113,13 +113,13 @@ def expect_deletefile_ok(contents) end end - describe "#verify_catalog_archive" do + describe "#verify_catalog_full_archive" do let(:verifier) { described_class.new } let(:test_date) { Date.parse("2024-11-30") } it "requires input file to have same line count as output file" do # We have fixtures with matching line counts for test_date, # so expect no warnings - verifier.verify_catalog_archive(date: test_date) + verifier.verify_catalog_full_archive(date: test_date) expect(verifier.errors).to be_empty end @@ -132,7 +132,7 @@ def expect_deletefile_ok(contents) end # The other unmodified fixtures in CATALOG_ARCHIVE should # no longer have matching line counts, so expect a warning - verifier.verify_catalog_archive(date: test_date) + verifier.verify_catalog_full_archive(date: test_date) expect(verifier.errors.count).to eq 1 expect(verifier.errors).to include(/output line count .+ != input line count/) end From 77ce92d9e0eaa443afb89d73bffa764cfb90d3ae Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 17:14:13 -0500 Subject: [PATCH 101/114] Remove extra with_test_environment in hathifiles contents verifier --- spec/unit/hathifiles_contents_verifier_spec.rb | 3 --- 1 file changed, 3 deletions(-) diff --git a/spec/unit/hathifiles_contents_verifier_spec.rb b/spec/unit/hathifiles_contents_verifier_spec.rb index 40e09f2..fac4988 100644 --- a/spec/unit/hathifiles_contents_verifier_spec.rb +++ b/spec/unit/hathifiles_contents_verifier_spec.rb @@ -10,9 +10,6 @@ module PostZephirProcessing end let(:sample_line) { File.read(fixture("sample_hathifile_line.txt"), encoding: "utf-8") } - around(:each) do |example| - with_test_environment { example.run } - end hathifiles_fields = [ { From 3bdde11aca8a2d62dceb054c372652402338e39c Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Thu, 19 Dec 2024 17:37:30 -0500 Subject: [PATCH 102/114] Add verifier classes to verifier script * prep bin/verify.rb for testing * try running bin/verify.rb runs * use run_for_date by default; remove requirement for journal --- bin/verify.rb | 42 ++++++++++++++++++++--------- config/env | 1 + lib/verifier.rb | 13 ++++----- lib/verifier/hathifiles_verifier.rb | 2 +- spec/spec_helper.rb | 1 - spec/unit/verifier_spec.rb | 13 --------- 6 files changed, 36 insertions(+), 36 deletions(-) diff --git a/bin/verify.rb b/bin/verify.rb index 7dc60e8..9813a5a 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -1,22 +1,38 @@ #!/usr/bin/env ruby # frozen_string_literal: true +$LOAD_PATH.unshift File.expand_path("../lib/", __FILE__) require "dotenv" -require_relative "../lib/verifier/hathifiles_verifier" -require_relative "../lib/verifier/populate_rights_verifier" -require_relative "../lib/verifier/post_zephir_verifier" +require "verifier/post_zephir_verifier" +require "verifier/populate_rights_verifier" +require "verifier/hathifiles_verifier" +require "verifier/hathifiles_database_verifier" +require "verifier/hathifiles_listing_verifier" +require "verifier/hathifiles_redirects_verifier" +require "verifier/catalog_index_verifier" Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) -[ - PostZephirProcessing::PostZephirVerifier, - PostZephirProcessing::PopulateRightsVerifier, - PostZephirProcessing::HathifilesVerifier -].each do |klass| - begin - klass.new.run - # Very simple minded exception handler so we can in theory check subsequent workflow steps - rescue StandardError => e - PostZephirProcessing::Services[:logger].fatal e +module PostZephir + def self.run_verifiers(date_to_check) + [ + PostZephirProcessing::PostZephirVerifier, + PostZephirProcessing::PopulateRightsVerifier, + PostZephirProcessing::HathifilesVerifier, + PostZephirProcessing::HathifilesDatabaseVerifier, + PostZephirProcessing::HathifilesListingVerifier, + PostZephirProcessing::HathifileRedirectsVerifier, + PostZephirProcessing::CatalogIndexVerifier + ].each do |klass| + begin + klass.new.run_for_date(date: date_to_check) + # Very simple minded exception handler so we can in theory check subsequent workflow steps + rescue StandardError => e + PostZephirProcessing::Services[:logger].fatal e + end + end end end + +date_to_check = ARGV[0] || Date.today +PostZephir.run_verifiers(date_to_check) if __FILE__ == $0 diff --git a/config/env b/config/env index d9571ac..763d5a9 100644 --- a/config/env +++ b/config/env @@ -1,6 +1,7 @@ # This is just for running/testing the Ruby components under Docker # Under Argo these ENV variables will all be set externally CATALOG_ARCHIVE=/usr/src/app/data/catalog_archive +HATHIFILE_ARCHIVE=/usr/src/app/data/hathifile_archive CATALOG_PREP=/usr/src/app/data/catalog_prep DATA_ROOT=/usr/src/app/data INGEST_BIBRECORDS=/usr/src/app/data/ingest_bibrecords diff --git a/lib/verifier.rb b/lib/verifier.rb index 89dc09c..9f41e6a 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -7,25 +7,22 @@ module PostZephirProcessing class Verifier - attr_reader :journal, :errors + attr_reader :errors - # Generally, needs a Journal in order to know what to look for. def initialize - @journal = Journal.from_yaml # Mainly for testing @errors = [] end - # Main entrypoint - # What should it return? + # Verify all dates listed in the journal def run - journal.dates.each do |date| + Journal.from_yaml.dates.each do |date| run_for_date(date: date) end end - # Verify outputs for one date in the journal. - # USeful for verifying datestamped files. + # Verify outputs for one date. + # Useful for verifying datestamped files. def run_for_date(date:) end diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index 0f7dc06..f1f3dea 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -2,7 +2,7 @@ require "zlib" require "verifier" -require_relative "hathifiles_contents_verifier" +require "verifier/hathifiles_contents_verifier" require "derivative/hathifile" # Verifies that hathifiles workflow stage did what it was supposed to. diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index d538abd..6ba536b 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -51,7 +51,6 @@ def with_test_environment DATA_ROOT: tmpdir, TMPDIR: tmpdir ) do - File.open(File.join(tmpdir, "journal.yml"), "w") { |f| f.puts test_journal } # Maybe we don't need to yield `tmpdir` since we're also assigning it to an # instance variable. Leaving it for now in case the ivar approach leads to funny business. @tmpdir = tmpdir diff --git a/spec/unit/verifier_spec.rb b/spec/unit/verifier_spec.rb index 349e313..23b99de 100644 --- a/spec/unit/verifier_spec.rb +++ b/spec/unit/verifier_spec.rb @@ -19,19 +19,6 @@ module PostZephirProcessing it "creates a Verifier" do expect(verifier).to be_an_instance_of(Verifier) end - - context "with no Journal file" do - it "raises StandardError" do - FileUtils.rm(File.join(@tmpdir, "journal.yml")) - expect { verifier }.to raise_error(StandardError) - end - end - end - - describe ".run" do - it "runs to completion" do - verifier.run - end end describe "#verify_file" do From d3e61355ea04a19a4eb249ad9f0812dc478f2edd Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Fri, 20 Dec 2024 14:48:36 -0500 Subject: [PATCH 103/114] Happy-path integration test * sufficient fixtures to run all verifiers (some fixtures have been moved/renamed and adjusted in tests appropriately) * verify.rb (for now) notes which date is needed for 'today' * extract solr catalog & hathifile database helpers to spec/support * info logging --- bin/verify.rb | 29 ++++++----- lib/verifier.rb | 8 +++ lib/verifier/catalog_index_verifier.rb | 6 ++- lib/verifier/hathifiles_contents_verifier.rb | 1 + lib/verifier/hathifiles_database_verifier.rb | 1 + lib/verifier/hathifiles_listing_verifier.rb | 1 + lib/verifier/hathifiles_redirects_verifier.rb | 6 +++ lib/verifier/hathifiles_verifier.rb | 1 + lib/verifier/populate_rights_verifier.rb | 1 + lib/verifier/post_zephir_verifier.rb | 3 ++ .../zephir_upd_20241203.json.gz | Bin 0 -> 1903 bytes .../catalog_prep/zephir_upd_20241202.json.gz | 1 + .../zephir_upd_20241202_delete.txt.gz | Bin 0 -> 71 bytes ...d_incremental_2024-12-02_dollar_dup.txt.gz | Bin 0 -> 65 bytes .../hathi_upd_20241203.txt.gz | Bin 314 -> 574 bytes .../hathi_upd_20241204.txt.gz | Bin 0 -> 229 bytes .../{202301.ndj.gz => 202412.ndj.gz} | Bin ..._202301.txt.gz => redirects_202412.txt.gz} | Bin .../rights_archive/zephir_upd_20241202.rights | 0 spec/fixtures/www/hathi_file_list.json | 8 +++ spec/fixtures/www/hathi_upd_20241203.txt.gz | Bin 0 -> 43 bytes spec/integration/hathifiles_verifier_spec.rb | 2 +- spec/integration/verify_spec.rb | 49 ++++++++++++++++++ spec/spec_helper.rb | 3 ++ spec/support/hathifile_database.rb | 23 ++++++++ spec/support/solr_mock.rb | 32 ++++++++++++ spec/unit/catalog_indexing_verifier_spec.rb | 45 +--------------- .../unit/hathifiles_database_verifier_spec.rb | 18 +------ .../hathifiles_redirects_verifier_spec.rb | 6 +-- 29 files changed, 166 insertions(+), 78 deletions(-) create mode 100644 spec/fixtures/catalog_archive/zephir_upd_20241203.json.gz create mode 120000 spec/fixtures/catalog_prep/zephir_upd_20241202.json.gz create mode 100644 spec/fixtures/catalog_prep/zephir_upd_20241202_delete.txt.gz create mode 100644 spec/fixtures/dollar_dup/vufind_incremental_2024-12-02_dollar_dup.txt.gz create mode 100644 spec/fixtures/hathifile_archive/hathi_upd_20241204.txt.gz rename spec/fixtures/redirects/{202301.ndj.gz => 202412.ndj.gz} (100%) rename spec/fixtures/redirects/{redirects_202301.txt.gz => redirects_202412.txt.gz} (100%) create mode 100644 spec/fixtures/rights_archive/zephir_upd_20241202.rights create mode 100644 spec/fixtures/www/hathi_upd_20241203.txt.gz create mode 100644 spec/integration/verify_spec.rb create mode 100644 spec/support/hathifile_database.rb create mode 100644 spec/support/solr_mock.rb diff --git a/bin/verify.rb b/bin/verify.rb index 9813a5a..4ce7d37 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -13,26 +13,29 @@ Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) -module PostZephir +module PostZephirProcessing def self.run_verifiers(date_to_check) [ - PostZephirProcessing::PostZephirVerifier, - PostZephirProcessing::PopulateRightsVerifier, - PostZephirProcessing::HathifilesVerifier, - PostZephirProcessing::HathifilesDatabaseVerifier, - PostZephirProcessing::HathifilesListingVerifier, - PostZephirProcessing::HathifileRedirectsVerifier, - PostZephirProcessing::CatalogIndexVerifier - ].each do |klass| + # all outputs here are date-stamped with yesterday's date + -> { PostZephirVerifier.new.run_for_date(date: date_to_check - 1) }, + -> { PopulateRightsVerifier.new.run_for_date(date: date_to_check - 1) }, + + # these are today's date + -> { HathifilesVerifier.new.run_for_date(date: date_to_check) }, + -> { HathifilesDatabaseVerifier.new.run_for_date(date: date_to_check) }, + -> { HathifilesListingVerifier.new.run_for_date(date: date_to_check) }, + -> { HathifileRedirectsVerifier.new.run_for_date(date: date_to_check) }, + -> { CatalogIndexVerifier.new.run_for_date(date: date_to_check) }, + ].each do |verifier_lambda| begin - klass.new.run_for_date(date: date_to_check) - # Very simple minded exception handler so we can in theory check subsequent workflow steps + verifier_lambda.call + # Very simple minded exception handler so we can in theory check subsequent workflow steps rescue StandardError => e - PostZephirProcessing::Services[:logger].fatal e + Services[:logger].fatal e end end end end date_to_check = ARGV[0] || Date.today -PostZephir.run_verifiers(date_to_check) if __FILE__ == $0 +PostZephirProcessing.run_verifiers(date_to_check) if __FILE__ == $0 diff --git a/lib/verifier.rb b/lib/verifier.rb index 9f41e6a..eae6c62 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -24,11 +24,13 @@ def run # Verify outputs for one date. # Useful for verifying datestamped files. def run_for_date(date:) + info message: "running for #{date}" end # Basic checks for the existence and readability of the file at `path`. # @return [Boolean] `true` if verified, `false` if error was reported. def verify_file(path:) + info message: "verifying file exists & is readable: #{path}" verify_file_exists(path: path) && verify_file_readable(path: path) end @@ -47,12 +49,14 @@ def verify_file_readable(path:) end def gzip_linecount(path:) + info message: "getting line count: #{path}" Zlib::GzipReader.open(path, encoding: "utf-8") { |gz| gz.count } end # Take a .ndj.gz file and check that each line is indeed parseable json # @return [Boolean] `true` if verified, `false` if error was reported. def verify_parseable_ndj(path:) + info message: "verifying parseable newline-delimited json: #{path}" Zlib::GzipReader.open(path, encoding: "utf-8") do |gz| gz.each_line do |line| JSON.parse(line) @@ -72,5 +76,9 @@ def error(message:) @errors << output_msg Services[:logger].error output_msg end + + def info(message:) + Services[:logger].info "#{self.class}: #{message}" + end end end diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index 2642ad6..e169df4 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -3,6 +3,7 @@ require "faraday" require "verifier" require "derivative/catalog" +require "uri" # Verifies that catalog indexing workflow stage did what it was supposed to. @@ -32,7 +33,7 @@ def verify_index_count(derivative:) def solr_count(date_of_indexing) datebegin = date_of_indexing.to_time.utc.strftime("%FT%TZ") - solr_result_count("time_of_index:[#{datebegin}%20TO%20NOW]") + solr_result_count("time_of_index:[#{datebegin} TO NOW]") end def solr_nondeleted_records @@ -40,12 +41,13 @@ def solr_nondeleted_records end def solr_result_count(filter_query) - url = "#{ENV["SOLR_URL"]}/select?fq=#{filter_query}&q=*:*&rows=0&wt=json" + url = "#{ENV["SOLR_URL"]}/select?fq=#{URI.encode_www_form_component(filter_query)}&q=*:*&rows=0&wt=json" JSON.parse(Faraday.get(url).body)["response"]["numFound"] end def run_for_date(date:) + super # The dates on the files are the previous day, but the indexing # happens on the current day. When we verify the current day, we are # verifying that the file named for the _previous_ day was produced. diff --git a/lib/verifier/hathifiles_contents_verifier.rb b/lib/verifier/hathifiles_contents_verifier.rb index ee13df1..cdf838b 100644 --- a/lib/verifier/hathifiles_contents_verifier.rb +++ b/lib/verifier/hathifiles_contents_verifier.rb @@ -77,6 +77,7 @@ def initialize(file) end def run + info message: "verifying contents of #{file}" Zlib::GzipReader.open(file, encoding: "utf-8").each_line do |line| @line_count += 1 # limit of -1 to ensure we don't drop trailing empty fields diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database_verifier.rb index d7f4662..93f5257 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database_verifier.rb @@ -24,6 +24,7 @@ def self.db_count end def run_for_date(date:) + super @current_date = date verify_hathifiles_database_log verify_hathifiles_database_count diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing_verifier.rb index 628492b..f9292ef 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing_verifier.rb @@ -10,6 +10,7 @@ class HathifilesListingVerifier < Verifier attr_reader :current_date def run_for_date(date:) + super @current_date = date verify_hathifiles_listing end diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index b8932d9..c3e6a64 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -14,6 +14,12 @@ def initialize(date: Date.today) @current_date = date end + def run_for_date(date:) + super + @current_date = date + verify_redirects + end + def verify_redirects(date: current_date) verify_redirects_file verify_redirects_history_file diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles_verifier.rb index f1f3dea..b5f728e 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles_verifier.rb @@ -12,6 +12,7 @@ class HathifilesVerifier < Verifier attr_reader :current_date def run_for_date(date:) + super @current_date = date verify_hathifile end diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights_verifier.rb index db68070..9e95c3a 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights_verifier.rb @@ -27,6 +27,7 @@ def initialize(slice_size: DEFAULT_SLICE_SIZE) end def run_for_date(date:) + super Derivative::Rights.derivatives_for_date(date: date).each do |derivative| path = derivative.path next unless verify_file(path: path) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 4a5d968..b8656a1 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -16,6 +16,7 @@ class PostZephirVerifier < Verifier attr_reader :current_date def run_for_date(date:) + super @current_date = date verify_catalog_update_archive verify_catalog_full_archive @@ -88,6 +89,7 @@ def verify_catalog_prep(date: current_date) # Verify contents of the given file consists of catalog record IDs (9 digits) # or blank lines def verify_deletes_contents(path:) + info message: "verifying contents of #{path}" Zlib::GzipReader.open(path).each_line do |line| if line != "\n" && !line.match?(/^\d{9}$/) error message: "Unexpected line in #{path} (was '#{line.strip}'); expecting catalog record ID (9 digits)" @@ -175,6 +177,7 @@ def verify_rights_file_format(path:) {name: :digitization_source, regex: /^[a-z]+(-[a-z]+)*$/} ] + info message: "verifying contents of #{path}" # This allows an empty file as well, which is possible. File.open(path) do |f| f.each_line.with_index do |line, i| diff --git a/spec/fixtures/catalog_archive/zephir_upd_20241203.json.gz b/spec/fixtures/catalog_archive/zephir_upd_20241203.json.gz new file mode 100644 index 0000000000000000000000000000000000000000..cf935bddcb3803dd2e16fd41ae3c94d8193e7cc3 GIT binary patch literal 1903 zcmV-#2axz5iwFpw$z^8%1A1j}XlZg^b#P=~GB7eUF)}bSE^2dcZUDWRZByGg5P-k; zS7`F(%1mq|`8}PxnUIu1FSJZ4)0uYopu_^yI(E)>+NM4J`>hq*DDf*Ctr-T{vR1pR zmG<%Sa}aRm^LQ{CVC;G%WJ|;-#h5w{LCCZ)9t;Q9f(L#w7=8K)ZxQ)!b{P&C+zqp5 z20S|xV|Fr~J39p_y(bS+VS`9&k|bf9w)qj`ulO=D~L$va%=a9!*W zYe78QgEyY@I7Dyf@@cG+GG@Be4Hnz-O zBHE3eu($6oXXA?lVtFocP0IkGkCOCmcKYVv?DX;#7?~-j9qmcl9rL4b^*G?GTP5~d z*U!YA>o=WEWT-5Zwl72ZvThA6-0AZ3k3>IqZ7FFE_ZPQ@Y|FkkHYPg@fpftSi(RU^ z*wwLlNFA$2GlIJ{JC4~Go($1zw&V$Vr%SasbZ}o+bs}4l)T~(uCM}BM6kUr&Dx&a3 zX(&UBSfpaX00R!Nl>2DDDd$I9h0rG!`d8>QVSW^I^gd?E;+BVdDwPK3 z9-~&n-ljgV;+LxTsvVwP#XLx zJEgpX78Gw8r`@)7FESiW`;;^Q2+}Ovvdv4PIxSM zQWH;>;nutNEb=*ut`S=-c#@3JX$ZnXo_2WLwz~J?@iwBQ>EGf0L?mexZxDn>9$F$< zs=DHR*|gCU+nCMdfxUI^bb2gL5t_pyScl~n`W%Tc-4DEcBTY`HowiqwG{tw*@%v8I zh_}s*JBfB)n}I8bPcFY^D`jInYG4~X*xMCqYJ_Uocu(AsK}Xp9*sxtIHe~&@)`mT) zc!cT48h1>3xZ`5_PJ`@H9_QoP*;w8LaI=c}ip5-C1scYtmVrez9*)l%oO1Mcd!(Pd zo%KCia7W6|vxdw?MK+nfE02c4z*R<5{z^B1Y9Iip?YQ7d*_9}-WjfS?g;d*f%)V;R z6``sge!hl;Vxye>CW_$jl(*|wBYLQ+0Px2wT*j_wXIH&HF3eFzZGPY3}mw*EJaL?ofjaBt03DN@{VoTj_uaP zGPt{Q_;{Vzu4dcL6JVms#HO@GjkE_!+utB<0$X(2g0M+kV)ugchk>x`JOKzx6n`KrpL_kVVE29~J5%B|Ft)t^}9evQzR7ZvSa>Oh7^f-#^QT<^&d&Yx#v<7OsN7##X zh>oJ@OLC8|+0j%(ugzj9wF+|0MEZi=#ECvcDVqlpx(cMlD$=UZiOq(bE9QZSZeq5& z-Hc!k1jo8YDd~XD9AaC~y`DU)f@W37jLoAsUw&!k4>dfRw8T5)S;NDy+Cy|C@^6YP z7e0Rl&!Vsh<%9Mhz6Z}{`zz;$Vs9Vtg#79PBU%K~5Ahxao{IjB4C)XQJ8&X2F{o*e zNG&Z^ZFmfW?ty3X9(W{pDhlV|*>^8X9Bly4ZrW6n>7!`VAf^dl2X-rM?xgJfAghAs zub|BxJ#*T0yJ$0vLQNq~q7dhf6%sxhz|yiiE2JhiDE10C|ka6Zcz_K~QoNO71 zO%E@|t>n}b2*DwpMbd5xtV-#s;dbt(lBXcTKz8MVDYY!V9)X?IaXhDXxAzjVN1L|O zjObo82S_%{~&JmfnhhWCN8**B9Yh?euHboi(9ScDnG!nZ7d zE6V%Pcoo6zf@PJrKj5i;RjQgDQgtBMhc>@||0=`~nUtu|30$n>Y>2 z(hS>vFmbkncnf^y#7VkKo3TJM7Jgx6;q*LOtpk?TQXgCCRKjCe2FG&62f5P3KN{=Djw$kce>TV(2 zHiFK@*w{6U<|-=??tYcEp&r_sP_w-OS@i+5`qYT}pxUUK^n23kbp^zB`E@7wn)r1G pHhv?t?kJZ-tMg>gI#lYVkjLEnda3Ib_0qd%{{g4FdmA4h007@9vv2?a literal 0 HcmV?d00001 diff --git a/spec/fixtures/catalog_prep/zephir_upd_20241202.json.gz b/spec/fixtures/catalog_prep/zephir_upd_20241202.json.gz new file mode 120000 index 0000000..ca2ad02 --- /dev/null +++ b/spec/fixtures/catalog_prep/zephir_upd_20241202.json.gz @@ -0,0 +1 @@ +../catalog_archive/zephir_upd_20241202.json.gz \ No newline at end of file diff --git a/spec/fixtures/catalog_prep/zephir_upd_20241202_delete.txt.gz b/spec/fixtures/catalog_prep/zephir_upd_20241202_delete.txt.gz new file mode 100644 index 0000000000000000000000000000000000000000..7b29616d0987e48915142931d0d976e85f22b636 GIT binary patch literal 71 zcmb2|=HU2d9-Yp>T$Ngoky#X9T96WNWME`sXlQH@pOTuBT9T?)Qc=QSY+__)XllHD b!;?*049ty94b6<6FfnwT&9W3>U|;|MKRya>gxvVrTGcP4RGcUO)H8(Y{Brzx6$iT=%*U(7Uz$iW?KPM-#C_bgM NK(C~tgn^j>1_03r65Rj* literal 0 HcmV?d00001 diff --git a/spec/fixtures/hathifile_archive/hathi_upd_20241203.txt.gz b/spec/fixtures/hathifile_archive/hathi_upd_20241203.txt.gz index 1562826792c6af45b6d0b58ec7a710ff5197dea2..da31cb9604424ce97aa2188713f5e176a171fdb1 100644 GIT binary patch literal 574 zcmV-E0>S+siwFp&xMgPm188A%XlY+{aAaRHFfueTGB7hPba-?C?Uv1M+b|4(&+1d~ z0b-Zbk0oWNrUDnC57;Mq(fOZYcVp%WJ=C zCg4inyAr%%Sl`wf!#Jc3HdKvduNwC7+x%PvAKz zPITJv1YC^pf{qI>t<)Jmad^IkM{!(urIofy{FCs69Azfeuka)kXZI<8%0KSy_~Vw* zBY&n)5I5)NofhnkU-fP0bLd>{m%a@{^9rgwl#2)lw~?~198{aAHMF;najO*O55}UQ z%M`BKp=e=NJ9p>0D6oXru7yuN*0}6kls1nsDa<-w*Q2w{CYsW{FVqRwfH?Z%+mk{p zI5^DLHt_^5>pWjQR1?z}Ym&3!K2Lu{8&9l~s-F&=YgNa@wKWSB@6I2N|fMUwk zl#6FnjV$-RTl(Ph{mz!;;f8mUqr>ZcS6VP-AE@Zspt8Ns}tQLT158zkUJV(0$}WGm?yP5~QmN zI^$<>EwJNgsR5quUqI`>fDtQ2MWFPu06=l3m0|?XDNb#gV!e0ckfG^ZG97-qWU=<> ze2SsVv6=CqZa*h8;;oHC?R^e6zO^CZHiY*vL(+=jYfHK;-=J{~0E0sYexAVi9=4Ci zKTU&c?dj}8==TQTbY*saBU=p@Moy9wUA3{$g(~@V8>Ny;OL5IcVw@FaA^#gmNDf~) zzhoY=GjoY5bzqi~3F6gy9hHDPH>MC>jnV8~?}DF(4cgjw&0@gaf=Sk(J+@{HyWigG M2g%4*W&X6cw4X+;kMW*c@m z1HM3a8hAWDL+f9Ih5L38DE%b>RKr>+7ChxtOfG3Wxy@H(XP50vA1<#n267>%IM+1o zbU&K+VmG>5lOIENV!uMZr literal 0 HcmV?d00001 diff --git a/spec/fixtures/redirects/202301.ndj.gz b/spec/fixtures/redirects/202412.ndj.gz similarity index 100% rename from spec/fixtures/redirects/202301.ndj.gz rename to spec/fixtures/redirects/202412.ndj.gz diff --git a/spec/fixtures/redirects/redirects_202301.txt.gz b/spec/fixtures/redirects/redirects_202412.txt.gz similarity index 100% rename from spec/fixtures/redirects/redirects_202301.txt.gz rename to spec/fixtures/redirects/redirects_202412.txt.gz diff --git a/spec/fixtures/rights_archive/zephir_upd_20241202.rights b/spec/fixtures/rights_archive/zephir_upd_20241202.rights new file mode 100644 index 0000000..e69de29 diff --git a/spec/fixtures/www/hathi_file_list.json b/spec/fixtures/www/hathi_file_list.json index 68681a4..83e3a6d 100644 --- a/spec/fixtures/www/hathi_file_list.json +++ b/spec/fixtures/www/hathi_file_list.json @@ -22,5 +22,13 @@ "created": "2023-01-02 02:02:02 -0400", "modified": "2023-01-02 02:02:02 -0400", "url": "https://www.hathitrust.org/files/hathifiles/hathi_upd_20230102.txt.gz" + }, + { + "filename": "hathi_upd_20241203.txt.gz", + "full": false, + "size": 43, + "created": "2024-12-03 02:02:02 -0400", + "modified": "2024-12-03 02:02:02 -0400", + "url": "https://www.hathitrust.org/files/hathifiles/hathi_upd_20241203.txt.gz" } ] diff --git a/spec/fixtures/www/hathi_upd_20241203.txt.gz b/spec/fixtures/www/hathi_upd_20241203.txt.gz new file mode 100644 index 0000000000000000000000000000000000000000..7bfea5740ad64dfa84706d548df1d2abe69914c3 GIT binary patch literal 43 rcmb2|=HM{inVQbPoRL_Pkr`iFkP>fXU}R!wWMHgUQc=Rd%m4!b2I>i_ literal 0 HcmV?d00001 diff --git a/spec/integration/hathifiles_verifier_spec.rb b/spec/integration/hathifiles_verifier_spec.rb index 3b7fbb8..19c68c1 100644 --- a/spec/integration/hathifiles_verifier_spec.rb +++ b/spec/integration/hathifiles_verifier_spec.rb @@ -24,7 +24,7 @@ module PostZephirProcessing end it "rejects a file with fewer records than the corresponding catalog, some of which are malformed" do - verifier.run_for_date(date: Date.parse("2024-12-03")) + verifier.run_for_date(date: Date.parse("2024-12-04")) expect(verifier.errors).not_to be_empty end diff --git a/spec/integration/verify_spec.rb b/spec/integration/verify_spec.rb new file mode 100644 index 0000000..24a772d --- /dev/null +++ b/spec/integration/verify_spec.rb @@ -0,0 +1,49 @@ +require_relative "../../bin/verify" + +module PostZephirProcessing + RSpec.describe "#run_verify" do + include_context "with solr mocking" + include_context "with hathifile database" + + around(:each) do |example| + @test_log = StringIO.new + old_logger = Services.logger + Services.register(:logger) { Logger.new(@test_log, level: Logger::INFO) } + example.run + Services.register(:logger) { old_logger } + end + + it "runs without error for date with fixtures" do + ClimateControl.modify( + HATHIFILE_ARCHIVE: fixture("hathifile_archive"), + WWW_DIR: fixture("www"), + CATALOG_ARCHIVE: fixture("catalog_archive"), + RIGHTS_ARCHIVE: fixture("rights_archive"), + REDIRECTS_DIR: fixture("redirects"), + REDIRECTS_HISTORY_DIR: fixture("redirects"), + CATALOG_PREP: fixture("catalog_prep"), + TMPDIR: fixture("dollar_dup"), + SOLR_URL: "http://solr-sdr-catalog:9033/solr/catalog", + TZ: "America/Detroit" + ) do + stub_catalog_timerange("2024-12-03T05:00:00Z", 3) + with_fake_hf_log_entry(hathifile: "hathi_upd_20241203.txt.gz") do + PostZephirProcessing.run_verifiers(Date.parse("2024-12-03")) + end + end + + # TODO: dollar-dup, hf_log (database) + + %w[PostZephirVerifier + PopulateRightsVerifier + HathifilesVerifier + HathifilesDatabaseVerifier + HathifilesListingVerifier + HathifileRedirectsVerifier + CatalogIndexVerifier].each do |verifier| + expect(@test_log.string).to include(/.*INFO.*#{verifier}/) + end + expect(@test_log.string).not_to include(/.*ERROR.*/) + end + end +end diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index 6ba536b..d3679b0 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -14,6 +14,9 @@ require "journal" require "verifier" +require "support/solr_mock" +require "support/hathifile_database" + Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) SimpleCov.add_filter "spec" diff --git a/spec/support/hathifile_database.rb b/spec/support/hathifile_database.rb new file mode 100644 index 0000000..05f393e --- /dev/null +++ b/spec/support/hathifile_database.rb @@ -0,0 +1,23 @@ +module PostZephirProcessing + shared_context "with hathifile database" do + around(:each) do |example| + Services[:database][:hf].truncate + Services[:database][:hf_log].truncate + example.run + Services[:database][:hf].truncate + Services[:database][:hf_log].truncate + end + + # Temporarily add `hathifile` to `hf_log` with the current timestamp. + def with_fake_hf_log_entry(hathifile:) + Services[:database][:hf_log].insert(hathifile: hathifile) + yield + end + + # Temporarily add `htid` to `hf` with reasonable (and irrelevant) defaults. + def with_fake_hf_entries(htids:) + htids.each { |htid| Services[:database][:hf].insert(htid: htid) } + yield + end + end +end diff --git a/spec/support/solr_mock.rb b/spec/support/solr_mock.rb new file mode 100644 index 0000000..2a7b05a --- /dev/null +++ b/spec/support/solr_mock.rb @@ -0,0 +1,32 @@ +RSpec.shared_context "with solr mocking" do + let(:solr_url) { "http://solr-sdr-catalog:9033/solr/catalog" } + + def stub_solr_count(fq:, result_count:) + url = "#{solr_url}/select?fq=#{URI.encode_www_form_component(fq)}&q=*:*&rows=0&wt=json" + + result = { + "responseHeader" => { + "status" => 0, + "QTime" => 0, + "params" => { + "q" => "*=>*", + "fq" => fq, + "rows" => "0", + "wt" => "json" + } + }, + "response" => {"numFound" => result_count, "start" => 0, "docs" => []} + }.to_json + + WebMock::API.stub_request(:get, url) + .to_return(body: result, headers: {"Content-Type" => "application/json"}) + end + + def stub_catalog_record_count(result_count) + stub_solr_count(fq: "deleted:false", result_count: result_count) + end + + def stub_catalog_timerange(datebegin, result_count) + stub_solr_count(fq: "time_of_index:[#{datebegin} TO NOW]", result_count: result_count) + end +end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index e6d9736..3281a70 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -3,10 +3,11 @@ require "verifier/catalog_index_verifier" require "webmock" require "derivative/catalog" +require "uri" module PostZephirProcessing RSpec.describe(CatalogIndexVerifier) do - let(:solr_url) { "http://solr-sdr-catalog:9033/solr/catalog" } + include_context "with solr mocking" let(:verifier) { described_class.new } around(:each) do |example| @@ -21,48 +22,6 @@ module PostZephirProcessing end end - def stub_catalog_record_count(result_count) - url = "#{solr_url}/select?fq=deleted:false&q=*:*&rows=0&wt=json" - - result = { - "responseHeader" => { - "status" => 0, - "QTime" => 0, - "params" => { - "q" => "*=>*", - "fq" => "deleted:false", - "rows" => "0", - "wt" => "json" - } - }, - "response" => {"numFound" => result_count, "start" => 0, "docs" => []} - }.to_json - - WebMock::API.stub_request(:get, url) - .to_return(body: result, headers: {"Content-Type" => "application/json"}) - end - - def stub_catalog_timerange(datebegin, result_count) - url = "#{solr_url}/select?fq=time_of_index:[#{datebegin}%20TO%20NOW]&q=*:*&rows=0&wt=json" - - result = { - "responseHeader" => { - "status" => 0, - "QTime" => 0, - "params" => { - "q" => "*=>*", - "fq" => "time_of_index:[#{datebegin} TO NOW]", - "rows" => "0", - "wt" => "json" - } - }, - "response" => {"numFound" => result_count, "start" => 0, "docs" => []} - }.to_json - - WebMock::API.stub_request(:get, url) - .to_return(body: result, headers: {"Content-Type" => "application/json"}) - end - describe "#verify_index_count" do context "with a catalog update file with 3 records" do let(:catalog_update) { Derivative::CatalogArchive.new(date: Date.parse("2024-12-02"), full: false) } diff --git a/spec/unit/hathifiles_database_verifier_spec.rb b/spec/unit/hathifiles_database_verifier_spec.rb index abd80df..8eecc2c 100644 --- a/spec/unit/hathifiles_database_verifier_spec.rb +++ b/spec/unit/hathifiles_database_verifier_spec.rb @@ -4,14 +4,12 @@ module PostZephirProcessing RSpec.describe(HathifilesDatabaseVerifier) do + include_context "with hathifile database" + around(:each) do |example| with_test_environment do ClimateControl.modify(HATHIFILE_ARCHIVE: fixture("hathifile_archive")) do - Services[:database][:hf].truncate - Services[:database][:hf_log].truncate example.run - Services[:database][:hf].truncate - Services[:database][:hf_log].truncate end end end @@ -26,18 +24,6 @@ module PostZephirProcessing let(:fake_upd_htids) { (1..5).map { |n| "test.%03d" % n } } let(:fake_full_htids) { (1..11).map { |n| "test.%03d" % n } } - # Temporarily add `hathifile` to `hf_log` with the current timestamp. - def with_fake_hf_log_entry(hathifile:) - Services[:database][:hf_log].insert(hathifile: hathifile) - yield - end - - # Temporarily add `htid` to `hf` with reasonable (and irrelevant) defaults. - def with_fake_hf_entries(htids:) - htids.each { |htid| Services[:database][:hf].insert(htid: htid) } - yield - end - describe ".has_log?" do context "with corresponding hf_log" do it "returns `true`" do diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index 1016564..862fb47 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -5,7 +5,7 @@ module PostZephirProcessing RSpec.describe(HathifileRedirectsVerifier) do - let(:test_date) { Date.parse("2023-01-01") } + let(:test_date) { Date.parse("2024-12-01") } let(:verifier) { described_class.new(date: test_date) } let(:redirects_file) { verifier.redirects_file(date: test_date) } let(:redirects_history_file) { verifier.redirects_history_file(date: test_date) } @@ -33,11 +33,11 @@ module PostZephirProcessing # copy fixture to temporary subdir def stage_redirects_file - FileUtils.cp(fixture("redirects/redirects_202301.txt.gz"), ENV["REDIRECTS_DIR"]) + FileUtils.cp(fixture("redirects/redirects_202412.txt.gz"), ENV["REDIRECTS_DIR"]) end def stage_redirects_history_file - FileUtils.cp(fixture("redirects/202301.ndj.gz"), ENV["REDIRECTS_HISTORY_DIR"]) + FileUtils.cp(fixture("redirects/202412.ndj.gz"), ENV["REDIRECTS_HISTORY_DIR"]) end # Intentionally add mess to an otherwise wellformed file to trigger errors From bd141dd81c0dea440053525b3f341bcd5b5e5fea Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Fri, 20 Dec 2024 16:17:00 -0500 Subject: [PATCH 104/114] - Change the semantics of `run_for_date` parameter to always mean "run date" and not "datestamp" - `Derivative` subclasses can provide a timestamp delta - PostZephirDerivatives and Dates also changed to "run date" semantics. --- bin/post_zephir.rb | 8 +++--- bin/verify.rb | 5 ++-- lib/dates.rb | 25 ++++++++--------- lib/derivative.rb | 15 ++++++++-- lib/derivative/catalog.rb | 6 +++- lib/derivative/delete.rb | 4 +++ lib/derivative/dollar_dup.rb | 4 +++ lib/derivative/ht_bib_export.rb | 4 +++ lib/derivative/ingest_bibrecord.rb | 2 +- lib/derivative/rights.rb | 6 +++- lib/post_zephir_derivatives.rb | 4 +-- lib/verifier.rb | 5 ++++ lib/verifier/catalog_index_verifier.rb | 5 ++-- lib/verifier/post_zephir_verifier.rb | 2 +- .../derivatives_integration_spec.rb | 14 +++++----- spec/unit/catalog_indexing_verifier_spec.rb | 4 +-- spec/unit/dates_spec.rb | 10 +++---- spec/unit/derivative/catalog_spec.rb | 14 +++++----- spec/unit/derivative/delete_spec.rb | 10 +++---- spec/unit/derivative/dollar_dup_spec.rb | 14 +++++----- spec/unit/derivative/ingest_bibrecord_spec.rb | 12 ++++---- spec/unit/derivative/rights_spec.rb | 14 +++++----- spec/unit/derivative_spec.rb | 2 +- spec/unit/populate_rights_verifier_spec.rb | 12 ++++---- spec/unit/post_zephir_derivatives_spec.rb | 4 +-- spec/unit/post_zephir_verifier_spec.rb | 28 +++++++++---------- 26 files changed, 135 insertions(+), 98 deletions(-) diff --git a/bin/post_zephir.rb b/bin/post_zephir.rb index bc2ad7e..b2f6db1 100755 --- a/bin/post_zephir.rb +++ b/bin/post_zephir.rb @@ -8,7 +8,7 @@ require "logger" require_relative "../lib/dates" -require_relative "../lib/derivatives" +require_relative "../lib/post_zephir_derivatives" require_relative "../lib/journal" def run_system_command(command) @@ -22,11 +22,11 @@ def run_system_command(command) INCREMENTAL_SCRIPT = File.join(HOME, "run_process_zephir_incremental.sh") YESTERDAY = Date.today - 1 -inventory = PostZephirProcessing::Derivatives.new(date: YESTERDAY) +derivatives = PostZephirProcessing::PostZephirDerivatives.new dates = [] # Is there a missing date? Plug them into an array to process. -if !inventory.earliest_missing_date.nil? - dates = (inventory.earliest_missing_date..YESTERDAY) +if !derivatives.earliest_missing_date.nil? + dates = ((derivatives.earliest_missing_date - 1)..YESTERDAY) end LOGGER.info "Processing Zephir files from #{dates}" diff --git a/bin/verify.rb b/bin/verify.rb index 4ce7d37..adac79e 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -17,8 +17,9 @@ module PostZephirProcessing def self.run_verifiers(date_to_check) [ # all outputs here are date-stamped with yesterday's date - -> { PostZephirVerifier.new.run_for_date(date: date_to_check - 1) }, - -> { PopulateRightsVerifier.new.run_for_date(date: date_to_check - 1) }, + # This is taken care of by the `Derivative` subclasses. + -> { PostZephirVerifier.new.run_for_date(date: date_to_check) }, + -> { PopulateRightsVerifier.new.run_for_date(date: date_to_check) }, # these are today's date -> { HathifilesVerifier.new.run_for_date(date: date_to_check) }, diff --git a/lib/dates.rb b/lib/dates.rb index 9ef1010..f7ad0f5 100644 --- a/lib/dates.rb +++ b/lib/dates.rb @@ -14,32 +14,31 @@ def first_of_month? end module PostZephirProcessing - # A class that determines the filename "dates of interest" when looking at directories + # A class that determines the "dates of interest" when looking at directories # full of Zephir derivative files. - # `Dates.new.all_dates` calculates an array of Dates from the most recent - # last day of the month up to yesterday, inclusive. + # + # NOTE: these are "run dates" and not file datestamps. + # The Derivative subclasses will use appropriate deltas to derive datestamps. + # + # `Dates.new.all_dates` calculates an array of Dates from the first of the month up to today, inclusive. class Dates attr_reader :date - # @param date [Date] the file datestamp date, not the "run date", yesterday by default - def initialize(date: (Date.today - 1)) + # @param date [Date] the "run date" + def initialize(date: Date.today) @date = date end # The standard start date (last of month) to "present" # @return [Array] sorted ASC def all_dates - @all_dates ||= (start_date..date).to_a.sort + @all_dates ||= (first_of_month..date).to_a.sort end - # The most recent last day of the month (which may be today) + # The first of the month, relative to the "run date" # @return [Date] - def start_date - @start_date ||= if date.last_of_month? - date - else - Date.new(date.year, date.month, 1) - 1 - end + def first_of_month + Date.new(date.year, date.month, 1) end end end diff --git a/lib/derivative.rb b/lib/derivative.rb index 3655d86..1f3127b 100644 --- a/lib/derivative.rb +++ b/lib/derivative.rb @@ -13,6 +13,17 @@ def full? full end + # Difference in days between when the workflow/verifier is run, and the datestamp on the files. + # Post-zephir derivatives are datestamped yesterday relative to the run date. + # Everything else is same day (the default). + def datestamp_delta + 0 + end + + def datestamp + date + datestamp_delta + end + def path File.join( template[:location], @@ -21,8 +32,8 @@ def path end def datestamped_file - template[:name].sub(/YYYYMMDD/i, date.strftime("%Y%m%d")) - .sub(/YYYY-MM-DD/i, date.strftime("%Y-%m-%d")) + template[:name].sub(/YYYYMMDD/i, datestamp.strftime("%Y%m%d")) + .sub(/YYYY-MM-DD/i, datestamp.strftime("%Y-%m-%d")) end def self.derivatives_for_date(date:) diff --git a/lib/derivative/catalog.rb b/lib/derivative/catalog.rb index bc47b98..d8f1c68 100644 --- a/lib/derivative/catalog.rb +++ b/lib/derivative/catalog.rb @@ -10,7 +10,7 @@ def self.derivatives_for_date(date:) ) ] - if date.last_of_month? + if date.first_of_month? derivatives << new( full: true, date: date @@ -20,6 +20,10 @@ def self.derivatives_for_date(date:) derivatives end + def datestamp_delta + -1 + end + def template { location: location, diff --git a/lib/derivative/delete.rb b/lib/derivative/delete.rb index 200b76b..25fc2c8 100644 --- a/lib/derivative/delete.rb +++ b/lib/derivative/delete.rb @@ -16,6 +16,10 @@ def self.derivatives_for_date(date:) ] end + def datestamp_delta + -1 + end + def template { location: ENV["CATALOG_PREP"], diff --git a/lib/derivative/dollar_dup.rb b/lib/derivative/dollar_dup.rb index 2ac220f..7205758 100644 --- a/lib/derivative/dollar_dup.rb +++ b/lib/derivative/dollar_dup.rb @@ -16,6 +16,10 @@ def self.derivatives_for_date(date:) ] end + def datestamp_delta + -1 + end + def template { location: ENV["TMPDIR"] || File.join(ENV["DATA_ROOT"], "work"), diff --git a/lib/derivative/ht_bib_export.rb b/lib/derivative/ht_bib_export.rb index 813f97f..e5262bf 100644 --- a/lib/derivative/ht_bib_export.rb +++ b/lib/derivative/ht_bib_export.rb @@ -2,6 +2,10 @@ module PostZephirProcessing class Derivative::HTBibExport < Derivative + def datestamp_delta + -1 + end + def template { location: ENV["ZEPHIR_DATA"], diff --git a/lib/derivative/ingest_bibrecord.rb b/lib/derivative/ingest_bibrecord.rb index 494835d..151bc4a 100644 --- a/lib/derivative/ingest_bibrecord.rb +++ b/lib/derivative/ingest_bibrecord.rb @@ -13,7 +13,7 @@ def path end def self.derivatives_for_date(date:) - if date.last_of_month? + if date.first_of_month? [ new(name: "groove_full.tsv.gz"), new(name: "zephir_ingested_items.txt.gz") diff --git a/lib/derivative/rights.rb b/lib/derivative/rights.rb index 2e83097..8e39ad2 100644 --- a/lib/derivative/rights.rb +++ b/lib/derivative/rights.rb @@ -10,7 +10,7 @@ def self.derivatives_for_date(date:) ) ] - if date.last_of_month? + if date.first_of_month? derivatives << new( full: true, date: date @@ -20,6 +20,10 @@ def self.derivatives_for_date(date:) derivatives end + def datestamp_delta + -1 + end + def template { location: ENV["RIGHTS_ARCHIVE"] || File.join(ENV.fetch("RIGHTS_DIR"), "archive"), diff --git a/lib/post_zephir_derivatives.rb b/lib/post_zephir_derivatives.rb index 29b5e67..784ebc9 100644 --- a/lib/post_zephir_derivatives.rb +++ b/lib/post_zephir_derivatives.rb @@ -13,8 +13,8 @@ module PostZephirProcessing class PostZephirDerivatives attr_reader :dates - # @param date [Date] the file datestamp date, not the "run date" - def initialize(date: (Date.today - 1)) + # @param date [Date] the run date (the datestamp date will be one day earlier) + def initialize(date: Date.today) @dates = Dates.new(date: date) end diff --git a/lib/verifier.rb b/lib/verifier.rb index eae6c62..edad6fc 100644 --- a/lib/verifier.rb +++ b/lib/verifier.rb @@ -23,6 +23,11 @@ def run # Verify outputs for one date. # Useful for verifying datestamped files. + # @param date [Date] "today", or more accurately "the day on which the processes we are verifying ran". + # The names of derivative files can differ from this date + # depending on the type of process or workflow step. + # Post Zephir derivatives are stamped one day ago. + # Hathifiles are stamped today. def run_for_date(date:) info message: "running for #{date}" end diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index_verifier.rb index e169df4..31d0c82 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index_verifier.rb @@ -16,7 +16,7 @@ def verify_index_count(derivative:) solr_count = solr_nondeleted_records query_desc = "existed" else - date_of_indexing = derivative.date + 1 + date_of_indexing = derivative.date solr_count = solr_count(date_of_indexing) query_desc = "had time_of_indexing on #{date_of_indexing}" end @@ -51,9 +51,10 @@ def run_for_date(date:) # The dates on the files are the previous day, but the indexing # happens on the current day. When we verify the current day, we are # verifying that the file named for the _previous_ day was produced. + # This is handled by the derivative class `datestamp_delta` @current_date = date - Derivative::CatalogArchive.derivatives_for_date(date: date - 1).each do |derivative| + Derivative::CatalogArchive.derivatives_for_date(date: date).each do |derivative| path = derivative.path next unless verify_file(path: path) diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index b8656a1..7577c9d 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -40,7 +40,7 @@ def verify_catalog_update_archive(date: current_date) # readable # line count must be the same as input JSON def verify_catalog_full_archive(date: current_date) - return unless date.last_of_month? + return unless date.first_of_month? output_path = Derivative::CatalogArchive.new(date: date, full: true).path input_path = Derivative::HTBibExport.new(date: date, full: true).path diff --git a/spec/integration/derivatives_integration_spec.rb b/spec/integration/derivatives_integration_spec.rb index d449f09..7644072 100644 --- a/spec/integration/derivatives_integration_spec.rb +++ b/spec/integration/derivatives_integration_spec.rb @@ -59,10 +59,10 @@ def delete_file_for_date(date:) # @param date [Date] determines the month and year for the file datestamps def setup_test_files(date:) - start_date = Date.new(date.year, date.month - 1, -1) + start_date = Date.new(date.year, date.month, 1) `touch #{full_file_for_date(date: start_date)}` `touch #{full_rights_file_for_date(date: start_date)}` - end_date = Date.new(date.year, date.month, -2) + end_date = Date.new(date.year, date.month, -1) (start_date..end_date).each do |d| `touch #{update_file_for_date(date: d)}` `touch #{delete_file_for_date(date: d)}` @@ -78,11 +78,11 @@ def setup_test_files(date:) end end - it "with no files present, returns the last day of last month" do + it "with no files present, returns the first day the month" do expect(described_class - .new(date: Date.parse("2023-01-15")) + .new(date: Date.parse("2024-12-15")) .earliest_missing_date) - .to eq Date.parse("2022-12-31") + .to eq Date.parse("2024-12-01") end context "with test files" do @@ -102,8 +102,8 @@ def setup_test_files(date:) expect(verifier.earliest_missing_date).to eq date end - it "with monthly file missing, returns the last day of the last month" do - date = Date.parse("2023-10-31") + it "with monthly file missing, returns the first of the month" do + date = Date.parse("2023-11-01") FileUtils.rm full_file_for_date(date: date) expect(verifier.earliest_missing_date).to eq date end diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/catalog_indexing_verifier_spec.rb index 3281a70..345767f 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/catalog_indexing_verifier_spec.rb @@ -27,7 +27,7 @@ module PostZephirProcessing let(:catalog_update) { Derivative::CatalogArchive.new(date: Date.parse("2024-12-02"), full: false) } # indexed the day after the date in the filename starting at midnight # EST - let(:catalog_index_begin) { "2024-12-03T05:00:00Z" } + let(:catalog_index_begin) { "2024-12-02T05:00:00Z" } it "accepts a catalog with 3 recent updates" do stub_catalog_timerange(catalog_index_begin, 3) @@ -52,7 +52,7 @@ module PostZephirProcessing end context "with a catalog full file with 5 records" do - let(:catalog_full) { Derivative::CatalogArchive.new(date: Date.parse("2024-11-30"), full: true) } + let(:catalog_full) { Derivative::CatalogArchive.new(date: Date.parse("2024-12-01"), full: true) } it "accepts a catalog with 5 records" do stub_catalog_record_count(5) diff --git a/spec/unit/dates_spec.rb b/spec/unit/dates_spec.rb index 7a4450c..359d1de 100644 --- a/spec/unit/dates_spec.rb +++ b/spec/unit/dates_spec.rb @@ -3,18 +3,18 @@ module PostZephirProcessing RSpec.describe(Dates) do describe "#all_dates" do - context "with a date before the last of the month" do + context "with a date after the first of the month" do it "returns a range of more than one date" do - (1..30).each do |day| - date = Date.new(2023, 10, day) + (2..31).each do |day| + date = Date.new(2024, 12, day) expect(described_class.new(date: date).all_dates.count).to be > 1 end end end - context "with the last of the month" do + context "with the first of the month" do it "returns only the reference date" do - date = Date.new(2023, 10, 31) + date = Date.new(2024, 12, 1) expect(described_class.new(date: date).all_dates).to eq [date] end end diff --git a/spec/unit/derivative/catalog_spec.rb b/spec/unit/derivative/catalog_spec.rb index c802fed..bbdcc30 100644 --- a/spec/unit/derivative/catalog_spec.rb +++ b/spec/unit/derivative/catalog_spec.rb @@ -16,29 +16,29 @@ module PostZephirProcessing end end - let(:test_date_first_of_month) { Date.parse("2023-11-01") } - let(:test_date_last_of_month) { Date.parse("2023-11-30") } + let(:test_date_first_of_month) { Date.parse("2023-12-01") } + let(:test_date_second_of_month) { Date.parse("2023-12-02") } let(:params) do { - date: test_date_last_of_month + date: test_date_first_of_month } end let(:derivative) { described_class.new(**params) } describe "self.derivatives_for_date" do - it "returns 2 derivatives (one full, one upd) on the last of month" do + it "returns 2 derivatives (one full, one upd) on the first of month" do derivatives = described_class.derivatives_for_date( - date: test_date_last_of_month + date: test_date_first_of_month ) expect(derivatives.count).to eq 2 expect(derivatives.count { |d| d.full == true }).to eq 1 expect(derivatives.count { |d| d.full == false }).to eq 1 end - it "returns 1 derivative on the first of month" do + it "returns 1 derivative after the first of month" do derivatives = described_class.derivatives_for_date( - date: test_date_first_of_month + date: test_date_second_of_month ) expect(derivatives.count).to eq 1 end diff --git a/spec/unit/derivative/delete_spec.rb b/spec/unit/derivative/delete_spec.rb index 4315774..69c54bd 100644 --- a/spec/unit/derivative/delete_spec.rb +++ b/spec/unit/derivative/delete_spec.rb @@ -16,20 +16,20 @@ module PostZephirProcessing end end - let(:test_date_first_of_month) { Date.parse("2023-11-01") } - let(:test_date_last_of_month) { Date.parse("2023-11-30") } + let(:test_date_first_of_month) { Date.parse("2023-12-01") } + let(:test_date_second_of_month) { Date.parse("2023-12-02") } let(:params) do { - date: test_date_last_of_month + date: test_date_first_of_month } end let(:derivative) { described_class.new(**params) } describe "self.derivatives_for_date" do - it "returns 1 derivative (upd) on the last of month" do + it "returns 1 derivative (upd) after the first of month" do derivatives = described_class.derivatives_for_date( - date: test_date_last_of_month + date: test_date_first_of_month ) expect(derivatives.count).to eq 1 expect(derivatives.first.full?).to be false diff --git a/spec/unit/derivative/dollar_dup_spec.rb b/spec/unit/derivative/dollar_dup_spec.rb index 97320c8..a925505 100644 --- a/spec/unit/derivative/dollar_dup_spec.rb +++ b/spec/unit/derivative/dollar_dup_spec.rb @@ -16,28 +16,28 @@ module PostZephirProcessing end let(:test_date_first_of_month) { Date.parse("2023-11-01") } - let(:test_date_last_of_month) { Date.parse("2023-11-30") } + let(:test_date_second_of_month) { Date.parse("2023-11-02") } let(:params) do { - date: test_date_last_of_month, + date: test_date_first_of_month, full: false } end let(:derivative) { described_class.new(**params) } describe "self.derivatives_for_date" do - it "returns 1 derivative (upd) on the last of month" do + it "returns 1 derivative (upd) on the first of month" do derivatives = described_class.derivatives_for_date( - date: test_date_last_of_month + date: test_date_first_of_month ) expect(derivatives.count).to eq 1 expect(derivatives.first.full?).to be false end - it "returns 1 derivative (upd) on the first of month" do + it "returns 1 derivative (upd) after the first of month" do derivatives = described_class.derivatives_for_date( - date: test_date_first_of_month + date: test_date_second_of_month ) expect(derivatives.count).to eq 1 expect(derivatives.first.full?).to be false @@ -45,7 +45,7 @@ module PostZephirProcessing end it "reports the expected path for a dollar dup file" do - expect(derivative.path).to eq "/tmp/vufind_incremental_2023-11-30_dollar_dup.txt.gz" + expect(derivative.path).to eq "/tmp/vufind_incremental_2023-10-31_dollar_dup.txt.gz" end it "raises if a full file is requested" do diff --git a/spec/unit/derivative/ingest_bibrecord_spec.rb b/spec/unit/derivative/ingest_bibrecord_spec.rb index 6840d97..376f096 100644 --- a/spec/unit/derivative/ingest_bibrecord_spec.rb +++ b/spec/unit/derivative/ingest_bibrecord_spec.rb @@ -15,18 +15,18 @@ module PostZephirProcessing end end - let(:test_date_last_of_month) { Date.parse("2023-11-30") } + let(:test_date_first_of_month) { Date.parse("2023-12-01") } describe "#{described_class}.derivatives_for_date" do - it "returns 2 derivatives on the last of the month, otherwise 0" do - 1.upto(29) do |day| - date = Date.new(2023, 11, day) + it "returns 2 derivatives on the first of the month, otherwise 0" do + 2.upto(31) do |day| + date = Date.new(2023, 12, day) expect(described_class.derivatives_for_date(date: date)).to be_empty end - expect(described_class.derivatives_for_date(date: test_date_last_of_month).count).to eq 2 + expect(described_class.derivatives_for_date(date: test_date_first_of_month).count).to eq 2 end it "reports the expected paths" do - derivative_paths = described_class.derivatives_for_date(date: test_date_last_of_month).map { |d| d.path } + derivative_paths = described_class.derivatives_for_date(date: test_date_first_of_month).map { |d| d.path } expect(derivative_paths).to include(fixture("ingest_bibrecords/groove_full.tsv.gz")) expect(derivative_paths).to include(fixture("ingest_bibrecords/zephir_ingested_items.txt.gz")) end diff --git a/spec/unit/derivative/rights_spec.rb b/spec/unit/derivative/rights_spec.rb index e68d71f..c1c7178 100644 --- a/spec/unit/derivative/rights_spec.rb +++ b/spec/unit/derivative/rights_spec.rb @@ -15,29 +15,29 @@ module PostZephirProcessing end end - let(:test_date_first_of_month) { Date.parse("2023-11-01") } - let(:test_date_last_of_month) { Date.parse("2023-11-30") } + let(:test_date_first_of_month) { Date.parse("2023-12-01") } + let(:test_date_second_of_month) { Date.parse("2023-12-02") } let(:params) do { - date: test_date_last_of_month + date: test_date_first_of_month } end let(:derivative) { described_class.new(**params) } describe "self.derivatives_for_date" do - it "returns 2 derivatives (one full, one upd) on the last of month" do + it "returns 2 derivatives (one full, one upd) on the first of the month" do derivatives = described_class.derivatives_for_date( - date: test_date_last_of_month + date: test_date_first_of_month ) expect(derivatives.count).to eq 2 expect(derivatives.count { |d| d.full == true }).to eq 1 expect(derivatives.count { |d| d.full == false }).to eq 1 end - it "returns 1 derivative on the first of month" do + it "returns 1 derivative on the second of month" do derivatives = described_class.derivatives_for_date( - date: test_date_first_of_month + date: test_date_second_of_month ) expect(derivatives.count).to eq 1 end diff --git a/spec/unit/derivative_spec.rb b/spec/unit/derivative_spec.rb index 7026641..44eb144 100644 --- a/spec/unit/derivative_spec.rb +++ b/spec/unit/derivative_spec.rb @@ -5,7 +5,7 @@ module PostZephirProcessing RSpec.describe(Derivative) do let(:test_date_first_of_month) { Date.parse("2023-11-01") } - let(:test_date_last_of_month) { Date.parse("2023-11-30") } + # let(:test_date_last_of_month) { Date.parse("2023-11-30") } let(:params) do { diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/populate_rights_verifier_spec.rb index 774e2a0..e7c7547 100644 --- a/spec/unit/populate_rights_verifier_spec.rb +++ b/spec/unit/populate_rights_verifier_spec.rb @@ -52,9 +52,9 @@ def insert_fake_rights(namespace:, id:) describe "#run_for_date" do it "logs no `missing rights_current` error for full file" do - date = Date.parse("2024-11-30") - with_fake_rights_file(date: date, full: true) do - verifier.run_for_date(date: date) + run_date = Date.parse("2024-12-01") + with_fake_rights_file(date: run_date, full: true) do + verifier.run_for_date(date: run_date) # The only error is for the missing upd file. expect(verifier.errors.count).to eq 1 missing_rights_errors = verifier.errors.select { |err| /missing rights_current/.match? err } @@ -81,9 +81,9 @@ def insert_fake_rights(namespace:, id:) context "with no HTIDs in the rights database" do describe "#run_for_date" do it "logs `missing rights_current` error for full file" do - date = Date.parse("2024-11-30") - with_fake_rights_file(date: date, full: true) do - verifier.run_for_date(date: date) + run_date = Date.parse("2024-12-01") + with_fake_rights_file(date: run_date, full: true) do + verifier.run_for_date(date: run_date) # There will be an error for the missing upd file, ignore it. missing_rights_errors = verifier.errors.select { |err| /missing rights_current/.match? err } expect(missing_rights_errors.count).to eq test_rights.count diff --git a/spec/unit/post_zephir_derivatives_spec.rb b/spec/unit/post_zephir_derivatives_spec.rb index 055f899..1f6598f 100644 --- a/spec/unit/post_zephir_derivatives_spec.rb +++ b/spec/unit/post_zephir_derivatives_spec.rb @@ -10,8 +10,8 @@ module PostZephirProcessing expect(described_class.new).to be_an_instance_of(PostZephirDerivatives) end - it "has a default date of yesterday" do - expect(described_class.new.dates.date).to eq(Date.today - 1) + it "has a default date of today" do + expect(described_class.new.dates.date).to eq Date.today end end diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 7c79211..1b2f011 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -115,7 +115,7 @@ def expect_deletefile_ok(contents) describe "#verify_catalog_full_archive" do let(:verifier) { described_class.new } - let(:test_date) { Date.parse("2024-11-30") } + let(:test_date) { Date.parse("2024-12-01") } it "requires input file to have same line count as output file" do # We have fixtures with matching line counts for test_date, # so expect no warnings @@ -146,7 +146,7 @@ def expect_deletefile_ok(contents) end end - test_date = Date.parse("2024-11-30") + test_date = Date.parse("2024-12-01") context "with all the expected files" do it "reports no errors" do # Create and test upd, full, and deletes in @tmpdir/catalog_prep @@ -172,7 +172,7 @@ def expect_deletefile_ok(contents) test_date = Date.parse("2024-12-01") context "with empty file" do it "reports no errors" do - dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") + dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-11-30_dollar_dup.txt.gz") Zlib::GzipWriter.open(dollar_dup_path) { |output_gz| } verifier = described_class.new verifier.verify_dollar_dup(date: test_date) @@ -182,7 +182,7 @@ def expect_deletefile_ok(contents) context "with nonempty file" do it "reports one `spurious dollar_dup lines` error" do - dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-12-01_dollar_dup.txt.gz") + dollar_dup_path = File.join(@tmpdir, "vufind_incremental_2024-11-30_dollar_dup.txt.gz") Zlib::GzipWriter.open(dollar_dup_path) do |output_gz| output_gz.puts <<~GZ uc1.b275234 @@ -220,8 +220,8 @@ def expect_deletefile_ok(contents) end end - context "last day of month" do - test_date = Date.parse("2024-11-30") + context "first day of month" do + test_date = Date.parse("2024-12-01") context "with expected groove_full and zephir_ingested_items files" do it "reports no errors" do FileUtils.touch(File.join(@tmpdir, "groove_full.tsv.gz")) @@ -254,7 +254,7 @@ def expect_deletefile_ok(contents) end context "non-last day of month" do - test_date = Date.parse("2024-12-01") + test_date = Date.parse("2024-12-02") it "reports no errors" do verifier = described_class.new verifier.verify_ingest_bibrecords(date: test_date) @@ -269,15 +269,15 @@ def expect_deletefile_ok(contents) example.run end end - context "last day of month" do - test_date = Date.parse("2024-11-30") + context "first day of month" do + test_date = Date.parse("2024-12-01") context "with full and update rights files" do it "reports no errors" do verifier = described_class.new - upd_rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + upd_rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", (test_date - 1).strftime("%Y%m%d")) upd_rights_path = File.join(@tmpdir, upd_rights_file) File.write(upd_rights_path, well_formed_rights_file_content) - full_rights_file = "zephir_full_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + full_rights_file = "zephir_full_YYYYMMDD.rights".gsub("YYYYMMDD", (test_date - 1).strftime("%Y%m%d")) full_rights_path = File.join(@tmpdir, full_rights_file) File.write(full_rights_path, well_formed_rights_file_content) verifier.verify_rights(date: test_date) @@ -297,12 +297,12 @@ def expect_deletefile_ok(contents) end end - context "non-last day of month" do - test_date = Date.parse("2024-12-01") + context "after first of month" do + test_date = Date.parse("2024-12-02") context "with update rights file" do it "reports no errors" do verifier = described_class.new - rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", test_date.strftime("%Y%m%d")) + rights_file = "zephir_upd_YYYYMMDD.rights".gsub("YYYYMMDD", (test_date - 1).strftime("%Y%m%d")) rights_path = File.join(@tmpdir, rights_file) File.write(rights_path, well_formed_rights_file_content) verifier.verify_rights(date: test_date) From 6fa3bcdf14557d0e71051455947b843232203bce Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Fri, 20 Dec 2024 16:33:40 -0500 Subject: [PATCH 105/114] - Move `derivatives_integration_spec.rb` to `post_zephir_derivatives_integration_spec.rb` - Move `hathifile_derivative_spec.rb` to `derivative/hathifile_spec.rb` - Rename `HathifileRedirectsVerifier` to `HathifilesRedirectsVerifier` --- bin/verify.rb | 2 +- lib/verifier/hathifiles_redirects_verifier.rb | 2 +- ...tion_spec.rb => post_zephir_derivatives_integration_spec.rb} | 0 spec/integration/verify_spec.rb | 2 +- .../hathifile_spec.rb} | 0 spec/unit/hathifiles_redirects_verifier_spec.rb | 2 +- 6 files changed, 4 insertions(+), 4 deletions(-) rename spec/integration/{derivatives_integration_spec.rb => post_zephir_derivatives_integration_spec.rb} (100%) rename spec/unit/{hathifile_derivative_spec.rb => derivative/hathifile_spec.rb} (100%) diff --git a/bin/verify.rb b/bin/verify.rb index adac79e..ec67393 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -25,7 +25,7 @@ def self.run_verifiers(date_to_check) -> { HathifilesVerifier.new.run_for_date(date: date_to_check) }, -> { HathifilesDatabaseVerifier.new.run_for_date(date: date_to_check) }, -> { HathifilesListingVerifier.new.run_for_date(date: date_to_check) }, - -> { HathifileRedirectsVerifier.new.run_for_date(date: date_to_check) }, + -> { HathifilesRedirectsVerifier.new.run_for_date(date: date_to_check) }, -> { CatalogIndexVerifier.new.run_for_date(date: date_to_check) }, ].each do |verifier_lambda| begin diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects_verifier.rb index c3e6a64..b6f815d 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects_verifier.rb @@ -3,7 +3,7 @@ require "verifier" module PostZephirProcessing - class HathifileRedirectsVerifier < Verifier + class HathifilesRedirectsVerifier < Verifier attr_reader :current_date REDIRECTS_REGEX = /^\d{9}\t\d{9}$/ diff --git a/spec/integration/derivatives_integration_spec.rb b/spec/integration/post_zephir_derivatives_integration_spec.rb similarity index 100% rename from spec/integration/derivatives_integration_spec.rb rename to spec/integration/post_zephir_derivatives_integration_spec.rb diff --git a/spec/integration/verify_spec.rb b/spec/integration/verify_spec.rb index 24a772d..908d94a 100644 --- a/spec/integration/verify_spec.rb +++ b/spec/integration/verify_spec.rb @@ -39,7 +39,7 @@ module PostZephirProcessing HathifilesVerifier HathifilesDatabaseVerifier HathifilesListingVerifier - HathifileRedirectsVerifier + HathifilesRedirectsVerifier CatalogIndexVerifier].each do |verifier| expect(@test_log.string).to include(/.*INFO.*#{verifier}/) end diff --git a/spec/unit/hathifile_derivative_spec.rb b/spec/unit/derivative/hathifile_spec.rb similarity index 100% rename from spec/unit/hathifile_derivative_spec.rb rename to spec/unit/derivative/hathifile_spec.rb diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/hathifiles_redirects_verifier_spec.rb index 862fb47..8eeb922 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/hathifiles_redirects_verifier_spec.rb @@ -4,7 +4,7 @@ require "zlib" module PostZephirProcessing - RSpec.describe(HathifileRedirectsVerifier) do + RSpec.describe(HathifilesRedirectsVerifier) do let(:test_date) { Date.parse("2024-12-01") } let(:verifier) { described_class.new(date: test_date) } let(:redirects_file) { verifier.redirects_file(date: test_date) } From 829d4fe7a4bc29fba811c6101bd87e1bc035090a Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Fri, 20 Dec 2024 16:51:30 -0500 Subject: [PATCH 106/114] run_verifiers: use class names instead of lambdas previous commit changed everything use current date instead of filename date, so we can call the verifiers as they were before. --- bin/verify.rb | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/bin/verify.rb b/bin/verify.rb index ec67393..72c5f30 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -16,23 +16,19 @@ module PostZephirProcessing def self.run_verifiers(date_to_check) [ - # all outputs here are date-stamped with yesterday's date - # This is taken care of by the `Derivative` subclasses. - -> { PostZephirVerifier.new.run_for_date(date: date_to_check) }, - -> { PopulateRightsVerifier.new.run_for_date(date: date_to_check) }, - - # these are today's date - -> { HathifilesVerifier.new.run_for_date(date: date_to_check) }, - -> { HathifilesDatabaseVerifier.new.run_for_date(date: date_to_check) }, - -> { HathifilesListingVerifier.new.run_for_date(date: date_to_check) }, - -> { HathifilesRedirectsVerifier.new.run_for_date(date: date_to_check) }, - -> { CatalogIndexVerifier.new.run_for_date(date: date_to_check) }, - ].each do |verifier_lambda| + PostZephirVerifier, + PopulateRightsVerifier, + HathifilesVerifier, + HathifilesDatabaseVerifier, + HathifilesListingVerifier, + HathifilesRedirectsVerifier, + CatalogIndexVerifier + ].each do |klass| begin - verifier_lambda.call - # Very simple minded exception handler so we can in theory check subsequent workflow steps + klass.new.run_for_date(date: date_to_check) + # Very simple minded exception handler so we can in theory check subsequent workflow steps rescue StandardError => e - Services[:logger].fatal e + PostZephirProcessing::Services[:logger].fatal e end end end From 0bbde912b66224209aa7f7a2b2177be9054f57aa Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Mon, 23 Dec 2024 12:28:56 -0500 Subject: [PATCH 107/114] - Extend CATALOG_ARCHIVE line count check to update files (in addition to monthlies) - Requires some wrangling of TMPDIR to find `ht_bib_export_incr...` and dollar dup files - Update .gitignore to just ignore data/ so we don't miss new fixtures --- .gitignore | 2 +- lib/derivative/ht_bib_export.rb | 25 +++++-- lib/verifier/post_zephir_verifier.rb | 65 +++++++++--------- .../ht_bib_export_incr_2024-11-30.json.gz | Bin 0 -> 2101 bytes .../ht_bib_export_incr_2024-12-02.json.gz | Bin 0 -> 1986 bytes ...d_incremental_2024-12-02_dollar_dup.txt.gz | Bin spec/integration/verify_spec.rb | 2 +- spec/spec_helper.rb | 9 +++ spec/unit/post_zephir_verifier_spec.rb | 21 ++++-- 9 files changed, 82 insertions(+), 42 deletions(-) create mode 100644 spec/fixtures/work/ht_bib_export_incr_2024-11-30.json.gz create mode 100644 spec/fixtures/work/ht_bib_export_incr_2024-12-02.json.gz rename spec/fixtures/{dollar_dup => work}/vufind_incremental_2024-12-02_dollar_dup.txt.gz (100%) diff --git a/.gitignore b/.gitignore index 65eadf1..1007918 100644 --- a/.gitignore +++ b/.gitignore @@ -6,7 +6,7 @@ config/config.pl config/.netrc coverage/ zephir_full_daily_a* -*.gz +data/ local local* *_stderr diff --git a/lib/derivative/ht_bib_export.rb b/lib/derivative/ht_bib_export.rb index e5262bf..bfe6bd0 100644 --- a/lib/derivative/ht_bib_export.rb +++ b/lib/derivative/ht_bib_export.rb @@ -1,16 +1,33 @@ require "derivative" module PostZephirProcessing + # These derivatives are the files downloaded from Zephir. + # They are not checked explicitly for well-formedness + # (to do so, must define self.derivatives_for_date); + # however, catalog archive checks line counts against these as originals + # so this class allows the downloads to be located. + # + # We might want to reconsider keeping the "incr" files in TMPDIR + # and instead move them to the same location as the full files. class Derivative::HTBibExport < Derivative def datestamp_delta -1 end + # These files are unusual in that they live in two different locations: + # the monthlies get moved but the updates just get downloaded and left in place. def template - { - location: ENV["ZEPHIR_DATA"], - name: "ht_bib_export_full_YYYY-MM-DD.json.gz" - } + if full? + { + location: ENV["ZEPHIR_DATA"], + name: "ht_bib_export_full_YYYY-MM-DD.json.gz" + } + else + { + location: ENV["TMPDIR"], + name: "ht_bib_export_incr_YYYY-MM-DD.json.gz" + } + end end end end diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir_verifier.rb index 7577c9d..a734264 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir_verifier.rb @@ -18,50 +18,42 @@ class PostZephirVerifier < Verifier def run_for_date(date:) super @current_date = date - verify_catalog_update_archive - verify_catalog_full_archive + verify_catalog_archive verify_catalog_prep verify_dollar_dup verify_ingest_bibrecords verify_rights end - def verify_catalog_update_archive(date: current_date) - zephir_update_path = Derivative::CatalogArchive.new(date: date, full: false).path - return unless verify_file(path: zephir_update_path) - verify_parseable_ndj(path: zephir_update_path) - end - # Frequency: ALL - # Files: CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz - # and potentially CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz + # Files: + # CATALOG_ARCHIVE/zephir_upd_YYYYMMDD.json.gz + # CATALOG_ARCHIVE/zephir_full_YYYYMMDD_vufind.json.gz [Monthly] # Contents: ndj file with one catalog record per line # Verify: # readable # line count must be the same as input JSON - def verify_catalog_full_archive(date: current_date) - return unless date.first_of_month? - output_path = Derivative::CatalogArchive.new(date: date, full: true).path - input_path = Derivative::HTBibExport.new(date: date, full: true).path - - paths = [input_path, output_path] - return unless paths.all? { |path| verify_file(path: path) } - - paths.each { |path| verify_parseable_ndj(path: path) } - - output_linecount = gzip_linecount(path: output_path) - input_linecount = gzip_linecount(path: input_path) - - if output_linecount != input_linecount - error( - message: sprintf( - "output line count (%s = %s) != input line count (%s = %s)", - output_path, - output_linecount, - input_path, - input_linecount + def verify_catalog_archive(date: current_date) + Derivative::CatalogArchive.derivatives_for_date(date: date).each do |derivative| + next unless verify_file(path: derivative.path) + + verify_parseable_ndj(path: derivative.path) + archive_linecount = gzip_linecount(path: derivative.path) + bib_export_path = Derivative::HTBibExport.new(date: date, full: derivative.full?).path + next unless verify_file(path: bib_export_path) + + bib_export_linecount = gzip_linecount(path: bib_export_path) + if bib_export_linecount != archive_linecount + error( + message: sprintf( + "catalog archive line count (%s = %s) != bib export line count (%s = %s)", + derivative.path, + archive_linecount, + bib_export_path, + bib_export_linecount + ) ) - ) + end end end @@ -194,5 +186,14 @@ def verify_rights_file_format(path:) end end end + + private + + def verify_hathifile_contents(path:) + HathifileContentsVerifier.new(path).tap do |contents_verifier| + contents_verifier.run + @errors.append(contents_verifier.errors) + end + end end end diff --git a/spec/fixtures/work/ht_bib_export_incr_2024-11-30.json.gz b/spec/fixtures/work/ht_bib_export_incr_2024-11-30.json.gz new file mode 100644 index 0000000000000000000000000000000000000000..5bcb37670a72264ee5bd77c4a96573ee3a5f0ea6 GIT binary patch literal 2101 zcmV-52+H>#iwFob%V=i+188(#VrgPuWq5FJa&%v5Zewy^GB7eUEio}IGcYb{b8l_{ z?O9!Oqc|3QpI@Q!%XF#?E%A-_$#FU#Gj=*#+Ymj zcpUdkPt}m50!X?#_v%VI2Yu~?)C;KS^g5UjU5~v8d76f?PIQ8hVPf3rcIJ$RLDK1c z`r5&m$hw=$G&|qA@|_OvMy&5$Ukr;kCcM$GrV$OBo60IKiYzFKP;mv@o1)N$LSLX+ zj1)!jGWaP994K%cY!fpfuI0)m@jBbIox$J#i0lN>c(JubaHsdRlVr0Bd=IQx4mMrG z(L3KBx*Zk=(jnBzZ?*jPTi#qZ_cwntV)tjRTZRDov2=$VX)Ota60uA@iNXJF;f-$Y zF~Kh{(^M9mFOK#6{eTY#XF81Qu@(m2`|16#KRF|&>ktQ9=CPK>!<$>_=M}gm5h>@d z30Oa*t7Rpt8!Z_%iFGj?LO7<=^`%<;K@sP+$#f@aG!Z0 zdg9^{t*A&=)KA%Gn*0lDuDD3m64971s6b;CjaV2`(M2?1DHm!L9NB@x1vsl__-^SF z`Zx62Euq}zAptP`Bu6Em5TB#T2I5Dk3;;_!pX5^a&^%Z=uIZh)7_D}`X%_M$wb_=5 zjs3RH*(_vyA-vU+`2cjzc^dPSCT#|FXTQM_52$q9^L?5mJ#-VNRKzrGv1irIl>S@* zuo(3tn%`_<_ytOe(E2=vv6T5(*9C6rqA-+D@tHw>=b0uGU2GG*#jssBP}^kNW?xOV z6(O+e_O-)uOKqNLj_cN}9xf)ZlRJjtn0CwhF}S(dZ{)xg_M$)C-yVpGg&ph=%dm8} zU=*{xRHJE$Y_Ol3fqeec08$2t$+Il#DK|EK|=U}D&e{MoiQfVlm{!xDA+Iv>QS;c4-Sx^isFlb!%E{Rr1 z^T(_PXDxt-Z$EvlGIPt?F>}o_TbVhr+%{%TU>4aYX68a}SMaAtL6w>3o%jbZa~s?< z4Bb8jGlvkImYKiWU_qaWsq3fW&oGK7;Lkatlkw*~_kI-qJR+&td@w8y8V9aG2o|NG-VM_m+NlnC~-Ac$)Bj4Y&=@EaDiG{JY7=YXcUxj;dM6Q9x8r*+ni7G8Vqg(+YcbVqFOV0vdPM(3PAo`}C{x>Mz^#65(+*3m;r z+ZfGb;E9y^A>7ZINK&-QfZ8TY8X$l#8M}-TP1jwSEV^RZt$o&v0Z>}OgUpvatczxA z^oxqFRdmI}AYF2?qN<|w16bBCsaRx5QRxC|fd26Ou_r4`JQk;D0@2K54$GigE51>u9SmsWTyf4!BF6bM8G6&$-3;dAH=Ttct{HgR*;!Bx~ z1-y<~5_(TC(jklaQ}W2l9D>n?fsviufus^Fl7Z)sY-8s#?_t27Xc+FYCyuMZx(sEK2KBW1WbTrn|vGZ;EhDGlb&nzW~CxcGnG%8P9Dyn^j3;c`85Lc9fWq4fn0yC zab7O*f0`(j%8Q45u`b2vHA?{Cxh?oC@`Z|ff;sBbAK=e_BT*nRStWg>UyrRbl{z~ZIMQebtqH*jv8x+^H7b29i8kn zmTeQSO=CHKiN<<)`jKDq>|=Gv>4$A%+csM{y0)je91cS@kCq(I{%Tf`c6lfuhD1@0gYz9HI5W!13_s>*WH4DJQHS? z0_%c#tRA~#6etCHTRCaS`CHNBYY!OE?yzG^l_x&1As~}KOI8fbs-mMTw}sHuN-|IY zpR|WAatJ81%1MrkJdy@PdUZrW-3Td~qws4^Ag<`9Tz?Op_%F5Vx=gM&$r-|dt z4Gc(6&v9bwudA!8tNXJ)3?j}#o(_fsjD6RaEJln{j7>}kLY9s3KpWf%9);Opc>4jK zBIUE{8 z>pCmoue#yE4bQ_au><0}nD~w(ABtt$Dg6Igc7nK^t+WUp3_lFAV%jjDfr?9Eq@XbP z_)Z%L8L9!HL3v5b%a3Jq)85_us|#`S%=c>-KtGP@^;OHlRF6er97XVtm3bHEH=wfh z0daknB(uIFe?3kn<7ci9=Jk~rv)9)*lhM^PV*4KPiH-Xzy`7w2s)4?Qpivg-Cr|O% zDB_EGBdqr-SuK%uIvJOt7c{QJa%tRjDB3NGC!R$Ns}!2TSL4jIno4+E$)mfKh~>X9 zItB(Iv%B%xJdK%@8v2$6zZE>wN{hH!t4X{tnyO?=m0B%Rc$KF`kQXV7L>6nc=XfY` z9-`kxK1WOvujlp#KwL-yG`%s!S$S@ zManZSb9A3XMa&UfEI3QS_d+6NEMggAC|mGAun37PD>zE-(2p$QD9ejb@XYvm&}D(u z653)KiDZ_t#r$Ci<3hL8^xx^!F}ubD*=MKXMG}_wu^`}CHbmz#=c(j*hn~|GCmZqE zqEW?~=li=gOi1T236z+`_d2rX`22LY$bot3badUa_Wk9~xv>#Xcka;pa!adzCF2l+ zUF(s)r4y=SdQH5ciH8>XX@9cV*T<$5rW;ap+nsGvSEp}v*f{Ei1p9BDsz;NHQTZi; zFR(loupW#>q+@L9stDOT0pDD-^my{!{)Vq!PqrK4j`BdOqF&t8{)Um&qgsvdZmV8f zs4<5N<8f@lt*p^dnfj?^iSO=M?bC*6)3UIdRtOzh_pa)bh?ifB7CS$i0hHL3d`1sg zzU4afGa6vKw(HyG9s}Ne{Nh7{DO5&SGX=G~n1Yx#?P3ZGBJtQv5yN)p!!}JMM{A}) z%A&)V0yYtoQ18Wy)ZOQvQlq6Dy5!HFKR*JLC?6gbN{%B~+%u!_z(*E+ib^b1Ca|EK z?L!V7grr7#1VRG8dG3)2$@crYt4Tl$w>8Q+GX^5ix^`NMI2nV%L$hi!nA_S(Ku5m8`s=3N!Y{?LUS+wSA?B-`64d&?Z<#gxfY-bx3I{!Wyf>$F5$#N%$oJ z(nqYw0iZkHD7bCuWw+z^`?Ukz2&UL_U~#U(?cV4}e)}=5J9+lPxJN}%UyNJIIM}9i z3*(4x;*!1WA*CG{_djH>jVkC<*{iN=H`!|=xSH(s$12b^Iqcv{tRDmr6L&+vhRwB| zy^Oi}epDN?x~_~diK>%Zj8UcJdvJ$E9N+gEhI~FWvK-S=2L0W}95hAPH*L=*{t>3A z!!zUn&tO6+b@35Ab8~UIj+g3sWeh+;>ofs07$%g`P`we)h74-u2U_zOXQ}Od$`Ki{ zNhi)yNVeYba25)E>Sg5{XZXlp3N~G}Z{Td$fwKe%%XK`Yc-HF6A1fv$urjD+9*;?% zEVvH?EAb;>Ai==8z0bFQj&lA!FW@1W)ook=zF3^iG~QlriLtss`IaP$AHR3 zW@sp8i_Ab@uSwT>(uv>I0>~&ymoEU+Je)wdC5!A!G*04JZ{3?=d|Req5oFJC5% z=P5v-fRH&&f;<(Hcc1(n8Cc!6-RclyYu|2^wT$tXz5@9fFKcOv|_J|J2dpzZ_W~7fA;q>(=356@8BYbp@rLQi}b( Uhl5?K1WNh80C$U&749Yg0Ep$;bN~PV literal 0 HcmV?d00001 diff --git a/spec/fixtures/dollar_dup/vufind_incremental_2024-12-02_dollar_dup.txt.gz b/spec/fixtures/work/vufind_incremental_2024-12-02_dollar_dup.txt.gz similarity index 100% rename from spec/fixtures/dollar_dup/vufind_incremental_2024-12-02_dollar_dup.txt.gz rename to spec/fixtures/work/vufind_incremental_2024-12-02_dollar_dup.txt.gz diff --git a/spec/integration/verify_spec.rb b/spec/integration/verify_spec.rb index 908d94a..9a0fa99 100644 --- a/spec/integration/verify_spec.rb +++ b/spec/integration/verify_spec.rb @@ -22,7 +22,7 @@ module PostZephirProcessing REDIRECTS_DIR: fixture("redirects"), REDIRECTS_HISTORY_DIR: fixture("redirects"), CATALOG_PREP: fixture("catalog_prep"), - TMPDIR: fixture("dollar_dup"), + TMPDIR: fixture("work"), SOLR_URL: "http://solr-sdr-catalog:9033/solr/catalog", TZ: "America/Detroit" ) do diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb index d3679b0..76e2098 100644 --- a/spec/spec_helper.rb +++ b/spec/spec_helper.rb @@ -48,6 +48,15 @@ def test_journal_dates [Date.new(2050, 1, 1), Date.new(2050, 1, 2)] end +# Note potential pitfall: +# Setting ENV["TMPDIR"] has an effect on Ruby's choice of temporary directory locations. +# See https://github.com/ruby/ruby/blob/f4476f0d07c781c906ed1353d8e1be5a7314d6e7/lib/tmpdir.rb#L130 +# So if you see mktmpdir yielding a location in spec/fixtures then it's likely +# TMPDIR has been defined, maybe in an `around` block, before the call to `with_test_environment`. +# Currently it is not happening but it can when noodling around with test setups. +# It's not a critical problem, but might nudge us in the direction of moving away from using +# TMPDIR in the PZP internals. +# Could also try wrapping the mktmpdir in another Climate Control layer. def with_test_environment Dir.mktmpdir do |tmpdir| ClimateControl.modify( diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/post_zephir_verifier_spec.rb index 1b2f011..fd442a8 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/post_zephir_verifier_spec.rb @@ -24,6 +24,12 @@ module PostZephirProcessing end describe "#run_for_date" do + around(:each) do |example| + ClimateControl.modify(TMPDIR: fixture("work")) do + example.run + end + end + context "last day of month" do test_date = Date.parse("2024-11-30") it "runs" do @@ -113,13 +119,20 @@ def expect_deletefile_ok(contents) end end - describe "#verify_catalog_full_archive" do + describe "#verify_catalog_archive" do let(:verifier) { described_class.new } let(:test_date) { Date.parse("2024-12-01") } + + around(:each) do |example| + ClimateControl.modify(TMPDIR: fixture("work")) do + example.run + end + end + it "requires input file to have same line count as output file" do # We have fixtures with matching line counts for test_date, # so expect no warnings - verifier.verify_catalog_full_archive(date: test_date) + verifier.verify_catalog_archive(date: test_date) expect(verifier.errors).to be_empty end @@ -132,9 +145,9 @@ def expect_deletefile_ok(contents) end # The other unmodified fixtures in CATALOG_ARCHIVE should # no longer have matching line counts, so expect a warning - verifier.verify_catalog_full_archive(date: test_date) + verifier.verify_catalog_archive(date: test_date) expect(verifier.errors.count).to eq 1 - expect(verifier.errors).to include(/output line count .+ != input line count/) + expect(verifier.errors).to include(/catalog archive line count .+ != bib export line count/) end end end From a74a806c0771079bda5dcc4d9fc8e04cdb6fc415 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Tue, 31 Dec 2024 17:04:19 -0500 Subject: [PATCH 108/114] - Database connection uses ENV instead of `database.yml` - For now these ENV values are taken from hathifiles secret - `database.yml` appears to be read-write to the rights DB (in practice) so using it for Verifier is not appropriate. --- docker-compose.yml | 5 +++++ lib/services.rb | 15 ++++++--------- 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/docker-compose.yml b/docker-compose.yml index a7fbdd1..4726fcc 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -39,6 +39,11 @@ services: - DB_CONNECTION_STRING="mysql2://ht_rights:ht_rights@mariadb/ht" - POST_ZEPHIR_PROCESSING_LOGGER_LEVEL=1 - PUSHGATEWAY="http://pushgateway:9091" + - DB_HT_RO_USER=ht_rights + - DB_HT_RO_PASSWORD=ht_rights + - DB_HT_RO_HOST=mariadb + - DB_HT_RO_PORT=3306 + - DB_HT_RO_DATABASE=ht # pass through info needed by coveralls uploader - GITHUB_TOKEN - GITHUB_RUN_ID diff --git a/lib/services.rb b/lib/services.rb index 55764e0..b8e5d20 100644 --- a/lib/services.rb +++ b/lib/services.rb @@ -3,7 +3,6 @@ require "canister" require "logger" require "sequel" -require "yaml" module PostZephirProcessing Services = Canister.new @@ -13,17 +12,15 @@ module PostZephirProcessing end # Read-only connection to database for verifying rights DB vs .rights files - # Would prefer to populate these values from ENV for consistency with other Ruby - # code running in the workflow but this suffices for now. + # as well as hathifiles tables. Services.register(:database) do - database_yaml = File.join(ENV.fetch("ROOTDIR"), "config", "database.yml") - yaml_data = YAML.load_file(database_yaml) Sequel.connect( adapter: "mysql2", - user: yaml_data["user"], - password: yaml_data["password"], - host: yaml_data["hostname"], - database: yaml_data["dbname"], + user: ENV["DB_HT_RO_USER"], + password: ENV["DB_HT_RO_PASSWORD"], + host: ENV["DB_HT_RO_HOST"], + port: ENV["DB_HT_RO_PORT"], + database: ENV["DB_HT_RO_DATABASE"], encoding: "utf8mb4" ) end From b71f6ff3592fa864e21275f65d1fc3db5b9d6a82 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 2 Jan 2025 10:21:27 -0500 Subject: [PATCH 109/114] Address #55 Move WhateverVerifier to Verifier::Whatever --- bin/verify.rb | 28 +++++++++---------- ...log_index_verifier.rb => catalog_index.rb} | 2 +- .../{hathifiles_verifier.rb => hathifiles.rb} | 8 +++--- ...nts_verifier.rb => hathifiles_contents.rb} | 2 +- ...ase_verifier.rb => hathifiles_database.rb} | 2 +- ...ting_verifier.rb => hathifiles_listing.rb} | 2 +- ...ts_verifier.rb => hathifiles_redirects.rb} | 2 +- ..._rights_verifier.rb => populate_rights.rb} | 2 +- ...post_zephir_verifier.rb => post_zephir.rb} | 2 +- .../hathifiles_spec.rb} | 4 +-- spec/integration/verify_spec.rb | 14 +++++----- .../catalog_index_spec.rb} | 4 +-- .../hathifiles_contents_spec.rb} | 4 +-- .../hathifiles_database_spec.rb} | 4 +-- .../hathifiles_listing_spec.rb} | 4 +-- .../hathifiles_redirects_spec.rb} | 4 +-- .../hathifiles_spec.rb} | 4 +-- .../populate_rights_spec.rb} | 4 +-- .../post_zephir_spec.rb} | 4 +-- 19 files changed, 50 insertions(+), 50 deletions(-) rename lib/verifier/{catalog_index_verifier.rb => catalog_index.rb} (98%) rename lib/verifier/{hathifiles_verifier.rb => hathifiles.rb} (86%) rename lib/verifier/{hathifiles_contents_verifier.rb => hathifiles_contents.rb} (98%) rename lib/verifier/{hathifiles_database_verifier.rb => hathifiles_database.rb} (97%) rename lib/verifier/{hathifiles_listing_verifier.rb => hathifiles_listing.rb} (95%) rename lib/verifier/{hathifiles_redirects_verifier.rb => hathifiles_redirects.rb} (97%) rename lib/verifier/{populate_rights_verifier.rb => populate_rights.rb} (98%) rename lib/verifier/{post_zephir_verifier.rb => post_zephir.rb} (99%) rename spec/integration/{hathifiles_verifier_spec.rb => verifier/hathifiles_spec.rb} (93%) rename spec/unit/{catalog_indexing_verifier_spec.rb => verifier/catalog_index_spec.rb} (97%) rename spec/unit/{hathifiles_contents_verifier_spec.rb => verifier/hathifiles_contents_spec.rb} (98%) rename spec/unit/{hathifiles_database_verifier_spec.rb => verifier/hathifiles_database_spec.rb} (97%) rename spec/unit/{hathifiles_listing_verifier_spec.rb => verifier/hathifiles_listing_spec.rb} (97%) rename spec/unit/{hathifiles_redirects_verifier_spec.rb => verifier/hathifiles_redirects_spec.rb} (98%) rename spec/unit/{hathifiles_verifier_spec.rb => verifier/hathifiles_spec.rb} (95%) rename spec/unit/{populate_rights_verifier_spec.rb => verifier/populate_rights_spec.rb} (97%) rename spec/unit/{post_zephir_verifier_spec.rb => verifier/post_zephir_spec.rb} (99%) diff --git a/bin/verify.rb b/bin/verify.rb index 72c5f30..82d2dcd 100755 --- a/bin/verify.rb +++ b/bin/verify.rb @@ -3,26 +3,26 @@ $LOAD_PATH.unshift File.expand_path("../lib/", __FILE__) require "dotenv" -require "verifier/post_zephir_verifier" -require "verifier/populate_rights_verifier" -require "verifier/hathifiles_verifier" -require "verifier/hathifiles_database_verifier" -require "verifier/hathifiles_listing_verifier" -require "verifier/hathifiles_redirects_verifier" -require "verifier/catalog_index_verifier" +require "verifier/post_zephir" +require "verifier/populate_rights" +require "verifier/hathifiles" +require "verifier/hathifiles_database" +require "verifier/hathifiles_listing" +require "verifier/hathifiles_redirects" +require "verifier/catalog_index" Dotenv.load(File.join(ENV.fetch("ROOTDIR"), "config", "env")) module PostZephirProcessing def self.run_verifiers(date_to_check) [ - PostZephirVerifier, - PopulateRightsVerifier, - HathifilesVerifier, - HathifilesDatabaseVerifier, - HathifilesListingVerifier, - HathifilesRedirectsVerifier, - CatalogIndexVerifier + Verifier::PostZephir, + Verifier::PopulateRights, + Verifier::Hathifiles, + Verifier::HathifilesDatabase, + Verifier::HathifilesListing, + Verifier::HathifilesRedirects, + Verifier::CatalogIndex ].each do |klass| begin klass.new.run_for_date(date: date_to_check) diff --git a/lib/verifier/catalog_index_verifier.rb b/lib/verifier/catalog_index.rb similarity index 98% rename from lib/verifier/catalog_index_verifier.rb rename to lib/verifier/catalog_index.rb index 31d0c82..d832328 100644 --- a/lib/verifier/catalog_index_verifier.rb +++ b/lib/verifier/catalog_index.rb @@ -8,7 +8,7 @@ # Verifies that catalog indexing workflow stage did what it was supposed to. module PostZephirProcessing - class CatalogIndexVerifier < Verifier + class Verifier::CatalogIndex < Verifier def verify_index_count(derivative:) catalog_linecount = gzip_linecount(path: derivative.path) diff --git a/lib/verifier/hathifiles_verifier.rb b/lib/verifier/hathifiles.rb similarity index 86% rename from lib/verifier/hathifiles_verifier.rb rename to lib/verifier/hathifiles.rb index b5f728e..a2b1d81 100644 --- a/lib/verifier/hathifiles_verifier.rb +++ b/lib/verifier/hathifiles.rb @@ -2,13 +2,13 @@ require "zlib" require "verifier" -require "verifier/hathifiles_contents_verifier" +require "verifier/hathifiles_contents" require "derivative/hathifile" # Verifies that hathifiles workflow stage did what it was supposed to. module PostZephirProcessing - class HathifilesVerifier < Verifier + class Verifier::Hathifiles < Verifier attr_reader :current_date def run_for_date(date:) @@ -22,7 +22,7 @@ def run_for_date(date:) # Frequency: ALL # Files: CATALOG_PREP/hathi_upd_YYYYMMDD.txt.gz # and potentially HATHIFILE_ARCHIVE/hathi_full_YYYYMMDD.txt.gz - # Contents: verified with HathifileContentsVerifier with regexes for each line/field + # Contents: verified with Verifier::HathifileContents with regexes for each line/field # Verify: # readable def verify_hathifile(date: current_date) @@ -49,7 +49,7 @@ def errors private def verify_hathifile_contents(path:) - HathifileContentsVerifier.new(path).tap do |contents_verifier| + Verifier::HathifileContents.new(path).tap do |contents_verifier| contents_verifier.run @errors.append(contents_verifier.errors) end diff --git a/lib/verifier/hathifiles_contents_verifier.rb b/lib/verifier/hathifiles_contents.rb similarity index 98% rename from lib/verifier/hathifiles_contents_verifier.rb rename to lib/verifier/hathifiles_contents.rb index cdf838b..df14ecf 100644 --- a/lib/verifier/hathifiles_contents_verifier.rb +++ b/lib/verifier/hathifiles_contents.rb @@ -6,7 +6,7 @@ # Verifies that hathifiles workflow stage did what it was supposed to. module PostZephirProcessing - class HathifileContentsVerifier < Verifier + class Verifier::HathifileContents < Verifier HATHIFILE_FIELD_SPECS = [ # htid - required; lowercase alphanumeric namespace, period, non-whitespace ID {name: "htid", regex: /^[a-z0-9]{2,4}\.\S+$/}, diff --git a/lib/verifier/hathifiles_database_verifier.rb b/lib/verifier/hathifiles_database.rb similarity index 97% rename from lib/verifier/hathifiles_database_verifier.rb rename to lib/verifier/hathifiles_database.rb index 93f5257..6753c4a 100644 --- a/lib/verifier/hathifiles_database_verifier.rb +++ b/lib/verifier/hathifiles_database.rb @@ -6,7 +6,7 @@ require "derivative/hathifile" module PostZephirProcessing - class HathifilesDatabaseVerifier < Verifier + class Verifier::HathifilesDatabase < Verifier attr_reader :current_date # Does an entry exist in hf_log for the hathifile? diff --git a/lib/verifier/hathifiles_listing_verifier.rb b/lib/verifier/hathifiles_listing.rb similarity index 95% rename from lib/verifier/hathifiles_listing_verifier.rb rename to lib/verifier/hathifiles_listing.rb index f9292ef..b519a37 100644 --- a/lib/verifier/hathifiles_listing_verifier.rb +++ b/lib/verifier/hathifiles_listing.rb @@ -6,7 +6,7 @@ require "set" module PostZephirProcessing - class HathifilesListingVerifier < Verifier + class Verifier::HathifilesListing < Verifier attr_reader :current_date def run_for_date(date:) diff --git a/lib/verifier/hathifiles_redirects_verifier.rb b/lib/verifier/hathifiles_redirects.rb similarity index 97% rename from lib/verifier/hathifiles_redirects_verifier.rb rename to lib/verifier/hathifiles_redirects.rb index b6f815d..86bc6eb 100644 --- a/lib/verifier/hathifiles_redirects_verifier.rb +++ b/lib/verifier/hathifiles_redirects.rb @@ -3,7 +3,7 @@ require "verifier" module PostZephirProcessing - class HathifilesRedirectsVerifier < Verifier + class Verifier::HathifilesRedirects < Verifier attr_reader :current_date REDIRECTS_REGEX = /^\d{9}\t\d{9}$/ diff --git a/lib/verifier/populate_rights_verifier.rb b/lib/verifier/populate_rights.rb similarity index 98% rename from lib/verifier/populate_rights_verifier.rb rename to lib/verifier/populate_rights.rb index 9e95c3a..1619435 100644 --- a/lib/verifier/populate_rights_verifier.rb +++ b/lib/verifier/populate_rights.rb @@ -16,7 +16,7 @@ module PostZephirProcessing # We may also look for errors in the output logs (postZephir.pm and/or populate_rights_data.pl?) # but that is out of scope for now. - class PopulateRightsVerifier < Verifier + class Verifier::PopulateRights < Verifier # This is an efficient slice size we adopted for hathifiles based on experimental evidence DEFAULT_SLICE_SIZE = 10_000 attr_reader :slice_size diff --git a/lib/verifier/post_zephir_verifier.rb b/lib/verifier/post_zephir.rb similarity index 99% rename from lib/verifier/post_zephir_verifier.rb rename to lib/verifier/post_zephir.rb index a734264..b85ef77 100644 --- a/lib/verifier/post_zephir_verifier.rb +++ b/lib/verifier/post_zephir.rb @@ -12,7 +12,7 @@ # Verifies that post_zephir workflow stage did what it was supposed to. module PostZephirProcessing - class PostZephirVerifier < Verifier + class Verifier::PostZephir < Verifier attr_reader :current_date def run_for_date(date:) diff --git a/spec/integration/hathifiles_verifier_spec.rb b/spec/integration/verifier/hathifiles_spec.rb similarity index 93% rename from spec/integration/hathifiles_verifier_spec.rb rename to spec/integration/verifier/hathifiles_spec.rb index 19c68c1..829519f 100644 --- a/spec/integration/hathifiles_verifier_spec.rb +++ b/spec/integration/verifier/hathifiles_spec.rb @@ -1,10 +1,10 @@ # frozen_string_literal: true require "climate_control" -require "verifier/hathifiles_verifier" +require "verifier/hathifiles" module PostZephirProcessing - RSpec.describe HathifilesVerifier do + RSpec.describe Verifier::Hathifiles do let(:verifier) { described_class.new } around(:each) do |example| diff --git a/spec/integration/verify_spec.rb b/spec/integration/verify_spec.rb index 9a0fa99..779c745 100644 --- a/spec/integration/verify_spec.rb +++ b/spec/integration/verify_spec.rb @@ -34,13 +34,13 @@ module PostZephirProcessing # TODO: dollar-dup, hf_log (database) - %w[PostZephirVerifier - PopulateRightsVerifier - HathifilesVerifier - HathifilesDatabaseVerifier - HathifilesListingVerifier - HathifilesRedirectsVerifier - CatalogIndexVerifier].each do |verifier| + %w[Verifier::PostZephir + Verifier::PopulateRights + Verifier::Hathifiles + Verifier::HathifilesDatabase + Verifier::HathifilesListing + Verifier::HathifilesRedirects + Verifier::CatalogIndex].each do |verifier| expect(@test_log.string).to include(/.*INFO.*#{verifier}/) end expect(@test_log.string).not_to include(/.*ERROR.*/) diff --git a/spec/unit/catalog_indexing_verifier_spec.rb b/spec/unit/verifier/catalog_index_spec.rb similarity index 97% rename from spec/unit/catalog_indexing_verifier_spec.rb rename to spec/unit/verifier/catalog_index_spec.rb index 345767f..50227d6 100644 --- a/spec/unit/catalog_indexing_verifier_spec.rb +++ b/spec/unit/verifier/catalog_index_spec.rb @@ -1,12 +1,12 @@ # frozen_string_literal: true -require "verifier/catalog_index_verifier" +require "verifier/catalog_index" require "webmock" require "derivative/catalog" require "uri" module PostZephirProcessing - RSpec.describe(CatalogIndexVerifier) do + RSpec.describe(Verifier::CatalogIndex) do include_context "with solr mocking" let(:verifier) { described_class.new } diff --git a/spec/unit/hathifiles_contents_verifier_spec.rb b/spec/unit/verifier/hathifiles_contents_spec.rb similarity index 98% rename from spec/unit/hathifiles_contents_verifier_spec.rb rename to spec/unit/verifier/hathifiles_contents_spec.rb index fac4988..236e54b 100644 --- a/spec/unit/hathifiles_contents_verifier_spec.rb +++ b/spec/unit/verifier/hathifiles_contents_spec.rb @@ -1,10 +1,10 @@ # frozen_string_literal: true require "zlib" -require "verifier/hathifiles_contents_verifier" +require "verifier/hathifiles_contents" module PostZephirProcessing - RSpec.describe(HathifileContentsVerifier) do + RSpec.describe(Verifier::HathifileContents) do around(:each) do |example| with_test_environment { example.run } end diff --git a/spec/unit/hathifiles_database_verifier_spec.rb b/spec/unit/verifier/hathifiles_database_spec.rb similarity index 97% rename from spec/unit/hathifiles_database_verifier_spec.rb rename to spec/unit/verifier/hathifiles_database_spec.rb index 8eecc2c..b78f362 100644 --- a/spec/unit/hathifiles_database_verifier_spec.rb +++ b/spec/unit/verifier/hathifiles_database_spec.rb @@ -1,9 +1,9 @@ # frozen_string_literal: true -require "verifier/hathifiles_database_verifier" +require "verifier/hathifiles_database" module PostZephirProcessing - RSpec.describe(HathifilesDatabaseVerifier) do + RSpec.describe(Verifier::HathifilesDatabase) do include_context "with hathifile database" around(:each) do |example| diff --git a/spec/unit/hathifiles_listing_verifier_spec.rb b/spec/unit/verifier/hathifiles_listing_spec.rb similarity index 97% rename from spec/unit/hathifiles_listing_verifier_spec.rb rename to spec/unit/verifier/hathifiles_listing_spec.rb index 3e7f130..7f2410c 100644 --- a/spec/unit/hathifiles_listing_verifier_spec.rb +++ b/spec/unit/verifier/hathifiles_listing_spec.rb @@ -1,10 +1,10 @@ # frozen_string_literal: true -require "verifier/hathifiles_listing_verifier" +require "verifier/hathifiles_listing" require "derivative/hathifile_www" module PostZephirProcessing - RSpec.describe(HathifilesListingVerifier) do + RSpec.describe(Verifier::HathifilesListing) do # Using secondday here as a representative for # "any day of the month that's not the 1st" # missingday does not have files or listings diff --git a/spec/unit/hathifiles_redirects_verifier_spec.rb b/spec/unit/verifier/hathifiles_redirects_spec.rb similarity index 98% rename from spec/unit/hathifiles_redirects_verifier_spec.rb rename to spec/unit/verifier/hathifiles_redirects_spec.rb index 8eeb922..944c57f 100644 --- a/spec/unit/hathifiles_redirects_verifier_spec.rb +++ b/spec/unit/verifier/hathifiles_redirects_spec.rb @@ -1,10 +1,10 @@ # frozen_string_literal: true -require "verifier/hathifiles_redirects_verifier" +require "verifier/hathifiles_redirects" require "zlib" module PostZephirProcessing - RSpec.describe(HathifilesRedirectsVerifier) do + RSpec.describe(Verifier::HathifilesRedirects) do let(:test_date) { Date.parse("2024-12-01") } let(:verifier) { described_class.new(date: test_date) } let(:redirects_file) { verifier.redirects_file(date: test_date) } diff --git a/spec/unit/hathifiles_verifier_spec.rb b/spec/unit/verifier/hathifiles_spec.rb similarity index 95% rename from spec/unit/hathifiles_verifier_spec.rb rename to spec/unit/verifier/hathifiles_spec.rb index be787f4..4afb231 100644 --- a/spec/unit/hathifiles_verifier_spec.rb +++ b/spec/unit/verifier/hathifiles_spec.rb @@ -1,9 +1,9 @@ # frozen_string_literal: true -require "verifier/hathifiles_verifier" +require "verifier/hathifiles" module PostZephirProcessing - RSpec.describe(HathifilesVerifier) do + RSpec.describe(Verifier::Hathifiles) do around(:each) do |example| with_test_environment { example.run } end diff --git a/spec/unit/populate_rights_verifier_spec.rb b/spec/unit/verifier/populate_rights_spec.rb similarity index 97% rename from spec/unit/populate_rights_verifier_spec.rb rename to spec/unit/verifier/populate_rights_spec.rb index e7c7547..7c2a41f 100644 --- a/spec/unit/populate_rights_verifier_spec.rb +++ b/spec/unit/verifier/populate_rights_spec.rb @@ -1,11 +1,11 @@ # frozen_string_literal: true -require "verifier/populate_rights_verifier" +require "verifier/populate_rights" require "derivative/rights" require "pry" module PostZephirProcessing - RSpec.describe(PopulateRightsVerifier) do + RSpec.describe(Verifier::PopulateRights) do around(:each) do |example| with_test_environment do ClimateControl.modify(RIGHTS_ARCHIVE: @tmpdir) do diff --git a/spec/unit/post_zephir_verifier_spec.rb b/spec/unit/verifier/post_zephir_spec.rb similarity index 99% rename from spec/unit/post_zephir_verifier_spec.rb rename to spec/unit/verifier/post_zephir_spec.rb index fd442a8..d7e4ed8 100644 --- a/spec/unit/post_zephir_verifier_spec.rb +++ b/spec/unit/verifier/post_zephir_spec.rb @@ -1,10 +1,10 @@ # frozen_string_literal: true -require "verifier/post_zephir_verifier" +require "verifier/post_zephir" require "zlib" module PostZephirProcessing - RSpec.describe(PostZephirVerifier) do + RSpec.describe(Verifier::PostZephir) do around(:each) do |example| ClimateControl.modify( CATALOG_ARCHIVE: fixture("catalog_archive"), From 58cf99cd35a6f73946b165a92e9a9ab8e8e1ace8 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 2 Jan 2025 11:05:38 -0500 Subject: [PATCH 110/114] Allow hathifiles `digitization_agent_code` to match `yale2` by allowing a trailing digit. --- lib/verifier/hathifiles_contents.rb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/verifier/hathifiles_contents.rb b/lib/verifier/hathifiles_contents.rb index df14ecf..ab8b2d1 100644 --- a/lib/verifier/hathifiles_contents.rb +++ b/lib/verifier/hathifiles_contents.rb @@ -60,8 +60,8 @@ class Verifier::HathifileContents < Verifier {name: "content_provider_code", regex: /^[a-z\-]+$/}, # responsible entity code - required, lowercase characters + dash {name: "responsible_entity_code", regex: /^[a-z\-]+$/}, - # digitization agent code - required, lowercase characters + dash - {name: "digitization_agent_code", regex: /^[a-z\-]+$/}, + # digitization agent code - required, lowercase characters + dash and optional trailing digit (yale2) + {name: "digitization_agent_code", regex: /^[a-z\-]+\d?$/}, # access profile code - required, lowercase characters + plus {name: "access_profile_code", regex: /^[a-z+]+$/}, # author - optional, anything goes From c8f19b64ceb2d6270f35f8a1860ca45632d81970 Mon Sep 17 00:00:00 2001 From: Brian Moses Hall Date: Thu, 2 Jan 2025 16:24:08 -0500 Subject: [PATCH 111/114] Add exception handler around Solr results to diagnose testing issue --- lib/verifier/catalog_index.rb | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/lib/verifier/catalog_index.rb b/lib/verifier/catalog_index.rb index d832328..3bb63d9 100644 --- a/lib/verifier/catalog_index.rb +++ b/lib/verifier/catalog_index.rb @@ -42,8 +42,13 @@ def solr_nondeleted_records def solr_result_count(filter_query) url = "#{ENV["SOLR_URL"]}/select?fq=#{URI.encode_www_form_component(filter_query)}&q=*:*&rows=0&wt=json" - - JSON.parse(Faraday.get(url).body)["response"]["numFound"] + body = Faraday.get(url).body + begin + JSON.parse(body)["response"]["numFound"] + rescue JSON::ParserError => e + error(message: "could not parse response from #{url}: #{body} (#{e})") + 0 + end end def run_for_date(date:) From 2f7be27fc0444bc908741c6e234f27103ffce5dd Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 6 Jan 2025 16:17:59 -0500 Subject: [PATCH 112/114] DEV-1418: Handle solr auth params correctly Faraday doesn't parse them out of the URL, so we need to manually set them. --- lib/verifier/catalog_index.rb | 16 +++++++++++++--- spec/integration/verify_spec.rb | 2 +- spec/support/solr_mock.rb | 1 + spec/unit/verifier/catalog_index_spec.rb | 2 +- 4 files changed, 16 insertions(+), 5 deletions(-) diff --git a/lib/verifier/catalog_index.rb b/lib/verifier/catalog_index.rb index 3bb63d9..b57ede3 100644 --- a/lib/verifier/catalog_index.rb +++ b/lib/verifier/catalog_index.rb @@ -41,12 +41,22 @@ def solr_nondeleted_records end def solr_result_count(filter_query) - url = "#{ENV["SOLR_URL"]}/select?fq=#{URI.encode_www_form_component(filter_query)}&q=*:*&rows=0&wt=json" - body = Faraday.get(url).body + url = URI.parse(ENV["SOLR_URL"]) + # duplicate the URL since Faraday will mutate the passed URL to remove + # the username & password from it ... which probably makes sense for a security + # reason, but isn't what we want. + conn = Faraday.new(url: url.dup) + conn.set_basic_auth(url.user, url.password) + params = {fq: filter_query, + q: '*:*', + rows: '0', + wt: 'json'} + body = conn.get('select', params).body + begin JSON.parse(body)["response"]["numFound"] rescue JSON::ParserError => e - error(message: "could not parse response from #{url}: #{body} (#{e})") + error(message: "could not parse response from #{conn.url_prefix}: #{body} (#{e})") 0 end end diff --git a/spec/integration/verify_spec.rb b/spec/integration/verify_spec.rb index 779c745..9a88525 100644 --- a/spec/integration/verify_spec.rb +++ b/spec/integration/verify_spec.rb @@ -23,7 +23,7 @@ module PostZephirProcessing REDIRECTS_HISTORY_DIR: fixture("redirects"), CATALOG_PREP: fixture("catalog_prep"), TMPDIR: fixture("work"), - SOLR_URL: "http://solr-sdr-catalog:9033/solr/catalog", + SOLR_URL: "http://solr:SolrRocks@solr-sdr-catalog:9033/solr/catalog", TZ: "America/Detroit" ) do stub_catalog_timerange("2024-12-03T05:00:00Z", 3) diff --git a/spec/support/solr_mock.rb b/spec/support/solr_mock.rb index 2a7b05a..15cc419 100644 --- a/spec/support/solr_mock.rb +++ b/spec/support/solr_mock.rb @@ -19,6 +19,7 @@ def stub_solr_count(fq:, result_count:) }.to_json WebMock::API.stub_request(:get, url) + .with(basic_auth: ['solr','SolrRocks']) .to_return(body: result, headers: {"Content-Type" => "application/json"}) end diff --git a/spec/unit/verifier/catalog_index_spec.rb b/spec/unit/verifier/catalog_index_spec.rb index 50227d6..f8f7da1 100644 --- a/spec/unit/verifier/catalog_index_spec.rb +++ b/spec/unit/verifier/catalog_index_spec.rb @@ -13,7 +13,7 @@ module PostZephirProcessing around(:each) do |example| with_test_environment do ClimateControl.modify( - SOLR_URL: solr_url, + SOLR_URL: "http://solr:SolrRocks@solr-sdr-catalog:9033/solr/catalog", CATALOG_ARCHIVE: fixture("catalog_archive"), TZ: "America/Detroit" ) do From 6499b4f541d4e08b6f40c134af443b2e7fdaa617 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Mon, 6 Jan 2025 16:24:36 -0500 Subject: [PATCH 113/114] standardrb fixes --- lib/verifier/catalog_index.rb | 10 +++++----- spec/support/solr_mock.rb | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/verifier/catalog_index.rb b/lib/verifier/catalog_index.rb index b57ede3..4e09872 100644 --- a/lib/verifier/catalog_index.rb +++ b/lib/verifier/catalog_index.rb @@ -47,11 +47,11 @@ def solr_result_count(filter_query) # reason, but isn't what we want. conn = Faraday.new(url: url.dup) conn.set_basic_auth(url.user, url.password) - params = {fq: filter_query, - q: '*:*', - rows: '0', - wt: 'json'} - body = conn.get('select', params).body + params = {fq: filter_query, + q: "*:*", + rows: "0", + wt: "json"} + body = conn.get("select", params).body begin JSON.parse(body)["response"]["numFound"] diff --git a/spec/support/solr_mock.rb b/spec/support/solr_mock.rb index 15cc419..041ceb7 100644 --- a/spec/support/solr_mock.rb +++ b/spec/support/solr_mock.rb @@ -19,7 +19,7 @@ def stub_solr_count(fq:, result_count:) }.to_json WebMock::API.stub_request(:get, url) - .with(basic_auth: ['solr','SolrRocks']) + .with(basic_auth: ["solr", "SolrRocks"]) .to_return(body: result, headers: {"Content-Type" => "application/json"}) end From 516e28fd12f1634e3b4b3593661ac101616d7057 Mon Sep 17 00:00:00 2001 From: Aaron Elkiss Date: Wed, 8 Jan 2025 13:40:44 -0500 Subject: [PATCH 114/114] DEV-1418: Add "deleted:false" clause to filter query If everything went as expected, there should be exactly as many records in the input json as there are non-deleted records indexed on that day. --- lib/verifier/catalog_index.rb | 2 +- spec/support/solr_mock.rb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/verifier/catalog_index.rb b/lib/verifier/catalog_index.rb index 4e09872..88de3a5 100644 --- a/lib/verifier/catalog_index.rb +++ b/lib/verifier/catalog_index.rb @@ -33,7 +33,7 @@ def verify_index_count(derivative:) def solr_count(date_of_indexing) datebegin = date_of_indexing.to_time.utc.strftime("%FT%TZ") - solr_result_count("time_of_index:[#{datebegin} TO NOW]") + solr_result_count("time_of_index:[#{datebegin} TO NOW] AND deleted:false") end def solr_nondeleted_records diff --git a/spec/support/solr_mock.rb b/spec/support/solr_mock.rb index 041ceb7..5e61f91 100644 --- a/spec/support/solr_mock.rb +++ b/spec/support/solr_mock.rb @@ -28,6 +28,6 @@ def stub_catalog_record_count(result_count) end def stub_catalog_timerange(datebegin, result_count) - stub_solr_count(fq: "time_of_index:[#{datebegin} TO NOW]", result_count: result_count) + stub_solr_count(fq: "time_of_index:[#{datebegin} TO NOW] AND deleted:false", result_count: result_count) end end