Skip to content

Commit

Permalink
Merge pull request #32 from hathitrust/DEV-117
Browse files Browse the repository at this point in the history
DEV-117: Remove code copied from post_zephir_processing
  • Loading branch information
mwarin authored Mar 22, 2024
2 parents a507f72 + 6ea54bc commit 4dc3c43
Show file tree
Hide file tree
Showing 6 changed files with 5 additions and 82 deletions.
2 changes: 0 additions & 2 deletions PossibleRuntimeErrors.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@ Of course, other errors are possible, especially if substantive changes are made
* Unable to get ht_bib_export_incr_<date>.json.gz from zephir (reports)
* Unable to **find** ht_bib_export_incr_<date>.json.gz (exits)
* Unable to retrieve or find vufind_removed_cids_<date>.txt.gz (exits)
* Unable to retrieve groove_incremental_<date>.tsv.gz (reports)
* Unable to retrieve zephir_daily_touched (reports)
* Unable to cp zephir_upd_<date>.json.gz to /htsolr/catalog/prep (exits)
* Unable to cp zephir_upd_<date>_all_delete.txt.gz to /htsolr/catalog/prep (exits)
* Unable to send zephir_upd_dollar_up.txt.gz to Zephir (exits)
Expand Down
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ run_process_zephir_incremental.sh (daily)
* (re)determine bibliographic rights
+ Write new/updated bib rights to file for Aaron's process to pick up and update the rights db (Why: possibly because of limited permissions on the rights database)
* File of processed new/updated records is copied to an HT server for Bill to index in the catalog
* Retrieve files with changed & new records daily_touched_YYYY-MM-DD.tsv.gz and groove_incremental_YYYY_MM-DD.tsv.gz from Zephir for pickup by ingest to be loaded to the `feed_zephir_items` table, which supports determining what items are newly available for ingest, what digitization source we expect to see for those items, and what their collection code (which maps to content provider and responsible source) is
* Retrieves full bib metadata file from zephir and runs run_zephir_full_monthly.sh. (Why?)

Why?
Expand All @@ -24,7 +23,6 @@ Data In
-------
* `ht_bib_export_incr_YYYY-MM-DD.json.gz` (incremental updates from Zephir, `ftps_zephir_get`)
* `vufind_removed_cids_YYYY-MM-DD.txt.gz` (CIDs that have gone away, `ftps_zephir_get`)
* `groove_incremental_YYYY-MM-DD.tsv.gz` (from Zephir - new items added to Zephir?)
* `/tmp/rights_dbm` (taken from `ht_rights.rights_current` table in the rights database)
* `us_cities.db` (dependency for `bib_rights.pm`)
* `us_fed_pub_exception_file` (dependency for `bib_rights.pm`, `/htdata/govdocs/feddocs_oclc_filter/`)
Expand All @@ -36,7 +34,6 @@ Data Out
* `zephir_upd_YYYYMMDD_delete.txt.gz` will be moved to /htsolr/catalog/prep. Used by the catalog to process deletes.
* `zephir_upd_YYYYMMDD_dollar_dup.txt `(generated by post_zephir_cleanup.pl, gets sent to Zephir, ftps_zephir_send, Zephir uningests these duplicate records)
* `zephir_upd_YYYYMMDD.json.gz` will be sent to /htsolr/catalog/prep for [catalog indexing](https://github.com/hathitrust/hathitrust_catalog_indexer)
* Updated bibliographic records - used by https://github.com/hathitrust/feed_internal/blob/master/feed.daily/02_get_bibrecords.pl to update the feed_zephir_items table on a daily basis. Could place directly in /htapps/babel/feed/var/bibrecords and remove the scp logic in `02_get_bibrecords.pl`, or just have `02_get_bibrecords.pl` call `ftps_zephir_get` directly: `daily_touched_YYYY-MM-DD.txt.gz` and `groove_incremental_YYYY-MM-DD.tsv.gz` (Retrieved with `ftps_zephir_get`.) `daily_touched*.txt.gz` and `groove_incremental*.tsv.gz` will be placed in /htapps/babel/feed/var/bibrecords
* `zephir_full_monthly_rpt.txt` Does anyone need this?

Perl script dependencies
Expand Down
16 changes: 0 additions & 16 deletions lib/monthly_inventory.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ class MonthlyInventory
UPDATE_REGEXP = /^zephir_upd_(\d{8})\.json\.gz$/
DELETE_REGEXP = /^zephir_upd_(\d{8})_delete\.txt\.gz$/
RIGHTS_REGEXP = /^zephir_upd_(\d{8})\.rights$/
GROOVE_REGEXP = /^groove_incremental_(\d{4}-\d{2}-\d{2})\.tsv\.gz$/
TOUCHED_REGEXP = /^daily_touched_(\d{4}-\d{2}-\d{2})\.tsv\.gz$/
attr_reader :date, :logger, :inventory

# @param logger [Logger] defaults to STDOUT
Expand All @@ -28,8 +26,6 @@ def initialize(logger: nil, date: (Date.today - 1))
zephir_update_files: zephir_update_files,
zephir_delete_files: zephir_delete_files,
zephir_rights_files: zephir_rights_files,
zephir_groove_files: zephir_groove_files,
zephir_touched_files: zephir_touched_files
}
end

Expand All @@ -51,18 +47,6 @@ def zephir_rights_files
directory_inventory(directory: @rights_dir, archive_directory: @rights_archive_dir, regexp: RIGHTS_REGEXP)
end

# groove_incremental_YYYY-MM-DD.tsv.gz files for the current month.
# @return [Array<Date>] sorted ASC
def zephir_groove_files
directory_inventory(directory: @ingest_bibrecords_dir, archive_directory: @ingest_bibrecords_archive_dir, regexp: GROOVE_REGEXP)
end

# daily_touched_YYYY-MM-DD.tsv.gz files for the current month.
# @return [Array<Date>] sorted ASC
def zephir_touched_files
directory_inventory(directory: @ingest_bibrecords_dir, archive_directory: @ingest_bibrecords_archive_dir, regexp: TOUCHED_REGEXP)
end

# Iterate over the parts of the inventory separately.
# Find the earliest (min) date missing (if any) from each.
# If a date is missing in any one of them then it is a do-over candidate.
Expand Down
32 changes: 2 additions & 30 deletions run_process_zephir_incremental.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,6 @@ export us_fed_pub_exception_file="$FEDDOCS_HOME/feddocs_oclc_filter/oclcs_remove
DATADIR=$ROOTDIR/data/zephir
ZEPHIR_VUFIND_EXPORT=ht_bib_export_incr_${zephir_date}.json.gz
ZEPHIR_VUFIND_DELETE=vufind_removed_cids_${zephir_date}.txt.gz
ZEPHIR_GROOVE_INCREMENTAL=groove_incremental_${zephir_date}.tsv.gz
ZEPHIR_DAILY_TOUCHED=daily_touched_${zephir_date}.tsv.gz
ZEPHIR_VUFIND_DOLL_D=vufind_incremental_${zephir_date}_dollar_dup.txt
BASENAME=zephir_upd_${YESTERDAY}
REPORT_FILE=${BASENAME}_report.txt
Expand All @@ -71,34 +69,8 @@ if [ ! -e $ZEPHIR_VUFIND_DELETE ]; then
report_error_and_exit "file $ZEPHIR_VUFIND_DELETE not found, exiting"
fi

echo "`date`: retrieve $ZEPHIR_GROOVE_INCREMENTAL"

run_external_command $ROOTDIR/ftpslib/ftps_zephir_get exports/$ZEPHIR_GROOVE_INCREMENTAL $ZEPHIR_GROOVE_INCREMENTAL

cmdstatus=$?
if [ $cmdstatus == "0" ]; then
echo "`date`: copy $ZEPHIR_GROOVE_INCREMENTAL to rootdir/data/zephir"
# should go here:
mv $ZEPHIR_GROOVE_INCREMENTAL $INGEST_BIBRECORDS
else
echo "***"
echo "Problem getting file ${ZEPHIR_GROOVE_INCREMENTAL} from zephir: rc is $cmdstatus"
echo "***"
fi

echo "`date`: retrieve $ZEPHIR_DAILY_TOUCHED"

$ROOTDIR/ftpslib/ftps_zephir_get exports/$ZEPHIR_DAILY_TOUCHED $ZEPHIR_DAILY_TOUCHED

cmdstatus=$?
if [ $cmdstatus == "0" ]; then
echo "`date`: copy $ZEPHIR_DAILY_TOUCHED to $INGEST_BIBRECORDS"
mv $ZEPHIR_DAILY_TOUCHED $INGEST_BIBRECORDS
else
echo "***"
echo "Problem getting file ${ZEPHIR_DAILY_TOUCHED} from zephir: rc is $cmdstatus"
echo "***"
fi
# ZEPHIR_GROOVE_INCREMENTAL moved to repo feed_internal, 02_get_bibrecords.pl
# ZEPHIR_DAILY_TOUCHED moved to repo feed_internal, 02_get_bibrecords.pl

echo "`date`: dump the rights db to a dbm file"

Expand Down
16 changes: 3 additions & 13 deletions spec/integration/monthly_inventory_integration_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,13 @@
it "returns the earliest" do
[
delete_file_for_date(date: Date.parse("2023-11-20")),
update_file_for_date(date: Date.parse("2023-11-19")),
rights_file_for_date(date: Date.parse("2023-11-18")),
groove_file_for_date(date: Date.parse("2023-11-17")),
touched_file_for_date(date: Date.parse("2023-11-16"))
update_file_for_date(date: Date.parse("2023-11-11")),
rights_file_for_date(date: Date.parse("2023-11-18"))
].each do |file|
FileUtils.rm file
end
mi = PostZephirProcessing::MonthlyInventory.new(date: Date.parse("2023-11-30"))
expect(mi.earliest_missing_date).to eq Date.parse("2023-11-16")
expect(mi.earliest_missing_date).to eq Date.parse("2023-11-11")
end
end

Expand All @@ -56,12 +54,4 @@
end
end

describe "find file not yet moved to archive directory" do
it "returns nil" do
date = Date.parse("2023-11-10")
FileUtils.mv groove_file_for_date(date: date), groove_file_for_date(date: date, archive: false)
mi = PostZephirProcessing::MonthlyInventory.new(date: Date.parse("2023-11-30"))
expect(mi.earliest_missing_date).to be_nil
end
end
end
18 changes: 0 additions & 18 deletions spec/spec_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -57,22 +57,6 @@ def rights_file_for_date(date:, archive: true)
)
end

# Note Zephir hyphenated date
def groove_file_for_date(date:, archive: true)
File.join(
archive ? ingest_bibrecords_archive_dir : ingest_bibrecords_dir,
"groove_incremental_#{date.strftime("%Y-%m-%d")}.tsv.gz"
)
end

# Note Zephir hyphenated date
def touched_file_for_date(date:, archive: true)
File.join(
archive ? ingest_bibrecords_archive_dir : ingest_bibrecords_dir,
"daily_touched_#{date.strftime("%Y-%m-%d")}.tsv.gz"
)
end

# @param date [Date] determines the month and year for the file datestamps
def setup_test_files(date:)
start_date = Date.new(date.year, date.month, 1)
Expand All @@ -81,8 +65,6 @@ def setup_test_files(date:)
`touch #{update_file_for_date(date: d)}`
`touch #{delete_file_for_date(date: d)}`
`touch #{rights_file_for_date(date: d)}`
`touch #{groove_file_for_date(date: d)}`
`touch #{touched_file_for_date(date: d)}`
end
end

Expand Down

0 comments on commit 4dc3c43

Please sign in to comment.