Log when total zip part size is less than the moab size #1993

mjgiarlo · 2022-10-26T22:16:16Z

Why was this change made? 🤔

This commit adds a new audit result to the catalog-to-archive audit service such that, when the total zip part size for a zipped moab version is less than the size of the moab on disk, this condition is reported. Reporting this information will help us find and remediate replication errors.

As part of this work, as a convenience, add a new #total_part_size method to the ZippedMoabVersion model.

How was this change tested? 🤨

CI. I'll have to defer to @jmartin-sul for advice on whether this is sufficient, or whether poking about in a deployed environment may be warranted.

mjgiarlo · 2022-10-26T22:21:21Z

app/services/audit/catalog_to_archive.rb

+      total_part_size = zmv.total_part_size
+      moab_version_size = zmv.preserved_object.complete_moab&.size || 0


@jmartin-sul I'm not entirely confident I've got the right sizes here, but maybe I'm not far off?

@mjgiarlo I think we probably want a method like CompleteMoab#version_size_from_disk(version), that'll use the complete_moab.moab_storage_root.storage_location and the druid (from the parent PreservedObject) to get a druid tree (for which i think you can use DruidTools::Druid.new(druid, storage_location).path). And then this new method would get the size of e.g. the v0003 dir in the particular Moab. Sorry, when we were pairing on this yesterday I mentioned thinking that the sizes in the DB could be trusted, but a bit of nuance is that the CompleteMoab encompasses all versions, so its size field isn't at the right level of granularity for comparison with one ZippedMoabVersion, I don't think. (perhaps replacing CompleteMoab#size with a CompleteMoab#version_sizes array is worth ticketing and considering, since that might be useful info for querying here and in general)

@jmartin-sul Sounds good, and rings a bell, even. 😄 I'll tackle that next.

mjgiarlo · 2022-10-26T22:22:40Z

spec/services/audit/catalog_to_archive_spec.rb

@@ -3,7 +3,7 @@
 require 'rails_helper'

 RSpec.describe Audit::CatalogToArchive do
-  let(:zmv) { create(:zipped_moab_version) }
+  let(:zmv) { create(:zipped_moab_version, preserved_object: create(:preserved_object_fixture, druid: 'bz514sm9647')) }


Without this, the zipped moab version's preserved object does not have a complete moab.

jmartin-sul · 2022-10-27T15:49:19Z

app/services/audit/catalog_to_archive.rb

@@ -15,6 +15,15 @@ def self.check_child_zip_part_attributes(zmv, results)
        return false # everything else relies on checking parts, nothing left to do
      end

+      total_part_size = zmv.total_part_size
+      moab_version_size = zmv.preserved_object.complete_moab&.size || 0
+      if total_part_size < moab_version_size


And so, if moab_version_size's calculation changes as suggested above, then like the corresponding check when building the zip file from the Moab version, I think we'd want to confirm that total_part_size > moab_version_size, instead of less than. The assumption being that the zips are created without compression (presumably with a bit of additional zip file metadata), and that the (hopefully past 😅) failure mode we're worried about was the silent/uncaught omission of some files when zips were created before that size check was added.

@jmartin-sul Maybe I still have this backwards but instead of making the change you recommend without really grokking why, I'd like to dig into this more so I can hopefully do a better job of keeping it in my brain. Here's what I was thinking... let me know where I go off the rails:

total_part_size is the total size of zips on an endpoint for a given druid + version

moab_version_size is the size on disk for a given druid + version (once fixed per your suggestion above)

the error condition we want to flag now is when we have stuff missing in the cloud

the way we flag that is by comparing total_part_size to moab_version_size and if the former is smaller than the latter, then it's missing content

so isn't the conditional for adding a size inconsistency error result total_part_size < moab_version_size?

yes, you had this right! sorry, when i first read the PR, i must've (incorrectly) been thinking of the condition evaluating to true as the non-error state. sorry for the confusion!

jmartin-sul

whoops, left individual comments, but forgot to do the official "request changes" thing

Fixes #1742 This commit adds a new audit result to the catalog-to-archive audit service such that, when the total zip part size for a zipped moab version is less than the size of the moab on disk, this condition is reported. Reporting this information will help us find and remediate replication errors. As part of this work, as a convenience, add a new `#total_part_size` method to the `ZippedMoabVersion` model.

mjgiarlo · 2022-10-28T17:17:06Z

app/models/druid_version_zip.rb

+  def moab_version_size
+    moab_version_files.sum { |f| File.size(f) }
+  end


Since DruidVersionZip already has a method to get the size on a disk of a given druid + version + storage location, I decided to use that method instead of writing a new one. And as part of that, I moved this method into the public API (and added a test below).

mjgiarlo · 2022-10-28T17:17:31Z

app/models/preserved_object.rb

+  def total_size_of_moab_version(version)
+    return 0 unless complete_moab
+
+    DruidVersionZip.new(druid, version, complete_moab.moab_storage_root.storage_location).moab_version_size
+  end


OK to have this responsibility defined on the PreservedObject?

mjgiarlo · 2022-10-28T17:18:56Z

spec/models/druid_version_zip_spec.rb

@@ -101,7 +101,7 @@
    after { FileUtils.rm_rf('/tmp/bj') } # cleanup

    context 'when zip size is less than the moab size' do
-      let(:moab_version_size) { dvz.send(:moab_version_size) }
+      let(:moab_version_size) { dvz.moab_version_size }


This method moved from private to public.

mjgiarlo · 2022-10-28T17:21:00Z

spec/models/preserved_object_spec.rb

+    end
+
+    context 'when complete moab exists' do
+      let!(:cm1) { create(:complete_moab, preserved_object: preserved_object, version: current_version, moab_storage_root: msr) } # rubocop:disable RSpec/LetSetup


Maybe I failed to drink enough 🍵 before working on this, but I couldn't tell why Rubocop was dinging me for let! on this line when this file uses plenty of let! already. I found 0 other references to RSpec/LetSetup in the codebase, when I expected to see this cop disabled or excluded elsewhere in .rubocop*.yml. Alas. 🤷🏻‍♂️

I think it might have been complaining because cm1 is never explicitly called. If this is done in a before block, without using let, I think it would work. but no biggie.

jmartin-sul · 2022-11-03T19:26:35Z

app/models/druid_version_zip.rb

+  def moab_version_size
+    moab_version_files.sum { |f| File.size(f) }
+  end


jmartin-sul · 2022-11-03T19:34:03Z

app/services/audit/catalog_to_archive.rb

@@ -15,6 +15,15 @@ def self.check_child_zip_part_attributes(zmv, results)
        return false # everything else relies on checking parts, nothing left to do
      end

+      total_part_size = zmv.total_part_size
+      moab_version_size = zmv.preserved_object.complete_moab&.size || 0
+      if total_part_size < moab_version_size


yes, you had this right! sorry, when i first read the PR, i must've (incorrectly) been thinking of the condition evaluating to true as the non-error state. sorry for the confusion!

mjgiarlo · 2022-11-03T20:07:45Z

@jmartin-sul do you think we should test this in a deployed env or is it safe to merge? I've unheld the PR for now.

jmartin-sul · 2022-11-03T20:16:05Z

How was this change tested? 🤨

CI. I'll have to defer to @jmartin-sul for advice on whether this is sufficient, or whether poking about in a deployed environment may be warranted.

probably a good idea to deploy this to stage or QA and let the nightly replication audit run the new check, just as a basic smoke test. with data in our test envs being generally iffy, but much smaller in volume than prod, i'm not sure whether to expect the detection of inconsistencies.

alternatively, we could deploy to QA or stage, and manually trigger #audit_moab_version_replication! on a handful of random PreservedObjects (instead of waiting for the nightly run).

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked? those environments don't use ceph and so are unlikely to have the issue with silent omission of e.g. a random datastream file or two. but there are probably other size inconsistencies because those envs are such a mess. and if we really wanted to be thorough, maybe we could hand-edit a ZipPart or two to have a smaller size, as a test? but maybe that's overkill.

i think any sort actual test in stage or QA, whether it was on known bad data or not, would suffice, just to make sure the happy path is working as expected. i'd be fine to merge with even cursory testing in stage or QA.

mjgiarlo · 2022-11-03T22:38:39Z

@jmartin-sul 💬

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

mjgiarlo · 2022-11-03T23:15:32Z

@jmartin-sul speech_balloon

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

@jmartin-sul This is now complete. There were a handful of part replication audit job runs that raised exceptions: https://app.honeybadger.io/projects/54415/faults/83422358 But none of them appear to be due to size inconsistency at first blush.

jmartin-sul · 2022-11-03T23:26:47Z

@jmartin-sul 💬

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

i think i'd mostly look for failed jobs. if any HB alerts come up, i'd probably just give one or two a look to confirm they're indeed pointing out something mis-replicated.

and... i just realized, thinking about this again after writing that last sentence, i overlooked something in my review: i think the new result code needs to be added to the applicable lists of handled codes in the specific reporter classes (under app/services/reporters). i'd just put it next to ZIP_PARTS_COUNT_INCONSISTENCY in those classes. so, handled_single_codes in honeybadger_reporter.rb, and handled_merge_codes in event_service_reporter.rb and audit_workflow_reporter.rb. for that last one, i notice also that all the zip part codes are commented out -- i think you could do the same here. iirc, those were commented out because they exceeded some field size limit in the WFS DB before it was migrated from oracle to PG? i think there's a ticket to revisit that, but if not, i'll file one.

sorry for the lag and the confusion! not my finest review 😬

also: i'm gonna merge this, and we can deal with the reporter result codes in a separate issue/PR. i'm looking at c2a.log on the QA worker box (-02), and it seems to be recording the usual flurry of not all ZippedMoabVersion parts are replicated yet (presumably from failed deliveries that were never retried), and i saw endpoint specific part audit jobs get queued, which wouldn't happen if the CatalogToArchive.check_child_zip_part_attributes method modified by this PR was blowing up.

and then i went ahead and did the test where i manually futzed with a part size, and the updated audit caught the issue 🎉 i grabbed a random zip part (happened to be for ry530ff4694 on QA), set its size to 1, ran the audit on the druid, and saw this result when i grepped the logs for the new message: log/c2a.log:E, [2022-11-03T16:21:56.205400 #1205410] ERROR -- : PartReplicationAuditJob(ry530ff4694, aws_s3_east_1) 1 on aws_s3_east_1: Sum of ZippedMoabVersion child part sizes (1) is less than what is in the Moab: 51025 (i undid my size tweak when i was done testing)

so yeah, i think this works!

mjgiarlo · 2022-11-03T23:42:29Z

@jmartin-sul speech_balloon

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

i think i'd mostly look for failed jobs. if any HB alerts come up, i'd probably just give one or two a look to confirm they're indeed pointing out something mis-replicated.

and... i just realized, thinking about this again after writing that last sentence, i overlooked something in my review: i think the new result code needs to be added to the applicable lists of handled codes in the specific reporter classes (under app/services/reporters). i'd just put it next to ZIP_PARTS_COUNT_INCONSISTENCY in those classes. so, handled_single_codes in honeybadger_reporter.rb, and handled_merge_codes in event_service_reporter.rb and audit_workflow_reporter.rb. for that last one, i notice also that all the zip part codes are commented out -- i think you could do the same here. iirc, those were commented out because they exceeded some field size limit in the WFS DB before it was migrated from oracle to PG? i think there's a ticket to revisit that, but if not, i'll file one.

sorry for the lag and the confusion! not my finest review grimacing

also: i'm gonna merge this, and we can deal with the reporter result codes in a separate issue/PR. i'm looking at c2a.log on the QA worker box (-02), and it seems to be recording the usual flurry of not all ZippedMoabVersion parts are replicated yet (presumably from failed deliveries that were never retried), and i saw endpoint specific part audit jobs get queued, which wouldn't happen if the CatalogToArchive.check_child_zip_part_attributes method modified by this PR was blowing up.

and then i went ahead and did the test where i manually futzed with a part size, and the updated audit caught the issue tada i grabbed a random zip part (happened to be for ry530ff4694 on QA), set its size to 1, ran the audit on the druid, and saw this result when i grepped the logs for the new message: log/c2a.log:E, [2022-11-03T16:21:56.205400 #1205410] ERROR -- : PartReplicationAuditJob(ry530ff4694, aws_s3_east_1) 1 on aws_s3_east_1: Sum of ZippedMoabVersion child part sizes (1) is less than what is in the Moab: 51025 (i undid my size tweak when i was done testing)

so yeah, i think this works!

Thank you, John! 🍻

jmartin-sul · 2022-11-05T02:54:47Z

existing issue about commented result codes: #1240

mjgiarlo commented Oct 26, 2022

View reviewed changes

mjgiarlo force-pushed the check-part-size-audit#1742 branch from 395f6d0 to 470285e Compare October 26, 2022 22:22

mjgiarlo commented Oct 26, 2022

View reviewed changes

mjgiarlo force-pushed the check-part-size-audit#1742 branch from 470285e to aa5dfbf Compare October 26, 2022 22:34

jmartin-sul reviewed Oct 27, 2022

View reviewed changes

jmartin-sul requested changes Oct 27, 2022

View reviewed changes

mjgiarlo force-pushed the check-part-size-audit#1742 branch from aa5dfbf to f9a6ca2 Compare October 28, 2022 17:10

mjgiarlo force-pushed the check-part-size-audit#1742 branch from f9a6ca2 to 840f486 Compare October 28, 2022 17:15

mjgiarlo commented Oct 28, 2022

View reviewed changes

jmartin-sul approved these changes Nov 3, 2022

View reviewed changes

mjgiarlo changed the title ~~[HOLD] Log when total zip part size is less than the moab size~~ Log when total zip part size is less than the moab size Nov 3, 2022

jmartin-sul merged commit 9c4eb15 into main Nov 3, 2022

jmartin-sul deleted the check-part-size-audit#1742 branch November 3, 2022 23:26

jmartin-sul mentioned this pull request Nov 14, 2022

ZipDeliveryService: complain loudly if the part already exists #2041

Closed

jmartin-sul mentioned this pull request Oct 20, 2023

Report ZIP_PARTS_SIZE_INCONSISTENCY to HB and event service, and other smallt touchups #2275

Merged

jmartin-sul mentioned this pull request Nov 7, 2023

PartReplicationAuditJob: Sum of ZippedMoabVersion child part sizes is less than what is in the Moab #2277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log when total zip part size is less than the moab size #1993

Log when total zip part size is less than the moab size #1993

mjgiarlo commented Oct 26, 2022

mjgiarlo Oct 26, 2022

jmartin-sul Oct 27, 2022

mjgiarlo Oct 28, 2022

mjgiarlo Oct 26, 2022 •

edited

Loading

jmartin-sul Oct 27, 2022

mjgiarlo Oct 28, 2022 •

edited

Loading

jmartin-sul Nov 3, 2022

jmartin-sul left a comment

mjgiarlo Oct 28, 2022

jmartin-sul Nov 3, 2022

mjgiarlo Oct 28, 2022

mjgiarlo Oct 28, 2022

mjgiarlo Oct 28, 2022 •

edited

Loading

ndushay Nov 30, 2022

jmartin-sul Nov 3, 2022

jmartin-sul Nov 3, 2022

mjgiarlo commented Nov 3, 2022

jmartin-sul commented Nov 3, 2022

mjgiarlo commented Nov 3, 2022

mjgiarlo commented Nov 3, 2022

jmartin-sul commented Nov 3, 2022 •

edited

Loading

mjgiarlo commented Nov 3, 2022

jmartin-sul commented Nov 5, 2022

		total_part_size = zmv.total_part_size
		moab_version_size = zmv.preserved_object.complete_moab&.size \|\| 0

Log when total zip part size is less than the moab size #1993

Log when total zip part size is less than the moab size #1993

Conversation

mjgiarlo commented Oct 26, 2022

Why was this change made? 🤔

How was this change tested? 🤨

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjgiarlo Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjgiarlo Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmartin-sul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjgiarlo Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjgiarlo commented Nov 3, 2022

jmartin-sul commented Nov 3, 2022

mjgiarlo commented Nov 3, 2022

mjgiarlo commented Nov 3, 2022

jmartin-sul commented Nov 3, 2022 • edited Loading

mjgiarlo commented Nov 3, 2022

jmartin-sul commented Nov 5, 2022

mjgiarlo Oct 26, 2022 •

edited

Loading

mjgiarlo Oct 28, 2022 •

edited

Loading

mjgiarlo Oct 28, 2022 •

edited

Loading

jmartin-sul commented Nov 3, 2022 •

edited

Loading