Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log when total zip part size is less than the moab size #1993

Merged
merged 1 commit into from
Nov 3, 2022

Conversation

mjgiarlo
Copy link
Member

Why was this change made? 🤔

Fixes #1742

This commit adds a new audit result to the catalog-to-archive audit service such that, when the total zip part size for a zipped moab version is less than the size of the moab on disk, this condition is reported. Reporting this information will help us find and remediate replication errors.

As part of this work, as a convenience, add a new #total_part_size method to the ZippedMoabVersion model.

How was this change tested? 🤨

CI. I'll have to defer to @jmartin-sul for advice on whether this is sufficient, or whether poking about in a deployed environment may be warranted.

Comment on lines 18 to 19
total_part_size = zmv.total_part_size
moab_version_size = zmv.preserved_object.complete_moab&.size || 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmartin-sul I'm not entirely confident I've got the right sizes here, but maybe I'm not far off?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjgiarlo I think we probably want a method like CompleteMoab#version_size_from_disk(version), that'll use the complete_moab.moab_storage_root.storage_location and the druid (from the parent PreservedObject) to get a druid tree (for which i think you can use DruidTools::Druid.new(druid, storage_location).path). And then this new method would get the size of e.g. the v0003 dir in the particular Moab. Sorry, when we were pairing on this yesterday I mentioned thinking that the sizes in the DB could be trusted, but a bit of nuance is that the CompleteMoab encompasses all versions, so its size field isn't at the right level of granularity for comparison with one ZippedMoabVersion, I don't think. (perhaps replacing CompleteMoab#size with a CompleteMoab#version_sizes array is worth ticketing and considering, since that might be useful info for querying here and in general)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmartin-sul Sounds good, and rings a bell, even. 😄 I'll tackle that next.

@@ -3,7 +3,7 @@
require 'rails_helper'

RSpec.describe Audit::CatalogToArchive do
let(:zmv) { create(:zipped_moab_version) }
let(:zmv) { create(:zipped_moab_version, preserved_object: create(:preserved_object_fixture, druid: 'bz514sm9647')) }
Copy link
Member Author

@mjgiarlo mjgiarlo Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the zipped moab version's preserved object does not have a complete moab.

@@ -15,6 +15,15 @@ def self.check_child_zip_part_attributes(zmv, results)
return false # everything else relies on checking parts, nothing left to do
end

total_part_size = zmv.total_part_size
moab_version_size = zmv.preserved_object.complete_moab&.size || 0
if total_part_size < moab_version_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And so, if moab_version_size's calculation changes as suggested above, then like the corresponding check when building the zip file from the Moab version, I think we'd want to confirm that total_part_size > moab_version_size, instead of less than. The assumption being that the zips are created without compression (presumably with a bit of additional zip file metadata), and that the (hopefully past 😅) failure mode we're worried about was the silent/uncaught omission of some files when zips were created before that size check was added.

Copy link
Member Author

@mjgiarlo mjgiarlo Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmartin-sul Maybe I still have this backwards but instead of making the change you recommend without really grokking why, I'd like to dig into this more so I can hopefully do a better job of keeping it in my brain. Here's what I was thinking... let me know where I go off the rails:

  • total_part_size is the total size of zips on an endpoint for a given druid + version
  • moab_version_size is the size on disk for a given druid + version (once fixed per your suggestion above)
  • the error condition we want to flag now is when we have stuff missing in the cloud
  • the way we flag that is by comparing total_part_size to moab_version_size and if the former is smaller than the latter, then it's missing content
  • so isn't the conditional for adding a size inconsistency error result total_part_size < moab_version_size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you had this right! sorry, when i first read the PR, i must've (incorrectly) been thinking of the condition evaluating to true as the non-error state. sorry for the confusion!

Copy link
Member

@jmartin-sul jmartin-sul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, left individual comments, but forgot to do the official "request changes" thing

Fixes #1742

This commit adds a new audit result to the catalog-to-archive audit service such that, when the total zip part size for a zipped moab version is less than the size of the moab on disk, this condition is reported. Reporting this information will help us find and remediate replication errors.

As part of this work, as a convenience, add a new `#total_part_size` method to the `ZippedMoabVersion` model.
Comment on lines +169 to +171
def moab_version_size
moab_version_files.sum { |f| File.size(f) }
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DruidVersionZip already has a method to get the size on a disk of a given druid + version + storage location, I decided to use that method instead of writing a new one. And as part of that, I moved this method into the public API (and added a test below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call!

Comment on lines +64 to +68
def total_size_of_moab_version(version)
return 0 unless complete_moab

DruidVersionZip.new(druid, version, complete_moab.moab_storage_root.storage_location).moab_version_size
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to have this responsibility defined on the PreservedObject?

@@ -101,7 +101,7 @@
after { FileUtils.rm_rf('/tmp/bj') } # cleanup

context 'when zip size is less than the moab size' do
let(:moab_version_size) { dvz.send(:moab_version_size) }
let(:moab_version_size) { dvz.moab_version_size }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method moved from private to public.

end

context 'when complete moab exists' do
let!(:cm1) { create(:complete_moab, preserved_object: preserved_object, version: current_version, moab_storage_root: msr) } # rubocop:disable RSpec/LetSetup
Copy link
Member Author

@mjgiarlo mjgiarlo Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I failed to drink enough 🍵 before working on this, but I couldn't tell why Rubocop was dinging me for let! on this line when this file uses plenty of let! already. I found 0 other references to RSpec/LetSetup in the codebase, when I expected to see this cop disabled or excluded elsewhere in .rubocop*.yml. Alas. 🤷🏻‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might have been complaining because cm1 is never explicitly called. If this is done in a before block, without using let, I think it would work. but no biggie.

Comment on lines +169 to +171
def moab_version_size
moab_version_files.sum { |f| File.size(f) }
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call!

@@ -15,6 +15,15 @@ def self.check_child_zip_part_attributes(zmv, results)
return false # everything else relies on checking parts, nothing left to do
end

total_part_size = zmv.total_part_size
moab_version_size = zmv.preserved_object.complete_moab&.size || 0
if total_part_size < moab_version_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you had this right! sorry, when i first read the PR, i must've (incorrectly) been thinking of the condition evaluating to true as the non-error state. sorry for the confusion!

@mjgiarlo mjgiarlo changed the title [HOLD] Log when total zip part size is less than the moab size Log when total zip part size is less than the moab size Nov 3, 2022
@mjgiarlo
Copy link
Member Author

mjgiarlo commented Nov 3, 2022

@jmartin-sul do you think we should test this in a deployed env or is it safe to merge? I've unheld the PR for now.

@jmartin-sul
Copy link
Member

How was this change tested? 🤨

CI. I'll have to defer to @jmartin-sul for advice on whether this is sufficient, or whether poking about in a deployed environment may be warranted.

probably a good idea to deploy this to stage or QA and let the nightly replication audit run the new check, just as a basic smoke test. with data in our test envs being generally iffy, but much smaller in volume than prod, i'm not sure whether to expect the detection of inconsistencies.

alternatively, we could deploy to QA or stage, and manually trigger #audit_moab_version_replication! on a handful of random PreservedObjects (instead of waiting for the nightly run).

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked? those environments don't use ceph and so are unlikely to have the issue with silent omission of e.g. a random datastream file or two. but there are probably other size inconsistencies because those envs are such a mess. and if we really wanted to be thorough, maybe we could hand-edit a ZipPart or two to have a smaller size, as a test? but maybe that's overkill.

i think any sort actual test in stage or QA, whether it was on known bad data or not, would suffice, just to make sure the happy path is working as expected. i'd be fine to merge with even cursory testing in stage or QA.

@mjgiarlo
Copy link
Member Author

mjgiarlo commented Nov 3, 2022

@jmartin-sul 💬

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

@mjgiarlo
Copy link
Member Author

mjgiarlo commented Nov 3, 2022

@jmartin-sul speech_balloon

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

@jmartin-sul This is now complete. There were a handful of part replication audit job runs that raised exceptions: https://app.honeybadger.io/projects/54415/faults/83422358 But none of them appear to be due to size inconsistency at first blush.

@jmartin-sul
Copy link
Member

jmartin-sul commented Nov 3, 2022

@jmartin-sul 💬

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

i think i'd mostly look for failed jobs. if any HB alerts come up, i'd probably just give one or two a look to confirm they're indeed pointing out something mis-replicated.

and... i just realized, thinking about this again after writing that last sentence, i overlooked something in my review: i think the new result code needs to be added to the applicable lists of handled codes in the specific reporter classes (under app/services/reporters). i'd just put it next to ZIP_PARTS_COUNT_INCONSISTENCY in those classes. so, handled_single_codes in honeybadger_reporter.rb, and handled_merge_codes in event_service_reporter.rb and audit_workflow_reporter.rb. for that last one, i notice also that all the zip part codes are commented out -- i think you could do the same here. iirc, those were commented out because they exceeded some field size limit in the WFS DB before it was migrated from oracle to PG? i think there's a ticket to revisit that, but if not, i'll file one.

sorry for the lag and the confusion! not my finest review 😬

also: i'm gonna merge this, and we can deal with the reporter result codes in a separate issue/PR. i'm looking at c2a.log on the QA worker box (-02), and it seems to be recording the usual flurry of not all ZippedMoabVersion parts are replicated yet (presumably from failed deliveries that were never retried), and i saw endpoint specific part audit jobs get queued, which wouldn't happen if the CatalogToArchive.check_child_zip_part_attributes method modified by this PR was blowing up.

and then i went ahead and did the test where i manually futzed with a part size, and the updated audit caught the issue 🎉 i grabbed a random zip part (happened to be for ry530ff4694 on QA), set its size to 1, ran the audit on the druid, and saw this result when i grepped the logs for the new message: log/c2a.log:E, [2022-11-03T16:21:56.205400 #1205410] ERROR -- : PartReplicationAuditJob(ry530ff4694, aws_s3_east_1) 1 on aws_s3_east_1: Sum of ZippedMoabVersion child part sizes (1) is less than what is in the Moab: 51025 (i undid my size tweak when i was done testing)

so yeah, i think this works!

@jmartin-sul jmartin-sul merged commit 9c4eb15 into main Nov 3, 2022
@jmartin-sul jmartin-sul deleted the check-part-size-audit#1742 branch November 3, 2022 23:26
@mjgiarlo
Copy link
Member Author

mjgiarlo commented Nov 3, 2022

@jmartin-sul speech_balloon

actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just PreservedObject.find_each(&:audit_moab_version_replication!) on one of those envs, and see what happens when everything is checked?

OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway.

i think i'd mostly look for failed jobs. if any HB alerts come up, i'd probably just give one or two a look to confirm they're indeed pointing out something mis-replicated.

and... i just realized, thinking about this again after writing that last sentence, i overlooked something in my review: i think the new result code needs to be added to the applicable lists of handled codes in the specific reporter classes (under app/services/reporters). i'd just put it next to ZIP_PARTS_COUNT_INCONSISTENCY in those classes. so, handled_single_codes in honeybadger_reporter.rb, and handled_merge_codes in event_service_reporter.rb and audit_workflow_reporter.rb. for that last one, i notice also that all the zip part codes are commented out -- i think you could do the same here. iirc, those were commented out because they exceeded some field size limit in the WFS DB before it was migrated from oracle to PG? i think there's a ticket to revisit that, but if not, i'll file one.

sorry for the lag and the confusion! not my finest review grimacing

also: i'm gonna merge this, and we can deal with the reporter result codes in a separate issue/PR. i'm looking at c2a.log on the QA worker box (-02), and it seems to be recording the usual flurry of not all ZippedMoabVersion parts are replicated yet (presumably from failed deliveries that were never retried), and i saw endpoint specific part audit jobs get queued, which wouldn't happen if the CatalogToArchive.check_child_zip_part_attributes method modified by this PR was blowing up.

and then i went ahead and did the test where i manually futzed with a part size, and the updated audit caught the issue tada i grabbed a random zip part (happened to be for ry530ff4694 on QA), set its size to 1, ran the audit on the druid, and saw this result when i grepped the logs for the new message: log/c2a.log:E, [2022-11-03T16:21:56.205400 #1205410] ERROR -- : PartReplicationAuditJob(ry530ff4694, aws_s3_east_1) 1 on aws_s3_east_1: Sum of ZippedMoabVersion child part sizes (1) is less than what is in the Moab: 51025 (i undid my size tweak when i was done testing)

so yeah, i think this works!

Thank you, John! 🍻

@jmartin-sul
Copy link
Member

existing issue about commented result codes: #1240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check total size of zip parts in CatalogToArchive ("C2A")
3 participants