-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log when total zip part size is less than the moab size #1993
Conversation
total_part_size = zmv.total_part_size | ||
moab_version_size = zmv.preserved_object.complete_moab&.size || 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmartin-sul I'm not entirely confident I've got the right sizes here, but maybe I'm not far off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjgiarlo I think we probably want a method like CompleteMoab#version_size_from_disk(version)
, that'll use the complete_moab.moab_storage_root.storage_location
and the druid (from the parent PreservedObject
) to get a druid tree (for which i think you can use DruidTools::Druid.new(druid, storage_location).path
). And then this new method would get the size of e.g. the v0003
dir in the particular Moab. Sorry, when we were pairing on this yesterday I mentioned thinking that the sizes in the DB could be trusted, but a bit of nuance is that the CompleteMoab
encompasses all versions, so its size field isn't at the right level of granularity for comparison with one ZippedMoabVersion
, I don't think. (perhaps replacing CompleteMoab#size
with a CompleteMoab#version_sizes
array is worth ticketing and considering, since that might be useful info for querying here and in general)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmartin-sul Sounds good, and rings a bell, even. 😄 I'll tackle that next.
395f6d0
to
470285e
Compare
@@ -3,7 +3,7 @@ | |||
require 'rails_helper' | |||
|
|||
RSpec.describe Audit::CatalogToArchive do | |||
let(:zmv) { create(:zipped_moab_version) } | |||
let(:zmv) { create(:zipped_moab_version, preserved_object: create(:preserved_object_fixture, druid: 'bz514sm9647')) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, the zipped moab version's preserved object does not have a complete moab.
470285e
to
aa5dfbf
Compare
@@ -15,6 +15,15 @@ def self.check_child_zip_part_attributes(zmv, results) | |||
return false # everything else relies on checking parts, nothing left to do | |||
end | |||
|
|||
total_part_size = zmv.total_part_size | |||
moab_version_size = zmv.preserved_object.complete_moab&.size || 0 | |||
if total_part_size < moab_version_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And so, if moab_version_size
's calculation changes as suggested above, then like the corresponding check when building the zip file from the Moab version, I think we'd want to confirm that total_part_size > moab_version_size
, instead of less than. The assumption being that the zips are created without compression (presumably with a bit of additional zip file metadata), and that the (hopefully past 😅) failure mode we're worried about was the silent/uncaught omission of some files when zips were created before that size check was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmartin-sul Maybe I still have this backwards but instead of making the change you recommend without really grokking why, I'd like to dig into this more so I can hopefully do a better job of keeping it in my brain. Here's what I was thinking... let me know where I go off the rails:
total_part_size
is the total size of zips on an endpoint for a given druid + versionmoab_version_size
is the size on disk for a given druid + version (once fixed per your suggestion above)- the error condition we want to flag now is when we have stuff missing in the cloud
- the way we flag that is by comparing
total_part_size
tomoab_version_size
and if the former is smaller than the latter, then it's missing content - so isn't the conditional for adding a size inconsistency error result
total_part_size < moab_version_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you had this right! sorry, when i first read the PR, i must've (incorrectly) been thinking of the condition evaluating to true as the non-error state. sorry for the confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, left individual comments, but forgot to do the official "request changes" thing
aa5dfbf
to
f9a6ca2
Compare
Fixes #1742 This commit adds a new audit result to the catalog-to-archive audit service such that, when the total zip part size for a zipped moab version is less than the size of the moab on disk, this condition is reported. Reporting this information will help us find and remediate replication errors. As part of this work, as a convenience, add a new `#total_part_size` method to the `ZippedMoabVersion` model.
f9a6ca2
to
840f486
Compare
def moab_version_size | ||
moab_version_files.sum { |f| File.size(f) } | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since DruidVersionZip
already has a method to get the size on a disk of a given druid + version + storage location, I decided to use that method instead of writing a new one. And as part of that, I moved this method into the public API (and added a test below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call!
def total_size_of_moab_version(version) | ||
return 0 unless complete_moab | ||
|
||
DruidVersionZip.new(druid, version, complete_moab.moab_storage_root.storage_location).moab_version_size | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to have this responsibility defined on the PreservedObject?
@@ -101,7 +101,7 @@ | |||
after { FileUtils.rm_rf('/tmp/bj') } # cleanup | |||
|
|||
context 'when zip size is less than the moab size' do | |||
let(:moab_version_size) { dvz.send(:moab_version_size) } | |||
let(:moab_version_size) { dvz.moab_version_size } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method moved from private to public.
end | ||
|
||
context 'when complete moab exists' do | ||
let!(:cm1) { create(:complete_moab, preserved_object: preserved_object, version: current_version, moab_storage_root: msr) } # rubocop:disable RSpec/LetSetup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I failed to drink enough 🍵 before working on this, but I couldn't tell why Rubocop was dinging me for let!
on this line when this file uses plenty of let!
already. I found 0 other references to RSpec/LetSetup
in the codebase, when I expected to see this cop disabled or excluded elsewhere in .rubocop*.yml
. Alas. 🤷🏻♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might have been complaining because cm1
is never explicitly called. If this is done in a before block, without using let, I think it would work. but no biggie.
def moab_version_size | ||
moab_version_files.sum { |f| File.size(f) } | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call!
@@ -15,6 +15,15 @@ def self.check_child_zip_part_attributes(zmv, results) | |||
return false # everything else relies on checking parts, nothing left to do | |||
end | |||
|
|||
total_part_size = zmv.total_part_size | |||
moab_version_size = zmv.preserved_object.complete_moab&.size || 0 | |||
if total_part_size < moab_version_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you had this right! sorry, when i first read the PR, i must've (incorrectly) been thinking of the condition evaluating to true as the non-error state. sorry for the confusion!
@jmartin-sul do you think we should test this in a deployed env or is it safe to merge? I've unheld the PR for now. |
probably a good idea to deploy this to stage or QA and let the nightly replication audit run the new check, just as a basic smoke test. with data in our test envs being generally iffy, but much smaller in volume than prod, i'm not sure whether to expect the detection of inconsistencies. alternatively, we could deploy to QA or stage, and manually trigger actually, i just looked, and QA and stage each have just over 39k druids... maybe we should just i think any sort actual test in stage or QA, whether it was on known bad data or not, would suffice, just to make sure the happy path is working as expected. i'd be fine to merge with even cursory testing in stage or QA. |
OK, I did this on QA and Resque is slowly working down the queue of ~39K jobs. What should I be looking for when this finishes? Honeybadger exceptions? Nothing so far, anyway. |
@jmartin-sul This is now complete. There were a handful of part replication audit job runs that raised exceptions: https://app.honeybadger.io/projects/54415/faults/83422358 But none of them appear to be due to size inconsistency at first blush. |
i think i'd mostly look for failed jobs. if any HB alerts come up, i'd probably just give one or two a look to confirm they're indeed pointing out something mis-replicated. and... i just realized, thinking about this again after writing that last sentence, i overlooked something in my review: i think the new result code needs to be added to the applicable lists of handled codes in the specific reporter classes (under sorry for the lag and the confusion! not my finest review 😬 also: i'm gonna merge this, and we can deal with the reporter result codes in a separate issue/PR. i'm looking at and then i went ahead and did the test where i manually futzed with a part size, and the updated audit caught the issue 🎉 i grabbed a random zip part (happened to be for so yeah, i think this works! |
Thank you, John! 🍻 |
existing issue about commented result codes: #1240 |
Why was this change made? 🤔
Fixes #1742
This commit adds a new audit result to the catalog-to-archive audit service such that, when the total zip part size for a zipped moab version is less than the size of the moab on disk, this condition is reported. Reporting this information will help us find and remediate replication errors.
As part of this work, as a convenience, add a new
#total_part_size
method to theZippedMoabVersion
model.How was this change tested? 🤨
CI. I'll have to defer to @jmartin-sul for advice on whether this is sufficient, or whether poking about in a deployed environment may be warranted.