You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, we may want to update this behavior a bit: as far as I know, it'd be pretty unexpected to actually re-attempt delivery of something that'd been successfully uploaded, and I'm unaware of a situation where we queue multiple jobs to deliver the same part to the same endpoint. Additionally, we want to enable automated backfilling for missing ZippedMoabVersions (#2036). More on the relevance of that in "Context".
I'd suggest a helpful alert message like "WARNING: attempting to push a druid version zip part to an S3 location that already has content. Perhaps a replication failure was pruned from the database, but still needs to be cleaned up from the cloud. Prune the failure again and ask ops to delete the bad replicated content." And then I'd include the druid/version/endpoint in the HB alert context. However I wouldn't raise or cause the job to fail, and would still be sure to return a falsey value in this case, because there's no use in retrying the job, and we also don't want the calling AbstractDeliveryJob to proceed to calling ResultsRecorderJob.perform_later, since nothing was delivered.
Context
We'll also be trying to clean up known partial replications (see #1733), and we've just enabled a new audit error about sanity checks on size (see #1993) that will likely alert us to a few more replications that need re-doing.
Since we want to restrict deletion and overwrite of cloud archive content as much as possible, we have to request that ops handle deletion of partial replications.
So, we could occasionally run into a corner case where:
we've cleaned up the database records for a mis-replicated or partially replicated druid version on a given endpoint, using CatalogRemediator.prune_replication_failures or its rake task (prune_failed_replicationprune_failed_replication)
we've given ops the druid/version/endpoint combos for deletion from the DB records that prune_replication_failures determined needed a re-push (it'll be fed from CatalogToArchive audit results).
ops hasn't had a chance to actually delete the bad cloud content
CatalogToArchive runs for a druid in this situation, tries to backfill the missing content, it can't push over the yet-to-be-removed bad replicated content.
Since replication audit only runs every 3 months on a given druid, it seems unlikely that we'll run into this situation much if at all, so starting out with a simple HB alert to let us know this happened, and that we need to re-run prune_replication_failures on the druid version, seems like a better approach to me than trying to do something more automated.
The text was updated successfully, but these errors were encountered:
TODO
Currently, we just return silently if the part already exists in the cloud:
preservation_catalog/app/services/zip_delivery_service.rb
Lines 18 to 20 in ffb18f6
However, we may want to update this behavior a bit: as far as I know, it'd be pretty unexpected to actually re-attempt delivery of something that'd been successfully uploaded, and I'm unaware of a situation where we queue multiple jobs to deliver the same part to the same endpoint. Additionally, we want to enable automated backfilling for missing
ZippedMoabVersion
s (#2036). More on the relevance of that in "Context".I'd suggest a helpful alert message like
"WARNING: attempting to push a druid version zip part to an S3 location that already has content. Perhaps a replication failure was pruned from the database, but still needs to be cleaned up from the cloud. Prune the failure again and ask ops to delete the bad replicated content."
And then I'd include the druid/version/endpoint in the HB alert context. However I wouldn'traise
or cause the job to fail, and would still be sure to return a falsey value in this case, because there's no use in retrying the job, and we also don't want the callingAbstractDeliveryJob
to proceed to callingResultsRecorderJob.perform_later
, since nothing was delivered.Context
We'll also be trying to clean up known partial replications (see #1733), and we've just enabled a new audit error about sanity checks on size (see #1993) that will likely alert us to a few more replications that need re-doing.
Since we want to restrict deletion and overwrite of cloud archive content as much as possible, we have to request that ops handle deletion of partial replications.
So, we could occasionally run into a corner case where:
CatalogRemediator.prune_replication_failures
or itsrake
task (prune_failed_replicationprune_failed_replication
)prune_replication_failures
determined needed a re-push (it'll be fed fromCatalogToArchive
audit results).CatalogToArchive
runs for a druid in this situation, tries to backfill the missing content, it can't push over the yet-to-be-removed bad replicated content.Since replication audit only runs every 3 months on a given druid, it seems unlikely that we'll run into this situation much if at all, so starting out with a simple HB alert to let us know this happened, and that we need to re-run
prune_replication_failures
on the druid version, seems like a better approach to me than trying to do something more automated.The text was updated successfully, but these errors were encountered: