Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipDeliveryService: complain loudly if the part already exists #2041

Closed
jmartin-sul opened this issue Nov 14, 2022 · 0 comments · Fixed by #2092
Closed

ZipDeliveryService: complain loudly if the part already exists #2041

jmartin-sul opened this issue Nov 14, 2022 · 0 comments · Fixed by #2092
Assignees
Labels
async jobs enhancement replication replication related questions or issues

Comments

@jmartin-sul
Copy link
Member

jmartin-sul commented Nov 14, 2022

TODO

Currently, we just return silently if the part already exists in the cloud:

def deliver
return if s3_part.exists?

However, we may want to update this behavior a bit: as far as I know, it'd be pretty unexpected to actually re-attempt delivery of something that'd been successfully uploaded, and I'm unaware of a situation where we queue multiple jobs to deliver the same part to the same endpoint. Additionally, we want to enable automated backfilling for missing ZippedMoabVersions (#2036). More on the relevance of that in "Context".

I'd suggest a helpful alert message like "WARNING: attempting to push a druid version zip part to an S3 location that already has content. Perhaps a replication failure was pruned from the database, but still needs to be cleaned up from the cloud. Prune the failure again and ask ops to delete the bad replicated content." And then I'd include the druid/version/endpoint in the HB alert context. However I wouldn't raise or cause the job to fail, and would still be sure to return a falsey value in this case, because there's no use in retrying the job, and we also don't want the calling AbstractDeliveryJob to proceed to calling ResultsRecorderJob.perform_later, since nothing was delivered.

Context

We'll also be trying to clean up known partial replications (see #1733), and we've just enabled a new audit error about sanity checks on size (see #1993) that will likely alert us to a few more replications that need re-doing.

Since we want to restrict deletion and overwrite of cloud archive content as much as possible, we have to request that ops handle deletion of partial replications.

So, we could occasionally run into a corner case where:

  • we've cleaned up the database records for a mis-replicated or partially replicated druid version on a given endpoint, using CatalogRemediator.prune_replication_failures or its rake task (prune_failed_replicationprune_failed_replication)
  • we've given ops the druid/version/endpoint combos for deletion from the DB records that prune_replication_failures determined needed a re-push (it'll be fed from CatalogToArchive audit results).
  • ops hasn't had a chance to actually delete the bad cloud content
  • CatalogToArchive runs for a druid in this situation, tries to backfill the missing content, it can't push over the yet-to-be-removed bad replicated content.

Since replication audit only runs every 3 months on a given druid, it seems unlikely that we'll run into this situation much if at all, so starting out with a simple HB alert to let us know this happened, and that we need to re-run prune_replication_failures on the druid version, seems like a better approach to me than trying to do something more automated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async jobs enhancement replication replication related questions or issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants