Skip to content
This repository has been archived by the owner on Apr 30, 2019. It is now read-only.

Add sanity check in GC to prevent active ledger deletion #15

Open
wants to merge 3 commits into
base: yahoo-4.3
Choose a base branch
from

Conversation

rdhabalia
Copy link

Motivation

Sometimes, due to unexpected bugs in GC, GC deletes active ledgers (ledger metadata still exist in ZK) from the bookie and it results into the data-loss.

We have seen this scenario multiple times:
(1) Lexicographical sorting at ledgerManager
(2) Recently we have seen that LongHierarchicalLedgerManager doesn't handle ledger directory traversing if we manually cleans up empty directory which doesn't contain any children.
eg:
We implemented a cron job to delete ledger directory from zookeeper, when it does not contain any children in zookeeper. e.g: /ledgers/000/0000/0042/0003
In the above sequence, after processing 0001 and 0002 nodes, when LongHierarchicalLedgerManager iterator sees the next node(0003) does not exist, it assumes no other nodes starting from 0003 till 9999 exists under /ledgers/000/0000/0042. This triggers bookie to delete any ledgers from 0004 onwards present in that bookie.

So, we have seen data loss multiple times which can be prevented if GC can perform a sanity check before cleaning ledger from the bookie.

Modifications

Add sanity check before cleaning up the ledger.

Result

It can prevent possible data loss due to any bug in GC.

@@ -350,6 +348,23 @@ public void readLedgerMetadata(final long ledgerId, final GenericCallback<Ledger
readLedgerMetadata(ledgerId, readCb, null);
}

@Override
public void existLedgerMetadata(final long ledgerId, final GenericCallback<Boolean> callback) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, 'exists' instead of 'exit'?

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I am a little confused about one of the design choices. I would also like to see a corresponding pull request go into open source before I merge it in here.

LOG.warn("Fail to check {} exists in zk {}", ledgerId, BKException.getMessage(rc));
}
latch.countDown();
semaphore.release();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a semaphore at all. Except during unit tests the gc method is called by scheduleAtFixedRate, which guarantees that later calls to gc will not happen until the previous ones finish.

if garbageCleaner.clean was inside the callback then the semaphore would make since, but with the CountDownLatch this is a blocking call, so it really doesn't.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i missed to remove it after adding latch. will fix it.

@rdhabalia
Copy link
Author

+@merlimat @sijie

@rdhabalia rdhabalia requested a review from merlimat January 10, 2018 22:09
@sijie
Copy link

sijie commented Jan 11, 2018

I think this is similar as the change Salesforce made in apache/bookkeeper@1f8b26d fyi

@saandrews
Copy link

@sijie That change still does not address cases where due to some other bug a valid ledger is deleted.

Copy link

@saandrews saandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

private void gcLedgerSafely(GarbageCleaner garbageCleaner, long ledgerId, LedgerManager ledgerManager) throws InterruptedException {
CountDownLatch latch = new CountDownLatch(1);
AtomicBoolean ledgerDeleted = new AtomicBoolean(false);
ledgerManager.existsLedgerMetadata(ledgerId, (rc, exists) -> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the second param exists?

@merlimat
Copy link
Collaborator

@sijie That change still does not address cases where due to some other bug a valid ledger is deleted.

Actually, I think it's the same behavior, because it forces to read the ledger metadata, which means reading the znode

@saandrews
Copy link

I see. Did we merge that pull request? I see it got closed, so not sure if it got approved eventually.

@sijie
Copy link

sijie commented Jan 11, 2018

apache#876 was approved and merged. the gitsha I pointed out is the change that already exists in latest apache/master. I think the change in apache and the change here are almost same, either merge this one to yahoo or cherry-pick the apache change works for me.

@rdhabalia
Copy link
Author

thanks for pointing out the commit. I think checking just exists of znode might be more optimized approach than reading znode and transferring metadata over n/w for each ledger. But I am not sure if it would be a big deal. So, @revans2 should we cherry-pick apache/bookkeeper#876 commit to be in sync and we can close this PR?

@revans2
Copy link
Collaborator

revans2 commented Feb 7, 2018

@rdhabalia could you take a look at #22 and see if it is good enough to replace this? It is a backport of the change that went into the apache repo that you suggested we look at instead.

rdhabalia pushed a commit to rdhabalia/bookkeeper that referenced this pull request Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants