Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metadata synchronization issues in Snuba migrations for multi-replica/shard ClickHouse clusters #6826

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pyhp2017
Copy link

Problem:
The Snuba migration system encountered significant stability issues when running migrations in ClickHouse clusters with multiple replicas and shards. Synchronization errors, particularly due to outdated metadata on zookeeper, frequently caused migrations to fail or require manual intervention.

Solution:

  1. Retry Mechanism: Added a RetryOnSyncError class that retries operations up to 30 seconds when synchronization errors are encountered (Metadata is not up to date). This mechanism has been integrated into critical migration operations (AddColumn, DropColumn, ModifyColumn, etc.) to handle transient synchronization issues gracefully.

  2. DROP TABLE with SYNC: Updated DROP TABLE operations to include the SYNC keyword, ensuring consistency.

Impact:
These changes significantly enhance the robustness of the Snuba migration system when operating in distributed ClickHouse environments with complex replication setups. By automatically handling synchronization delays, the risk of migration failures is reduced, minimizing operational overhead and downtime.

Testing:

  • Verified migrations in a multi-replica, multi-shard ClickHouse cluster.

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@pyhp2017 pyhp2017 requested review from a team as code owners January 25, 2025 12:14
…d shards

This commit introduces a retry mechanism to handle synchronization errors in migrations caused by outdated metadata in ClickHouse replicas. Additionally, it ensures DROP TABLE operations use SYNC. These changes address one of the critical issues in the Snuba migration system, improving stability and reliability when working with ClickHouse clusters.
Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix the
ci / pre-commit hooks (pull_request) check and the ci / mypy typing (pull_request) check? Those are required for the tests to run. It's ok for the build steps to fail for now.

super().execute()
break
except Exception as e:
if i and 'Metadata on replica is not up to date with common metadata in Zookeeper' in str(e):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the exception that is returned here will have an error code in it. Could you change this to match on the code instead of the string?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants