Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to MemQ Cluster Sensor #308

Merged
merged 3 commits into from
Nov 20, 2024
Merged

Enhancements to MemQ Cluster Sensor #308

merged 3 commits into from
Nov 20, 2024

Conversation

yisheng-zhou
Copy link
Contributor

This PR addresses an issue where the MemQ Cluster Sensor could fail after running for an extended period by introducing a mechanism to refresh the CuratorFramework client upon encountering ZK call exceptions. Previously, after MemQ orion ran for some time, the CuratorFramework might consistently throw exceptions, requiring engineers to restart orion to restore the system to a normal state. With this update, the client is now automatically recreated when exceptions occur. The newly instantiated client is attached to the MemQ cluster class, ensuring continuous availability.

Key Improvements

  • Refreshed Zookeeper Client:

    • The CuratorFramework client is now refreshed automatically on encountering exceptions, enhancing system reliability and reducing manual intervention. This feature can be turned on/off.
  • Refactored Zookeeper Logic:

    • Zookeeper-related logic has been consolidated into a new class, MemqZookeeperClient. This refactoring clarifies the MemqClusterSensor code.
    • It's important to note that the MemqZookeeperClient does not share Zookeeper logic with Kafka because the ZK paths and commands differ significantly. Accordingly, it does not extend classes such as OrionZookeeperClient.
  • Graceful Handling of Node Broker Clusters:

    • The system now handles node broker clusters more gracefully, contributing to overall stability.

Testing

The changes have been deployed and running within Pinterest for more than one day without any reported sensor errors, demonstrating improved stability and reliability.

@yisheng-zhou yisheng-zhou requested a review from a team as a code owner November 20, 2024 22:11
@yisheng-zhou yisheng-zhou merged commit 5267c45 into master Nov 20, 2024
1 check passed
@yisheng-zhou yisheng-zhou deleted the memq_zk_client branch December 2, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants