Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inspection Service] Add simple consensus health check endpoint. #15512

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Dec 5, 2024

Description

This PR adds a simple consensus health check endpoint to the node inspection service. Specifically, the new endpoint /consensus_health_check will return a 200 status code iff the node is currently executing consensus.

To achieve this, I added a new metric gauge that is set iff consensus is executing on the validator node. The inspection service simply fetches the value of this gauge. The gauge can be seen here.

Testing Plan

Existing test infrastructure, and manual verification, e.g., I ran a local validator and pinged the endpoint:

% curl -I http://127.0.0.1:53822/consensus_health_check
HTTP/1.1 200 OK
content-type: text/plain
date: Thu, 05 Dec 2024 20:08:48 GMT

Copy link

trunk-io bot commented Dec 5, 2024

⏱️ 1h 54m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
test-target-determinator 17m 🟩🟩🟩🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 12m 🟩
rust-cargo-deny 11m 🟩🟩🟩🟩 (+2 more)
check-dynamic-deps 6m 🟩🟩🟩🟩🟩 (+2 more)
rust-doc-tests 5m 🟩
execution-performance / test-target-determinator 4m 🟩
check 4m 🟩
general-lints 3m 🟩🟩🟩🟩🟩 (+2 more)
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+2 more)
rust-move-tests 2m
fetch-last-released-docker-image-tag 2m 🟩
rust-move-tests 2m

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / single-node-performance 10s 15m -99%

settingsfeedbackdocs ⋅ learn more about trunk.io

Comment on lines 37 to 39
if gauge_value == "0" {
return (
StatusCode::OK,
Body::from("Consensus health check passed!"),
CONTENT_TYPE_TEXT.into(),
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health check logic appears to be reversed. The code returns OK when gauge_value == "0", indicating consensus is not executing. However, based on the function's documented purpose of checking if "the node is currently participating in consensus", it should return OK when gauge_value == "1". This would align with the gauge's behavior set in update_executing_component_metrics() where 1 indicates active consensus participation.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

use hyper::{Body, StatusCode};
use prometheus::TextEncoder;

// The metric key for the consensus execution gauge
const CONSENSUS_EXECUTION_GAUGE: &str = "aptos_state_sync_consensus_executing_gauge{}";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric name aptos_state_sync_consensus_executing_gauge{} includes empty curly braces that should be removed since this gauge doesn't use any labels. To match the registered metric name in the code below, this should be aptos_state_sync_consensus_executing_gauge.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@JoshLind JoshLind added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Dec 5, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

const CONSENSUS_EXECUTION_GAUGE: &str = "aptos_state_sync_consensus_executing_gauge{}";

/// Handles a consensus health check request. This method returns
/// 200 iff the node is currently participating in consensus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// 200 iff the node is currently participating in consensus.
/// 200 if the node is currently participating in consensus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, this was meant to be "if and only if" 😄 https://www.merriam-webster.com/dictionary/iff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, this was meant to be "if and only if" 😄 merriam-webster.com/dictionary/iff

I recommend simply using if and only if in that case. I doubt that even half of our eng team knows the iff thingy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Changed it to if. Don't feel as strongly as you do 😄

This comment has been minimized.

This comment has been minimized.

@JoshLind JoshLind enabled auto-merge (rebase) December 10, 2024 20:38

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 3c6e693a27339e73520f41030dce8fc9cd504967 ==> bf4f44805c89edf7b9d4a15f23b93f758bf19a53

Compatibility test results for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> bf4f44805c89edf7b9d4a15f23b93f758bf19a53 (PR)
1. Check liveness of validators at old version: 3c6e693a27339e73520f41030dce8fc9cd504967
compatibility::simple-validator-upgrade::liveness-check : committed: 16660.58 txn/s, latency: 1953.78 ms, (p50: 1600 ms, p70: 1800, p90: 2200 ms, p99: 11200 ms), latency samples: 566100
2. Upgrading first Validator to new version: bf4f44805c89edf7b9d4a15f23b93f758bf19a53
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7502.90 txn/s, latency: 3816.06 ms, (p50: 4300 ms, p70: 4500, p90: 4600 ms, p99: 4800 ms), latency samples: 142020
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7496.33 txn/s, latency: 4317.63 ms, (p50: 4600 ms, p70: 4600, p90: 4700 ms, p99: 4800 ms), latency samples: 257880
3. Upgrading rest of first batch to new version: bf4f44805c89edf7b9d4a15f23b93f758bf19a53
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6507.21 txn/s, latency: 4428.73 ms, (p50: 5000 ms, p70: 5400, p90: 5600 ms, p99: 5700 ms), latency samples: 116660
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6464.62 txn/s, latency: 5108.34 ms, (p50: 5500 ms, p70: 5600, p90: 5800 ms, p99: 6000 ms), latency samples: 224200
4. upgrading second batch to new version: bf4f44805c89edf7b9d4a15f23b93f758bf19a53
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 13131.64 txn/s, latency: 2106.16 ms, (p50: 2300 ms, p70: 2400, p90: 2500 ms, p99: 2600 ms), latency samples: 224300
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 13712.55 txn/s, latency: 2311.72 ms, (p50: 2300 ms, p70: 2500, p90: 2600 ms, p99: 3000 ms), latency samples: 441500
5. check swarm health
Compatibility test for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> bf4f44805c89edf7b9d4a15f23b93f758bf19a53 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on bf4f44805c89edf7b9d4a15f23b93f758bf19a53

two traffics test: inner traffic : committed: 14464.68 txn/s, submitted: 14464.73 txn/s, expired: 0.05 txn/s, latency: 2744.92 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3900 ms), latency samples: 5499840
two traffics test : committed: 100.07 txn/s, latency: 1484.13 ms, (p50: 1400 ms, p70: 1500, p90: 1600 ms, p99: 5500 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.676, avg: 1.522", "ConsensusProposalToOrdered: max: 0.335, avg: 0.296", "ConsensusOrderedToCommit: max: 0.379, avg: 0.367", "ConsensusProposalToCommit: max: 0.676, avg: 0.664"]
Max non-epoch-change gap was: 1 rounds at version 31209 (avg 0.00) [limit 4], 2.02s no progress at version 31209 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.66s no progress at version 2609004 (avg 0.66s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 3c6e693a27339e73520f41030dce8fc9cd504967 ==> bf4f44805c89edf7b9d4a15f23b93f758bf19a53

Compatibility test results for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> bf4f44805c89edf7b9d4a15f23b93f758bf19a53 (PR)
Upgrade the nodes to version: bf4f44805c89edf7b9d4a15f23b93f758bf19a53
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1491.36 txn/s, submitted: 1495.16 txn/s, failed submission: 3.81 txn/s, expired: 3.81 txn/s, latency: 1978.16 ms, (p50: 2100 ms, p70: 2100, p90: 2400 ms, p99: 3500 ms), latency samples: 133260
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1607.69 txn/s, submitted: 1612.45 txn/s, failed submission: 4.76 txn/s, expired: 4.76 txn/s, latency: 1873.08 ms, (p50: 1800 ms, p70: 2100, p90: 2400 ms, p99: 3600 ms), latency samples: 141780
5. check swarm health
Compatibility test for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> bf4f44805c89edf7b9d4a15f23b93f758bf19a53 passed
Upgrade the remaining nodes to version: bf4f44805c89edf7b9d4a15f23b93f758bf19a53
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1442.74 txn/s, submitted: 1446.03 txn/s, failed submission: 3.29 txn/s, expired: 3.29 txn/s, latency: 2024.76 ms, (p50: 2100 ms, p70: 2100, p90: 2700 ms, p99: 4200 ms), latency samples: 131640
Test Ok

@JoshLind JoshLind merged commit 1d194b8 into main Dec 10, 2024
92 checks passed
@JoshLind JoshLind deleted the in_consensus_probe branch December 10, 2024 21:06
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
aptos-release-v1.25

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-forge-e2e-perf Run the e2e perf forge only v1.25
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants