Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus Observer] Increase safety catch thresholds. #15460

Closed
wants to merge 1 commit into from

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Dec 3, 2024

Description

How Has This Been Tested?

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Dec 3, 2024

@JoshLind JoshLind added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Dec 3, 2024

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Dec 3, 2024

✅ Forge suite realistic_env_max_load success on 397db3118acac846ad877d1afcd36e6f5120c9e5

two traffics test: inner traffic : committed: 14079.94 txn/s, latency: 2821.86 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3200 ms), latency samples: 5353580
two traffics test : committed: 100.02 txn/s, latency: 1945.33 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 17700 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.560, avg: 1.699", "ConsensusProposalToOrdered: max: 0.321, avg: 0.294", "ConsensusOrderedToCommit: max: 0.314, avg: 0.304", "ConsensusProposalToCommit: max: 0.610, avg: 0.598"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.45s no progress at version 29029 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 15.47s no progress at version 1987950 (avg 15.05s) [limit 16].
Test Ok

Copy link
Contributor

github-actions bot commented Dec 3, 2024

❌ Forge suite compat failure on 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> 397db3118acac846ad877d1afcd36e6f5120c9e5

Compatibility test results for 010570d3b7aa20889fb5ad0e5b23800aa33f5634 ==> 397db3118acac846ad877d1afcd36e6f5120c9e5 (PR)
1. Check liveness of validators at old version: 010570d3b7aa20889fb5ad0e5b23800aa33f5634
compatibility::simple-validator-upgrade::liveness-check : committed: 5218.32 txn/s, submitted: 5329.92 txn/s, expired: 111.60 txn/s, latency: 3162.14 ms, (p50: 1800 ms, p70: 1900, p90: 2700 ms, p99: 32400 ms), latency samples: 464330
2. Upgrading first Validator to new version: 397db3118acac846ad877d1afcd36e6f5120c9e5
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6898.22 txn/s, latency: 4104.56 ms, (p50: 4600 ms, p70: 4900, p90: 5100 ms, p99: 5200 ms), latency samples: 128920
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7039.85 txn/s, latency: 4604.02 ms, (p50: 4900 ms, p70: 5000, p90: 6100 ms, p99: 6400 ms), latency samples: 241660
3. Upgrading rest of first batch to new version: 397db3118acac846ad877d1afcd36e6f5120c9e5
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7094.16 txn/s, latency: 4021.85 ms, (p50: 4500 ms, p70: 4800, p90: 4900 ms, p99: 5000 ms), latency samples: 133060
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7174.94 txn/s, latency: 4542.64 ms, (p50: 5000 ms, p70: 5100, p90: 5200 ms, p99: 5400 ms), latency samples: 241200
4. upgrading second batch to new version: 397db3118acac846ad877d1afcd36e6f5120c9e5
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10905.17 txn/s, latency: 2398.78 ms, (p50: 2600 ms, p70: 2800, p90: 3100 ms, p99: 3200 ms), latency samples: 196260
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9663.24 txn/s, latency: 3185.16 ms, (p50: 2800 ms, p70: 3400, p90: 4900 ms, p99: 5800 ms), latency samples: 342880
5. check swarm health
Test Failed: Waiting for nodes to catch up to target version and epoch (None, Some(23)) timed out after 60 seconds, current status: Ok([("validator-0", 1771388, 22), ("validator-1", 1771388, 22), ("validator-2", 1771388, 22), ("validator-3", 1771388, 22)])

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.89/src/error.rs:85:36
   1: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup_to_target_version_or_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:463:24
   2: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup_to_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:396:6
   3: aptos_forge::interface::swarm::SwarmExt::wait_for_all_nodes_to_change_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:289:92
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/future/future.rs:123:9
   5: aptos_forge::interface::swarm::SwarmExt::fork_check::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:200:14
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/future/future.rs:123:9
   7: <aptos_testcases::compatibility_test::SimpleValidatorUpgrade as aptos_forge::interface::network::NetworkTest>::run::{{closure}}
             at ./testsuite/testcases/src/compatibility_test.rs:325:63
   8: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/future/future.rs:123:9
   9: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/park.rs:281:63
  10: tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:107:5
  11: tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:73:5
  12: tokio::runtime::park::CachedParkThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/park.rs:281:31
  13: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/blocking.rs:66:9
  14: tokio::runtime::handle::Handle::block_on_inner::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/handle.rs:324:22
  15: tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/runtime.rs:65:16
  16: tokio::runtime::handle::Handle::block_on_inner
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/handle.rs:323:9
  17: tokio::runtime::handle::Handle::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/handle.rs:302:18
  18: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:332:50
  19: forge::run_forge_with_changelog
             at ./testsuite/forge-cli/src/main.rs:426:24
  20: forge::main
             at ./testsuite/forge-cli/src/main.rs:329:21
  21: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  22: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  23: std::rt::lang_start::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
  24: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
  25: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  26: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  27: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  28: std::rt::lang_start_internal::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  29: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  30: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  31: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  32: std::rt::lang_start_internal
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  33: main
  34: __libc_start_main
  35: _start
Trailing Log Lines:
  30: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  31: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  32: std::rt::lang_start_internal
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  33: main
  34: __libc_start_main
  35: _start

=== BEGIN JUNIT ===
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="forge" tests="1" failures="1" errors="0" uuid="b9fc0480-e9d4-47ab-a1b2-a5154fe4e4a0">
    <testsuite name="local" tests="1" disabled="0" errors="0" failures="1">
        <testcase name="compatibility::simple-validator-upgrade">
            <failure message="Waiting for nodes to catch up to target version and epoch (None, Some(23)) timed out after 60 seconds, current status: Ok([(&quot;validator-0&quot;, 1771388, 22), (&quot;validator-1&quot;, 1771388, 22), (&quot;validator-2&quot;, 1771388, 22), (&quot;validator-3&quot;, 1771388, 22)])

Stack backtrace:
   0: anyhow::error::&lt;impl anyhow::Error&gt;::msg
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.89/src/error.rs:85:36
   1: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup_to_target_version_or_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:463:24
   2: aptos_forge::interface::swarm::wait_for_all_nodes_to_catchup_to_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:396:6
   3: aptos_forge::interface::swarm::SwarmExt::wait_for_all_nodes_to_change_epoch::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:289:92
   4: &lt;core::pin::Pin&lt;P&gt; as core::future::future::Future&gt;::poll
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/future/future.rs:123:9
   5: aptos_forge::interface::swarm::SwarmExt::fork_check::{{closure}}
             at ./testsuite/forge/src/interface/swarm.rs:200:14
   6: &lt;core::pin::Pin&lt;P&gt; as core::future::future::Future&gt;::poll
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/future/future.rs:123:9
   7: &lt;aptos_testcases::compatibility_test::SimpleValidatorUpgrade as aptos_forge::interface::network::NetworkTest&gt;::run::{{closure}}
             at ./testsuite/testcases/src/compatibility_test.rs:325:63
   8: &lt;core::pin::Pin&lt;P&gt; as core::future::future::Future&gt;::poll
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/future/future.rs:123:9
   9: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/park.rs:281:63
  10: tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:107:5
  11: tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:73:5
  12: tokio::runtime::park::CachedParkThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/park.rs:281:31
  13: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/blocking.rs:66:9
  14: tokio::runtime::handle::Handle::block_on_inner::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/handle.rs:324:22
  15: tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/runtime.rs:65:16
  16: tokio::runtime::handle::Handle::block_on_inner
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/handle.rs:323:9
  17: tokio::runtime::handle::Handle::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/handle.rs:302:18
  18: aptos_forge::runner::Forge&lt;F&gt;::run
             at ./testsuite/forge/src/runner.rs:332:50
  19: forge::run_forge_with_changelog
             at ./testsuite/forge-cli/src/main.rs:426:24
  20: forge::main
             at ./testsuite/forge-cli/src/main.rs:329:21
  21: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  22: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  23: std::rt::lang_start::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
  24: core::ops::function::impls::&lt;impl core::ops::function::FnOnce&lt;A&gt; for &amp;F&gt;::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
  25: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  26: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  27: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  28: std::rt::lang_start_internal::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  29: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  30: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  31: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  32: std::rt::lang_start_internal
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  33: main
  34: __libc_start_main
  35: _start"/>
        </testcase>
    </testsuite>
</testsuites>
=== END JUNIT ===

Swarm logs can be found here: See fgi output for more information.
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:381"},"thread_name":"main","hostname":"forge-compat-pr-15460-1733238989-010570d3b7aa20889fb5ad0e5b2380","timestamp":"2024-12-03T15:35:27.481296Z","message":"Deleting namespace forge-compat-pr-15460: Some(NamespaceStatus { conditions: None, phase: Some(\"Terminating\") })"}
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:398"},"thread_name":"main","hostname":"forge-compat-pr-15460-1733238989-010570d3b7aa20889fb5ad0e5b2380","timestamp":"2024-12-03T15:35:27.481332Z","message":"aptos-node resources for Forge removed in namespace: forge-compat-pr-15460"}

failures:
Failed to run tests:
Tests Failed
    compatibility::simple-validator-upgrade

test result: FAILED. 0 passed; 1 failed; 0 filtered out

Error: Tests Failed

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.89/src/error.rs:85:36
   1: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:358:13
   2: forge::run_forge_with_changelog
             at ./testsuite/forge-cli/src/main.rs:426:24
   3: forge::main
             at ./testsuite/forge-cli/src/main.rs:329:21
   4: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
   6: std::rt::lang_start::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
   7: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
   8: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
   9: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  10: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  11: std::rt::lang_start_internal::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  12: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  13: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  14: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  15: std::rt::lang_start_internal
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  16: main
  17: __libc_start_main
  18: _start
Debugging output:
NAME                                  READY   STATUS      RESTARTS   AGE
aptos-node-0-validator-0              1/1     Running     0          9m50s
aptos-node-1-validator-0              1/1     Running     0          13m
aptos-node-2-validator-0              1/1     Running     0          6m6s
aptos-node-3-validator-0              1/1     Running     0          4m58s
forge-testnet-deployer-8ncm2          0/1     Completed   0          18m
genesis-aptos-genesis-eforge9-qt5gq   0/1     Completed   0          17m

@JoshLind JoshLind closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant