Replies: 1 comment
-
Replacement procedure:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Post mortem of mainnet outage on 2023-10-12
Around 12:23 UTC mainnet went down due to a bug in the node software, during the upgrade from version 1.5.2 to 1.5.3. Code that was migrating data in the global state database during the 1.5.2. upgrade was run again during the 1.5.3 upgrade, which resulted in no changes applied, as planned. However, a final sanity check in the code ensuring that changes were made was erroneously firing, detecting no changes and shutting down the node due to a wrongly perceived error condition as a safety measure.
This migration happens before nodes connect up to the network, resulting in a crash-restart loop for every node running 1.5.3.
The rolled out fix
We rolled out a new, patched 1.5.3 binary with the unnecessary migration code disabled entirely. Node operators should replace their existing binary to fix their node, the node will restart on its own, after which operation will resume normally.
Official binaries have also been updated, so anyone from this point on forward installing version 1.5.3 will receive a version with the fix applied.
Why was this issue not caught in testnet?
All code ran and passed testnet before going live, however the state that would potentially trigger and fail the check did not occur in testnet naturally, as it requires a validator to unbond across an upgrade point. This scenario was not added for the 1.5.3 upgrade.
Measures taken to prevent these issues in the future
We strive to write idempotent migration code and only remove it later to keep changesets in patch level versions smaller. This policy will be changed to remove migration code artifacts in the following version from this point onward.
Beta Was this translation helpful? Give feedback.
All reactions