Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default NB/SB election timer to 16s #81

Merged
merged 1 commit into from
Oct 20, 2023

Conversation

mkalcok
Copy link
Contributor

@mkalcok mkalcok commented Oct 17, 2023

Note: This change will affect only new deployments.

@mkalcok mkalcok requested a review from fnordahl October 17, 2023 08:58
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mkalcok
Copy link
Contributor Author

mkalcok commented Oct 17, 2023

TLDR
CI failed because increased election timer along with the unfortunate fact that we restarted OVN NB/SB services on scaling events caused clustered DB to be unavailable for a relatively long time. We decided to get rid of the service restart and it turned out to be pretty smooth #82

Why was DB unavailable for a long time?
As can be seen in Figure 4 in Raft's Whitepaper. After startup, cluster member goes into Follower state until a leader fails to send heartbeat within the election timer. Since we restarted every cluster member when new member was added, they all went into Follower state without there being a single Leader and stayed in this state for 16s (new election timer). After the timer expired a new leader was elected, but during the time that there was no leader, commands like microovn.ovn-nbctl would fail as they (by default) require talking to the cluster leader.

@fnordahl
Copy link
Member

@mkalcok I tested a version of this rebased onto #83, and tests are now passing, so I'd love to get an updated version!

@mkalcok
Copy link
Contributor Author

mkalcok commented Oct 20, 2023

rebasing now

Note: This change will affect only new deployments.
Signed-off-by: Martin Kalcok <[email protected]>
Copy link
Member

@fnordahl fnordahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -10,3 +10,6 @@ export OVN_RUNDIR="${SNAP_COMMON}/run/ovn"
export OVN_PKGDATADIR="${SNAP}/share/ovn"
export OVN_SYSCONFDIR="${SNAP}/etc"
export OVN_PKIDIR="${SNAP_COMMON}/data/pki"

# Set more lenient election timer for NB/SB clusters (16s)
export ELECTION_TIMER=16000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline at end of file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

@fnordahl fnordahl merged commit 81247df into canonical:main Oct 20, 2023
15 checks passed
@tomponline
Copy link
Member

@mkalcok can you let us know when this lands in latest/stable please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants