Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed bugs with multiple Members with same Address crashing ClusterDaemon #7371

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 40 additions & 37 deletions src/core/Akka.Cluster/ClusterDaemon.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1635,22 +1635,22 @@ public void Welcome(Address joinWith, UniqueAddress from, Gossip gossip)
public void Leaving(Address address)
{
// only try to update if the node is available (in the member ring)
if (LatestGossip.Members.Any(m => m.Address.Equals(address) && m.Status is MemberStatus.Joining or MemberStatus.WeaklyUp or MemberStatus.Up))
foreach(var mem in LatestGossip.Members.Where(m => m.Address.Equals(address)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR - iterate over each Member with the same Address and filter them one at a time - don't do what we did before, which is to check if there are members that match this condition and then try to state-transition them all at the same time. This is what creates the IllegalOperation: invalid state transition errors.

{
// mark node as LEAVING
var newMembers = LatestGossip.Members.Select(m =>
if (mem.Status is MemberStatus.Joining or MemberStatus.WeaklyUp or MemberStatus.Up)
{
if (m.Address == address) return m.Copy(status: MemberStatus.Leaving);
return m;
}).ToImmutableSortedSet(); // mark node as LEAVING
var newGossip = LatestGossip.Copy(members: newMembers);

UpdateLatestGossip(newGossip);
// mark node as LEAVING
var newMembers = LatestGossip.Members
.Remove(mem).Add(mem.Copy(status: MemberStatus.Leaving));
var newGossip = LatestGossip.Copy(members: newMembers);
UpdateLatestGossip(newGossip);

_cluster.LogInfo("Marked address [{0}] as [{1}]", address, MemberStatus.Leaving);
PublishMembershipState();
// immediate gossip to speed up the leaving process
SendGossip();
_cluster.LogInfo("Marked address [{0}] as [{1}]", address, MemberStatus.Leaving);
PublishMembershipState();
// immediate gossip to speed up the leaving process
SendGossip();
Arkatufus marked this conversation as resolved.
Show resolved Hide resolved
}
}
}

Expand All @@ -1674,40 +1674,43 @@ public void Downing(Address address)
var localGossip = LatestGossip;
var localMembers = localGossip.Members;
var localOverview = localGossip.Overview;
var localSeen = localOverview.Seen;
var localReachability = _membershipState.DcReachability;

// check if the node to DOWN is in the 'members' set
var member = localMembers.FirstOrDefault(m => m.Address == address);
if (member != null && member.Status != MemberStatus.Down)
var found = false;
foreach (var member in localMembers.Where(m => m.Address == address))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the same type of issue with the Downing function too and made similar fixes there.

{
if (localReachability.IsReachable(member.UniqueAddress))
_cluster.LogInfo("Marking node [{0}] as [{1}]", member.Address, MemberStatus.Down);
else
_cluster.LogInfo("Marking unreachable node [{0}] as [{1}]", member.Address, MemberStatus.Down);
found = true;
if (member.Status != MemberStatus.Down)
{
Arkatufus marked this conversation as resolved.
Show resolved Hide resolved
if (localReachability.IsReachable(member.UniqueAddress))
_cluster.LogInfo("Marking node [{0}] as [{1}]", member.Address, MemberStatus.Down);
else
_cluster.LogInfo("Marking unreachable node [{0}] as [{1}]", member.Address, MemberStatus.Down);


var newGossip = localGossip.MarkAsDown(member); //update gossip
UpdateLatestGossip(newGossip);
var newGossip = localGossip.MarkAsDown(member); //update gossip
UpdateLatestGossip(newGossip);

PublishMembershipState();
PublishMembershipState();

if (address == _cluster.SelfAddress)
{
// spread the word quickly, without waiting for next gossip tick
SendGossipRandom(MaxGossipsBeforeShuttingDownMyself);
}
else
{
// try to gossip immediately to downed node, as a STONITH signal
GossipTo(member.UniqueAddress);
if (address == _cluster.SelfAddress)
{
// spread the word quickly, without waiting for next gossip tick
SendGossipRandom(MaxGossipsBeforeShuttingDownMyself);
}
else
{
// try to gossip immediately to downed node, as a STONITH signal
GossipTo(member.UniqueAddress);
}
}

// if the previous statement did not evaluate to true, then this node is already being downed

}
else if (member != null)
{
// already down
}
else

if (!found)
{
_cluster.LogInfo("Ignoring down of unknown node [{0}]", address);
}
Expand Down