About identifying stuck state machines #423

qkim0x01 · 2022-11-08T06:08:58Z

qkim0x01
Nov 8, 2022

Hi Team,

We have a need to identify the time spent in each state by the Coyote state machines. We want to be able to log this periodically so that if a state machine is in the same state for more than a pre-defined time, we can know about it. This will allow us to detect states with high latencies in real time. Our current implementation logs the duration after the state

Does the Coyote framework provide a way to efficiently log the time-in-state, periodically? Could we use the Liveness monitor to achieve this?

We thought of the following two implementations and would appreciate your feedback on them.

Use the Coyote periodic timers
Each state machine will create a periodic timer and upon receiving the TimeElapsedEvent, it will log the duration of the current state. The con of this approach is that we will have a lot of timers running. Could this become an issue at scale?
Create a state machine to act as Timer
Other state machines will register a timer event to the timer state machine at the beginning of each state and unregister the event before the next state. The timer state machine will then notify the listeners when the timer expires. Upon receiving the time elapsed event, the listeners can then log the state duration.
We intend to use a single state machine to handle timers for all the state machines (potentially 10s of thousands). Do you see any potential perf issue with this approach?

Thanks

Answered by pdeligia

Nov 8, 2022

Hi @qkim0x01, it seems that something is missing in your description here **. I am afraid that none of the 3 approaches will work for what you have in mind (if I understood correctly without the missing bits):

(1) Yea, this has the con that you mentioned, each ActorTimer is pretty much a wrapper over System.Threading.Timer, so creating 10s of thousands of them and leaving them running might put a lot of pressure on your app. I have not evaluated the performance of this in the latest versions of .NET, so it's worth trying out a scale experiment, but I think it will probably not scale very well, as you are doubling your threads inside the process (unless .NET has done something very efficie…

View full answer

pdeligia · 2022-11-08T16:00:57Z

pdeligia
Nov 8, 2022
Maintainer

Hi @qkim0x01, it seems that something is missing in your description here **. I am afraid that none of the 3 approaches will work for what you have in mind (if I understood correctly without the missing bits):

(1) Yea, this has the con that you mentioned, each ActorTimer is pretty much a wrapper over System.Threading.Timer, so creating 10s of thousands of them and leaving them running might put a lot of pressure on your app. I have not evaluated the performance of this in the latest versions of .NET, so it's worth trying out a scale experiment, but I think it will probably not scale very well, as you are doubling your threads inside the process (unless .NET has done something very efficient to handle thousands of periodic timers).

(2) Using a single state machine will become a bottleneck very fast, as now all your other actors will race to send an event to a single inbox and will have to grab the inbox lock to do so, creating huge congestion which will really hurt perf and not scale, given you say 10s of thousands of actors. Perhaps one approach would be to use a pool of such state machines and load balance across your 10s of thousands of actors. But this will still have performance bottleneck (not sure how frequently you do state transitions), so would need some experimentation if you went down this route.

(3) About using the liveness monitor, this is sadly only available during systematic testing (not in production) and it’s geared towards finding liveness bugs, i.e., where the program is unable to make progress and is stuck infinitely. (Even during testing this would not help with performance metrics, as Coyote sequentializes the execution to explore interleavings, and the real time becomes abstracted away.)

Coyote does not provide anything built-in for measuring this, but instead we provide a logging infrastructure that can be highly customizable with overriding callbacks to all events happening in an actor (including state changes). Most folks I have worked with have overridden the IActorRuntimeLog for adding debug logs and gathering various metadata like this. But I think you were trying to say that you are already logging this in your unfinished sentence **, and you want something that will notify you before the next state transition when N time has passed? This log indeed only happens after the fact (after moving out of the state), so won’t work.

Thinking aloud a potential solution: what about reverting the problem, and instead of pushing updates from state-machines to a background monitoring entity, to instead poll/scan for updates? Here is an idea: you could implement a new IActorRuntimeLog that maintains a concurrent dictionary *** that maps ActorId.Id (ulong) to a timestamp taken last time it called OnStateTransition(id, stateName, true) (check for true value for entering state, false is exiting). (The Coyote logging infra allows you to attach as many different types of logs as you want, each doing different things, so this can play along with your existing logger.) If a machine transitions to a new state you could reset the timestamp, and if a machine halts you can remove it from the map. The log itself can be thought as a latency monitor and it can spawn & manage a long-running background task (or timer) that periodically scans the data structure, which state machines update via their logging callbacks. You probably don’t need full accuracy really here, so you can just read the latest snapshot of the map, even if other machines update it in the background without creating a bottleneck. If you identify a machine being “stuck” on the same state for too long (based on your definition of “stuck”), pointing to some latency issue you can take the action you care about. You could even spawn N scanning tasks if you wanted to make this faster. The benefit is that there is no timer per actor as in (1) which is expensive, and no bottleneck on the queue of a single actor or a pool of actors as in (2). Would something like this work?

*** (important to be thread safe + avoid congestion in writes, which a concurrent dictionary would give you as it has separate locking per hash bucket; perhaps there can be other custom data structures to make it even more efficient)

You might know this already, but just worth noting that each actor in Coyote (once you remove all the “fluff”) is just a C# System.Threading.Task that is draining the actor’s inbox and invokes its handlers. When the inbox gets empty, the actor becomes deactivated. This means that the associated task draining its inbox completes, and a fresh task gets created when the inbox receives again a new event. I am saying this, because you need to be careful in case actors genuinely get deactivated and stay in one state for a long time because there is no work to do, rather than them being stuck during a high-latency operation. (But if you try the approach I mentioned above, you know the state of each machine through the OnStateTransition callback, so you can filter things out.)

Hope this helps? Let me know if you have more thoughts and happy to discuss more.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About identifying stuck state machines #423

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

About identifying stuck state machines #423

qkim0x01 Nov 8, 2022

Replies: 1 comment

pdeligia Nov 8, 2022 Maintainer

qkim0x01
Nov 8, 2022

pdeligia
Nov 8, 2022
Maintainer