-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: Pipelines remaining idle even when there are many packages queued #618
Comments
If I understand this correctly, this issue arised from the changes introduced in #616, which effectively adds a global queue processing tasks in FIFO order without considering the availability of individual pipelines. If I remember correctly, we had to do this to address scalability challenges when running Temporal and Enduro workers in constrained environments where a high number of concurrent sessions were causing timeout errors in Temporal. Users can mitigate the negative impact of the global queue by adjusting An alternative worth exploring is to push the existing pipeline-specific queues to operate outside the scope of worker sessions, assuming that worker sessions are resource-intensive and the root issue. For example, could we start a worker session only when a pipeline slot has been acquired? This approach could reduce the number of concurrent worker sessions to the maximum combined capacity of all pipelines. For instance, with 4 pipelines each having a capacity of 5, the system would operate at a maximum of 20 concurrent sessions, as the queuing would occur beforehand. This idea is relatively simple to implement and could provide significant efficiency gains. |
@sevein You are spot on in your understanding of the problem. My "solution" to the previous scalability problem created the problem described in this issue. Technically we could revert back to using generous values for I actually did some tests with your proposed solution (a while ago). I faced some unknown (to me) problems in which transfers would stop being sent to AM, I have high hopes that it's a small bug in Enduro we could actually solve. Sadly I did not look further. Let me know if it would be useful to setup a scalability testing environment for Enduro. By scalability testing in this particular case I mean:
Also, I was able to successfully reproduce this issue locally in my environment, emulating the client setup. |
Enduro in the current DPS setup watches three directories:
It queues the packages as they appear in the directories. What seems to be an issue though, is the way these packages are distributed to the transfer slots. For example, if there are many epj-s queued before a batch of dpj-s, the dpj-s aren't picked up, even if the dpj pipeline is completeley idle.
The behaviour we're observing won't give us full utilization of all pipelines unless we somehow manage to distribute dpj-s and epj-s evenly in the queue, which is not really realistic. Consider the following scenario:
The result is the epj and dpj submissions are prevented from running in parallell, and even if we manage to tweak it so both epj and dpj sips are created, we have to make sure they get queued evenly.
Ideally, Enduro should trigger transfers for a certain type of package, as long as that type of pipeline has free transfer slots, ignoring the transfer's overall position in the queue.
The text was updated successfully, but these errors were encountered: