Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: Pipelines remaining idle even when there are many packages queued #618

Open
joel-simpson opened this issue Jun 27, 2024 · 2 comments

Comments

@joel-simpson
Copy link

Enduro in the current DPS setup watches three directories:

dpj-ts
epj-ts
other-ts

It queues the packages as they appear in the directories. What seems to be an issue though, is the way these packages are distributed to the transfer slots. For example, if there are many epj-s queued before a batch of dpj-s, the dpj-s aren't picked up, even if the dpj pipeline is completeley idle.

The behaviour we're observing won't give us full utilization of all pipelines unless we somehow manage to distribute dpj-s and epj-s evenly in the queue, which is not really realistic. Consider the following scenario:

a dpj submission is queued
an epj submission is sent to archiving
the sip containing the “avleveringsliste” is queued in enduro, after all the dpj-s
epj sip creation does not start until the avl is archived, basically when the dpj submission is finished

The result is the epj and dpj submissions are prevented from running in parallell, and even if we manage to tweak it so both epj and dpj sips are created, we have to make sure they get queued evenly.

Ideally, Enduro should trigger transfers for a certain type of package, as long as that type of pipeline has free transfer slots, ignoring the transfer's overall position in the queue.

@sevein
Copy link
Member

sevein commented Dec 15, 2024

If I understand this correctly, this issue arised from the changes introduced in #616, which effectively adds a global queue processing tasks in FIFO order without considering the availability of individual pipelines. If I remember correctly, we had to do this to address scalability challenges when running Temporal and Enduro workers in constrained environments where a high number of concurrent sessions were causing timeout errors in Temporal.

Users can mitigate the negative impact of the global queue by adjusting MaxConcurrentSessionExecutionSize and MaxConcurrentWorkflowExecutionSize to higher values (the default used to be 5000). While this won't eliminate the issue, it can reduce its frequency (more packages would be dispatched directly to pipeline-specific queues) - however doing so requires provisioning additional compute resources for Temporal and Enduro to handle the increased load.

An alternative worth exploring is to push the existing pipeline-specific queues to operate outside the scope of worker sessions, assuming that worker sessions are resource-intensive and the root issue. For example, could we start a worker session only when a pipeline slot has been acquired? This approach could reduce the number of concurrent worker sessions to the maximum combined capacity of all pipelines. For instance, with 4 pipelines each having a capacity of 5, the system would operate at a maximum of 20 concurrent sessions, as the queuing would occur beforehand. This idea is relatively simple to implement and could provide significant efficiency gains.

@DanielCosme
Copy link
Contributor

@sevein You are spot on in your understanding of the problem. My "solution" to the previous scalability problem created the problem described in this issue. Technically we could revert back to using generous values for MaxConcurrentSessionExecutionSize and MaxConcurrentWorkflowExecutionSize if we give the Temporal service an Isolated VM with generous resources (mainly to avoid it competing for disk IO with AM and Enduro). I say it because I tested that as a brute force alternative, and it worked. Maybe even pointing Temporal's DB to be in it's own block device in the same VM could help immensely (Separate from AM's DB). I'm not saying is a perfect solution, but it can temporarily mitigate this Limitation at the operational level.

I actually did some tests with your proposed solution (a while ago). I faced some unknown (to me) problems in which transfers would stop being sent to AM, I have high hopes that it's a small bug in Enduro we could actually solve. Sadly I did not look further. Let me know if it would be useful to setup a scalability testing environment for Enduro. By scalability testing in this particular case I mean:

  • 3 watched directories
  • 2+ AM pipelines (powered by docker in the same host)
  • 15k Transfer batches being queued at any given time.
  • 1 Instance of Enduro

Also, I was able to successfully reproduce this issue locally in my environment, emulating the client setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants