Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud support #11233

Open
12 tasks
innovate-invent opened this issue Jan 26, 2021 · 1 comment
Open
12 tasks

Cloud support #11233

innovate-invent opened this issue Jan 26, 2021 · 1 comment

Comments

@innovate-invent
Copy link
Contributor

innovate-invent commented Jan 26, 2021

Despite popular opinion, Galaxy is not yet compatible with Cloud infrastructure. I have spent several months working to get Galaxy in the Cloud but have encountered some major barriers. Discussion at the last dev round table about work offloading and message queues has prompted me to compile all of the issues preventing Cloud deployment.

I think there are subtly varying definitions of "Cloud infrastructure" floating around so I should begin by defining what I mean when I use that term. Cloud resources are ephemeral and not necessarily co-located. For an application to be Cloud compatible it needs to access its resources in a way that is permissive of high latency and is tolerant of racing access patterns. The application processes need to be able to losslessly accept being terminated with little to no warning and replicate processes need to be able to recover the work. I need to point out the distinction between Cloud infrastructure and deploying a fixed compute cluster in a "Cloud provider" such as AWS. Even if the cluster can autoscale, nodes are expected to be long living and this does not meet my definition of a Cloud infrastructure.

Aspects of Galaxy do not meet these requirements:

Dependency on a NFS - While object store support has been added for user data, there are still many resources that require all replicates to share a network file system. NFS is expected to support a range of access patterns that do not scale well or permit high latency. Replicating a NFS across data centers is a massive undertaking and usually requires unwelcome compromise. NFS is also strongly tied to the mounting devices permission domain. This requires using fixed uid/gids for all processes when not all hosts share the same domain. Job data, tool data, managed configuration, and data table loc files all have to be shared between application replicates via an NFS. Bridging the filesystem to a more Cloud compatible data store using something such as s3fs is a layer of complexity and prone to failure when the resource is accessed assuming a full filesystem feature set. Pulsar attempts to remedy the issue with the job data, and I am still working on evaluating it as a solution. I have yet to confirm if it supports process replication and scaling. It also appears that support for removing the need for a NFS between the Pulsar daemon and the jobs is still in its infancy and needs to be reworked.

Intolerant of process termination - Currently in 20.09, the Galaxy app process (apart from the workflow scheduler and job scheduler) attempts to do large operations such as history deletion "in process". A sudden termination of the replicate serving the request will lose the request to delete the history leaving a partially deleted history. Schedulers attempt to synchronize their work by acquiring a lock on a database resource. Unfortunately there is no mechanism to release this lock when a worker terminates. All tasks acquired by that scheduler are orphaned and need to be manually killed. These limitations prevent Galaxy from being auto-scalable.

CPU bound auto-scaling - This is the issue I was discussing with @dannon in the Gitter channel. Galaxy maintains internal queues opaque to the infrastructure. This means the only way to trigger scaling of resources is to watch the CPU load of the processes serving the queues or counting the number of incoming web requests. These metrics do not allow preemptive scaling of resources. There is also no way to inform the infrastructure how much to scale. A queue can have thousands of microsecond tasks or a few very large ones. The infrastructure needs to be able to understand the weighted volume of pending work in order to scale appropriately. The job queue is able to do this because it leverages the queue provided by the underlying infrastructure. Ideally all work tasks should have to pass through a underlying infrastructure interface to be executed.

Locked to quay.io - Currently there is no way to implement container caching. Galaxy is hardcoded to quay.io/biocontainers to resolve its dependencies. I attempted to spoof quay.io via a pull through proxy but Galaxy still uses v1 of the registry API which made this impossible. Compute nodes scale to zero when no work is pending, meaning that every time a large job comes in, every spawned compute node has to re-download the tool containers from quay.io. This is very problematic when serving very bursty work requests to over a hundred new compute nodes (which are then deleted when the queue drains).

Proposals

  • Replace loc files with SQLAlchemy, and SQLite. Add configuration allowing moving the data into a larger scale database that can be replicated
  • Enable by default passing data manager file outputs to Galaxies object store. Allow disabling for data managers where the data is already available to jobs (CVMFS)
  • Find a non-flat-file solution for storing managed configuration
  • Pulsar needs more love and attention. Make it the sole job runner and move all runner plugins to it?
  • Move forward with passing task requests from the web app to a MQ as previously discussed but use Kombu such that managed brokers provided by the Cloud provider can be used, or if ZeroMQ is termination tolerant go with it
  • Make all tasks Galaxy tools. Unify and reuse the existing mechanisms for work dispatch. The argument was made that passing tiny tasks through the infrastructure queue may be too heavy weight, but that is where job destinations can play in and send smaller tasks to a celery worker. The point is to provide the option to keep the infrastructure orchestrator in the loop.
  • Make all processes stateless, operations atomic/idempotent, and locks attached to the lifetime of the process. Any Galaxy process should be able to be terminated and not throw the application into an inconsistent state.
  • Make tool container resolution configurable and dynamic. ie. Map container repositories to tool repository owner, or provide tool container overrides outside of the tool xml
  • Update the container registry API version
  • Include the Galaxy tool wrapper in the mulled container
  • Provide health measures for the various processes so that the orchestrator can monitor

Stretch goals

  • Scale to Zero. Make Galaxy app initialization as close to instant as possible so that it can be started and stopped with the web request. This allows "serverless" deployment into AWS Lambda or competing offerings. Workers can scale to zero until the message queue triggers them.

References:
#10243
#10894
#10699
#10686
#10646
#10576
#10536
#10414
#8392
#11334
#10894
#10436
#11721

@selten
Copy link
Contributor

selten commented Jan 27, 2021

Replacement of loc files, while a different goal is stated for the PR it does actually aid in solving the problem described here: #9875

This is not fully where this issue hopes it will get to, as it would depend on the implementation of people maintaining the data manager, it does set the first steps to accomplish this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants