deployment: extend "at scale" page #126

VMois · 2022-05-20T08:31:28Z

Take notes below by @tiborsimko and use them to extend our "Deployment at scale" page.

or a test deployment you can start very small; even a single-node Kubernetes
deployment on one machine of the type m2.xlarge (8 CPU, 16 GB RAM) should be
more than enough.

However, if you are already thinking about creating a small multi-user
multi-node cluster that could easily grow with time, then I'd recommend to
start with at least 5 such nodes, using the following responsibilities:

- 1 node labelled reana.io/system=infrastructure that will run the REANA
  frontend and backend infrastructure services;

- 1 node labelled reana.io/system=infrastructuredb that will run the PostgreSQL
  DB service (unless you have some already-existing DB service running outside
  of the cluster that could be reused without hosting DB yourself);

- 1 node labelled reana.io/system=infrastructuremq that will run the Rabbit MQ
  service;

- 1 node labelled reana.io/system=runtimebatch that will run the user runtime
  batch workflow orchestration pods (CWL/Serial/Snakemake/Yadage);

- 1 node labelled reana.io/system=runtimejobs that will run the user runtime
  job pods (generated by those workflow orchestration pods).

With such a setup, you can keep 3 infrastructure nodes and scale the 2 runtime
nodes (1 batch, 1 jobs) to e.g. 50 runtime nodes (10 batch, 40 jobs) as your
needs grow:

For example, 1 runtime batch node can comfortably run 8 concurrent user
workflows at the full speed (since 1 node has 8 cores). So, if you need 80
users running at full speed, then 10 such runtime batch nodes may be wanted.

For example, if the nature of your physics workflows is usually such that 1
workflow typically generates 4 parallel n-tupling jobs, then you may want to
add 4x more runtime job nodes for one runtime batch node in the system, so that
everything can run optimally at sustainable full speed. (Provided memory is
enough; if not, then higher RAM machine types may be necessary.)

We have tried the above three-infrastructure-node setup for clusters of O(1k)
core size and everything was scaling very nicely.

If you'd like to aim even higher, say 5k cores, then using 8 CPU 16 GB nodes is
not optimal. We have made some tests and saw slowdowns and huge loads on the
Kubernetes master node in our scalability tests. Using larger machine type
flavours (32 CPU or more) would be preferable here. But I guess these types of
considerations can wait for now.

The text was updated successfully, but these errors were encountered:

VMois added the deployment/helm label May 20, 2022

tiborsimko self-assigned this Mar 13, 2023

mdonadoni added this to 0.95.0 Aug 8, 2024

mdonadoni moved this to In work in 0.95.0 Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deployment: extend "at scale" page #126

deployment: extend "at scale" page #126

VMois commented May 20, 2022 •

edited by tiborsimko

Loading

deployment: extend "at scale" page #126

deployment: extend "at scale" page #126

Comments

VMois commented May 20, 2022 • edited by tiborsimko Loading

VMois commented May 20, 2022 •

edited by tiborsimko

Loading