SOMA is an open-source and freely available toolbox for composing distributed, high-performance computing (HPC) monitoring, analysis, and visualization services out of microservice components. SOMA employs the principle of composability to design, build, maintain, deploy, and extend monitoring capabilities. SOMA is inspired by and reliant on the Mochi
framework from Argonne National Laboratory (https://mochi.readthedocs.io/) to build portable HPC microservices.
SOMA's design is guided by the following principles ("CPAIR"):
- Composability: Distinct, new functionality should go into a new microservice or library component --- that means no mixing of "related" functionality. We want to imbibe the principles of microservice design to the extent possible with little to no exceptions. The rationale here is to offer the user improved maintainability and scalability for monitoring service components while exploring the performance trade-offs resulting from a clean separation through services.
- Performance: While SOMA can be deployed on commodity clusters, it is designed to truly come to life on RDMA-enabled HPC clusters. By relying on a HPC-optimized service framework like
Mochi
, SOMA can transparently use high-performance RDMA networks and high-end computing accelerators such as GPUs to move and process data efficiently. These capabilities become critical must your monitoring service lie in the critical path of your distributed workflow. - Accessibility: We intend to make high-performance monitoring capabilities accessible to all --- especially HPC application domain specialists. That means reducing the barrier-to-entry for new users and balancing the knobs that are hidden and the knobs exposed to the power user.
- Inclusion: SOMA intends to address the monitoring needs of existing traditional MPI applications as well as ML or AI-centic HPC workflows on the horizon. A service-based design allows SOMA's capabilities to be extended when required, while simultaneously increasing the degree of code reuse. Service-based designs also allow SOMA to be "cloud-ready" and deployable on your favorite CSP.
- Reuse: It cannot be overstated that the SOMA's goal is to not reinvent the wheel. Therefore, SOMA does not promise to provide all the functionality required for monitoring --- especially when archetypical, free-to-use software for such functionality is already provided elsewhere. For example, there exist several state-of-the-art HPC performance tools such as TAU, ScoreP, HPCToolkit, etc that offer sophisticated application performance measurement capabilities. SOMA's API would offer a thin adapter for the measurement APIs of these tools, allowing for reuse of existing instrumentation whereever possible. Similarly, SOMA does not intend to provide a full-blown monitoring dashboard solution --- rather, SOMA's output would be converted to a file format that existing tools such as Graphana or Zipkin can injest directly.