Skip to content

Commit

Permalink
Merge pull request #3 from hmdc/metrics_help_and_configuration
Browse files Browse the repository at this point in the history
Metrics help and configuration
  • Loading branch information
abujeda authored Dec 4, 2024
2 parents 1fdd304 + 6e2a76a commit 890e0f3
Show file tree
Hide file tree
Showing 21 changed files with 137 additions and 15 deletions.
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,10 @@ build_latest_ood:
build_system_demo build_user_demo: build_latest_ood
# COPY DEMO CONFIGURATION
cp -R config/demo/. ondemand/apps/dashboard
# COPY METRICS WIDGET
# COPY CUSTOMIZATIONS
mkdir -p ondemand/apps/dashboard/plugins
cp -R dev/metrics ondemand/apps/dashboard/plugins/metrics
cp -R dev/session_metrics ondemand/apps/dashboard/plugins/session_metrics

start_ood_installer:
docker create --rm --name ood_installer --privileged -p 43000:443 ood_puppet:5.0.1
Expand Down
1 change: 0 additions & 1 deletion config/demo/app_overrides/ondemand.d/default.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# File managed by Puppet - DO NOT EDIT
---
pinned_apps:
- sys/Jupyter
Expand Down
8 changes: 7 additions & 1 deletion config/demo/app_overrides/ondemand.d/demo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,10 @@ host_based_profiles: false
show_all_apps_link: false
bc_saved_settings: true
cancel_session_enabled: true
metrics_enabled: true
session_metrics_enabled: true

help_menu:
- group: "Metrics"
- title: "Metrics"
icon: "fas://chart-bar"
page: "metrics"
5 changes: 4 additions & 1 deletion config/demo/app_overrides/ondemand.d/metrics.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@ custom_pages:
metrics:
rows:
- columns:
- width: 12
- width: 8
widgets:
- "metrics/metrics_help"
- width: 4
widgets:
- "metrics/metrics"
6 changes: 5 additions & 1 deletion config/local/ondemand.d/metrics.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
session_metrics_enabled: true
custom_pages:
metrics:
rows:
- columns:
- width: 12
- width: 8
widgets:
- "metrics/metrics_help"
- width: 4
widgets:
- "metrics/metrics"
4 changes: 4 additions & 0 deletions config/local/ondemand.d/root.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@ globus_endpoints:
endpoint_path: "/demo/dataverse"

help_menu:
- group: "Metrics"
- title: "Metrics"
icon: "fas://chart-bar"
page: "metrics"
- group: "Docs"
- title: "Documentation"
icon: "fas://book"
Expand Down
6 changes: 4 additions & 2 deletions dev/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ The metrics widget consist of four panels:
- GPU Jobs by state
- Jobs Stats Summary

The metrics help widget is just a blurb of text.

### Implementation Summary
The implementation consist of three components
- OOD Slurm adapter extension
- ./initializers/slurm_extension.rb
- Metrics calculations utility classes
- ./lib/slurm_metrics
- Metrics widget templates
- Metrics widget and help templates
- ./views/widgets/metrics

### Deployment
Expand All @@ -36,6 +38,6 @@ In the Puppet control repo, we need to add the files for the three components to
- site-modules/profile/files/openondemand/common/apps_config/dashboard/views/widgets

To review and test the new widget, we could use a custom page to display it. This is a sample configuration:
https://github.com/hmdc/ondemand_development/blob/main/config/local/ondemand.d/metrics.yml
https://github.com/hmdc/ondemand_development/blob/main/dev/metrics/metrics.yml

This will create a custom page under: `/pun/sys/dashboard/custom/metrics
4 changes: 2 additions & 2 deletions dev/metrics/lib/slurm_metrics/metrics_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ def metrics_waiting_elapsed(completed_time)
Time.now.to_i - completed_time > 10
end

def metrics_enabled?(user_configuration)
metrics_configuration = user_configuration.send(:fetch, :metrics_enabled, false)
def session_metrics_enabled?(user_configuration)
metrics_configuration = user_configuration.send(:fetch, :session_metrics_enabled, false)
::Configuration.send(:to_bool, metrics_configuration)
end
end
Expand Down
10 changes: 9 additions & 1 deletion dev/metrics/metrics.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
help_menu:
- group: "Metrics"
- title: "Metrics"
icon: "fas://chart-bar"
page: "metrics"
custom_pages:
metrics:
rows:
- columns:
- width: 12
- width: 8
widgets:
- "metrics/metrics_help"
- width: 4
widgets:
- "metrics/metrics"
55 changes: 55 additions & 0 deletions dev/metrics/views/widgets/metrics/_metrics_help.html.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
<div class="metrics-help">
<h3>Metrics Help</h3>
<div class="introduction">
<p>When submitting jobs to a high-performance computing (HPC) cluster, it is essential to understand how efficiently your jobs utilize the available resources. Proper resource usage ensures optimal job performance and minimizes resource waste, which can help reduce job wait times and improve overall system efficiency.</p>
<p>To help users analyze their jobs, we have created a set of metrics widgets to surface job analytics from the Slurm Workload Manager.</p>
<p>The Slurm <code>sacct</code> command is used to retrieve information about job resource usage and performance. The <code>sacct</code> command provides details such as job exit state, CPU time, memory usage, GPU allocation, and job runtime, which are used to compute efficiency metrics.</p>
</div>

<div class="description">
<h5>FairShare</h5>
<p>FairShare is a scheduling policy used in HPC environments to ensure equitable access to resources among users and projects. It prioritizes job scheduling based on a user’s recent resource usage relative to their allocated share of the cluster. Users or groups that have consumed fewer resources recently are given higher priority, while those who have utilized more resources may experience lower priority until their usage balances out.</p>
<p>The FairShare system helps maintain a balanced workload across the cluster, preventing resource monopolization by any single user or group and promoting fair access for all users.</p>
<p><a target="_blank" href="https://docs.rc.fas.harvard.edu/kb/fairshare/">FASRC Fairshare info</a></p>

<h5>Completed Jobs by State</h5>
<p>This widget provides an overview of the outcomes of completed jobs submitted to the HPC cluster. It categorizes jobs based on their final state, allowing users to quickly assess the overall success and failure rates of their workloads. The widget summarizes the following job states:</p>
<ul>
<li><strong>Completed</strong>: Jobs that finished without errors and met all resource and runtime requirements.</li>
<li><strong>Time Out</strong>: Jobs that exceeded their allocated time limit and were terminated by the scheduler.</li>
<li><strong>Canceled</strong>: Jobs that were manually cancelled by the user or administrator before completion.</li>
<li><strong>OOM - Out of Memory</strong>: Jobs that were terminated due to exceeding their allocated memory limits.</li>
<li><strong>Failed</strong>: Jobs that encountered errors and did not complete successfully.</li>
</ul>
<p>The information helps users identify patterns and potential issues in their job submissions, enabling them to make informed adjustments to improve job success rates and optimize resource usage.</p>

<h5>Summary Job Stats</h5>
<p>The Summary Job Stats widget provides a high-level overview of job resource utilization and efficiency for jobs submitted over a period of time. It tracks key metrics such as CPU Efficiency, GPU Efficiency, Memory Efficiency, and Time Efficiency, helping users understand how well their jobs are utilizing the HPC cluster's resources.</p>
<p>For each metric, the widget displays the average resources used compared to the resources allocated, along with relevant job execution data. This allows users to quickly identify inefficiencies and make necessary adjustments to improve resource usage, reduce queue times, and enhance overall job performance.</p>
<p>By monitoring these metrics, users can ensure their jobs run efficiently, making better use of the cluster's computational resources and contributing to a fair and balanced workload distribution.</p>

<strong>Understanding Efficiency Metrics</strong>
<p>The efficiency metrics discussed in this guide are expressed as percentages, representing how effectively your job utilizes the allocated resources. A higher percentage indicates better resource utilization, while lower values suggest inefficiencies.</p>
<p>Small efficiency values often indicate that the requested resources are not being fully utilized. In such cases, it is recommended to adjust your job's resource requests — for example, reducing the number of CPU cores, GPUs, or memory — to better match the actual needs of your workload. Optimizing resource requests can help improve job efficiency, reduce wait times in the queue, and make more resources available for other users in the HPC cluster.</p>

<div class="metric-section" id="cpu-efficiency">
<p><strong>CPU Efficiency: </strong>CPU efficiency measures how effectively a job uses the allocated CPU cores. High CPU efficiency means that the job is utilizing most of the CPU resources during its runtime.</p>
<p>To improve your CPU efficiency, optimize code for parallel processing, minimize blocking I/O operations, and use monitoring tools like top or htop to adjust CPU core allocation for better resource utilization.</p>
</div>

<div class="metric-section" id="gpu-efficiency">
<p><strong>GPU Efficiency: </strong>Currently, GPU efficiency cannot be calculated with the current data available from Slurm. Only GPU allocated and total walltime metrics are available</p>
</div>

<div class="metric-section" id="memory-efficiency">
<p><strong>Memory Efficiency: </strong>Memory efficiency refers to how effectively a job uses the allocated RAM. Inefficient memory usage can lead to job failures or excessive swapping, reducing performance.</p>
<p>To improve your memory efficiency: Optimize data structures to reduce memory usage, avoid loading unnecessary data, and request only the required memory when submitting jobs.</p>
</div>

<div class="metric-section" id="time-efficiency">
<p><strong>Time Efficiency: </strong>Time efficiency measures how quickly a job completes relative to the resources allocated. Poor time efficiency can indicate that resources are not being used effectively.</p>
<p>To improve your time efficiency, Parallelize tasks to shorten runtime, use optimized libraries and algorithms, avoid over-allocating idle resources, and test with smaller datasets to fine-tune job parameters before scaling up.</p>
</div>
</div>

</div>
37 changes: 37 additions & 0 deletions dev/session_metrics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## IQSS Session Metrics Customizations For OnDemand
The session card changes display the jobs CPU, Memory, and Time efficiency metrics.
When expanded, we display a table with more information about the job.

### Implementation Summary
The implementation consist of two components
- OOD Session Helper extension
- ./initializers/session_helper_extension.rb
- Session card updates and required new templates
- ./views/batch_connect/sessions

For OODv3.x, the version that we currently have in production, we need to use 2 different templates:
- `./views/batch_connect/sessions/index.v3.html.erb` => `./views/batch_connect/sessions/index.html.erb`
- `./views/batch_connect/sessions/index.v3.js.erb` => `./views/batch_connect/sessions/index.js.erb`
- `./views/batch_connect/sessions/index.html.erb` => ignore this file for v3.x

### Deployment
Using the customization feature from OnDemand, with the default location under: `/etc/ood/config/apps/dashboard`
- copy `./initializers/session_helper_extension.rb` into `/etc/ood/config/apps/dashboard/intializers`
- copy `./views/batch_connect` into `/etc/ood/config/apps/dashboard/views`

Restart the OnDemand application for the customizations to take effect.

### Deployment With FASRC Puppet
The widget components need to be deployed using FASRC Puppet control repo. We are already using the OOD Puppet module feature to add files to the OOD dashboard location to add/extend functionality:
`openondemand::apps_config_source:`

The folder that we are deploying is: `site-modules/profile/files/openondemand/common/apps_config`

In the Puppet control repo, we need to add the files for the three components to the following folders:
- site-modules/profile/files/openondemand/common/apps_config/dashboard/intializers
- site-modules/profile/files/openondemand/common/apps_config/dashboard/views

The session card updates are disabled by default. To enabled it, add the following property to the root configuration or a profile:
https://github.com/hmdc/ondemand_development/blob/main/dev/session_metrics/metrics.yml

After the updates, the changes can be seen in the Interactive Sessions page: `/pun/sys/dashboard/batch_connect/sessions
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def session_view(session)
concat render_session_time(session)
concat id(session)
concat support_ticket(session) unless @user_configuration.support_ticket.empty?
if SlurmMetrics::MetricsHelper.new.metrics_enabled?(@user_configuration) && session.completed?
if SlurmMetrics::MetricsHelper.new.session_metrics_enabled?(@user_configuration) && session.completed?
concat render(partial: 'batch_connect/sessions/card/session_job_metrics', locals: { session: session })
end
concat display_choices(session)
Expand Down
2 changes: 2 additions & 0 deletions dev/session_metrics/metrics.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
session_metrics_enabled: true

Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<%= render_card_partial('id', session) %>
<%= render_card_partial('support_ticket', session) if Configuration.support_ticket_enabled? %>
<%= render_card_partial('display_choices', session) %>
<%= render_card_partial('session_job_metrics', session) if SlurmMetrics::MetricsHelper.new.metrics_enabled?(@user_configuration) && session.completed?%>
<%= render_card_partial('session_job_metrics', session) if SlurmMetrics::MetricsHelper.new.session_metrics_enabled?(@user_configuration) && session.completed?%>
<%= render_card_partial('custom_info_view', session) if session.app.session_info_view %>
<%= render_card_partial('completed_view', session) if session.app.session_completed_view && session.completed? %>
<%= render_connection(session) %>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@

<%= javascript_include_tag 'batch_connect_sessions', nonce: true %>

<%= render partial: '/batch_connect/sessions/card/session_card_css' if SlurmMetrics::MetricsHelper.new.metrics_enabled?(@user_configuration) %>
<%= render partial: '/batch_connect/sessions/card/session_card_js' if SlurmMetrics::MetricsHelper.new.metrics_enabled?(@user_configuration) %>
<%= render partial: '/batch_connect/sessions/card/session_card_css' if SlurmMetrics::MetricsHelper.new.session_metrics_enabled?(@user_configuration) %>
<%= render partial: '/batch_connect/sessions/card/session_card_js' if SlurmMetrics::MetricsHelper.new.session_metrics_enabled?(@user_configuration) %>

<%= render partial: 'batch_connect/shared/breadcrumb',
locals: {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

<%= javascript_include_tag 'batch_connect_sessions', nonce: true %>

<%= render partial: '/batch_connect/sessions/card/session_card_css' if SlurmMetrics::MetricsHelper.new.metrics_enabled?(@user_configuration) %>
<%= render partial: '/batch_connect/sessions/card/session_card_css' if SlurmMetrics::MetricsHelper.new.session_metrics_enabled?(@user_configuration) %>

<%= render partial: 'batch_connect/shared/breadcrumb',
locals: {
Expand Down
1 change: 1 addition & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ services:
- ./config/local/ondemand.d:/etc/ood/config/ondemand.d
- ./config/local/app_overrides:/etc/ood/config/apps/ood
- ./dev/metrics:/var/www/ood/apps/plugins/metrics
- ./dev/session_metrics:/var/www/ood/apps/plugins/session_metrics
ports:
- "33000:443"
expose:
Expand Down

0 comments on commit 890e0f3

Please sign in to comment.