-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #3 from hmdc/metrics_help_and_configuration
Metrics help and configuration
- Loading branch information
Showing
21 changed files
with
137 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
# File managed by Puppet - DO NOT EDIT | ||
--- | ||
pinned_apps: | ||
- sys/Jupyter | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,11 @@ | ||
session_metrics_enabled: true | ||
custom_pages: | ||
metrics: | ||
rows: | ||
- columns: | ||
- width: 12 | ||
- width: 8 | ||
widgets: | ||
- "metrics/metrics_help" | ||
- width: 4 | ||
widgets: | ||
- "metrics/metrics" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,15 @@ | ||
help_menu: | ||
- group: "Metrics" | ||
- title: "Metrics" | ||
icon: "fas://chart-bar" | ||
page: "metrics" | ||
custom_pages: | ||
metrics: | ||
rows: | ||
- columns: | ||
- width: 12 | ||
- width: 8 | ||
widgets: | ||
- "metrics/metrics_help" | ||
- width: 4 | ||
widgets: | ||
- "metrics/metrics" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
<div class="metrics-help"> | ||
<h3>Metrics Help</h3> | ||
<div class="introduction"> | ||
<p>When submitting jobs to a high-performance computing (HPC) cluster, it is essential to understand how efficiently your jobs utilize the available resources. Proper resource usage ensures optimal job performance and minimizes resource waste, which can help reduce job wait times and improve overall system efficiency.</p> | ||
<p>To help users analyze their jobs, we have created a set of metrics widgets to surface job analytics from the Slurm Workload Manager.</p> | ||
<p>The Slurm <code>sacct</code> command is used to retrieve information about job resource usage and performance. The <code>sacct</code> command provides details such as job exit state, CPU time, memory usage, GPU allocation, and job runtime, which are used to compute efficiency metrics.</p> | ||
</div> | ||
|
||
<div class="description"> | ||
<h5>FairShare</h5> | ||
<p>FairShare is a scheduling policy used in HPC environments to ensure equitable access to resources among users and projects. It prioritizes job scheduling based on a user’s recent resource usage relative to their allocated share of the cluster. Users or groups that have consumed fewer resources recently are given higher priority, while those who have utilized more resources may experience lower priority until their usage balances out.</p> | ||
<p>The FairShare system helps maintain a balanced workload across the cluster, preventing resource monopolization by any single user or group and promoting fair access for all users.</p> | ||
<p><a target="_blank" href="https://docs.rc.fas.harvard.edu/kb/fairshare/">FASRC Fairshare info</a></p> | ||
|
||
<h5>Completed Jobs by State</h5> | ||
<p>This widget provides an overview of the outcomes of completed jobs submitted to the HPC cluster. It categorizes jobs based on their final state, allowing users to quickly assess the overall success and failure rates of their workloads. The widget summarizes the following job states:</p> | ||
<ul> | ||
<li><strong>Completed</strong>: Jobs that finished without errors and met all resource and runtime requirements.</li> | ||
<li><strong>Time Out</strong>: Jobs that exceeded their allocated time limit and were terminated by the scheduler.</li> | ||
<li><strong>Canceled</strong>: Jobs that were manually cancelled by the user or administrator before completion.</li> | ||
<li><strong>OOM - Out of Memory</strong>: Jobs that were terminated due to exceeding their allocated memory limits.</li> | ||
<li><strong>Failed</strong>: Jobs that encountered errors and did not complete successfully.</li> | ||
</ul> | ||
<p>The information helps users identify patterns and potential issues in their job submissions, enabling them to make informed adjustments to improve job success rates and optimize resource usage.</p> | ||
|
||
<h5>Summary Job Stats</h5> | ||
<p>The Summary Job Stats widget provides a high-level overview of job resource utilization and efficiency for jobs submitted over a period of time. It tracks key metrics such as CPU Efficiency, GPU Efficiency, Memory Efficiency, and Time Efficiency, helping users understand how well their jobs are utilizing the HPC cluster's resources.</p> | ||
<p>For each metric, the widget displays the average resources used compared to the resources allocated, along with relevant job execution data. This allows users to quickly identify inefficiencies and make necessary adjustments to improve resource usage, reduce queue times, and enhance overall job performance.</p> | ||
<p>By monitoring these metrics, users can ensure their jobs run efficiently, making better use of the cluster's computational resources and contributing to a fair and balanced workload distribution.</p> | ||
|
||
<strong>Understanding Efficiency Metrics</strong> | ||
<p>The efficiency metrics discussed in this guide are expressed as percentages, representing how effectively your job utilizes the allocated resources. A higher percentage indicates better resource utilization, while lower values suggest inefficiencies.</p> | ||
<p>Small efficiency values often indicate that the requested resources are not being fully utilized. In such cases, it is recommended to adjust your job's resource requests — for example, reducing the number of CPU cores, GPUs, or memory — to better match the actual needs of your workload. Optimizing resource requests can help improve job efficiency, reduce wait times in the queue, and make more resources available for other users in the HPC cluster.</p> | ||
|
||
<div class="metric-section" id="cpu-efficiency"> | ||
<p><strong>CPU Efficiency: </strong>CPU efficiency measures how effectively a job uses the allocated CPU cores. High CPU efficiency means that the job is utilizing most of the CPU resources during its runtime.</p> | ||
<p>To improve your CPU efficiency, optimize code for parallel processing, minimize blocking I/O operations, and use monitoring tools like top or htop to adjust CPU core allocation for better resource utilization.</p> | ||
</div> | ||
|
||
<div class="metric-section" id="gpu-efficiency"> | ||
<p><strong>GPU Efficiency: </strong>Currently, GPU efficiency cannot be calculated with the current data available from Slurm. Only GPU allocated and total walltime metrics are available</p> | ||
</div> | ||
|
||
<div class="metric-section" id="memory-efficiency"> | ||
<p><strong>Memory Efficiency: </strong>Memory efficiency refers to how effectively a job uses the allocated RAM. Inefficient memory usage can lead to job failures or excessive swapping, reducing performance.</p> | ||
<p>To improve your memory efficiency: Optimize data structures to reduce memory usage, avoid loading unnecessary data, and request only the required memory when submitting jobs.</p> | ||
</div> | ||
|
||
<div class="metric-section" id="time-efficiency"> | ||
<p><strong>Time Efficiency: </strong>Time efficiency measures how quickly a job completes relative to the resources allocated. Poor time efficiency can indicate that resources are not being used effectively.</p> | ||
<p>To improve your time efficiency, Parallelize tasks to shorten runtime, use optimized libraries and algorithms, avoid over-allocating idle resources, and test with smaller datasets to fine-tune job parameters before scaling up.</p> | ||
</div> | ||
</div> | ||
|
||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
## IQSS Session Metrics Customizations For OnDemand | ||
The session card changes display the jobs CPU, Memory, and Time efficiency metrics. | ||
When expanded, we display a table with more information about the job. | ||
|
||
### Implementation Summary | ||
The implementation consist of two components | ||
- OOD Session Helper extension | ||
- ./initializers/session_helper_extension.rb | ||
- Session card updates and required new templates | ||
- ./views/batch_connect/sessions | ||
|
||
For OODv3.x, the version that we currently have in production, we need to use 2 different templates: | ||
- `./views/batch_connect/sessions/index.v3.html.erb` => `./views/batch_connect/sessions/index.html.erb` | ||
- `./views/batch_connect/sessions/index.v3.js.erb` => `./views/batch_connect/sessions/index.js.erb` | ||
- `./views/batch_connect/sessions/index.html.erb` => ignore this file for v3.x | ||
|
||
### Deployment | ||
Using the customization feature from OnDemand, with the default location under: `/etc/ood/config/apps/dashboard` | ||
- copy `./initializers/session_helper_extension.rb` into `/etc/ood/config/apps/dashboard/intializers` | ||
- copy `./views/batch_connect` into `/etc/ood/config/apps/dashboard/views` | ||
|
||
Restart the OnDemand application for the customizations to take effect. | ||
|
||
### Deployment With FASRC Puppet | ||
The widget components need to be deployed using FASRC Puppet control repo. We are already using the OOD Puppet module feature to add files to the OOD dashboard location to add/extend functionality: | ||
`openondemand::apps_config_source:` | ||
|
||
The folder that we are deploying is: `site-modules/profile/files/openondemand/common/apps_config` | ||
|
||
In the Puppet control repo, we need to add the files for the three components to the following folders: | ||
- site-modules/profile/files/openondemand/common/apps_config/dashboard/intializers | ||
- site-modules/profile/files/openondemand/common/apps_config/dashboard/views | ||
|
||
The session card updates are disabled by default. To enabled it, add the following property to the root configuration or a profile: | ||
https://github.com/hmdc/ondemand_development/blob/main/dev/session_metrics/metrics.yml | ||
|
||
After the updates, the changes can be seen in the Interactive Sessions page: `/pun/sys/dashboard/batch_connect/sessions |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
session_metrics_enabled: true | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters