Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and update scripts for generating grafana dashboards #4382

Merged

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Nov 8, 2023

Tracking issue

Closes #4346

Describe your changes

  • Rename dashboard generation files so that grafana lib's generate-dashboard command works correctly. Previously we were getting grafanalib._gen.DefinitionError: Definition /tmp/flyte/stats/flytepropeller_dashboard.py does not define a variable '/tmp/flyte/stats/flytepropeller_dashboard'
  • Update generation scripts where the metric names have changed.
  • Add a multiply by 100 on some graphs that where supposed to show percentage.
  • Ensure RefIds are unique. I was getting errors from grafana when trying to view these dashboards.
  • Update the checked in dashboard jsons
  • I based all these changes on a flyte-core deployment. The metric names might be difference on flyte-binary
  • Changed the aggregation from sum to max on the flyteadmin latencies. Sum would always return a blank graph when using more than one flyteadmin replica.

Check all the applicable boxes

  • I updated the documentation accordingly - I think the documentation was already correct
  • All new and existing tests passed - perhaps we could add a test to ensure make stats works but probably it will be difficult to check that all the metric names are correct
  • All commits are signed-off.

Screenshots

image

Note to reviewers

I went through all the graphs in the dashboards to confirm either they have data or there is a valid reason for them not to show anything and that the metrics they use are exposed by flyte.

@Tom-Newton Tom-Newton force-pushed the tomnewton/updated_grafana_dashboards branch from f436ffc to de37547 Compare November 8, 2023 12:17
@eapolinario
Copy link
Contributor

This is amazing! Thank you!

One thing that's still not clear is how these dashboards are published. I'm not sure if we were doing it manually or if there was some automation somewhere (I couldn't find anything in the flyte main repo).

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Nov 8, 2023

One thing that's still not clear is how these dashboards are published. I'm not sure if we were doing it manually or if there was some automation somewhere (I couldn't find anything in the flyte main repo).

According to this README they are published manually. https://github.com/flyteorg/flyte/blob/master/stats/README.md

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Nov 10, 2023

I've just discovered a problem with the flyteadmin dashboard. It seems like latency metrics are not aggregated correctly if you are running flyteadmin with more than one replica. I will try to push a fix.

Copy link
Contributor

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you!

Copy link

codecov bot commented Nov 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (712ee8e) 59.81% compared to head (13e70f2) 60.28%.

❗ Current head 13e70f2 differs from pull request most recent head 54d2242. Consider uploading reports for the commit 54d2242 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4382      +/-   ##
==========================================
+ Coverage   59.81%   60.28%   +0.46%     
==========================================
  Files         632      562      -70     
  Lines       53644    40314   -13330     
==========================================
- Hits        32089    24303    -7786     
+ Misses      19029    13657    -5372     
+ Partials     2526     2354     -172     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eapolinario eapolinario merged commit 4f2a164 into flyteorg:master Nov 14, 2023
44 checks passed
Copy link

welcome bot commented Nov 14, 2023

Congrats on merging your first pull request! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Grafana dashboards are outdated and make stats is failing
2 participants