Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VPA for latest prom-operator #1497

Merged
merged 4 commits into from
Feb 7, 2024

Conversation

QuentinBisson
Copy link
Contributor

@QuentinBisson QuentinBisson commented Feb 1, 2024

This PR should fix VPA issues that arise with the latest kube-prometheus-stack 9.0.0 release.

The issue are coming from the new scale subresource that was added to the prometheus CR so the prometheus operator can scale the number of shards of a Prometheus CR using tools like KEDA or the prometheus adapter.

VPA also relies on the scale subresource to find the resoruces that VPA needs to manage (recommend metrics) and to do that, the vpa controllers go up the topmost controller that has the scale subresource.

Prior to the upgrade of prometheus operator, the topmost controller was the statefulsetcontroller, but it is not the Prometheus CR.

TO be able to support slow CAPI upgrades, this PR actually uses the version of the observability bundle that runs on the MC to know if PMO should set VPA to use the Prometheus CR as a topmost controller (olly bundle >= 1.2.0 with kube-prometheus-stack > 9.0.0) or the statefulset controller (1.2.0 with kube-prometheus-stack 8.x or lesser)

Checklist

I have:

  • Described why this change is being introduced
  • Separated out refactoring/reformatting in a dedicated PR
  • Updated changelog in CHANGELOG.md

@QuentinBisson QuentinBisson self-assigned this Feb 1, 2024
@QuentinBisson QuentinBisson requested a review from a team as a code owner February 1, 2024 20:15
@hervenicol
Copy link
Contributor

As discussed, this will be hard to test without actually creating the observability-bundle release 1.2.0.
Maybe we should at least test a dev branch of this code, which would bypass the release version test, so we validate the VPA is correctly created and works well when acting on the prometheus resource?

@QuentinBisson
Copy link
Contributor Author

Oh we can test on a dev branch for sure, or change the bundle version to 1.1.0 to check that it works yes :)

@Gacko
Copy link
Member

Gacko commented Feb 5, 2024

This is currently breaking cluster creation / deletion on gaia:

% kubectl logs --namespace monitoring prometheus-meta-operator-775bf76955-ldvhr
{"caller":"github.com/giantswarm/k8sclient/v7/pkg/k8srestconfig/rest_config.go:141","level":"debug","message":"creating in-cluster REST config","time":"2024-02-05T10:05:28.434109+00:00"}
{"caller":"github.com/giantswarm/k8sclient/v7/pkg/k8srestconfig/rest_config.go:148","level":"debug","message":"created in-cluster REST config","time":"2024-02-05T10:05:28.434249+00:00"}
panic: {"kind":"invalidConfigError","annotation":"verticalpodautoscaler.Config.Installation must not be empty","stack":[{"file":"/root/project/service/controller/resource/monitoring/verticalpodautoscaler/resource.go","line":60},{"file":"/root/project/service/controller/resource/resource.go","line":279},{"file":"/root/project/service/controller/clusterapi/controller.go","line":69},{"file":"/root/project/service/service.go","line":180}]}

goroutine 1 [running]:
main.mainE.func1(0xc0003a2c40?)
	/root/project/main.go:58 +0x185
github.com/giantswarm/microkit/command/daemon.(*command).Execute(0xc0000ada40, 0x0?, {0x0?, 0x0?, 0x0?})
	/go/pkg/mod/github.com/giantswarm/[email protected]/command/daemon/command.go:105 +0x13a
github.com/spf13/cobra.(*Command).execute(0xc000032000, {0xc00014faa0, 0x2, 0x2})
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc000032600)
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x2284c58?)
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039 +0x13
main.mainE({0x227e350?, 0x32de740})
	/root/project/main.go:147 +0xaab
main.main()
	/root/project/main.go:24 +0x25

Could you please not test such changes on gaia? I thought we have clusters like gauss for testing MC components. gaia is stable-testing and as such its MC components should be stable so the WC lifecycle is stable.

@hervenicol hervenicol force-pushed the fix-VPA-for-latest-prom-operator branch from 3b87d45 to aad9fa7 Compare February 5, 2024 14:59
QuentinBisson and others added 2 commits February 6, 2024 12:43
* fix forgotten verticalpodautoscaler.Config
* move a const to a more suitable place

---------

Co-authored-by: Herve Nicol <[email protected]>
@hervenicol hervenicol force-pushed the fix-VPA-for-latest-prom-operator branch from 8d9b502 to f419f37 Compare February 6, 2024 11:43
testFunc := func(v interface{}) (interface{}, error) {
c := Config{
Logger: logger,
K8sClient: k8sClient,
VpaClient: vpa_clientsetfake.NewSimpleClientset(),
Provider: cluster.Provider{
Kind: "aws",
Flavor: "vintage",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect. It should be in the flavor loop right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's already in the flavor loop. So I removed the Provider part of the config.

Copy link
Contributor

@hervenicol hervenicol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did my best to test, by changing the observability-bundle minimum release on a separate branch.
Seems to work 😁

* Fix VPA for latest prom-operator

* Update CHANGELOG.md
@QuentinBisson QuentinBisson enabled auto-merge (squash) February 7, 2024 09:05
@QuentinBisson QuentinBisson merged commit 7367dca into master Feb 7, 2024
4 of 5 checks passed
@QuentinBisson QuentinBisson deleted the fix-VPA-for-latest-prom-operator branch February 7, 2024 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants