-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic switch to emergency mode when metrics unavailable #424
Conversation
pkg/hpa/service.go
Outdated
return true | ||
} | ||
|
||
if conditions[1].Type == "ScalingActive" && conditions[1].Status == "False" && conditions[1].Reason == "FailedGetResourceMetric" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FailedGetResourceMetric
What about other failures? e.g., the container resource metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho should we only switch to emergency mode when main container resource metrics are missing? e.g. if istio-proxy metrics are unavailable for some reason but the main container metrics are still available, do we stay in auto mode since scaling is still active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fallback to emergency only if all metrics are dead..? Maybe it's too aggressive to do a fallback if some, but not all metrics are dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., Let's say a user deploys a new version to the app container which contains a bug unfortunately, and all main containers are crash due to the bug. In this case, switching to emerge doesn't help, rather just become unnecessary cost increase.
So, ideally we have to detect which issue is solvable with increasing the replicas, and which isn't.
I know it's super difficult, though. But, if two container's metrics are missing (a metric server is dead, the service is overwhelmed and whole dead, etc), we can say increasing the replica might help solve the issue, with more confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must write a test.
You meant, at /before? I don't exactly remember whether it's intentional or I just forgot, but I guess it's intentional because the reconciliation should create the monitor VPA at the initialization phase, no? |
scalingActive := r.HpaService.CheckHpaMetricStatus(ctx, hpa) | ||
if scalingActive == false && tortoise.Spec.UpdateMode == autoscalingv1beta3.UpdateModeAuto && tortoise.Status.TortoisePhase == autoscalingv1beta3.TortoisePhaseWorking { | ||
//switch to emergency mode only when auto tortoise and already working | ||
tortoise.Spec.UpdateMode = autoscalingv1beta3.UpdateModeEmergency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho somehow this line doesn't change the actual spec that k8sclient.get retrieves so in the test while tortoisephase changes to emergency, spec.updatemode is still auto. If this is the case, i think we need to create a new tortoisephase for auto scale up. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somehow this line doesn't change the actual spec
because UpdateTortoiseStatus
below only updates .status (literally), not .spec.
And, thinking about it, probably we shouldn't directly change the spec. It's breaking the boundary between the user's intent and the controller.
So,
If this is the case, i think we need to create a new tortoisephase for auto scale up.
yeah. I think, instead of changing the updatemode on spec, we should change something at .status. TortoisePhase
makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use the existing TortoisePhaseEmergency
, not a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use existing emergency phase, when tortoise is in auto mode it will reconcile and switch it back to working right? we can only use emergency phase if we disallow emergency to be input by user. If we want users to be able to manually switch to emergency, we need a new phase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually i am not sure what other scenarios will require emergency mode. If missing metrics is the only scenario then it would make sense to disallow manual switch. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use existing emergency phase, when tortoise is in auto mode it will reconcile and switch it back to working right?
No, it shouldn't. That depends on how you implement. The controller should keep putting TortoisePhaseEmergency
as long as HPA has a missing metric, even if the mode is auto.
Actually i am not sure what other scenarios will require emergency mode. If missing metrics is the only scenario then it would make sense to disallow manual switch. What do you think?
There could be many scenarios right now, e.g.,
- Mercari is going to have a Black Friday sale, and we want to pre-autoscale the number of replicas. (could be solved by init commit for scheduled scaling #370)
- A big upstream microservice is down, and the downstream microservice is getting traffic much less than usual now. The dynamic minReplicas feature keeps the replica number to some extent though, we want to make it bigger enough to prepare the big upstream microservice to come back fully.
- We're getting requests more than usual, unexpectedly (DDoS, etc), and the platform needs to scale up all the services.
I agree that we should enhance tortoise so that users don't have to use emergency mode at all eventually, but for now, we should have it, as a last resort.
api/v1beta3/tortoise_types.go
Outdated
@@ -256,7 +256,8 @@ const ( | |||
// TortoisePhaseBackToNormal means tortoise was in the emergency mode, and now it's coming back to the normal operation. | |||
// During TortoisePhaseBackToNormal, the number of replicas of workloads are gradually reduced to the usual value. | |||
// - Emergency → BackToNormal | |||
TortoisePhaseBackToNormal TortoisePhase = "BackToNormal" | |||
TortoisePhaseBackToNormal TortoisePhase = "BackToNormal" | |||
TortoisePhaseAutoEmergency TortoisePhase = "AutoEmergency" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I do not agree adding a new phase. It'll increase our maintenance effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look at it again later i didnt see your comment earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho Changed to use only emergency mode. Could you review the new test cases as well? I am also using currentMetric.ContainerResource.Current.Value.IsZero()
which is the default value for int32 though I am not sure what is supposed to be returned when metrics are down. I assume its the default value for the type but I'm not too familiar with golang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is supposed to be returned when metrics are down.
I guess it's correct to assume HPA to return zero when something goes wrong happens with the metric. Or maybe they just keep the previous values (= they don't clean up the list with zero values when the HPA reconciliation fails).
You have to check HPA controller implementation to make 100% sure.
pkg/hpa/service.go
Outdated
@@ -765,3 +766,54 @@ func (c *Service) excludeExternalMetric(ctx context.Context, hpa *v2.HorizontalP | |||
|
|||
return newHPA | |||
} | |||
|
|||
func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool { | |
func (c *Service) IsHpaMetricAvailable(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we do like this instead? It would eliminate the logic from the controller layer.
func (c *Service) UpdateTortoisePhaseIfHPAIsUnhealthy(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler, tortoise *v1beta1.Tortoise) error {
...
if condition.Type == "ScalingActive" && condition.Status == "False" && condition.Reason == "FailedGetResourceMetric" {
//switch to Emergency mode since no metrics
logger.Info("HPA failed to get resource metrics, switch to emergency mode")
tortoise.Status.TortoisePhase = v1beta3.TortoisePhaseEmergency
return nil
}
}
Then, probably it'd make more sense to move inside the tortoise service, instead of hpa service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wont this add extra responsibilities to tortoise service if the service has to manage hpa conditions as well? Would it be better to add a UpdateTortoisePhaseIfHPAIsUnhealthy
function in tortoise service but still perform the check in hpa service? so we only pass in the scalingActive bool into tortoise service instead of the entire hpa. Im not sure which will be cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wont this add extra responsibilities to tortoise service if the service has to manage hpa conditions as well
Hmm, that's understandable. So,
Would it be better to add a UpdateTortoisePhaseIfHPAIsUnhealthy function in tortoise service but still perform the check in hpa service? so we only pass in the scalingActive bool into tortoise service instead of the entire hpa.
this direction looks good.
pkg/hpa/service.go
Outdated
if currenthpa == nil { | ||
logger.Info("empty HPA passed into status check, ignore") | ||
return true | ||
} | ||
|
||
if reflect.DeepEqual(currenthpa.Status, v2.HorizontalPodAutoscalerStatus{}) { | ||
logger.Info("HPA empty status, switch to emergency mode") | ||
return false | ||
} | ||
|
||
if currenthpa.Status.Conditions == nil { | ||
logger.Info("HPA empty conditions, switch to emergency mode") | ||
return false | ||
} | ||
|
||
if currenthpa.Status.CurrentMetrics == nil { | ||
logger.Info("HPA no metrics, switch to emergency mode") | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add error
in the return value and return err
in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, when err
, we should abort the reconciliation because HPA's object looks like not valid at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause an issue with the controller test that initializes hpa because in the test environment, creating a new hpa without manually updating the status will give an empty status. But we cannot force a status in the controller itself because that might affect the actual hpa in an actual environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, looks like it's expected that if HPA is just created, Status etc could be empty (since HPA controller might not yet notice a new HPA)
Can we early-return (or not run this function) when tortoise's phase is Initializing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i handled it to only run when tortoise is working. But there is a test case where tortoise is working but no hpa and tortoise reconcile should create a hpa. This will always fail and im not sure of a workaround for this specific test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but because of that it will cause error since it generates hpa with empty status and tortoisephase is still working. A rather hackish way around it is if a new hpa is created from that function it returns a bool so we skip hpa status check on that reconcile loop and check it on future reconcile loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that's complicated.
Okay, I now agree that we don't return error from this function even when HPA has empty condition etc, and we just ignore, doing nothing at those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the missing metrics concern as well, would skipping the check if theres a change in policy be a good workaround? as long as the metrics are populated by the next reconcile loop it should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two cases coming from UpdateHPASpecFromTortoiseAutoscalingPolicy
, IIUC:
- A new HPA is created: In this case, empty status and hence we skip. So, no issue.
- A new metric is added: In this case, a status doesn't have anything about new metric added at HPA spec, but this function doesn't check missing metrics by comparing spec and status, so no issue.
So, I don't see any issues. WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont see issues as well just that it may not be the cleanest way to code this? But I can't think of anything better at the moment
@sanposhiho Ready for review |
pkg/hpa/service.go
Outdated
@@ -497,32 +498,33 @@ func (c *Service) UpdateHPASpecFromTortoiseAutoscalingPolicy( | |||
givenHPA *v2.HorizontalPodAutoscaler, | |||
replicaNum int32, | |||
now time.Time, | |||
) (*autoscalingv1beta3.Tortoise, error) { | |||
) (*autoscalingv1beta3.Tortoise, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was suggesting at the end of the thread is a bit different.
Okay, I now agree that we don't return error from this function even when HPA has empty condition etc, and we just ignore, doing nothing at those cases.
#424 (comment)
So, we don't need hpaCreated
, and we can just ignore (= return true
) when HPA has empty condition at CheckHpaMetricStatus
. (I meant CheckHpaMetricStatus
doesn't have to have error
in a return value anymore)
pkg/hpa/service.go
Outdated
@@ -765,3 +768,51 @@ func (c *Service) excludeExternalMetric(ctx context.Context, hpa *v2.HorizontalP | |||
|
|||
return newHPA | |||
} | |||
|
|||
func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I don't think we have to return error anymore.
func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) (bool, error) { | |
// IsHPAScalingActive returns whether HPA is correctly working or not. | |
func (c *Service) IsHPAScalingActive(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool { |
pkg/hpa/service.go
Outdated
if currenthpa == nil { | ||
logger.Info("empty HPA passed into status check, ignore") | ||
return true, nil | ||
} | ||
|
||
if reflect.DeepEqual(currenthpa.Status, v2.HorizontalPodAutoscalerStatus{}) { | ||
return false, fmt.Errorf("HPA empty status, switch to emergency mode") | ||
} | ||
|
||
if currenthpa.Status.Conditions == nil { | ||
return false, fmt.Errorf("HPA empty conditions, switch to emergency mode") | ||
} | ||
|
||
if currenthpa.Status.CurrentMetrics == nil { | ||
return false, fmt.Errorf("HPA no metrics, switch to emergency mode") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, let's return true
for all those cases based on the assumption that HPA is just created and the HPA controller doesn't handle it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this might be a general issue that maybe HPA hasn't be reconciled yet when UpdateHPASpecFromTortoiseAutoscalingPolicy
just created the one.
In such cases, Tortoise controller cannot (or shouldn't) calculate the recommendation.
Can we varify those empty stuff at GetHPAOnTortoise
, and if empty, we abort the reconciliation? And, at the next reconciliation, HPA status is supposed to be filled by HPA controller, and we can reconcile.
func (c *Service) GetHPAOnTortoise(ctx context.Context, tortoise *autoscalingv1beta3.Tortoise) (*v2.HorizontalPodAutoscaler, bool, error) {
...
hpa := &v2.HorizontalPodAutoscaler{}
if err := c.c.Get(ctx, types.NamespacedName{Namespace: tortoise.Namespace, Name: tortoise.Status.Targets.HorizontalPodAutoscaler}, hpa); err != nil {
return nil, false, fmt.Errorf("failed to get hpa on tortoise: %w", err)
}
+ if hpa.Status == nil || hpa.Status.Conditions == nil || .... {
+ // Most likely, HPA is just created and not yet handled by HPA controller.
+ return nil, false, nil
+ }
...
}
hpa, isReady, err = r.HpaService.GetHPAOnTortoise(ctx, tortoise)
if err != nil {
logger.Error(err, "failed to get HPA", "tortoise", req.NamespacedName)
return ctrl.Result{}, err
}
+
+ if !isReady {
+ // HPA is correctly fetched, but looks like not ready yet. We won't be able to calculate things correctly, and hence stop the reconciliation here.
+ return ctrl.Result{}, nil
+ }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick question @sanposhiho How should we handle the test case where HPA is created? In the current test case, it is updated to the correct recommendation after one reconcile loop in tortoise conditions. Now it should just be the same tortoise since reconcile was aborted and tests only cover one reconcile loop? Would this be lacking from a testing perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we change the expectation of the reconciliation loop that creates HPA to stop the reconciliation at that point (if HPA hasn't yet updated immediately by HPA controller) and expect the next reconciliation would handle them appropriately.
So, we can just change the test case; change the test case's tortoise expectation not to get any update around the recommendation.
@sanposhiho fixed the above comments. Could you look through the create hpa test case too? in case i missed something |
pkg/hpa/service.go
Outdated
conditions := currenthpa.Status.Conditions | ||
currentMetrics := currenthpa.Status.CurrentMetrics | ||
|
||
if len(conditions) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to L777 above
if len(conditions) > 0 { |
pkg/hpa/service.go
Outdated
} | ||
} | ||
|
||
if len(currentMetrics) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to L777 above
if len(currentMetrics) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You overlooked this. L776 already checked this.
pkg/tortoise/tortoise.go
Outdated
|
||
func (c *Service) UpdateTortoisePhaseIfHPAIsUnhealthy(ctx context.Context, scalingActive bool, tortoise *v1beta3.Tortoise) (*v1beta3.Tortoise, error) { | ||
if !scalingActive && tortoise.Spec.UpdateMode == v1beta3.UpdateModeAuto && tortoise.Status.TortoisePhase == v1beta3.TortoisePhaseWorking { | ||
//switch to Emergency mode since no metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//switch to Emergency mode since no metrics | |
log.FromContext(ctx).Info("switching Tortoise to Emergency mode because looks like HPA isn't working properly") | |
s.recorder.Event(tortoise, corev1.EventTypeNormal, event.EmergencyModeEnabled, "Tortoise is in Emergency mode because it detected HPA isn't working appropriately. It will increase the number of replicas") |
@@ -232,13 +233,28 @@ func (r *TortoiseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (_ | |||
// VPA is ready, we mark all Vertical scaling resources as Running. | |||
tortoise = vpa.SetAllVerticalContainerResourcePhaseWorking(tortoise, now) | |||
|
|||
isReady := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need to define this here?
pkg/hpa/service.go
Outdated
} | ||
} | ||
|
||
if len(currentMetrics) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You overlooked this. L776 already checked this.
}) | ||
It("HPA scalingactive no metrics", func() { | ||
runTest(filepath.Join("testdata", "reconcile-automatic-emergency-mode-hpa-no-metrics")) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one more test case which tests that tortoise gets back to tortoisePhase: Working
when HPA is back to normal?
@sanposhiho made the changes PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit
Co-authored-by: Kensei Nakada <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty. Thanks!
What this PR does / why we need it:
This PR adds a check to HPA status condition and switches tortoise to emergency mode whenever hpa is unable to get resource metric
Which issue(s) this PR fixes:
#422
Fixes #
Special notes for your reviewer: