Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the deletion process of the last_error_event from the error history of a machine #599

Open
simcod opened this issue Dec 19, 2024 · 1 comment

Comments

@simcod
Copy link
Contributor

simcod commented Dec 19, 2024

The last_error_event from machines should be cleared from the issue history after some time (about 6 days).

While deploying metal-stack on the new supermicro nodes, we encountered the following problem: Already allocated machines (integrated into a Kubernetes cluster) had the last_error_event of : unexpectedly received in state pxe booting.

The metal-api-liveliness is running in the metal-control-plane namespace. The logs do not show any errors for machines.

{... "msg":"machine liveliness was requested"}
{... "msg":"machine liveliness evaluated","alive":x,"dead":0,"unknown":0,"errors":0}

However, listing the machines with metalctl machine ls returns some allocated machines with a ⭕ crashloop issue.

@Gerrit91
Copy link
Contributor

Gerrit91 commented Jan 7, 2025

Last event error and crashloop do not depend on each other. To me it sounds like this issue is more about resetting the crashloop field, which should actually happen as soon as a machine reaches phoned home state?

https://github.com/metal-stack/metal-api/blob/master/cmd/metal-api/internal/fsm/states/phoned-home.go#L41

If a last event is shown, this is indicated with an exclamation mark with metalctl and there is a flag for defining how long this looks into the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants