Improve the deletion process of the `last_error_event` from the error history of a machine #599

simcod · 2024-12-19T16:18:59Z

The last_error_event from machines should be cleared from the issue history after some time (about 6 days).

While deploying metal-stack on the new supermicro nodes, we encountered the following problem: Already allocated machines (integrated into a Kubernetes cluster) had the last_error_event of : unexpectedly received in state pxe booting.

The metal-api-liveliness is running in the metal-control-plane namespace. The logs do not show any errors for machines.

{... "msg":"machine liveliness was requested"}
{... "msg":"machine liveliness evaluated","alive":x,"dead":0,"unknown":0,"errors":0}

However, listing the machines with metalctl machine ls returns some allocated machines with a ⭕ crashloop issue.

The text was updated successfully, but these errors were encountered:

Gerrit91 · 2025-01-07T08:38:15Z

Last event error and crashloop do not depend on each other. To me it sounds like this issue is more about resetting the crashloop field, which should actually happen as soon as a machine reaches phoned home state?

https://github.com/metal-stack/metal-api/blob/master/cmd/metal-api/internal/fsm/states/phoned-home.go#L41

If a last event is shown, this is indicated with an exclamation mark with metalctl and there is a flag for defining how long this looks into the past.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the deletion process of the `last_error_event` from the error history of a machine #599

Improve the deletion process of the `last_error_event` from the error history of a machine #599

simcod commented Dec 19, 2024

Gerrit91 commented Jan 7, 2025

Improve the deletion process of the last_error_event from the error history of a machine #599

Improve the deletion process of the last_error_event from the error history of a machine #599

Comments

simcod commented Dec 19, 2024

Gerrit91 commented Jan 7, 2025

Improve the deletion process of the `last_error_event` from the error history of a machine #599

Improve the deletion process of the `last_error_event` from the error history of a machine #599