Recover controlled items after crash #26

stv0g · 2021-11-25T14:18:16Z

In GitLab by @skolen on Nov 25, 2021, 15:18

After a crash, a VILLAScontroller should be able to recover all the items that it controlled before (not only the default items).

Simple approach: Checkpoint the controlled items in a file or DB once the set of controlled items changes, e.g. if a new simulator is created. Then this Checkpoint can be used to recover from a crash.

Question: Where to safely store the Checkpoint so that it is not destroyed with a crash? Separate Checkpoint config map in k8s for each controller?

stv0g · 2021-11-25T14:23:38Z

Ideally we would make this dependent on the component.
As the state about the component state is usually somehow persisted by the components themself.

Examples:

the state of our Kubernetes simualtors is persisted by the Kubernetes API.
the state of running OPAL-RT / RTDS simulations can be retrieved by the APIs offered by OPAL-RT and RTDS.

This can be nicely implemented in form of a reconciliation loop where the controller attempts to synchronize its internal state with the state of the external components.

See also here: https://cloud.redhat.com/blog/kubernetes-operators-best-practices

Persisting the state in a dedicated database owned by the controller only asks for trouble as the state of this database can (and will) get out of sync with the real world (e.g. Kubernetes Pod dies while the controller is not running or a user stops OPAL-RT while the controller is not running).

stv0g · 2021-11-25T15:11:47Z

In GitLab by @skolen on Nov 25, 2021, 16:11

Thanks for the quick comment.

@iripiri and my question was rather: How does a re-started controller know/ learn which were its "external" components before the crash? Currently this information is not saved persistently. If the controller knows which components it controlled before the crash, it can reconcile their status through the respective APIs.

stv0g · 2021-11-25T15:24:08Z

I assume this would be the task of the managers components.

In the current state we always hard-code the manager components via the configuration file (e.g. ConfigMap).

The reconciliation loop should then be implemented in the manager component which contacts the APIs to sync its managed components with the components seen by the API.

I have already implemented this for the VILLASnode and VILLASrelay manager components here with a periodically called reconcile() function:

Cheers,
Steffen

stv0g · 2021-11-25T15:26:47Z

For the Kubernetes simulators/manager:

We already store meta information such as:

Simulator UUID
Manager UUID
Location
Realm
Owner

In the Kubernetes metadata associated with the Kubernetes Job resources.
The manager would need to read-in this metadata to properly reconstruct the component using the same UUIDs and other data.

If there is more information required, we could also add it to the Job metadata.

stv0g mentioned this issue Dec 2, 2022

add manager for simple kubernetes job creation #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover controlled items after crash #26

Recover controlled items after crash #26

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

Recover controlled items after crash #26

Recover controlled items after crash #26

Comments

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021

stv0g commented Nov 25, 2021