Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover controlled items after crash #26

Open
stv0g opened this issue Nov 25, 2021 · 4 comments
Open

Recover controlled items after crash #26

stv0g opened this issue Nov 25, 2021 · 4 comments

Comments

@stv0g
Copy link
Contributor

stv0g commented Nov 25, 2021

In GitLab by @skolen on Nov 25, 2021, 15:18

After a crash, a VILLAScontroller should be able to recover all the items that it controlled before (not only the default items).

Simple approach: Checkpoint the controlled items in a file or DB once the set of controlled items changes, e.g. if a new simulator is created. Then this Checkpoint can be used to recover from a crash.

Question: Where to safely store the Checkpoint so that it is not destroyed with a crash? Separate Checkpoint config map in k8s for each controller?

@stv0g
Copy link
Contributor Author

stv0g commented Nov 25, 2021

Ideally we would make this dependent on the component.
As the state about the component state is usually somehow persisted by the components themself.

Examples:

  • the state of our Kubernetes simualtors is persisted by the Kubernetes API.
  • the state of running OPAL-RT / RTDS simulations can be retrieved by the APIs offered by OPAL-RT and RTDS.

This can be nicely implemented in form of a reconciliation loop where the controller attempts to synchronize its internal state with the state of the external components.

See also here: https://cloud.redhat.com/blog/kubernetes-operators-best-practices

Persisting the state in a dedicated database owned by the controller only asks for trouble as the state of this database can (and will) get out of sync with the real world (e.g. Kubernetes Pod dies while the controller is not running or a user stops OPAL-RT while the controller is not running).

@stv0g
Copy link
Contributor Author

stv0g commented Nov 25, 2021

In GitLab by @skolen on Nov 25, 2021, 16:11

Thanks for the quick comment.

@iripiri and my question was rather: How does a re-started controller know/ learn which were its "external" components before the crash? Currently this information is not saved persistently. If the controller knows which components it controlled before the crash, it can reconcile their status through the respective APIs.

@stv0g
Copy link
Contributor Author

stv0g commented Nov 25, 2021

I assume this would be the task of the managers components.

In the current state we always hard-code the manager components via the configuration file (e.g. ConfigMap).

The reconciliation loop should then be implemented in the manager component which contacts the APIs to sync its managed components with the components seen by the API.

I have already implemented this for the VILLASnode and VILLASrelay manager components here with a periodically called reconcile() function:

Cheers,
Steffen

@stv0g
Copy link
Contributor Author

stv0g commented Nov 25, 2021

For the Kubernetes simulators/manager:

We already store meta information such as:

  • Simulator UUID
  • Manager UUID
  • Location
  • Realm
  • Owner

In the Kubernetes metadata associated with the Kubernetes Job resources.
The manager would need to read-in this metadata to properly reconstruct the component using the same UUIDs and other data.

If there is more information required, we could also add it to the Job metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant