Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(components): improve component error handling #138

Merged
merged 11 commits into from
Oct 8, 2024
Merged

Conversation

domire8
Copy link
Member

@domire8 domire8 commented Sep 10, 2024

Description

I wouldn't say that this PR solves the parent issue but at least it removes a feature that is not fully supported, desired and thought through.

Review guidelines

Estimated Time of Review: 5 minutes

Checklist before merging:

  • Confirm that the relevant changelog(s) are up-to-date in case of any user-facing changes

@domire8
Copy link
Member Author

domire8 commented Sep 10, 2024

We could also tackle this differently: We could keep the implementation of the in_error_state predicate in the Component (since it also has that is_finished predicate) and only remove it in the LifecycleComponent where we would trigger the error state transition instead. This would then look like this

@domire8 domire8 closed this Sep 10, 2024
@domire8 domire8 reopened this Sep 10, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Sep 10, 2024
@eeberhard
Copy link
Member

We could also tackle this differently: We could keep the implementation of the in_error_state predicate in the Component (since it also has that is_finished predicate) and only remove it in the LifecycleComponent where we would trigger the error state transition instead. This would then look like this

I like the second suggestion, but I would add some thoughts. Normal components do not really have discrete states and being "in error state" carries no meaning outside of the predicate being set to true. So, for it to be meaningful, we could also stop the step timer or skip the step callback once an error has been raised. We might also consider if it's meaningful to "reset" an error but I think that's very implementation specific and not suited for a component (rather, use a lifecycle component for that). The error predicate should just be a way for components to stop themselves, and the predicate can correspondingly be used to unload them if we want. And by stopping the step callback, it has actual value in being defined here in modulo, since it would also stop publishers.

And regarding lifecycle components, raising an error should put the component into the ERROR_PROCESSING transition state, which in turn triggers the on_error callback. This can be entered either through transition callbacks (i.e. if we raise an exception during on_activate_callback, then the wrapper on_activate method returns an error code to the lifecycle node, which is already the case) or it can be entered from step when the component is active. If you can't directly enter the ERROR_PROCESSING state from step (I haven't looked deeply yet), then you could trigger the shutdown transition but set a property to force the on_shutdown callback to return an error code.

So overall, when it comes to the necessary change, I would:

  • keep the error predicate for normal components
  • remove the error predicate from lifecycle components
  • keep the raise_error method in the base component interface for both components
  • make the raise_error method for normal components:
    • set the error predicate to true
    • stop the step callback
  • make the raise_error method for lifecycle components:
    • set an internal flag
    • trigger the shutdown transition
  • the on_shutdown wrapper for lifecycle components should return an error status if the internal error flag was set, in order to force the component into the error_processing state

@domire8
Copy link
Member Author

domire8 commented Sep 18, 2024

So I tested that with lifecycle components and what happens is that

[component_container_mt-2] [ERROR] [1726646503.924903468] [timer_1]: Failed to execute step function: test
[component_container_mt-2] [DEBUG] [1726646503.925462848] [timer_1]: on_shutdown called from previous state active.
[component_container_mt-2] [ERROR] [1726646503.925865250] [timer_1]: Entering into the error processing transition state.
[component_container_mt-2] [DEBUG] [1726646503.926206065] [timer_1]: on_error called from previous state active.

The component is in state unconfigured after that. Is that what we want or do I also have to explicitly fail the on_error callback if an error has been raised?

@domire8
Copy link
Member Author

domire8 commented Sep 18, 2024

In general, subscriptions will survive these steps, especially in a Component. Do we want to clear those as well?

@eeberhard
Copy link
Member

The component is in state unconfigured after that. Is that what we want or do I also have to explicitly fail the on_error callback if an error has been raised?

That's what we want, yes. The on_error_callback for the derived component to override determines the error recovery behavior. If that method returns true, it means the error is handled and the component goes into the unconfigured state (however, the implementation is responsible for correctly resetting internal properties to unconfigured). If the error callback returns false, it should go straight into finalized state.

@eeberhard
Copy link
Member

eeberhard commented Sep 18, 2024

Actually the default virtual on_error_callback in the base class should return false. Only if it is overridden to implement actual error recovery should it return true

@domire8
Copy link
Member Author

domire8 commented Sep 18, 2024

however, the implementation is responsible for correctly resetting internal properties to unconfigured

That is quite important for us as well though, we need to carefully see what we should reset and what not.

@domire8 domire8 changed the title refactor(components): remove unused error predicate and raise_error refactor(components): improve component error handling Sep 18, 2024
Copy link
Member

@eeberhard eeberhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I kind of dropped this because I lost track of what was still to be discussed and what was being worked on.

In general, subscriptions will survive these steps, especially in a Component. Do we want to clear those as well?

Since we can add inputs on construction, so they shouldn't be cleared during error processing in case of recovery. On the finalized transition however, we could fully clear the map of inputs too, since they will never be used.

For the error transition itself, if there was someway to tell the SubscriptionInterface to temporarily suppress callbacks (like having a mute() / unmute() method) then we could do it like that during the error state / finalized. But overall I think having subscriptions remain active during the error processing and only deleting them on finalized is good with me.

To also quickly consider the error handled state, when going back to unconfigured, the componetn should be back in a "post-construction" state. It looks like the lifecycle component base class does not set up anything post-construction that would need to be cleared. Any additional things all happen in derived components. So, we are good there.

I think after my comments I would be quite happy to merge this.

Copy link
Member

@eeberhard eeberhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last suggestions (same would apply to Python). Otherwise looks good and the parity between C++ and Python also looks correct

source/modulo_components/src/Component.cpp Outdated Show resolved Hide resolved
Copy link
Member

@eeberhard eeberhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get testing with that RC 👍

@domire8 domire8 merged commit 1b8d582 into main Oct 8, 2024
4 checks passed
@domire8 domire8 deleted the refactor/error branch October 8, 2024 12:44
@domire8 domire8 linked an issue Oct 15, 2024 that may be closed by this pull request
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolve TODOs about putting components into error state
2 participants