-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add status change event #3903
base: main
Are you sure you want to change the base?
Add status change event #3903
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
the events are shown only when starting via the api |
Yes, that is expected, as |
a067a0f
to
c863072
Compare
Tested locally and functionality is working as expected, LGTM. it'd be nice to have more detailed commit messages explaining the changes in each commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to spend more time on the last commit, but here are some initial comments.
cmd/crc/cmd/status.go
Outdated
@@ -81,7 +81,7 @@ func runWatchStatus(writer io.Writer, client *daemonclient.Client, cacheDir stri | |||
|
|||
isPullInit := false | |||
|
|||
err := client.SSEClient.Status(func(loadResult *types.ClusterLoadResult) { | |||
err := client.SSEClient.ClusterLoad(func(loadResult *types.ClusterLoadResult) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not obvious why the renaming is needed? The commit log should explain this so that we all have the same understanding of the need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not actually a rename, but rather a 'prepare'. The way this will work is to enable the status to be sent when requested (starts listening).
Yevhen can explain, as I asked him the same on one of our meetings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I rename that to avoid confusion with name Status
vs StatusChange
a6a8fb4
to
f63d585
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of comments, still need to spend more time on the pkg/crc/machine/sync.go changes.
pkg/crc/api/events/log_stream.go
Outdated
@@ -23,7 +25,7 @@ func newSSEStreamHook(server *sse.Server) *streamHook { | |||
&logrus.JSONFormatter{ | |||
TimestampFormat: "", | |||
DisableTimestamp: false, | |||
DisableHTMLEscape: false, | |||
DisableHTMLEscape: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it cause parsing issues at the other hand if HTML is not escaped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, especially new line characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand what this solves, we remove the new line characters with TrimRight
, so DisableHTMLEscape: true
must be solving a different problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, delete this change
} | ||
|
||
var ( | ||
StatusChanged = NewEvent[StatusChangedEvent]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the global variable here, but we can refine this later.
45a11dd
to
7184f64
Compare
events.StatusChanged.Fire(events.StatusChangedEvent{State: startResult.Status}) | ||
} else { | ||
events.StatusChanged.Fire(events.StatusChangedEvent{State: state.Error, Error: err}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(commenting here, but that applies to most of the file).
Duplicating the state logic in this file ("after calling Start()
, the VM is in the Start
state, unless there was an error, in which case it's in the Error
state") will likely cause additional maintainance burden, as we have to make sure to keep in sync the state logic in this file, and the state logic in other parts of crc - the status events may be reporting a different state from Status()
.
Can we emit a StatusChangedEvent
to tell other code the status changed, but not specify the current status in the event payload? This way the receiver of the event can check Status()
if they need the status, and we don't have 2 source for the status to keep in sync?
Regarding errors, it seems if Start()
(or other commands) return an error, you consider that the cluster is in an error state? This makes sense for some commands, but I don't think this matches what is reported by Status()
at the moment? Would be nice to fix the reported Status() when there are start failures for examples, but doing it here would be papering over the real issue in my opinion.
And in some other cases, errors are less bad, for example in the power off case, maybe the command failed early, and the cluster is still nicely running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating the state logic in this file ("after calling Start(), the VM is in the Start state, unless there was an error, in which case it's in the Error state") will likely cause additional maintainance burden, as we have to make sure to keep in sync the state logic in this file, and the state logic in other parts of crc - the status events may be reporting a different state from Status().
Not sure to understand how to resolve you concern, the problem is that we don't have a way to track state change(at least I wasn’t able to find it, it would be nice if you point me that place if if it exist)
Can we emit a StatusChangedEvent to tell other code the status changed, but not specify the current status in the event payload? This way the receiver of the event can check Status() if they need the status, and we don't have 2 source for the status to keep in sync?
I first do that, but I face a race condition with updating state, for example:
User call start, we fire state change event, before actuality call Start
, and if listener is fast enough, it could get old state (basically we could lost starting
state in transition from stoped
to running
states)
Possible solution to that, may be moving state change event firing from Synchronized
client in to client
implementation of machine.Client
interface. And fire events only after actually state is changed.
Regarding errors, it seems if Start() (or other commands) return an error, you consider that the cluster is in an error state? This makes sense for some commands, but I don't think this matches what is reported by Status() at the moment? Would be nice to fix the reported Status() when there are start failures for examples, but doing it here would be papering over the real issue in my opinion.
And in some other cases, errors are less bad, for example in the power off case, maybe the command failed early, and the cluster is still nicely running?
It is matched, as we send with SSE status reported by Status()
in addition we also send error message, which come from status change event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is #3903 (comment) , and I'm under the impression that if you have a terminal showing the stream of events, and another running crc status
every second, then the state between the 2 will not always match, in particular sometimes the event stream will contain State: state.Error
while crc status
will not report an error?
Guess I need to run both side by side and compare myself to understand better how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I understand the current code base is missing a centralized "setState" facility, and that this PR has to find a way of doing this, just trying to understand the limitations in the PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is #3903 (comment) , and I'm under the impression that if you have a terminal showing the stream of events, and another running crc status every second, then the state between the 2 will not always match, in particular sometimes the event stream will contain State: state.Error while crc status will not report an error?
Guess I need to run both side by side and compare myself to understand better how this works.
No, output should be exactly the same, to get that I add
https://github.com/crc-org/crc/pull/3903/files#diff-e5710ade03da511d4845f380cd63319bf7329e960c1afbbba7fbed4060fcf8edR37
and
https://github.com/crc-org/crc/pull/3903/files#diff-e5710ade03da511d4845f380cd63319bf7329e960c1afbbba7fbed4060fcf8edR44-R50
Otherwise we need to move event triggering deeper in crc code, and fire event after execution of pice of code which change crc status
output.
And it is a question for all, should we integrate events in more depths of CRC core(which leads to spreading events over our codebase) or keep as is( in relative hi level Client interface implementation )?
@gbraad @cfergeau @praveenkumar WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it is a question for all, should we integrate events in more depths of CRC core(which leads to spreading events over our codebase) or keep as is( in relative hi level Client interface implementation )?
I'd say we can go with this for now, more details about the design choices, the implementation limitations, ...
One thing we can explore over time some centralized tracking of the cluster state, a StateTracker
type, we'd add some calls to StateTracker.SetState(...)
in the appropriate places, the StateTracker
would fire events on status changes, it could have a CanChangeState(state)
helper if needed to know if calling Stop
will result in an error or not, ...
As long as all crc
commands do not go through the daemon, it could do some polling to reconcile the internal state with changes through external crc
calls, ...
7184f64
to
a36a8f0
Compare
/need-rebase |
I've rebased it at https://github.com/cfergeau/crc/tree/stus-change-event and added a fixup commit to fix compliation of the first commit in the series. |
I'm not sure step 3. is correct in the testing instructions:
Don't we need to start it with the REST API for now as otherwise
I was wondering if |
Yes, you are correct, need to start with rest API.
Daemon send response only after start is finished(or error incase of any error), but calling |
Few more things I noticed during testing, if VM exists, and then I call
which does not seem correct as the VM existed before being deleted. Probably something that can improve with the |
a36a8f0
to
1fde404
Compare
This allows to avoid confusion with upcoming status change event stream. Signed-off-by: Yevhen Vydolob <[email protected]>
It allows us to have unified way to propagate state changes across an app Co-Authored-By: Christophe Fergeau <[email protected]> Signed-off-by: Yevhen Vydolob <[email protected]>
1fde404
to
54df54b
Compare
This would simplify podman-desktop ext status handling Signed-off-by: Yevhen Vydolob <[email protected]>
54df54b
to
8807ace
Compare
That was actually bug, Now it should work properly. |
} else { | ||
// event could be fired, before actual code, which change state is called | ||
// so status could contain 'old' state, replace it with state received in event | ||
status.CrcStatus = changedEvent.State // override with actual reported state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I find questionable in this PR. Instead of having a single source of truth for the status, we now have 2, some of the status comes from st.machineClient.Status()
, but some of the status is hardcoded in machine/sync.go
. If the 2 get out of sync, or if we want to improve status reporting in general, we'll have multiple places to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spending more time on it, I understand more the problems this is trying to solve.
One example is when the VM is stopped. crc status
will succeed and report that crc is stopped, however crc stop
will return an error and say the machine is already stopped.
Without status.CrcStatus = changedEvent.State
, when invoking api/stop
, we would not be reporting an error in the stream, but instead we'd only say the machine is stopped. This would not match what was reported to the api/stop
call. There are most likely other state transitions where this is needed.
The comment however seems to indicate sometimes Status
could report a stale status, I haven't seen that in the little testing I did, but I'd be interested in understanding more when this happens? If we don't record this knowledge now when it's more or less fresh in our minds, it will be very difficult to find this back in 6 months from now if we need it :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, there's more details in an older discussion:
User call start, we fire state change event, before actuality call Start, and if listener is fast enough, it could get old state (basically we could lost starting state in transition from stoped to running states)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, this is an indication that these events are not fired in the right place, and this is papered over with even more complexity :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t instead we'd only say the machine is stopped. This would not match what was reported to the api/stop
The API matches the usecase. Stop should in my opinion not return an error when already stopped; as you succeed in doing what is requested. What more information do we actually give with saying this failed as it was already stopped?!
If you believe this should be fixed or recorded, make these states clear for the convergence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part you quoted is a description of the behaviour in this PR. These are some notes I took during review/testing when I was trying to understand the design choices which led to the code being reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to replace this code, with firing event at more "proper" place, but face problem that starting
state is "artificial", ie exist only in sync.go
file, and VM doesn't have/reports starting
state. Also stopping
state is the same.
So I have no other ideas, than accept current PR and think on better state handling, like @cfergeau propose there #3903 (comment)
Or change state handling first, and then rebase this PR on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go forward with the current code, but we really need more details in the commit log, so that it records the design choices you made when writing the code, so that we know the current limitations, why this overwrites the state, ...
What is holding back this PR? |
Some discussion around #3903 (comment) would have been useful Apart from this, as said multiple times, the comit log must be a lot more detailed about how this works:
And #3903 (comment) is also still pending |
8807ace
to
e7445d4
Compare
Will propagated to client via SSE in ith own channel, to inform clients about CRC status change. Status modified in "status_change_stream.go", as current status implementation has no single point of change, so tracking transition between states is challenging. Signed-off-by: Yevhen Vydolob <[email protected]>
Having '\n' character in between message field is braking SSE specification and client could not parse correctly such events Signed-off-by: Yevhen Vydolob <[email protected]>
e7445d4
to
66237fa
Compare
@cfergeau Could you look on this? |
@evidolob: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Solution/Idea
PR adds event API and new SSE(server sent event) channel
status_change
.Proposed changes
PR adds event API implementation in
pkg/crc/api/events/events.go
Modify
sync
client implementation to trigger event when status is changed.Fixes in existing log events, rename constants related to SSE
Testing
status_change
channel withcurl -GET -i --unix-socket <path to socket>/crc-http.sock "http://localhost/events?stream=status_change"
curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start"