-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ProcessTerminationResult, for better alignment with POSIX #244
base: main
Are you sure you want to change the base?
Add ProcessTerminationResult, for better alignment with POSIX #244
Conversation
25af301
to
e3cce78
Compare
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states the following: > There are two kinds of process termination: > > 1. Normal termination occurs by a return from main(), when requested > with the exit(), _exit(), or _Exit() functions; or when the last > thread in the process terminates by returning from its start > function, by calling the pthread_exit() function, or through > cancellation. > > 2. Abnormal termination occurs when requested by the abort() function > or when some signals are received. Both these cases can be detected by parent processes separately. Unfortunately, REv2 currently doesn't allow this, as we only provide an exit code. Implementations such as Buildbarn tend to set the exit code for abnormal termination to 128+signum, meaning that if a segmentation fault occurs, an action terminates with exit code 139. This tends to be fairly confusing for non-expert users. The goal of this change is to make all of this less confusing. It adds a new message type named ProcessTerminationResult, which contains the termination result in a more structured way. It can now either be an exit code or a signal number. If it turns out that other operating systems provide even more ways, we can add additional `oneof` cases. For the exit code in case of normal termination I decided to use an int64 instead of int32. The reason being that POSIX allows you to get the full exit code back, using APIs like waitid(). On ILP64 systems, the exit code may thus be 64 bits. For the signal in case of abnormal termination I decided to use a string. The reason being that signal numbers are not covered by POSIX. Everybody always assumes that 9 is SIGKILL, but POSIX only documents this as a convention for the command line flags of the `kill` utility. Not the the actual `SIG*` constants. Adding an enumeration would also be unwise, as operating systems are free to add their own signals that are not covered by POSIX (e.g., SIGINFO on BSD). Fixes: bazelbuild#240
e3cce78
to
19fd68b
Compare
Hi @EricBurnett @juergbi, Could you folks please take a look at the latest version of this PR? I think I have processed your feedback accordingly:
Thanks! |
@@ -1312,6 +1323,57 @@ message OutputSymlink { | |||
NodeProperties node_properties = 4; | |||
} | |||
|
|||
// The termination result of a process, as reported by the operating | |||
// system to the parent process (e.g., using wait*() on POSIX-like | |||
// systems, or GetExitCodeProcess() on Microsoft Windows). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest removing or weakening the statement "as reported by the operating system..." . This structure of exit_code + process_termination_result I think would actually extend to other concepts as well in the future.
For example, a pattern I expect to need at some point is "flake due to intermediate on-worker infrastructure", where there is a logical test that we care about the result of, but also some intermediate scaffolding that's been injected to run it where we'd like to be able to distinguish flakes due to the scaffolding itself (that doesn't necessarily imply a problem with the test) from flakes of the test.
(To be clear, don't want to extend your PR to cover this; that can be a future discussion. I'd just say that the "process termination result" need not be only OS-provided information.
string signal_name = 1; | ||
} | ||
|
||
oneof reason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about not using a oneof and simply having a set of optional fields? signal_name implying the lack of exit_code makes sense in this particular case, but I'm not sure every future result is going to follow the pattern of being mutually exclusive. E.g. if we also provided info from the syslog or Windows Event Log so you could know why the OS killed the process, say, where you'd get both a SIGKILL and some indication that it was killed due to OOM or too many handles or whatever.
So e.g.
message ProcessTerminationResult {
//...
int64 exit_code = 1;
//...
string signal_name = 2;
//...
string human_readable_explanation_for_signal = 3;
}
With comments detailing the relationship between the various fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that open the door to problems with default values? I'm thinking about exit_code
in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. For many the empty field is trivially uninteresting (e.g. string signal, empty==missing) but you're right about exit_code.
A few options:
- Do nothing, since exit_code can be distinguished fairly readily: the only case of concern is when it's 0, where we shouldn't actually need to distinguish if its set or not. (Do we really need to provide a ProcessTerminationResult just to provide a 0 exit_code? Given that the top-level field already exists, it seems just as reasonable to omit the whole message entirely in that case. So "0 means unset" for exit_code is just fine, since it introduces no real ambiguity in practice.)
- Do something special for exit_code, since so far it's the only one that has an ambiguous default. A second "has_exit_code" field would do, for example.
- If you know a set that will definitely be mutually exclusive, a oneof suffices, and then we can use optional fields for the rest. So e.g. exit_code and signal_name within the oneof, but with the intent for augmenting fields (like the human_readable_explanation_for_signal example) to be optionals outside of it. Which I guess is...what you already have, just with the understanding that maybe not everything will go in the oneof?
- Make these structs with a single field instead of raw fields, since you can check for the presence of a struct. E.g.
message ProcessTerminationResult {
message ExitCode {
//...
int64 code = 1;
}
ExitCode exit_code = 1;
message Signal {
//...
string name = 1;
string human_readable_explanation_for_signal = 2;
}
Signal signal = 2;
}
I think I vote for (1): doesn't really feel like there's "real" ambiguity being introduced in this case, even for exit_code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this change; our impl currently returns -1
for exit code in this case, mostly as a consequence of using exec.Command(...).Wait()
in Golang which returns -1
whenever the process doesn't exit()
. It would be nice to surface more details as to whether it's an abort()
or something else.
// Abnormal process termination occurred. | ||
// | ||
// On POSIX-like systems, this is achieved by calling abort(), or by | ||
// receiving a signal for which no signal handler is installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: SIGKILL (e.g. OOM-kill) and SIGSTOP (probably less common in practice) in particular can't be handled, so maybe this wording is slightly clearer:
// receiving a signal for which no signal handler is installed. | |
// receiving a signal that is not handled by the process. |
@@ -1156,6 +1156,17 @@ message ActionResult { | |||
// The exit code of the command. | |||
int32 exit_code = 4; | |||
|
|||
// An optional reason the process launched by this action terminated, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is optional, it seems that checking ActionResult.exit_code == 0
is still how the client should check whether an execution was successful or not, rather than inspecting this new field.
To avoid confusion now that this additional field is being added (which also contains exit code information), should the ActionResult.exit_code
field be further specified so that it must be non-zero if the action did not succeed?
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:
Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.
The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional
oneof
cases.For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.
For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the
kill
utility.Not the the actual
SIG*
constants. Adding an enumeration would also beunwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).
Fixes: #240