RFC: uniform exit codes

Wed Sep 7 19:51:06 UTC 2016

> On 2 Sep 2016, at 16:53, Elias Pipping <pipping.elias at icloud.com> wrote:
> 
> Dear list,
> 
> I’d like to talk about exit codes for a bit. The uiop/run-program::%wait-process-result function e.g. currently waits for a process to terminate and then returns something. An exit-code. What would you expect that to be? What would you like it to be?
> 
> (I’ve already had a long discussion about this with Robert but Faré asked me to take it to the mailing list, too).
> 
> An exit code should be a number and lie between 0 and 255, with only 0 signalling success, as far as I understand. A ‘return -1’ in C ends up as a 255 once I check for it in lisp or shell. Beyond that there are customs but not standards (there is sysexits.h but it’s not used all that much)
> 
> Please consider the three shell scripts that each contain just one line:
> 
> (1) exit 15
> (2) kill $$
> (3) sh -c ‘kill $$’
> 
> If you saved them in separate scripts and ran them from within a shell, the exit code would be 15/143/143. The take-away messages from that for me are that
> 
> - the shell uses 128+n if the process dies in response to signal n
> - there are cases where the exit code is greater than 128 even though the process itself did not die in response to a signal, thereby interfering with this logic
> - the shell cannot distinguish (2) and (3)
> 
> From within lisp, often but not always (2) and (3) can be distinguished. Sometimes, a process-wait function will return something like 15/(0 15)/(143 0) for the above examples(*); sometimes a process-status function will report (:exited 15)/(:signaled 15)/(:exited 143).
> 
> But some implementations will behave like the shell and always return 15/143/143, e.g. ABCL, LispWorks <7, and Allegro CL with :wait t.
> 
> So the thing that we can reliably do is produce the sequence 15/143/143. Please note that even this baseline is already a proposal for a change: With today’s UIOP master branch, you could also get things such as 15/15/143, 15/0/143, or 15/:sigterm/143, if I’m not mistaken.
> 
> What we could not reliably do is e.g. return things like 15/-15/143 or (15 :exited)/(15 :signaled)/(143 :exited). What I’ve implemented so far is a compromise. Some platforms might return 15/(143 15)/143 and others just 15/143/143. The (143 15) could easily be turned into (143 :signaled) instead, that’s a matter of taste, the take-away message remains, though, that you couldn’t be sure that what you think is case (2) isn’t really case (3). So that leaves also the option of “let’s just not bother with distinguishing the two”.
> 
> Looking forward to your feedback,
> 
> 
> Elias

Dear list,

I’ve now tested what happens on OpenBSD rather than Linux. Not entirely unsurprisingly (I simply hadn’t thought about it), it turns out that the difference between (2) and (3), namely the additional shell layer, also enters here:

  (defun %normalize-command (command)
    ...
    (etypecase command
      #+os-unix (string `("/bin/sh" "-c" ,command))
      #+os-unix (list command)
      …

In other words: Whether %run-program is passed something like “/bin/something arg” or (list “/bin/something” “arg”) potentially makes a difference; and indeed it does make a difference for (2) (because it can be turned into (3) this way). Again, it’s not surprising that it does, but my impression is that the user really should not have to worry about this difference, otherwise the entire abstraction that %run-program provides breaks down.

So we’re now at a place where distinguishing (:exited 143) and (:signaled 15) may or may not work depending on whether you pass a string or a list, what operating system you're on, what lisp you’re on, and whether you call your script synchronously or asynchronously. I think it’s safe to say we should just give up on this undertaking. Return 143 in both cases. We can do that reliably now and I’m happy that we can. We should not return additional information if it’s not reliable.

Elias

PS: The additional shell layer was something that confused me quite a bit, too, when I wrote wrappers around process-status and process signalling functions: If you run `sleep 1`, send it SIGSTOP, sleep for 2 seconds, and send it SIGCONT, it will run for approximately another second. If you do the same with `sh -c ‘sleep 1’`, you’ll get a very different result: The shell will stop but `sleep 1` will continue to run. Once you send SIGCONT, the process will immediately return. All of this makes perfect sense but it becomes confusing if someone turns your `sleep 1` into `sh -c “sleep 1”` without telling you.
One situation where the number of layers of shell is less relevant (but still not completely so) is when a process is terminated or when checking if a process is still alive (that’s why I made the corresponding functions public and the others I mentioned earlier private).
Even here, killing a process will not necessarily kill its children. Windows has taskkill /t for that, I believe (a so-called “tree kill”) but I don’t think such a think is possible on unix without an additional requirements like cgroups on linux (at least I think that’s something systemd uses and requires them for).