Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)

Tue, 4 Jun 2002 19:55:13 -0700 (PDT)

[ On Sunday, May 26, 2002 at 00:10:06 (-0700), Bryan Stansell wrote: ]
> Subject: Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)
>
> looking at #2, you see it's calling waitpid() from ConsChat().
> ConsChat() is part of your patch.  the problem, i'm guessing, is that
> the waitpid() inside the while loop has a little bad logic.
> specifically, what happens when the waitpid() returns an error that
> isn't EINTR?  it'll come around for another waitpid() and, i suppose,
> lock up like this.  at least, that's my guess - i haven't done any real
> testing - just scanned the code quickly.

Hmmm... but there's never been any errno value other than EINTR -- there
would be a "ConsChat: error waiting for chat process:" message in my log
if there had.....

I've done a whole lot of more careful error checking, including blocking
SIGCHLD before calling waitpid(), setting an alarm(), checking that the
process still exists when the alarm expires and EINTR is returned.
I've also added a break out of the loop if ECHILD is returned.  I don't
know what to do if either of EFAULT or EINVAL are returned --
something's drastically wrong in that case and it should probably
abort()....

So far the deadlock hasn't occured again, though perhaps the blocking of
SIGCHLD has prevented it.  The problem without the blocking (or
ignoring) of SIGCHLD is that the delivery (and catch) caused waitpid()
to be interrupted and for it to return EINTR.  I don't know why the
second call didn't work though -- perhaps there's a race condition in my
kernel that loses the status information if it's waitpid() itself that
is interrupted....

-- 
								Greg A. Woods

+1 416 218-0098;  <gwoods@acm.org>;  <g.a.woods@ieee.org>;  <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>