conserver eventually goes catatonic after SIGPIPE (on NetBSD)

Thu, 23 May 2002 13:45:51 -0700 (PDT) · Bryan Stansell

conserver eventually seems to go catatonic after SIGPIPE (on NetBSD)

I think this is the same problem as what I reported some time ago with
a somewhat older release, but now with 7.2.1 it's just far less common.
(note the code I'm running includes the patches I sent to the list,
though I don't see how any of those changes could affect any signal
processing....)

I suspect the SIGPIPE is triggered by an attempt to write to a socket
that's been closed (TCP RST) by the client.  The server should just
close the socket and do any per-client cleanup necessary, but I don't
see a signal handler for SIGPIPE anywhere....

Eventually I notice this when any long-running 'console' client dies, or
when I start getting warning e-mails from Cricket about some delay in
processing one of its collectors, which in this case usually turns out
to be the little script I use to ask each UPS what it's status is....

No new 'console' connections work right either, which is why the Cricket
collector "fails".  Somtimes if I leave "console -u" or "console -x"
running long enough when it's in this state then I get a response, but
it takes many minutes....  I don't think I've ever managed to get a
successful connection to an actual console, though I may not have waited
long enough.

I can kill one of the 'conserver' processes with SIGTERM (I'm currently
assuming this is the parent, though I've not been careful enough to look
yet), and the other needs SIGQUIT or similar (something it's not caught
that will force it to exit).  Twice now I've forced it to dump core, but
unfortunately I've not been smart enough yet to realize that the
binaries I've been building and using were not compiled with '-g'.

I'm recompiling now.....  Hmm.... seems it was waiting for a PID that
didn't exist:

$ gdb ./conserver conserver-forced-2.sparc.core  
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (sparc-netbsd), Copyright 1996 Free Software Foundation, Inc...

warning: exec file is newer than core file.
Core was generated by `conserver'.
Program terminated with signal 6, Abort trap.
Reading symbols from /usr/libexec/ld.so...done.
Reading symbols from /usr/lib/libcrypt.so.0.0...done.
Reading symbols from /usr/lib/libwrap.so.0.0...done.
Reading symbols from /usr/lib/libc.so.12.20...done.
#0  0x100b7e5c in wait4 ()
(gdb) where
#0  0x100b7e5c in wait4 ()
#1  0x100b4e88 in waitpid ()
#2  0xa6cc in ConsChat (pCE=0x15400) at group.c:3016
#3  0x7a9c in Kiddie (pGE=0x47180, sfd=0x6c010) at group.c:1458
#4  0xa20c in Spawn (pGE=0x47180) at group.c:2907
#5  0xcad0 in FixKids () at master.c:143
#6  0xd2f4 in Master () at master.c:313
#7  0xc874 in main (argc=269482840, argv=0x14400) at main.c:724
(gdb) up
#1  0x100b4e88 in waitpid ()
(gdb) up
#2  0xa6cc in ConsChat (pCE=0x15400) at group.c:3016
3016                    while (waitpid(pid, &cstatus, 0) < 0) {
(gdb) print pid
$1 = 21006
(gdb) 

I'm fairly certain there was no PID 21006 at the time I killed it...

This has happened at least twice and I have two forced core dumps of the
stuck process.

The conserver log file contains entries that suggest it might be one of
my Cricket collector scripts causing the SIGPIPE as the failure occurs
in the middle of one of the runs (which happen every minute, with a
login and logout for each of my three UPS units).  From there on things
go really wonky until I stop it.  PID 14846 is the one that stopped on
its own with SIGTERM, and PID 15629 is the one that produced the above
core dump.  There is no record of PID 21006 in any of the log files
produced by this instantiation of conserver, and given the PIDs around
the time I killed it that must have been a very recently started process
(the new daemon after restarting was 21022).

conserver (14847): best-1.4: login cricket@becoming.weird.com [Thu May 23 06:21:20 2002]
conserver (14847): best-1.4: logout cricket@becoming.weird.com [Thu May 23 06:21:22 2002]
conserver (14847): best-3.1-0: login cricket@becoming.weird.com [Thu May 23 06:21:22 2002]
conserver (14847): best-3.1-0: logout cricket@becoming.weird.com [Thu May 23 06:21:24 2002]
conserver (14847): best-3.1-1: login cricket@becoming.weird.com [Thu May 23 06:21:24 2002]
conserver (14847): best-3.1-1: logout cricket@becoming.weird.com [Thu May 23 06:21:26 2002]
conserver (14847): best-1.4: login cricket@becoming.weird.com [Thu May 23 06:22:17 2002]
conserver (14847): best-1.4: logout cricket@becoming.weird.com [Thu May 23 06:22:19 2002]
conserver (14847): best-3.1-0: login cricket@becoming.weird.com [Thu May 23 06:22:20 2002]
conserver (14846): conserver(14847): signal(13), restarted [Thu May 23 06:22:22 2002]
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
warning: read() on stdin returned 0
Failed
conserver (15629): lost carrier on once (tserv/2012)! [Thu May 23 06:37:33 2002]
conserver (15629): once: automatic reinitialization [Thu May 23 06:37:33 2002]
Failed
Failed
Failed
conserver (15629): lost carrier on proven (tserv/2006)! [Thu May 23 06:42:05 2002]
conserver (15629): proven: automatic reinitialization [Thu May 23 06:42:05 2002]
Failed
conserver (15629): lost carrier on raid-00 (tserv/2004)! [Thu May 23 06:43:37 2002]
conserver (15629): raid-00: automatic reinitialization [Thu May 23 06:43:37 2002]
Failed
conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 06:45:08 2002]
conserver (15629): hubly: automatic reinitialization [Thu May 23 06:45:08 2002]
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 07:00:15 2002]
conserver (15629): hubly: automatic reinitialization [Thu May 23 07:00:15 2002]
Failed
Failed
Failed
Failed
Failed
conserver (15629): best-1.4: login cricket@becoming.weird.com [Thu May 23 07:07:48 2002]
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
Failed
conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 07:22:52 2002]
conserver (15629): hubly: automatic reinitialization [Thu May 23 07:22:52 2002]
conserver (15629): best-1.4: logout cricket@becoming.weird.com [Thu May 23 07:22:53 2002]
Failed
Failed
Failed
Failed
Failed
Failed
Failed
warning: read() on stdin returned 0
Failed
Failed
Failed
Failed
Failed
Failed
Failed
conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 07:42:28 2002]
conserver (15629): hubly: automatic reinitialization [Thu May 23 07:42:28 2002]
Failed
Failed
Failed
Failed
Failed
Failed
warning: read() on stdin returned 0
Failed
conserver (15629): best-3.1-0: login cricket@becoming.weird.com [Thu May 23 07:51:32 2002]
Failed
Failed
Failed
Failed
[[ .... blah, blah, blah .... ]]
conserver (14846): Stopped at Thu May 23 10:25:27 2002

-- 
								Greg A. Woods

+1 416 218-0098;  <gwoods@acm.org>;  <g.a.woods@ieee.org>;  <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>