[Date Prev] [Date Index] [Date Next] [Thread Prev] [Thread Index] [Thread Next]
Greg A. Woods woods@weird.com
Thu, 23 May 2002 13:45:51 -0700 (PDT)
conserver eventually seems to go catatonic after SIGPIPE (on NetBSD) I think this is the same problem as what I reported some time ago with a somewhat older release, but now with 7.2.1 it's just far less common. (note the code I'm running includes the patches I sent to the list, though I don't see how any of those changes could affect any signal processing....) I suspect the SIGPIPE is triggered by an attempt to write to a socket that's been closed (TCP RST) by the client. The server should just close the socket and do any per-client cleanup necessary, but I don't see a signal handler for SIGPIPE anywhere.... Eventually I notice this when any long-running 'console' client dies, or when I start getting warning e-mails from Cricket about some delay in processing one of its collectors, which in this case usually turns out to be the little script I use to ask each UPS what it's status is.... No new 'console' connections work right either, which is why the Cricket collector "fails". Somtimes if I leave "console -u" or "console -x" running long enough when it's in this state then I get a response, but it takes many minutes.... I don't think I've ever managed to get a successful connection to an actual console, though I may not have waited long enough. I can kill one of the 'conserver' processes with SIGTERM (I'm currently assuming this is the parent, though I've not been careful enough to look yet), and the other needs SIGQUIT or similar (something it's not caught that will force it to exit). Twice now I've forced it to dump core, but unfortunately I've not been smart enough yet to realize that the binaries I've been building and using were not compiled with '-g'. I'm recompiling now..... Hmm.... seems it was waiting for a PID that didn't exist: $ gdb ./conserver conserver-forced-2.sparc.core GDB is free software and you are welcome to distribute copies of it under certain conditions; type "show copying" to see the conditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.16 (sparc-netbsd), Copyright 1996 Free Software Foundation, Inc... warning: exec file is newer than core file. Core was generated by `conserver'. Program terminated with signal 6, Abort trap. Reading symbols from /usr/libexec/ld.so...done. Reading symbols from /usr/lib/libcrypt.so.0.0...done. Reading symbols from /usr/lib/libwrap.so.0.0...done. Reading symbols from /usr/lib/libc.so.12.20...done. #0 0x100b7e5c in wait4 () (gdb) where #0 0x100b7e5c in wait4 () #1 0x100b4e88 in waitpid () #2 0xa6cc in ConsChat (pCE=0x15400) at group.c:3016 #3 0x7a9c in Kiddie (pGE=0x47180, sfd=0x6c010) at group.c:1458 #4 0xa20c in Spawn (pGE=0x47180) at group.c:2907 #5 0xcad0 in FixKids () at master.c:143 #6 0xd2f4 in Master () at master.c:313 #7 0xc874 in main (argc=269482840, argv=0x14400) at main.c:724 (gdb) up #1 0x100b4e88 in waitpid () (gdb) up #2 0xa6cc in ConsChat (pCE=0x15400) at group.c:3016 3016 while (waitpid(pid, &cstatus, 0) < 0) { (gdb) print pid $1 = 21006 (gdb) I'm fairly certain there was no PID 21006 at the time I killed it... This has happened at least twice and I have two forced core dumps of the stuck process. The conserver log file contains entries that suggest it might be one of my Cricket collector scripts causing the SIGPIPE as the failure occurs in the middle of one of the runs (which happen every minute, with a login and logout for each of my three UPS units). From there on things go really wonky until I stop it. PID 14846 is the one that stopped on its own with SIGTERM, and PID 15629 is the one that produced the above core dump. There is no record of PID 21006 in any of the log files produced by this instantiation of conserver, and given the PIDs around the time I killed it that must have been a very recently started process (the new daemon after restarting was 21022). conserver (14847): best-1.4: login cricket@becoming.weird.com [Thu May 23 06:21:20 2002] conserver (14847): best-1.4: logout cricket@becoming.weird.com [Thu May 23 06:21:22 2002] conserver (14847): best-3.1-0: login cricket@becoming.weird.com [Thu May 23 06:21:22 2002] conserver (14847): best-3.1-0: logout cricket@becoming.weird.com [Thu May 23 06:21:24 2002] conserver (14847): best-3.1-1: login cricket@becoming.weird.com [Thu May 23 06:21:24 2002] conserver (14847): best-3.1-1: logout cricket@becoming.weird.com [Thu May 23 06:21:26 2002] conserver (14847): best-1.4: login cricket@becoming.weird.com [Thu May 23 06:22:17 2002] conserver (14847): best-1.4: logout cricket@becoming.weird.com [Thu May 23 06:22:19 2002] conserver (14847): best-3.1-0: login cricket@becoming.weird.com [Thu May 23 06:22:20 2002] conserver (14846): conserver(14847): signal(13), restarted [Thu May 23 06:22:22 2002] Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed warning: read() on stdin returned 0 Failed conserver (15629): lost carrier on once (tserv/2012)! [Thu May 23 06:37:33 2002] conserver (15629): once: automatic reinitialization [Thu May 23 06:37:33 2002] Failed Failed Failed conserver (15629): lost carrier on proven (tserv/2006)! [Thu May 23 06:42:05 2002] conserver (15629): proven: automatic reinitialization [Thu May 23 06:42:05 2002] Failed conserver (15629): lost carrier on raid-00 (tserv/2004)! [Thu May 23 06:43:37 2002] conserver (15629): raid-00: automatic reinitialization [Thu May 23 06:43:37 2002] Failed conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 06:45:08 2002] conserver (15629): hubly: automatic reinitialization [Thu May 23 06:45:08 2002] Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 07:00:15 2002] conserver (15629): hubly: automatic reinitialization [Thu May 23 07:00:15 2002] Failed Failed Failed Failed Failed conserver (15629): best-1.4: login cricket@becoming.weird.com [Thu May 23 07:07:48 2002] Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 07:22:52 2002] conserver (15629): hubly: automatic reinitialization [Thu May 23 07:22:52 2002] conserver (15629): best-1.4: logout cricket@becoming.weird.com [Thu May 23 07:22:53 2002] Failed Failed Failed Failed Failed Failed Failed warning: read() on stdin returned 0 Failed Failed Failed Failed Failed Failed Failed conserver (15629): lost carrier on hubly (constantly/2001)! [Thu May 23 07:42:28 2002] conserver (15629): hubly: automatic reinitialization [Thu May 23 07:42:28 2002] Failed Failed Failed Failed Failed Failed warning: read() on stdin returned 0 Failed conserver (15629): best-3.1-0: login cricket@becoming.weird.com [Thu May 23 07:51:32 2002] Failed Failed Failed Failed [[ .... blah, blah, blah .... ]] conserver (14846): Stopped at Thu May 23 10:25:27 2002 -- Greg A. Woods +1 416 218-0098; <gwoods@acm.org>; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>