[Date Prev] [Date Index] [Date Next] [Thread Prev] [Thread Index] [Thread Next]
John R. Jackson jrj@gandalf.cc.purdue.edu
Sat, 10 Aug 2002 17:26:21 -0700 (PDT)
One of my (more annoying :-) cohorts noticed the timestamps were missing for one of the log files he was looking at. After poking around, I found about 1/3 of the child processes looked "normal": # psig 12550 | grep ^ALRM 12550: ALRM caught 0 while the others looked "odd": # psig 12555 | grep ^ALRM 12555: ALRM caught RESETHAND,NODEFER and there was a direct correlation between "normal"/"odd" and whether the timestamps were working. Sending an ALRM signal to an "odd" process by hand caused it to die, although at least the master restarted it. However, the flags were still "wrong". Looking at the code I noticed some sleep() and usleep() calls that could potentially disturb the ALRM handler and clearly needed to be protected. However, a debugging session showed the real culprit to be, of all things, TCP wrappers. It messes with the ALRM signal and makes no attempt whatsoever to save/restore it (grrrr). So every time a connection was made and hosts_access() was called, our ALRM handler was clobbered (of course, it only took one). The following code adds a small function that must be called from any place that might mess up the handler. Future coding should be sure to call it if any sleep() or usleep() calls are added (maybe sleep/usleep should be wrapped with our own code and never be called directly?). This was all tested on Solaris 2.8 with conserver 7.2.2, although the problem first showed up my production system, which is Solaris 2.6 and conserver 7.1.3 (upgrade to both is imminent). John R. Jackson, Technical Software Specialist, jrj@purdue.edu