ITEM: M9247L

srcmaster was killed. Having problems.


Env: AIX 3.2.5, Model 990
Desc:
The customer did a kill -9 on srcmaster. Is there a more graceful
way to get out of messages with an error number of 0513-053? 

Killing srcmaster will terminate all the services. This effectively 
terminates all services in addition to srcmaster. The srmaster is 
configured to respond and does not restart the services when it 
respawns.  Even though the services are still running, you will see
"inoperative" for all the services when you issue an lssrc -a.
This is because you killed the srcmsrt.  When you did that, you
effectively messed up all the process IDs (PIDs) and Parent PIDs
(PPIDs).

The sockets that were corrupted were srcmaster sockets.  Howerver, 
the services were actually running (TCP/IP and SNA for example) 
were operative. How do we determine which services of the srcmaster 
were corrupted and how can the offending sockets be identified and 
resolved without impacting the rest of the services?
Desc: The problem is that they get the above error when they are doing
a lssrc -a. This would mean that either their srcmstr is corrupt
or they do not have the /etc/objrepos/SRC* files or that some little 
service is corrupt and they cannot tell what it is because lssrc does 
not work. So what they want to do is find a way to find out what service 
is corrupted without using a lssrc. 

When a reboot was recommended, the customer did not want to reboot the 
system.  Therefore, the customer took the following actions:

   \# kill -9 "pid of srcmstr"
   \# lssrc -a  (showed all processes inoperative)

so he then killed and started all the services again.  (This step was
unecessary since the services were really going, but just showing
as inoperative since they had killed srcmstr)

kill -9 sends death of child for srcmstr. All the services managed
will no longer be registered with the new srcmstr (however, they
will be running). So when he re-invoked the new services, there 
were multiple processes running. This is not a desireable state.
Advised the customer that they will need to reboot their machine.
I have been doing some research.  Basically, all I have found with the
same error message (0513-053) it was either comm problems (like with
bad ethernet cables), a corrupted filesystem (which was so corrupted
they had to reformat the hard drives and reinstall), or a corrupted or
missing /dev/.SRC-unix directory.  The way to clean up the /dev/.SRC-
unix is to copy it somewhere else, remove it, and it should be rebuilt
at system reboot (all these were calling for reboots).  If it's
not rebuilt, then there is something seriously wrong.

Next: I want to make sure you are not experiencing any FS corruption.
I will have you run "fsck /dev/hd4" while up and running.  This
will not fix anything, but it will tell us if you have any corruption.

Action: You did do a shutdown last night and rebooted.  You are
checking out the machine.  Everything is working fine.

You are concerned that this might happen again.  Your systems need to
be running just fine and available (that's why you are running hacmp)
and you very well might not be in a situation to reboot if it does
happen again.  We talked and I really feel like this is a very rare
thing that happened to you.

Next:  You will call in on this problem if you start to see those
errors again.  I also feel that if you see those errors again we
really need to look at the /dev/.SRC-unix directory.  Unfortunately I
will want you to copy it to a different location, remove it, and
reboot so the system will rebuild a new one.  I would want to look at
it though, to see if there is any obvious corruption.


Support Line: srcmaster was killed. Having problems. ITEM: M9247L
Dated: November 1994 Category: N/A
This HTML file was generated 99/06/24~13:30:39
Comments or suggestions? Contact us