So there is a small clustered system, and it appears one of the services is a touch unstable, and the cluster isn't detecting the failure, the solution put in place, make the cluster failover once in a while to restart the services.
Now I just can't figure this out. First, the admin in question had set it to restart the heartbeat (seriously bad idea) on the running active server. So the non-active server starts up the service.
Where did he get that bright idea from. Also the fact of this bludgeoning method. So I start looking round for the method to monitor and manage the cluster. It appears there isn't, the config runs, the heartbeat watches for the server to fail but no actual management. Poor D- for you.
So I look to an old friend, Monit. A simple daemon on the server that can monitor services, files, free space and either alert or restart the service or other things. Simples.
So on it pops and off we go, most Linux distributions have a copy, alternatively the likes of RPMFusion will provide binaries.
Things you need to make sure you setup,
Set an email server
set mailserver <yourhostmailserver or localhost>
Set alert emails
set alert <youremailhere>
Rather than putting your entire config in one file (because it's just plain confusing, or perhaps that's what you like)
Set config file directory
include /etc/monit.d/*
I would recommend this location, because it's a sensible name and we all like being sensible, don't we children.
So in /etc/monit.d/myborkedservice I have the following
check process with pidfile /location/of/pidfile
start program = "/etc/init.d/borkedservice start"
stop program = "/etc/init.d/borkedservice stop"
The system will automatically alert (via the email) on either a failure or a changed pid (i.e the process has restarted)
On the none active server I have the additional line
mode passive
This alerts but does not actually restart the service. When I figure out how to get this to work with HA I'll post again. however, I am going to presume there is a better method (like getting the borkservice(tm) to work and/or HA properly working)