Tuesday 17 December 2013

Linux (RHEL) - Simple application watchdog script (aka how to make an application automatically restart)

In order to make sure some of applications are running non-stop and required no human intervention to be restarted in case of failures we usually really on a cluster infrastructure (in our case Red Hat Cluster Suite). This works great when you have the hardware to run it but sometimes we also need to make sure we have this behaviour on a standalone machine.

This is accomplished by using a watchdog script that mimics the behaviour of the cluster (in regards to application monitoring).

In the new Red Hat EL versions (>6.0) a similar effect can be achieved with upstart respawn option. But the watchdog script has some advantages since it allows for a bit more fine-tuning. Besides, I'm still stuck on RHEL 5 which has no upstart, only sysvinit.

To make implementation and deployment fast and easy I have a template script which enables me to add monitoring to a script/service in a couple of minutes.

The template script can be downloaded here: app_wathdog_template.sh

I'll try to explain how it works:

First, you must replaces all instances of the text: <app> with the name of the script/service you are trying to monitor.

This can quickly be done in vi with:

Press ESC (to enter command mode)

:%s/<app>/myapp/g

(myapp is your script/service name :-) )
Press Enter

Next you need to alter the settings in the script:
#***************************************************
#APP specific Configurations
INTERVAL=30     #Interval between status checks
LOCKFILE=/var/run/<app>.pid #monitor app if this files exists.
SERVICE_SCRIPT=/etc/init.d/<app>
SYSLOGNAME=<app>watchdog
#***************************************************

INTERVAL is the number of seconds between each status check

LOCKFILE is a file which is used to start the monitor cycle, usually it can be the application's pidfile ou lockfile or anything you want, just make sure it only exists when your application should be running.

SERVICE_SCRIPT is the complete path to the script that is used to monitor the application.

SYSLOGNAME is the tag name that is send to syslog. This makes it easy to see what the watchdog's actions in the system's logs.

The main action happens in the monitor function, which I'll explain below:

monitor()
{
   #infinite loop until stop is called
   while [ 1 -eq 1 ]
   do
      #no restarting if there is no pidfile, this ensures there are no unsolicited restarts
      #it also means that the application must delete its pidfile when stop is called.
      if [ -f $PIDFILE ]
      then
         $SERVICE_SCRIPT status 2>&1 >/dev/null
         STATUS=$?
         if [ $STATUS -gt 0 ]
         then
            logger -t $SYSLOGNAME "Status returned ERROR!"
            doRestart
         fi
      fi
      sleep $INTERVAL
   done
}

First it enters an infinite loop which checks the app at $INTERVAL seconds.

The monitoring is only done if it finds the LOCKFILE, this ensures it only restarts the application if it is supposed to be running.  This also implies the application's script needs to remove the LOCKFILE when it stops. If the LOCKFILE is not removed once the application is stopped the watchdog will continue to check the script status which could lead to a restart loop.

The watchdog relies on the service/script status exit code to verify if a restart is required.

By default, if the status returns a value other than 0 then the script/service is restarted. This also sends messages through syslog so you can debug later.

You can configure this loop to do whatever you want based on the status return value. :-)













No comments:

Post a Comment