Service Recovery Scripts & Error Page Tips.

A couple of weeks ago, I was proper ill with flu; the problem with looking after your own server is that only you can fix it - it's well and good having monitoring systems (nagios) telling you about faults, but if you can't read or see the alerts the fault won't get resolved.

During this time I was ill, for an unknown reason the mySQL process on my server died, as such my website (and others I look after) were down for 8 hours. The fix was simple, one command, restart the service and normal service was resumed (excuse the pun).

This led to me to the conclusion that there must be a way to get the server to fix it's self. after all, why do a job when you can get a computer to do it for you ! Fortunately I had a light bulb moment and realised that I could use the init scripts that are provided by redhat, the below code will restart apache (httpd) and mySQL on a redhat based system in the event that the service was not stopped cleanly. (In-fact this config has only be tested on CentOS, your mileage may vary on anything else)

#!/bin/bash

# taken from redhast default scripts - /etc/rc.d/init.d/functions

# Set up a default search path.
PATH="/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin"
export PATH

status() {
        local base=${1##*/}
        local pid

        # Test syntax.
        if [ "$#" = 0 ] ; then
                echo $"Usage: status {program}"
                return 1
        fi

        # First try "pidof"
        pid=`pidof -o $$ -o $PPID -o %PPID -x $1 ||
             pidof -o $$ -o $PPID -o %PPID -x ${base}`
        if [ -n "$pid" ]; then
# Uncomment this if you want OK messages
#               echo $"${base} (pid $pid) is running..."
                return 0
        fi

        # Next try "/var/run/*.pid" files
        if [ -f /var/run/${base}.pid ] ; then
                read pid < /var/run/${base}.pid
                if [ -n "$pid" ]; then
                        echo $"${base} dead but pid file exists"
                        /etc/init.d/${base} restart
                        return 1
                fi
        fi
        # See if /var/lock/subsys/${base} exists
        if [ -f /var/lock/subsys/${base} ]; then
                echo $"${base} dead but subsys locked"
                /etc/init.d/${base} restart
                return 2
        fi
        echo $"${base} is stopped"
        return 3
}

# found in /etc/init.d/httpd
httpd=${HTTPD-/usr/sbin/httpd}

status mysqld
status $httpd

If you save this, as /etc/cron.hourly/auto_recovery.sh , then do chmod +x /etc/cron.hourly/auto_recovery.sh , assuming you've not changed the default cron setup, every hour mySQL & httpd will be checked, if they have died the'll be restarted and root will get an e-mail about what happened.

Cool eh !

A final finishing touch: I wanted to change the default "Database Down" error messages on my two most popular applications.

Melvin Rivera has written a tutorial on how to customize the wordpress error page, note that it involves editing a file outside of wp-content, that means you'll have to re-do this "hack" every time you upgrade wordpress.

PHPBB: Setting a custom error page on that is really easy, first create a php page displaying your message. Then at the bottom of /path/to/phpbb-install/includes/db.php you'll see

// Make the database connection.
$db = new sql_db($dbhost, $dbuser, $dbpasswd, $dbname, false);
if(!$db->db_connect_id)
{
message_die(CRITICAL_ERROR, "Could not connect to the database");
}

change it to

 // Make the database connection.
$db = new sql_db($dbhost, $dbuser, $dbpasswd, $dbname, false);
if(!$db->db_connect_id)
{
 include("/path/to/my-custom-error-page.php");
        die();
}

Now if you database dies, for the time it's down (before cron fixes it) wordpress & phpbb sites would get a much prettier error message. Obviously there's no solution for apache as there's nothing to serve the pages, but hopefully this kind of thing doesn't happen to often :D