Monitoring the mail-database exchanger (mdx)

Date: Mon, 28 Aug 2006

The mail-database exchanger script periodically updates a few entries in the runtime_info to report about its status. These entries can be checked by an external program; this article demonstrates how.

The relevant keys (runtime_info.rt_key field) are:

The time values are expressed as POSIX timestamps.
mail=> select * from runtime_info where rt_key in ('last_alive', 'last_sent', 'last_import');
   rt_key    |  rt_value
-------------+------------
 last_alive  | 1156772479
 last_sent   | 1156754648
 last_import | 1156771322
(3 rows)
The date can be converted to a human-readable form by a one-line script in Perl:
$ perl -e 'print scalar(localtime(1156772479));'
Mon Aug 28 15:41:19 2006

The last_alive entry gets updated only if the 'alive_interval' configuration parameter is set in the manitou-mdx config file, which is not the case by default.

In order to check if manitou-mdx's is running, we can create a simple script that connects to the database , read one or several of these entries, and compares them to an expected result. For last_alive, that result is the easier to define. The configuration parameter 'alive_interval' specifies how many seconds there is between two updates of 'last_alive'. If it happens that the difference between the current time and the value of 'last_alive' is significantly higher than 'alive_interval', then it can be assumed that manitou-mdx is no longer running, or something prevents it to update the entry (it could be stuck waiting for a database lock, for example).

In addition, this script can be hosted on a different machine than manitou-mdx and the database, and so will still be able to report if one of those is down.

Below is an example of such a script, in Perl (it assumes the existence of an environment variable named MANITOU_CONNECT_STRING that contains a valid DBI connect string, for example: Dbi:Pg:host=pgserver;dbname=mail;user=manitou )

#!/usr/bin/perl

use DBI;
use POSIX qw(strftime);

# The maximum number of seconds allowed between the
# 'last_alive' value of the database and the current time.
# When the difference between these two becomes higher
# than ALIVE_INTERVAL_MAX, the alert is triggered.
my $ALIVE_INTERVAL_MAX=600;

# Change these for real addresses
my $ALERT_EMAIL="alert\@domain.tld";
my $FROM_EMAIL="alert-sender\@domain.tld";

# A file created when an alert is sent
# The alert won't be sent again until this file is removed, either
# by us when detecting that the mdx is up again, or
# by another program, for instance the mdx start script
my $ALERT_LCK="/var/tmp/manitou-alert.lck";

sub alert {
  my $msg=shift;
  if (! -f $ALERT_LCK) {
    # If no lockfile, create one
    open(F, ">$ALERT_LCK");
    print F localtime(time);
    close(F);
    # and send the alert
    alert_mail($msg);
  }
}

sub alert_mail {
  my $msg=shift;
  open(F, "|/usr/sbin/sendmail -t -f $FROM_EMAIL") or die $!;
  print F "From: $FROM_EMAIL\n";
  print F "To: $ALERT_EMAIL\n";
  print F "Subject: alert about manitou-mdx\n";
  print F "\n";			# end of header
  print F "This is an automatically generated alert\n\n";
  print F "Error message:\n$msg\n";
  close(F);
}

my $cnx_string=$ENV{'MANITOU_CONNECT_STRING'};
if (!defined($cnx_string)) {
  die "Missing MANITOU_CONNECT_STRING environment variable";
}

my $dbh=DBI->connect($cnx_string);
if (!$dbh) {
  alert("unable to connect to database: $DBI::errstr");
  exit 1;
}

my $sth=$dbh->prepare("SELECT rt_value FROM runtime_info WHERE rt_key='last_alive'");
$sth->execute;
my @r=$sth->fetchrow_array;
# if there's no entry, we consider there's no error
if (@r) {
  if (time-$r[0] > $ALIVE_INTERVAL_MAX) {
    my $d=strftime("%d/%m/%Y %H:%M:%S", localtime($r[0]));
    alert("manitou-mdx appears to be down since $d");
  }
  else {
    # if the mdx is running and there's an alert lockfile, then remove it
    # in order not to block further alerts
    if (-f $ALERT_LCK) {
      unlink($ALERT_LCK);
    }
  }
}
$sth->finish;
$dbh->disconnect;

Similarly, the last _import entry could be used to detect a problem in the mail chain. For example, if a mail system that is generally busy hasn't processed a single incoming message during several hours, that could be considered suspicious enough to trigger an alert.