Configuring Xymon Alerts

When something breaks, you want to know about it. Since you probably don't have the Xymon webpages in view all of the time, Xymon can generate alerts to draw your attention to problems. Alerts can go out as e-mail, or Xymon can run a script that takes care of activating a pager, sending an SMS, or however you prefer to get alerted.

A simple alert configuration

The configuration file for the Xymon alert module is ~/server/etc/alerts.cfg. This file consists of a number of rules that are matched against the name of the host that has a problem, the name of the service, the time of day and a number of other criteria. Each rule then has a number of recipients that receive the alert. For each recipient you can further refine the rules that need to be matched. An example:

	HOST=www.foo.com
		MAIL [email protected] SERVICE=http REPEAT=1h
		MAIL [email protected] SERVICE=cpu,disk,memory

The first line defines a rule for alerting when something breaks on the host "www.foo.com".
There are two recipients: [email protected] is notified if it is the "http" service that fails, and the notification is repeated once an hour until the problem is resolved.
[email protected] is notified if it is the "cpu", "disk" or "memory" tests that report a failure. Since there is no "REPEAT" setting for this recipient, the default is used which is to repeat the alert every 30 minutes.

OK, suppose now that the webmaster complains about getting e-mails at 4 AM in the morning. The webserver is not supposed to be running between 9 PM and 8 AM, so even though there is a problem, he doesn't want to hear about it until 7:30 - that gives him just enough time to fix the problem. So you must modify the rule so that it doesn't send out alerts until 7:30 AM:

	HOST=www.foo.com
		MAIL [email protected] SERVICE=http REPEAT=1h TIME=*:0730:2100
		MAIL [email protected] SERVICE=cpu,disk,memory

Adding the TIME setting on the recipient causes the alerts for this recipient to be suppressed, unless the time of day is within the interval. So with this setup, the webmaster gets his sleep.

What would have happened if you put the TIME setting on the rule instead of on the recipient ? Like this:

	HOST=www.foo.com TIME=*:0730:2100
		MAIL [email protected] SERVICE=http REPEAT=1h
		MAIL [email protected] SERVICE=cpu,disk,memory

Well, the webmaster would still have his nights to himself - but the TIME setting would then also apply to the alerts that go out when there is a problem with the "cpu", "disk" or "memory" services. So there would not be any mails going to [email protected] when a disk fills up during the night.

Keywords in rules and recipients

These are the keywords for setting up rules:

PAGErule matching an alert by the name of the page the host is displayed on. This is the name following the "page", "subpage" or "subparent" keyword in the hosts.cfg file.
EXPAGErule excluding an alert if the pagename matches.
HOSTrule matching an alert by the hostname.
EXHOSTrule excluding an alert by matching the hostname.
SERVICErule matching an alert by the service name.
EXSERVICErule excluding an alert by matching the hostname.
COLORrule matching an alert by color. Can be "red", "yellow", or "purple".
TIMErule matching an alert by the time-of-day. This is specified as the DOWNTIME timespecification in the hosts.cfg file (see hosts.cfg(5)).
DURATIONRule matching an alert if the event has lasted longer/shorter than the given duration. E.g. DURATION>10m (lasted longer than 10 minutes) or DURATION<2h (only sends alerts the first 2 hours). Unless explicitly stated, this is in minutes - you can use 'm', 'h', 'd' for 'minutes', 'hours' and 'days' respectively.
UNMATCHEDThis keyword on a recipient means that he will only get an alert, if no other alerts have been sent. So you can use it e.g. when setting up alerts to specific people for some services, then after those you add a recipient with the UNMATCHED keyword who will only get those alerts that were not sent anyone else. You can also use it to setup a "catch-all" alert recipient, use the UNMATHED keyword on a recipient at the end of the alerts.cfg file.
RECOVEREDRule matches if the alert has recovered from an alert state.
NOTICERule matches if the message is a "notify" message. This type of message is sent when a host or test is disabled or enabled.

These are the keywords for specifying a recipient:

MAILRecipient who receives an e-mail alert. This takes one parameter, the e-mail address.
SCRIPTRecipient that invokes a script. This takes two parameters: The script filename, and the recipient that gets passed to the script.
IGNORERecipient that does NOT send an alert, and will cause Xymon to stop looking for any more recipients. See the example below.
FORMATformat of the text message with the alert. Default is "TEXT" (suitable for e-mail alerts). "PLAIN" is the same as TEXT, except it does not include the URL linking to the status webpage. "SMS" is a short message with no subject for SMS alerts. "SCRIPT" is a brief message template for scripts.
REPEATHow often an alert gets repeated. As with the DURATION setting, this is in minutes unless explicitly modified with 'm', 'h', 'd'.
STOPBy default, xymond_alert looks at all the possible recipients in the alerts.cfg file when handling an alert. If you would like it stop after a specific recipient gets an alert, add the STOP keyword to this recipient. This terminates the search for more recipients.

Wildcards - regular expressions

So now we can setup an alert. But using explicit hostnames is bothersome, if you have many hosts. There is a smarter way:

	HOST=%(www|intranet|support|mail).foo.com
		MAIL [email protected] SERVICE=http REPEAT=1h
		MAIL [email protected] SERVICE=cpu,disk,memory

The percent-sign indicates that the hostname should not be taken literally - instead, (www|intranet|support|mail).foo.com is a Perl-compatible regular expression. This particular expression matches "www.foo.com", "intranet.foo.com", "support.foo.com" and "mail.foo.com". You can use regular expressions to match hostnames, service-names and page-names.

If you want to test how your alert configuration handles a specific host, you can run xymond_alert in test mode - you give it a hostname and servicename as input, and it will go through the configuration and tell you which rules match and who gets an alert.


	osiris:~ $ cd server/
	osiris:~/server $ ./bin/xymoncmd xymond_alert --test osiris.hswn.dk cpu
	Matching host:service:page 'osiris.hswn.dk:cpu:' against rule line 109:Matched
	    *** Match with 'HOST=*' ***
	Matching host:service:page 'osiris.hswn.dk:cpu:' against rule line 110:Matched
	    *** Match with 'MAIL [email protected] REPEAT=2 RECOVERED COLOR=red' ***
	Mail alert with command 'mail -s "XYmon [12345] osiris.hswn.dk:cpu is RED" [email protected]'

If e-mail is not enough

The MAIL keyword means that the alert is sent in an e-mail. Sometimes this ends up being an SMS to your cell-phone - there are several "e-mail to SMS" gateways that perform this service - but that may not be what you want to do. And also, for an e-mail to actually be delivered requires that the mail-server is working. So if you need full control over how alerts are handled, you can use the SCRIPT method instead. Here's how:

	HOST=%(www|intranet|support|mail).foo.com SERVICE=http
		SCRIPT /usr/local/bin/smsalert 4538761925 FORMAT=sms

This alert doesn't go out as e-mail. Instead, when an alert needs to be delivered, Xymon will run the script /usr/local/bin/smsalert. The script can use data from a series of environment variables to build the information it sends in the alert, depending on what the recipient can handle. E.g. for pagers you will typically just send a sequence of numbers - Xymon provides things like the IP-address of the server that has a problem and a numeric code for the service to the script. So a simple script to send an SMS alert with the "sendsms" tool could look like this:

	#!/bin/sh

	/usr/local/bin/sendsms $RCPT "$BBALPHAMSG"

Here you can see the script use two environment variables that Xymon sets up for the script: The $RCPT is the recipient, i.e. the phone-number "4538761925" that is in the alerts.cfg file. The $BBALPHAMSG is text of the status that triggers the alert.

Although $BBALPHAMSG is nice to have, not all recipients can handle the large messages that may be sent in the status message. The FORMAT=sms tells Xymon to change the BBALPHAMSG into a form that is suitable for an SMS message - which has a maximum size of 160 bytes. So Xymon picks out the most important bits of the status message, and puts as much of that as possible into the BBALPHSMSG variable for the script.

The full list of environment variables provided to scripts are as follows:

BBCOLORLEVELThe current color of the status
BBALPHAMSGThe full text of the status log triggering the alert
ACKCODEThe "cookie" that can be used to acknowledge the alert
RCPTThe recipient, from the SCRIPT entry
BBHOSTNAMEThe name of the host that the alert is about
MACHIPThe IP-address of the host that has a problem
BBSVCNAMEThe name of the service that the alert is about
BBSVCNUMThe numeric code for the service. From SVCCODES definition.
BBHOSTSVCHOSTNAME.SERVICE that the alert is about.
BBHOSTSVCCOMMAS As BBHOSTSVC, but dots in the hostname replaced with commas
BBNUMERICA 22-digit number made by BBSVCNUM, MACHIP and ACKCODE.
RECOVEREDIs "1" if the service has recovered.
DOWNSECSNumber of seconds the service has been down.
DOWNSECSMSGWhen recovered, holds the text "Event duration : N" where N is the DOWNSECS value.

This set of environment variables are the same as those provided by Big Brother to custom paging scripts, so you should be able to re-use any paging scripts written for Big Brother with Xymon.

Save on the typing - use macros

Say you have a long list of hosts or e-mail addresses that you want to use several times throughout the alerts.cfg file. Do you have to write the full list every time ? No:

	$WEBHOSTS=%(www|intranet|support|mail).foo.com 
	
	HOST=$WEBHOSTS SERVICE=http
		SCRIPT /usr/local/bin/smsalert 4538761925 FORMAT=sms

	HOST=$WEBHOSTS SERVICE=cpu,disk,memory
		MAIL [email protected]

The first line defines $WEBHOSTS as a macro. So everywhere else in the file, "$WEBHOSTS" is automatically replaced with "%(www|intranet|support|mail).foo.com" before the rule is processed. The same method can be used for recipients, e.g. e-mail addresses. In fact, you can put an entire line into a macro:
	$UNIXSUPPORT=MAIL [email protected] TIME=*:0800:1600 SERVICE=cpu,disk,memory

	HOST=%(www|intranet|support|mail).foo.com 
		$UNIXSUPPORT

	HOST=dns.bar.com
		$UNIXSUPPORT

would be a perfectly valid way of specifying that [email protected] gets e-mailed about cpu-, disk- or memory-problems on the foo.com web-servers, and the bar.com dns-servers.

Note: Nesting macros is possible, except that you must define a macro before you use it in a subsequent macro definition.

There are rules ... and exceptions: IGNORE

A common scenario is where you handle most of the alerts with a wildcard rule, but there is just that one exception where you don't want any cpu alerts from the marketing server on Thursday afternoon. Then it is time for the IGNORE recipient:

	HOST=* COLOR=red
		IGNORE HOST=marketing.foo.com SERVICE=cpu TIME=4:1500:1800
		MAIL [email protected]

What this does is it defines a general catch-all alert: All red alerts go off to the [email protected] mailbox. There is just one exception: When the marketing.foo.com alerts on the "cpu" status on Thursdays between 3PM and 6PM, that alert is ignored. The IGNORE recipient implicitly has a STOP flag associated, so when the IGNORE recipient is matched, Xymon will stop looking for more recipients - so the next line with the MAIL recipient is never looked at when handling that busy marketing server on Thursdays.