Xymon config/installation/manipulation notes:

Lessons learned:

  • xymond listes on port 1984 - useful for firewall restrictions.

  • Acknowledging an alert from the CLI:

    xymon 127.0.0.1 'hobbitdack ${alert_id} ${time_in_minutes} ${alert_msg}'
    
  • To segregate alerts by filesystem:

    • In analsysi.cfg, ensure appropriate filesystems (and other alerts) are grouped:

      HOST=%client[1-3]
          DISK /opt/app GROUP=mw 90 95
          DISK * GROUP=infra 90 05
      HOST=client4
          DISK /opt/app GROUP=dol 90 95
          DISK * GROUP=infra 90 05
      
    • In alerts.cfg, use GROUP name as the host:

      GROUP=mw
          MAIL $Middleware
      GROUP=dol
          MAIL $Dkoleary
      GROUP=infra
          MAIL=$Mpiunix
      
  • To disable a test for a period of time:

    xymon 127.0.0.1 'disable ${host}.[${test}|*] ${minutes} ${free_text}'
    

    Can set ${minutes} to -1 to disable it until it comes back good again.

  • To ID the alert_id of a test - in fact, to obtain quite a bit of info regarding a test.

    • xymon localhost 'xymondlog ${host}.${test} displays test status. see xymon manpage, xymondlog section, for details. Note: DO NOT haveto be root to run it.

      $ xymon localhost 'xymondlog client4.disk'
      client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y|
      red Sat Oct  4 13:56:11 CDT 2014 - Filesystems NOT ok
      &red /opt/app (100% used) has reached the PANIC level (95%)
      
      Filesystem            1024-blocks    Used Available Capacity Mounted on
      /dev/mapper/vg00-root     1032088  370652    609008      38% /
      /dev/vda1                  495844   67751    402493      15% /boot
      /dev/mapper/vg00-opt      1032088   34060    945600       4% /opt
      /dev/mapper/vg00-tmp      2064208   68616   1890736       4% /tmp
      /dev/mapper/vg00-usr      4128448 1684704   2234032      43% /usr
      /dev/mapper/vg00-var      2064208  439152   1520200      23% /var
      /dev/mapper/vg00-app      2064208 2042292         0     100% /opt/app
      
    • To display the alert id, parse the above output:

      $ xymon localhost 'xymondlog client4.disk' | head -1 | \
      awk -F\| '{print $11}'
      1578903790
      
  • To display the results of a test across the env:

    # xymon localhost 'xymondboard test=lntp'
    client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    

    See xymon man page for details.

Items to learn:

  • How to set up different scripts. For instance, for ntp testing

Notes:

08/17/14

  • Got xymon and four clients running. Downloaded rpms for the same version we’re using at work from http://terabithia.org/rpms/xymon/. Server and clients installed but not configured and running.
  • Still need to:
    • Edit /etc/xymon-client/xymonclient.cfg updating XYMONSERVERS.
    • Figure out the server configuration.

09/01/14:

  • Scratch that and reverse. Got xymon installed on new vm, called xymon
  • Got xymon-client running on six other clients.
  • xymon.conf for http access put in place automatically. That’s nice.
  • xymond listens on port 1984 - useful for firewall restrictions.
  • Got my ghost clients. Nice!
  • Read through the hosts.cfg man page. Nothing too out of the ordinary.
  • One interesting bit, though, was the .default. tag, used for identifying default tests on otherwise unidentified hosts. That’s how you get the new hosts in the ghosts page.
  • OK: got my two groups, cient and infra, got clients all green and got one host in infra red.
  • Next goals:
    • ack alerts
    • rewrite ntp reporting.

09/02/14:

  • Read through alerts.cfg. I think I found out, at least initially, how to configure disk alerts to go to other people. Specific lines:

    For some tests - e.g. "procs" or "msgs" - the right group of people
    to alert in case of a failure may be different, depending on which
    of the client rules actually detected a problem. E.g. if you have
    PROCS rules for a host checking both "httpd" and "sshd" processes,
    then the Web admins should handle httpd-failures, whereas "sshd"
    failures are handled by the Unix admins.
    
    To handle this, all rules can have a "GROUP=groupname" setting. When
    a rule with this setting triggers a yellow or red status, the groupname
    is passed on to the Xymon alerts module, so you can use it in the alert
    rule definitions in alerts.cfg(5) to direct alerts to the correct group
    of people.
    

    Need to experiment a bit with that one.

09/05/14:

  • Files:

    • hosts.cfg: IDs the hosts to monitor and tests to run on them.
    • analysis.cfg: IDs specific parameters for each host:
      • memphys
      • memswap
      • memact
      • load
      • up
      • disk
    • alerts.cfg: IDs who gets alerted for what.
  • Updated analysis.cfg and alerts.cfg to direct emails for specific filesystems to specific groups. Trick is as follows:

    • analysis.cfg:

      HOST=%client[1-3]
          DISK /opt/app GROUP=mw 90 95
          DISK * GROUP=infra 90 05
      HOST=client4
          DISK /opt/app GROUP=dol 90 95
          DISK * GROUP=infra 90 05
      HOST=%xymon|ldapsvr|syslog
          DISK * GROUP=infra 90 05
      
    • alerts.cfg:

      GROUP=mw
          MAIL $Middleware
      GROUP=dol
          MAIL $Dkoleary
      GROUP=infra
          MAIL=$Mpiunix
      
  • Didn’t get duplicate alerts, though. When client[14] were already alerting due to disk issues, the alert didn’t go out for /tmp. That may be expected. Will have to check on that w/Justin at some point.

09/06/14:

Remaining goals:

  • How to ID the alert number if it’s not emailed out. Answer: /var/lib/xymon/histlogs/${host}/${test}: Nope; not it.
  • How to script an alert on a client. (ntp)
  • How to send alerts to scripts (for further redirection to OVO)

Well, didn’t find out how to acknowledge a specific alert but I did find out how to disable the damned thing for a bit. That, at least, makes it go away for the duration. I disabled caauth until it comes live again. At work, I disabled walvdevwapp062’s memory until 0800 monday morning, and I disabled nap-lvad-075’s memory until it goes green again. Damn thing’s been yellow for pushing 20 days now...

Still, remaining goals:

  • How to script an alert on a client. (ntp)
  • How to send alerts to scripts (for further redirection to OVO)

10/04/14:

Been a bit. Vacation, new role at work, and complete and utter task saturation.

Today’s work: figure out how to identify the alert_id from an alert that’s not mailed out. To do that, I’m going to kick off an alert, wait for the alert, then find the fucking alert_id.

OK: forgot the firewall update on xymon. That’s sorted now. Alert ID for the client4:disk is 1578903790

Found the fucker!

xymond "xymondlog ${host}.${test}"

example:

# xymon "xymondlog client4.disk"
2014-10-04 12:39:36 No recipient specified - assuming localhost
client4|disk|red||1412444137|1412444338|1412446138|0|0|192.168.122.25|1578903790|||Y|
red Sat Oct  4 12:38:57 CDT 2014 - Filesystems NOT ok
&red /opt/app (100% used) has reached the PANIC level (95%)

Filesystem            1024-blocks    Used Available Capacity Mounted on
/dev/mapper/vg00-root     1032088  370652    609008      38% /
/dev/vda1                  495844   67751    402493      15% /boot
/dev/mapper/vg00-opt      1032088   34060    945600       4% /opt
/dev/mapper/vg00-tmp      2064208   68616   1890736       4% /tmp
/dev/mapper/vg00-usr      4128448 1684704   2234032      43% /usr
/dev/mapper/vg00-var      2064208  438492   1520860      23% /var
/dev/mapper/vg00-app      2064208 2042292         0     100% /opt/app

Or, more exlicity:

xymon localhost "xymondlog client4.disk" | head -1 | \
awk -F\| '{print $11}'

Combining that with our ack cli:

xymon localhost 'hobbitdack ${alert_id} ${time_in_minutes} ${alert_msg}'

xymon localhost 'hobbitdack 1578903790 5 testing cli alert ack'

Then, update xymonserver.cfg to not propogate acknolwedged alerts and your non-gree view becomes much clearer.

XYMONGENOPTS="--nopropack='*'...

OK; some excellent progress today. That was one of the main goals. If ntp’s still fucked up, I can probably live with that. I really wanted to be able to acknowledge those goddamned alerts, though.