Monitoring

Monitoring Nginx with ZenOSS

If you're using ZenOSS for network monitoring, and you have a few loadbalancers (or servers of some sort) running Nginx, chances are pretty good that you want to see what your load balancers are up to inside of ZenOSS. It requires gluing some things together, but once you know what kind of glue to use, it's a pretty straightforward process. First, you need to add status pages for Nginx. This means that it must be compiled with "--with-http_stub_status_module", but if you're using Nginx from a package provider like EPEL, it already has this included. Add nginx_status to your configuration by adding something like:

location /nginx_status {
    stub_status on;
    access_log   off;
    allow IP.OF.MONITORING.SERVER;
    deny all;
}

to your first server {} block. Reload nginx and visit http://IP.OF.NGINX.SERVER/nginx_status and you should get some stats like:

Active connections: 8 
server accepts handled requests
 455010 455010 781977 
Reading: 0 Writing: 2 Waiting: 6 

Up next is getting this information into ZenOSS. We'll do this using a nagios plugin grabbed from here: check_nginx. This plugin does most of the work, but it's not quite enough to use here because it bases the numbers it provides on the difference in the accepts/connects over 2 runs 1 second apart instead of real numbers, and the format doesn't work with ZenOSS's input parser, so you'll need to apply this patch:

--- check_nginx.sh	2009-11-24 07:31:35.000000000 -0800
+++ check_nginx.sh.1	2009-11-24 06:45:56.000000000 -0800
@@ -181,17 +181,13 @@
     if [ "$secure" = 1 ]
     then
         wget_opts="-O- -q -t 3 -T 3 --no-check-certificate"
-        out1=`wget ${wget_opts} http://${hostname}:${port}/${status_page}`
-        sleep 1
-        out2=`wget ${wget_opts} http://${hostname}:${port}/${status_page}`
+        out=`wget ${wget_opts} http://${hostname}:${port}/${status_page}`
     else        
         wget_opts="-O- -q -t 3 -T 3"
-        out1=`wget ${wget_opts} http://${hostname}:${port}/${status_page}`
-        sleep 1
-        out2=`wget ${wget_opts} http://${hostname}:${port}/${status_page}`
+        out=`wget ${wget_opts} http://${hostname}:${port}/${status_page}`
     fi
 
-    if [ -z "$out1" -o -z "$out2" ]
+    if [ -z "$out" ]
     then
         echo "UNKNOWN - Local copy/copies of $status_page is empty."
         exit $ST_UK
@@ -199,13 +195,9 @@
 }
 
 get_vals() {
-    tmp1_reqpsec=`echo ${out1}|awk '{print $10}'`
-    tmp2_reqpsec=`echo ${out2}|awk '{print $10}'`
-    reqpsec=`expr $tmp2_reqpsec - $tmp1_reqpsec`
-
-    tmp1_conpsec=`echo ${out1}|awk '{print $9}'`
-    tmp2_conpsec=`echo ${out2}|awk '{print $9}'`
-    conpsec=`expr $tmp2_conpsec - $tmp1_conpsec`
+    reqpsec=`echo ${out}|awk '{print $10}'`
+
+    conpsec=`echo ${out}|awk '{print $9}'`
 
     reqpcon=`echo "scale=2; $reqpsec / $conpsec" | bc -l`
     if [ "$reqpcon" = ".99" ]
@@ -220,7 +212,7 @@
 }
 
 do_perfdata() {
-    perfdata="'reqpsec'=$reqpsec 'conpsec'=$conpsec 'conpreq'=$reqpcon"
+    perfdata="reqpsec=$reqpsec conpsec=$conpsec conpreq=$reqpcon"
 }
 
 # Here we go!
@@ -247,17 +239,17 @@
 then
     if [ "$reqpsec" -ge "$warning" -a "$reqpsec" -lt "$critical" ]
     then
-        echo "WARNING - ${output} | ${perfdata}"
+        echo "WARNING - ${output} | ${perfdata};"
 	exit $ST_WR
     elif [ "$reqpsec" -ge "$critical" ]
     then
-        echo "CRITICAL - ${output} | ${perfdata}"
+        echo "CRITICAL - ${output} | ${perfdata};"
 	exit $ST_CR
     else
-        echo "OK - ${output} | ${perfdata} ]"
+        echo "OK - ${output} | ${perfdata}; ]"
 	exit $ST_OK
     fi
 else
-    echo "OK - ${output} | ${perfdata}"
+    echo "OK - ${output} | ${perfdata};"
     exit $ST_OK
 fi

by saving that to "check_nginx.patch" and running "patch -p0 < check_nginx.path" from the folder where you have the check_nginx script saved. Make sure that the ZenOSS user can run the script, and make sure it works. The output should look like:

OK - nginx is running. 782995 requests per second, 455656 connections per second (1.71 requests per connection) | reqpsec=782995 conpsec=455656 conpreq=1.71;

With all this working, now you'll just need to create a new Template in ZenOSS, add a COMMAND data source to it with a command like:

/home/zenoss/scripts/check_nginx_ng.sh -N -H ${dev/manageIp}

Then add two datapoints, reqpsec and conpsec. Make sure to set both to the "COUNTER" type because the patched check_nginx script reports constantly increasing numbers instead of the GAUGE that it was before! Bind this new template to any devices or device classes where servers are running Nginx, and create any graphs you like. I have a graph on each device that shows reqpsec and conpsec, as well as a report that shows the aggregate reqpsec and conpsec for all of the loadbalancers. If your command is named "check_nginx" in the template, you can use the variables in any report by adding "checK_nginx_reqpsec" as a data point.

Without too much trouble, you now have fancy graphs in ZenOSS for your Nginx statistics, and you can set thresholds for these if you have conditions that should send out alarms.

Dell RAID status using OpenManage in Gentoo

I use Gentoo on the majority of my servers, and it's the OS on my desktop that I use every day. While there are occasional annoyances, I've been using it for almost 6 years and have the know how to bend it to my will. However, a recent server issue stumped me a little bit.

One of my new Dell servers came with a LSI 1060E built in RAID controller for the system SAS drives. I'm not quite sure how this happened because the box also contained a PERC6/E for the new MD1000, and the 1060E is a substantial step down from the PERC6/i that I was expecting. No big deal, this card is only to do RAID-1 mirroring on two SAS drives for the OS, but I was unable to figure out to monitor disk status to see what was going on, and I do need to know if and when one of the OS hard drives fails.

On other Dell systems that I have running Gentoo, the "MegaCli" utility from LSI does a great job reporting physical and virtual disk status. On this new server, it works with the PERC6/E just fine but doesn't detect the 1060E. I couldn't find any documentation on how to get this to work in Linux and Dell support's only suggestion was to install Dell's full-blown OpenManage package. Conveniently, this doesn't work natively in Gentoo yet. Watching the output of `ipmi-sel` may do the trick, but I'm not sure that the 1060E would log an event to the IPMI log if a drive failed.

After some Googling around, I stumbled across this howto that ended up being the clues that I needed to get this started. I basically followed those instructions, but a few things didn't quite work. I got some errors at the end of `apt-get install dellomsa` and it turns out that `/etc/init.d/dataeng` wouldn't succeed and didn't give me any usable errors. I got around this by killing all dsm_sa_datamgr32d processes and running `/opt/dell/srvadmin/dataeng/bin/dsm_sa_datamgr32d` manually inside the Debian chroot.

After starting up that daemon, Dell's "omreport" tool started working! `omreport storage vdisk` and `omreport storage pdisk controller=0` gave me the information I was expecting, and I used some existing perl scripts I had to set up a cronjob that runs regularly and e-mails me any problems. To keep things simple (by keeping scheduled jobs in one place), I set up the cron to run from the Gentoo host outside of the vhost, which runs something along the lines of this:

my $vout = `/bin/chroot /var/debian /usr/sbin/omreport storage vdisk`;
my $pout = `/bin/chroot /var/debian /usr/sbin/omreport storage pdisk controller=0`;

There are still some issues with what may happen when the boxes reboot, mainly needing to mount a few things to /var/debian, chroot, and start up sdm_sa_datamgr32d again, but the only time these boxes should ever reboot is when I'm working on them so all should be well.

Have a better way to run omreport in Gentoo or access 1060E drive status via IPMI natively? I'd love to hear it!

logrotate, munin, and configuration mistakes

I've got several servers and on each of them, I log things and graph things. Munin is one of my favorite graphing tools for keeping an eye on things, and I noticed something odd that was getting odder as the days went by in one of my Munin graphs:

somethingodd

System load on that server was usually low but every night at 4am when the daily scheduled jobs run, the CPU usage of those jobs was steadily getting higher. Not good! I poked around at each of the scheduled jobs on that machine and one of them was taking an unusually amount of time to run: logrotate. This seemed strange as my logs folder was still under 1GB for all the logs on that machine that sees a pretty high volume of web and mail hits, but something needed to be fixed. I ran strace on the logrotate process and it seemed to just be calling localtime thousands of times in a row which seemed to indicate it was broken or it was checking lots of files. Hrm. All the logrotate configuration files looked fine and had been working for a long time. I poked around the logs folder to see if anything unusual was happening in there and there it was. The munin log file folder was only ~100MB but had over 15,000 files in it. Munin definitely didn't need to have that many log files... They were named like "munin_node.log.1.gz.2.gz.1.gz.3.gz.1.gz" which meant that logrotate was rotating out files that it had already rotated, requiring ungzing and gzing and it's pretty obvious how this could get bad fast.

Turns out I made a typo in one of the logrotate config files. Telling it to rotate "/var/log/munin/*" instead of "/var/log/munin/*.log". Doh! It took ~10 minutes to remove all of the extraneous files (rm said too many arguments so I had to use find: `find ./ -name "*.log.*" -exec rm {} \;`) but now all seems to be well and logrotate is running in the expected amount of time. We'll see what happens at 4am tomorrow :)

monitoring a centrally managed system with Nagios

My research project at work, CPR, consists of several "networks" of machines that each run full mesh end-to-end measurements between themselves and other machines in their network. Our biggest network is approaching 100 machines but there are several other active networks. Each of these has it's configuration information in a centralized database and is managed by a tool called cpradmin. Initially, there was just one network so the tool was developed with one network in mind but as the project has grown, I hackishly added support for multiple networks by allowing the network name to specify the MySQL database for that particular network. Cpradmin sets up new machines and does things like configuring Smokeping, iptables, etc on them. Cpradmin is also responsible for generating Nagios configuration information for each node: for example, the campus CPR nodes each check the availability of campus services such as IMAP, DNS, DHCP, etc.

As the number of nodes has grown, we could no longer depend on our eyes to make sure they were all working and I set up a central installation of Nagios to monitor both reachability to the nodes and to verify that all the monitoring processes were running. This used the existing Nagios configuration generation script that was used for the nodes and worked fine, but due to the different tests being run on each network of machines, this started to get out of hand. Additionally, the central install was just monitoring one network and any hosts on other networks needed to be added individually. Obviously not good because it required human intervention, so I started on a new tool.

It's still not quite finished, but is looking pretty good. Instead of using specific configuration information for each host, the new tool uses the general configuration information about the types of tests running in each network. For example, all campus nodes should have arpwatch running. This new script generates a comprehensive Nagios configuration file based on the types of services running in each network on each host, as well as things like disk space, system load, latency, etc. Whenever a new node is added to a mesh, cpradmin no longer has to add specific files to a list, it just regenerates the configuration file and sends Nagios a signal to reload it's configuration. When this is all finished, we will no longer have to manually intervene with Nagios at all, and it will monitor close to 150 hosts with a total of close to 1000 tests and email us whenever the disk on one of them fills up.

Up next is giving real multi-network support to cpradmin so that it can do software upgrades on multiple networks all at once. (I've already completed multithread support so to upgrade smokeping on all the machines takes the time it takes smokeping to install on one node, once for each network, but cpradmin has to be run once for each network.)