Troubleshooting
Software RAID Drive Replacement
Submitted by ckdake on Sat, 2010-08-21 12:37Software RAID is such a finicky thing. I've blogged in the past on how to grow a software RAID array in Linux, but have since switched to using hardware RAID in all of my servers and a Drobo at home (which I wrote a little about here). My Drobo just got 2 new 2TB drives and all that took was sliding the drives in, but one of my old servers lost a software-RAID drive last week.
This crashed the server for some reason, and a matching replacement drive for the Hitachi Ultra320 SCSI 73GB drives was $400 or so, so I bought an IBM Ultra320 SCSI 73GB drive online for $50 shipped. It arrived this week and I headed to the datacenter today to install it. What should be a stupid-simple process like it is with Drobo was a lot more involved. This was made even more painful because while the labeled size of the drives was identically, the new one had a few less actual blocks on it than the old one. Should this happen to you with an ext2/3/4 data volume, (or me again in the future) these are the steps to take:
- Shut down the system
- Replace the failed drive
- Boot up from a recovery CD (I used a Gentoo install CD)
- Use fdisk to partition the new drive with as close of a partition layout as you can to the old drive. Here are my two drives:
Disk /dev/sda: 73.4 GB, 73407868928 bytes 255 heads, 63 sectors/track, 8924 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xc36bc36b Device Boot Start End Blocks Id System /dev/sda1 * 1 5 40131 fd Linux raid autodetect /dev/sda2 6 249 1959930 82 Linux swap / Solaris /dev/sda3 250 8924 69681937+ fd Linux raid autodetect
Disk /dev/sdb: 72.9 GB, 72892735488 bytes 255 heads, 63 sectors/track, 8862 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xf496138a Device Boot Start End Blocks Id System /dev/sdb1 * 1 5 40131 fd Linux raid autodetect /dev/sdb2 6 249 1959930 82 Linux swap / Solaris /dev/sdb3 250 8862 69183922+ fd Linux raid autodetect
- Note any block sizes that are different. In my case, the new sdb3 is 69183922 which is smaller than 69681937 on the old drive
- Use resize2fs to resize the existing partition on the good device to the size of the new partition on the new device. Yes, you are modifying it directly instead of going through the software RAID (one of the few nice things about software RAID):
resize2fs -f /dev/sda3 69183922
- Use mdadm to shrink the raid device to the size of the new partition on the new device:
mdadm /dev/md127 --grow --size=69183922
- Just to be safe, run a filesystem check on the new RAID volume:
e2fsck -y /dev/md127
- Add in the new drive to the RAID array:
mdadm /dev/md127 --add /dev/sdb3
- And wait for it to finish resyncing:
watch cat /proc/mdstat
If you run into device busy errors, you may need reboots, stopping the raid device (mdadm --stop /dev/md127), etc. And if you screw up the block sizes, all kinds of bad things happen. Also, if you just shrink the RAID volume but don't shrink the filesystem, or you shrink the RAID volume first, be prepared to spend far to much time trying to fix things! Sometimes it's faster (if your volume isn't close to full) to resize2fs down to a very small number, shring the raid volume to a very small number, let all the synchronizing happen, and then grow the raid volume with "--grow --size=max" and then resize your filesystem up to the new size of the RAID volume.
There are worse things to spend part of a saturday afternoon on, but I'd rather be outside!
OpenVZ, VLANs, Bridges, and Bonding
Submitted by ckdake on Tue, 2010-04-27 14:53At SugarCRM we use OpenVZ on some beefy boxes to host a number of containers that take care of some things that we don't want to dedicate hardware to. An article I wrote in 2008 describes our OpenVZ VLANd network setup.
I recently provisioned a new OpenVZ server but it seemed to be possessed by some sort of evil spirit because networking just wouldn't work quite right on the containers hosted on it all the time. Most of the time things would be fine, but occasionally things would just stop working. I re-imaged the physical server several times which didn't help at all. Today, we used up capacity on all the other OpenVZ servers and needed this one, so I set out to expunge the demons. I set up a test vm and conveniently, networking was completely broken on it, so I set out to work.
Up first were the basics:
- the bridge, vzbr256 was up and running and tcpdump showed traffic in it
- the interface on the host, bond0.256 was up and running and tcpdump showed traffic in it
- the parent interface for the container, veth137.0, was up and running and tcpdump showed traffic
- iptables configuration vm2 was identical to all other OpenVZ servers
When vzbr256 was in discovery mode and passing all traffic, veth137.0 could see traffic between other systems on it's subnet. Once the bridge figured out the topology and switched to forwarding mode, the only traffic goign to veth137.0 was the arp requests from the container side with no arp replies coming from the bridge. This was very strange because the bridge was still receiving the arp traffic and they weren't tagged with vlans or anything funky. Full packet dumps of the arp replies on vm2 showed that they were functionally identical to the ones on the other OpenVZ servers.
Basically the container was asking "where can I find my gateway?" and the gateway was responding with a MAC address. This reply made it all the way back to the bridge, but not back to the container for some reason. If I pinged the container from some other system on the vlan, the container would see that arp request and send a response, and all traffic would start to flow normally.
This sounded a little similar to this Problem with bonding, vlan, bridge, veth but the patch that fixes this was already applied to the Linux kernel that we're running. From this thread on Complex routing and bridging with OpenVZ it looked like some sysctl changes might be useful:
echo 0 > /proc/sys/net/bridge/bridge-nf-call-arptables echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables echo 0 > /proc/sys/net/bridge/bridge-nf-filter-vlan-tagged
A wiki article "Bridge doesn't forward packets" suggested that this might help, and changing 'bridge-nf-filter-vlan-tagged' from 1 to 0 seemed to let that first arp packet through, but resetting everything with that set to 0 from system boot led to the same symptoms and toggling it (and the others) around in different settings didn't help. Perhaps the first toggle was timed just right so it let through that one arp packet.
I compared the configuration of bond0/eth0/eth1 on vm2 with other vm servers and noticed that the MAC addresses of all 3 were the same. On other vm servers, bon0 and eth0 were the same but eth1 was different. Perhaps this was the cause! Our provisioning system using Cobbler continues to evolve and perhaps something changed. I removed the MAC lines from /etc/sysconfig/conf.d/net.* files so that they now look like:
#eth0 DEVICE=eth0 ONBOOT=yes SLAVE=yes MASTER=bond0 HOTPLUG=no BOOTPROTO=none #eth1 DEVICE=eth1 ONBOOT=yes SLAVE=yes MASTER=bond0 HOTPLUG=no BOOTPROTO=none #bond0 DEVICE=bond0 ONBOOT=yes BOOTPROTO=none USERCTL=no MODE=trunk
and reset everything, but the mac addresses still showed up as the same. Given all the information above, and that others were having problems combing bonding ("port channeling") with vlans with bridging, I decided to investigate how our bonding interfaces were set up. vm2's interfaces on the switch were still set to 'spanning-tree portfast' for some reason so I disabled that which had no effect. Digging a little deeper, the options that the bonding module was loaded into the kernel with in /etc/modprobe.conf were the culprit:
alias bond0 bonding options bond0 mode=1
vm2 was using mode 1 bonding (balance-rr) which does simple round-robin load balancing instead of mode 5 (balance-tlb) which does adaptive transmit load balancing. I changed this on vm2 to:
alias bond0 bonding options bond0 miimon=80 mode=5
and rebooted. Problem solved!
I can't be sure of the exact root cause here without understand a bit more about kernel level networking than I do, but it seems that linux bridging doesn't know what to do with arp reply packets that arrive in a vlan on a bonding interface via a physical interface other than the one the arp request was sent from. In mode 1 bonding, both physical interfaces get the same MAC address so there is no way to ensure that replies will come in on the same interface they are sent on, but with mode 5 bonding, each card uses its own MAC and replies always come to the same physical interface that the request was sent from.
After fixing this, I updated the provisioning system to configure all new systems with mode 5 bonding, and will be updating all existing systems to use mode 5 bonding the next time they reboot. Thankfully, our longest running system (at 1057 days and counting, last rebooted in mid 2007 before I was hired!) was set up manually before the provisioning system existed and was already using mode 5.
Long story short, if you're having networking problems with OpenVZ on top of a VLAN+bonding infrastructure, check out your bonding mode first and pick one that allows distinct MAC addresses for each physical interface and meets whatever needs you have for using bonding in the first place.
SugarCRM and Caching with APC
Submitted by ckdake on Sat, 2010-03-27 08:46At my day job, I'm responsible for everything that is Operations at SugarCRM including our OnDemand environment where we host SugarCRM for a large number of our customers. The web clusters are a not too surprising Open Source stack including nginx, wackamole, Apache, PHP, MySQL, CentOS, memcache, etc, but because of how SugarCRM works, we do run into some interesting challenges from time to time. Our engineering team develops a single product that can be deployed on customer servers, in our ondemand environment, and anywhere in the cloud, which means that in our environment each customer instance of SugarCRM lives in it's own silo. This puts our own unique spin on the common SaaS challenge of scaling up while keeping response times as low as possible.
One of the key components in keeping response times low is PHP opcode caching. When I started at SugarCRM a few years ago, we were using Zend Platform for opcode caching, session clustering, and a few other things. It did a fine job of the opcode caching but one of our "standard procedures" was to restart the Zend session clustering daemon whenever a customer reported certain kinds of problems. Not good! To permanently resolve this, help move us towards a full Open Source stack (since we are an Open Source company!), and to simplify our architecture, we moved to using APC for opcode caching (and memcached for session clustering).
The APC performance gains were similar to using Zend, but we didn't have any issues with APC requiring service restarts for close to 2 years. Over these two years we have continued to add web servers to our OnDemand cluster as the number of customers increased, and a few weeks ago Apache started (seemingly randomly) getting backed up with all of it's slots in use "Sending Reply". A simple `apachectl reload` would fix this, but it was eerily similar to the days of Zend. Due to the number of other ongoing projects, I didn't have time to investigate this much so I set up an alarm in our monitoring system so that we would get alarms before the web server got completely backed up and could proactively fix the issue.
Unrelated, yesterday I was looking at some of our metrics and noticed that our APC cache hit rate was a lot lower than it should be due to a high "Cache Full Count" (The Cache Full Count number was 10% of the size of the Cache Hit Count). We weren't directly monitoring this and I remember making sure the cache was big enough when this was initially set up, but that was from when we had a much smaller OnDemand customer base. To fix this, I bumped up the cache size from 128M to 512M and our configuration management system slowly started pushing out the changes. Cache Full Count was reset to 0 and stayed at 0, and the hit rate went from ~70% to ~90% across the cluster. Problem Solved! Or so I thought.
A little later on in the day, the web servers started getting backed up, much more quickly than before and all at the same time, and it was starting to cause timeouts for some customers. The proverbial monkeys in our application had gotten mad at the fan.
With a much bigger problem, I did some deeper debugging using strace and the "Sending Reply" apache processes all seemed to be spinlocked on a futex(FUTEX_WAKE) call. Could this be APCs fault? I disabled APC across the cluster and the problem went away, but had to figure out a way to fix this because no APC means a significant increase in the minimum load time to every single page load in our system. Digging through source code and Googling around, I came across How to Dismantle an APC Bomb which explains a set of symptoms that seemed very familiar. The cause was explained as contention caused by apc_store calls which is in the User Cache portion of APC andnot the opcode cache, which gave me an idea: just disable the user cache.
With SugarCRM, this can be done by setting 'external_cache_disabled' or 'external_cache_disabled_apc' in config.php, but I didn't really want to touch thousands of config.php files, so I looked in our code and found:
elseif(function_exists("apc_store") && empty($GLOBALS['sugar_config']['external_cache_disabled_apc']))'SugarCRM checks if apc_store exists to see if it can use APC, so I simply added apc_store to disable_functions in our php.ini template, re-enabled APC, tested out the changes, and pushed this out to the cluster.
This ended up being the problem and the solution. It turns out that increasing the memory available to APC meant that things were staying in the cache longer which made the contention issue even worse. SugarCRM's use of a user cache simply doesn't scale well in a massive clustered environment, and with it disabled, the opcode cache can do a better job. Hit rates are up to 92% to 99% across the cluster, and response times are down. Below is a sanitized graph of response times to a test instance designed to represent the worst response times across the OnDemand cluster:

- APC's memory was increased from 128M to 512M a little after the dotted red line at 12:00. This improved consistency of 'worst case' response times significantly
- Sometime between 1pm and 2pm, things started to get Very Bad and while response times were still down, web servers were crashing left and right
- At 2:12, APC was disabled completely. Minimum response times went up but things got suspiciously consistent
- After over an hour of investigation and testing, at around 3:20, APC was re-enabled with apc_store disabled, response times went back down, and the response time consistency remained.
I'm glad this issue happened after peak hours on a Friday, and I'm looking forward to seeing how this solution affects response times during peak load periods next week. If you're running SugarCRM at any kind of scale, this may be something to consider.
Debugging a (particular) failing boot service on Linux
Submitted by ckdake on Thu, 2010-01-21 19:51At work I recently rolled out a newer version of the Dell OpenManage tools which included for the first time a build of Openwsman. We didn't specifically need this functionality, but it's good to stay current with the OpenManage tools. To load in the (unrelated) new kernel on a test machine, I rebooted the machine using Cobbler's power management functionality on our administrative system, but after 5 minutes the machine was still not responding to pings so something was broken. I used remote desktop to hop on our one Windows server in the datacenter which we use to get at the interactive consoles of our servers (Thankfully the new DRAC6 card's have console applets that work on Macs!), and pulled up the console for this machine.
The boot process was hung on "Starting openwsman" and didn't seem to be doing anything. Doh!
I restarted the machine again, and at the grub boot menu added a "S" to the boot string to start up the system in single user mode, and booted things up. "chkconfig openwsman off" to disable the service, and another reboot to get the machine back up and running to let me troubleshoot a little better. I took a look in /etc/init.d/openwsman to see what might be hanging, and nothing immediately looked suspicious. It was a pretty standard init script, with the extra feature of generating OpenSSL certificates if they didn't exist already:
if [ ! -f "/etc/openwsman/serverkey.pem" ]; then
if [ -f "/etc/ssl/servercerts/servercert.pem" \
-a -f "/etc/ssl/servercerts/serverkey.pem" ]; then
echo "Using common server certificate /etc/ssl/servercerts/servercert.pem"
ln -s /etc/ssl/servercerts/server{cert,key}.pem /etc/openwsman/
else
echo "Generating Openwsman server public certificate and private key"
FQDN=`hostname --fqdn`
if [ "x${FQDN}" = "x" ]; then
FQDN=localhost.localdomain
fi
cat << EOF | sh /etc/openwsman/owsmangencert.sh > /dev/null 2>&1
--
SomeState
SomeCity
SomeOrganization
SomeOrganizationalUnit
${FQDN}
root@${FQDN}
EOF
fi
fi
It's a little strange, but not unheard of practice to do this, and shouldn't cause any problems. (Puppet and Func, two other systems tools we use, generate their certs in the application which is a lot more common.)
I extracted the only possible culprit from the owsmangencert.sh script and tried running the openssl command manually:
openssl req -days 365 $@ -config /etc/openwsman/ssleay.cnf \ -new -x509 -nodes -out cert.out \ -keyout key.out
and it seemed that this was indeed the problem. It just sat there and didn't complete with the speediness I expect from OpenSSL. Time for strace!
cat << EOF | strace openssl req -days 365 -config ./ssleay.cnf.2 -new -x509 -nodes -out cert.out -keyout key.out > -- > SomeState > SomeCity > SomeOrganization > SomeOrganizationalUnit > test > root@test > EOF
This ended up doing a long read with output like:
open("/dev/random", O_RDONLY) = 3
read(3, "\323K\372u_ya'\27\266\320\25\22\373\240\330~'\224\310\243\356\225\350.\245\362\3058\230Zb"..., 1024) = 128
read(3, "K\7:\273Zdr\274\25\227\263\366\260U\337Owp\6y\2333c\361\322\334\217\370.k\375]"..., 896) = 128
read(3, "dH\375V\327\230Bi\221\342\326\26R\301v^Qv5f\347\303g7\2747\345\360\207A!\227"..., 768) = 128
read(3, "X&\254r\331\353<:\36!\333\340\353", 640) = 13
read(3, "\357F\27\347\372atf", 627) = 8
read(3, "\231\347\232\362\345\215n\227", 619) = 8
read(3, "\324\304\323\30\325\10G\332", 611) = 8
Looks like /dev/random wasn't returning random data nearly fast enough, which makes a whole lot of sense! /dev/random is "good" random data because it is based on environmental entropy and the entropy data is only used once, but on a modern multi-core systems doing lots of things, there usually isn't much entropy available. That means that while this command would eventually finish, it could take a very long time.
The fix: using /dev/urandom instead. It is "not quite as good" random data because the output may have less entropy than /dev/random, and it uses internal entropy bits multiple times to generate it's output, but it's "good enough" for generating cryptographic keys. And, it is non blocking which means that a caller will never have to wait inane amounts of time for enough "random" data. (See http://en.wikipedia.org/wiki//dev/random for a longer explanation.
I replaced the two occurrences of /dev/random, one in /etc/openwsman/ssleay.cnf and one in /etc/openwsman/owsmangencert.sh, and initial startup of openwsman (including key generation) became pretty instantaneous. "chkconfig --levels 2345 openwsmand on" to turn it back on, and a reboot (after removing the generated keys and certs) to confirm, and the machine booted up as expected. To make this work everywhere, I customized those two config files and added them to our Puppet system so that all Dell servers would get Openwsman set up properly when the update is run globally:
file {
"ssleay.cnf":
path => "/etc/openwsman/ssleay.cnf",
source => "puppet://$server/dell/ssleay.cnf",
}
file {
"owsmangencert.sh":
path => "/etc/openwsman/owsmangencert.sh",
source => "puppet://$server/dell/owsmangencert.sh",
}
Problem solved and all machines will automatically get the correct fix, so the next time a machine won't finish starting up, it will be a new and different problem to debug.
Disable Gateway Smart Packet Detection
Submitted by ckdake on Fri, 2008-12-05 14:25Do you have Comcast business cable internet? Do you occasionally have crazy random problems connecting to remote machines/websites and/or do you notices an very unusual number of TCP retransmits when looking at packet traces that go through your Comcast provided SMC Networks cable modem? You are not alone! And there is one checkbox to fix all of this!
Head in to your modem admin page, go to Firewall Settings and check the box next to "Disable Gateway Smart Packet Detection" and all of your problems will be solved. (No guarantees, but seriously, it should work)
How did I find this out? I was getting a lot of things delivered via UPS this week and was unable to get to their website after a few successful attempts. Things still worked through a proxy server at work, so I thought it must be them blocking my IP address due to the number of requests I made (possibly violating some AUP with them) and tcp dump showed my packets going to their webserver (on Akamai) but nothing coming back after my SYN packets went out. After 15 minutes on the phone with Comcast support, they escalated me to second level support which meant a callback in 2 days. The second level guy confirmed my ISP and told me to change this setting because they'd had a lot of problems with Comcast customers and, working with Comcast, they came up with this solution.
Um ok. Whats going on here? The best part about this is that nobody really knows! Searching around the internet about "Gateway Smart Packet Detection" doesn't lead to any documentation or any "good" answers, just lots of people having problems and this checkbox fixing all of them. I've gathered that it is some kind of Anti-DOS feature for blocking multiple attempts at something, but chances are you are better off just turning it off. Hope this helps someone as the problems that checking this box have solved for me have been frustrating me for months!
Adobe Fast Web View
Submitted by ckdake on Sun, 2008-03-16 11:40Adobe Fast Web View is a very lightly documented but seemingly often used feature in Adobe Acrobat Reader. From the users point of view, it does what it says and makes pages of a PDF show up in their browser before the entire PDF is completly downloaded, but it's a bit more complicated from a server operators point of view. And, it is enabled by default when installing Adobe Acrobat Reader.
We recently moved sugarcrm.com and some other web properties from stand-alone web servers to a clustered solution involving NFS, load balancers, database replication, etc. It was a pretty complex migration and we're pretty sure that we're running a handful of applications on this cluster that nobody has every clustered before, so needless to say we ran into our share of gotchas. (Other than one web server that seems to be cursed..) One of them was very strange, and involved PDFs: Everything worked fine in all browsers on all platforms until Adobe Acrobat Reader entered the picture. Some number of PDFs would lock up the browser and never load, but only when the PDFs were served from the cluster through the load balancer. When served from one of the cluster web servers but not through the load balancer, everything would work perfectly! Also, with the Adobe plugin disabled, the PDFs would save perfectly and be viewable every time.
Using livehttpheaders it was apparently that 2 HTTP requests were being made so my guess was that the browser would do a GET for the PDF, but when the Adobe plugin took over, it was sending a new HTTP request (all about HTTP). This shouldn't be an issue, but things weren't working! I installed Wireshark on my Windows test installation and dug deeper. Immediately, I noticed that all the data packets in the response coming from the server were fragmented. This typically means that there is an MTU somewhere. However, with some Googling around for PDF files I noticed the same behavior on 50% of the sites I hit, and those PDFs were working fine in the Adobe plugin. Regardless, Igor and I set out to tinker with the MTUs on the web servers and load balancer. Changing the MTU from 1500 down to 1400 did change which PDFs would load in the plugin, but not all of them. Strange!
Again looking in the Wireshark traces, we saw what looked like a TCP reset loop (read all the details about TCP here). After the first part of data came through successfully, every packet from the server was a RST and the Adobe plugin just sat there waiting for data that was never going to arrive. We poked around the load balancer looking for anything that could cause this but no luck. Googling around for this PDF problem, the only solutions we found were recommendations to disable "Fast Web View." What's that? This gave us another thing to search for and led us to a server-side solution in this forum topic. For whatever reason, the load balancer was breaking HTTP requests with a "Request-Range" header, and Adobe Acrobat Reader was using this to attempt to make the PDF load faster. In retrospect, this makes sense, but it sure was a time consuming thing to discover! If you run into this, the solution is to add the following to your Apache configuration file (or something equivalent if you use lighttpd or something else, we found examples of this happening with other server software):
LoadModule headers_module modules/mod_headers.so
...
<FilesMatch "\.(mp3|zip|pdf)$">
Header unset Accept-Ranges
RequestHeader unset Range
RequestHeader unset Unless-Modified-Since
RequestHeader unset If-Range
</FilesMatch>

