Software RAID Drive Replacement

Software RAID is such a finicky thing. I've blogged in the past on how to grow a software RAID array in Linux, but have since switched to using hardware RAID in all of my servers and a Drobo at home (which I wrote a little about here). My Drobo just got 2 new 2TB drives and all that took was sliding the drives in, but one of my old servers lost a software-RAID drive last week.

This crashed the server for some reason, and a matching replacement drive for the Hitachi Ultra320 SCSI 73GB drives was $400 or so, so I bought an IBM Ultra320 SCSI 73GB drive online for $50 shipped. It arrived this week and I headed to the datacenter today to install it. What should be a stupid-simple process like it is with Drobo was a lot more involved. This was made even more painful because while the labeled size of the drives was identically, the new one had a few less actual blocks on it than the old one. Should this happen to you with an ext2/3/4 data volume, (or me again in the future) these are the steps to take:

  1. Shut down the system
  2. Replace the failed drive
  3. Boot up from a recovery CD (I used a Gentoo install CD)
  4. Use fdisk to partition the new drive with as close of a partition layout as you can to the old drive. Here are my two drives:
    Disk /dev/sda: 73.4 GB, 73407868928 bytes
    255 heads, 63 sectors/track, 8924 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0xc36bc36b
       Device Boot      Start         End      Blocks   Id  System
    /dev/sda1   *           1           5       40131   fd  Linux raid autodetect
    /dev/sda2               6         249     1959930   82  Linux swap / Solaris
    /dev/sda3             250        8924    69681937+  fd  Linux raid autodetect
    Disk /dev/sdb: 72.9 GB, 72892735488 bytes
    255 heads, 63 sectors/track, 8862 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0xf496138a
       Device Boot      Start         End      Blocks   Id  System
    /dev/sdb1   *           1           5       40131   fd  Linux raid autodetect
    /dev/sdb2               6         249     1959930   82  Linux swap / Solaris
    /dev/sdb3             250        8862    69183922+  fd  Linux raid autodetect
  5. Note any block sizes that are different. In my case, the new sdb3 is 69183922 which is smaller than 69681937 on the old drive
  6. Use resize2fs to resize the existing partition on the good device to the size of the new partition on the new device. Yes, you are modifying it directly instead of going through the software RAID (one of the few nice things about software RAID):
    resize2fs -f /dev/sda3 69183922
  7. Use mdadm to shrink the raid device to the size of the new partition on the new device:
    mdadm /dev/md127 --grow --size=69183922
  8. Just to be safe, run a filesystem check on the new RAID volume:
    e2fsck -y /dev/md127
  9. Add in the new drive to the RAID array:
    mdadm /dev/md127 --add /dev/sdb3
  10. And wait for it to finish resyncing:
    watch cat /proc/mdstat

If you run into device busy errors, you may need reboots, stopping the raid device (mdadm --stop /dev/md127), etc. And if you screw up the block sizes, all kinds of bad things happen. Also, if you just shrink the RAID volume but don't shrink the filesystem, or you shrink the RAID volume first, be prepared to spend far to much time trying to fix things! Sometimes it's faster (if your volume isn't close to full) to resize2fs down to a very small number, shring the raid volume to a very small number, let all the synchronizing happen, and then grow the raid volume with "--grow --size=max" and then resize your filesystem up to the new size of the RAID volume.

There are worse things to spend part of a saturday afternoon on, but I'd rather be outside!

Velocity and DevOpsDay 2010

What a crazy week last week was! A lot happened in Silicon Valley at Velocity and DevOpsDay, and it was all pretty awesome.

I flew in Monday and picked up a rental car (my first week in California with a car, and I've spent 7+ months of my life there) and headed to Facebook for lunch with my friend Don from Gallery and some informal conversations about Facebook's operations with two people that manage things there. Dinner and some long conversations about technical teams followed at Bharat's (also from Gallery) place in Menlo Park, and I was off to a super late check in at the Hotel.

Tuesday morning, Velocity 2010 began. Seth, Wyatt and I from SugarCRM were a small part of the 1200 person sold-out crowd of performance and operations engineers from around the world. Including:

  • Everyone that wrote all the books (e.x. Web Operations co-authored by John Allspaw of Flickr and now Etsy. He signed the pre-release copy I got for free!)
  • Everyone that wrote all the tools (e.x. Adam Jacob of Chef, and Luke Kanies of Puppet.)
  • "The Greatest SysAdmin in the world" , Theo Schlossnagle, founder of OmniTI. A quote from him: "Ops super powers come from developing a uniform hatred and mistrust for all tech while maintaining a positive outlook."
  • The kind of people that are as "graph crazy" as I am and use this sticker in their presentations.
  • The developers behind web development and performance tools, including a guy I did some grad school projects with named Jaime that works for Google and did a presentation on SpeedTracer.

Tuesday, Wednesday and Thursday were full of entertaining and informative sessions, and I mostly stuck to the operational track. There is far too much for me to fit in here, but if you'd like to know more about the content of the conference, Ernst at The Agile Admin did great writeups of the sessions he went to which mostly cover the ones I was in, Royans posted a pretty consise topic and idea summary and Velocity has the slides and videos available. If you only have time for a few or don't know where to begin, start with these:

  • Scalable Internet Architectures - Theo Schlossnagle's opening session that set the tone for the entire conference.
  • Infrastructure Automation with Chef - I learned a lot about Chef from this talk and talked to Adam Jacob (presenter and Chef author) afterwards about some aspects of how Chef works and some problems I have with Puppet. He was super helful about both and once I look at switching out mod_passenger for unicorn for use in our Puppetmaster, I'll be giving Chef a shot.
  • Ops Meta-Metrics - John Allspaw's great talk about correlating data related to change and incidents with more standard monitoring data.
  • Always Ship Trunk: Managing Change In Complex Websites (PDF) - The way that development should be done and code should be deployed.
  • Facebook Operations – A Day In The Life - This was the only standing room only talk I was in, and Tom from Facebook (one of the guys I talked to Monday) explained all kinds of interesting behind-the-scenes details.

Some of the most fun sessions I went to were the evening activities on Tuesday and Wednesday. Tuesday was Ignite Talks which is an hour of presentations, each consisting of 20 slides that automatically advance every 15 seconds forcing the presenter to be quick and concise. This was the first one of these I have been to, and the topics covered ranged from completing a triathlon to cheap scalable storage using laptop drives. Ernst at The Agile Admin did a great writeup of Velocity 2010: Ignite.

Wednesday evening was "Birds of a Feather" sessions which are small conversations around a table between people interested in the same things. Up first for me was a Demo by ZenOSS/Opscode/Dyn about "Geographically Based Cloud Scaling". They demoed using ZenOSS to detect failure of servers and datacenters, which then updated DNS hosted at Dyninc using their API so traffic was only sent to "live" servers. Once enough things failed that there wasn't enough capacity, ZenOSS used Chef (by Opscode) to spin up Amazon EC2 instances and add them to DNS to help handle the load. Very Cool demo! I pointed out that ZenOSS was still a single point of failure, and we discussed strategies for using Chef to ensure that a ZenOSS install is always working properly somewhere. update: ZenOSS Blog post about the demo.

After cloud scaling, I switched rooms to talk "Load Balancing Tips and Tricks" with another small room of people. It turns out that the vast majority of people use NetScalers for load balancing (We at SugarCRM used to use them but switched last December to nginx+wackamole on Dell hardware), and most of them have the same issues we experienced. It's pretty common to not use any of the advanced features in the Citrix load balancers and to do things like session handling, redirects, and serving error pages from upstream servers. In the early days of Youtube, they actually would have to reboot the entire site due to a "feature" in the NetScalers that couldn't be disabled: An "overflow" queue would accept more connections than upstream had resources to handle with the goal of smoothing out small load spikes, but it ended up backing up everything and the only way to clear out the load and get response times back down was to restart all the web servers which gave all open connections error 500s. Youtube doesn't serve video content through their NetScalers and neither should you!

To close out wednesday evening, I headed to Dyntini at the hotel bar: free drinks on DynInc's tab! It's a bunch of nice people, and I'm slowly moving over ithought.org DNS to their infrastructure. Once SugarCRM is ready to do some global load balancing, we'll probably make the switch as well.

Some other interesting talks I have specific thoughts on:

  • Grendel is really neat. It's a HTTP based document store with strong server-side PGP encryption: if someone gets a copy of your DB, still don't have data.
  • Hidden Scalability Gotchas in Memcached and Friends Neil Gunther comes from an academic background and talked about performance and modeling: "Models are from God, data is from the devil." This talk had a bit of a confrontational response from the audience because ops people in the room want to use models to predict performance and capacity characteristics, but he claims that models are more useful to explain the data. The particular model he presented had things like a variable representing "contention factor," and the value of this variable on a curve fit tells you if cpu/disk/ram/lock/etc contention is limiting the performance of your application. This was used to show some changes that they made to Memcached to improve scalability on systems with large numbers of CPUs
  • Drizzle - The momentum in Drizzle sounds really promissing. It's based on MySQL but is getting a lot of features ripped out and/or converted to plugins, and comes with sane defaults. There isn't a reason to switch to using it yet, but if at some point it doesn't become feasable to use MySQL, this will probably be the way to go.
  • Choose Your Own Adventure - Adam from Opscode gave a wonderful presentation that was a great close to Velocity. You had to be there, but he's a great public speaker and everything is either a magic unicorn or it isn't. UPDATE: videos of this are now online!

DevOpsDay was a completely different yet still pretty similar experience. A bunch of us met up at LinkedIn on Friday morning to listen to a round of panels discuss DevOps ideas and culture. The entrance was somewhat sneaky.

All of the panels were full with great speakers, including ones that stood out as being great during their presentations at Velocity and some great people that didnt' speak at Velocity like Gene Kim (wrote first version of tripwire), Michael Stahnke (created EPEL), and Israel Gat.

"DevOps" is sort of like the new "Cloud". In 3 years companies may be asking "What should or DevOps strategy be?" which seems pretty funny now, but "The Cloud" was the same way several years ago. It's not a technology, but a culture and processes that revolve around people and technologies. Terms like DevOps and "The Cloud" are bad terms but they're nice because business people see them in magazines and online, so are more receptive when technical people bring things up. One of the panels was on DevOps culture outside of Web Oberations, and I hope to be able to take it's content to heart: DevOps is about removing team barries (Us vs Them) including those with Engineering and those with "them" being "the business". Everything needs to be "we" and companies that get this working correctly see 4x gains in efficiency and productivity.

Adam from Opscode, Theo from OmniTI, Luke from Puppet, and Eric from rPath were in a fantastic panel that ended up focusing much on the "sharing" aspects of the rising world of system automation. Puppet just launched Puppet Forge and Chef has their Cookbooks Library, but at some point configurations are going to be so specific to an enterprise that they'll need to be developed inside of the company. This panel had a lot of back and forth between the panelists, and a lot of audience interaction, with no real conclusion on how much things can truly be automated. Food and drinks at the end of the day with other "DevOps" people led to some great conversations about automation and scalability, and Saturday morning I flew back to Atlanta.

Based on conversations and presentations over the week, here are a few books that are worth checking out:

And lastly, some quotes and concepts to close out with:

  • "Hiring too many people to do things that should be automated gets you a "Meat Cloud". This is bad" (Adam Jacob?) - Focus should be on hiring people that can do the automation instead.
  • "us ops guys are like cicadas. Every 17 years we get to come out" - Velocity is the only real gathering like this, and each year it grealy increases in size. There was a similar sort of thing in the early years of computing, but for some time the sysadmin has been mostly forgotten.
  • "cloud security: Its always going to suck, but we'll always have jobs!"(Ward Spangenberg) - Running your code on someone elses infrastructure is always going to have risks, and smart people will always be needed to make sure everything is functional and secure.
  • "The limiting factor for any business is time it takes to restore from app data, code repo backup, and bare metal." (Theo Schlossnagle) Operationally, a business can only recover form disaster as quickly as the application data can be restored. If there are other things that make recovery time longer, they need to be fixed and optimized out so that the restoration of data is the only limiting factor.
  • "The purpose of QA is to ensure quality." (John Allswaw) - So why do business have QA departments? Shouldn't engineering and operations be responsible for ensuring quality? A lot of the big-name businesses (Facebook for example) don't have QA departments any more and it's working out alright.

That's it for the 2010 version, I'm looking forward to continuing to interact with all the people I met at Velocity and DevOpsDay, and next year should be even bigger and better.

SugarCRM and Caching with APC

At my day job, I'm responsible for everything that is Operations at SugarCRM including our OnDemand environment where we host SugarCRM for a large number of our customers. The web clusters are a not too surprising Open Source stack including nginx, wackamole, Apache, PHP, MySQL, CentOS, memcache, etc, but because of how SugarCRM works, we do run into some interesting challenges from time to time. Our engineering team develops a single product that can be deployed on customer servers, in our ondemand environment, and anywhere in the cloud, which means that in our environment each customer instance of SugarCRM lives in it's own silo. This puts our own unique spin on the common SaaS challenge of scaling up while keeping response times as low as possible.

One of the key components in keeping response times low is PHP opcode caching. When I started at SugarCRM a few years ago, we were using Zend Platform for opcode caching, session clustering, and a few other things. It did a fine job of the opcode caching but one of our "standard procedures" was to restart the Zend session clustering daemon whenever a customer reported certain kinds of problems. Not good! To permanently resolve this, help move us towards a full Open Source stack (since we are an Open Source company!), and to simplify our architecture, we moved to using APC for opcode caching (and memcached for session clustering).

The APC performance gains were similar to using Zend, but we didn't have any issues with APC requiring service restarts for close to 2 years. Over these two years we have continued to add web servers to our OnDemand cluster as the number of customers increased, and a few weeks ago Apache started (seemingly randomly) getting backed up with all of it's slots in use "Sending Reply". A simple `apachectl reload` would fix this, but it was eerily similar to the days of Zend. Due to the number of other ongoing projects, I didn't have time to investigate this much so I set up an alarm in our monitoring system so that we would get alarms before the web server got completely backed up and could proactively fix the issue.

Unrelated, yesterday I was looking at some of our metrics and noticed that our APC cache hit rate was a lot lower than it should be due to a high "Cache Full Count" (The Cache Full Count number was 10% of the size of the Cache Hit Count). We weren't directly monitoring this and I remember making sure the cache was big enough when this was initially set up, but that was from when we had a much smaller OnDemand customer base. To fix this, I bumped up the cache size from 128M to 512M and our configuration management system slowly started pushing out the changes. Cache Full Count was reset to 0 and stayed at 0, and the hit rate went from ~70% to ~90% across the cluster. Problem Solved! Or so I thought.

A little later on in the day, the web servers started getting backed up, much more quickly than before and all at the same time, and it was starting to cause timeouts for some customers. The proverbial monkeys in our application had gotten mad at the fan.

With a much bigger problem, I did some deeper debugging using strace and the "Sending Reply" apache processes all seemed to be spinlocked on a futex(FUTEX_WAKE) call. Could this be APCs fault? I disabled APC across the cluster and the problem went away, but had to figure out a way to fix this because no APC means a significant increase in the minimum load time to every single page load in our system. Digging through source code and Googling around, I came across How to Dismantle an APC Bomb which explains a set of symptoms that seemed very familiar. The cause was explained as contention caused by apc_store calls which is in the User Cache portion of APC andnot the opcode cache, which gave me an idea: just disable the user cache.

With SugarCRM, this can be done by setting 'external_cache_disabled' or 'external_cache_disabled_apc' in config.php, but I didn't really want to touch thousands of config.php files, so I looked in our code and found:

elseif(function_exists("apc_store") && empty($GLOBALS['sugar_config']['external_cache_disabled_apc']))'

SugarCRM checks if apc_store exists to see if it can use APC, so I simply added apc_store to disable_functions in our php.ini template, re-enabled APC, tested out the changes, and pushed this out to the cluster.

This ended up being the problem and the solution. It turns out that increasing the memory available to APC meant that things were staying in the cache longer which made the contention issue even worse. SugarCRM's use of a user cache simply doesn't scale well in a massive clustered environment, and with it disabled, the opcode cache can do a better job. Hit rates are up to 92% to 99% across the cluster, and response times are down. Below is a sanitized graph of response times to a test instance designed to represent the worst response times across the OnDemand cluster:

  • APC's memory was increased from 128M to 512M a little after the dotted red line at 12:00. This improved consistency of 'worst case' response times significantly
  • Sometime between 1pm and 2pm, things started to get Very Bad and while response times were still down, web servers were crashing left and right
  • At 2:12, APC was disabled completely. Minimum response times went up but things got suspiciously consistent
  • After over an hour of investigation and testing, at around 3:20, APC was re-enabled with apc_store disabled, response times went back down, and the response time consistency remained.

I'm glad this issue happened after peak hours on a Friday, and I'm looking forward to seeing how this solution affects response times during peak load periods next week. If you're running SugarCRM at any kind of scale, this may be something to consider.