SugarCRM
A Change Of Pace
Submitted by ckdake on Thu, 2010-07-29 10:03After 2.5 years at SugarCRM, it's time for me to move on. Sugar was a great first job out of graduate school, and my experiences there have been invaluable. It took me from the custom tools I wrote at Georgia Tech for system management and monitoring (for CPR), to in-depth experience based knowledge about much better tools other people have written like Puppet, ZenOSS, Cobbler, and more. What began is a small confusing mess where everyone had root on all the servers and no idea what other people were doing grew into a managed, monitored, and automated environment that made everything easy so that people could focus on getting real work done instead of figuring out things like what changes need to be made to php.ini and who might change them to something else. Reimaging a cluster of web servers, database servers, mongodb servers, memcache servers, nginx servers, and mysql servers used to take quite a bit of time, but can now be done in an hour or so.
Each trip out to Silicon Valley was a new and exciting set of experiences, from Velocity last month, to views from the CTO's porch in the hills (like this one), to having beers at with friends old and new that work at places like Yahoo, Facebook, Google, Digg, Meebo, etc.
SugarCRM is a great product, has a great future, and I look forward to seeing what happens next in the super-competitive world of CRM. If our growth numbers of requests/second and OnDemand customers over the past few years is any indication, the sky is the limit! It's been great working with everyone at SugarCRM, and there are lots of aspects of my job and the people at SugarCRM that I'll miss.
All that said, it's time for me to move on. Working on a remote team makes some things pretty difficult, travelling across the country all the time has lost its novelty, and the interrupt-driven world of operations prevents the kind of focus needed to solve the kinds of problems I would like to solve, so next up for me is a complete change of pace. Tomorrow, July 30th, is my last day at SugarCRM.
On Monday, August 9th I'm starting at Highgroove in Atlanta to try my hand at Ruby on Rails in a software engineering envrionment, working in-person at an office a bit more, and being able to manage my time a bit better without the constant flow of tickets and alarms.
Ruby is new to me but I've done plenty of things in Perl/PHP/Python, and Rails is new to me but I've worked on web applications using various frameworks for almost 10 years so that shouldn't be too much trouble either. The skills I pick up will also help me put a dent in the backlog of personal project ideas I have. I have no idea where the future leads, but regardless of where I end up, the "sweet spot" for me has always been the overlap between different areas of computing (whether its operations and research, or mobile devices and web services) and Ruby+Rails will be a great thing to have in my toolbox to glue things together. This should be fun!
Velocity and DevOpsDay 2010
Submitted by ckdake on Mon, 2010-06-28 10:43What a crazy week last week was! A lot happened in Silicon Valley at Velocity and DevOpsDay, and it was all pretty awesome.
I flew in Monday and picked up a rental car (my first week in California with a car, and I've spent 7+ months of my life there) and headed to Facebook for lunch with my friend Don from Gallery and some informal conversations about Facebook's operations with two people that manage things there. Dinner and some long conversations about technical teams followed at Bharat's (also from Gallery) place in Menlo Park, and I was off to a super late check in at the Hotel.
Tuesday morning, Velocity 2010 began. Seth, Wyatt and I from SugarCRM were a small part of the 1200 person sold-out crowd of performance and operations engineers from around the world. Including:
- Everyone that wrote all the books (e.x. Web Operations co-authored by John Allspaw of Flickr and now Etsy. He signed the pre-release copy I got for free!)
- Everyone that wrote all the tools (e.x. Adam Jacob of Chef, and Luke Kanies of Puppet.)
- "The Greatest SysAdmin in the world" , Theo Schlossnagle, founder of OmniTI. A quote from him: "Ops super powers come from developing a uniform hatred and mistrust for all tech while maintaining a positive outlook."
- The kind of people that are as "graph crazy" as I am and use this sticker in their presentations.
- The developers behind web development and performance tools, including a guy I did some grad school projects with named Jaime that works for Google and did a presentation on SpeedTracer.
Tuesday, Wednesday and Thursday were full of entertaining and informative sessions, and I mostly stuck to the operational track. There is far too much for me to fit in here, but if you'd like to know more about the content of the conference, Ernst at The Agile Admin did great writeups of the sessions he went to which mostly cover the ones I was in, Royans posted a pretty consise topic and idea summary and Velocity has the slides and videos available. If you only have time for a few or don't know where to begin, start with these:
- Scalable Internet Architectures - Theo Schlossnagle's opening session that set the tone for the entire conference.
- Infrastructure Automation with Chef - I learned a lot about Chef from this talk and talked to Adam Jacob (presenter and Chef author) afterwards about some aspects of how Chef works and some problems I have with Puppet. He was super helful about both and once I look at switching out mod_passenger for unicorn for use in our Puppetmaster, I'll be giving Chef a shot.
- Ops Meta-Metrics - John Allspaw's great talk about correlating data related to change and incidents with more standard monitoring data.
- Always Ship Trunk: Managing Change In Complex Websites (PDF) - The way that development should be done and code should be deployed.
- Facebook Operations – A Day In The Life - This was the only standing room only talk I was in, and Tom from Facebook (one of the guys I talked to Monday) explained all kinds of interesting behind-the-scenes details.
Some of the most fun sessions I went to were the evening activities on Tuesday and Wednesday. Tuesday was Ignite Talks which is an hour of presentations, each consisting of 20 slides that automatically advance every 15 seconds forcing the presenter to be quick and concise. This was the first one of these I have been to, and the topics covered ranged from completing a triathlon to cheap scalable storage using laptop drives. Ernst at The Agile Admin did a great writeup of Velocity 2010: Ignite.
Wednesday evening was "Birds of a Feather" sessions which are small conversations around a table between people interested in the same things. Up first for me was a Demo by ZenOSS/Opscode/Dyn about "Geographically Based Cloud Scaling". They demoed using ZenOSS to detect failure of servers and datacenters, which then updated DNS hosted at Dyninc using their API so traffic was only sent to "live" servers. Once enough things failed that there wasn't enough capacity, ZenOSS used Chef (by Opscode) to spin up Amazon EC2 instances and add them to DNS to help handle the load. Very Cool demo! I pointed out that ZenOSS was still a single point of failure, and we discussed strategies for using Chef to ensure that a ZenOSS install is always working properly somewhere. update: ZenOSS Blog post about the demo.
After cloud scaling, I switched rooms to talk "Load Balancing Tips and Tricks" with another small room of people. It turns out that the vast majority of people use NetScalers for load balancing (We at SugarCRM used to use them but switched last December to nginx+wackamole on Dell hardware), and most of them have the same issues we experienced. It's pretty common to not use any of the advanced features in the Citrix load balancers and to do things like session handling, redirects, and serving error pages from upstream servers. In the early days of Youtube, they actually would have to reboot the entire site due to a "feature" in the NetScalers that couldn't be disabled: An "overflow" queue would accept more connections than upstream had resources to handle with the goal of smoothing out small load spikes, but it ended up backing up everything and the only way to clear out the load and get response times back down was to restart all the web servers which gave all open connections error 500s. Youtube doesn't serve video content through their NetScalers and neither should you!
To close out wednesday evening, I headed to Dyntini at the hotel bar: free drinks on DynInc's tab! It's a bunch of nice people, and I'm slowly moving over ithought.org DNS to their infrastructure. Once SugarCRM is ready to do some global load balancing, we'll probably make the switch as well.
Some other interesting talks I have specific thoughts on:
- Grendel is really neat. It's a HTTP based document store with strong server-side PGP encryption: if someone gets a copy of your DB, still don't have data.
- Hidden Scalability Gotchas in Memcached and Friends Neil Gunther comes from an academic background and talked about performance and modeling: "Models are from God, data is from the devil." This talk had a bit of a confrontational response from the audience because ops people in the room want to use models to predict performance and capacity characteristics, but he claims that models are more useful to explain the data. The particular model he presented had things like a variable representing "contention factor," and the value of this variable on a curve fit tells you if cpu/disk/ram/lock/etc contention is limiting the performance of your application. This was used to show some changes that they made to Memcached to improve scalability on systems with large numbers of CPUs
- Drizzle - The momentum in Drizzle sounds really promissing. It's based on MySQL but is getting a lot of features ripped out and/or converted to plugins, and comes with sane defaults. There isn't a reason to switch to using it yet, but if at some point it doesn't become feasable to use MySQL, this will probably be the way to go.
- Choose Your Own Adventure - Adam from Opscode gave a wonderful presentation that was a great close to Velocity. You had to be there, but he's a great public speaker and everything is either a magic unicorn or it isn't. UPDATE: videos of this are now online!
DevOpsDay was a completely different yet still pretty similar experience. A bunch of us met up at LinkedIn on Friday morning to listen to a round of panels discuss DevOps ideas and culture. The entrance was somewhat sneaky.
All of the panels were full with great speakers, including ones that stood out as being great during their presentations at Velocity and some great people that didnt' speak at Velocity like Gene Kim (wrote first version of tripwire), Michael Stahnke (created EPEL), and Israel Gat.
"DevOps" is sort of like the new "Cloud". In 3 years companies may be asking "What should or DevOps strategy be?" which seems pretty funny now, but "The Cloud" was the same way several years ago. It's not a technology, but a culture and processes that revolve around people and technologies. Terms like DevOps and "The Cloud" are bad terms but they're nice because business people see them in magazines and online, so are more receptive when technical people bring things up. One of the panels was on DevOps culture outside of Web Oberations, and I hope to be able to take it's content to heart: DevOps is about removing team barries (Us vs Them) including those with Engineering and those with "them" being "the business". Everything needs to be "we" and companies that get this working correctly see 4x gains in efficiency and productivity.
Adam from Opscode, Theo from OmniTI, Luke from Puppet, and Eric from rPath were in a fantastic panel that ended up focusing much on the "sharing" aspects of the rising world of system automation. Puppet just launched Puppet Forge and Chef has their Cookbooks Library, but at some point configurations are going to be so specific to an enterprise that they'll need to be developed inside of the company. This panel had a lot of back and forth between the panelists, and a lot of audience interaction, with no real conclusion on how much things can truly be automated. Food and drinks at the end of the day with other "DevOps" people led to some great conversations about automation and scalability, and Saturday morning I flew back to Atlanta.
Based on conversations and presentations over the week, here are a few books that are worth checking out:
And lastly, some quotes and concepts to close out with:
- "Hiring too many people to do things that should be automated gets you a "Meat Cloud". This is bad" (Adam Jacob?) - Focus should be on hiring people that can do the automation instead.
- "us ops guys are like cicadas. Every 17 years we get to come out" - Velocity is the only real gathering like this, and each year it grealy increases in size. There was a similar sort of thing in the early years of computing, but for some time the sysadmin has been mostly forgotten.
- "cloud security: Its always going to suck, but we'll always have jobs!"(Ward Spangenberg) - Running your code on someone elses infrastructure is always going to have risks, and smart people will always be needed to make sure everything is functional and secure.
- "The limiting factor for any business is time it takes to restore from app data, code repo backup, and bare metal." (Theo Schlossnagle) Operationally, a business can only recover form disaster as quickly as the application data can be restored. If there are other things that make recovery time longer, they need to be fixed and optimized out so that the restoration of data is the only limiting factor.
- "The purpose of QA is to ensure quality." (John Allswaw) - So why do business have QA departments? Shouldn't engineering and operations be responsible for ensuring quality? A lot of the big-name businesses (Facebook for example) don't have QA departments any more and it's working out alright.
That's it for the 2010 version, I'm looking forward to continuing to interact with all the people I met at Velocity and DevOpsDay, and next year should be even bigger and better.
Some Quick Pointers on Improving SugarCRM Performance
Submitted by ckdake on Wed, 2010-06-16 14:19Getting web applications performing optimally is a never-ending full-time job for a lot of people, including me! I manage the OnDemand environment for thousands of customers at SugarCRM. This post will walk through a few pretty simple steps that can be taken to improve the performance of SugarCRM on a Linux server with MySQL, but they mostly apply to any PHP application running on a LAMP stack.
Tuning a web application is a type of engineering, and as Rico Mariani at Microsoft puts it:
If you are not measuring, you are not engineering.Monitoring is the key in all of the tips that follow because if you don't know how bad things are, you won't know which of these fixes help you out. Depending on the details of your situation, some can actually do more harm than good so it's important to go through these one by one, understand the implications of them, and compare you data from befor and after you make any changes.
MySQL Performance
Getting things right in MySQL is another full-time job, but there are a few quick things you can do. First of all, make sure that all of your tables are the same character set. If they are not, this can greatly slow down queries that join between tables of different character sets. Take a look in the output of
SHOW TABLE STATUS FROM sugarcrm;and make sure that everything is the same in your Collation column. If you are using utf8_general_ci in most places, and the table my_custom is a different type, just run:
ALTER TABLE my_custom CONVERT TO CHARACTER SET utf8;
Next up is making sure that you are properly utilizing the MySQL Query Cache. This cache saves the restlts of select queries so that if the same query comes in again before changes to the table, the results can be returned much more quickly. To see the configuration for this, run:
SHOW VARIABLES LIKE "query_cache%";If query_cache_type is set to something other than "ON" or query_cache_size is set to 0, you will want to change these! 32MB is a fine start, but depending on your workload you may need (a lot) more to get the full benifit from this:
SET GLOBAL query_cache_type = 'ON'; SET GLOBAL query_cache_size = 32000000;This one change can make a significant difference in performance. To tune your numbers, check the output of:
SHOW STATUS LIKE "Qcache%";After some time running with the Query Cache on, you should see hits go up and lowmem_prunes should stay pretty low with free_memory not reaching 0. If free_memory seems to run out, or lowmem_prunes continue to climb, increase the size of the query_cache_size. There will continue to be lots of inserts and not_cached due to constantly changing tables and queries other than SELECT queries, so don't worry about those numbers.
Generally, using the InnoDB storage engine instead of the default MyISAM storage engine will help out as well, but any tables that you want to be able to do full-text search on cannot be InnoDB, and there is a lot more to take into consideration here than any quick fix. If you are using all MYISAM tables, this would be a good thing to investigate. Switching to InnoDB and tuning your MySQL storage engine based on your workload are a lot more than I can fit here today, but there is a lot of information out there about these. Head over to the MySQL Performance Blog for more information than you ever wanted to know about MySQL performance (including an excelent article on the MySQL Query Cache).
All of your MySQL STATUS values should be included in your monitoring system, as graphs of these are very helpful over time for diagnosing problems and tuning cache sizes! Also, remember that the Query Cache uses up RAM so keep an eye on that as well. In this case, it will use 32MB and as you increase the query_cache_size, you are directly increasing RAM usage.
PHP Performance
Perhaps the single biggest perfomance enhancement that you can make for SugarCRM and many other PHP applications is OpCode Caching. I've written about some of the complexity involving this before at SugarCRM and Caching with APC, but the simple version is that OpCode caching will improve the performance of your PHP applications. At SugarCRM, we use APC which is pretty straightforward to install. On CentOS, and other rpm/yum Linux distributions simply:
yum install php-pecl-apcor
pecl install apcThen, enable apc by adding it's configuration to /etc/php.ini or /etc/php.d/apc.ini, and make sure that apc.enabled is set to 1 and apc.shm_size is set to something reasonable. For one instance of SugarCRM, the default of 32 (for 32MB) is generally plenty but if you have other PHP things running on the server, a bigger number might help.
Copy /usr/share/pear/apc.php to somewhere in your webroot and visit it in your browser, and after a bit of usage of SugarCRM, you should see the "Hits" percentage climb while the "Misses" percentage gets smaller and smaller. Your hit ratio should be higher than 98%. If the "Cache full count" number is more than a very very small percentage of your hit counter, you should increase the size of memory available to APC. On some Linux systems, shm segment size is limited to 32M so instead of increasing shm_size, you will need to increase shm_segments.
Like MySQL, these numbers should be part of your monitoring system so that you know when it's time to change things. Also like MySQL, this directly uses RAM. This example will use 32MB, and increasing it will use more.
Client-Side Performance
Why bother serving files to clients that never (or seldomly) change? A common industy best practice is to set a Cache-Control header for static content to 10 days in the future. Tools like YSlow will let you know if you are not doing this. (They will also give you a lot of other pointers on perfromance! We have internal bugs open at SugarCRM for most of the client side performance issues that tools like YSlow have identified.)
We use apache on our application servers, and to enable these cache-control headers we added this:
<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf|xml|txt)$"> Header set Cache-Control max-age=36000,public </FilesMatch>For SugarCRM OnDemand, this cut our number of requests-per-second in half and made a several-hundred-millisecond improvement in the time it takes to load pages from SugarCRM OnDemand.
For this one, monitor your req/s and check your PHP Applications after any changes using a tool like YSlow.
SugarCRM Specific
Lastly, a few specific changes just for SugarCRM:
- Add 'disable_count_query' => true to config.php. This replaces the "1-5 of 200" with a "1-5 of 6+" and makes a big difference when you have a lot of records. This configuration setting should be a default in a future release of SugarCRM.
- Disable explicit caching in the app by setting 'external_cache_disabled' => true in config.php. While SugarCRM can use external caches like APC's user cache and Memcached for some things, this shouldn't be enabled unless you have set up the infrastructure to handle these. This configuration setting be a default in a future release of SugarCRM.
- If developer mode is enabled and you are not using it, turn this off! In SugarCRM 6.0, this is under Adminstration -> System Settings -> Advanced. This is off by default, and if it gets turned on and accidentlally left on, it causes severe performance degradation.
- Turn loging down to the minimum that you need. If your SugarCRM installation is on a NFS based document root with multiple servers, this becomes even more important. This is under Administration -> System Settings -> Logger Settings. "Error" or "Fatal" are good options here.
- Turn off tracker features you are not using. This is in Admin -> Tracker. Each enabled piece of functionality causes things to be written to the database, and will slow down page loads. Leave on only what you need for reports that you have written.
Thats a start!
With those changes, you should be off to a pretty good start at improving the performance of your installation of SugarCRM! These things help us keep SugarCRM performing quickly in the SugarCRM OnDemand environment with a pretty high ratio of active users to servers, and a recent application of these tips that I did to a site that handles 200k+ unique users per month lead to a 65% decrease in system utilization and much improved response times.
[Cross posted on developers.sugarcrm.com.]SugarCRM and Caching with APC
Submitted by ckdake on Sat, 2010-03-27 08:46At my day job, I'm responsible for everything that is Operations at SugarCRM including our OnDemand environment where we host SugarCRM for a large number of our customers. The web clusters are a not too surprising Open Source stack including nginx, wackamole, Apache, PHP, MySQL, CentOS, memcache, etc, but because of how SugarCRM works, we do run into some interesting challenges from time to time. Our engineering team develops a single product that can be deployed on customer servers, in our ondemand environment, and anywhere in the cloud, which means that in our environment each customer instance of SugarCRM lives in it's own silo. This puts our own unique spin on the common SaaS challenge of scaling up while keeping response times as low as possible.
One of the key components in keeping response times low is PHP opcode caching. When I started at SugarCRM a few years ago, we were using Zend Platform for opcode caching, session clustering, and a few other things. It did a fine job of the opcode caching but one of our "standard procedures" was to restart the Zend session clustering daemon whenever a customer reported certain kinds of problems. Not good! To permanently resolve this, help move us towards a full Open Source stack (since we are an Open Source company!), and to simplify our architecture, we moved to using APC for opcode caching (and memcached for session clustering).
The APC performance gains were similar to using Zend, but we didn't have any issues with APC requiring service restarts for close to 2 years. Over these two years we have continued to add web servers to our OnDemand cluster as the number of customers increased, and a few weeks ago Apache started (seemingly randomly) getting backed up with all of it's slots in use "Sending Reply". A simple `apachectl reload` would fix this, but it was eerily similar to the days of Zend. Due to the number of other ongoing projects, I didn't have time to investigate this much so I set up an alarm in our monitoring system so that we would get alarms before the web server got completely backed up and could proactively fix the issue.
Unrelated, yesterday I was looking at some of our metrics and noticed that our APC cache hit rate was a lot lower than it should be due to a high "Cache Full Count" (The Cache Full Count number was 10% of the size of the Cache Hit Count). We weren't directly monitoring this and I remember making sure the cache was big enough when this was initially set up, but that was from when we had a much smaller OnDemand customer base. To fix this, I bumped up the cache size from 128M to 512M and our configuration management system slowly started pushing out the changes. Cache Full Count was reset to 0 and stayed at 0, and the hit rate went from ~70% to ~90% across the cluster. Problem Solved! Or so I thought.
A little later on in the day, the web servers started getting backed up, much more quickly than before and all at the same time, and it was starting to cause timeouts for some customers. The proverbial monkeys in our application had gotten mad at the fan.
With a much bigger problem, I did some deeper debugging using strace and the "Sending Reply" apache processes all seemed to be spinlocked on a futex(FUTEX_WAKE) call. Could this be APCs fault? I disabled APC across the cluster and the problem went away, but had to figure out a way to fix this because no APC means a significant increase in the minimum load time to every single page load in our system. Digging through source code and Googling around, I came across How to Dismantle an APC Bomb which explains a set of symptoms that seemed very familiar. The cause was explained as contention caused by apc_store calls which is in the User Cache portion of APC andnot the opcode cache, which gave me an idea: just disable the user cache.
With SugarCRM, this can be done by setting 'external_cache_disabled' or 'external_cache_disabled_apc' in config.php, but I didn't really want to touch thousands of config.php files, so I looked in our code and found:
elseif(function_exists("apc_store") && empty($GLOBALS['sugar_config']['external_cache_disabled_apc']))'SugarCRM checks if apc_store exists to see if it can use APC, so I simply added apc_store to disable_functions in our php.ini template, re-enabled APC, tested out the changes, and pushed this out to the cluster.
This ended up being the problem and the solution. It turns out that increasing the memory available to APC meant that things were staying in the cache longer which made the contention issue even worse. SugarCRM's use of a user cache simply doesn't scale well in a massive clustered environment, and with it disabled, the opcode cache can do a better job. Hit rates are up to 92% to 99% across the cluster, and response times are down. Below is a sanitized graph of response times to a test instance designed to represent the worst response times across the OnDemand cluster:

- APC's memory was increased from 128M to 512M a little after the dotted red line at 12:00. This improved consistency of 'worst case' response times significantly
- Sometime between 1pm and 2pm, things started to get Very Bad and while response times were still down, web servers were crashing left and right
- At 2:12, APC was disabled completely. Minimum response times went up but things got suspiciously consistent
- After over an hour of investigation and testing, at around 3:20, APC was re-enabled with apc_store disabled, response times went back down, and the response time consistency remained.
I'm glad this issue happened after peak hours on a Friday, and I'm looking forward to seeing how this solution affects response times during peak load periods next week. If you're running SugarCRM at any kind of scale, this may be something to consider.
Hairpinning with a Cisco ASA
Submitted by ckdake on Fri, 2009-11-13 12:01What a long battle with Cisco IOS this has been, but after quite a bit of tinkering I've gotten things working the way that I would like. Here's a technical description of the details in hope that this helps someone else.
The Setup
- Load balancers with private IP address like 172.16.0.10 on a /24, running example.com
- Cisco ASA Firewalls running 7.2(1) or newer, that map public IP addresses (I'll use 192.168.0.193 on a /24 here instead of a real public IP)
- Internal DNS servers that map loadbalancer.private to 172.16.0.10
- External DNS servers that map example.com to 192.168.0.193
- Random application server behind the firewall with no public IP address and a private IP of 172.16.0.20
The Problem
Applications behind the firewall need to access other applications behind the firewall using the public DNS name (example.com) instead of the private one (loadbalancer.private).
Some possible solutions
As an easy-to-set-up solution, we currently have the internal dns servers set up to map example.com to 172.16.0.10 which works fine, except it requires updating DNS records in multiple places. Our naming scheme slowly got a bit more complex, and I've had to add explicit relay rules to our DNS server configuration files to relay certain lookups from the internal DNS servers to the external DNS server's internal IP address. Sending it to the DNS server's external IP address doesn't work because the Cisco ASA will not send traffic back out on the same interface that it came in on, even after network translations have been done. (For a different portion of our external IP space, I added some static routes to the core router but when we move those IPs behind this firewall, this ASA feature will break those routes as well)
The current mapping of public IPs to private IPs looks like:
static (inside,outside) 192.168.0.193 172.16.0.10 netmask 255.255.255.255
One feature that Cisco suggests to solve our problem is using "DNS Doctoring" which is just simply adding the 'dns' keyword to the end of the mapping like:
static (inside,outside) 192.168.0.193 172.16.0.10 netmask 255.255.255.255 dns
which modifies DNS queries going through the firewall from the inside interface to change the IP from 192.168.0.193 to 172.16.0.10. This would great, if your DNS server is outside of the firewall, which ours is not. Our internal DNS queries never travel through the ASA so this didn't do anything for us.
Up next was trying out
same-security-traffic permit intra-interface
which "permits communication in and out of the same interface" which sounds like it's the exact right solution for the problem because that was the limitation that broke things. However, adding this in didn't seem to change anything and traffic still was not permitted in and out the same interface.
The Solution
After a lot of troubleshooting, which involves an ASA 5510 and a 3524-XL on the floor under my desk, downloading and installing new versions of IOS, a lot of Googling, a lot of cursing, and a lot of sketching possible things out on paper, I finally figured out the missing piece: Hairpinning which is "the process by which traffic is sent back out the same interface on which it arrived." Here is the configuration that finally got traffic flowing from 172.16.0.10 to 192.168.0.193 on the ASA back out to 172.16.0.10 on the same interface it started on:
!--- Output suppressed. ! interface Ethernet0/0 nameif outside security-level 0 ip address 192.168.0.192 255.255.255.0 ! interface Ethernet0/1 nameif inside security-level 100 ip address 172.16.0.1 255.255.0.0 ! !--- Output suppressed. ! same-security-traffic permit intra-interface access-list outside_in extended permit icmp any any access-list outside_in extended permit tcp any any ! !--- Output suppressed. ! global (outside) 1 interface nat (inside) 1 172.16.0.0 255.255.0.0 alias (inside) 192.168.0.193 172.16.0.10 255.255.255.255 alias (inside) 10.0.0.20 172.16.0.20 255.255.255.255 static (inside,outside) 192.168.0.193 172.16.0.10 netmask 255.255.255.255 access-group outside_in in interface outside ! !--- Output suppressed.
The trick here was, combined with "same-security-traffic permit intra-interface" to add the alias lines, the first one:
alias (VLAN100) 192.168.0.193 172.16.0.10 255.255.255.255
does something sensible and aliases 192.168.0.193 to 172.16.0.10 on the inside interface so any time traffic comes in here matching that IP, it gets rewritten. The second line is also required but doesn't make as much sense:
alias (inside) 10.0.0.20 172.16.0.20 255.255.255.255
This line is telling the ASA to take any traffic coming in destined to 10.0.0.20 and map it to 172.16.0.20, however, we don't have any devices on 10.0.0.0/8 and there are no routes for this, so there will never be any traffic coming in to 10.0.0.20. That said, this line has to exist so that there is a mapping back to 172.16.0.20 in the alias table so that the ASA knows it's alright to send traffic to it. Using a "real" public IP here would both use up our public IPs and perhaps pose some security risk, so it's safer to use these non-public IPs and add a rule to prevent incoming traffic from the outside from reaching them. If the alias command would work for an IP range instead of one host, this would be pretty much perfect.
The result
Things finally work! Here is a trace of a ping from 172.16.0.20 to 192.168.0.193 (which works now!):
ICMP echo request from VLAN100:172.16.0.20 to VLAN100:192.168.0.193 ID=12034 seq=0 len=56 ICMP echo request translating VLAN100:172.16.0.20 to VLAN100:10.0.0.20 ICMP echo request untranslating VLAN100:192.168.0.193 to VLAN100:172.16.0.10
So the ASA is doing the translating the proper way and not doing anything with 10.0.0.20. This is good news because it means that our naming and routing architecture can be greatly simplififed:
- All relay rules for external facing domains that have previously required this "split-horizion" DNS can be removed, returning the DNS server configurations to a generic state
- All crazy static routes for external IP addresses can be removed from our core router
- All external facing domain zones can be removed from the internal DNS servers, and updates when things are moved only have to be done in one place
The only penalty for this is adding in the alias lines to our ASA configuration for each existing static mapping that we have, as well as adding an alias line for each server that needs to communicate with the external IP addresses of things behind the same ASA which should be limited to the internal DNS servers and a few application servers.
References
- Configuring Interfaces for the Cisco ASA 5505 Adaptive Security Appliance - cisco.com
- same-security-traffic through show asdm sessions Commands - cisco.com
- PIX/ASA: Perform DNS Doctoring with the static Command and Two NAT Interfaces Configuration Example - cisco.com
EDIT: Another way to do this
After sharing this with some coworkers, it turns out that 'hairpinning' is definitely the key word and one of them stumbled across this article:
Setup U-Turn (Hairpinning) on Cisco ASA
It solves the same issue with a slightly more graceful solution because no alias entries are needed for non-public services, in fact, no aliases are needed at all. To have the exact same functionality as above, here is the working configuration for the problem above with this new methodology:
!--- Output suppressed. ! interface Ethernet0/0 nameif outside security-level 0 ip address 192.168.0.192 255.255.255.0 ! interface Ethernet0/1 nameif inside security-level 100 ip address 172.16.0.1 255.255.0.0 ! !--- Output suppressed. ! same-security-traffic permit intra-interface access-list outside_in extended permit icmp any any access-list outside_in extended permit tcp any any ! !--- Output suppressed. ! global (outside) 1 interface global (inside) 1 interface nat (inside) 1 172.16.0.0 255.255.0.0 static (inside,outside) 192.168.0.193 172.16.0.10 netmask 255.255.255.255 static (inside,inside) 192.168.0.193 172.16.0.10 netmask 255.255.255.255 access-group outside_in in interface outside ! !--- Output suppressed.
June California Trip
Submitted by ckdake on Sat, 2009-06-06 21:06This past Monday, Seth and I headed to California for the week to get some work done. We didn't have hotel reservations that we knew of, and had a mess of things to clean up in the datacenter, so we drove the rental car straight from the airport to the office at around 11am on Monday and got started.
Monday
We spent Monday in the office, getting some face time with the new office IT guy ("chicks" is his username which is the source of much hilarity) and meeting with some people that we have ongoing projects with. Lunch was at Dittmer's Gourmet Meats and dinner was 4x4s at In-N-Out. We ended up crashing at Jesse's house after sitting in his hottub drinking Micky's, and watching Apocalypse Briggs (part 1 here, additional parts in "related videos"). It's nice sharing rooms with Seth because he likes sleeping on the floor which means no complicated figuring out beds/couches/etc. A pillow and a blanket, and he's set!
Tuesday
Pretty early on Tuesday, we headed directly to the datacenter, stopping at Le Boulanger on the way for tasty breakfast sandwiches. After getting our hands added to the biometrics system, we began sorting spare parts, getting rid of trash and server packaging, and removing wires that weren't plugged in to anything. 2 people from Virident Systems showed up with a box for us to install that we're doing some experimenting with, and things are looking pretty good so far with that. They took us out to eat at a Malaysian place that was pretty good, and our afternoon in the datacenter was more cleaning up. We drove to Thee Parkside in the city for beer and $2 tacos with some of the Gallery crew, and headed over to Digg with Robert for a few more beers. Afterwards, Seth and I drove Bharat back home and slept at his brand new house in Menlo Park. Digg HQ:
Wednesday
We started off Wednesday morning dropping Bharat off at work at Google, and getting a quick tour of Google HQ for Seth. After that was another datacenter day, interrupted with a trip to the office for some Japanese food for lunch. The grand total of trash we cleaned up filled up a 48 gallon plastic bin, and we began fixing labels on machines, noting rack locations in our ZenOSS installation, and properly labeling all the outlets on our PDUs and what they are connected to. Aside from everything looking a _lot_ better, highlights of the day included finding a machine we didn't know about with 32G of RAM (now a OpenVZ box doing a lot of things). For the evening we headed up to Lila's place in the hills of Los Gatos where the SugarCRM IT crew enjoyed beers and pork ribs, and Seth and I slept in a spare room there after staying up long past the always amazing sunsets:

Thursday
After the crazy drive back down from Lila's, we headed to the datacenter for the morning. It took us about 4 hours to finish things up including rewiring all the cat5 in one rack and mostly wiring up a new rack of machines (still waiting on the switch and PDUs before that will be done). Back at the office, we had a very late lunch of more 4x4s at In-N-Out because they apparently couldn't make us 5x5s. I spent the rest of the afternoon catching up on some of the ticket backlog assigned to me since we'd been busy all week doing other things, and around 6:30 we drove up to Igor's place and got to see "mini beast", Igor's newborn. Several other SugarCRM people met up with us to head to Whiskey Thieves for some whiskey sampling. At some point, Julian and I put a few dollars into the Area51 machine there and ended up with 5th and 6th place on the high score list, and he told us "The Japanese Fan Story" which you should get him to share if you haven't heard it yet. Afterwards, we stopped by The Owl Tree and ended up at Cocobang for some super spicy Korean BBQ chicken to finish off the week. A week's work:

Friday
Friday morning was back to the airport to fly home. It was another crazy exhausting week in California and while we got a lot done, I'm definitely glad to be home. Delta helped us out because both our flight our and our flight back took ~45 minutes less than expected. All meals not described above were either not eaten, or consisted of cherry coke and taquitos from 711. Now that I'm home, it's time to hunt down some people to pay their hosting bills (Eldon- While biking today I saw you on your bike so I know you are alive!) and mow the grass. Pictures from the week are at http://ckdake.com/gallery/2009/june-california/.

