SugarCRM
reverse proxying with apache and mod_proxy_html
Submitted by ckdake on Mon, 2008-07-07 16:56I've been fighting to get some reverse proxy things working today at work. Basically, some python application servers that speak HTTP live on servers with private IP addresses behind the firewall, but they need to be reachable to the outside world via a HTTPS portal that does authentication checking with mod_authnz_ldap. Basically, https://example.com/app1/ needs to go to http://app1:8888/. I figured out much of what is below with the help of: http://www.apachetutor.org/admin/reverseproxies.
Apache's mod_proxy seemed like it would be simple enough to use and 2 lines of config file changes later, the first page was working. However, redirects from the app servers were causing the client to redirect to internal addresses which didn't work, and absolute urls in HTML from the appserver needed to be changed to include the /app1/ on the externally facing server. Enter mod_proxy_html.
mod_proxy_html is a third party module that allows content modification including replacing link addresses with different addresses. I downloaded and installed it on the proxy server but it wasn't working. Turning up debugging with
LogLevel debug ProxyHTMLLogVerbose On
gave me the following message: "No links configured: nothing for proxy-html filter to do", and Google only had one result for this: mod_proxy_html.c - the source code for mod_proxy_html with the error message in it! It turns out that much of the documentation for mod_proxy_html is out of date, and in mod_proxy_html 3.0 the link tag definitions have been removed from the code and must be included in the configuration. Had I looked at the config file provided with the download (instead of the one I'd been writing from howtos), this wouldn't have happened, but it's surpsising Google hasn't indexed anyone else running into this! The fix for this was to include the following in my config:
ProxyHTMLLinks a href
ProxyHTMLLinks area href
ProxyHTMLLinks link href
ProxyHTMLLinks img src longdesc usemap
ProxyHTMLLinks object classid codebase data usemap
ProxyHTMLLinks q cite
ProxyHTMLLinks blockquote cite
ProxyHTMLLinks ins cite
ProxyHTMLLinks del cite
ProxyHTMLLinks form action
ProxyHTMLLinks input src usemap
ProxyHTMLLinks head profile
ProxyHTMLLinks base href
ProxyHTMLLinks script src for
ProxyHTMLLinks iframe src
ProxyHTMLEvents onclick ondblclick onmousedown onmouseup \
onmouseover onmousemove onmouseout onkeypress \
onkeydown onkeyup onfocus onblur onload \
onunload onsubmit onreset onselect onchange
An Apache restart later, and HTML links were getting rewritten. Neat! On to the next problem.. the app servers in question have lots of hardcoded absolute URLs, many of them in CSS and JS files. The documentation has an initial solution to this in their technical guide, using a regular expression like:
ProxyHTMLURLMap url\(http://internal.example.com([^\)]*)\) url(http://proxy.example.com$1) Rihe
However this only works on inline CSS because mod_proxy_html only works on html content types and not the text/css that CSS files are sent as. A workaround for this is setting the PROXY_HTML_FORCE environment variable, but in addition to forcing mod_proxy_html to look at css files, this forces it to process image files, etc, which uses up too much CPU for our use case. Doh!
Setting up each application server as a vhost insted is a lot simpler (the 2 lines of config I started with here are enough), and while it's less than ideal, we have wildcard SSL certificates so having https://app1.example.com/ isn't the end of the world and doesn't require any additional IP addresses.
Adobe Fast Web View
Submitted by ckdake on Sun, 2008-03-16 12:40Adobe Fast Web View is a very lightly documented but seemingly often used feature in Adobe Acrobat Reader. From the users point of view, it does what it says and makes pages of a PDF show up in their browser before the entire PDF is completly downloaded, but it's a bit more complicated from a server operators point of view. And, it is enabled by default when installing Adobe Acrobat Reader.
We recently moved sugarcrm.com and some other web properties from stand-alone web servers to a clustered solution involving NFS, load balancers, database replication, etc. It was a pretty complex migration and we're pretty sure that we're running a handful of applications on this cluster that nobody has every clustered before, so needless to say we ran into our share of gotchas. (Other than one web server that seems to be cursed..) One of them was very strange, and involved PDFs: Everything worked fine in all browsers on all platforms until Adobe Acrobat Reader entered the picture. Some number of PDFs would lock up the browser and never load, but only when the PDFs were served from the cluster through the load balancer. When served from one of the cluster web servers but not through the load balancer, everything would work perfectly! Also, with the Adobe plugin disabled, the PDFs would save perfectly and be viewable every time.
Using livehttpheaders it was apparently that 2 HTTP requests were being made so my guess was that the browser would do a GET for the PDF, but when the Adobe plugin took over, it was sending a new HTTP request (all about HTTP). This shouldn't be an issue, but things weren't working! I installed Wireshark on my Windows test installation and dug deeper. Immediately, I noticed that all the data packets in the response coming from the server were fragmented. This typically means that there is an MTU somewhere. However, with some Googling around for PDF files I noticed the same behavior on 50% of the sites I hit, and those PDFs were working fine in the Adobe plugin. Regardless, Igor and I set out to tinker with the MTUs on the web servers and load balancer. Changing the MTU from 1500 down to 1400 did change which PDFs would load in the plugin, but not all of them. Strange!
Again looking in the Wireshark traces, we saw what looked like a TCP reset loop (read all the details about TCP here). After the first part of data came through successfully, every packet from the server was a RST and the Adobe plugin just sat there waiting for data that was never going to arrive. We poked around the load balancer looking for anything that could cause this but no luck. Googling around for this PDF problem, the only solutions we found were recommendations to disable "Fast Web View." What's that? This gave us another thing to search for and led us to a server-side solution in this forum topic. For whatever reason, the load balancer was breaking HTTP requests with a "Request-Range" header, and Adobe Acrobat Reader was using this to attempt to make the PDF load faster. In retrospect, this makes sense, but it sure was a time consuming thing to discover! If you run into this, the solution is to add the following to your Apache configuration file (or something equivalent if you use lighttpd or something else, we found examples of this happening with other server software):
LoadModule headers_module modules/mod_headers.so
...
<FilesMatch "\.(mp3|zip|pdf)$">
Header unset Accept-Ranges
RequestHeader unset Range
RequestHeader unset Unless-Modified-Since
RequestHeader unset If-Range
</FilesMatch>
MySQL, PostgreSQL, and Clustering
Submitted by ckdake on Wed, 2008-01-16 10:37So Sun Micrososystems just bought MySQL. That could turn out to be pretty interesting.
In other news, Clustering and Replication. I've been setting up some highly available systems that need to use both MySQL and PostgreSQL. MySQL has some great options for this and I was able to pick between Master-Master replication (which was the final choice here) or a clustered storage engine, both of which are available from MySQL and are very straightforward to configure.
As for PostgreSQL, this is about the 2nd or 3rd time I've ever installed or used it, much less set it up to scale. (MySQL I've been using for ~10 years and with replication for at least a couple.) I assumed that replication would be as easy as MySQL and it turns out that it is, but there are a lot of 3rd party options that do different things and the documentation could be a lot more thorough. I stumbled across this article on PostgreSQL Replication and HA which led me to try out PGCluster. Your needs may vary, but this particular situation requires Master-Master replication with a definite upgrade path to High Availability. PGCluster seemed to provide all this, but it is non-trivial to get going. This PGCluster example was the most helpful thing I came across but it didn't really explain things too well, just provided a working configuration for a setup very different than mine. Read on for the details
Some Changes
Submitted by ckdake on Fri, 2008-01-04 10:34It's been pretty busy the last couple of weeks. A lot's gone on, so here's the short version and I'll likely go a bit more in depth over the next few weeks as I get around to starting to post more regularly.
- I graduated from Georgia Tech with a Masters in Computer Science specializing in Networking.
- I bought a house in Reynoldstown/Cabbagetown in Atlanta, GA and moved in. Still need to get some furniture and fix some things.
- I started work as an Operations Engineer for SugarCRM where I'll be working on scalable infrastructure sorts of things.

