Apache
reverse proxying with apache and mod_proxy_html
Submitted by ckdake on Mon, 2008-07-07 15:56I've been fighting to get some reverse proxy things working today at work. Basically, some python application servers that speak HTTP live on servers with private IP addresses behind the firewall, but they need to be reachable to the outside world via a HTTPS portal that does authentication checking with mod_authnz_ldap. Basically, https://example.com/app1/ needs to go to http://app1:8888/. I figured out much of what is below with the help of: http://www.apachetutor.org/admin/reverseproxies.
Apache's mod_proxy seemed like it would be simple enough to use and 2 lines of config file changes later, the first page was working. However, redirects from the app servers were causing the client to redirect to internal addresses which didn't work, and absolute urls in HTML from the appserver needed to be changed to include the /app1/ on the externally facing server. Enter mod_proxy_html.
mod_proxy_html is a third party module that allows content modification including replacing link addresses with different addresses. I downloaded and installed it on the proxy server but it wasn't working. Turning up debugging with
LogLevel debug ProxyHTMLLogVerbose On
gave me the following message: "No links configured: nothing for proxy-html filter to do", and Google only had one result for this: mod_proxy_html.c - the source code for mod_proxy_html with the error message in it! It turns out that much of the documentation for mod_proxy_html is out of date, and in mod_proxy_html 3.0 the link tag definitions have been removed from the code and must be included in the configuration. Had I looked at the config file provided with the download (instead of the one I'd been writing from howtos), this wouldn't have happened, but it's surpsising Google hasn't indexed anyone else running into this! The fix for this was to include the following in my config:
ProxyHTMLLinks a href
ProxyHTMLLinks area href
ProxyHTMLLinks link href
ProxyHTMLLinks img src longdesc usemap
ProxyHTMLLinks object classid codebase data usemap
ProxyHTMLLinks q cite
ProxyHTMLLinks blockquote cite
ProxyHTMLLinks ins cite
ProxyHTMLLinks del cite
ProxyHTMLLinks form action
ProxyHTMLLinks input src usemap
ProxyHTMLLinks head profile
ProxyHTMLLinks base href
ProxyHTMLLinks script src for
ProxyHTMLLinks iframe src
ProxyHTMLEvents onclick ondblclick onmousedown onmouseup \
onmouseover onmousemove onmouseout onkeypress \
onkeydown onkeyup onfocus onblur onload \
onunload onsubmit onreset onselect onchange
An Apache restart later, and HTML links were getting rewritten. Neat! On to the next problem.. the app servers in question have lots of hardcoded absolute URLs, many of them in CSS and JS files. The documentation has an initial solution to this in their technical guide, using a regular expression like:
ProxyHTMLURLMap url\(http://internal.example.com([^\)]*)\) url(http://proxy.example.com$1) Rihe
However this only works on inline CSS because mod_proxy_html only works on html content types and not the text/css that CSS files are sent as. A workaround for this is setting the PROXY_HTML_FORCE environment variable, but in addition to forcing mod_proxy_html to look at css files, this forces it to process image files, etc, which uses up too much CPU for our use case. Doh!
Setting up each application server as a vhost insted is a lot simpler (the 2 lines of config I started with here are enough), and while it's less than ideal, we have wildcard SSL certificates so having https://app1.example.com/ isn't the end of the world and doesn't require any additional IP addresses.
PHP Security, Round 2
Submitted by ckdake on Sun, 2008-03-23 21:14As I've noticed from watching hits on my site here, many of you have read my page on PHP security using mod_fastcgi and suexec. The logic on that page still holds, but Gentoo decided to make the switch from mod_fastcgi to mod_fcgid and it broke all sorts of things for me. I got things scratched back together without any security on my old server, and with the installation of my new server a few weeks ago, I set things up more securely again. I still think this way is the way to go for a server where many of the virtual hosts will seldomly see traffic, but if you're running lots of high traffic sites and have a little bit of RAM overhead, you might want to check out this article on mpm-peruser.
For this setup, I decided to stick to some standards. This means no more changing the suxec directory, using /data/, or anything like that. Other than that, the key differences from last time:
- All configuration is now done with with a setup script instead of using a mysql database. There was not really any point for the host names to be in a database, and it makes setup/teardown scripts easier to write as just a bash script.
- Some hosts have PHP, some don't, so no point in setting up all the overhead if a host isn't going to use PHP.
- Most hosts won't have any interest in having their own logs. Statistics can be done using client side things such as Google Analytics, and Apache is happier writing all the logs to 1 place instead of hundreds. I also have split-logs running when logs are rotated, so logs can easily be gathered per-site as needed, just not real time by one of my hosting customers. I've never known of one of my customers using live access to their logs.
- php.ini files are now stored with the wrapper script in the site's cgi-bin directory and file system extended attributes are used to protect it. This means no separate home for php.ini files, and it's easier for users to see what their PHP confguration is.
The script isn't quite ready for sharing yet, but here's what you can do to get a setup like this:
- on Gentoo, make sure your USE contains: suexec, apache2, cgi, fastcgi, session.
- on Gentoo, "emerge apache php mod_fcgid". On other platforms, consult your docs (or just download mod_fcgid and use apxs to install it. it should be pretty seamless)
- Set up your global configuration. On Gentoo, this is done for you, but make sure this gets loaded into your global apache configuration:
LoadModule fcgid_module modules/mod_fcgid.so SocketPath /var/run/fcgidsock SharememPath /var/run/fcgid_shm <Location /fcgid> SetHandler fcgid-script Options ExecCGI allow from all </Location>
- Add a user and group for your first virtual host, test.example.com. call em "example" if you like
- Set up the directory tree for the virtual host:
/var/www/test.example.com/ /var/www/test.example.com/tmp /var/www/test.example.com/htdocs /var/www/test.example.com/htdocs/cgi-bin
- Make some files:
hello HTML world!
<? /* /var/www/test.example.com/test.php */ print("hello PHP world!"); ?>#!/bin/sh # /var/www/test.example.com/htdocs/fcgi PHPRC=/var/www/test.example.com/htdocs/cgi-bin/ export PHPRC PHP_FCGI_CHILDREN=2 export PHP_FCGI_CHILDREN PHP_FCGI_MAX_REQUESTS=25000 export PHP_FCGI_MAX_REQUESTS exec /usr/bin/php-cgi
- Copy your php.ini to /var/www/test.example.com/htdocs/fcgi and edit it so that directories are right. All you'll likely need to change is upload.tmp_dir and session.save_path, but you may want to change others.
- Set fcgi to be executable, and make sure permissions are set on it so that it is owned by your test user/group and other users can't mess with it. If things don't work later, this is a frequent culprit
- Set the immutable bit on php.ini and fcgi (you'll need to be using extended file system attributes on your filesystem to do this, check your OS documentation for details) by running 'chattr +i /var/www/test.example.com/htdocs/*'. You'll need to undo this with chattr -i if you want to change these files in the future.
- Set up this host's configuration:
<VirtualHost *:80> DocumentRoot /var/www/test.example.com/htdocs/ ServerName test.example.com SuexecUserGroup example example <Directory /var/www/test.example.com/htdocs/> Options +SymLinksIfOwnerMatch AllowOverride All Order allow,deny Allow from all DirectoryIndex index.html index.php AddType application/x-httpd-fastphp .php Action application/x-httpd-fastphp /cgi-bin/fphp </Directory> <Directory /var/www/test.example.com/htdocs/cgi-bin/> SetHandler fcgid-script FCGIWrapper /var/www/test.example.com/htdocs/cgi-bin/fphp .php Options +ExecCGI -Includes allow from all </Directory> </VirtualHost> <VirtualHost *:80> ServerName aerospace.com Redirect Permanent / http://test.example.com/ </VirtualHost> - Give apache a restart and that should be it!
Check out the processes running on your server, and after you hit test.php you should see a php-cgi process running as the example user. If you have problems, error_log and suexec_log in /var/log/apache2/ (or /var/log/httpd/) tend to tell you everything you need to know.
An oh yeah, want to use APC to speed up your PHP applications significantly under this setup? Just install APC, then add the configuration for it to the bottom of the php.ini for any hosts that you want to enable this on. Given that APC isn't 100% perfect and crashes sometimes, the beauty of the fcgid setup is that it will take out the php-cgi process and the fcgid manager will just start a new one like nothing happened.
Adobe Fast Web View
Submitted by ckdake on Sun, 2008-03-16 11:40Adobe Fast Web View is a very lightly documented but seemingly often used feature in Adobe Acrobat Reader. From the users point of view, it does what it says and makes pages of a PDF show up in their browser before the entire PDF is completly downloaded, but it's a bit more complicated from a server operators point of view. And, it is enabled by default when installing Adobe Acrobat Reader.
We recently moved sugarcrm.com and some other web properties from stand-alone web servers to a clustered solution involving NFS, load balancers, database replication, etc. It was a pretty complex migration and we're pretty sure that we're running a handful of applications on this cluster that nobody has every clustered before, so needless to say we ran into our share of gotchas. (Other than one web server that seems to be cursed..) One of them was very strange, and involved PDFs: Everything worked fine in all browsers on all platforms until Adobe Acrobat Reader entered the picture. Some number of PDFs would lock up the browser and never load, but only when the PDFs were served from the cluster through the load balancer. When served from one of the cluster web servers but not through the load balancer, everything would work perfectly! Also, with the Adobe plugin disabled, the PDFs would save perfectly and be viewable every time.
Using livehttpheaders it was apparently that 2 HTTP requests were being made so my guess was that the browser would do a GET for the PDF, but when the Adobe plugin took over, it was sending a new HTTP request (all about HTTP). This shouldn't be an issue, but things weren't working! I installed Wireshark on my Windows test installation and dug deeper. Immediately, I noticed that all the data packets in the response coming from the server were fragmented. This typically means that there is an MTU somewhere. However, with some Googling around for PDF files I noticed the same behavior on 50% of the sites I hit, and those PDFs were working fine in the Adobe plugin. Regardless, Igor and I set out to tinker with the MTUs on the web servers and load balancer. Changing the MTU from 1500 down to 1400 did change which PDFs would load in the plugin, but not all of them. Strange!
Again looking in the Wireshark traces, we saw what looked like a TCP reset loop (read all the details about TCP here). After the first part of data came through successfully, every packet from the server was a RST and the Adobe plugin just sat there waiting for data that was never going to arrive. We poked around the load balancer looking for anything that could cause this but no luck. Googling around for this PDF problem, the only solutions we found were recommendations to disable "Fast Web View." What's that? This gave us another thing to search for and led us to a server-side solution in this forum topic. For whatever reason, the load balancer was breaking HTTP requests with a "Request-Range" header, and Adobe Acrobat Reader was using this to attempt to make the PDF load faster. In retrospect, this makes sense, but it sure was a time consuming thing to discover! If you run into this, the solution is to add the following to your Apache configuration file (or something equivalent if you use lighttpd or something else, we found examples of this happening with other server software):
LoadModule headers_module modules/mod_headers.so
...
<FilesMatch "\.(mp3|zip|pdf)$">
Header unset Accept-Ranges
RequestHeader unset Range
RequestHeader unset Unless-Modified-Since
RequestHeader unset If-Range
</FilesMatch>
Logging
Submitted by ckdake on Sun, 2008-02-10 21:50I've been logging my bike rides in the singletracks.com premium ride log. It's neat because I can see how many miles I'm doing and how fast, but there's not a way to plug in GPS data to it yet. I may end up making a module that does this so that other people can do it on FM but that does require work.
The real subject of this post is logging in a cluster environment. Say you have 10 web servers, each with a handful of apache virtual hosts. Some goals come to mind:
- Compliance - do logs need to be stored in such a way that they are "secure" and indelible?
- Troubleshooting - how can technicians analyze logs for some subset of servers/domains quickly and easily?
- Billing - which websites are using the most bandwidth?
In my example, all of these are important. Logs cannot be stored on the web servers in case one of them is compromised or fails (they are redundant after all and failure is not unexpected), problems do require immediate attention and resolution, and noticing bandwidth utilization changes of sites allows for better capacity planning (and I just like stats).
Several tools are available to help with this but not one of them is the magic bullet:
- syslog (specifically syslog-ng) allows collection of logs on a machine in a sensible way, and allows forwarding of them to a remote logging server which can aggregate logs from multiple machines into one place.
- Splunk can pull in logs from syslog, correlate timestamps, and provides an excellent search interface to collected data
- Perl. Is there anything Perl can't do?
In the test setup, initial logging was done over NFS which turns out to be very bad with this particular NFS set up which was optimized for read loads (as a NFS server for a web application cluster probably should be). Then was getting Apache to just log to syslog locally. Theres no one-push button for this, but Sending Apache logs to Syslog on the O'Reilly Network helps out greatly (much of below is a direct result of this article). Apache can be configured to use an external script for logging and a little perl script to pipe it to syslog (which can then pass it along to another server) does the trick. We experimented with using the 'logger' binary to log to the central server directly from apache, but it has a nasty undocumented habit of splitting long log lines into multiple entries in syslog on the central server which is unacceptable for using tools like Splunk as well as web server log analytics packages. (In addition to this, the caveats at the bottom of the O'Reilly article are very true and may matter significantly depending on your setup.)
Lastly is taking care of the site separation for analytics. With the setup above, log lines were coming in with no information identifying what site they came from (and the syslog information is appended to the beginning of each line, *sigh*). The easy way to deal with this is by adding the virtual host name to the log format Apache is using. this howto explains the process, though they use %{Host}i instead of the perhaps easier to remember %v. With this last change, everything comes into the central server and can easilly be split up into files per host with no syslog headers. However, if you're only interested in overall stats per host like bandwidth/hits, webalizer knows what to do without any configuration changes. A "sites" section appears and gives a breakdown of the hits, files, kbs, and visits per site. This is all running on my little cluster now and should make my life a lot easier when the new servers show up.

