12.5 Adding a Proxy Server in httpd Accelerator Mode
We have already presented a solution with two servers:
one plain Apache server, which is very light and configured to serve
static objects, and the other with mod_perl enabled (very heavy) and
configured to serve mod_perl scripts and handlers. We named them
httpd_docs and httpd_perl,
respectively.
In the dual-server setup presented earlier, the two servers coexist
at the same IP address by listening to different ports:
httpd_docs listens to port 80 (e.g.,
http://www.example.com/images/test.gif) and
httpd_perl listens to port 8000 (e.g.,
http://www.example.com:8000/perl/test.pl). Note
that we did not write http://www.example.com:80
for the first example, since port 80 is the default port for the HTTP
service. Later on, we will change the configuration of the
httpd_docs server to make it listen to port 81.
This section will attempt to convince you that you should really
deploy a proxy server in httpd accelerator mode.
This is a special mode that, in addition to providing the normal
caching mechanism, accelerates your CGI and mod_perl scripts by
taking the responsibility of pushing the produced content to the
client, thereby freeing your mod_perl processes. Figure 12-3 shows a configuration that uses a proxy
server, a standalone Apache server, and a mod_perl-enabled Apache
server.
The advantages of using the proxy server in conjunction with
mod_perl are:
You get all the benefits of the usual use of a proxy server that
serves static objects from the proxy's cache. You
get less I/O activity reading static objects from the disk (the proxy
serves the most "popular" objects
from RAM—of course you benefit more if you allow the proxy
server to consume more RAM), and since you do not wait for the I/O to
be completed, you can serve static objects much faster.
You get the extra functionality provided by
httpd accelerator mode, which makes the proxy
server act as a sort of output buffer for the dynamic content. The
mod_perl server sends the entire response to the proxy and is then
free to deal with other requests. The proxy server is responsible for
sending the response to the browser. This means that if the transfer
is over a slow link, the mod_perl server is not waiting around for
the data to move.
This technique allows you to hide the details of the
server's implementation. Users will never see ports
in the URLs (more on that topic later). You can have a few boxes
serving the requests and only one serving as a frontend, which
spreads the jobs between the servers in a way that you can control.
You can actually shut down a server without the user even noticing,
because the frontend server will dispatch the jobs to other servers.
This is called load
balancing—it's too big an issue to
cover here, but there is plenty of information available on the
Internet (refer to Section
12.16 at the end of this chapter).
For security reasons, using an httpd accelerator
(or a proxy in httpd accelerator mode) is
essential because it protects your internal server from being
directly attacked by arbitrary packets. The
httpd accelerator and internal server
communicate only expected HTTP requests, and usually only specific
URI namespaces get proxied. For example, you can ensure that only
URIs starting with /perl/ will be proxied to the
backend server. Assuming that there are no vulnerabilities that can
be triggered via some resource under /perl, this
means that only your public
"bastion" accelerating web server
can get hosed in a successful attack—your backend server will
be left intact. Of course, don't consider your web
server to be impenetrable because it's accessible
only through the proxy. Proxying it reduces the number of ways a
cracker can get to your backend server; it doesn't
eliminate them all. Your server will be effectively impenetrable if it listens only on
ports on your localhost (127.0.0.1), which makes
it impossible to connect to your backend machine from the outside.
But you don't need to connect from the outside
anymore, as you will see when you proceed to this
technique's implementation notes.
In addition, if you use some sort of access control, authentication,
and authorization at the frontend server, it's easy
to forget that users can still access the backend server directly,
bypassing the frontend protection. By making the backend server
directly inaccessible you prevent this possibility.
Of course, there are drawbacks. Luckily, these are not functionality
drawbacks—they are more administration hassles. The
disadvantages are:
You have another daemon to worry about, and while proxies are
generally stable, you have to make sure to prepare proper startup and
shutdown scripts, which are run at boot and reboot as appropriate.
This is something that you do once and never come back to again.
Also, you might want to set up the crontab to
run a watchdog script that will make sure that the proxy server is
running and restart it if it detects a problem, reporting the problem
to the administrator on the way. Chapter 5
explains how to develop and run such watchdogs.
Proxy servers can be configured to be light or heavy. The
administrator must decide what gives the highest performance for his
application. A proxy server such as Squid is light in the sense of
having only one process serving all requests, but it can consume a
lot of memory when it loads objects into memory for faster service.
If you use the default logging mechanism for all requests on the
front- and backend servers, the requests that will be proxied to the
backend server will be logged twice, which makes it tricky to merge
the two log files, should you want to. Therefore, if all accesses to
the backend server are done via the frontend server,
it's the best to turn off logging of the backend
server. If the backend server is also accessed directly, bypassing the
frontend server, you want to log only the requests that
don't go through the frontend server. One way to
tell whether a request was proxied or not is to use
mod_proxy_add_forward, presented later in this chapter, which sets
the HTTP header X-Forwarded-For for all proxied
requests. So if the default logging is turned off, you can add a
custom PerlLogHandler that logs only requests made
directly to the backend server.
If you still decide to log proxied requests at the backend server,
they might not contain all the information you need, since instead of
the real remote IP of the user, you will always get the IP of the
frontend server. Again, mod_proxy_add_forward, presented later,
provides a solution to this problem.
Let's look at a real-world scenario that shows
the importance of the
proxy httpd accelerator mode for mod_perl.
First let's explain an abbreviation used in the
networking world. If someone claims to have a 56-kbps connection, it
means that the connection is made at 56 kilobits per second (~56,000
bits/sec). It's not 56 kilobytes per second, but 7
kilobytes per second, because 1 byte equals 8 bits. So
don't let the merchants fool you—your modem
gives you a 7 kilobytes-per-second connection at most, not 56
kilobytes per second, as one might think.
Another convention used in computer literature is that 10 Kb usually
means 10 kilo-bits and 10 KB means 10 kilobytes. An uppercase B
generally refers to bytes, and a lowercase b refers to bits (K of
course means kilo and equals 1,024 or 1,000, depending on the field
in which it's used). Remember that the latter
convention is not followed everywhere, so use this knowledge with
care.
In the typical scenario (as of this writing), users connect to your
site with 56-kbps modems. This means that the speed of the
user's network link is 56/8 = 7 KB per second.
Let's assume an average generated HTML page to be of
42 KB and an average mod_perl script to generate this response in 0.5
seconds. How many responses could this script produce during the time
it took for the output to be delivered to the user? A simple
calculation reveals pretty scary numbers:
(42KB)/(0.5sx7KB/s)
= 12
Twelve other dynamic requests could be served at the same time, if we
could let mod_perl do only what it's best at:
generating responses.
This very simple example shows us that we need only one-twelfth the
number of children running, which means that we will need only
one-twelfth of the memory.
But you know that nowadays scripts often return pages that are blown
up with JavaScript and other code, which can easily make them 100 KB
in size. Can you calculate what the download time for a file that
size would be?
Furthermore, many users like to open multiple browser windows and do
several things at once (e.g., download files and browse graphically
heavy sites). So the speed of 7 KB/sec we assumed before may in
reality be 5-10 times slower. This is not good for your server.
Considering the last example and taking into account all the other
advantages that the proxy server provides, we hope that you are
convinced that despite a small administration overhead, using a proxy
is a good thing.
Of course, if you are on a very fast local area network (LAN) (which
means that all your users are connected from this network and not
from the outside), the big benefit of the proxy buffering the output
and feeding a slow client is gone. You are probably better off
sticking with a straight mod_perl server in this case.
Two proxy implementations are known to be widely used with mod_perl:
the Squid proxy server and the mod_proxy Apache module.
We'll discuss these in the next sections.
|