12.6 The Squid Server and mod_perl

To give you an idea of what Squid is, we will reproduce the following bullets from Squid's home page (http://www.squid-cache.org/):

Squid is...

A full-featured web proxy cache
Designed to run on Unix systems
Free, open source software
The result of many contributions by unpaid volunteers
Funded by the National Science Foundation

Squid supports...

Proxying and caching of HTTP, FTP, and other URLs
Proxying for SSL
Cache hierarchies
ICP, HTCP, CARP, and Cache Digests
Transparent caching
WCCP (Squid v2.3)
Extensive access controls
httpd server acceleration
SNMP
Caching of DNS lookups

12.6.1 Pros and Cons

The advantages of using Squid are:

Caching of static objects. These are served much faster, assuming that your cache size is big enough to keep the most frequently requested objects in the cache.
Buffering of dynamic content. This takes the burden of returning the content generated by mod_perl servers to slow clients, thus freeing mod_perl servers from waiting for the slow clients to download the data. Freed servers immediately switch to serve other requests; thus, your number of required servers goes down dramatically.
Nonlinear URL space/server setup. You can use Squid to play some tricks with the URL space and/or domain-based virtual server support.

The disadvantages are:

Buffering limit. By default, Squid buffers in only 16 KB chunks, so it will not allow mod_perl to complete immediately if the output is larger. (READ_AHEAD_GAP, which is 16 KB by default, can be enlarged in defines.h if your OS allows that.)
Speed. Squid is not very fast when compared with the plain file-based web servers available today. Only if you are using a lot of dynamic features, such as with mod_perl, is there a reason to use Squid, and then only if the application and the server are designed with caching in mind.
Memory usage. Squid uses quite a bit of memory. It can grow three times bigger than the limit provided in the configuration file.
HTTP protocol level. Squid is pretty much an HTTP/1.0 server, which seriously limits the deployment of HTTP/1.1 features, such as KeepAlives.
HTTP headers, dates, and freshness. The Squid server might give out stale pages, confusing downstream/client caches. This might happen when you update some documents on the site—Squid will continue serve the old ones until you explicitly tell it which documents are to be reloaded from disk.
Stability. Compared to plain web servers, Squid is not the most stable.

The pros and cons presented above indicate that you might want to use Squid for its dynamic content-buffering features, but only if your server serves mostly dynamic requests. So in this situation, when performance is the goal, it is better to have a plain Apache server serving static objects and Squid proxying only the mod_perl-enabled server. This means that you will have a triple server setup, with frontend Squid proxying the backend light Apache server and the backend heavy mod_perl server.

12.6.2 Light Apache, mod_perl, and Squid Setup Implementation Details

You will find the installation details for the Squid server on the Squid web site (http://www.squid-cache.org/). In our case it was preinstalled with Mandrake Linux. Once you have Squid installed, you just need to modify the default squid.conf file (which on our system was located at /etc/squid/squid.conf), as we will explain now, and you'll be ready to run it.

Before working on Squid's configuration, let's take a look at what we are already running and what we want from Squid.

Previously we had the httpd_docs and httpd_perl servers listening on ports 80 and 8000, respectively. Now we want Squid to listen on port 80 to forward requests for static objects (plain HTML pages, images, and so on) to the port to which the httpd_docs server listens, and dynamic requests to httpd_perl's port. We also want Squid to collect the generated responses and deliver them to the client. As mentioned before, this is known as httpd accelerator mode in proxy dialect.

We have to reconfigure the httpd_docs server to listen to port 81 instead, since port 80 will be taken by Squid. Remember that in our scenario both copies of Apache will reside on the same machine as Squid. The server configuration is illustrated in Figure 12-4.

Figure 12-4. A Squid proxy server, standalone Apache, and mod_perl-enabled Apache

A proxy server makes all the magic behind it transparent to users. Both Apache servers return the data to Squid (unless it was already cached by Squid). The client never sees the actual ports and never knows that there might be more than one server running. Do not confuse this scenario with mod_rewrite, where a server redirects the request somewhere according to the rewrite rules and forgets all about it (i.e., works as a one-way dispatcher, responsible for dispatching the jobs but not for collecting the results).

Squid can be used as a straightforward proxy server. ISPs and big companies generally use it to cut down the incoming traffic by caching the most popular requests. However, we want to run it in httpd accelerator mode. Two configuration directives, httpd_accel_host and httpd_accel_port, enable this mode. We will see more details shortly.

If you are currently using Squid in the regular proxy mode, you can extend its functionality by running both modes concurrently. To accomplish this, you can extend the existing Squid configuration with httpd accelerator mode's related directives or you can just create a new configuration from scratch.

Let's go through the changes we should make to the default configuration file. Since the file with default settings (/etc/squid/squid.conf) is huge (about 60 KB) and we will not alter 95% of its default settings, our suggestion is to write a new configuration file that includes the modified directives.^[1]

^[1] The configuration directives we use are correct for Squid Cache Version 2.4STABLE1. It's possible that the configuration directives might change in new versions of Squid.

First we want to enable the redirect feature, so we can serve requests using more than one server (in our case we have two: the httpd_docs and httpd_perl servers). So we specify httpd_accel_host as virtual. (This assumes that your server has multiple interfaces—Squid will bind to all of them.)

httpd_accel_host virtual

Then we define the default port to which the requests will be sent, unless they're redirected. We assume that most requests will be for static documents (also, it's easier to define redirect rules for the mod_perl server because of the URI that starts with /perl or similar). We have our httpd_docs listening on port 81:

httpd_accel_port 81

And Squid listens to port 80:

http_port 80

We do not use icp (icp is used for cache sharing between neighboring machines, which is more relevant in the proxy mode):

icp_port 0

hierarchy_stoplist defines a list of words that, if found in a URL, cause the object to be handled directly by the cache. Since we told Squid in the previous directive that we aren't going to share the cache between neighboring machines, this directive is irrelevant. In case you do use this feature, make sure to set this directive to something like:

hierarchy_stoplist /cgi-bin /perl

where /cgi-bin and /perl are aliases for the locations that handle the dynamic requests.

Now we tell Squid not to cache dynamically generated pages:

acl QUERY urlpath_regex /cgi-bin /perl
no_cache deny QUERY

Please note that the last two directives are controversial ones. If you want your scripts to be more compliant with the HTTP standards, according to the HTTP specification, the headers of your scripts should carry the caching directives: Last-Modified and Expires.

What are they for? If you set the headers correctly, there is no need to tell the Squid accelerator not to try to cache anything. Squid will not bother your mod_perl servers a second time if a request is (a) cacheable and (b) still in the cache. Many mod_perl applications will produce identical results on identical requests if not much time has elapsed between the requests. So your Squid proxy might have a hit ratio of 50%, which means that the mod_perl servers will have only half as much work to do as they did before you installed Squid (or mod_proxy).

But this is possible only if you set the headers correctly. Refer to Chapter 16 to learn more about generating the proper caching headers under mod_perl. In the case where only the scripts under /perl/caching-unfriendly are not caching-friendly, fix the above setting to be:

acl QUERY urlpath_regex /cgi-bin /perl/caching-unfriendly
no_cache deny QUERY

If you are lazy, or just have too many things to deal with, you can leave the above directives the way we described. Just keep in mind that one day you will want to reread this section to squeeze even more power from your servers without investing money in more memory and better hardware.

While testing, you might want to enable the debugging options and watch the log files in the directory /var/log/squid/. But make sure to turn debugging off in your production server. Below we show it commented out, which makes it disabled, since it's disabled by default. Debug option 28 enables the debugging of the access-control routes; for other debug codes, see the documentation embedded in the default configuration file that comes with Squid.

# debug_options 28

We need to provide a way for Squid to dispatch requests to the correct servers. Static object requests should be redirected to httpd_docs unless they are already cached, while requests for dynamic documents should go to the httpd_perl server. The configuration:

redirect_program /usr/lib/squid/redirect.pl
redirect_children 10
redirect_rewrites_host_header off

tells Squid to fire off 10 redirect daemons at the specified path of the redirect daemon and (as suggested by Squid's documentation) disables rewriting of any Host: headers in redirected requests. The redirection daemon script is shown later, in Example 12-1.

The maximum allowed request size is in kilobytes, which is mainly useful during PUT and POST requests. A user who attempts to send a request with a body larger than this limit receives an "Invalid Request" error message. If you set this parameter to 0, there will be no limit imposed. If you are using POST to upload files, then set this to the largest file's size plus a few extra kilobytes:

request_body_max_size 1000 KB

Then we have access permissions, which we will not explain here. You might want to read the documentation, so as to avoid any security problems.

acl all src 0.0.0.0/0.0.0.0
acl manager proto cache_object
acl localhost src 127.0.0.1/255.255.255.255
acl myserver src 127.0.0.1/255.255.255.255
acl SSL_ports port 443 563
acl Safe_ports port 80 81 8080 81 443 563
acl CONNECT method CONNECT

http_access allow manager localhost
http_access allow manager myserver
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
# http_access allow all

Since Squid should be run as a non-root user, you need these settings:

cache_effective_user squid
cache_effective_group squid

if you are invoking Squid as root. The user squid is usually created when the Squid server is installed.

Now configure a memory size to be used for caching:

cache_mem 20 MB

The Squid documentation warns that the actual size of Squid can grow to be three times larger than the value you set.

You should also keep pools of allocated (but unused) memory available for future use:

memory_pools on

(if you have the memory available, of course—otherwise, turn it off).

Now tighten the runtime permissions of the cache manager CGI script (cachemgr.cgi, which comes bundled with Squid) on your production server:

cachemgr_passwd disable shutdown

If you are not using this script to manage the Squid server remotely, you should disable it:

cachemgr_passwd disable all

Put the redirection daemon script at the location you specified in the redirect_program parameter in the configuration file, and make it executable by the web server (see Example 12-1).

Example 12-1. redirect.pl

#!/usr/bin/perl -p
BEGIN { $|=1 }
s|www.example.com(?::81)?/perl/|www.example.com:8000/perl/|;

The regular expression in this script matches all the URIs that include either the string "www.example.com/perl/" or the string "www.example.com:81/perl/" and replaces either of these strings with "www.example.com:8080/perl". No matter whether the regular expression worked or not, the $_ variable is automatically printed, thanks to the -p switch.

You must disable buffering in the redirector script. $|=1; does the job. If you do not disable buffering, STDOUT will be flushed only when its buffer becomes full—and its default size is about 4,096 characters. So if you have an average URL of 70 characters, only after about 59 (4,096/70) requests will the buffer be flushed and will the requests finally reach the server. Your users will not wait that long (unless you have hundreds of requests per second, in which case the buffer will be flushed very frequently because it'll get full very fast).

If you think that this is a very ineffective way to redirect, you should consider the following explanation. The redirector runs as a daemon; it fires up N redirect daemons, so there is no problem with Perl interpreter loading. As with mod_perl, the Perl interpreter is always present in memory and the code has already been compiled, so the redirect is very fast (not much slower than if the redirector was written in C). Squid keeps an open pipe to each redirect daemon; thus, the system calls have no overhead.

Now it is time to restart the server:

/etc/rc.d/init.d/squid restart

Now the Squid server setup is complete.

If on your setup you discover that port 81 is showing up in the URLs of the static objects, the solution is to make both the Squid and httpd_docs servers listen to the same port. This can be accomplished by binding each one to a specific interface (so they are listening to different sockets). Modify httpd_docs/conf/httpd.conf as follows:

Port 80
BindAddress 127.0.0.1
Listen 127.0.0.1:80

Now the httpd_docs server is listening only to requests coming from the local server. You cannot access it directly from the outside. Squid becomes a gateway that all the packets go through on the way to the httpd_docs server.

Modify squid.conf as follows:

http_port example.com:80
tcp_outgoing_address 127.0.0.1
httpd_accel_host 127.0.0.1
httpd_accel_port 80

It's important that http_port specifies the external hostname, which doesn't map to 127.0.0.1, because otherwise the httpd_docs and Squid server cannot listen to the same port on the same address.

Now restart the Squid and httpd_docs servers (it doesn't matter which one you start first), and voilà—the port number is gone.

You must also have the following entry in the file /etc/hosts (chances are that it's already there):

127.0.0.1 localhost.localdomain localhost

Now if your scripts are generating HTML including fully qualified self references, using 8000 or the other port, you should fix them to generate links to point to port 80 (which means not using the port at all in the URI). If you do not do this, users will bypass Squid and will make direct requests to the mod_perl server's port. As we will see later, just like with httpd_docs, the httpd_perl server can be configured to listen only to requests coming from localhost (with Squid forwarding these requests from the outside). Then users will not be able to bypass Squid.

The whole modified squid.conf file is shown in Example 12-2.

Example 12-2. squid.conf

http_port example.com:80
tcp_outgoing_address 127.0.0.1
httpd_accel_host 127.0.0.1
httpd_accel_port 80

icp_port 0

acl QUERY urlpath_regex /cgi-bin /perl
no_cache deny QUERY

# debug_options 28

redirect_program /usr/lib/squid/redirect.pl
redirect_children 10
redirect_rewrites_host_header off

request_body_max_size 1000 KB

acl all src 0.0.0.0/0.0.0.0
acl manager proto cache_object
acl localhost src 127.0.0.1/255.255.255.255
acl myserver src 127.0.0.1/255.255.255.255
acl SSL_ports port 443 563
acl Safe_ports port 80 81 8080 8081 443 563
acl CONNECT method CONNECT

http_access allow manager localhost
http_access allow manager myserver
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
# http_access allow all

cache_effective_user squid
cache_effective_group squid

cache_mem 20 MB

memory_pools on

cachemgr_passwd disable shutdown

12.6.3 mod_perl and Squid Setup Implementation Details

When one of the authors was first told about Squid, he thought: "Hey, now I can drop the httpd_docs server and have just Squid and the httpd_perl servers. Since all static objects will be cached by Squid, there is no more need for the light httpd_docs server."

But he was a wrong. Why? Because there is still the overhead of loading the objects into the Squid cache the first time. If a site has many static objects, unless a huge chunk of memory is devoted to Squid, they won't all be cached, and the heavy mod_perl server will still have the task of serving these objects.

How do we measure the overhead? The difference between the two servers is in memory consumption; everything else (e.g., I/O) should be equal. So you have to estimate the time needed to fetch each static object for the first time at a peak period, and thus the number of additional servers you need for serving the static objects. This will allow you to calculate the additional memory requirements. This amount can be significant in some installations.

So on our production servers we have decided to stick with the Squid, httpd_docs, and httpd_perl scenario, where we can optimize and fine-tune everything. But if in your case there are almost no static objects to serve, the httpd_docs server is definitely redundant; all you need are the mod_perl server and Squid to buffer the output from it.

If you want to proceed with this setup, install mod_perl-enabled Apache and Squid. Then use a configuration similar to that in the previous section, but without httpd_docs (see Figure 12-5). Also, you do not need the redirector any more, and you should specify httpd_accel_host as a name of the server instead of virtual. Because you do not redirect, there is no need to bind two servers on the same port, so you also don't need the Bind or Listen directives in httpd.conf.

Figure 12-5. A Squid proxy server and mod_perl-enabled Apache

The modified configuration for this simplified setup is given in Example 12-3 (see the explanations in the previous section).

Example 12-3. squid2.conf

httpd_accel_host example.com
httpd_accel_port 8000
http_port 80
icp_port 0

acl QUERY urlpath_regex /cgi-bin /perl
no_cache deny QUERY

# debug_options 28

# redirect_program /usr/lib/squid/redirect.pl
# redirect_children 10
# redirect_rewrites_host_header off

request_body_max_size 1000 KB

acl all src 0.0.0.0/0.0.0.0
acl manager proto cache_object
acl localhost src 127.0.0.1/255.255.255.255
acl myserver src 127.0.0.1/255.255.255.255
acl SSL_ports port 443 563
acl Safe_ports port 80 81 8080 8081 443 563
acl CONNECT method CONNECT

http_access allow manager localhost
http_access allow manager myserver
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
# http_access allow all

cache_effective_user squid
cache_effective_group squid

cache_mem 20 MB

memory_pools on

cachemgr_passwd disable shutdown

[ Team LiB ]