Chapter 4. URL Mapping and Dynamic Content

URL Mapping

This chapter explains how to configure Apache to map requests to files and directories or redirect them to specific pages or servers. This knowledge comes in handy to solve common problems such as maintaining working URLs when the site structure changes, dealing with case-sensitive websites, supporting multiple languages, and so on. It also explains how to use the CGI and server side include functionality present in Apache to provide dynamically generated content.

Mapping URLs to Files with Alias

Alias /icons/ /usr/local/apache2/icons/

The structure of your website does not need to match the layout of your files on disk. You can use the Alias directive to map directories on disk to specific URLs. For example, this directive will cause a request for http://www.example.com/icons/image.gif to make Apache look for the file in /usr/local/apache2/icons/ image.gif instead of under the default document root, in /usr/local/apache2/htdocs/icons/image.gif.

The trailing slashes in the Alias directive are significant. If you include them, the client request must include the slash as well or the Alias directive won’t take effect. For example, if you use the following directive

Alias /icons/ /usr/local/apache2/icons/

and request http://www.example.com/icons, the server will return a 404 Document Not Found error response.

Mapping URL Patterns to Files with AliasMatch

AliasMatch ^/(docs|help) /usr/local/apache/htdocs/
manual

The AliasMatch directive provides a similar behavior to Alias, but enables you to specify a regular expression for the URL. The matches can be substituted in the destination path. For example, this directive will match any URL under /help or /docs to filesystem paths under the manual directory. Regular expressions are strings that describe or match a set of strings, according to certain syntax rules. You can learn more about regular expressions at http://en.wikipedia.org/wiki/Regular_expression.

Redirecting a Page to Another Location

Redirect /news http://news.example.com
Redirect /latest /3.0

The structure of a typical website changes over time, and you can’t control how other sites link to you, such as search engines with stale links. To avoid errors when people access your website through old links, you can configure Apache with the Redirect directive to redirect those requests to the correct resource, whether it is in the current server or a different one. Although the Redirect directive can take optional arguments indicating the type of redirect (such as temporary or permanent), the most commonly used syntax is to provide an origin URL and a destination URL. The destination URL can be in the same web server or can point to a different web server altogether. In this example, a request for http://www.example.com/news/today/index.html will be redirected to http://news.example.com/today/index.html.

Redirecting to the Latest Version of a File

RedirectMatch myapp-(1|2).([0-9])(.[0-9])?-(.*)
/myapp-3.0-$4

The RedirectMatch directive is similar to Redirect, but allows the origin URL path to be a regular expression. This allows a great amount of flexibility. For example, imagine you are a software company distributing downloads from your website and release new versions of a particular product over time. You may find that a certain percentage of your users are still downloading older versions of your software through third-party websites that have not yet updated their links. Using RedirectMatch, users who request old versions of the file can be easily redirected to the latest version. For example, suppose the name of the latest version of your downloadable file is myapp-3.0. This example will redirect requests for http://www.example.com/myapp-2.5.1-demo.tgz to http://www.example.com/myapp-3.0-demo.tgz and requests for http://www.example.com/myapp-1.2-manual.pdf to http://www.example.com/myapp-3.0-manual.pdf.

The first three elements of the regular expression will match a major and minor number and an optional patch number. Those will be replaced by 3.0. The remaining part of the filename is captured in the final regular expression group and replaced in the destination URL.

Redirecting Failed or Unauthorized Requests

ErrorDocument 404 /search.html

If you maintain a popular or complex website, no matter how careful you are, you will receive a number of requests for invalid URLs or documents that no longer exist. Though many of them can be addressed with proper use of Redirects, there will always be a number of requests that end up with the dreaded 404 Document Not Found response. For that reason, it may be desirable to replace the default Apache error page and direct your users to a special location in your website. For example, a page that can help your visitors find the resource they were looking for, such as a search page or site map, as shown in the example. On a related note, Chapter 6 provides additional information on customizing access denied pages.

Defining Content Handlers

AddHandler cgi-script .pl .cgi
<Location "/cgi-bin/*.pl">
Options +ExecCGI
SetHandler cgi-script
</Location>

Handlers are a way Apache determines which actions to perform on the requested content. Modules provide handlers, and you configure Apache to associate certain content with specific handlers. This functionality is commonly used with content-generation modules such as PHP and mod_cgi. The example shows how to associate the cgi-handler handler with the files you want to run as CGIs.

The AddHandler directive associates a certain handler with filename extensions. RemoveHandler can be used to remove previous associations. In the example, AddHandler tells Apache to treat all documents with cgi or pl extensions as CGI scripts.

The SetHandler directive enables you to associate a handler with all files in a particular directory or location. The Action directive, described later in this chapter, enables you to associate a particular MIME type or handler with a CGI script.

Understanding MIME Types

MIME (Multipurpose Internet Mail Extensions) is a set of standards that defines, among other things, a way to indicate the content type of a document. Examples of MIME types are text/html and audio/mpeg. The first component of the MIME type is the main category of the content (text, audio, image, video) and the second component is the specific type.

Apache uses MIME types to determine which modules or filters will process certain content, and to add HTTP headers to the response to identify its content type. These headers will be used by the client application to identify and correctly display the contents to the end user.

Configuring MIME Types

AddType text/xml .xml .schema
<Location /xml-schemas/>
ForceType text/xml
</Location>

As with content handlers, you can associate MIME types with specific file extensions or URLs. This example shows how to associate the text/xml MIME type with files ending in .xml and .schema and with all the content under the /xml-schemas/ URL. By default, Apache bundles a mime.types file that includes the most common MIME types and their associated extensions.

Basics of Running CGI Scripts

CGI stands for Common Gateway Interface. It is a standard protocol used by web servers to communicate with external programs. The web server provides all the necessary information about the request to an external program, which processes it and returns a response. The response is then transmitted back to the client. CGIs were the original mechanism to generate unique content for every request on-the-fly (“dynamic content”) and are supported by nearly every web server. Apache provides support for CGIs using the mod_cgi Apache module (mod_cgid when running a threaded Apache server).

Poorly written or sample CGI programs can be a security risk, so if you are not using this functionality, you may want to disable it altogether, as described in Chapter 6.

Marking Resources As Executable CGIs

ScriptAlias /cgi-bin/ /usr/local/apache2/cgi-bin

This section shows a number of ways to tell Apache that the target file for a particular request is a CGI script. This is necessary so Apache will not serve the contents of the file directly to the client, but rather return the results of executing it.

The ScriptAlias directive is similar to the Alias directive described earlier in this chapter, but with the difference that Apache will treat every file in the target directory as a CGI script. Alternatively, you can use any <Files>, <Location>, and <Directory> sections in combination with the SetHandler directive to tell Apache that the contents of these sections are CGI programs. In this case, you will also need to provide an Options +ExecCGI directive to tell Apache that CGI execution is allowed. The following example tells Apache to treat all URLs ending with a .pl file extension as CGI scripts.

<Location "/cgi-bin/*.pl">
Options +ExecCGI
SetHandler cgi-script
</Location>

Associating Scripts with HTTP Methods and MIME Types

# Processing all GIF images through a CGI script
# before serving them
Action image/gif /cgi-bin/filter.cgi
# Associating specific HTTP methods with a CGI
# script
Script PUT /cgi-bin/upload.cgi

In addition to the directives mentioned in the previous section, Apache provides directives that simplify associating specific MIME types, file extensions, or even specific HTTP methods with a particular CGI. The mod_actions module, included in the base distribution and compiled by default, provides the Action and Script directives, shown in this example:

  • The Action directive accepts two arguments. The first argument is a handler or a MIME content type; the second points to the CGI program to handle the request.

  • The Script directive associates certain HTTP request methods with a CGI program.

The information about the original requested document is passed to the CGI via the PATH_INFO (document URL) and PATH_TRANSLATED (document path) environment variables.

As with the example from the previous section, the directory containing the destination CGI must be marked as allowing CGI execution with either a ScriptAlias directive or the ExecCGI parameter to the Options directive.

Troubleshooting the Execution of CGI Scripts

ScriptLog logs/cgi_log

In addition to the modules and techniques explained in Chapter 2 and Chapter 3, the mod_cgi module provides the ScriptLog directive to aid in the debugging of CGI scripts. If enabled, it will store information for each failed CGI execution, including HTTP headers, POST variables, and so on. This file can grow quickly, so you can limit its growth with the ScriptLogBuffer and ScriptLogLength directives.

Improving CGI Script Performance

One of the main drawbacks of CGI development is the performance impact associated with the requirement to start and stop programs per every request.

mod_perl and FastCGI provide two solutions for this problem. Both require careful examination of existing code because you can no longer assume in your CGIs that all resources will be automatically freed by the operating system after the request is served.

mod_perl is a module available for Apache 1.3 and 2.0 that embeds a Perl interpreter inside the Apache web server. In addition to a powerful API to Apache internals, mod_perl includes a CGI compatibility mode that provides an environment that allows existing Perl CGIs to run with little or no modification. Since the scripts are run inside a persistent, in-process interpreter, there is no startup penalty.

FastCGI is a standard that allows the same instance of a CGI program to answer several requests over time. You can read the specs and download modules for Apache 1.3 and 2.x from http://www.fastcgi.com. FastCGI has regained some popularity by its use by web development frameworks such as Ruby-on-Rails.

Understanding Server Side Includes

Document on disk
This document, <!--#echo var="DOCUMENT_NAME" -->,
was last modified <!--#echo var="LAST_MODIFIED" -->

Content received by the browser
This document, sample.shtml,
was last modified Sunday, 14-Sep-2005 12:03:20 PST

SSI is a simple, “old school” web technology and a predecessor to other HTML embedded languages such as PHP. SSI provides a simple and effective mechanism for adding simple pieces of dynamic content with very little overhead; for example, a common footer for each page that includes the date and time the page was served. As another example, the Apache 2.0 distribution uses SSI to provide a custom look and feel for error messages. It works by embedding special processing instructions inside web pages and evaluating them before the content is returned to the client. You can learn more about Apache SSI support at http://httpd.apache.org/docs/2.0/howto/ssi.html.

Configuring Server Side Includes

AddType text/html .shtml
AddHandler server-parsed .shtml

Server side includes functionality is provided by the mod_include module, distributed with Apache. The simplest way to configure it is to associate an extension with the server-parsed content handler, as shown in the example.

Setting Environment Variables

SetEnv foo bar
UnSetEnv foo
PassEnv foo

Environment variables are variables that can be shared between modules and that are also available to external processes such as CGIs and server side include (SSI) documents. Environment variables also can be used for intermodule communication and to flag certain requests for special processing.

You can set environment variables with the SetEnv directive. This variable will be available to CGI scripts and SSI pages, and can be logged or added to a header. For example

SetEnv foo bar

will create the environment variable foo and assign it the value bar.

Conversely, you can remove specific variables using the UnsetEnv directive.

Finally, the PassEnv directive enables you to expose variables from the server process environment. For example

PassEnv LD_LIBRARY_PATH

will make the environment variable LD_LIBRARY_PATH available to CGI scripts and SSI pages. This variable contains the path to loadable dynamic libraries in some Unix systems, such as Linux. You can get a listing of standard environment variables in the appendix.

Setting Environment Variables Dynamically

SetEnvIf HTTP_USER_AGENT MSIE iexplorer
SetEnvIf HTTP_USER_AGENT MSIE iexplorer=foo
SetEnvIf HTTP_USER_AGENT MSIE !javascript

The SetEnvIf directive enables you to set environment variables based on request information, such as the username, the file being requested, or a specific HTTP header value.

This directive takes a request parameter, a regular expression, and a set of variables that will be modified if the parameter matches the expression. This example matches Microsoft Internet Explorer browsers and shows how you can just set a variable, assign it an arbitrary value foo, or even assign it a negated expression.

Later, you can check the existence and value of this variable to perform a variety of actions such as logging a specific request or serving different content based on the type of browser. For example, you could provide simplified HTML pages for text browsers such as Lynx, or for PDA and cell phone browsers.

In fact, checking for the client user agent is so common that mod_setenvif provides the BrowserMatch directive, allowing you to simply write

BrowserMatch MSIE iexplorer=1

Note

Both SetEnvIf and BrowserMatch have non–case sensitive versions, SetEnvIfNoCase and BrowserMatchNoCase, that can be used to simplify the regular expressions in certain situations.

Special Environment Variables

BrowserMatch "Mozilla/2" nokeepalive

Apache provides a set of special environment variables. If one of those variables is set, Apache will modify its behavior. They are commonly used to work around buggy clients. For example, the nokeepalive variable disables keepalive support in Apache. This reduces performance on the server, since multiple requests cannot be transmitted over the same connection. Hence, it should only be set when the request is made by a client that does not correctly support this functionality, typically using a BrowserMatch or SetEnvIf directive, as shown in the example.

In the appendix you can find a list of all the special environment variables. Chapters 7 and 8 include examples of special variables used to work around issues with SSL and DAV implementations.

Understanding Content Negotiation

AddCharset UTF-8 .utf8
AddLanguage en .en
AddEncoding gzip .gzip .gz

The HTTP protocol provides mechanisms that enable you to maintain different versions of a certain resource and return the appropriate content based on the capabilities and preferences of the client. For example, a client may inform you that he is able to accept content that is compressed and that, while its preferred language is English, it will also understand pages written in Spanish. The three main aspects that are negotiated are

  • Encoding: This is the format in which a resource is stored or represented, and can usually be determined from the file extension. For example, the file listing.txt.gz has a MIME type of text/plain and a gzip encoding. The encoding of the resource will be appended to the Content-Encoding: header of the response.

  • Character Set: This property describes the particular character set used by a document. The character set of the resource will be appended to the Content-Type: header of the response, together with the MIME type.

  • Language: You can provide different versions of the same resource. For example, the Apache documentation provides index.html.en, index.html.es, index.html.de, and so on. The language of the resource will be appended to the Content-Language: header of the response.

The example explains how you can associate charsets, languages, and encodings with particular file extensions.

Configuring Content Negotiation

Options +Multiviews
AddHandler type-map .var

There are two primary ways of configuring content negotiation in Apache: multiviews and type maps.

Multiviews can be enabled by adding an Options +Multiviews directive to your configuration. This method is not recommended (except for simple websites) because it is not very efficient: For every request, it scans the directory containing the file, looking for similar documents with additional extensions. It will then construct a list of such files and use the extensions to determine content encoding and character set, and return the appropriate content.

It is recommended that you use type maps instead, because they save filesystem lookups. These are special files that map filenames and information (metadata) about them. You can configure a type map for a certain resource by creating a file with the same name and the .var extension, and adding an AddHandler directive, as shown in the sample configuration.

The file can contain several entries. Each entry starts with a URI: that is the name of the document, followed by several attributes such as Content-Type:, Content-Language:, and Content-Encoding:. The following listing shows a sample type map file.

Example 4.1. Contents of Type Map File

URI: page.html.en
Content-type: text/html
Content-language: en

URI: page.html.fr
Content-type: text/html; charset=iso-8859-2
Content-language: fr

Tip

Bear in mind that using any kind of content-negotiation has an adverse impact on the performance of the web server, as it requires additional filesystem accesses.

Assigning Default Charsets and Language Priorities

DefaultLanguage en
AddDefaultCharset iso-8859-1
LanguagePriority en es de

You can specify a default character set for documents without one already associated by using the AddDefaultCharset, as shown in the example. Another option is to specify AddDefaultcharset Off to disable adding a character set for documents without one associated.

You can specify a default language with the DefaultLanguage directive. For a website in English, that would be en, as shown in the example.

Finally, if the client does not provide a language preference, you can use LanguagePriority to determine the preferred language order. In this example, if a document in English is found, it will be served. Otherwise, Apache will look for a document in Spanish, and if that is not found, Apache will look for a document in German. You can learn more about this topic at http://httpd.apache.org/docs/2.0/mod/mod_negotiation.html and http://httpd.apache.org/docs/2.0/mod/mod_mime.html.

Advanced URL Mapping with mod_rewrite

Apache provides a very powerful module, mod_rewrite, that allows virtually unlimited URL manipulation capabilities using regular expressions. Due to its complexity, it is outside the scope of this book other than specific references or examples in other chapters. It is mentioned here so you are aware of its existence if you reach the limits of what Redirect, ErrorDocument, and Alias directives can do.

You can learn more about mod_rewrite at http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html.

Understanding the “Trailing Slash” Problem

DirectorySlash On

Sometimes, certain URLs only work only if they have a “/” at the end. This is likely because you have either not loaded mod_dir into the server or because the redirections made by mod_dir are not working correctly with the value specified in the ServerName directive, as explained in the “Redirections Do Not Work” section in Chapter 2.

When accessing certain URLs that map into directories, it is necessary to add a trailing slash (“/”) to the end of the URL to correctly access the content of the directory, which can be either an index file or a directory index. Forgetting to add this trailing slash is a common mistake, so when mod_dir realizes that may be happening, it issues the appropriate redirection.

For example, if mod_dir is enabled on the server, and you have a directory named foo under the document root, a request for http://example.com/foo will be redirected to http://example.com/foo/.

This is the default behavior in both Apache 1.3 and 2.0 when mod_dir is loaded into the server. In Apache 2, you can disable such redirections using a DirectorySlash directive:

DirectorySlash Off

Fixing Spelling Mistakes

CheckSpelling on

mod_speling is a useful Apache module that recognizes misspelled URLs and redirects the user to the correct location for the document. mod_speling is able to correct URLs with the wrong capitalization or with one letter missing or incorrect. This is most common when users misspell the URL while typing it in the browser.

For example, if a user requests the file file.html and it is not present, mod_speling will search for a similar document such as FILE.HTML, file.htm, and so on, and if it finds one, will return it. This has a performance impact, but can be quite useful and avoid unnecessary support requests due to broken links.

To enable spelling checks, you can add CheckSpelling on to your Apache configuration, as shown in the example.

Note

If there are several documents that may be a match for the misspelling, the module will return a list of those documents. This could have security implications because you may not want to make some of those files visible.

Fixing Capitalization Problems

NoCase on

Windows has a non–case sensitive file system, while Unix systems are case sensitive. This usually creates problems when migrating websites from Windows to Unix servers. All of a sudden, URLs such as http://www.example.com/images/icon. PNG that used to work fine on Windows start failing with Document Not Found errors, because the file on disk is named icon.png and is not equivalent on Unix to the icon.PNG file requested. This issue can be solved by manually checking and rewriting every link or by enabling the mod_speling module as described in the previous section.

There is also an alternative, single-purpose module that can be used to this end: mod_nocase. This module, originally based on mod_speling, makes GET request for URLs non–case sensitive. It checks for an exact URL match and if it does not find it, it tries a non–case sensitive matching. If multiple files match the non–case sensitive search, the first one will automatically be selected. To enable mod_nocase, you should load it into the server and include a NoCase directive in your Apache configuration file, as shown in the example.

You can download mod_nocase from http://www.misterblue.com/Software/mod_nocase.htm.

Remember that enabling either mod_speling or mod_nocase has an impact on the performance of the server.

Validating Your Pages with Tidy

AddOutputFilterByType   TIDY    text/html application/xhtml+xml
TidyOption char-encoding utf8

Independently of whether you have dynamically generated or hand-coded your HTML pages, if they contain markup errors, they may not display correctly in all browsers. Tidy is a useful command-line tool that is able to process malformed HTML and XML, correct many common mistakes, and produce standards-compliant output. You can download it from http://tidy.sourceforge.net/.

You can run Tidy from the command line over static files or, thanks to mod_tidy and the Apache 2 filter architecture, process content being served on-the-fly. This example shows how to use the SetFilter directive to associate a Tidy filter with XML and HTML files and how to use TidyOption to configure the behavior of the Tidy engine. Apache filter architecture and configuration is described in Chapter 11. You can download mod_tidy from

http://home.snafu.de/tusk/mod_tidy/.

A related Apache 2 module is mod_validator, which can be downloaded from

http://www.webthing.com/software/mod_validator/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.25.41