Certain characters are special in web pages and must be encoded if you want to display them literally. Because database content often contains instances of these characters, scripts that include query results in web pages should encode those results to prevent browsers from misinterpreting the information.
HTML is a markup language: it uses certain characters as markers
that have a special meaning. To include literal instances of these
characters in a page, you must encode them so that they are not
interpreted as having their special meanings. For example, <
should be encoded as <
to keep a browser from interpreting
it as the beginning of a tag. Furthermore, there are actually two
kinds of encoding, depending on the context in which you use a
character. One encoding is appropriate for HTML text, another is used
for text that is part of a URL in a hyperlink.
The MySQL table-display scripts shown in Recipes and are simple demonstrations of how to
produce web pages using programs. But with one exception, the scripts
have a common failing: they take no care to properly encode special
characters that occur in the information retrieved from the MySQL
server. (The exception is the JSP version of the script. The <c:out>
tag used there handles
encoding automatically, as we’ll discuss shortly.)
As it happens, I deliberately chose information to display that is unlikely to contain any special characters; the scripts should work properly even in the absence of any encoding. However, in the general case, it’s unsafe to assume that a query result will contain no special characters, so you must be prepared to encode it for display in a web page. Neglecting to do this often results in scripts that generate pages containing malformed HTML that displays incorrectly.
This recipe describes how to handle special characters, beginning with some general principles, and then discusses how each API implements encoding support. The API-specific examples show how to process information drawn from a database table, but they can be adapted to any content you include in a web page, no matter its source.
One form of encoding applies to characters that are used in writing HTML constructs; another applies to text that is included in URLs. It’s important to understand this distinction so that you don’t encode text inappropriately.
Encoding text for inclusion in a web page is an entirely different issue from encoding special characters in data values for inclusion in an SQL statement. Handling Special Characters and NULL Values in Statements discusses the latter issue.
Encoding
characters that are special in HTML. HTML markup uses
<
and >
characters to begin and end tags,
&
to begin special entity names (such as
to signify a nonbreaking
space), and "
to quote attribute
values in tags (such as <p
align="left">
). Consequently, to display literal
instances of these characters, you should encode them as HTML
entities so that browsers or other clients understand your intent.
To do this, convert the special characters <
, >
, &
, and "
to the corresponding HTML entity
designators shown in the following table.
Special character | HTML entity |
---|---|
<
|
<
|
>
|
>
|
&
|
&
|
"
|
"
|
Suppose that you want to display the following string literally in a web page:
Paragraphs begin and end with <p> & </p> tags.
If you send this text to the client browser exactly as shown,
the browser will misinterpret it: the <p>
and </p>
tags will be taken as paragraph
markers and the &
may be
taken as the beginning of an HTML entity designator. To display the
string the way you intend, encode the special characters as the
<
, >
, and &
entities:
Paragraphs begin and end with <p> & </p> tags.
The principle of encoding text this way is also useful within
tags. For example, HTML tag attribute values usually are enclosed
within double quotes, so it’s important to perform HTML-encoding on
attribute values. Suppose that you want to include a text-input box
in a form, and you want to provide an initial value of Rich
"Goose"
Gossage
to be displayed in the box. You
cannot write that value literally in the tag like this:
<input type="text" name="player_name" value="Rich "Goose" Gossage" />
The problem here is that the double-quoted value
attribute includes internal double
quotes, which makes the <input>
tag malformed. The proper
way to write it is to encode the double quotes:
<input type="text" name="player_name" value="Rich "Goose" Gossage" />
When a browser receives this text, it decodes the "
entities back to "
characters and interprets the value
attribute value properly.
Encoding characters that are special in URLs. URLs for hyperlinks that occur within HTML pages have their own syntax and their own encoding. This encoding applies to attributes within several tags:
<a href="URL
"> <img src="URL
"> <form action="URL
"> <frame src="URL
">
Many characters have special meaning within URLs, such as
:
, /
, ?
,
=
, &
, and ;
. The following URL contains some of
these characters:
http://localhost/myscript.php?id=428&name=Gandalf
Here the :
and /
characters segment the URL into
components, the ?
character
indicates that parameters are present, and the &
character separates the parameters,
each of which is specified as a
name
=value
pair. (The ;
character is not
present in the URL just shown, but commonly is used instead of
&
to separate parameters.) If
you want to include any of these characters literally within a URL,
you must encode them to prevent the browser from interpreting them
with their usual special meaning. Other characters such as spaces
require special treatment as well. Spaces are not allowed within a
URL, so if you want to reference a page named my home page.html on the local host, the
URL in the following hyperlink won’t work:
<a href="http://localhost/my home page.html">My Home Page</a>
URL-encoding for special and reserved characters is performed
by converting each such character to %
followed by two hexadecimal digits
representing the character’s ASCII code. For example, the ASCII
value of the space character is 32 decimal, or 20 hexadecimal, so
you’d write the preceding hyperlink like this:
<a href="http://localhost/my%20home%20page.html">My Home Page</a>
Sometimes you’ll see spaces encoded as +
in URLs. That is legal, too.
Use the
appropriate encoding method for the context. Be sure to
encode information properly for the context in which you’re using
it. Suppose that you want to create a hyperlink to trigger a search
for items matching a search term, and you want the term itself to
appear as the link label that is displayed in the page. In this
case, the term appears as a parameter in the URL, and also as HTML
text between the <a>
and
</a>
tags. If the search
term is “cats & dogs”, the unencoded hyperlink
construct looks like this:
<a href="/cgi-bin/myscript?term=cats & dogs">cats & dogs</a>
That is incorrect because &
is special in both contexts and the
spaces are special in the URL. The link should be written like this
instead:
<a href="/cgi-bin/myscript?term=cats%20%26%20dogs">cats & dogs</a>
Here, &
is HTML-encoded
as &
for the link label,
and is URL-encoded as %26
for the
URL, which also includes spaces encoded as %20
.
Granted, it’s a pain to encode text before writing it to a web page, and sometimes you know enough about a value that you can skip the encoding. (See the sidebar, “Do You Always Need to Encode Web Page Output?”) But encoding is the safe thing to do most of the time. Fortunately, most APIs provide functions to do the work for you. This means you need not know every character that is special in a given context. You just need to know which kind of encoding to perform, so that you can call the appropriate function to produce the intended result.
The following encoding examples show how to pull values out of MySQL and perform both
HTML-encoding and URL-encoding on them to generate hyperlinks. Each
example reads a table named phrase
that contains short phrases and
then uses its contents to construct hyperlinks that point to a
(hypothetical) script that searches for instances of the phrases in
some other table. The table contains the following rows:
mysql>SELECT phrase_val FROM phrase ORDER BY phrase_val;
+--------------------------+
| phrase_val |
+--------------------------+
| are we "there" yet? |
| cats & dogs |
| rhinoceros |
| the whole > sum of parts |
+--------------------------+
The goal here is to generate a list of hyperlinks using each phrase both as the hyperlink label (which requires HTML-encoding) and in the URL as a parameter to the search script (which requires URL-encoding). The resulting links look something like this:
<a href="/cgi-bin/mysearch.pl?phrase=are%20we%20%22there%22%20yet%3F"> are we "there" yet?</a> <a href="/cgi-bin/mysearch.pl?phrase=cats%20%26%20dogs"> cats & dogs</a> <a href="/cgi-bin/mysearch.pl?phrase=rhinoceros"> rhinoceros</a> <a href="/cgi-bin/mysearch.pl?phrase=the%20whole%20%3E%20sum%20of%20parts"> the whole > sum of parts</a>
The initial part of the href
attribute value will vary per API.
Also, the links produced by some APIs will look slightly different
because they encode spaces as +
rather than as %20
.
Perl. The Perl CGI.pm module provides two methods, escapeHTML()
and escape()
, that handle HTML-encoding
and URL-encoding. There are three ways to use these methods to
encode a string $str
:
Invoke escapeHTML()
and escape()
as CGI class methods
using a CGI::
prefix:
use CGI; printf "%s %s ", CGI::escape ($str), CGI::escapeHTML ($str);
Create a CGI
object and
invoke escapeHTML()
and
escape()
as object
methods:
use CGI; my $cgi = new CGI; printf "%s %s ", $cgi->escape ($str), $cgi->escapeHTML ($str);
Import the names explicitly into your script’s namespace.
In this case, neither a CGI
object nor the CGI::
prefix
is necessary and you can invoke the methods as standalone
functions. The following example imports the two method names in
addition to the set of standard names:
use CGI qw(:standard escape escapeHTML); printf "%s %s ", escape ($str), escapeHTML ($str);
I prefer the last alternative because it is consistent with
the CGI.pm function call interface that you use for other imported
method names. Just remember to include the encoding method names in
the use
CGI
statement for any Perl script that
requires them, or you’ll get “undefined subroutine”
errors when the script executes.
The following code reads the contents of the phrase
table and produces hyperlinks from
them using escapeHTML()
and
escape()
:
my $stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val"; my $sth = $dbh->prepare ($stmt); $sth->execute (); while (my ($phrase) = $sth->fetchrow_array ()) { # URL-encode the phrase value for use in the URL my $url = "/cgi-bin/mysearch.pl?phrase=" . escape ($phrase); # HTML-encode the phrase value for use in the link label my $label = escapeHTML ($phrase); print a ({-href => $url}, $label), br (), " "; }
Ruby. The Ruby cgi
module
contains two methods, CGI.escapeHTML()
and CGI.escape()
, that perform
HTML-encoding and URL-encoding. However, both methods raise an
exception unless the argument is a string. One way to deal with this
is to apply the to_s
method to
any argument that might not be a string, to force it to string form
and convert nil
to the empty
string. For example:
stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val" dbh.execute(stmt) do |sth| sth.fetch do |row| # make sure that the value is a string phrase = row[0].to_s # URL-encode the phrase value for use in the URL url = "/cgi-bin/mysearch.rb?phrase=" + CGI.escape(phrase) # HTML-encode the phrase value for use in the link label label = CGI.escapeHTML(phrase) page << cgi.a("href" => url) { label } + cgi.br + " " end end
page
is used here as a
variable that “accumulates” page content and that
eventually you pass to cgi.out
to
display the page.
PHP. In
PHP, thehtmlspecialchars()
and urlencode()
functions perform
HTML-encoding and URL-encoding. Use them as follows:
$stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val"; $result =& $conn->query ($stmt); if (!PEAR::isError ($result)) { while (list ($phrase) = $result->fetchRow ()) { # URL-encode the phrase value for use in the URL $url = "/mcb/mysearch.php?phrase=" . urlencode ($phrase); # HTML-encode the phrase value for use in the link label $label = htmlspecialchars ($phrase); printf ("<a href="%s">%s</a><br /> ", $url, $label); } $result->free (); }
Python. In Python, the
cgi
and urllib
modules
contain the relevant encoding methods. cgi.escape()
and urllib.quote()
perform HTML-encoding
and URL-encoding. However, both methods raise an exception unless
the argument is a string. One way to deal with this is to apply the
str()
method to any
argument that might not be a string, to force it to string form and
convert None
to the string
"None"
. (If you want None
to convert to the empty string, you
need to test for it explicitly.) For example:
import cgi import urllib stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val" cursor = conn.cursor () cursor.execute (stmt) for (phrase,) in cursor.fetchall (): # make sure that the value is a string phrase = str (phrase) # URL-encode the phrase value for use in the URL url = "/cgi-bin/mysearch.py?phrase=" + urllib.quote (phrase) # HTML-encode the phrase value for use in the link label label = cgi.escape (phrase, 1) print "<a href="%s">%s</a><br />" % (url, label) cursor.close ()
The first argument to cgi.escape()
is the string to be
HTML-encoded. By default, this function converts <
, >
, and &
characters to their corresponding
HTML entities. To tell cgi.escape()
to also convert double
quotes to the "
entity,
pass a second argument of 1
, as
shown in the example. This is especially important if you’re
encoding values to be placed into a double-quoted tag
attribute.
Java. The <c:out>
JSTL tag automatically performs HTML-encoding for JSP
pages. (Strictly speaking, it performs XML-encoding, but the set of
characters affected is <
,
>
, &
, "
, and '
, which includes all those needed for
HTML-encoding.) By using <c:out>
to display text in a web
page, you need not even think about converting special characters to
HTML entities. If for some reason you want to suppress encoding,
invoke <c:out>
with an
encodeXML
attribute value of
false
:
<c:out value="value to display
" encodeXML="false"/>
To URL-encode parameters for inclusion in a URL, use the
<c:url>
tag. Specify the
URL string in the tag’s value
attribute, and include any parameter values and names in <c:param>
tags in the body of the <c:url>
tag. A parameter value can
be given either in the value
attribute of a <c:param>
tag or in its body. Here’s an example that shows both uses:
<c:url var="urlStr" value="myscript.jsp"> <c:param name="id" value ="47"/> <c:param name="color">sky blue</c:param> </c:url>
This will URL-encode the values of the id
and color
parameters and add them to the end
of the URL. The result is placed in an object named urlStr
, which you can display as
follows:
<c:out value="${urlStr}"/>
The <c:url>
tag
does not encode special characters such as spaces in the string
supplied in its value
attribute. You must encode them yourself, so it’s probably best to
avoid creating pages with spaces in their names, to avoid the
likelihood that you’ll need to refer to them.
To display entries from the phrase
table, use the <c:out>
and <c:url>
tags as follows:
<sql:query dataSource="${conn}" var="rs"> SELECT phrase_val FROM phrase ORDER BY phrase_val </sql:query> <c:forEach items="${rs.rows}" var="row"> <%-- URL-encode the phrase value for use in the URL --%> <c:url var="urlStr" value="/mcb/mysearch.jsp"> <c:param name="phrase" value ="${row.phrase_val}"/> </c:url> <a href="<c:out value="${urlStr}"/>"> <%-- HTML-encode the phrase value for use in the link label --%> <c:out value="${row.phrase_val}"/> </a> <br /> </c:forEach>
3.149.255.168