One of the problems that the designers of the Web faced was differences between local operating systems. These differences can cause problems with URLs: for example, some operating systems allow spaces in filenames; some don’t. Most operating systems won’t complain about a # sign in a filename; in a URL, a # sign means that the filename has ended, and a named anchor follows. Similar problems are presented by other special characters, nonalphanumeric characters, etc., all of which may have a special meaning inside a URL or on another operating system. To solve these problems, characters used in URLs must come from a fixed subset of ASCII, in particular:
The capital letters A-Z
The lowercase letters a-z
The digits 0-9
The punctuation characters - _ . ! ~ * ` (and , )
The characters : / & ? @ # ; $ + = % and , may also be used, but only for their specified purposes. If these characters occur as part of a filename, then they and all other characters should be encoded.
The encoding used is very simple. Any characters that are not ASCII numerals, letters, or the punctuation marks specified earlier are represented by a percent sign followed by two hexadecimal digits giving the value for that character. Spaces are a special case because they’re so common. Besides being encoded as %20, they can be encoded as a plus sign (+). The plus sign itself is encoded as %2B. The / # = & and ? characters should be encoded when they are used as part of a name, and not as a separator between parts of the URL.
This scheme doesn’t work well (or really at all) for multibyte character sets. This is a distinct shortcoming of the current URI specification that should be addressed in the future.
Java 1.0 and later provides a URLEncoder
class to
encode strings in this format. Java 1.2 adds a
URLDecoder
class that can decode strings in this
format. Neither of these classes will be instantiated. Both provide a
single static method to do their work:
public class URLDecoder extends Object public class URLEncoder extends Object
The java.net.URLEncoder
class contains a single static method called encode( )
that encodes a
String
according to these rules:
public static String encode(String s)
URLEncoder.encode( )
changes any nonalphanumeric
characters except the space, underscore, hyphen, period, and asterisk
characters into % sequences. The space is converted into a plus sign.
This method is a little overly aggressive in that it also converts
tildes, single quotes, exclamation points, and parentheses to percent
escapes even though they don’t absolutely have to be. (In Java
1.0, URLEncoder
was even more aggressive and also
encoded asterisks and periods.) However, this isn’t forbidden
by the URL specification, so web browsers will deal reasonably with
these excessively encoded URLs. There’s no reason
encode( )
couldn’t have been included in the
URL
class, but it wasn’t. The signature of
encode( )
is:
public static String encode(String s)
It returns a new String
suitably encoded. Example 7.8 uses this method to print various encoded
strings.
Example 7-8. x-www-form-urlencoded Strings
import java.net.*; public class EncodeTest { public static void main(String[] args) { System.out.println(URLEncoder.encode("This string has spaces")); System.out.println(URLEncoder.encode("This*string*has*asterisks")); System.out.println(URLEncoder.encode( "This%string%has%percent%signs")); System.out.println(URLEncoder.encode("This+string+has+pluses")); System.out.println(URLEncoder.encode("This/string/has/slashes")); System.out.println(URLEncoder.encode( "Thisstring"has"quote"marks")); System.out.println(URLEncoder.encode(This:string:has:colons")); System.out.println(URLEncoder.encode("This~string~has~tildes")); System.out.println(URLEncoder.encode( "This(string)has(parentheses)")); System.out.println(URLEncoder.encode("This.string.has.periods")); System.out.println(URLEncoder.encode( "This=string=has=equals=signs")); System.out.println(URLEncoder.encode("This&string&has&ersands")); } }
Here is the output:
% java EncodeTest This+string+has+spaces This*string*has*asterisks This%25string%25has%25percent%25signs This%2Bstring%2Bhas%2Bpluses This%2Fstring%2Fhas%2Fslashes This%22string%22has%22quote%22marks This%3Astring%3Ahas%3Acolons This%7Estring%7Ehas%7Etildes This%28string%29has%28parentheses%29 This.string.has.periods This%3Dstring%3Dhas%3Dequals%3Dsigns This%26string%26has%26ampersands
Notice in particular that this method does encode the forward slash,
the ampersand, the equals sign, and the colon. It does not attempt to
determine how these characters are being used in a URL. Consequently,
you have to encode your URLs piece by piece, rather than encoding an
entire URL in one method call. This is an important point, because
the primary use of URLEncoder
is in preparing
query strings for communicating with CGI programs that use
GET
. For example, suppose you want to encode
this query string used for an AltaVista search:
pg=q&kl=XX&stype=stext&q=+"Java+I/O"&search.x=38&search.y=3
This code fragment encodes it:
String query = URLEncoder.encode( "pg=q&kl=XX&stype=stext&q=+"Java+I/O"&search.x=38&search.y=3"); System.out.println(query);
Unfortunately, the output is:
pg%3Dq%26kl%3DXX%26stype%3Dstext%26q%3D%2B%22Java%2BI%2FO%22%26search .x%3D38%26search.y%3D3
The problem is that URLEncoder.encode( )
encodes
blindly. It can’t distinguish between special characters used
as part of the URL or query string, like &
and
=
in the previous string, and characters that need
to be encoded. Consequently, URLs need to be encoded a piece at a
time like this:
String query = URLEncoder.encode("pg"); query += "="; query += URLEncoder.encode("q"); query += "&"; query += URLEncoder.encode("kl"); query += "="; query += URLEncoder.encode("XX"); query += "&"; query += URLEncoder.encode("stype"); query += "="; query += URLEncoder.encode("stext"); query += "&"; query += URLEncoder.encode("q"); query += "="; query += URLEncoder.encode(""Java I/O""); query += "&"; query += URLEncoder.encode("search.x"); query += "="; query += URLEncoder.encode("38"); query += "&"; query += URLEncoder.encode("search.y"); query += "="; query += URLEncoder.encode("3"); System.out.println(query);
The output of this is what you actually want:
pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3
Example 7.9 is a QueryString
class that uses the URLEncoder
to encode
successive name and value pairs in a Java object, which will be used
for sending data to CGI programs. When you create a
QueryString
, you can supply the first name-value
pair to the constructor; the arguments are a pair of objects, which
are converted to strings using their toString( )
methods and then encoded. To add further pairs, call the
add( )
method, which also takes two objects as
arguments, converts them to String
s, and encodes
them. The QueryString
class supplies its own
toString( )
method, which simply returns the
accumulated list of name-value pairs. toString( )
is called implicitly whenever you add a
QueryString
to another string or print it on an
output stream.
Example 7-9. The QueryString Class
package com.macfaq.net; import java.net.URLEncoder; public class QueryString { private String query; public QueryString(Object name, Object value) { query = URLEncoder.encode(name.toString( )) + "=" + URLEncoder.encode(value.toString( )); } public QueryString( ) { query = ""; } public synchronized void add(Object name, Object value) { if (!query.trim( ).equals("")) query += "&" ; query += URLEncoder.encode(name.toString( )) + "=" + URLEncoder.encode(value.toString( )); } public String toString( ) { return query; } }
Using this class, we can now encode the previous example like this:
QueryString qs = new QueryString("pg", "q"); qs.add("kl", "XX"); qs.add("stype", "stext"); qs.add("q", "+"Java I/O""); qs.add("search.x", "38"); qs.add("search.y", "3"); String url = "http://www.altavista.com/cgi-bin/query?" + qs; System.out.println(url);
Java 1.2
adds a corresponding URLDecoder
class. This has a
single static method that decodes any string encoded in
x-www-form-url-encoded format. That is, it converts all plus signs to
spaces and all percent escapes to their corresponding character. Its
signature is:
public static String decode(String s) throws Exception
An IllegalArgumentException
is thrown if the
string contains a percent sign that isn’t followed by two
hexadecimal digits. Since this method passes all non-escaped
characters along as is, you can pass an entire URL to it, rather than
splitting it into pieces first. For example:
String input = "http://www.altavista.com/cgi-bin/" + "query?pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3"; try { String output = URLDecoder.decode(input); System.out.println(output); }
18.118.193.232