A subclass of
ContentHandler
overrides the getContent( )
method to return an object that’s the Java
equivalent of the content. This method can be quite simple or quite
complex, depending almost entirely on the complexity of the content
type you’re trying to parse. A text/plain
content handler is quite simple; a text/rtf
content handler would be very
complex.
The ContentHandler
class has only a simple noargs
constructor:
public ContentHandler( )
Since ContentHandler
is an abstract class, you
never call its constructor directly, only from inside the
constructors of subclasses.
The primary method of the class, albeit an abstract one, is
getContent( )
:
public abstract Object getContent(URLConnection uc) throws IOException
This method is normally called only from inside the
getContent( )
method of a
URLConnection
object. It is overridden in a
subclass that is specific to the type of content being handled.
getContent( )
should use the
URLConnection
’s
InputStream
to create an object. There are no
rules about what type of object a content handler should return. In
general, this depends on what the application requesting the content
expects. Content handlers for text-like content bundled with the JDK
return some subclass of InputStream
. Content
handlers for images return ImageProducer
objects.
The getContent( )
method of a content handler does
not get the full InputStream
that the
URLConnection
has access to. The
InputStream
that a content handler sees should
include only the content’s raw data. Any MIME headers or other
protocol-specific information that come from the server should be
stripped by the URLConnection
before it passes the
stream to the ContentHandler
. A
ContentHandler
is responsible only for content,
not for any protocol overhead that may be present. The
URLConnection
should have already performed any
necessary handshaking with the server and interpreted any headers it
sends.
To see how
content handlers work, let’s create a
ContentHandler
that handles the
text/tab-separated-values
content type. We
aren’t concerned with how the tab-separated values get to us.
That’s for a protocol handler to deal with. All a
ContentHandler
needs to know is the MIME type and
format of the data.
Tab-separated values are produced by many database and spreadsheet programs. A tab-separated file may look something like this. Tabs are indicated by arrows:
JPE Associates |
→ |
341 Lafayette Street, Suite 1025 |
→ |
New York Æ NY Æ 10012 |
O’Reilly & Associates |
→ |
103 Morris Street, Suite A |
→ |
Sebastopol Æ CA Æ 95472 |
In database parlance, each line is a record, and the data before each tab is a field. It is usually (though not necessarily) true that each field has the same meaning in each record. In the previous example, the first field is the company name.
The first question to ask is: what kind of Java object should we
convert the tab-separated values to? The simplest and most general
way to store each record is as an array of
String
s. Successive records can be collected in a
Vector
. In many applications, however, you have a
great deal more knowledge about the exact format and meaning of the
data than we do here. The more you know about the data you’re
dealing with, the better a ContentHandler
you can
write. For example, if you know that the data you’re
downloading represents U.S. addresses, then you could define a class
like this:
public class Address { private String name; private String street; private String city; private String state; private String zip; }
This class would also have appropriate constructors and other methods
to represent each record. In this example, we don’t know
anything about the data in advance, or how many records we’ll
have to store. Therefore, we will take the most general approach and
convert each record into an array of strings, using a
Vector
to store each array until there are no more
records. The getContent( )
method can return the
Vector
of String
arrays.
Example 17.1 shows the code for such a
ContentHandler
. The full package-qualified name is
com.macfaq.net.www.content.text.tab_separated_values
.
This unusual class name follows the naming convention for a content
handler for the MIME type text/tab-separated-values. Since MIME types
often contain hyphens, as in this example, a convention exists to
replace these with the underscore (_). Thus
text/tab-separated-values
becomes
text.tab_separated_values
. To install this content
handler, all that’s needed is to put the compiled
.class
file somewhere the class loader can find
it and set the java.content.handler.pkgs
property
to
com.macfaq.net.www.content
.
Example 17-1. A ContentHandler for text/tab-separated-values
package com.macfaq.net.www.content.text; import java.net.*; import java.io.*; import java.util.*; import com.macfaq.io.SafeBufferedReader // From Chapter 4 public class tab_separated_values extends ContentHandler { public Object getContent(URLConnection uc) throws IOException { String theLine; Vector v = new Vector( ); InputStreamReader isr = new InputStreamReader(uc.getInputStream( )); SafeBufferedReader in = new SafeBufferedReader(isr); while ((theLine = in.readLine( )) != null) { String[] linearray = lineToArray(theLine); v.addElement(linearray); } return v; } private String[] lineToArray(String line) { int numFields = 1; for (int i = 0; i < line.length( ); i++) { if (line.charAt(i) == ' ') numFields++; } String[] fields = new String[numFields]; int position = 0; for (int i = 0; i < numFields; i++) { StringBuffer buffer = new StringBuffer( ); while (position < line.length( ) && line.charAt(position) != ' ') { buffer.append(line.charAt(position)); position++; } fields[i] = buffer.toString( ); position++; } return fields; } }
Example 17.1 has two methods. The private utility
method lineToArray( )
converts a tab-separated
string into an array of strings. This method is for the private use
of this subclass and is not required by the
ContentHandler
interface. The more complicated the
content you’re trying to parse, the more such methods your
class will need. The lineToArray( )
method begins
by counting the number of tabs in the string. This sets the
numFields
variable to one more than the number of
tabs. An array is created for the fields with the length
numFields
. Then a for
loop
fills the array with the strings between the tabs. Then this array is
returned.
You may have expected a StringTokenizer
to split the line into parts. However, that class has unusual ideas about what makes up a token. In particular, it would interpret multiple tabs in a row as a single delimiter. That is, it never returns an empty string as a token.
The getContent( )
method starts by instantiating a
Vector
. Then it gets the
InputStream
from the
URLConnection
uc
and chains
this to an InputStreamReader
, which is in turn
chained to the SafeBufferedReader
introduced in
Chapter 4, so it can read it one line at a time in
a while
loop. Each line is fed to the
lineToArray( )
method, which splits it into a
String
array. This array is then added to the
Vector
. When no more lines are left, the loop
exits and the Vector
is returned.
Now that you’ve written your
first ContentHandler
, let’s see how to use
it in a program. Files of MIME type
text/tab-separated-values
can be served by gopher
servers, HTTP servers, FTP servers, and more. Let’s assume
you’re retrieving a tab-separated-values file from an HTTP
server. The filename should end with the .tsv
or
.tab
extension so that the server knows
it’s a text/tab-separated-values
file.
Not all servers are configured to support this type out of the box. Consult your server documentation to see how to set up a MIME-type mapping for your server. For instance, to configure my Apache server, I added these lines to my .htaccess
file:
AddType text/tab-separated-values tab AddType text/tab-separated-values tsv
You can test the web server configuration by connecting to port 80 of the web server with Telnet and requesting the file manually:
%telnet metalab.unc.edu 80
Trying 127.0.0.1... Connected to metalab.unc.edu. Escape character is `^]'.GET /javafaq/addresses.tab HTTP 1.0
HTTP 1.0 200 OK Date: Mon, 15 Nov 1999 18:36:51 GMT Server: Apache/1.3.4 (Unix) PHP/3.0.6 mod_perl/1.17 Last-Modified: Thu, 04 Nov 1999 18:22:51 GMT Content-type: text/tab-separated-values Content-length: 163 JPE Associates 341 Lafayette Street, Suite 1025 New York NY 10012 O'Reilly & Associates 103 Morris Street, Suite A Sebastopol CA 95472 Connection closed by foreign host.
You’re looking for a line that says
Content-type:
text/tab-separated-values
. If you see a
Content-type of text/plain
,
application/octet-stream
, or some other value, or
you don’t see any Content-type at all, the server is
misconfigured and must be fixed before you continue.
The application that uses the tab-separated-values content handler
does not need to know about it explicitly. It simply has to call the
getContent( )
method of URL
or
URLConnection
on a URL with a matching MIME type.
Furthermore, the package where the content handler can be found has
to be listed in the java.content.handlers.pkg
property.
Example 17.2 is a class that downloads and prints a
text/tab-separated-values
file using the
ContentHandler
of Example 17.1.
However, note that it does not import
com.macfaq.net.www.content.text
and that it never
references the tab_separated_values
class. It does
explicitly add com.macfaq.net.www.content
to the
java.content.handlers.pkgs
property because
that’s the simplest way to make sure this standalone program
works. However, the lines that do that could be deleted if the
property were set in a property file or from the command line, and
indeed this is required in Java 1.1.
Example 17-2. The tab-separated-values ContentTester Class
import java.io.*; import java.net.*; import java.util.*; public class TSVContentTester { private static void test(URL u) throws IOException { Object content = u.getContent( ); Vector v = (Vector) content; for (Enumeration e = v.elements() ; e.hasMoreElements( ) ;) { String[] sa = (String[]) e.nextElement( ); for (int i = 0; i < sa.length; i++) { System.out.print(sa[i] + " "); } System.out.println( ); } } public static void main (String[] args) { // If you uncomment these lines in Java 1.2, then you don't // have to set the java.content.handler.pkgs property from the // command line or your properties files. /* String pkgs = System.getProperty("java.content.handler.pkgs", ""); if (!pkgs.equals("")) { pkgs = pkgs + "|"; } pkgs += "com.macfaq.net.www.content"; System.setProperty("java.content.handler.pkgs", pkgs); */ for (int i = 0; i < args.length; i++) { try { URL u = new URL(args[i]); test(u); } catch (MalformedURLException e) { System.err.println(args[i] + " is not a good URL"); } catch (Exception e) { e.printStackTrace( ); } } } }
Here’s how you run this program in Java 1.1 and 1.3. The arrows indicate tabs:
%java -Djava.content.handler.pkgs=com.macfaq.net.www.content
TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
JPE Associates→341 Lafayette Street, Suite 1025→New York→NY→10012 O'Reilly & Associates→103 Morris Street, Suite A→Sebastopol→CA→95472
Java 1.2 is trickier because the new class-loading policy used in
Java 1.2 prevents content handlers from being loaded from the local
class path. This bug is fixed in Java 1.3. However, in Java 1.2,
simply running Example 17.2 will result in a
ClassCastException
. Since the custom content
handler isn’t found, getContent( )
returns
an InputStream
(specifically a
sun.net.www.MeteredStream
) instead:
% java TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
java.lang.ClassCastException: sun.net.www.MeteredStream
at TSVContentTester.test(TSVContentTester.java:10)
at TSVContentTester.main(TSVContentTester.java:35)
There are a couple of ways around this. You can use the oldjava interpreter instead. This does load content handlers from the local class path. For example:
% oldjava TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
JPE Associates→341 Lafayette Street, Suite 1025→New York→NY→10012
O'Reilly & Associates→103 Morris Street, Suite A→Sebastopol→CA→95472
Alternatively, you can use the nonstandard
-Xbootclasspath
command-line switch to tell Java
where to find the content handlers. For example, this line tells Java
to look in the current directory for content handlers and other files
before looking in the rt.jar
file (the exact
location of the rt.jar
file will have to be
adjusted to match your system):
%java -Xbootclasspath:.;/usr/local/jdk1.3/jre/lib/rt.jar
TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
JPE Associates→341 Lafayette Street, Suite 1025→New York→NY→10012 O'Reilly & Associates→103 Morris Street, Suite A→Sebastopol→CA→95472
The bug that requires these workarounds is present in all versions of
Java 1.2, though it doesn’t manifest itself in applets because
of the different nature of the ClassLoader
that a
web browser uses. Of course, the ultimate solution is to use a
ContentHandlerFactory
, an option I’ll
discuss later.
Java 1.3 adds one overloaded variant
of the getContent( )
method to the
ContentHandler
class:
public Object getContent(URLConnection uc, Class[] classes) // Java 1.3 throws IOException
The difference is the array of java.lang.Class
objects passed as the second argument. This allows the caller to
request that the content be returned as one of the types in the array
and enables content handlers to support multiple types. For example,
the text/tab-separated-values content handler could return data as a
Vector
, an array, a string, or an
InputStream
. One would be the default used by the
single argument getContent( )
method, while the
others would be options that a client could request. If the client
doesn’t request any of the classes this
ContentHandler
knows how to provide, then it
returns null.
To call this method, the client invokes the method with the same
arguments in a URL
or
URLConnection
object. It passes an array of
Class
objects in the order it wishes to receive
the data. Thus, if it prefers to receive a String
but is willing to accept an InputStream
and will
take a Vector
as a last resort, then it would put
String.class
in the zeroth component of the array,
InputStream.class
in the first component of the
array, and Vector.class
in the last component of
the array. Then it would use instanceof
to test
what was actually returned and either process it or convert it into
the preferred type. For example:
Class[] requestedTypes = {String.class, InputStream.class, Vector.class}; Object content = url.getContent(requestedTypes); if (content instanceof String) { String s = (String) content; System.out.println(s); } else if (content instanceof InputStream) { InputStream in = (InputStream) content; int c; while ((c = in.read( )) != -1) System.out.write(c); } else if (content instanceof Vector) { Vector v = (Vector) content; for (Enumeration e = v.elements() ; e.hasMoreElements( ) ;) { String[] sa = (String[]) e.nextElement( ); for (int i = 0; i < sa.length; i++) { System.out.print(sa[i] + " "); } System.out.println( ); } } else { System.out.println("Unrecognized content type " + content.getClass( )); }
To demonstrate this, let’s write a content handler that can be
used in association with the time protocol. Recall that the time
protocol returns the current time at the server as a 4-byte,
big-endian unsigned integer giving the number of seconds since
midnight, January 1, 1900, Greenwich Mean Time. There are several
obvious candidates for storing this data in a Java content handler,
including java.lang.Long
(java.lang.Integer
won’t work since the
unsigned value may overflow the bounds of an int
),
java.util.Date
,
java.util.Calendar
,
java.lang.String
, and
java.io.InputStream
, which often works as a last
resort. Example 17.3 provides all five options. Since
there’s no standard MIME type for the time format, we’ll
use application
for the type to indicate that this
is binary data, and x-time
for the subtype to
indicate that this is a nonstandard extension type. It will be up to
the time protocol handler to return the right content type.
Example 17-3. A Time Content Handler
package com.macfaq.net.www.content.application; import java.net.*; import java.io.*; import java.util.*; public class x_time extends ContentHandler { public Object getContent(URLConnection uc) throws IOException { Class[] classes = new Class[1]; classes[0] = Date.class; return this.getContent(uc, classes); } public Object getContent(URLConnection uc, Class[] classes) throws IOException { InputStream in = uc.getInputStream( ); for (int i = 0; i < classes.length; i++) { if (classes[i] == InputStream.class) { return in; } else if (classes[i] == Long.class) { long secondsSince1900 = readSecondsSince1900(in); return new Long(secondsSince1900); } else if (classes[i] == Date.class) { long secondsSince1900 = readSecondsSince1900(in); Date time = shiftEpochs(secondsSince1900); return time; } else if (classes[i] == Calendar.class) { long secondsSince1900 = readSecondsSince1900(in); Date time = shiftEpochs(secondsSince1900); Calendar c = Calendar.getInstance( ); c.setTime(time); return c; } else if (classes[i] == String.class) { long secondsSince1900 = readSecondsSince1900(in); Date time = shiftEpochs(secondsSince1900); return time.toString( ); } } return null; // no requested type available } private long readSecondsSince1900(InputStream in) throws IOException { long secondsSince1900 = 0; for (int j = 0; j < 4; j++) { secondsSince1900 = (secondsSince1900 << 8) | in.read( ); } return secondsSince1900; } private Date shiftEpochs(long secondsSince1900) { // The time protocol sets the epoch at 1900, the Java Date class // at 1970. This number converts between them. long differenceBetweenEpochs = 2208988800L; long secondsSince1970 = secondsSince1900 - differenceBetweenEpochs; long msSince1970 = secondsSince1970 * 1000; Date time = new Date(msSince1970); return time; } }
Most of the work is performed by the second getContent( )
method. This checks to see whether it recognizes any of
the classes in the classes
array. If so, it
attempts to convert the content into an object of that type. The
for
loop is arranged so that classes earlier in
the array take precedence; that is, first we try to match the first
class in the array; next we try to match the second class in the
array; then the third class in the array; and so on. As soon as one
class is matched, the method returns so later classes won’t be
matched even if they’re an allowed choice.
Once a type is matched, a simple algorithm converts the four bytes
that the time server sends into the right kind of object, either an
InputStream
, a Long
, a
Date
, a Calendar
, or a
String
. The InputStream
conversion is trivial. The Long
conversion is one
of those rare times when it seems a little inconvenient that
primitive data types aren’t objects. Although you can convert
to and return any object type, you can’t convert to and return
a primitive data type like long
, so we return the
type wrapper class Long
instead. The
Date
and Calendar
conversions
require shifting the origin of the time from January 1, 1900 to
January 1, 1970 and changing the units from seconds to milliseconds
as discussed in Chapter 10. Finally, the conversion
to a String
simply converts to a
Date
first, then invokes the
Date
object’s toString( )
method.
While it would be possible to configure a web server to send data of
MIME type application/x-time, this class is really designed to be
used by a custom protocol handler. This handler would know not only
how to speak the time protocol, but also how to return
application/x-time from the getContentType( )
method. Example 17.4 and Example 17.5
demonstrate such a protocol handler. It assumes that time URLs look
like time://vision.poly.edu:3737/.
Example 17-4. The URLConnection for the Time Protocol Handler
package com.macfaq.net.www.protocol.time; import java.net.*; import java.io.*; import com.macfaq.net.www.content.application.*; public class TimeURLConnection extends URLConnection { private Socket connection = null; public final static int DEFAULT_PORT = 37; public TimeURLConnection (URL u) { super(u); } public String getContentType( ) { return "application/x-time"; } public Object getContent( ) throws IOException { ContentHandler ch = new x_time( ); return ch.getContent(this); } public Object getContent(Class[] classes) throws IOException { ContentHandler ch = new x_time( ); return ch.getContent(this, classes); } public InputStream getInputStream( ) throws IOException { if (!connected) this.connect( ); return this.connection.getInputStream( ); } public synchronized void connect( ) throws IOException { if (!connected) { int port = url.getPort( ); if ( port < 0) { port = DEFAULT_PORT; } this.connection = new Socket(url.getHost( ), port); this.connected = true; } } }
In general, it should be enough for the protocol handler to simply
know or be able to deduce the correct MIME content type. However, in
a case like this where both content and protocol handlers must be
provided, you can tie them a little more closely together by
overriding getContent( )
as well. This allows you
to avoid messing with the
java.content.handler.pkgs
property or installing a
ContentHandlerFactory
. You will still need to set
the java.protocolhandler.pkgs
property to point to
your package or install a URLS StreamHandlerFactory
, however. Example 17.5
is a simple URLStreamHandler
for the time protocol
handler.
Example 17-5. The URLStreamHandler for the Time Protocol Handler
package com.macfaq.net.www.protocol.time; import java.net.*; import java.io.*; public class Handler extends URLStreamHandler { protected URLConnection openConnection(URL u) throws IOException { return new TimeURLConnection(u); } }
We could install the time protocol handler into HotJava, as in the
previous chapter. However, even if we place the time content handler
in HotJava’s class path, HotJava won’t use it.
Consequently, I’ve written a simple standalone application,
shown in Example 17.6, that uses these protocol and
content handlers to tell the time. Notice that it does not need to
import or directly refer to any of the classes involved. It simply
lets the URL
find the right content handler.
Example 17-6. URLTimeClient
import java.net.*; import java.util.*; import java.io.*; public class URLTimeClient { public static void main(String[] args) { System.setProperty("java.protocol.handler.pkgs", "com.macfaq.net.www.protocol"); try { // You can replace this with your own time server URL u = new URL("time://tock.usno.navy.mil/"); Class[] types = {String.class, Date.class, Calendar.class, Long.class}; Object o = u.getContent(types); System.out.println(o); } catch (IOException e) { // Let's see what went wrong e.printStackTrace( ); } } }
Here’s a sample run:
D:JAVAJNP2examples16>java URLTimeClient
Thu Nov 18 08:27:31 PST 1999
In this case, a String
object was returned. This
was the first choice of URLTimeClient
but the last
choice of the content handler. The client choice always takes
precedence.
3.149.233.72