The ContentHandler Class

A subclass of ContentHandler overrides the getContent( ) method to return an object that’s the Java equivalent of the content. This method can be quite simple or quite complex, depending almost entirely on the complexity of the content type you’re trying to parse. A text/plain content handler is quite simple; a text/rtf content handler would be very complex.

The ContentHandler class has only a simple noargs constructor:

public ContentHandler(  )

Since ContentHandler is an abstract class, you never call its constructor directly, only from inside the constructors of subclasses.

The primary method of the class, albeit an abstract one, is getContent( ) :

public abstract Object getContent(URLConnection uc) throws IOException

This method is normally called only from inside the getContent( ) method of a URLConnection object. It is overridden in a subclass that is specific to the type of content being handled. getContent( ) should use the URLConnection’s InputStream to create an object. There are no rules about what type of object a content handler should return. In general, this depends on what the application requesting the content expects. Content handlers for text-like content bundled with the JDK return some subclass of InputStream. Content handlers for images return ImageProducer objects.

The getContent( ) method of a content handler does not get the full InputStream that the URLConnection has access to. The InputStream that a content handler sees should include only the content’s raw data. Any MIME headers or other protocol-specific information that come from the server should be stripped by the URLConnection before it passes the stream to the ContentHandler. A ContentHandler is responsible only for content, not for any protocol overhead that may be present. The URLConnection should have already performed any necessary handshaking with the server and interpreted any headers it sends.

A Content Handler for Tab-Separated Values

To see how content handlers work, let’s create a ContentHandler that handles the text/tab-separated-values content type. We aren’t concerned with how the tab-separated values get to us. That’s for a protocol handler to deal with. All a ContentHandler needs to know is the MIME type and format of the data.

Tab-separated values are produced by many database and spreadsheet programs. A tab-separated file may look something like this. Tabs are indicated by arrows:

JPE Associates

341 Lafayette Street, Suite 1025

New York Æ NY Æ 10012

O’Reilly & Associates

103 Morris Street, Suite A

Sebastopol Æ CA Æ 95472

In database parlance, each line is a record, and the data before each tab is a field. It is usually (though not necessarily) true that each field has the same meaning in each record. In the previous example, the first field is the company name.

The first question to ask is: what kind of Java object should we convert the tab-separated values to? The simplest and most general way to store each record is as an array of Strings. Successive records can be collected in a Vector. In many applications, however, you have a great deal more knowledge about the exact format and meaning of the data than we do here. The more you know about the data you’re dealing with, the better a ContentHandler you can write. For example, if you know that the data you’re downloading represents U.S. addresses, then you could define a class like this:

public class Address {

  private String name;
  private String street;
  private String city;
  private String state;
  private String zip;
  
}

This class would also have appropriate constructors and other methods to represent each record. In this example, we don’t know anything about the data in advance, or how many records we’ll have to store. Therefore, we will take the most general approach and convert each record into an array of strings, using a Vector to store each array until there are no more records. The getContent( ) method can return the Vector of String arrays.

Example 17.1 shows the code for such a ContentHandler. The full package-qualified name is com.macfaq.net.www.content.text.tab_separated_values. This unusual class name follows the naming convention for a content handler for the MIME type text/tab-separated-values. Since MIME types often contain hyphens, as in this example, a convention exists to replace these with the underscore (_). Thus text/tab-separated-values becomes text.tab_separated_values. To install this content handler, all that’s needed is to put the compiled .class file somewhere the class loader can find it and set the java.content.handler.pkgs property to com.macfaq.net.www.content.

Example 17-1. A ContentHandler for text/tab-separated-values

package com.macfaq.net.www.content.text;

import java.net.*;
import java.io.*;
import java.util.*;
import com.macfaq.io.SafeBufferedReader  // From Chapter 4


public class tab_separated_values extends ContentHandler {

  public Object getContent(URLConnection uc) throws IOException {

    String theLine;
    Vector v = new Vector(  );

    InputStreamReader isr = new InputStreamReader(uc.getInputStream(  ));
    SafeBufferedReader in = new SafeBufferedReader(isr);
    while ((theLine = in.readLine(  )) != null) {
      String[] linearray = lineToArray(theLine);
      v.addElement(linearray);
    }

    return v; 

  }

  private String[] lineToArray(String line)  {

    int numFields = 1;
    for (int i = 0; i < line.length(  ); i++) {
      if (line.charAt(i) == '	') numFields++;
    }
    String[] fields = new String[numFields];
    int position = 0;
    for (int i = 0; i < numFields; i++) {
      StringBuffer buffer = new StringBuffer(  );
      while (position < line.length(  ) && line.charAt(position) != '	') {
        buffer.append(line.charAt(position));
        position++;
      }
      fields[i] = buffer.toString(  );
      position++;
    }

    return fields;

  }
}

Example 17.1 has two methods. The private utility method lineToArray( ) converts a tab-separated string into an array of strings. This method is for the private use of this subclass and is not required by the ContentHandler interface. The more complicated the content you’re trying to parse, the more such methods your class will need. The lineToArray( ) method begins by counting the number of tabs in the string. This sets the numFields variable to one more than the number of tabs. An array is created for the fields with the length numFields. Then a for loop fills the array with the strings between the tabs. Then this array is returned.

Note

You may have expected a StringTokenizer to split the line into parts. However, that class has unusual ideas about what makes up a token. In particular, it would interpret multiple tabs in a row as a single delimiter. That is, it never returns an empty string as a token.

The getContent( ) method starts by instantiating a Vector. Then it gets the InputStream from the URLConnection uc and chains this to an InputStreamReader, which is in turn chained to the SafeBufferedReader introduced in Chapter 4, so it can read it one line at a time in a while loop. Each line is fed to the lineToArray( ) method, which splits it into a String array. This array is then added to the Vector. When no more lines are left, the loop exits and the Vector is returned.

Using Content Handlers

Now that you’ve written your first ContentHandler, let’s see how to use it in a program. Files of MIME type text/tab-separated-values can be served by gopher servers, HTTP servers, FTP servers, and more. Let’s assume you’re retrieving a tab-separated-values file from an HTTP server. The filename should end with the .tsv or .tab extension so that the server knows it’s a text/tab-separated-values file.

Note

Not all servers are configured to support this type out of the box. Consult your server documentation to see how to set up a MIME-type mapping for your server. For instance, to configure my Apache server, I added these lines to my .htaccess file:

AddType text/tab-separated-values tab
AddType text/tab-separated-values tsv

You can test the web server configuration by connecting to port 80 of the web server with Telnet and requesting the file manually:

% telnet metalab.unc.edu 80
Trying 127.0.0.1...
Connected to metalab.unc.edu.
Escape character is `^]'.
GET /javafaq/addresses.tab HTTP 1.0

HTTP 1.0 200 OK
Date: Mon, 15 Nov 1999 18:36:51 GMT
Server: Apache/1.3.4 (Unix) PHP/3.0.6 mod_perl/1.17
Last-Modified: Thu, 04 Nov 1999 18:22:51 GMT
Content-type: text/tab-separated-values
Content-length: 163

JPE Associates 341 Lafayette Street, Suite 1025 New York NY 10012
O'Reilly & Associates 103 Morris Street, Suite A Sebastopol CA 95472
Connection closed by foreign host.

You’re looking for a line that says Content-type: text/tab-separated-values. If you see a Content-type of text/plain, application/octet-stream, or some other value, or you don’t see any Content-type at all, the server is misconfigured and must be fixed before you continue.

The application that uses the tab-separated-values content handler does not need to know about it explicitly. It simply has to call the getContent( ) method of URL or URLConnection on a URL with a matching MIME type. Furthermore, the package where the content handler can be found has to be listed in the java.content.handlers.pkg property.

Example 17.2 is a class that downloads and prints a text/tab-separated-values file using the ContentHandler of Example 17.1. However, note that it does not import com.macfaq.net.www.content.text and that it never references the tab_separated_values class. It does explicitly add com.macfaq.net.www.content to the java.content.handlers.pkgs property because that’s the simplest way to make sure this standalone program works. However, the lines that do that could be deleted if the property were set in a property file or from the command line, and indeed this is required in Java 1.1.

Example 17-2. The tab-separated-values ContentTester Class

import java.io.*;
import java.net.*;
import java.util.*;

public class TSVContentTester {

  private static void test(URL u) throws IOException {
  
    Object content = u.getContent(  );
    Vector v = (Vector) content;
    for (Enumeration e = v.elements() ; e.hasMoreElements(  ) ;) {
      String[] sa = (String[]) e.nextElement(  );
      for (int i = 0; i < sa.length; i++) {
        System.out.print(sa[i] + "	");
      }
      System.out.println(  );
    } 

  }

  public static void main (String[] args) {  
    
    // If you uncomment these lines in Java 1.2, then you don't
    // have to set the java.content.handler.pkgs property from the
    // command line or your properties files.

/*    String pkgs = System.getProperty("java.content.handler.pkgs", "");
    if (!pkgs.equals("")) {
      pkgs = pkgs + "|";
    }
    pkgs += "com.macfaq.net.www.content";      
    System.setProperty("java.content.handler.pkgs", pkgs);  */  

    for (int i = 0; i < args.length; i++) {
      try {
        URL u = new URL(args[i]);
        test(u);
      }
      catch (MalformedURLException e) {
        System.err.println(args[i] + " is not a good URL"); 
      }
      catch (Exception e) {
        e.printStackTrace(  );
      }
    }
  }
}

Here’s how you run this program in Java 1.1 and 1.3. The arrows indicate tabs:

% java -Djava.content.handler.pkgs=com.macfaq.net.www.content 
                TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
JPE Associates341 Lafayette Street, Suite 1025New YorkNY10012
O'Reilly & Associates103 Morris Street, Suite ASebastopolCA95472

Java 1.2 is trickier because the new class-loading policy used in Java 1.2 prevents content handlers from being loaded from the local class path. This bug is fixed in Java 1.3. However, in Java 1.2, simply running Example 17.2 will result in a ClassCastException. Since the custom content handler isn’t found, getContent( ) returns an InputStream (specifically a sun.net.www.MeteredStream) instead:

% java TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
java.lang.ClassCastException: sun.net.www.MeteredStream
        at TSVContentTester.test(TSVContentTester.java:10)
        at TSVContentTester.main(TSVContentTester.java:35)

There are a couple of ways around this. You can use the oldjava interpreter instead. This does load content handlers from the local class path. For example:

% oldjava TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
JPE Associates341 Lafayette Street, Suite 1025New YorkNY10012
O'Reilly & Associates103 Morris Street, Suite ASebastopolCA95472

Alternatively, you can use the nonstandard -Xbootclasspath command-line switch to tell Java where to find the content handlers. For example, this line tells Java to look in the current directory for content handlers and other files before looking in the rt.jar file (the exact location of the rt.jar file will have to be adjusted to match your system):

% java -Xbootclasspath:.;/usr/local/jdk1.3/jre/lib/rt.jar 
TSVContentTester http://metalab.unc.edu/javafaq/addresses.tab
JPE Associates341 Lafayette Street, Suite 1025New YorkNY10012
O'Reilly & Associates103 Morris Street, Suite ASebastopolCA95472

The bug that requires these workarounds is present in all versions of Java 1.2, though it doesn’t manifest itself in applets because of the different nature of the ClassLoader that a web browser uses. Of course, the ultimate solution is to use a ContentHandlerFactory, an option I’ll discuss later.

Choosing Return Types

Java 1.3 adds one overloaded variant of the getContent( ) method to the ContentHandler class:

public Object getContent(URLConnection uc, Class[] classes) // Java 1.3
 throws IOException

The difference is the array of java.lang.Class objects passed as the second argument. This allows the caller to request that the content be returned as one of the types in the array and enables content handlers to support multiple types. For example, the text/tab-separated-values content handler could return data as a Vector, an array, a string, or an InputStream. One would be the default used by the single argument getContent( ) method, while the others would be options that a client could request. If the client doesn’t request any of the classes this ContentHandler knows how to provide, then it returns null.

To call this method, the client invokes the method with the same arguments in a URL or URLConnection object. It passes an array of Class objects in the order it wishes to receive the data. Thus, if it prefers to receive a String but is willing to accept an InputStream and will take a Vector as a last resort, then it would put String.class in the zeroth component of the array, InputStream.class in the first component of the array, and Vector.class in the last component of the array. Then it would use instanceof to test what was actually returned and either process it or convert it into the preferred type. For example:

Class[] requestedTypes = {String.class, InputStream.class, 
 Vector.class};
Object content = url.getContent(requestedTypes);
if (content instanceof String) {
  String s = (String) content;
  System.out.println(s);
}
else if (content instanceof InputStream) {
  InputStream in = (InputStream) content;
  int c;
  while ((c = in.read(  )) != -1) System.out.write(c);
}
else if (content instanceof Vector) {
  Vector v = (Vector) content;
  for (Enumeration e = v.elements() ; e.hasMoreElements(  ) ;) {
    String[] sa = (String[]) e.nextElement(  );
    for (int i = 0; i < sa.length; i++) {
      System.out.print(sa[i] + "	");
    }
    System.out.println(  );
  }
}
else {
  System.out.println("Unrecognized content type " + content.getClass(  ));
}

To demonstrate this, let’s write a content handler that can be used in association with the time protocol. Recall that the time protocol returns the current time at the server as a 4-byte, big-endian unsigned integer giving the number of seconds since midnight, January 1, 1900, Greenwich Mean Time. There are several obvious candidates for storing this data in a Java content handler, including java.lang.Long (java.lang.Integer won’t work since the unsigned value may overflow the bounds of an int), java.util.Date, java.util.Calendar, java.lang.String, and java.io.InputStream, which often works as a last resort. Example 17.3 provides all five options. Since there’s no standard MIME type for the time format, we’ll use application for the type to indicate that this is binary data, and x-time for the subtype to indicate that this is a nonstandard extension type. It will be up to the time protocol handler to return the right content type.

Example 17-3. A Time Content Handler

package com.macfaq.net.www.content.application;

import java.net.*;
import java.io.*;
import java.util.*;


public class x_time extends ContentHandler {

  public Object getContent(URLConnection uc) throws IOException {

    Class[] classes = new Class[1];
    classes[0] = Date.class;
    return this.getContent(uc, classes); 

  }

  public Object getContent(URLConnection uc, Class[] classes)
   throws IOException {
    
    InputStream in = uc.getInputStream(  );
    for (int i = 0; i < classes.length; i++) {
      if (classes[i] == InputStream.class) {
        return in;  
      } 
      else if (classes[i] == Long.class) {
        long secondsSince1900 = readSecondsSince1900(in);
        return new Long(secondsSince1900);
      }
      else if (classes[i] == Date.class) {
        long secondsSince1900 = readSecondsSince1900(in);
        Date time = shiftEpochs(secondsSince1900);
        return time;
      }
      else if (classes[i] == Calendar.class) {
        long secondsSince1900 = readSecondsSince1900(in);
        Date time = shiftEpochs(secondsSince1900);
        Calendar c = Calendar.getInstance(  );
        c.setTime(time);
        return c;
      }
      else if (classes[i] == String.class) {
        long secondsSince1900 = readSecondsSince1900(in);
        Date time = shiftEpochs(secondsSince1900);
        return time.toString(  );
      }      
    }
    
    return null; // no requested type available
    
  }
  
  private long readSecondsSince1900(InputStream in) 
   throws IOException {
    
    long secondsSince1900 = 0;
    for (int j = 0; j < 4; j++) {
      secondsSince1900 = (secondsSince1900 << 8) | in.read(  );
    }
    return secondsSince1900;
    
  }
  
  private Date shiftEpochs(long secondsSince1900) {
  
    // The time protocol sets the epoch at 1900, the Java Date class
    //  at 1970. This number converts between them.
    long differenceBetweenEpochs = 2208988800L;
    
    long secondsSince1970 = secondsSince1900 - differenceBetweenEpochs;       
    long msSince1970 = secondsSince1970 * 1000;
    Date time = new Date(msSince1970);
    return time;
    
  }
}

Most of the work is performed by the second getContent( ) method. This checks to see whether it recognizes any of the classes in the classes array. If so, it attempts to convert the content into an object of that type. The for loop is arranged so that classes earlier in the array take precedence; that is, first we try to match the first class in the array; next we try to match the second class in the array; then the third class in the array; and so on. As soon as one class is matched, the method returns so later classes won’t be matched even if they’re an allowed choice.

Once a type is matched, a simple algorithm converts the four bytes that the time server sends into the right kind of object, either an InputStream, a Long, a Date, a Calendar, or a String. The InputStream conversion is trivial. The Long conversion is one of those rare times when it seems a little inconvenient that primitive data types aren’t objects. Although you can convert to and return any object type, you can’t convert to and return a primitive data type like long, so we return the type wrapper class Long instead. The Date and Calendar conversions require shifting the origin of the time from January 1, 1900 to January 1, 1970 and changing the units from seconds to milliseconds as discussed in Chapter 10. Finally, the conversion to a String simply converts to a Date first, then invokes the Date object’s toString( ) method.

While it would be possible to configure a web server to send data of MIME type application/x-time, this class is really designed to be used by a custom protocol handler. This handler would know not only how to speak the time protocol, but also how to return application/x-time from the getContentType( ) method. Example 17.4 and Example 17.5 demonstrate such a protocol handler. It assumes that time URLs look like time://vision.poly.edu:3737/.

Example 17-4. The URLConnection for the Time Protocol Handler

package com.macfaq.net.www.protocol.time;

import java.net.*;
import java.io.*;
import com.macfaq.net.www.content.application.*;

public class TimeURLConnection extends URLConnection {

  private Socket connection = null;
  public final static int DEFAULT_PORT = 37;

  public TimeURLConnection (URL u) {
    super(u);
  }

  public String getContentType(  ) {
    return "application/x-time";
  }

  public Object getContent(  ) throws IOException {
    ContentHandler ch = new x_time(  );
    return ch.getContent(this);
  }

  public Object getContent(Class[] classes) throws IOException { 
    ContentHandler ch = new x_time(  );
    return ch.getContent(this, classes);
  }

  public InputStream getInputStream(  ) throws IOException {
    if (!connected) this.connect(  );
	  return this.connection.getInputStream(  );
  }

  public synchronized void connect(  ) throws IOException {
  
    if (!connected) {
      int port = url.getPort(  );
      if ( port < 0) {
        port = DEFAULT_PORT;
      }
      this.connection = new Socket(url.getHost(  ), port);
      this.connected = true;
    } 
  }
}

In general, it should be enough for the protocol handler to simply know or be able to deduce the correct MIME content type. However, in a case like this where both content and protocol handlers must be provided, you can tie them a little more closely together by overriding getContent( ) as well. This allows you to avoid messing with the java.content.handler.pkgs property or installing a ContentHandlerFactory. You will still need to set the java.protocolhandler.pkgs property to point to your package or install a URLS StreamHandlerFactory, however. Example 17.5 is a simple URLStreamHandler for the time protocol handler.

Example 17-5. The URLStreamHandler for the Time Protocol Handler

package com.macfaq.net.www.protocol.time;

import java.net.*;
import java.io.*;
public class Handler extends URLStreamHandler {

  protected URLConnection openConnection(URL u) throws IOException {
    return new TimeURLConnection(u);
  }
}

We could install the time protocol handler into HotJava, as in the previous chapter. However, even if we place the time content handler in HotJava’s class path, HotJava won’t use it. Consequently, I’ve written a simple standalone application, shown in Example 17.6, that uses these protocol and content handlers to tell the time. Notice that it does not need to import or directly refer to any of the classes involved. It simply lets the URL find the right content handler.

Example 17-6. URLTimeClient

import java.net.*;
import java.util.*;
import java.io.*;

public class URLTimeClient {

  public static void main(String[] args) {
  
    System.setProperty("java.protocol.handler.pkgs", 
     "com.macfaq.net.www.protocol");
  
    try {
      // You can replace this with your own time server
      URL u = new URL("time://tock.usno.navy.mil/");
      Class[] types = {String.class, Date.class, 
       Calendar.class, Long.class};
      Object o = u.getContent(types);
      System.out.println(o);
    }
    catch (IOException e) {
     // Let's see what went wrong
     e.printStackTrace(  ); 
    }
  }
}

                  

Here’s a sample run:

D:JAVAJNP2examples16>java URLTimeClient
Thu Nov 18 08:27:31 PST 1999

In this case, a String object was returned. This was the first choice of URLTimeClient but the last choice of the content handler. The client choice always takes precedence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.233.72