Extracting HTML from a URL

Problem

You need to extract all the HTML tags from a URL.

Solution

Use this simple HTML tag extractor.

Discussion

A simple HTML extractor can be made by reading a character at a time and looking for < and > tags. This is reasonably efficient if a BufferedReader is used.

The ReadTag program shown in Example 17-5 implements this; given a URL, it opens the file (similar to TextBrowser in Section 17.7) and extracts the HTML tags. Each tag is printed to the standard output.

Example 17-5. ReadTag.java

/** A simple but reusable HTML tag extractor.
 */
public class ReadTag {
    /** The URL that this ReadTag object is reading */
    protected URL myURL = null;
    /** The Reader for this object */
    protected BufferedReader inrdr = null;
  
    /* Simple main showing one way of using the ReadTag class. */
    public static void main(String[] args) throws MalformedURLException, IOException {
        if (args.length == 0) {
            System.err.println("Usage: ReadTag URL [...]");
            return;
        }

        for (int i=0; i<args.length; i++) {
            ReadTag rt = new ReadTag(args[0]);
            String tag;
            while ((tag = rt.nextTag(  )) != null) {
                System.out.println(tag);
            }
            rt.close(  );
        }
    }
  
    /** Construct a ReadTag given a URL String */
    public ReadTag(String theURLString) throws 
            IOException, MalformedURLException {

        this(new URL(theURLString));
    }

    /** Construct a ReadTag given a URL */
    public ReadTag(URL theURL) throws IOException {
        myURL = theURL;
        // Open the URL for reading
        inrdr = new BufferedReader(new InputStreamReader(myURL.openStream(  )));
    }

    /** Read the next tag.  */
    public String nextTag(  ) throws IOException {
        int i;
        while ((i = inrdr.read(  )) != -1) {
            char thisChar = (char)i;
            if (thisChar == '<') {
                String tag = readTag(  );
                return tag;
            }
        }
        return null;
    }

    public void close(  ) throws IOException {
        inrdr.close(  );
    }

    /** Read one tag. Adapted from code by Elliotte Rusty Harold */
    protected String readTag(  ) throws IOException {
        StringBuffer theTag = new StringBuffer("<");
        int i = '<';
      
        while (i != '>' && (i = inrdr.read(  )) != -1) {
                theTag.append((char)i);
        }     
        return theTag.toString(  );
    }

    /* Return a String representation of this object */
    public String toString(  ) {
        return "ReadTag[" + myURL.toString(  ) + "]";
    }
}

When I ran it on one system, I got the following output:

darian$ java ReadTag http://localhost/
<HTML>
<HEAD>
<TITLE>
</TITLE>
</HEAD>
<FRAMESET BORDER="0" ROWS="110, *" FRAMESPACING="0">
<FRAME NAME="header" SRC="header.html" SCROLLING="NO" MARGINHEIGHT="0" FRAMEBORDER="0">
<FRAMESET COLS="130, *" FRAMESPACING="0">
<FRAME NAME="menu" SRC="menu.html" SCROLLING="NO" MARGINHEIGHT="0" FRAMEBORDER="0">
<FRAME NAME="main" SRC="main.html" MARGINHEIGHT="15" MARGINWIDTH="15" FRAMEBORDER="0">
</FRAMESET>
</FRAMESET>
</HTML>
darian$
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.168.8