Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Extracting HTML from a URL

Problem

You need to extract all the HTML tags from a URL.

Solution

Use this simple HTML tag extractor.

Discussion

A simple HTML extractor can be made by reading a character at a time and looking for < and > tags. This is reasonably efficient if a BufferedReader is used.

The ReadTag program shown in Example 17-5 implements this; given a URL, it opens the file (similar to TextBrowser in Section 17.7) and extracts the HTML tags. Each tag is printed to the standard output.

Example 17-5. ReadTag.java

/** A simple but reusable HTML tag extractor.
 */
public class ReadTag {
    /** The URL that this ReadTag object is reading */
    protected URL myURL = null;
    /** The Reader for this object */
    protected BufferedReader inrdr = null;
  
    /* Simple main showing one way of using the ReadTag class. */
    public static void main(String[] args) throws MalformedURLException, IOException {
        if (args.length == 0) {
            System.err.println("Usage: ReadTag URL [...]");
            return;
        }

        for (int i=0; i<args.length; i++) {
            ReadTag rt = new ReadTag(args[0]);
            String tag;
            while ((tag = rt.nextTag(  )) != null) {
                System.out.println(tag);
            }
            rt.close(  );
        }
    }
  
    /** Construct a ReadTag given a URL String */
    public ReadTag(String theURLString) throws 
            IOException, MalformedURLException {

        this(new URL(theURLString));
    }

    /** Construct a ReadTag given a URL */
    public ReadTag(URL theURL) throws IOException {
        myURL = theURL;
        // Open the URL for reading
        inrdr = new BufferedReader(new InputStreamReader(myURL.openStream(  )));
    }

    /** Read the next tag.  */
    public String nextTag(  ) throws IOException {
        int i;
        while ((i = inrdr.read(  )) != -1) {
            char thisChar = (char)i;
            if (thisChar == '<') {
                String tag = readTag(  );
                return tag;
            }
        }
        return null;
    }

    public void close(  ) throws IOException {
        inrdr.close(  );
    }

    /** Read one tag. Adapted from code by Elliotte Rusty Harold */
    protected String readTag(  ) throws IOException {
        StringBuffer theTag = new StringBuffer("<");
        int i = '<';
      
        while (i != '>' && (i = inrdr.read(  )) != -1) {
                theTag.append((char)i);
        }     
        return theTag.toString(  );
    }

    /* Return a String representation of this object */
    public String toString(  ) {
        return "ReadTag[" + myURL.toString(  ) + "]";
    }
}

When I ran it on one system, I got the following output:

darian$ java ReadTag http://localhost/
<HTML>
<HEAD>
<TITLE>
</TITLE>
</HEAD>
<FRAMESET BORDER="0" ROWS="110, *" FRAMESPACING="0">
<FRAME NAME="header" SRC="header.html" SCROLLING="NO" MARGINHEIGHT="0" FRAMEBORDER="0">
<FRAMESET COLS="130, *" FRAMESPACING="0">
<FRAME NAME="menu" SRC="menu.html" SCROLLING="NO" MARGINHEIGHT="0" FRAMEBORDER="0">
<FRAME NAME="main" SRC="main.html" MARGINHEIGHT="15" MARGINWIDTH="15" FRAMEBORDER="0">
</FRAMESET>
</FRAMESET>
</HTML>
darian$

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Extracting HTML from a URL

Create new playlist

Sign In

Sign Up

Extracting HTML from a URL

Problem

Solution

Discussion

Table of Contents for
Extracting HTML from a URL