Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Extracting URLs from a File

Problem

You need to extract just the URLs from a file.

Solution

Use ReadTag from Section 17.8, and just look for tags that might contain URLs.

Discussion

The program in Example 17-6 uses ReadTag from the previous recipe and checks each tag to see if it is a “wanted tag” defined in the array wantedTags. These include A (anchor), IMG (image), and APPLET tags. If it is determined to be a wanted tag, the URL is extracted from the tag and printed.

Example 17-6. GetURLs.java

public class GetURLs {
    /** The tag reader */
    ReadTag reader;

    public GetURLs(URL theURL) throws IOException {
        reader = new ReadTag(theURL);
    }

    public GetURLs(String theURL) throws MalformedURLException, IOException {
        reader = new ReadTag(theURL);
    }

    /* The tags we want to look at */
    public final static String[] wantTags = {
        "<a ", "<A ",
        "<applet ", "<APPLET ",
        "<img ", "<IMG ",
        "<frame ", "<FRAME ",
    };

    public ArrayList getURLs(  ) throws IOException {
        ArrayList al = new ArrayList(  );
        String tag;
        while ((tag = reader.nextTag(  )) != null) {
            for (int i=0; i<wantTags.length; i++) {
                if (tag.startsWith(wantTags[i])) {
                    al.add(tag);
                    continue;        // optimization
                }
            }
        }
        return al;
    }

    public void close(  ) throws IOException {
        if (reader != null) 
            reader.close(  );
    }
    public static void main(String[] argv) throws 
            MalformedURLException, IOException {
        String theURL = argv.length == 0 ?
            "http://localhost/" : argv[0];
        GetURLs gu = new GetURLs(theURL);
        ArrayList urls = gu.getURLs(  );
        Iterator urlIterator = urls.iterator(  );
        while (urlIterator.hasNext(  )) {
            System.out.println(urlIterator.next(  ));
        }
    }
}

The GetURLs program prints the URLs contained in a given web page:

darian$ java GetURLs http://daroad
<IMG SRC="ian.gif">
<A ID="LinkLocal" HREF="webserver/index.html">
<A HREF="quizzes/">
<A HREF="servlets/IsItWorking">
<A HREF="demo.jsp">
<A ID=LinkRemote HREF="http://java.sun.com">
<A ID=LinkRemote HREF="http://www.openbsd.org">
<A ID=LinkRemote HREF="http://www.cpg.com">
<A ID=LinkRemote HREF="http://www.ssc.com">
<A ID=LinkRemote HREF="http://www.learningtree.com">
<A ID=LinkLocal HREF="javacook.html">
<A ID=LinkRemote HREF="http://java.oreilly.com">
<A ID=LinkLocal HREF="lookup/index.htm">
<A ID=LinkLocal HREF="readings/index.html">
<A ID=LinkLocal HREF="download.html">
<IMG SRC="miniduke.gif" BORDER=0>
darian$

The LinkChecker program in Section 17.12 will extract the HREF or SRC attributes, and validate them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Extracting URLs from a File

Create new playlist

Sign In

Sign Up

Extracting URLs from a File

Problem

Solution

Discussion

Table of Contents for
Extracting URLs from a File