Extracting URLs from a File

Problem

You need to extract just the URLs from a file.

Solution

Use ReadTag from Section 17.8, and just look for tags that might contain URLs.

Discussion

The program in Example 17-6 uses ReadTag from the previous recipe and checks each tag to see if it is a “wanted tag” defined in the array wantedTags. These include A (anchor), IMG (image), and APPLET tags. If it is determined to be a wanted tag, the URL is extracted from the tag and printed.

Example 17-6. GetURLs.java

public class GetURLs {
    /** The tag reader */
    ReadTag reader;

    public GetURLs(URL theURL) throws IOException {
        reader = new ReadTag(theURL);
    }

    public GetURLs(String theURL) throws MalformedURLException, IOException {
        reader = new ReadTag(theURL);
    }

    /* The tags we want to look at */
    public final static String[] wantTags = {
        "<a ", "<A ",
        "<applet ", "<APPLET ",
        "<img ", "<IMG ",
        "<frame ", "<FRAME ",
    };

    public ArrayList getURLs(  ) throws IOException {
        ArrayList al = new ArrayList(  );
        String tag;
        while ((tag = reader.nextTag(  )) != null) {
            for (int i=0; i<wantTags.length; i++) {
                if (tag.startsWith(wantTags[i])) {
                    al.add(tag);
                    continue;        // optimization
                }
            }
        }
        return al;
    }

    public void close(  ) throws IOException {
        if (reader != null) 
            reader.close(  );
    }
    public static void main(String[] argv) throws 
            MalformedURLException, IOException {
        String theURL = argv.length == 0 ?
            "http://localhost/" : argv[0];
        GetURLs gu = new GetURLs(theURL);
        ArrayList urls = gu.getURLs(  );
        Iterator urlIterator = urls.iterator(  );
        while (urlIterator.hasNext(  )) {
            System.out.println(urlIterator.next(  ));
        }
    }
}

The GetURLs program prints the URLs contained in a given web page:

darian$ java GetURLs http://daroad
<IMG SRC="ian.gif">
<A ID="LinkLocal" HREF="webserver/index.html">
<A HREF="quizzes/">
<A HREF="servlets/IsItWorking">
<A HREF="demo.jsp">
<A ID=LinkRemote HREF="http://java.sun.com">
<A ID=LinkRemote HREF="http://www.openbsd.org">
<A ID=LinkRemote HREF="http://www.cpg.com">
<A ID=LinkRemote HREF="http://www.ssc.com">
<A ID=LinkRemote HREF="http://www.learningtree.com">
<A ID=LinkLocal HREF="javacook.html">
<A ID=LinkRemote HREF="http://java.oreilly.com">
<A ID=LinkLocal HREF="lookup/index.htm">
<A ID=LinkLocal HREF="readings/index.html">
<A ID=LinkLocal HREF="download.html">
<IMG SRC="miniduke.gif" BORDER=0>
darian$

The LinkChecker program in Section 17.12 will extract the HREF or SRC attributes, and validate them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.12.34