Extracting URLs from a File


You need to extract just the URLs from a file.


Use ReadTag from Section 17.8, and just look for tags that might contain URLs.


The program in Example 17-6 uses ReadTag from the previous recipe and checks each tag to see if it is a “wanted tag” defined in the array wantedTags. These include A (anchor), IMG (image), and APPLET tags. If it is determined to be a wanted tag, the URL is extracted from the tag and printed.

Example 17-6. GetURLs.java

public class GetURLs {
    /** The tag reader */
    ReadTag reader;

    public GetURLs(URL theURL) throws IOException {
        reader = new ReadTag(theURL);

    public GetURLs(String theURL) throws MalformedURLException, IOException {
        reader = new ReadTag(theURL);

    /* The tags we want to look at */
    public final static String[] wantTags = {
        "<a ", "<A ",
        "<applet ", "<APPLET ",
        "<img ", "<IMG ",
        "<frame ", "<FRAME ",

    public ArrayList getURLs(  ) throws IOException {
        ArrayList al = new ArrayList(  );
        String tag;
        while ((tag = reader.nextTag(  )) != null) {
            for (int i=0; i<wantTags.length; i++) {
                if (tag.startsWith(wantTags[i])) {
                    continue;        // optimization
        return al;

    public void close(  ) throws IOException {
        if (reader != null) 
            reader.close(  );
    public static void main(String[] argv) throws 
            MalformedURLException, IOException {
        String theURL = argv.length == 0 ?
            "http://localhost/" : argv[0];
        GetURLs gu = new GetURLs(theURL);
        ArrayList urls = gu.getURLs(  );
        Iterator urlIterator = urls.iterator(  );
        while (urlIterator.hasNext(  )) {
            System.out.println(urlIterator.next(  ));

The GetURLs program prints the URLs contained in a given web page:

darian$ java GetURLs http://daroad
<IMG SRC="ian.gif">
<A ID="LinkLocal" HREF="webserver/index.html">
<A HREF="quizzes/">
<A HREF="servlets/IsItWorking">
<A HREF="demo.jsp">
<A ID=LinkRemote HREF="http://java.sun.com">
<A ID=LinkRemote HREF="http://www.openbsd.org">
<A ID=LinkRemote HREF="http://www.cpg.com">
<A ID=LinkRemote HREF="http://www.ssc.com">
<A ID=LinkRemote HREF="http://www.learningtree.com">
<A ID=LinkLocal HREF="javacook.html">
<A ID=LinkRemote HREF="http://java.oreilly.com">
<A ID=LinkLocal HREF="lookup/index.htm">
<A ID=LinkLocal HREF="readings/index.html">
<A ID=LinkLocal HREF="download.html">
<IMG SRC="miniduke.gif" BORDER=0>

The LinkChecker program in Section 17.12 will extract the HREF or SRC attributes, and validate them.

