Program: LinkChecker

One of the hard parts of maintaining a large web site is ensuring that all the hypertext links, images, applets, and so forth remain valid as the site grows and changes. It’s easy to make a change somewhere that breaks a link somewhere else, exposing your users to those “Doh!"-producing 404 errors. What’s needed is a program to automate checking the links. This turns out to be surprisingly complex due to the variety of link types. But we can certainly make a start.

Since we already created a program that reads a web page and extracts the URL-containing tags (Section 17.9), we can use that here. The basic approach of our new LinkChecker program is this: given a starting URL, create a GetURLs object for it. If that succeeds, read the list of URLs and go from there. This program has the additional functionality of displaying the structure of the site using simple indentation in a graphical window, as shown in Figure 17-3.

LinkChecker in action

Figure 17-3. LinkChecker in action

So using the GetURLS class from Section 17.9, the rest is largely a matter of elaboration. A lot of this code has to do with the GUI (see Chapter 13). The code uses recursion: the routine checkOut( ) calls itself each time a new page or directory is started.

Example 17-8 shows the code for the LinkChecker program.

Example 17-8. LinkChecker.java

/** A simple HTML Link Checker. 
 * Need a Properties file to set depth, URLs to check. etc.
 * Responses not adequate; need to check at least for 404-type errors!
 * When all that is (said and) done, display in a Tree instead of a TextArea.
 * Then use Color coding to indicate errors.
 */
public class LinkChecker extends Frame implements Runnable {
    protected Thread t = null;
    /** The "global" activation flag: set false to halt. */
    boolean done = false;
    protected Panel p;
    /** The textfield for the starting URL.
     * Should have a Properties file and a JComboBox instead.
     */
    protected TextField textFldURL;
    protected Button checkButton;
    protected Button killButton;
    protected TextArea textWindow;
    protected int indent = 0;
  
    public static void main(String[] args) {
        LinkChecker lc = new LinkChecker(  );
        lc.setSize(500, 400);
        lc.setLocation(150, 150);
        lc.setVisible(true);
        if (args.length == 0)
            return;
        lc.textFldURL.setText(args[0]);
    }
  
    public void startChecking(  ) {
        done = false;
        checkButton.setEnabled(false);
        killButton.setEnabled(true);
        textWindow.setText("");
        doCheck(  );
    }

    public void stopChecking(  ) {
        done = true;
        checkButton.setEnabled(true);
        killButton.setEnabled(false);
    }

    /** Construct a LinkChecker */
    public LinkChecker(  ) {
        super("LinkChecker");
        addWindowListener(new WindowAdapter(  ) {
            public void windowClosing(WindowEvent e) {
            setVisible(false);
            dispose(  );
            System.exit(0);
            }
        });
        setLayout(new BorderLayout(  ));
        p = new Panel(  );
        p.setLayout(new FlowLayout(  ));
        p.add(new Label("URL"));
        p.add(textFldURL = new TextField(40));
        p.add(checkButton = new Button("Check URL"));
        // Make a single action listener for both the text field (when
        // you hit return) and the explicit "Check URL" button.
        ActionListener starter = new ActionListener(  ) {
            public void actionPerformed(ActionEvent e) {
                startChecking(  );
            }
        };
        textFldURL.addActionListener(starter);
        checkButton.addActionListener(starter);
        p.add(killButton = new Button("Stop"));
        killButton.setEnabled(false);    // until startChecking is called.
        killButton.addActionListener(new ActionListener(  ) {
            public void actionPerformed(ActionEvent e) {
                if (t == null || !t.isAlive(  ))
                    return;
                stopChecking(  );
            }
        });
        // Now lay out the main GUI - URL & buttons on top, text larger
        add("North", p);
        textWindow = new TextArea(80, 40);
        add("Center", textWindow);
    }

    public void doCheck(  ) {
        if (t!=null && t.isAlive(  ))
            return;
        t = new Thread(this);
        t.start(  );
    }

    public synchronized void run(  ) {
        textWindow.setText("");
        checkOut(textFldURL.getText(  ));
        textWindow.append("-- All done --");
    }
  
    /** Start checking, given a URL by name.
     * Calls checkLink to check each link.
     */
    public void checkOut(String rootURLString) {
        URL rootURL = null;
        GetURLs urlGetter = null;

        if (done)
            return;
        if (rootURLString == null) {
            textWindow.append("checkOut(null) isn't very useful");
            return;
        }

        // Open the root URL for reading
        try {
            rootURL = new URL(rootURLString);
            urlGetter = new GetURLs(rootURL);
        } catch (MalformedURLException e) {
            textWindow.append("Can't parse " + rootURLString + "
");
            return;
        } catch (FileNotFoundException e) {
            textWindow.append("Can't open file " + rootURLString + "
");
            return;
        } catch (IOException e) {
            textWindow.append("openStream " + rootURLString + " " + e + "
");
            return;
        }

        // If we're still here, the root URL given is OK.
        // Next we make up a "directory" URL from it.
        String rootURLdirString;
        if (rootURLString.endsWith("/") ||
            rootURLString.endsWith("\"))
                rootURLdirString = rootURLString;
        else {
            rootURLdirString = rootURLString.substring(0, 
                rootURLString.lastIndexOf('/'));    // XXX or 
        }

        try {
            ArrayList urlTags = urlGetter.getURLs(  );
            Iterator urlIterator = urlTags.iterator(  );
            while (urlIterator.hasNext(  )) {
                if (done)
                    return;
                String tag = (String)urlIterator.next(  );
                System.out.println(tag);
                        
                String href = extractHREF(tag);

                for (int j=0; j<indent; j++)
                    textWindow.append("	");
                textWindow.append(href + " -- ");

                // Can't really validate these!
                if (href.startsWith("mailto:")) {
                    textWindow.append(href + " -- not checking");
                    continue;
                }

                if (href.startsWith("..") || href.startsWith("#")) {
                    textWindow.append(href + " -- not checking");
                    // nothing doing!
                    continue; 
                }

                URL hrefURL = new URL(rootURL, href);

                // TRY THE URL.
                // (don't combine previous textWindow.append with this one,
                // since this one can throw an exception)
                textWindow.append(checkLink(hrefURL));

                // There should be an option to control whether to
                // "try the url" first and then see if off-site, or
                // vice versa, for the case when checking a site you're
                // working on on your notebook on a train in the Rockies
                // with no web access available.

                // Now see if the URL is off-site.
                if (!hrefURL.getHost().equals(rootURL.getHost(  ))) {
                    textWindow.append("-- OFFSITE -- not following");
                    textWindow.append("
");
                    continue;
                }
                textWindow.append("
");

                // If HTML, check it recursively. No point checking
                // PHP, CGI, JSP, etc., since these usually need forms input.
                // If a directory, assume HTML or something under it will work.
                if (href.endsWith(".htm") ||
                    href.endsWith(".html") ||
                    href.endsWith("/")) {
                        ++indent;
                        if (href.indexOf(':') != -1)
                            checkOut(href);            // RECURSE
                        else {
                            String newRef = 
                                 rootURLdirString + '/' + href;
                            checkOut(newRef);        // RECURSE
                        }
                        --indent;
                }
            }
            urlGetter.close(  );
        } catch (IOException e) {
            System.err.println("Error " + ":(" + e +")");
        }
    }

    /** Check one link, given its DocumentBase and the tag */
    public String checkLink(URL linkURL) {

        try { 
            // Open it; if the open fails we'll likely throw an exception
            URLConnection luf = linkURL.openConnection(  );
            if (linkURL.getProtocol(  ).equals("http")) {
                HttpURLConnection huf = (HttpURLConnection)luf;
                String s = huf.getResponseCode() + " " + huf.getResponseMessage(  );
                if (huf.getResponseCode(  ) == -1)
                    return "Server error: bad HTTP response";
                return s;
            } else if (linkURL.getProtocol(  ).equals("file")) {
                InputStream is = luf.getInputStream(  );
                is.close(  );
                // If that didn't throw an exception, the file is probably OK
                return "(File)";
            } else
                return "(non-HTTP)";
        }
        catch (SocketException e) {
            return "DEAD: " + e.toString(  );
        }
        catch (IOException e) {
            return "DEAD";
        }
    }
 
    /** Read one tag. Adapted from code by Elliotte Rusty Harold */
    public String readTag(BufferedReader is) {
        StringBuffer theTag = new StringBuffer("<");
        int i = '<';
      
        try {
            while (i != '>' && (i = is.read(  )) != -1)
                theTag.append((char)i);
        }
        catch (IOException e) {
           System.err.println("IO Error: " + e);
        }     
        catch (Exception e) {
           System.err.println(e);
        }     

        return theTag.toString(  );
    }

    /** Extract the URL from <sometag attrs HREF="http://foo/bar" attrs ...> 
     * We presume that the HREF is correctly quoted!!!!!
     * TODO: Handle Applets.
     */
    public String extractHREF(String tag) throws MalformedURLException {
        String caseTag = tag.toLowerCase(  ), attrib;
        int p1, p2, p3, p4;

        if (caseTag.startsWith("<a "))
            attrib = "href";        // A
        else
            attrib = "src";            // image, frame
        p1 = caseTag.indexOf(attrib);
        if (p1 < 0) {
            throw new MalformedURLException("Can't find " + attrib + " in " + tag);
        }
        p2 = tag.indexOf ("=", p1);
        p3 = tag.indexOf(""", p2);
        p4 = tag.indexOf(""", p3+1);
        if (p3 < 0 || p4 < 0) {
            throw new MalformedURLException("Invalid " + attrib + " in " + tag);
        }
        String href = tag.substring(p3+1, p4);
        return href;
    }
}

Downloading an Entire Web Site

It would also be useful to have a program that reads the entire contents of a web site and saves it on your local hard disk. Sounds wasteful, but disk space is quite inexpensive nowadays, and this would allow you to peruse a web site when not connected to the Internet. Of course you couldn’t run most of the CGI scripts that you downloaded, but at least you could navigate around the text and view the images. The LinkChecker program contains all the seeds of such a program: you need only to download the contents of each non-dynamic URL (see the test for HTML and directories near the end of routine checkOut( ) and the code in Section 17.7), create the requisite directories (Section 10.10), and create and write to a file on disk (see Chapter 9). This final step is left as an exercise for the reader.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.152.136