You now know everything there is to know about the
java.net.InetAddress
class. The tools in this
class alone let you write some genuinely useful programs. Here
we’ll look at two: one that queries your domain name server
interactively and another that can improve the performance of your
web server by processing log files offline.
nslookup
is a Unix utility that converts
hostnames to IP addresses and IP addresses to hostnames. It has two
modes: interactive and command line. If you enter a hostname on the
command line, nslookup prints the IP address of
that host. If you enter an IP address on the command line,
nslookup prints the hostname. If no hostname or
IP address is entered on the command line,
nslookup enters interactive mode, in which it
reads hostnames and IP addresses from standard input and echoes back
the corresponding IP addresses and hostnames until you type
“exit”. Example 6.9 is a simple
character mode application called HostLookup
,
which emulates nslookup. It doesn’t
implement any of nslookup’s more complex
features, but it does enough to be useful.
Example 6-9. An nslookup Clone
import java.net.*; import java.io.*; public class HostLookup { public static void main (String[] args) { if (args.length > 0) { // use command line for (int i = 0; i < args.length; i++) { System.out.println(lookup(args[i])); } } else { BufferedReader in = new BufferedReader( new InputStreamReader(System.in)); System.out.println( "Enter names and IP addresses. Enter "exit" to quit."); try { while (true) { String host = in.readLine( ); if (host.equals("exit")) break; System.out.println(lookup(host)); } } catch (IOException e) { System.err.println(e); } } } /* end main */ private static String lookup(String host) { InetAddress thisComputer; byte[] address; // get the bytes of the IP address try { thisComputer = InetAddress.getByName(host); address = thisComputer.getAddress( ); } catch (UnknownHostException e) { return "Cannot find host " + host; } if (isHostName(host)) { // Print the IP address String dottedQuad = ""; for (int i = 0; i < address.length; i++) { int unsignedByte = address[i] < 0 ? address[i] + 256 : address[i]; dottedQuad += unsignedByte; if (i != address.length-1) dottedQuad += "."; } return dottedQuad; } else { // this is an IP address return thisComputer.getHostName( ); } } // end lookup private static boolean isHostName(String host) { char[] ca = host.toCharArray( ); // if we see a character that is neither a digit nor a period // then host is probably a host name for (int i = 0; i < ca.length; i++) { if (!Character.isDigit(ca[i])) { if (ca[i] != '.') return true; } } // Everything was either a digit or a period // so host looks like an IP address in dotted quad format return false; } // end isHostName } // end HostLookup
Here’s some sample output; input typed by the user is in bold:
%java HostLookup utopia.poly.edu
128.238.3.21 %java HostLookup 128.238.3.21
utopia.poly.edu %java HostLookup
Enter names and IP addresses. Enter "exit" to quit.cs.nyu.edu
128.122.80.78199.1.32.90
star.blackstar.comlocalhost
127.0.0.1cs.cmu.edu
128.2.222.173rtfm.mit.edu
18.181.0.29star.blackstar.com
199.1.32.90cs.med.edu
Cannot find host cs.med.eduexit
The HostLookup
program is built using three
methods: main( )
, lookup( )
,
and isHostName( )
. The main( )
method determines whether there are command-line arguments. If there
are command-line arguments, main( )
calls
lookup( )
to process each one. If there are no
command-line arguments, it chains a BufferedReader
to an InputStreamReader
chained to
System.in
and reads input from the user with the
readLine( )
method. (The warning in Chapter 4, about this method doesn’t apply here
because we’re reading from the console, not a network
connection.) If the line is “exit”, then the program
exits. Otherwise, the line is assumed to be a hostname or IP address,
and is passed to the lookup( )
method.
The lookup( )
method uses
InetAddress.getByName( )
to find the requested
host, regardless of the input’s format; remember that
getByName( )
doesn’t care if its argument is
a name or a dotted quad address. If getByName( )
fails, then lookup( )
returns a failure message.
Otherwise, it gets the address of the requested system. Then
lookup( )
calls isHostName( )
to determine whether the input string host
is a
hostname like cs.nyu.edu or a
dotted quad format IP address like 128.122.153.70.
isHostName( )
looks at each character of the
string; if all the characters are digits or periods,
isHostName( )
guesses that the string is a numeric
IP address and returns false. Otherwise, isHostName( )
guesses that the string is a hostname and returns true.
What if the string is neither? That is very unlikely, since if the
string is neither a hostname nor an address, getByName( )
won’t be able to do a lookup and will throw an
exception. However, it would not be difficult to add a test making
sure that the string looks valid; this is left as an exercise for the
reader. If the user types a hostname, lookup( )
returns the corresponding dotted quad address; we have already saved
the address in the byte array address[]
, and the
only complication is making sure that we don’t treat byte
values from 128 to 255 as negative numbers. If the user types an IP
address, then we use the getHostName( )
method to
look up the hostname corresponding to the address, and return
it.
Web
server logs track the hosts that access a web site. By default, the
log reports the IP addresses of the sites that connect to the server.
However, you can often get more information from the names of those
sites than from their IP addresses. Most web servers have an option
to store hostnames instead of IP addresses, but this can hurt
performance because the server needs to make a DNS request for each
hit. It is much more efficient to log the IP addresses and convert
them to hostnames at a later time. This task can be done when the
server isn’t busy or even on another machine completely. Example 6.10 is a program called Weblog
that reads a web server log file and prints each line with IP
addresses converted to hostnames.
Most web servers have standardized on the common log file format, although there are exceptions; if your web server is one of those exceptions, you’ll have to modify this program. A typical line in the common log file format looks like this:
205.160.186.76 unknown - [17/Jun/1999:22:53:58 -0500] "GET /bgs/greenbg.gif HTTP 1.0" 200 50
This means that a web browser at IP address 205.160.186.76 requested
the file /bgs/greenbg.gif
from this web server
at 11:53 P.M. (and 58 seconds) on June 17, 1999. The file was found
(response code 200), and 50 bytes of data were successfully
transferred to the browser.
The first field is the IP address or, if DNS resolution is turned on, the hostname from which the connection was made. This is followed by a space. Therefore, for our purposes, parsing the log file is easy: everything before the first space is the IP address, and everything after it does not need to be changed.
The dotted quad format IP address is converted into a hostname using
the usual methods of java.net.InetAddress
. Example 6.10 shows the code.
Example 6-10. Process Web Server Log Files
import java.net.*; import java.io.*; import java.util.*; import com.macfaq.io.SafeBufferedReader; public class Weblog { public static void main(String[] args) { Date start = new Date( ); try { FileInputStream fin = new FileInputStream(args[0]); Reader in = new InputStreamReader(fin); SafeBufferedReader bin = new SafeBufferedReader(in); String entry = null; while ((entry = bin.readLine( )) != null) { // separate out the IP address int index = entry.indexOf(' ', 0); String ip = entry.substring(0, index); String theRest = entry.substring(index, entry.length( )); // find the host name and print it out try { InetAddress address = InetAddress.getByName(ip); System.out.println(address.getHostName( ) + theRest); } catch (UnknownHostException e) { System.out.println(entry); } } // end while } catch (IOException e) { System.out.println("Exception: " + e); } Date end = new Date( ); long elapsedTime = (end.getTime()-start.getTime( ))/1000; System.out.println("Elapsed time: " + elapsedTime + " seconds"); } // end main }
The name of the file to be processed is passed to
Weblog
as the first argument on the command line.
A FileInputStream
fin
is opened
from this file, and an InputStreamReader
is
chained to fin
. This
InputStreamReader
is buffered by chaining it to an
instance of the SafeBufferedReader
class developed
in Chapter 4. The file is processed line by line
in a while
loop.
Each pass through the loop places one line in the
String
variable entry
.
entry
is then split into two substrings:
ip
, which contains everything before the first
space, and theRest
, which is everything after the
first space. The position of the first space is determined by
entry.indexOf("
",
0)
. ip
is converted to an
InetAddress
object using getByName( )
. The hostname is then looked up by getHostName( )
. Finally, the hostname, a space, and everything else on
the line (theRest
) are printed on
System.out
. Output can be sent to a new file
through the standard means for redirecting output.
Weblog
is more efficient than you might expect.
Most web browsers generate multiple log file entries per page served,
since there’s an entry in the log not just for the page itself
but for each graphic on the page. And many web browsers request
multiple pages while visiting a site. DNS lookups are expensive, and
it simply doesn’t make sense to look up each of those sites
every time it appears in the log file. The
InetAddress
class caches requested addresses. If
the same address is requested again, it can be retrieved from the
cache much more quickly than from DNS.
Nonetheless, this program could certainly be faster. In my initial tests, it took more than a second per log entry. (Exact numbers depend on the speed of your network connection, the speed of both local and remote DNS servers you access, and network congestion when the program is run.) It spends a huge amount of time just sitting and waiting for DNS requests to return. Of course, this is exactly the problem multithreading is designed to solve. One main thread can read the log file and pass off individual entries to other threads for processing.
A thread pool is absolutely necessary here. Over the space of a few
days, even low volume web servers can easily generate a log file with
hundreds of thousands of lines. Trying to process such a log file by
spawning a new thread for each entry would rapidly bring even the
strongest virtual machine to its knees, especially since the main
thread can read log file entries much faster than individual threads
can resolve domain names and die. Consequently, reusing threads is
essential here. The number of threads is stored in a tunable
parameter, numberOfThreads
, so that it can be
adjusted to fit the VM and network stack. (Launching too many
simultaneous DNS requests can also cause problems.)
This program is now divided into two classes. The first class,
PooledWeblog
, shown in Example 6.11, contains the main( )
method and the processLogFile( )
method. It also
holds the resources that need to be shared among the threads. These
are the pool, implemented as a synchronized
LinkedList
from the Java Collections API, and the
output log, implemented as a BufferedWriter
named
out
. Individual threads will have direct access to
the pool but will have to pass through
PooledWeblog
’s log( )
method to write output.
The key method is processLogFile( )
. As before,
this method reads from the underlying log file. However, each entry
is placed in the entries
pool rather than being
immediately processed. Because this method is likely to run much more
quickly than the threads that have to access DNS, it yields after
reading each entry. Furthermore, it goes to sleep if there are more
entries in the pool than threads available to process them. The
amount of time it sleeps depends on the number of threads. This will
avoid using excessive amounts of memory for very large log files.
When the last entry is read, the finished
flag is
set to true
to tell the threads that they can die
once they’ve completed their work.
Example 6-11. PooledWebLog
import java.io.*; import java.util.*; import com.macfaq.io.SafeBufferedReader; public class PooledWeblog { private BufferedReader in; private BufferedWriter out; private int numberOfThreads; private List entries = Collections.synchronizedList(new LinkedList( )); private boolean finished = false; private int test = 0; public PooledWeblog(InputStream in, OutputStream out, int numberOfThreads) { this.in = new BufferedReader(new InputStreamReader(in)); this.out = new BufferedWriter(new OutputStreamWriter(out)); this.numberOfThreads = numberOfThreads; } public boolean isFinished( ) { return this.finished; } public int getNumberOfThreads( ) { return numberOfThreads; } public void processLogFile( ) { for (int i = 0; i < numberOfThreads; i++) { Thread t = new LookupThread(entries, this); t.start( ); } try { String entry = null; while ((entry = in.readLine( )) != null) { if (entries.size( ) > numberOfThreads) { try { Thread.sleep((long) (1000.0/numberOfThreads)); } catch (InterruptedException e) {} continue; } synchronized (entries) { entries.add(0, entry); entries.notifyAll( ); } Thread.yield( ); } // end while } catch (IOException e) { System.out.println("Exception: " + e); } this.finished = true; // finish any threads that are still waiting synchronized (entries) { entries.notifyAll( ); } } public void log(String entry) throws IOException { out.write(entry + System.getProperty("line.separator", " ")); out.flush( ); } public static void main(String[] args) { try { PooledWeblog tw = new PooledWeblog(new FileInputStream(args[0]), System.out, 100); tw.processLogFile( ); } catch (FileNotFoundException e) { System.err.println("Usage: java PooledWeblog logfile_name"); } catch (ArrayIndexOutOfBoundsException e) { System.err.println("Usage: java PooledWeblog logfile_name"); } catch (Exception e) { System.err.println(e); e.printStackTrace( ); } } // end main }
The detailed work of converting IP addresses to hostnames in the log
entries is handled by the LookupThread
class,
shown in Example 6.12. The constructor provides each
thread with a reference to the entries
pool it
will retrieve work from and a reference to the
PooledWeblog
object it’s working for. The
latter reference allows callbacks to the
PooledWeblog
so that the thread can log converted
entries and check to see when the last entry has been processed. It
does so by calling the isFinished( )
method in
PooledWeblog
when the entries
pool is empty (has size 0). Neither an empty pool nor
isFinished( )
returning true is sufficient by
itself. isFinished( )
returns true after the last
entry is placed in the pool, which is, at least for a small amount of
time, before the last entry is removed from the pool. And
entries
may be empty while there are still many
entries remaining to be read, if the lookup threads outrun the main
thread reading the log file.
Example 6-12. LookupThread
import java.net.*; import java.io.*; import java.util.*; public class LookupThread extends Thread { private List entries; PooledWeblog log; // used for callbacks public LookupThread(List entries, PooledWeblog log) { this.entries = entries; this.log = log; } public void run( ) { String entry; while (true) { synchronized (entries) { while (entries.size( ) == 0) { if (log.isFinished( )) return; try { entries.wait( ); } catch (InterruptedException e) { } } entry = (String) entries.remove(entries.size( )-1); } int index = entry.indexOf(' ', 0); String remoteHost = entry.substring(0, index); String theRest = entry.substring(index, entry.length( )); try { remoteHost = InetAddress.getByName(remoteHost).getHostName( ); } catch (Exception e) { // remoteHost remains in dotted quad format } try { log.log(remoteHost + theRest); } catch (IOException e) { } this.yield( ); } } }
Using threads like this lets the same log files be processed in parallel. This is a huge time savings. In my unscientific tests, the threaded version is 10 to 50 times faster than the sequential version.
The biggest disadvantage to the multithreaded approach is that it
reorders the log file. The output statistics aren’t necessarily
in the same order as the input statistics. For simple hit counting,
this doesn’t matter. However, there are some log analysis tools
that can mine a log file to determine paths users followed through a
site. These could well get confused if the log is out of sequence. If
that’s an issue, you’d need to attach a sequence number
to each log entry. As the individual threads returned log entries to
the main program, the log( )
method in the main
program would store any that arrived out of order until their
predecessors appeared. This is in some ways reminiscent of how
network software reorders TCP packets that arrive out of order.
18.224.32.86