Chapter 3. Basic Web Concepts

By the time you finish this book, I hope you will realize that Java can do a lot more than create flashy web pages. Nonetheless, many of your programs will be applets on web pages or will need to talk to web servers to retrieve files or post data. Therefore, it’s important to have a solid understanding of the interaction between web servers and web browsers.

The Hypertext Transfer Protocol (HTTP) is a standard that defines how a web client talks to a server and how data is transferred from the server back to the client. HTTP relies heavily on two other standards: the Multipurpose Internet Mail Extensions (MIME) and the Hypertext Markup Language (HTML). MIME is a way to encode different kinds of data, such as sound and text, to be transmitted over a 7-bit ASCII connection; it also lets the recipient know what kind of data has been sent, so that it can be displayed properly. As its name implies, MIME was originally designed to facilitate multimedia email and to provide an encoding that could get binary data past the most brain-damaged mail transfer programs. However, it is now used much more broadly. HTML is a simple standard for describing the semantic value of textual data. This means that you can say “this is a header”, “this is a list item”, “this deserves emphasis”, and so on, but you can’t specify how headers, lists, and other items are formatted: formatting is up to the browser. HTML is a “hypertext markup language” because it includes a way to specify links to other documents identified by URLs. A URL is a way to unambiguously identify the location of a resource on the Internet. To understand network programming, you’ll need to understand URLs, HTML, MIME, and HTTP in somewhat more detail than the average web page designer.

URIs

A Uniform Resource Identifier (URI) is a string of characters in a particular syntax that identifies a resource. The resource identified may be a file on a server, but it may also be an email address, a news message, a book, a person’s name, an Internet host, the current stock price of Sun Microsystems, or something else. An absolute URI is made up of a scheme for the URI and a scheme-specific part, separated by a colon like this:

            scheme:scheme-specific-part

The syntax of the scheme-specific part depends on the scheme being used. Many different schemes will eventually be defined, but current ones include:

data

Base64-encoded data included directly in a link; see RFC 2397

file

A file on a local disk

FTP

An FTP server

HTTP

A World Wide Web server using the Hypertext Transfer Protocol

gopher

A Gopher server

mailto

An email address

news

A Usenet newsgroup

Telnet

A connection to a Telnet-based service

urn

A Uniform Resource Name

In addition, Java makes heavy use of nonstandard, custom schemes such as rmi, jndi, and doc for various purposes. We’ll look at the mechanism behind this in Chapter 15, when we discuss protocol handlers.

There is no specific syntax that applies to the scheme-specific parts of all URIs. However, many follow this form:

//authority/path?query

The authority part of the URI names the authority responsible for resolving the rest of the URI. For instance, the URI http://www.ietf.org/rfc/rfc2396.txt has the scheme http and the authority www.ietf.org. This means that the server at www.ietf.org is responsible for mapping the path /rfc/rfc2396.txt to an actual resource. This URI does not have a query part. The URI http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=1565924851 has the scheme http, the authority www1.fatbrain.com, the path /asp/bookinfo/bookinfo.asp, and the query theisbn=1565924851. The URI urn:isbn:1565924851 has the scheme urn but doesn’t follow the //authority/path?query form for scheme-specific parts.

Although current examples of URIs use an Internet host as an authority, this may not be true of all future schemes. However, if the authority is an Internet host, then optional usernames and ports may also be provided to make the authority more specific. For example, the URI ftp://mp3:[email protected]:33/VanHalen-Jump.mp3 has the authority mp3:[email protected]:33. This authority has the username mp3, the password mp3, the host ci43198-a.ashvil1.nc.home.com, and the port 33. It has the scheme ftp and the path /VanHalen-Jump.mp3. (In most cases, including the password in the URI is a big security hole unless, as here, you really do want everyone in the universe to know the password.)

The path (which includes its initial /) is a string that the authority can use to determine which resource is identified. Different authorities may interpret the same path to refer to different resources. For instance, the path /index.html means one thing when the authority is www.georgewbush.com and something very different when the authority is www.gore2000.com. The path may be hierarchical, in which case the individual parts are separated by forward slashes, and the . and .. operators are used to navigate the hierarchy. These are derived from the pathname syntax on the Unix operating systems where the Web and URLs were invented. They conveniently map to a filesystem stored on a Unix web server. However, there is no guarantee that the components of any particular path actually correspond to files or directories on any particular filesystem. For example, in the URI http://www.amazon.com/exec/obidos/ISBN%3D1565924851/cafeaulaitA/002-3777605-3043449 all the pieces of the hierarchy are just used to pull information out of a database that’s never stored in a filesystem. ISBN%3D1565924851 selects the particular book from the database by its ISBN number. cafeaulaitA specifies who gets the referral fee if a purchase is made from this link. And 002-3777605-3043449 is a session key used to track this visitor’s path through the site.

Of course, some URIs aren’t at all hierarchical, at least in the filesystem sense. For example, snews://secnews.netscape.com/netscape.devs-java has a path of /netscape.devs-java. Although there’s some hierarchy to the newsgroup names indicated by the . between netscape and netscape.devs-java, it’s not visible as part of the URI.

The scheme part is composed of lowercase letters, digits, and the plus sign, period, and hyphen. It must begin with a lowercase letter. The other three parts of a typical URI (authority, path, and query) should each be composed of the ASCII alphanumeric characters; that is, the letters A-Z, a-z, and the digits 0-9. In addition, the punctuation characters - _ . ! ~ * ` (and , ) may also be used. All other characters including non-ASCII alphanumerics such as á and should be escaped by a percent sign (%) followed by the hexadecimal code for the character. For instance, á would be encoded as %E1 and would be encoded %3C0. The latter assumes the underlying character set is 2-byte Unicode. The current draft of the URI specification does not yet provide a means of specifying the character set to be used. This is a deficiency that will be corrected in a future draft. A URL so transformed is said to have been “x-www-form-url-encoded”.

Punctuation characters such as / and @ must also be encoded using percent escapes if they’re used in any role other than what’s specified for them in the scheme-specific part of a particular URL. For example, the forward slashes in the URI http://metalab.unc.edu/javafaq/books/javaio/ do not need to be encoded as %2F because they serve to delimit the hierarchy as specified for the http URI scheme. However, if a filename included a / character—for instance, if the last directory were named Java I/O instead of javaio to more closely match the name of the book—then the URI would have to be written as http://metalab.unc.edu/javafaq/books/Java%20I%2FO/. This is not as farfetched as it might sound to Unix or Windows users. Mac filenames often include a forward slash. File names on many platforms often contain other characters that need to be encoded including @, $, +, =, and many more.

URNs

There are two types of URIs: Uniform Resource Locators (URLs) and Uniform Resource Names (URNs). A URL is a pointer to a particular resource on the Internet at a particular location. For example, http://www.oreilly.com/catalog/javanp2/ is one of several URLs for the book Java Network Programming, 2nd edition. A URN is a name for a particular resource but without reference to a particular location. For instance, urn:isbn:1565928709 is a URN referring to the same book. As this example shows, URNs, unlike URLs, are not limited to Internet resources.

The goal of URNs is to handle resources that are mirrored in many different locations or that have moved from one site to another; they identify the resource itself, not the place where the resource lives. For instance, when given a URN for a particular piece of software, an FTP program should get the file from the nearest mirror site. Given a URN for a book, a browser might reserve the book for you at the local library or order a copy from a bookstore.

A URN has the general form:

urn:namespace:resource_name

The namespace is the name of a collection of certain kinds of resources maintained by some authority. The resource_name is the name of a resource within that collection. For instance, the URN urn:isbn:1565924851 identifies a resource in the isbn namespace with the identifier 1565924851. Of all the books published, this one selects the first edition of Java I/O.

The exact syntax of resource names depends on the namespace. The ISBN namespace expects to see strings composed of 10 characters, all of which are digits with the single exception that the last character may be a capital letter X instead. Other namespaces will use very different syntaxes for resource names. The IANA is responsible for handing out namespaces to different organizations, but the procedure isn’t really in place yet. URNs are still an area of active research and are not much used by current software. ISBN numbers are pretty much the only example established so far, and even those haven’t been officially standardized as URNs. Consequently, the rest of this book will use URLs exclusively.

URLs

A URL identifies the location of a resource on the Internet. It specifies the protocol used to access a server (e.g., FTP, HTTP), the name of the server, and the location of a file on that server. A typical URL looks like http://metalab.unc.edu/javafaq/javatutorial.html. This specifies that there is a file called javatutorial.html in a directory called javafaq on the server metalab.unc.edu, and that this file can be accessed via the HTTP protocol. The syntax of a URL is:

protocol://username@hostname:port/path/filename#fragment?query

Here the protocol is another word for what was called the scheme of the URI. (Scheme is the word used in the URI RFC. Protocol is the word used in the Java documentation.) In a URL, the protocol part can be file, ftp, http, https, gopher, news, Telnet, wais, or various other strings (though not urn).

The hostname part of a URL is the name of the server that provides the resource you want, like www.oreilly.com or utopia.poly.edu. It can also be the server’s IP address, like 204.148.40.9 or 128.238.3.21. The username is an optional username for the server. The port number is also optional. It’s not necessary if the service is running on its default port (port 80 for HTTP servers).

The path points to a particular directory on the specified server. The path is relative to the document root of the server, not necessarily to the root of the filesystem on the server. As a rule, servers that are open to the public do not show their entire filesystem to clients. Rather, they show only the contents of a specified directory. This directory is called the document root, and all paths and filenames are relative to it. Thus on a Unix workstation all files that are available to the public may be in /var/public/html, but to somebody connecting from a remote machine this directory looks like the root of the filesystem.

The filename points to a particular file in the directory specified by the path. It is often omitted, in which case it is left to the server’s discretion what file, if any, to send. Many servers send an index file for that directory, often called index.html or Welcome.html. Others send a list of the files and folders in the directory as shown in Figure 3.1. Others may send a 403 forbidden error message as shown in Figure 3.2.

A web server configured to send a directory list when no index file exists

Figure 3-1. A web server configured to send a directory list when no index file exists

A web server configured to send a 403 error when no index file exists

Figure 3-2. A web server configured to send a 403 error when no index file exists

The fragment is used to reference a named anchor in an HTML document. Some documents refer to the fragment part of the URL as a “section”; Java documents rather unaccountably refer to the section as a “Ref “. A named anchor is created in an HTML document with a tag like this:

<A NAME="xtocid1902914">Comments</A>

This tag identifies a particular point in a document. To refer to this point, a URL includes not only the document’s filename, but also the named anchor separated from the rest of the URL by a #:

http://metalab.unc.edu/javafaq/javafaq.html#xtocid1902914

Finally, the query string provides additional arguments for the server. It’s commonly used only in http URLs, where it contains form data for input to CGI programs. This will be discussed further later on.

Relative URLs

A URL tells the web browser a lot about a document: the protocol used to retrieve the document, the name of the host where the document lives, and the path to that document on the host. Most of this information is likely to be the same for other URLs that are referenced in the document. Therefore, rather than requiring each URL to be specified in its entirety, a URL may inherit the protocol, hostname, and path of its parent document (i.e., the document in which it appears). URLs that aren’t complete but inherit pieces from their parent are called relative URLs. In contrast, a completely specified URL is called an absolute URL. In a relative URL, any pieces that are missing are assumed to be the same as the corresponding pieces from the URL of the document in which the URL is found. For example, suppose that while browsing http://metalab.unc.edu/javafaq/javatutorial.html you click on this hyperlink:

<a href="javafaq.html">

Your browser cuts javatutorial.html off the end of http://metalab.unc.edu/javafaq/javatutorial.html to get http://metalab.unc.edu/javafaq/. Then it attaches javafaq.html onto the end of http://metalab.unc.edu/javafaq/ to get http://metalab.unc.edu/javafaq/javafaq.html. Finally, it loads that document.

If the relative link begins with a /, then it is relative to the document root instead of relative to the current file. Thus, if you click on the following link while browsing http://metalab.unc.edu/javafaq/javatutorial.html :

<a href="/boutell/faq/www_faq.html">

your browser would throw away /javafaq/javatutorial.html and attach /boutell/faq/www_ faq.html to the end of http://metalab.unc.edu to get http://metalab.unc.edu/boutell/faq/www_ faq.html.

Relative URLs have a number of advantages. First and least, they save a little typing. More importantly, relative URLs allow a single document tree to be served by multiple protocols; for instance, both FTP and HTTP. The HTTP might be used for direct surfing while the FTP could be used for mirroring the site. Most importantly of all, relative URLs allow entire trees of HTML documents to be moved or copied from one site to another without breaking all the internal links.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.64.128