The HyperText Transport Protocol (HTTP) is the protocol used on the World Wide Web. This section presents a procedure to fetch pages or images from a server on the Web. Items in the Web are identified with a Universal Resource Location (URL) that specifies a host, port, and location on the host. The basic outline of HTTP is that a client sends a URL to a server, and the server responds with some header information and some content data. The header information describes the content, which can be hypertext, images, postscript, and more.
proc Http_Open {url} { global http if {![regexp -nocase {^(http://)?([^:/]+)(:([0-9])+)?(/.*)} $url x protocol server y port path]} { error "bogus URL: $url" } if {[string length $port] == 0} { set port 80 } set sock [socket $server $port] puts $sock "GET $path HTTP/1.0" puts $sock "Host: $server" puts $sock "User-Agent: Tcl/Tk Http_Open" puts $sock "" flush $sock return $sock } |
The Http_Open procedure uses regexp to pick out the server and port from the URL. This regular expression is described in detail on page 149. The leading http:// is optional, and so is the port number. If the port is left off, then the standard port 80 is used. If the regular expression matches, then a socket command opens the network connection.
The protocol begins with the client sending a line that identifies the command (GET), the path, and the protocol version. The path is the part of the URL after the server and port specification. The rest of the request is lines in the following format:
key: value
The Host identifies the server, which supports servers that implement more than one server name. The User-Agent identifies the client program, which is often a browser like Netscape Navigator or Internet Explorer. The key-value lines are terminated with a blank line. This data is flushed out of the Tcl buffering system with the flush command. The server will respond by sending the URL contents back over the socket. This is described shortly, but first we consider proxies.
A proxy is used to get through firewalls that many organizations set up to isolate their network from the Internet. The proxy accepts HTTP requests from clients inside the firewall and then forwards the requests outside the firewall. It also relays the server's response back to the client. The protocol is nearly the same when using the proxy. The difference is that the complete URL is passed to the GET command so that the proxy can locate the server. Example 17-6 uses a proxy if one is defined:
# Http_Proxy sets or queries the proxy proc Http_Proxy {{new {}}} { global http if ![info exists http(proxy)] { return {} } if {[string length $new] == 0} { return $http(proxy):$http(proxyPort) } else { regexp {^([^:]+):([0-9]+)$}$new x http(proxy) http(proxyPort) } } proc Http_Open {url {cmd GET} {query {}}} { global http if {![regexp -nocase {^(http://)?([^:/]+)(:([0-9])+)?(/.*)} $url x protocol server y port path]} { error "bogus URL: $url" } if {[string length $port] == 0} { set port 80 } if {[info exists http(proxy)] && [string length $http(proxy)]} { set sock [socket $http(proxy) $http(proxyPort)] puts $sock "$cmd http://$server:$port$path HTTP/1.0" } else { set sock [socket $server $port] puts $sock "$cmd $path HTTP/1.0" } puts $sock "User-Agent: Tcl/Tk Http_Open" puts $sock "Host: $server" if {[string length $query] > 0} { puts $sock "Content-Length: [string length $query]" puts $sock "" puts $sock $query } puts $sock "" flush $sock fconfigure $sock -blocking 0 return $sock } |
In Example 17-6, the Http_Open procedure takes a cmd parameter so that the user of Http_Open can perform different operations. The GET operation fetches the contents of a URL. The HEAD operation just fetches the description of a URL, which is useful to validate a URL. The POST operation transmits query data to the server (e.g., values from a form) and also fetches the contents of the URL. All of these operations follow a similar protocol. The reply from the server is a status line followed by lines that have key-value pairs. This format is similar to the client's request. The reply header is followed by content data with GET and POST operations. Example 17-7 implements the HEAD command, which does not involve any reply data:
The Http_Head procedure uses Http_Open to contact the server. The HttpHeader procedure is registered as a fileevent handler to read the server's reply. A global array keeps state about each operation. The URL is used in the array name, and upvar is used to create an alias to the name (upvar is described on page 86):
upvar #0 $url state
You cannot use the upvar alias as the variable specified to vwait. Instead, you must use the actual name. The backslash turns off the array reference in order to pass the name of the array element to vwait, otherwise Tcl tries to reference url as an array:
vwait $url(status)
The HttpHeader procedure checks for special cases: end of file, an error on the gets, or a short read on a nonblocking socket. The very first reply line contains a status code from the server that is in a different format than the rest of the header lines:
code message
The code is a three-digit numeric code. 200 is OK. Codes in the 400's and 500's indicate an error. The codes are explained fully in RFC 1945 that specifies HTTP 1.0. The first line is saved with the key http:
set state(headers) [list http $line]
The rest of the header lines are parsed into key-value pairs and appended onto state(headers). This format can be used to initialize an array:
array set header $state(headers)
When HttpHeader gets an empty line, the header is complete and it sets the state(status) variable, which signals Http_Head. Finally, Http_Head returns the status to its caller. The complete information about the request is still in the global array named by the URL. Example 17-8 illustrates the use of Http_Head:
set url http://www.sun.com/ set status [Http_Head $url] => eof upvar #0 $url state array set info $state(headers) parray info info(http) HTTP/1.0 200 OK info(server) Apache/1.1.1 info(last-modified) Nov ... info(content-type) text/html |
Example 17-9 shows Http_Get, which implements the GET and POST requests. The difference between these is that POST sends query data to the server after the request header. Both operations get a reply from the server that is divided into a descriptive header and the content data. The Http_Open procedure sends the request and the query, if present, and reads the reply header. Http_Get reads the content.
The descriptive header returned by the server is in the same format as the client's request. One of the key-value pairs returned by the server specifies the Content-Type of the URL. The types come from the MIME standard, which is described in RFC 1521. Typical content types are:
text/html — HyperText Markup Language (HTML), which is introduced in Chapter 3.
application/x-tcl — a Tcl program! This type is discussed in Chapter 20.
Http_Get uses Http_Open to initiate the request, and then it looks for errors. It handles redirection errors that occur if a URL has changed. These have error codes that begin with 3. A common case of this error is when a user omits the trailing slash on a URL (e.g., http://www.scriptics.com). Most servers respond with:
302 Document has moved Location: http://www.scriptics.com/
If the content-type is text, then Http_Get sets up a fileevent handler to read this data into memory. The socket is in nonblocking mode, so the read handler can read as much data as possible each time it is called. This is more efficient than using gets to read a line at a time. The text will be stored in the state(body) variable for use by the caller of Http_Get. Example 17-10 shows the HttpGetText fileevent handler:
The content may be in binary format. This poses a problem for Tcl 7.6 and earlier. A null character will terminate the value, so values with embedded nulls cannot be processed safely by Tcl scripts. Tcl 8.0 supports strings and variable values with arbitrary binary data. Example 17-9 uses fcopy to copy data from the socket to a file without storing it in Tcl variables. This command was introduced in Tcl 7.5 as unsupported0, and became fcopy in Tcl 8.0. It takes a callback argument that is invoked when the copy is complete. The callback gets additional arguments that are the bytes transferred and an optional error string. In this case, these arguments are added to the url argument specified in the fcopy command. Example 17-11 shows the HttpCopyDone callback:
The user of Http_Get uses the information in the state array to determine the status of the fetch and where to find the content. There are four cases to deal with:
There was an error, which is indicated by the state(error) element.
There was a redirection, in which case, the new URL is in state(link). The client of Http_Get should change the URL and look at its state instead. You can use upvar to redefine the alias for the state array:
upvar #0 $state(link) state
There was text content. The content is in state(body).
There was another content type that was copied to state(filename).
The fcopy command can do a complete copy in the background. It automatically sets up fileevent handlers, so you do not have to use fileevent yourself. It also manages its buffers efficiently. The general form of the command is:
fcopy input output ?-size size? ?-command callback?
The -command argument makes fcopy work in the background. When the copy is complete or an error occurs, the callback is invoked with one or two additional arguments: the number of bytes copied, and, in the case of an error, it is also passed an error string:
fcopy $in $out -command [list CopyDone $in $out] proc CopyDone {in out bytes {error {}} { close $in ; close $out }
With a background copy, the fcopy command transfers data from input until end of file or size bytes have been transferred. If no -size argument is given, then the copy goes until end of file. It is not safe to do other I/O operations with input or output during a background fcopy. If either input or output gets closed while the copy is in progress, the current copy is stopped. If the input is closed, then all data already queued for output is written out.
Without a -command argument, the fcopy command reads as much as possible depending on the blocking mode of input and the optional size parameter. Everything it reads is queued for output before fcopy returns. If output is blocking, then fcopy returns after the data is written out. If input is blocking, then fcopy can block attempting to read size bytes or until end of file.
18.118.126.248