Connecting with HTTP

While urllib is suitable for working with Internet files, you may still have the need to perform more intricate communication with an HTTP server. For example, if you are writing a Python program to communicate between two web sites, you may need to adjust the headers to include any cookies the site may require. You may need to emulate a certain browser type (by placing its name in your User-Agent header) if the site requires the latest version of Internet Explorer. Working with httplib as opposed to urllib in cases such as these allows for finer control.

HTTP Conversations

HTTP conversations between browsers and servers involve headers and data. The interaction between a web browser and a web server reveals a great deal of information about both parties. The HTTP headers that precede content from the server and precede requests from the browser contain a lot of metadata about both client and server. For example, when you type a URL into your browser and press return, a complete HTTP request is sent to the remote server that can look something like this:

GET /c7/favquote.cgi HTTP/1.1
Host: www.python.org
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
Connection: Keep-Alive

The headers tell the web server a great deal about the capabilities of the client browser. From the first line of the headers (GET /c7/favquote.cgi HTTP/1.1), we can tell that the request type is a GET, the target file is /c7/favquote.cgi, and the HTTP version in use is 1.1. Beyond this essential information is data telling the server what file types the browser can accept, what the browser is, and what type of HTTP connection to use. The Accept lines tell the server that your browser can handle .gif and .jpeg files, as well as any others. Notice there are three lines that start with Accept in the HTTP headers. They show the browser accepts en-us as its language, and both gzip and deflate as encoding:.

Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg
Accept-Language: en-us
Accept-Encoding: gzip, deflate

The User-Agent informs the server which browser or HTTP client you’re using. Every browser (and every HTTP library) populates this field with one thing or another, letting web sites know how people are visiting. Some web sites are designed to specifically utilize the features of either Netscape Navigator or Internet Explorer and may redirect browsers to one set of pages or another based on what it sees in the User-Agent string:

User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)

This User-Agent is Internet Explorer 5.5 running on Microsoft’s desktop platform. The last header in the previous example tells the server what type of connection to use. In this case, it’s Keep-Alive:

Connection: Keep-Alive

The Keep-Alive connection type tells the server to keep the socket between the client and the server open for additional resources. Typically, when downloading a web page, an initial request is made to retrieve the HTML page itself, and then a series of subsequent requests are made to retrieve images referenced with the <img> tag, as well as linked stylesheets and framesets. Initiating a new connection for each one of these resources would be very time-consuming, especially considering the graphics-laden web pages in use today. The Keep-Alive option lets the browser use the same channel it has already established to bring down all of the additional resources.

Request Types

In addition to the GET request, three other request types exist. Basically, a browser can do four things with a web server. It can GET a file. It can POST data to a file as well, such as sending a form to a CGI script. The two other lesser-known methods are HEAD and PUT. HEAD is used just to retrieve web server headers, and PUT is used to actually send a file to the server.

GET

Requests a file, and optionally contains a query string used by the file (if it’s a CGI or other executable) to generate dynamic data.

POST

Sends URL-encoded data to the server in a large chunk. Frequently used to send form fields to the server. Anything POSTed can also be sent via GET, but the difference is the query string becomes large and may look unsightly or be unmanageable in the browser when doing a GET. A POST is not carried on the query string, and so is not visible to an end user. Some servers may allow less data on a query string, and accept bigger chunks of data in a POST request.

HEAD

Similar to GET, but returns only the headers.

PUT

A seldom-used HTTP method to place files on the server. When using this method, the filename that normally goes after a GET or a POST request is used as the filename for the content being delivered to the server. In other words, instead of telling the server you want to GET a page called /index.html, tell it you are going to PUT a file named /index.html on the server. During a PUT operation, the contents of the file are sent after the headers, just like in a POST operation.

Getting a Document with Python

To manually use HTTP from Python, use httplib. The httplib module is standard and ships with Python 2.x. The HTTP class within httplib features several import methods for connecting to the server:

Req = HTTP( address, [port] )

Returns an instance of the HTTP class for use as a request object in this connection. The connection is also made to the given address and optional port. Most web servers run on port 80, and you would not need to supply this argument. However, some web sites are kept on different ports and this option then gives you the ability to select a specific port.

Req .putrequest( method, file )

Performs the initial HTTP request method with its accompanying filename and HTTP version indicator. This is the first line of the headers, as in:

GET /c7/favquote.cgi HTTP/1.1
Req .putheader( header-type, value )

Adds a new header to the request. This method would be used to add your custom User-Agent string, Accept-Types, or cookies.

Req .endheaders( )

Instructs the module to finish off the headers sent to the server. In HTTP, the headers are separated from data with a blank line. That is, when the server is sending back an HTML page, it gives the browser its headers, followed by a blank line, followed by the HTML document. Conversely, when the browser is making a form POST to a server, it does the same thing and separates its request headers from its post data with a blank line.

Req .send( data )

Sends data after your request. The send method must be called after endheaders to speak proper HTTP. You use this method when making a POST or PUT.

ErrorCode, ErrorMessage, Headers = Req .getreply( )

Gives you the server’s headers and response code in one swoop. Both ErrorCode and ErrorMessage should be 200 and OK respectively if everything is going well. The Headers object is actually an instance of mimetools.Message.

fp. = Req .getfile( )

Returns a file-like object that you can use to access the actual HTML (or other) document.

Using the HTTP class from httplib is simple. Example 8-2 shows how to connect to a server and retrieve the index.html page.

Example 8-2. Making an HTTP Request
>>> from httplib import HTTP
>>> req = HTTP("www.example.com")
>>> req.putrequest("GET", "/index.html")
>>> req.putheader("Accept", "text/html")
>>> req.putheader("User-Agent", "MyPythonScript")
>>> req.endheaders(  )
>>> ec, em, h = req.getreply(  )
>>> print ec, em
200 OK
>>> fd = req.getfile(  )
>>> textlines = fd.read(  )
>>> fd.close(  )

The steps taken in Example 8-2 are straightforward. The HTTP class is called with an argument indicating the server, www.example.com. Next, headers are added describing some minimal information for the server including the type of data expected and the name of your user agent as MyPythonScript. The error code and error message are retrieved with a call to getreply and the result is printed to the console:

>>> print ec, em
200 OK

Next, getfile is used to retrieve the file-like object containing the document contents. The getfile method returns a file descriptor that you can read with. After a call to read, the return result is assigned to the variable textlines that now contains the actual document. Calling close finishes off the request. You can print textlines to see what you have retrieved:

>>> print textlines
<html>
<head>
<link rel="stylesheet" type="text/css" href="/zpath.css">
</head>
<body BGCOLOR="#CDFF00">
<p>
<table WIDTH=100% height=100%>
    <tr>
        <td VALIGN="top" ALIGN="left">
            <a HREF="/cgi-bin/start.cgi?page=top">
                <img SRC="images/zplogo.gif" WIDTH=457 
                      HEIGHT=144 BORDER=0>
            </a>
        </td>
    </tr>
</table>
</p>
</body>
</html>

Of special note in this request is the User-Agent string. Most web site administrators run access reports and generate neat sets of statistics detailing the browser types in use. By writing your own Python Internet programs, you can add to the statistics. In Example 8-2, we set the User-Agent string to MyPythonScript by calling:

req.putheader("User-Agent", "MyPythonScript")

This is captured in the server logs, and most likely show up in the less-than-one-percent category of the site administrator’s browser statistics.

Building a Query String with httplib

Example 8-2 shows how to request a specific file. Say you’d like to also add a query string to your GET request. The second argument to HTTP.putrequest is the filename you’re after. To add a query string to the HTTP request, you could couple the filename with your data, as shown here, properly URL-encoded:

req.putrequest("GET", "/handler.cgi?id=12345")

If you need to encode your data because it contains special characters, you could use the urllib’s quote function described earlier in this chapter, in Section 8.2.2.

req.putrequest("GET", "/handler.cgi?" + quote("numbers=1/2/3/4/5"))

Baking Cookies for the Server

Any hungry server administrator may be disappointed to learn that the cookies your browser sends to his web site are electronic. Cookies are frequently delivered to web servers by browsers to indicate a special identification for your browser. Your browser keeps the cookie and returns it whenever the same web site or document is requested. This lets the web server personalize site content for you, or connect you with some specific data that may be held in a database, such as your profile information or virtual shopping cart. If you are writing Python scripts to go between web sites, you may need to send cookies in your headers. You use the putheader method of the HTTP class to do so, as shown here:

req.putheader("Cookie", "key=value")

Conversely, when the server is sending cookies to your browser a set-cookie header is thrown in the mix with the other headers and digested by your browser.

Performing a POST Operation

If you are manually using HTTP from Python, odds are you’re moving documents around. You may be hitting one URL to get information from a database, constructing a form, and submitting that data to another web site via the POST operation. Creating a POST with httplib is straightforward, but more intricate than the examples shown thus far in this chapter.This method is detailed in the following sections.

Creating a POST catcher

Any example illustrating a POST is of no value without something to post to. So, for the purpose of this example, you can create a simple CGI script that echoes back your posted data. To use, place this ten-line file in a CGI-capable directory of your web server as favquote.cgi, shown in Example 8-3.

Example 8-3. favquote.cgi
#!/usr/bin/python

import cgi
form = cgi.FieldStorage(  )
favquote = form["favquote"].value

print "Content-type: text/html"
print ""
print "<html><body>"
print "Your quote is: "
print favquote
print "</body></html>"

This simple CGI uses the cgi module to retrieve the data sent in the post. We make a post to this CGI in the next section.

Ensuring proper URL encoding

One of the more interesting points of making a POST is ensuring that your data is properly URL-encoded. This means ensuring the favquote key is not encoded in the data if your CGI script is looking for a variable named favquote. For example, a proper key=value pair should be:

favquote=This%20is%20my%20quote%3a%20%22I%20think%20therefore%20I%20am.%22

However, if in your enthusiasm you quote the entire string and not just the value portion, you wind up with:

favquote%3dThis%20is%20my%20favorite%20quote%3a%20...

Unfortunately, the server will not know what to do with the second flawed scenario, as there is no key to associate the value with, as favquote= has been transformed into favquote%3d.

Performing a POST with httplib

In Example 8-3, we created a GET request using the methods of httplib. Performing a POST requires a couple of extra method calls, and very precise order of events. The sequence of the HTTP calls is important, and making a post requires extra headers. The start of a POST request is similar to a GET, as shown here:

req = HTTP("192.168.1.23")
req.putrequest("POST", "/c7/favquote.cgi")
req.putheader("Accept", "text/html")
req.putheader("User-Agent", "MyPythonScript")

Note that this time, the first argument to putrequest is POST. Beyond the change from GET to POST, the call to putrequest looks the same. When posting data to the server, it’s important that the server know exactly how many bytes of data the HTTP client is sending. While the HTTP headers rely on line breaks and a blank line as field delimiters, posted data may contain all sorts of special characters, binary data, or other nonprintable characters. Therefore, instead of relying on line breaks, the server requires that you specify how many bytes you’re sending, and then reads that number of bytes from your request. Specify the content length using putheader (note that you must know the number of bytes):

myquote = 'This is my quote: "I think therefore I am."'
postdata = "favquote=" + quote(myquote)
req.putheader("Content-Length", str(len(postdata)))

In these calls, you assemble the post data by concatenating the key portion (favquote=) with the quoted value. Use the len function to size up your URL-encoded string named postdata. Finally, since putheader expects a string as a second argument and not a number, convert the length with the str function.

The HTTP.send method is used to submit the data after ending the headers:

req.endheaders(  )
req.send(postdata)

Now, you can get the reply and read the results as you did with your GET request earlier in Example 8-3. The result of the POST may be dynamically generated data such as search results, or it could be an HTML page detailing a problem associated with the POST.

Illustrating a complete POST operation

As you can see, performing a POST (rather than a GET) requires learning a few more steps. The file post.py, shown in Example 8-4, pulls these ideas together and illustrates a complete POST operation. If you copied Example 8-3, favquote.cgi, to a CGI-capable directory on your web server, you should be able to run post.py from the command line. Be sure and put the appropriate IP address or localhost in the call to the HTTP constructor!

Example 8-4. post.py
"""
post.py
"""
from httplib import HTTP
from urllib  import quote

# establish POST data
myquote = 'This is my quote: "I think therefore I am."'

# be sure not to quote the key= sequence...
postdata = "favquote=" + quote(myquote)

print "Will POST ", len(postdata), "bytes:"
print postdata

# begin HTTP request
req = HTTP("127.0.0.1") # change to a different IP if needed
req.putrequest("POST", "/c7/favquote.cgi")
req.putheader("Accept", "text/html")
req.putheader("User-Agent", "MyPythonScript")

# Set Content-length to length of postdata
req.putheader("Content-Length", str(len(postdata)))
req.endheaders(  )

# send post data after ending headers,
# CGI script will receive it on STDIN
req.send(postdata)

ec, em, h = req.getreply(  )
print "HTTP RESPONSE: ", ec, em

# get file-like object from HTTP response
# and print received HTML to screen
fd = req.getfile(  )
textlines = fd.read(  )
fd.close(  )
print "
Received following HTML:
"
print textlines

The raw HTTP headers and post data that post.py produces are shown below:

POST /c7/favquote.cgi HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg
Accept-Language: en-us
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
Host: 192.168.1.45
Content-Length: 70
Connection: Keep-Alive

favquote=This+is+my+favorite+quote%3A+%22I+think%2C+therefore+I+am.%22

As you can see, the content length is specified and the data follows exactly one blank line after the headers.

Thus far in this chapter, we’ve encountered urllib and httplib, and have retrieved generic URLs and created custom HTTP requests. Now we are going to take a look at Python’s support for implementing the server side of the connection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.71.94