Working with URLs

The URL contains a great deal of Internet information in a single string. It tells you the name of the server, the name of the file on the server, any data that you are supplying to generate a dynamic response, and even the protocol to use to retrieve the information. In basic form, URLs look like this:

http://www.oreilly.com/oreilly/about.html

This URL has three elements. The first section tells you (or your software) the protocol in use for this resource. In this case, it is HTTP, shown by http:. The next section indicates the server name and its corresponding domain. In this case the server is named www, and the domain is oreilly.com, coming together as //www.oreilly.com. What follow are a pathname (/oreilly/) and a filename (about.html). Your browser uses this information as it comes to the brilliant conclusion to use HTTP in connecting with www in oreilly.com, and retrieves the /oreilly/about.html file.

Of course, URLs can become more complicated. If you type “Python” into a search box and click Submit, your browser may go after a URL similar to the following:

http://search.oreilly.com/cgi-bin/search?term=Python&category=All&pref=all

Now there are several more items to examine. First, the server has changed from www to search. Second, the path has changed from /oreilly/ to /cgi-bin/. The filename about.html has been replaced with a target named search. But most interesting is the question mark and the data that follows:

?term=Python&category=All&pref=all

This portion of the URL is known as the query string. If search is a CGI program (or something similar inside an application server) the query string is supplied to it in the form of an environment variable. The CGI program can pick the string apart to realize that a variable named term is set to Python, and that category and pref are equal to All and all respectively. As you can imagine, this information is relevant to the O’Reilly database and appropriate product information is returned to your browser.

However, suppose that instead of searching the O’Reilly site for “Python”, you searched it for “Python!”. What does the URL look like now? Well, the only difference is that the exclamation point is URL-encoded. That is, only a few special characters are allowed within a URL, all others are escaped to their respective hexadecimal code and delimited with a percent (%) sign. This time, the query string looks slightly different:

?term=Python%21&category=All&pref=all

The exclamation point is now replaced with %21, which is its URL-encoded cousin.

Encoding URLs

If you are constructing a URL programmatically for submission to a web site, you find yourself needing to supply parameters in the query string, as shown in the previous section.

Programmatic construction of URLs may be necessary when integrating your Python program with a dynamic web site expecting query parameters in the query string.

The Python urllib module features the method urlencode. This method accepts a dictionary of key/value pairs and returns a properly formatted query string that you could tag onto a URL. For example, if you have an arbitrarily sized dictionary, you could call urlencode with the dictionary as a parameter, as shown here:

>>> from urllib import urlencode
>>> myDict = {
... "Name" : "Chris Jones",
... "Address" : "Woodinville, WA",
... "Favorite Characters" : "#, @, $, and %"
... }
>>> strUrl = urlencode(myDict)
>>> print strUrl
Address=Woodinville%2c+WA&Name=Chris+Jones&Favorite+Characters=
	%23%2c+%40%2c+%24%2c+and+%25

What constitutes strURL here is not a complete URL. It’s just the query data that comes at the end of the URL. The first half of the URL needs to include the protocol, as well as the server and domain pairing:

http://www.example.com/search.cgi?Address=Woodinville%2c+WA&Name=Chris+Jones&Favorite+Characters=
	%23%2c+%40%2c+%24%2c+and+%25

The urlencode method takes care of escaping special characters as it translates #, @, $, and % into %23%2c+%40%2c+%24%2c+and+%25. Not only are the special characters translated, but the commas and spaces have also been converted to their hexadecimal values.

Quoting URLs

The quote method of urllib that takes a single string of data and performs the necessary encoding related to urlencode that takes a dictionary as a parameter. The primary difference is quote does not automatically generate key/value pairings based on a dictionary. The quote method exists to convert a single string into a URL-compliant syntax. For example, if a URL you are constructing consists of http://www.example.com/addQuotation.cgi?myquote=, but you need to add a URL-compliant value to it, you could use the quote method to encode it:

>>> from urllib import quote
>>> quote('Famous Quote: "I think, therefore I am."')
'Famous%20Quote%3a%20%22I%20think%2c%20therefore%20I%20am.%22'

Perhaps the most important thing to remember is that quote should be used to encode a single string, not a key/value pair. In any key=value combination, only the value should be encoded with the quote command. If you were to include the myquote= (or the key) portion of the query string when calling the quote method, the equal sign would also be encoded rendering the URL worthless.

Unquoting URLs

What goes up must come down. If you are encoding URLs programmatically, the odds are that you are going to need to decode one at some point or another. The unquote method of urllib takes an encoded string (such as that generated by quote) and returns the decoded version of it:

>>> from urllib import unquote
>>> unquote("Famous%20Quote%3a%20%22I%20think%2c%20therefore%20I%20am.%22")
'Famous Quote: "I think, therefore I am."'

If you are constructing and deconstructing URLs programmatically, it follows that actually connecting to these and retrieving their content is of value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.95