URLs

Uniform Resource Locators, or URLs are fundamental to the way in which the web operates, and they have been formally described in RFC 3986. A URL represents a resource on a given host. How URLs map to the resources on the remote system is entirely at the discretion of the system admin. URLs can point to files on the server, or the resources may be dynamically generated when a request is received. What the URL maps to though doesn't matter as long as the URLs work when we request them.

URLs are comprised of several sections. Python uses the urllib.parse module for working with URLs. Let's use Python to break a URL into its component parts:

>>> from urllib.parse import urlparse
>>> result = urlparse('http://www.python.org/dev/peps')
>>> result
ParseResult(scheme='http', netloc='www.python.org', path='/dev/peps', params='', query='', fragment='')

The urllib.parse.urlparse() function interprets our URL and recognizes http as the scheme, https://www.python.org/ as the network location, and /dev/peps as the path. We can access these components as attributes of the ParseResult:

>>> result.netloc
'www.python.org'
>>> result.path
'/dev/peps'

For almost all resources on the web, we'll be using the http or https schemes. In these schemes, to locate a specific resource, we need to know the host that it resides on and the TCP port that we should connect to (together these are the netloc component), and we also need to know the path to the resource on the host (the path component).

Port numbers can be specified explicitly in a URL by appending them to the host. They are separated from the host by a colon. Let's see what happens when we try this with urlparse.

>>> urlparse('http://www.python.org:8080/')
ParseResult(scheme='http', netloc='www.python.org:8080', path='/', params='', query='', fragment='')

The urlparse method just interprets it as a part of the netloc. This is fine because this is how handlers such as urllib.request.urlopen() expect it to be formatted.

If we don't supply a port (as is usually the case), then the default port 80 is used for http, and the default port 443 is used for https. This is usually what we want, as these are the standard ports for the HTTP and HTTPS protocols respectively.

Paths and relative URLs

The path in a URL is anything that comes after the host and the port. Paths always start with a forward-slash (/), and when just a slash appears on its own, it's called the root. We can see this by performing the following:

>>> urlparse('http://www.python.org/')
ParseResult(scheme='http', netloc='www.python.org', path='/', params='', query='', fragment='')

If no path is supplied in a request, then by default urllib will send a request for the root.

When a scheme and a host are included in a URL (as in the previous example), the URL is called an absolute URL. Conversely, it's possible to have relative URLs, which contain just a path component, as shown here:

>>> urlparse('../images/tux.png')
ParseResult(scheme='', netloc='', path='../images/tux.png', params='', query='', fragment='')

We can see that ParseResult only contains a path. If we want to use a relative URL to request a resource, then we need to supply the missing scheme, the host, and the base path.

Usually, we encounter relative URLs in a resource that we've already retrieved from a URL. So, we can just use this resource's URL to fill in the missing components. Let's look at an example.

Suppose that we've retrieved the http://www.debian.org URL, and within the webpage source code we found the relative URL for the 'About' page. We found that it's a relative URL for intro/about.

We can create an absolute URL by using the URL for the original page and the urllib.parse.urljoin() function. Let's see how we can do this:

>>> from urllib.parse import urljoin
>>> urljoin('http://www.debian.org', 'intro/about')
'http://www.debian.org/intro/about'

By supplying urljoin with a base URL, and a relative URL, we've created a new absolute URL.

Here, notice how urljoin has filled in the slash between the host and the path. The only time that urljoin will fill in a slash for us is when the base URL does not have a path, as shown in the preceding example. Let's see what happens if the base URL does have a path.

>>> urljoin('http://www.debian.org/intro/', 'about')
'http://www.debian.org/intro/about'
>>> urljoin('http://www.debian.org/intro', 'about')
'http://www.debian.org/about'

This will give us varying results. Notice how urljoin appends to the path if the base URL ends in a slash, but it replaces the last path element in the base URL if the base URL doesn't end in a slash.

We can force a path to replace all the elements of a base URL by prefixing it with a slash. Do the following:

>>> urljoin('http://www.debian.org/intro/about', '/News')
'http://www.debian.org/News'

How about navigating to parent directories? Let's try the standard dot syntax, as shown here:

>>> urljoin('http://www.debian.org/intro/about/', '../News')
'http://www.debian.org/intro/News'
>>> urljoin('http://www.debian.org/intro/about/', '../../News')
'http://www.debian.org/News'
>>> urljoin('http://www.debian.org/intro/about', '../News')
'http://www.debian.org/News'

It work as we'd expect it to. Note the difference between the base URL having and not having a trailing slash.

Lastly, what if the 'relative' URL is actually an absolute URL:

>>> urljoin('http://www.debian.org/about', 'http://www.python.org')
'http://www.python.org'

The relative URL completely replaces the base URL. This is handy, as it means that we don't need to worry about testing whether a URL is relative or not before using it with urljoin.

Query strings

RFC 3986 defines another property of URLs. They can contain additional parameters in the form of key/value pairs that appear after the path. They are separated from the path by a question mark, as shown here:

http://docs.python.org/3/search.html?q=urlparse&area=default

This string of parameters is called a query string. Multiple parameters are separated by ampersands (&). Let's see how urlparse handles it:

>>> urlparse('http://docs.python.org/3/search.html? q=urlparse&area=default')
ParseResult(scheme='http', netloc='docs.python.org', path='/3/search.html', params='', query='q=urlparse&area=default', fragment='')

So, urlparse recognizes the query string as the query component.

Query strings are used for supplying parameters to the resource that we wish to retrieve, and this usually customizes the resource in some way. In the aforementioned example, our query string tells the Python docs search page that we want to run a search for the term urlparse.

The urllib.parse module has a function that helps us turn the query component returned by urlparse into something more useful:

>>> from urllib.parse import parse_qs
>>> result = urlparse ('http://docs.python.org/3/search.html?q=urlparse&area=default')
>>> parse_qs(result.query)
{'area': ['default'], 'q': ['urlparse']}

The parse_qs() method reads the query string and then converts it into a dictionary. See how the dictionary values are actually in the form of lists? This is because parameters can appear more than once in a query string. Try it with a repeated parameter:

>>> result = urlparse ('http://docs.python.org/3/search.html?q=urlparse&q=urljoin')
>>> parse_qs(result.query)
{'q': ['urlparse', 'urljoin']}

See how both of the values have been added to the list? It's up to the server to decide how it interprets this. If we send this query string, then it may just pick one of the values and use that, while ignoring the repeat. You can only try it, and see what happens.

You can usually figure out what you need to put in a query string for a given page by submitting a query through the web interface using your web browser, and inspecting the URL of the results page. You should be able to spot the text of your search and consequently deduce the corresponding key for the search text. Quite often, many of the other parameters in the query string aren't actually needed for getting a basic result. Try requesting the page using only the search text parameter and see what happens. Then, add the other parameters, if it does not work as expected.

If you submit a form to a page and the resulting page's URL doesn't have a query string, then the page would have used a different method for sending the form data. We'll look at this in the HTTP methods section in the following, while discussing the POST method.

URL encoding

URLs are restricted to the ASCII characters and within this set, a number of characters are reserved and need to be escaped in different components of a URL. We escape them by using something called URL encoding. It is often called percent encoding, because it uses the percent sign as an escape character. Let's URL-encode a string:

>>> from urllib.parse import quote
>>> quote('A duck?')
'A%20duck%3F'

The special characters ' ' and ? have been replaced by escape sequences. The numbers in the escape sequences are the characters' ASCII codes in hexadecimal.

The full rules for where the reserved characters need to be escaped are given in RFC 3986, however urllib provides us with a couple of methods for helping us construct URLs. This means that we don't need to memorize all of these!

We just need to:

  • URL-encode the path
  • URL-encode the query string
  • Combine them by using the urllib.parse.urlunparse() function

Let's see how to use the aforementioned steps in code. First, we encode the path:

>>> path = 'pypi'
>>> path_enc = quote(path)

Then, we encode the query string:

>>> from urllib.parse import urlencode
>>> query_dict = {':action': 'search', 'term': 'Are you quite sure this is a cheese shop?'}
>>> query_enc = urlencode(query_dict)
>>> query_enc
'%3Aaction=search&term=Are+you+quite+sure+this+is+a+cheese+shop%3F'

Lastly, we compose everything into a URL:

>>> from urllib.parse import urlunparse
>>> netloc = 'pypi.python.org'
>>> urlunparse(('http', netloc, path_enc, '', query_enc, ''))
'http://pypi.python.org/pypi?%3Aaction=search&term=Are+you+quite+sure +this+is+a+cheese+shop%3F'

The quote() function has been setup for specifically encoding paths. By default, it ignores slash characters and it doesn't encode them. This isn't obvious in the preceding example, try the following to see how this works:

>>> from urllib.parse import quote
>>> path = '/images/users/+Zoot+/'
>>> quote(path)
'/images/users/%2BZoot%2B/'

Notice that it ignores the slashes, but it escapes the +. That is perfect for paths.

The urlencode() function is similarly intended for encoding query strings directly from dicts. Notice how it correctly percent encodes our values and then joins them with &, so as to construct the query string.

Lastly, the urlunparse() method expects a 6-tuple containing the elements matching those of the result of urlparse(), hence the two empty strings.

There is a caveat for path encoding. If the elements of a path themselves contain slashes, then we may run into problems. The example is shown in the following commands:

>>> username = '+Zoot/Dingo+'
>>> path = 'images/users/{}'.format(username)
>>> quote(path)
'images/user/%2BZoot/Dingo%2B'

Notice how the slash in the username doesn't get escaped? This will be incorrectly interpreted as an extra level of directory structure, which is not what we want. In order to get around this, first we need to individually escape any path elements that may contain slashes, and then join them manually:

>>> username = '+Zoot/Dingo+'
>>> user_encoded = quote(username, safe='')
>>> path = '/'.join(('', 'images', 'users', username))
'/images/users/%2BZoot%2FDingo%2B'

Notice how the username slash is now percent-encoded? We encode the username separately, telling quote not to ignore slashes by supplying the safe='' argument, which overwrites its default ignore list of /. Then, we combine the path elements by using a simple join() function.

Here, it's worth mentioning that hostnames sent over the wire must be strictly ASCII, however the socket and http modules support transparent encoding of Unicode hostnames to an ASCII-compatible encoding, so in practice we don't need to worry about encoding hostnames. There are more details about this process in the encodings.idna section of the codecs module documentation.

URLs in summary

There are quite a few functions that we've used in the preceding sections. Let's just review what we have used each function for. All of these functions can be found in the urllib.parse module. They are as follows:

  • Splitting a URL into its components: urlparse
  • Combining an absolute URL with a relative URL: urljoin
  • Parsing a query string into a dict: parse_qs
  • URL-encoding a path: quote
  • Creating a URL-encoded query string from a dict: urlencode
  • Creating a URL from components (reverse of urlparse): urlunparse
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.147