User agents

Another request header worth knowing about is the User-Agent header. Any client that communicates using HTTP can be referred to as a user agent. RFC 7231 suggests that user agents should use the User-Agent header to identify themselves in every request. What goes in there is up to the software that makes the request, though it usually comprises a string that identifies the program and version, and possibly the operating system and the hardware that it's running on. For example, the user agent for my current version of Firefox is shown here:

Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140722 Firefox/24.0 Iceweasel/24.7.0

Although it has been broken over two lines here, it is a single long string. As you can probably decipher, I'm running Iceweasel (Debian's version of Firefox) version 24 on a 64-bit Linux system. User agent strings aren't intended for identifying individual users. They only identify the product that was used for making the request.

We can view the user agent that urllib uses. Perform the following steps:

>>> req = Request('http://www.python.org')
>>> urlopen(req)
>>> req.get_header('User-agent')
'Python-urllib/3.4'

Here, we have created a request and submitted it using urlopen, and urlopen added the user agent header to the request. We can examine this header by using the get_header() method. This header and its value are included in every request made by urllib, so every server we make a request to can see that we are using Python 3.4 and the urllib library.

Webmasters can inspect the user agents of requests and then use the information for various things, including the following:

  • Classifying visits for their website statistics
  • Blocking clients with certain user agent strings
  • Sending alternative versions of resources for user agents with known problems, such as bugs when interpreting certain languages like CSS, or not supporting some languages at all, such as JavaScript

The last two can cause problems for us because they can stop or interfere with us accessing the content that we're after. To work around this, we can try and set our user agent so that it mimics a well known browser. This is known as spoofing, as shown here:

>>> req = Request('http://www.debian.org')
>>> req.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140722 Firefox/24.0 Iceweasel/24.7.0')
>>> response = urlopen(req)

The server will respond as if our application is a regular Firefox client. User agent strings for different browsers are available on the web. I'm yet to come across a comprehensive resource for them, but Googling for a browser and version number will usually turn something up. Alternatively you can use Wireshark to capture an HTTP request made by the browser you want to emulate and look at the captured request's user agent header.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.199.56