Content negotiation

Content compression with the Accept-Encoding header and language selection with the Accept-Language header are examples of content negotiation, where the client specifies its preferences regarding the format and the content of the requested resource. The following headers can also be used for this:

  • Accept: For requesting a preferred file format
  • Accept-Charset: For requesting the resource in a preferred character set

There are additional aspects to the content negotiation mechanism, but because it's inconsistently supported and it can become quite involved, we won't be covering it in this chapter. RFC 7231 contain all the details that you need. Take a look at sections such as 3.4, 5.3, 6.4.1, and 6.5.6, if you find that your application requires this.

Content types

HTTP can be used as a transport for any type of file or data. The server can use the Content-Type header in a response to inform the client about the type of data that it has sent in the body. This is the primary means an HTTP client determines how it should handle the body data that the server returns to it.

To view the content type, we inspect the value of the response header, as shown here:

>>> response = urlopen('http://www.debian.org')
>>> response.getheader('Content-Type')
'text/html'

The values in this header are taken from a list which is maintained by IANA. These values are variously called content types, Internet media types, or MIME types (MIME stands for Multipurpose Internet Mail Extensions, the specification in which the convention was first established). The full list can be found at http://www.iana.org/assignments/media-types.

There are registered media types for many of the types of data that are transmitted across the Internet, some common ones are:

Media type

Description

text/html

HTML document

text/plain

Plain text document

image/jpeg

JPG image

application/pdf

PDF document

application/json

JSON data

application/xhtml+xml

XHTML document

Another media type of interest is application/octet-stream, which in practice is used for files that don't have an applicable media type. An example of this would be a pickled Python object. It is also used for files whose format is not known by the server. In order to handle responses with this media type correctly, we need to discover the format in some other way. Possible approaches are as follows:

  • Examine the filename extension of the downloaded resource, if it has one. The mimetypes module can then be used for determining the media type (go to Chapter 3, APIs in Action to see an example of this).
  • Download the data and then use a file type analysis tool. TheUse the Python standard library imghdr module can be used for images, and the third-party python-magic package, or the GNU file command, can be used for other types.
  • Check the website that we're downloading from to see if the file type has been documented anywhere.

Content type values can contain optional additional parameters that provide further information about the type. This is usually used to supply the character set that the data uses. For example:

Content-Type: text/html; charset=UTF-8.

In this case, we're being told that the character set of the document is UTF-8. The parameter is included after a semicolon, and it always takes the form of a key/value pair.

Let's discuss an example, downloading the Python home page and using the Content-Type value it returns. First, we submit our request:

>>> response = urlopen('http://www.python.org')

Then, we check the Content-Type value of our response, and extract the character set:

>>> format, params = response.getheader('Content-Type').split(';')
>>> params
' charset=utf-8'
>>> charset = params.split('=')[1]
>>> charset
'utf-8'

Lastly, we decode our response content by using the supplied character set:

>>> content = response.read().decode(charset)

Note that quite often, the server either doesn't supply a charset in the Content-Type header, or it supplies the wrong charset. So, this value should be taken as a suggestion. This is one of the reasons that we look at the Requests library later in this chapter. It will automatically gather all the hints that it can find about what character set should be used for decoding a response body and make a best guess for us.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.108.112