An App Engine application can connect to other sites on the Internet to retrieve data and communicate with web services. It does this not by opening a connection to the remote host from the application server, but through a scalable service called the URL Fetch service. This takes the burden of maintaining connections away from the app servers, and ensures that resource fetching performs well regardless of how many request handlers are fetching resources simultaneously. As with other parts of the App Engine infrastructure, the URL Fetch service is used by other Google applications to fetch web pages.
The URL Fetch service supports fetching URLs by using the HTTP protocol, and using HTTP with SSL (HTTPS). Other methods sometimes associated with URLs (such as FTP) are not supported.
Because the URL Fetch service is based on Google infrastructure, the service inherits a few restrictions that were put in place in the original design of the underlying HTTP proxy. The service supports the five most common HTTP actions (GET, POST, PUT, HEAD, and DELETE) but does not allow for others or for using a nonstandard action. Also, it can only connect to TCP ports in several allowed ranges: 80–90, 440–450, and 1024–65535. By default, it uses port 80 for HTTP, and port 443 for HTTPS. The proxy uses HTTP 1.1 to connect to the remote host.
The outgoing request can contain URL parameters, a request body, and
HTTP headers. A few headers cannot be modified for security reasons, which
mostly means that an app cannot issue a malformed request, such as a request
whose Content-Length
header does not accurately reflect
the actual content length of the request body. In these cases, the service
uses the correct values, or does not include the header.
Request and response sizes are limited, but generous. A request can be up to 5 megabytes in size (including headers), and a response can be up to 32 megabytes in size.
The service waits for a response up to a time limit, or “deadline.” The default fetch deadline is 5 seconds, but you can increase this on a per-request basis. The maximum deadline is 60 seconds during a user request, or 10 minutes during a task queue or scheduled task or from a backend. That is, the fetch deadline can be up to the request handler’s own deadline, except for backends (which have none).
Both the Python and Java runtime environments offer implementations of
standard libraries used for fetching URLs that call the URL Fetch service
behind the scenes. For Python, these are the urllib
, httplib
, and urllib2
modules. For
Java, this is the java.net.URLConnection
set of APIs, including
java.net.URL
. These implementations give you a reasonable
degree of portability and interoperability with other libraries.
Naturally, the standard interfaces do not give you complete access to the service’s features. When using the standard libraries, the service uses the following default behaviors:
If the remote host doesn’t respond within five seconds, the request is canceled and a service exception is raised.
The service follows HTTP redirects up to five times before returning the response to the application.
Responses from remote hosts that exceed 32 megabytes in size are truncated to 32 megabytes. The application is not told whether the response is truncated.
HTTP over SSL (HTTPS) URLs will use SSL to make the connection, but the service will not validate the server’s security certificate. (The App Engine team has said certificate validation will become the default for the standard libraries in a future release, so check the App Engine website.)
All of these behaviors can be customized when calling the service APIs directly. You can increase the fetch response deadline, disable the automatic following of redirects, cause an exception to be thrown for responses that exceed the maximum size, and enable validation of certificates for HTTPS connections.
The development server simulates the URL Fetch service by making HTTP connections directly from your computer. If the remote host might behave differently when your app connects from your computer rather than from Google’s proxy servers, be sure to test your URL Fetch calls on App Engine.
In this chapter, we introduce the standard-library and direct interfaces to the URL Fetch service, in Python and in Java. We also examine several features of the service, and how to use them from the direct APIs.
Fetching resources from remote hosts can take quite a bit of time. Like several other services, the URL Fetch service offers a way to call the service asynchronously, so your application can issue fetch requests and do other things while remote servers take their time to respond. See Chapter 17 for more information.
In Python, you can call the URL Fetch service by using the
google.appengine.api.urlfetch
module, or you can use Python
standard libraries such as urllib2
.
The Python runtime environment overrides portions of the urllib
, urllib2
, and httplib
modules in the Python standard library
so that HTTP and HTTPS connections made with these modules use the URL
Fetch service. This allows existing software that depends on these
libraries to function on App Engine, as long as the requests function
within certain limitations. urllib2
has rich extensible
support for features of remote web servers such as HTTP authentication and
cookies. We won’t go into the details of this module here, but Example 13-1 shows a brief example using the
module’s urlopen()
convenience function.
Example 13-1. A simple example of using the urllib2 module to access the URL Fetch service
import urllib2 from google.appengine.api import urlfetch # ... try: newsfeed = urllib2.urlopen('http://ae-book.appspot.com/blog/atom.xml/') newsfeed_xml = newsfeed.read() except urllib2.URLError, e: # Handle urllib2 error... except urlfetch.Error, e: # Handle urlfetch error...
In this example, we catch both exceptions raised by
urllib2
and exceptions raised from the URL Fetch Python API,
google.appengine.api.urlfetch
. The service may throw one of
its own exceptions for conditions that urllib2
doesn’t catch,
such as a request exceeding its deadline.
Because the service follows redirect responses by default (up to
five times) when using urllib2
, a urllib2
redirect handler will not see all redirects, only the final
response.
If you use the service API directly, you can customize these
behaviors. Example 13-2 shows a similar
example using the urlfetch
module, with several options
changed.
Example 13-2. Customizing URL Fetch behaviors, using the urlfetch module
from google.appengine.api import urlfetch # ... try: newsfeed = urlfetch.fetch('http://ae-book.appspot.com/blog/atom.xml/', allow_truncated=False, follow_redirects=False, deadline=10) newsfeed_xml = newsfeed.content except urlfetch.Error, e: # Handle urlfetch error...
We’ll consider the direct URL Fetch API for the rest of this chapter.
In Java, the direct URL Fetch service API is provided by the
com.google.appengine.api.urlfetch
package. You can also use
standard java.net
calls to fetch URLs. The Java runtime
includes a custom implementation of the URLConnection
class in the
java.net
package that calls the URL Fetch service instead of
making a direct socket connection. As with the other standard interfaces,
you can use this interface and rest assured that you can port your app to
another platform easily.
Example 13-3 shows a
simple example of using a convenience method in the URL
class, which in turn uses the URLConnection
class to fetch the contents of a
web page. The openStream()
method of the URL
object returns an input stream of bytes. As shown, you can use an
InputStreamReader
(from java.io
)
to process the byte stream as a character stream. The BufferedReader
class makes it easy to read
lines of text from the InputStreamReader
.
Example 13-3. Using java.net.URL to call the URL Fetch service
import java.net.URL; import java.net.MalformedURLException; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.BufferedReader; // ... try { URL url = new URL("http://ae-book.appspot.com/blog/atom.xml/"); InputStream inStream = url.openStream(); InputStreamReader inStreamReader = new InputStreamReader(inStream); BufferedReader reader = new BufferedReader(inStreamReader); // ... read characters or lines with reader ... reader.close(); } catch (MalformedURLException e) { // ... } catch (IOException e) { // ... }
Note that the URL Fetch service has already buffered the entire response into the application’s memory by the time the app begins to read. The app reads the response data from memory, not from a network stream from the socket or the service.
You can use other features of the URLConnection
interface, as long as they operate within the functionality of the service
API. Notably, the URL Fetch service does not maintain a persistent
connection with the remote host, so features that require such a
connection will not work.
By default, the URL Fetch service waits up to five seconds for a
response from the remote server. If the server does not respond by the
deadline, the service throws an IOException
. You can
adjust the amount of time to wait using the setConnectTimeout()
method of the
URLConnection
. (The setReadTimeout()
method has
the same effect; the service uses the greater of the two values.) The
deadline can be up to 60 seconds during user requests, or up to 10 minutes
(600 seconds) for task queue and scheduled tasks and when running on a
backend.
When using the URLConnection
interface, the URL Fetch
service follows HTTP redirects automatically, up to five consecutive
redirects. The app does not see the intermediate redirect responses, only
the last one. If there are more than five redirects, the service returns
the fifth redirect response to the app.
The low-level API for the URL Fetch service lets you customize
several behaviors of the service. Example 13-4 demonstrates how to
fetch a URL with this API with options specified. As shown, the
FetchOptions
object tells the service not to follow any
redirects, and to throw a ResponseTooLargeException
if the response
exceeds the maximum size of 32 megabytes instead of truncating the
data.
Example 13-4. Using the low-level API to call the URL Fetch service, with options
import java.net.URL; import java.net.MalformedURLException; import com.google.appengine.api.urlfetch.FetchOptions; import com.google.appengine.api.urlfetch.HTTPMethod; import com.google.appengine.api.urlfetch.HTTPRequest; import com.google.appengine.api.urlfetch.HTTPResponse; import com.google.appengine.api.urlfetch.ResponseTooLargeException; import com.google.appengine.api.urlfetch.URLFetchService; import com.google.appengine.api.urlfetch.URLFetchServiceFactory; // ... try { URL url = new URL("http://ae-book.appspot.com/blog/atom.xml/"); FetchOptions options = FetchOptions.Builder .doNotFollowRedirects() .disallowTruncate(); HTTPRequest request = new HTTPRequest(url, HTTPMethod.GET, options); URLFetchService urlfetch = URLFetchServiceFactory.getURLFetchService(); HTTPResponse response = urlfetch.fetch(request); // ... process response.getContent() ... } catch (ResponseTooLargeException e) { // ... } catch (MalformedURLException e) { // ... } catch (IOException e) { // ... }
You use the FetchOptions
to adjust many of the
service’s features. You get an instance of this class by calling a static
method of FetchOptions.Builder
, and then set options by
calling methods on the instance. For convenience, there is a static method
for each option, and every method returns the instance, so your code can
build the full set of options with a single statement of chained method
calls.
We will use the direct urlfetch
API for the remainder
of this chapter.
An HTTP request can consist of a URL, an HTTP method, request headers, and a payload. Only the URL and HTTP method are required, and the API assumes you mean the HTTP GET method if you only provide a URL.
In Python, you fetch a URL using HTTP GET by passing the URL to the
fetch()
function in the
google.appengine.api.urlfetch
module:
from google.appengine.api import urlfetch # ... response = memcache.fetch('http://www.example.com/feed.xml')
In Java, you prepare an instance of the HTTPRequest
class from the
com.google.appengine.api.urlfetch
package with the URL as a
java.net.URL
instance, then you pass the request object to
the service’s fetch()
method. (Notice that this
HTTPRequest
class is different from the J2EE class you use
with your request handler servlets.)
import java.net.URL; import com.google.appengine.api.urlfetch.HTTPRequest; import com.google.appengine.api.urlfetch.HTTPResponse; import com.google.appengine.api.urlfetch.URLFetchService; import com.google.appengine.api.urlfetch.URLFetchServiceFactory; // ... HTTPRequest outRequest = new HTTPRequest(new URL("http://www.example.com/feed.xml")); URLFetchService urlfetch = URLFetchServiceFactory.getURLFetchService(); HTTPResponse response = urlfetch.fetch(outRequest);
The URL consists of a scheme, a domain, an optional port, and a path. For example:
https://www.example.com:8081/private/feed.xml
In this example, https
is the scheme,
www.example.com
is the domain, 8081
is the
port, and /private/feed.xml
is the path.
The URL Fetch service supports the http
and
https
schemes. Other schemes, such as ftp
, are not supported.
If no port is specified, the service will use the default port for the scheme: port 80 for HTTP, and port 443 for HTTPS. If you specify a port, it must be within 80–90, 440–450, or 1024–65535.
As a safety measure against accidental request loops in an application, the URL Fetch service will refuse to fetch the URL that maps to the request handler doing the fetching. An app can make connections to other URLs of its own, so request loops are still possible, but this restriction provides a simple sanity check.
As shown above, the Python API takes the URL as a string passed to
the fetch()
function as its first positional argument. The
Java API accepts a java.net.URL
object as an argument to
the HTTPRequest
constructor.
The HTTP method describes the general nature of the request, as codified by the HTTP standard. For example, the GET method asks for the data associated with the resource identified by the URL (such as a document or database record). The server is expected to verify that the request is allowed, then return the data in the response, without making changes to the resource. The POST method asks the server to modify records or perform an action, and the client usually includes a payload of data with the request.
The URL Fetch service can send requests using the GET, POST, PUT, HEAD, and DELETE methods. No other methods are supported.
In Python, you set the method by providing the method
keyword argument to the fetch()
function. The possible
values are provided as constants by the urlfetch
method. If
the argument is omitted, it defaults to urlfetch.GET
. To
provide a payload, you set the payload
keyword
argument:
profile_data = profile.get_field_data() response = urlfetch.fetch('http://www.example.com/profile/126542', method=urlfetch.POST, payload=new_profile_data)
In Java, the method is an optional second argument to the
HTTPRequest
constructor. Its value is from the enum
HTTPMethod
, whose values are named GET
,
POST
, PUT
, HEAD
, and
DELETE
. To add a payload, you call the
setPayload()
method of the HTTPRequest
,
passing in a byte[]
:
import com.google.appengine.api.urlfetch.HTTPMethod; // ... byte[] profileData = profile.getFieldData(); HTTPRequest request = new HTTPRequest(url, HTTPMethod.POST); request.setPayload(profileData);
Requests can include headers, a set of key-value pairs
distinct from the payload that describe the client, the request, and the
expected response. App Engine sets several headers automatically, such
as Content-Length
. Your app can provide
additional headers that may be expected by the server.
In Python, the fetch()
function accepts additional
headers as the headers
keyword argument. Its value is a
mapping of header names to values:
response = urlfetch.fetch('http://www.example.com/article/roof_on_fire', headers={'Accept-Charset': 'utf-8'}, payload=new_profile_data)
In Java, you set a request header by calling the
setHeader()
method on the HTTPRequest
. Its sole argument is an instance
of the HTTPHeader
class, whose constructor takes the
header name and value as strings:
import com.google.appengine.api.urlfetch.HTTPHeader; // ... HTTPRequest request = new HTTPRequest(new URL("http://www.example.com/article/roof_on_fire")); request.setHeader(new HTTPHeader("Accept-Charset", "utf-8"));
Some headers cannot be set directly by the application. This is
primarily to discourage request forgery or invalid requests that could
be used as an attack on some servers. Disallowed headers include
Content-Length
(which is set by App Engine
automatically to the actual size of the request), Host
,
Vary
, Via
, X-Forwarded-For
, and
X-ProxyUser-IP
.
The User-Agent
header, which most servers use to
identify the software of the client, can be set by the app. However, App
Engine will append a string to this value identifying the request as
coming from App Engine. This string includes your application ID. This
is usually enough to allow an app to coax a server into serving content
intended for a specific type of client (such as a specific brand or
version of web browser), but it won’t be a complete impersonation of
such a client.
When the scheme of a URL is https
, the URL
Fetch service uses HTTP over SSL to connect to the remote server,
encrypting both the request and the response.
The SSL protocol also allows the client to verify the identity of the remote host, to ensure it is talking directly to the host and traffic is not being intercepted by a malicious host (a “man in the middle” attack). This protocol involves security certificates and a process for clients to validate certificates.
By default, the URL Fetch service does not validate SSL certificates. With validation disabled, traffic is still encrypted, but the remote host’s certificates are not validated before sending the request data. You can tell the URL Fetch service to enable validation of security certificates.
To enable certificate validation in Python, you provide the
validate_certificate=True
argument to
fetch()
:
response = urlfetch.fetch('https://secure.example.com/profile/126542', validate_certificate=True)
In Java, you use a FetchOptions
instance with the request, and
call its validateCertificate()
option. Its antonym is
doNotValidateCertificate()
, which is the default:
FetchOptions options = FetchOptions.Builder .validateCertificate(); HTTPRequest request = new HTTPRequest( new URL("https://secure.example.com/profile/126542"), HTTPMethod.GET, options);
The standard libraries use the default behavior and do not validate certificates. The App Engine team has said they will change this default for the standard libraries in a future release. See the official App Engine website for updates.
The request can be up to 5 megabytes in size, including the headers and payload. The response can be up to 32 megabytes in size.
The URL Fetch service can do one of two things if the remote host returns a response larger than 32 megabytes: it can truncate the response (delete everything after the first 32 megabytes), or it can raise an exception in your app. You control this behavior with an option.
In Python, the fetch()
function accepts an
allow_truncated=True
keyword argument. The default is
False
, which tells the service to raise a urlfetch.ResponseTooLargeError
if the response
is too large:
response = memcache.fetch('http://www.example.com/firehose.dat', allow_truncated=True)
In Java, the FetchOptions
method
allowTruncate()
enables truncation, and
disallowTruncate()
tells the service to throw a ResponseTooLargeException
if the response is
too large:
FetchOptions options = FetchOptions.Builder .allowTruncate(); HTTPRequest request = new HTTPRequest( new URL("http://www.example.com/firehose.dat"), HTTPMethod.GET, options);
The standard libraries tell the URL Fetch service to allow truncation. This ensures that the standard libraries won’t raise an unfamiliar exception when third-party code fetches a URL, at the expense of returning unexpectedly truncated data when responses are too large.
The URL Fetch service issues a request, waits for the remote host to respond, and then makes the response available to the app. But the service won’t wait on the remote host forever. By default, the service will wait 5 seconds before terminating the connection and raising an exception with your app.
You can adjust the amount of time the service will wait (the “deadline”) as an option to the fetch call. You can set a deadline up to 60 seconds for fetches made during user requests, and up to 10 minutes (600 seconds) for requests made during tasks. That is, you can wait up to the maximum amount of time your request handler can run. Typically, you’ll want to set a fetch deadline shorter than your request handler’s deadline, so it can react to a failed fetch.
To set the fetch deadline in Python, provide the
deadline
keyword argument, whose value is a number of
seconds. If a fetch exceeds its deadline, the service raises a urlfetch.DeadlineExceededError
:
response = memcache.fetch('http://www.example.com/users/ackermann', deadline=30)
In Java, the FetchOptions
class provides a
setDeadline()
method, which takes a
java.lang.Double
. The Builder
static method is
slightly different, named withDeadline()
and taking a
double
. The value is a number of seconds:
FetchOptions options = FetchOptions.Builder .withDeadline(30); HTTPRequest request = new HTTPRequest( new URL("http://www.example.com/users/ackermann"), HTTPMethod.GET, options);
You can tell the service to follow redirects automatically, if HTTP redirect requests are returned by the remote server. The server will follow up to five redirects, then return the last response to the app (regardless of whether the last response is a redirect or not).
In Python, urlfetch.fetch()
accepts a
follow_redirects=True
keyword argument. The default is
False
, which means to return the first response even if it’s
a redirect. When using the urllib2
, redirects are followed
automatically, up to five times:
response = memcache.fetch('http://www.example.com/bounce', follow_redirects=True)
In Java, the FetchOptions.Builder
has a
followRedirects()
method, and its opposite
doNotFollowRedirects()
. The default is to not follow
redirects. When using java.net.URLConnection
, redirects are followed
automatically, up to five times:
FetchOptions options = FetchOptions.Builder .followRedirects(); HTTPRequest request = new HTTPRequest( new URL("http://www.example.com/bounce"), HTTPMethod.GET, options);
When following redirects, the service does not retain or use cookies set in the responses of the intermediate steps. If you need requests to honor cookies during a redirect chain, you must disable the automatic redirect feature, and process redirects manually in your application code.
In Python, the fetch()
function returns an
object with response data available on several named properties. (The
class name for response objects is _URLFetchResult
, which
implies that only the fetch()
function should be constructing
these objects—or relying on the class name.)
In Java, the fetch()
service method returns an
HTTPResponse
instance, with getter methods for
the response data.
The response fields are as follows:
content
/ getContent()
The response body. A Python str
or Java
byte[]
.
status_code
/ getResponseCode()
The HTTP status code. An int
.
headers
/ getHeaders()
The response headers. In Python, this value is a mapping of
names to values. In Java, this is a
List<HTTPHeader>
, where each header has
getName()
and getValue()
methods
(returning strings).
final_url
/ getFinalUrl()
The URL that corresponds to the response data. If automatic
redirects were enabled and the server issued one or more redirects,
this is the URL of the final destination, which may differ from the
request URL. A Python str
or a Java
java.net.URL
.
content_was_truncated
(Python only)True
if truncation was enabled and the response
data was larger than 32 megabytes. (There is no Java
equivalent.)
18.224.38.3