Chapter 16
Files are not limited to those on your hard drive. The following program retrieves a web page from the Internet, and then uses Python’s string methods to display specific information from it.
Listing 16.1: GCS Menu
1 # menu.py
2
3 import urllib.request
4
5 URL = "http://www.central.edu/go/gcsmenu"
6
7 def getpage(url):
8 with urllib.request.urlopen(url) as f:
9 return str(f.read())
10
11 def gettag(page, tag, start=0):
12 opentag = "<" + tag + ">"
13 closetag = "</" + tag + ">"
14 i = page.find(opentag, start)
15 if i == −1:
16 return None, i
17 j = page.find(closetag, i)
18 return page[i + len(opentag):j], j
19
20 def process(page):
21 heading, i = gettag(page, "h2")
22 result = "
" + heading.center(60) + "
"
23 day, i = gettag(page, "h3", i)
24 while day is not None:
25 result += day + "
"
26 meal, i = gettag(page, "p", i)
27 result += " " + meal.strip("<>p") + "
"
28 day, i = gettag(page, "h3", i)
29 return result
30
31 def main():
32 page = getpage(URL)
33 print(process(page))
34
35 main()
Visit the URL in a web browser and use “View Page Source” to see the raw contents of the page. Listing 16.1 searches for data contained in HTML tags such as <h2>Grand Central Station Menu</h2>.
In addition to the new string methods, this example uses the is comparison, multiple assignment and return, constants, a different version of the import statement, and the urllib.request module.
Python provides many string methods. A few are highlighted here, but see the documentation for a complete list. These search within a string s:
s.find(t[, start[, end]]) |
First index where t is a substring in |
|
s[start:end]. |
s.rfind(t[, start[, end]]) |
Last index where t is a substring in |
|
s[start:end]. |
Return −1 if not found. Optional start, end work like slices. |
|
s.startswith(t) |
True if t is a prefix of s. |
s.endswith(t) |
True if t is a suffix of s. |
s.count(t) |
Number of occurrences of substring t |
|
in s. |
Square brackets in syntax descriptions, such as those used above, indicate optional elements. With the .find() methods, because there are two sets of brackets, there are actually three options:
s.find(t) |
# Search s |
s.find(t, i) |
# Search s[i] |
s.find(t, i, j) |
# Search s[i:j] |
These test the contents of s:
s.isalpha() |
True if all characters in s are alphabetic. |
s.isupper() |
True if all characters in s are uppercase. |
s.islower() |
True if all characters in s are lowercase. |
s.isdigit() |
True if all characters in s are digits. |
These each return a modified copy of s:
s.upper() |
All uppercase. |
s.lower() |
All lowercase. |
s.capitalize() |
First letter capitalized and the rest lowercase. |
s.title() |
Each word capitalized, rest lowercase. |
s.replace(old, new) |
All occurrences of substring old replaced |
|
by new. |
s.center(width) |
Centered in a string of width width. |
s.strip([chars]) |
All chars removed from both ends. |
s.lstrip([chars]) |
All chars removed from left end. |
s.rstrip([chars]) |
All chars removed from right end. |
If optional chars omitted, whitespace is removed. |
⇒ Caution: None of these methods changes the original string; they modify and return a copy.
Recall that Python has the special value None to represent “nothing” or “no object.” The gettag() function in Listing 16.1 uses None to indicate that a tag was not found. The proper way to test for None in the calling function on line 24 is with an is comparison:
obj1 is obj2 |
True if obj1 and obj2 refer to the same object. |
obj1 is not obj2 |
Opposite of is. |
The is comparison never checks the value of its object references; it only checks whether or not the references refer to precisely the same object.
Python allows you to assign values to more than one variable at a time, as long as the number of values on the right side of the equals sign is the same as the number of variables on the left:
<var1>, <var2>, ..., <varN> = <expr1>, <expr2>, ..., <exprN>
In a multiple assignment, all expressions on the right-hand side are evaluated before assigning values to the variables on the left side, and the assignments are considered to happen simultaneously.
Recall from Chapter 6 that a function may return more than one value by separating the values with commas, as on line 18 of Listing 16.1. Each call to gettag() in the process() function of Listing 16.1 shows how multiple assignment can be used to store multiple return values. Technically, multiple values are returned in a tuple, defined later in Chapter 19.
Occasionally, it is helpful to create a variable for a value that will never change. Such variables are called constants, and in Python they are usually written in all capitals with underscores between words. Listing 16.1 defines the constant URL in line 5.
Modules may include names that represent constants in addition to functions. For example, the math module includes the constant pi. These module constants do not use the all-caps naming convention.
The string module provides several constants, including:
punctuation |
String of all punctuation characters. |
A second form of import is used in Listing 16.1:
import <module>
After this statement, any name from the module may be referred to with dot notation:
<module>.<name>
One advantage of this form is that all module names are made available without having to list them. A second is that every use of a module name is easy to find because of the dot notation; for example, see line 8. Finally, this syntax prevents accidentally hiding the same name either as a built-in function or from another module. For these reasons, production code usually uses this form, and we will use it frequently throughout the remainder of the text.
The urllib.request module provides support for reading web pages. Every page on the web is described by its URL or uniform resource locator, essentially its address on the web. Listing 16.1 uses the following function from urllib.request:
urllib.request.urlopen(url) |
File-like object accessing url. |
The file-like object that is returned supports a read() method, but this read() method returns raw bytes rather than text. The str() type conversion in line 9 converts the bytes to a string.
Write encode(msg, n) and decode(msg, n) functions, and test your program by decoding encoded messages and checking that the results are the same as the originals. Maintain case within messages (so that uppercase stays upper and lowercase stays lower), and handle punctuation appropriately.
18.219.132.107