Computing often works with persistent data. There may be source data to be analyzed, or output to be created using Python input and output operations. The map of the dungeon that's explored in a game is data that will be input to the game application. Images, sounds, and movies are data output by some applications and input by other applications. Even a request through a network will involve input and output operations. The common aspect to all of these is the concept of a file of data. The term file is overloaded with many meanings:
Python gives us two common modes for working with a file's content:
pillow
to handle the details of image file encoding into bytes.Additionally, Python modules like shelve
and pickle
have unique ways of representing more complex Python objects than simple strings. There are a number of pickle
protocols available; all of them are based on binary mode file operations.
Throughout this chapter, we'll talk about how Python objects are serialized. When an object is written to a file, a representation of the Python object's state is transformed into a series of bytes. Often, the translation involves text objects as an intermediate notation. Deserialization is the reverse process: it recovers a Python object's state from the bytes of a file. Saving and transferring a representation of the object state is the foundational concept behind REST web services.
When we process data from files, we have two common concerns:
csv
, json
, and pickle
, among many others.Both the physical format and logical layout are essential to interpreting the data on a file. We'll look at a number of recipes for working with different physical formats. We'll also look at ways to divorce our program from some aspects of the logical layout.
In this chapter, we'll look at the following recipes:
pathlib
to work with filenamesdataclasses
to simplify working with CSV files.csv
DictReader
as a dataclass readerIn order to start doing input and output with files, we'll start by working with the OS filesystem. The common features of the directory structure of files and devices are described by Python's pathlib
module. This module has consistent behavior across a number of operating systems, allowing a Python program to work similarly on Linux, macOS, and Windows.
Most operating systems use a hierarchical path to identify a file. Here's an example filename, including the entire path:
/Users/slott/Documents/Writing/Python Cookbook/code
This full pathname has the following elements:
/
means the name is absolute. It starts from the root of the directory of files. In Windows, there can be an extra letter in front of the name, such as C:
, to distinguish between the directories on individual storage devices. Linux and macOS treat all the devices as a unified hierarchy.Users
, slott
, Documents
, Writing
, Python Cookbook
, and code
represent the directories (or "folders," as a visual metaphor) of the filesystem. The path names a top-level Users
directory. This directory is expected to contain the slot
subdirectory. This is true for each name in the path./
is a separator between directory names. The Windows OS uses
to separate items on the path. Python running on Windows, however, can use /
because it automatically converts the more common /
into the Windows path separator character gracefully; in Python, we can generally ignore the Windows use of
.There is no way to tell what kind of filesystem object the name at the end of the path, "code
", represents. The name code
might be a directory that contains the names of other files. It could be an ordinary data file, or a link to a stream-oriented device. The operating system retains additional directory information that shows what kind of filesystem object this is.
A path without the leading /
is relative to the current working directory. In macOS and Linux, the cd
command sets the current working directory. In Windows, the chdir
command does this job. The current working directory is a feature of the login session with the OS. It's made visible by the shell.
This recipe will show you how we can work with pathlib.Path
objects to get access to files in any OS directory structure.
It's important to separate two concepts:
The path provides two things: an optional sequence of directory names and a mandatory filename. An OS directory includes each file's name, information about when each file was created, who owns the files, what the permissions are for each file, how many bytes the files use, and other details. The contents of the files are independent of the directory information: multiple directory entries can be linked to the same content.
Often, the filename has a suffix (or extension) as a hint as to what the physical format is. A file ending in .csv
is likely a text file that can be interpreted as rows and columns of data. This binding between name and physical format is not absolute. File suffixes are only a hint and can be wrong.
In Python, the pathlib
module handles all path-related processing. The module makes several distinctions among paths:
This distinction allows us to create pure paths for files that our application will likely create or refer to. We can also create concrete paths for those files that actually exist on the OS. An application can resolve a pure path to a concrete path.
The pathlib
module also makes a distinction between Linux path objects and Windows path objects. This distinction is rarely needed; most of the time, we don't want to care about the OS-level details of the path. An important reason for using pathlib
is because we want processing that is isolated from details of the underlying OS. The cases where we might want to work with a PureLinuxPath
object are rare.
All of the mini recipes in this section will leverage the following:
>>> from pathlib import Path
We rarely need any of the other definitions from pathlib
.
We'll also presume the argparse
module is used to gather the file or directory names. For more information on argparse
, see the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs. We'll use the options
variable as a namespace that contains the input
filename or directory name that the recipe works with.
For demonstration purposes, a mock argument parsing is shown by providing the following Namespace
object:
>>> from argparse import Namespace
>>> options = Namespace(
... input='/path/to/some/file.csv',
... file1='data/ch10_file1.yaml',
... file2='data/ch10_file2.yaml',
... )
This options
object has three mock argument values. The input
value is an absolute path. The file1
and file2
values are relative paths.
We'll show a number of common pathname manipulations as separate mini recipes. These will include the following manipulations:
We'll start by creating an output filename based on an input filename. This reflects a common kind of application pattern where a source file in one physical format is transformed into a file in a distinct physical format.
Perform the following steps to make the output filename by changing the input suffix:
Path
object from the input filename string. In this example, the PosixPath
class is displayed because the author is using macOS. On a Windows machine, the class would be WindowsPath
. The Path
class will properly parse the string to determine the elements of the path. Here's how we create a path from a string:
>>> input_path = Path(options.input)
>>> input_path
PosixPath('/path/to/some/file.csv')
Path
object using the with_suffix()
method:
>>> output_path = input_path.with_suffix('.out')
>>> output_path
PosixPath('/path/to/some/file.out')
All of the filename parsing is handled seamlessly by the Path
class. The with_suffix()
method saves us from manually parsing the text of the filename.
Perform the following steps to make a number of sibling output files with distinct names:
Path
object from the input filename string. In this example, the PosixPath
class is displayed because the author uses Linux. On a Windows machine, the class would be WindowsPath
. The Path
class will properly parse the string to determine the elements of the path:
>>> input_path = Path(options.input)
>>> input_path
PosixPath('/path/to/some/file.csv')
>>> input_directory = input_path.parent
>>> input_stem = input_path.stem
_pass
to the filename. An input file of file.csv
will produce an output of file_pass.csv
:
>>> output_stem_pass = f"{input_stem}_pass"
>>> output_stem_pass
'file_pass'
Path
object:
>>> output_path = (
... input_directory / output_stem_pass
... ).with_suffix('.csv')
>>> output_path
PosixPath('/path/to/some/file_pass.csv')
The /
operator assembles a new path from path
components. We need to put the /
operation in parentheses to be sure that it's performed first to create a new Path
object. The input_directory
variable has the parent Path
object, and output_stem_pass
is a simple string. After assembling a new path with the /
operator, the with_suffix()
method ensures a specific suffix is used.
The following steps are for creating a directory and a number of files in the newly created directory:
Path
object from the input filename string. In this example, the PosixPath
class is displayed because the author uses Linux. On a Windows machine, the class would be WindowsPath
. The Path
class will properly parse the string to determine the elements of the path:
>>> input_path = Path(options.input)
>>> input_path
PosixPath('/path/to/some/file.csv')
Path
object for the output directory. In this case, we'll create an output
directory as a subdirectory with the same parent directory as the source file:
>>> output_parent = input_path.parent / "output"
>>> output_parent
PosixPath('/path/to/some/output')
Path
object. In this example, the output directory will contain a file that has the same name as the input with a different suffix:
>>> input_stem = input_path.stem
>>> output_path = (
... output_parent / input_stem).with_suffix('.src')
We've used the /
operator to assemble a new Path
object from the parent Path
and a string based on the stem of a filename. Once a Path
object has been created, we can use the with_suffix()
method to set the desired suffix for the path.
The following are the steps to see newer file dates by comparing them:
Path
objects from the input filename strings. The Path
class will properly parse the string to determine the elements of the path:
>>> file1_path = Path(options.file1)
>>> file2_path = Path(options.file2)
stat()
method of each Path
object to get timestamps for the file. This method returns a stat
object; within that stat
object, the st_mtime
attribute of that object provides the most recent modification time for the file:
>>> file1_path.stat().st_mtime
1464460057.0
>>> file2_path.stat().st_mtime
1464527877.0
The values are timestamps measured in seconds. We can compare the two values to see which is newer.
If we want a timestamp that's sensible to people, we can use the datetime
module to create a proper datetime
object from this:
>>> import datetime
>>> mtime_1 = file1_path.stat().st_mtime
>>> datetime.datetime.fromtimestamp(mtime_1)
datetime.datetime(2016, 5, 28, 14, 27, 37)
We can use the strftime()
method to format the datetime
object or we can use the isoformat()
method to provide a standardized display. Note that the time will have the local time zone offset implicitly applied to the OS timestamp; depending on the OS configuration(s), a laptop may not show the same time as the server that created it because they're in different time zones.
The Linux term for removing a file is unlinking. Since a file may have many links, the actual data isn't removed until all links are removed. Here's how we can unlink files:
Path
object from the input filename string. The Path
class will properly parse the string to determine the elements of the path:
>>> input_path = Path(options.input)
>>> input_path
PosixPath('/path/to/some/file.csv')
unlink()
method of this Path
object to remove the directory entry. If this was the last directory entry for the data, then the space can be reclaimed by the OS:
>>> try:
... input_path.unlink()
... except FileNotFoundError as ex:
... print("File already deleted")
File already deleted
If the file does not exist, a FileNotFoundError
is raised. In some cases, this exception needs to be silenced with the pass
statement. In other cases, a warning message might be important. It's also possible that a missing file represents a serious error.
Additionally, we can rename a file using the rename()
method of a Path
object. We can create new soft links using the symlink_to()
method. To create OS-level hard links, we need to use the os.link()
function.
The following are the steps to find all the files that match a given pattern:
Path
object from the input directory name. The Path
class will properly parse the string to determine the elements of the path:
>>> Path(options.file1)
PosixPath('data/ch09_file1.yaml')
>>> directory_path = Path(options.file1).parent
>>> directory_path
PosixPath('data')
glob()
method of the Path
object to locate all files that match a given pattern. By default, this will not recursively walk the entire directory tree (add keyword-argument recursive=True
to the glob
call to walk the whole tree):
>>> list(directory_path.glob("*.csv"))
[PosixPath('data/wc1.csv'), PosixPath('data/ex2_r12.csv'),
PosixPath('data/wc.csv'), PosixPath('data/ch07_r13.csv'),
PosixPath('data/sample.csv'),
PosixPath('data/craps.csv'), PosixPath('data/output.csv'),
PosixPath('data/fuel.csv'), PosixPath('data/waypoints.csv'),
PosixPath('data/quotient.csv'),
PosixPath('data/summary_log.csv'), PosixPath('data/fuel2.csv')]
With this, we've seen a number of mini recipes for using pathlib.Path
objects for managing the file resources. This abstraction is helpful for simplifying access to the filesystem, as well as providing a uniform abstraction that works for Linux, macOS, and Windows.
Inside the OS, a path is a sequence of directories (a folder is a visual depiction of a directory). In a name such as /Users/slott/Documents/writing
, the root directory, /
, contains a directory named Users
. This contains a subdirectory, slott
, which contains Documents
, which contains writing
.
In some cases, a simple string representation can be used to summarize the navigation from root to directory, through to the final target directory. The string representation, however, makes many kinds of path operations into complex string parsing problems.
The Path
class definition simplifies operations on paths. These operations on Path
include the following examples:
Path
. Also, convert a Path
into a string. Many OS functions and parts of Python prefer to use filename strings.Path
object from an existing Path
joined with a string using the / operator
.A concrete Path
represents an actual filesystem resource. For concrete Paths
, we can do a number of additional manipulations of the directory information:
Just about anything we might want to do with directory entries for files can be done with the pathlib
module. The few exceptions are part of the os
or os.path
module.
When we look at other file-related recipes in the rest of this chapter, we'll use Path
objects to name the files. The objective is to avoid trying to use strings to represent paths.
The pathlib
module makes a small distinction between Linux pure Path
objects and Windows pure Path
objects. Most of the time, we don't care about the OS-level details of the path.
There are two cases where it can help to produce pure paths for a specific operating system:
PureLinuxPath
in unit test cases. This allows us to write test cases on the Windows development machine that reflect actual intended use on a Linux server.PureWindowsPath
.The following snippet shows how to create Windows-specific Path
objects:
>>> from pathlib import PureWindowsPath
>>> home_path = PureWindowsPath(r'C:Usersslott')
>>> name_path = home_path / 'filename.ini'
>>> name_path
PureWindowsPath('C:/Users/slott/filename.ini')
>>> str(name_path)
'C:\Users\slott\filename.ini'
Note that the /
characters are normalized from Windows to Python notation when displaying the WindowsPath
object. Using the str()
function retrieves a path string appropriate for the Windows OS.
When we use the generic Path
class, we always get a subclass appropriate to the user's environment, which may or may not be Windows. By using PureWindowsPath
, we've bypassed the mapping to the user's actual OS.
Path
to create a temporary file and then rename the temporary file to replace the original file.Path
object.We can leverage the power of pathlib
to support a variety of filename manipulations. In the Using pathlib to work with filenames recipe, earlier in this chapter, we looked at a few of the most common techniques for managing directories, filenames, and file suffixes.
One common file processing requirement is to create output files in a fail-safe manner. That is, the application should preserve any previous output file, no matter how or where the application fails.
Consider the following scenario:
output.csv
file from the previous run of the long_complex.py
application.long_complex.py
application. It begins overwriting the output.csv
file. Until the program finishes, the bytes are unusable.output.csv
file is useless. Worse, the valid file from time t0 is not available either since it was overwritten.We need a way to preserve the previous state of the file, and only replace the file when the new content is complete and correct. In this recipe, we'll look at an approach to creating output files that's safe in the event of a failure.
For files that don't span across physical devices, fail-safe file output generally means creating a new copy of the file using a temporary name. If the new file can be created successfully, then the old file should be replaced using a single, atomic rename operation.
The goal is to create files in such a way that at any time prior to the final rename, a crash will leave the original file in place. Subsequent to the final rename, the new file should be in place.
We can add capabilities to preserve the old file as well. This provides a recovery strategy. In case of a catastrophic problem, the old file can be renamed manually to make it available as the original file.
There are several ways to approach this. The fileinput
module has an inplace=True
option that permits reading a file while redirecting standard output to write a replacement of the input file. We'll show a more general approach that works for any file. This uses three separate files and does two renames:
output.csv
.output.csv.new
. There are a variety of conventions for naming this file. Sometimes, extra characters such as ~
or #
are placed on the filename to indicate that it's a temporary, working file; for example, output.csv~
. Sometimes, it will be in the /tmp
filesystem.name.out.old
. Any previous .old
file will be removed as part of finalizing the output. Sometimes, the previous version is .bak
, meaning "backup."To create a concrete example, we'll work with a file that has a very small but precious piece of data: a Quotient
. Here's the definition for this Quotient
object:
from dataclasses import dataclass, asdict, fields
@dataclass
class Quotient:
numerator: int
denominator: int
The following function will write this object to a file in CSV notation:
def save_data(
output_path: Path, data: Iterable[Quotient]) -> None:
with output_path.open("w", newline="") as output_file:
headers = [f.name for f in fields(Quotient)]
writer = csv.DictWriter(output_file, headers)
writer.writeheader()
for q in data:
writer.writerow(asdict(q))
We've opened a file with a context manager to be guaranteed the file will be closed. The headers
variable is the list of attribute names in the Quotient
dataclass. We can use these headers to create a CSV writer, and then emit all the given instances of the Quotient
dataclass.
Some typical contents of the file are shown here:
numerator,denominator
87,32
Yes. This is a silly little file. We can imagine that it might be an important part of the security configuration for a web server, and changes must be managed carefully by the administrators.
In the unlikely event of a problem when writing the data
object to the file, we could be left with a corrupted, unusable output file. We'll wrap this function with another to provide a reliable write.
We start creating our wrapper by importing the classes we need:
Path
class from the pathlib
module:
from pathlib import Path
save_data()
function with some extra features. The function signature is the same as it is for the save_data()
function:
def safe_write(
output_path: Path, data: Iterable[Quotient]
) -> None:
.new
at the end. This is a temporary file. If it is written properly, with no exceptions, then we can rename it so that it's the target file:
ext = output_path.suffix
output_new_path = output_path.with_suffix(f'{ext}.new')
save_data(output_new_path, data)
.old
file, if one exists. If there's no .old
file, we can use the missing_ok
option to ignore the FileNotFound
exception:
output_old_path = output_path.with_suffix(f'{ext}.old')
output_old_path.unlink(missing_ok=True)
.old
to save it in case of problems:
try:
output_path.rename(output_old_path)
except FileNotFoundError as ex:
# No previous file. That's okay.
pass
.new
file the official output:
try:
output_new_path.rename(output_path)
except IOError as ex:
# Possible recovery...
output_old_path.rename(output_path)
This multi-step process uses two rename operations:
.old
appended to the suffix..new
appended to the suffix, to the current version of the file.A Path
object has a replace()
method. This always overwrites the target file, with no warning if there's a problem. The choice depends on how our application needs to handle cases where old versions of files may be left in the filesystem. We've used rename()
in this recipe to try and avoid overwriting files in the case of multiple problems. A variation could use replace()
to always replace a file.
This process involves the following three separate OS operations: an unlink and two renames. This leads to a situation in which the .old
file is preserved and can be used to recover the previously good state.
Here's a timeline that shows the state of the various files. We've labeled the content as version 1 (the previous contents) and version 2 (the revised contents):
Time |
Operation |
.csv.old |
.csv |
.csv.new |
t0 |
version 0 |
version 1 |
||
t1 |
Mid-creation |
version 0 |
version 1 |
Will appear corrupt if used |
t2 |
Post-creation, closed |
version 0 |
version 1 |
version 2 |
t3 |
After unlinking |
version 1 |
version 2 |
|
t4 |
After renaming |
version 1 |
version 2 |
|
t5 |
After renaming |
version 1 |
version 2 |
Timeline of file operations
While there are several opportunities for failure, there's no ambiguity about which file is valid:
.csv
file, it's the current, valid file..csv
file, then the .csv.old
file is a backup copy, which can be used for recovery.Since none of these operations involve actually copying the files, the operations are all extremely fast and reliable.
In some enterprise applications, output files are organized into directories with names based on timestamps. This can be handled gracefully by the pathlib
module. We might, for example, have an archive directory for old files:
archive_path = Path("/path/to/archive")
We may want to create date-stamped subdirectories for keeping temporary or working files:
import datetime
today = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
We can then do the following to define a working directory:
working_path = archive_path / today
working_path.mkdir(parents=True, exists_ok=True)
The mkdir()
method will create the expected directory, including the parents=True
argument, which ensures that all parent directories will also be created. This can be handy the very first time an application is executed. exists_ok=True
is handy so that if a directory already exists, it can be reused without raising an exception.
parents=True
is not the default. With the default of parents=False
, when a parent directory doesn't exist, the method will raise a FileNotFoundError
exception because the required file doesn't exist.
Similarly, exists_ok=True
is not the default. By default, if the directory exists, a FileExistsError
exception is raised. Including options to make the operation silent when the directory already exists can be helpful.
Also, it's sometimes appropriate to use the tempfile
module to create temporary files. This module can create filenames that are guaranteed to be unique. This allows a complex server process to create temporary files without regard to filename conflicts.
Path
class.with
statement to ensure file operations complete properly, and that all of the OS resources are released.One commonly used data format is comma-separated values (CSV). We can generalize this to think of the comma character as simply one of many candidate separator characters. For example, a CSV file can use the |
character as the separator between columns of data. This generalization for separators other than the comma makes CSV files particularly powerful.
How can we process data in one of the wide varieties of CSV formats?
A summary of a file's content is called a schema. It's essential to distinguish between two aspects of the schema.
The physical format of the file: For CSV, this means the file's bytes encode text. The text is organized into rows and columns using a row separator character (or characters) and a column separator character. Many spreadsheet products will use ,
(comma) as the column separator and the
sequence of characters as the row separator. The specific combination of punctuation characters in use is called the CSV dialect.
The logical layout of the data in the file: This is the sequence of data columns that are present. There are several common cases for handling the logical layout in CSV files:
We'll look at a CSV file that has some real-time data recorded from the log of a sailboat. This is the waypoints.csv
file. The data looks as follows:
lat,lon,date,time
32.8321666666667,-79.9338333333333,2012-11-27,09:15:00
31.6714833333333,-80.93325,2012-11-28,00:00:00
30.7171666666667,-81.5525,2012-11-28,11:35:00
This data contains four columns named in the first line of the file: lat
, lon
, date
, and time
. These describe a waypoint and need to be reformatted to create more useful information.
csv
module and the Path
class:
import csv
from pathlib import Path
,
', which is the default.
', also widely used in both Windows and Linux. Python's universal newlines feature means that the Linux standard '
' will work just as well as a row separator.raw()
function to read raw data from a Path
that refers to the file:
def raw(data_path: Path) -> None:
Path
object to open the file in a with
statement:
with data_path.open() as data_file:
with
statement:
data_reader = csv.DictReader(data_file)
with
statement. For this example, we'll only print the rows:
for row in data_reader:
print(row)
Here's the function that we created:
def raw(data_path: Path) -> None:
with data_path.open() as data_file:
data_reader = csv.DictReader(data_file)
for row in data_reader:
print(row)
The output from the raw()
function is a series of dictionaries that looks as follows:
{'date': '2012-11-27',
'lat': '32.8321666666667',
'lon': '-79.9338333333333',
'time': '09:15:00'}
We can now process the data by referring to the columns as dictionary items, using syntax like, for example, row['date']
. Using the column names is more descriptive than referring to the column by position; for example, row[0]
is hard to understand.
To be sure that we're using the column names correctly, the typing.TypedDict
type hint can be used to provide the expected column names.
The csv
module handles the physical format work of separating the rows from each other, and also separating the columns within each row. The default rules ensure that each input line is treated as a separate row and that the columns are separated by ",
".
What happens when we need to use the column separator character as part of data? We might have data like this:
lan,lon,date,time,notes
32.832,-79.934,2012-11-27,09:15:00,"breezy, rainy"
31.671,-80.933,2012-11-28,00:00:00,"blowing ""like stink"""
The notes
column has data in the first row, which includes the ",
" column separator character. The rules for CSV allow a column's value to be surrounded by quotes. By default, the quoting characters are ". Within these quoting characters, the column and row separator characters are ignored.
In order to embed the quote character within a quoted string, it is doubled. The second example row shows how the value blowing "like stink"
is encoded by doubling the quote characters when they are part of the value of a column. These quoting rules mean that a CSV file can represent any combination of characters, including the row and column separator characters.
The values in a CSV file are always strings. A string value like 7331
may look like a number to us, but it's always text when processed by the csv
module. This makes the processing simple and uniform, but it can be awkward for our Python application programs.
When data is saved from a manually prepared spreadsheet, the data may reveal the quirks of the desktop software's internal rules for data display. It's surprisingly common, for example, to have a column of data that is displayed as a date on the desktop software but shows up as a floating-point number in the CSV file.
There are two solutions to the date-as-number problem. One is to add a column in the source spreadsheet to properly format the date as a string. Ideally, this is done using ISO rules so that the date is represented in YYYY-MM-DD
format. The other solution is to recognize the spreadsheet date as a number of seconds past some epochal date. The epochal dates vary slightly, but they're generally either Jan 1, 1900 or Jan 1, 1904.
As we saw in the Combining map and reduce transformations recipe in Chapter 9, Functional Programming Features, there's often a pipeline of processing that includes cleaning and transforming the source data. In this specific example, there are no extra rows that need to be eliminated. However, each column needs to be converted into something more useful.
To transform the data into a more useful form, we'll use a two-part design. First, we'll define a row-level cleansing function. In this case, we'll create a dictionary object by adding additional values that are derived from the input data. A clean_row()
function can look like this:
import datetime
Raw = Dict[str, Any]
Waypoint = Dict[str, Any]
def clean_row(source_row: Raw) -> Waypoint:
ts_date = datetime.datetime.strptime(
source_row["date"], "%Y-%m-%d"
).date()
ts_time = datetime.datetime.strptime(
source_row["time"], "%H:%M:%S"
).time()
return dict(
date=source_row["date"],
time=source_row["time"],
lat=source_row["lat"],
lon=source_row["lon"],
lat_lon=(
float(source_row["lat"]),
float(source_row["lon"])
),
ts_date=ts_date,
ts_time=ts_time,
timestamp = datetime.datetime.combine(
ts_date, ts_time
)
)
Here, we've created some new column values. The column named lat_lon
has a two-tuple with proper floating-point values instead of strings. We've also parsed the date and time values to create datetime.date
and datetime.time
objects. We've combined the date and time into a single, useful value, which is the value of the timestamp
column.
Once we have a row-level function for cleaning and enriching our data, we can map this function to each row in the source of data. We can use map(clean_row, reader)
or we can write a function that embodies this processing loop:
def cleanse(reader: csv.DictReader) -> Iterator[Waypoint]:
for row in reader:
yield clean_row(cast(Raw, row))
This can be used to provide more useful data from each row:
def display_clean(data_path: Path) -> None:
with data_path.open() as data_file:
data_reader = csv.DictReader(data_file)
clean_data_reader = cleanse(data_reader)
for row in clean_data_reader:
pprint(row)
We've used the cleanse()
function to create a very small stack of transformation rules. The stack starts with data_reader
, and only has one other item in it. This is a good beginning. As the application software is expanded to do more computations, the stack will expand.
These cleansed and enriched rows look as follows:
{'date': '2012-11-27',
'lat': '32.8321666666667',
'lat_lon': (32.8321666666667, -79.9338333333333),
'lon': '-79.9338333333333',
'time': '09:15:00',
'timestamp': datetime.datetime(2012, 11, 27, 9, 15),
'ts_date': datetime.date(2012, 11, 27),
'ts_time': datetime.time(9, 15)}
We've added columns such as lat_lon
, which have proper numeric values instead of strings. We've also added timestamp
, which has a full date-time value that can be used for simple computations of elapsed time between waypoints.
We can leverage the typing.TypedDict
type hint to make a stronger statement about the structure of the dictionary data that will be processed. The initial data has known column names, each of which has string data values. We can define the raw data as follows:
from typing import TypedDict
class Raw_TD(TypedDict):
date: str
time: str
lat: str
lon: str
The cleaned data has a more complex structure. We can define the output from a clean_row_td()
function as follows:
class Waypoint_TD(Raw_TD):
lat_lon: Tuple[float, float]
ts_date: datetime.date
ts_time: datetime.time
timestamp: datetime.datetime
The Waypoint_TD TypedDict
definition extends Raw_TD TypedDict
to make the outputs from the cleanse()
function explicit. This lets us use the mypy
tool to confirm that the cleanse()
function – and any other processing – adheres to the expected keys and value types in the dictionary.
with
statement, see the Reading and writing files with context managers recipe in Chapter 7, Basics of Classes and Objects.One commonly used data format is known as CSV. Python's csv
module has a very handy DictReader
class definition. When a file contains a one-row header, the header row's values become keys that are used for all the subsequent rows. This allows a great deal of flexibility in the logical layout of the data. For example, the column ordering doesn't matter, since each column's data is identified by a name taken from the header row.
This leads to dictionary-based references to a column's data. We're forced to write, for example, row['lat']
or row['date']
to refer to data in specific columns. While this isn't horrible, it would be much nicer to use syntax like row.lat
or row.date
to refer to column values.
Additionally, we often have derived values that should – perhaps – be properties of a class definition instead of a separate function. This can properly encapsulate important attributes and operations into a single class definition.
The dictionary data structure has awkward-looking syntax for the column references. If we use dataclasses for each row, we can change references from row['name']
to row.name
.
We'll look at a CSV file that has some real-time data recorded from the log of a sailboat. This file is the waypoints.csv
file. The data looks as follows:
lat,lon,date,time
32.8321666666667,-79.9338333333333,2012-11-27,09:15:00
31.6714833333333,-80.93325,2012-11-28,00:00:00
30.7171666666667,-81.5525,2012-11-28,11:35:00
The first line contains a header that names the four columns, lat
, lon
, date
, and time
. The data can be read by a csv.DictReader
object. However, we'd like to do more sophisticated work, so we'll create a @dataclass
class definition that can encapsulate the data and the processing we need to do.
We need to start with a dataclass that reflects the available data, and then we can use this dataclass with a dictionary reader:
from dataclasses import dataclass, field
import datetime
from typing import Tuple, Iterator
dataclass
narrowly focused on the input, precisely as it appears in the source file. We've called the class RawRow
. In a complex application, a more descriptive name than RawRow
would be appropriate. This definition of the attributes may change as the source file organization changes:
@dataclass
class RawRow:
date: str
time: str
lat: str
lon: str
raw
, in this example. Fields computed from this source data are all initialized with field(init=False)
because they'll be computed after initialization:
@dataclass
class Waypoint:
raw: RawRow
lat_lon: Tuple[float, float] = field(init=False)
ts_date: datetime.date = field(init=False)
ts_time: datetime.time = field(init=False)
timestamp: datetime.datetime = field(init=False)
__post_init__()
method to initialize all of the derived fields:
def __post_init__(self):
self.ts_date = datetime.datetime.strptime(
self.raw.date, "%Y-%m-%d"
).date()
self.ts_time = datetime.datetime.strptime(
self.raw.time, "%H:%M:%S"
).time()
self.lat_lon = (
float(self.raw.lat),
float(self.raw.lon)
)
self.timestamp = datetime.datetime.combine(
self.ts_date, self.ts_time
)
csv.DictReader
object and create the needed Waypoint
objects. The intermediate representation, RawRow
, is a convenience so that we can assign attribute names to the source data columns:
def waypoint_iter(reader: csv.DictReader) -> Iterator[Waypoint]:
for row in reader:
raw = RawRow(**row)
yield Waypoint(raw)
The waypoint_iter()
function creates RawRow
objects from the input dictionary, then creates the final Waypoint
objects from the RawRow
instances. This two-step processing is helpful for managing changes to the source or the processing.
We can use the following function to read and display the CSV data:
def display(data_path: Path) -> None:
with data_path.open() as data_file:
data_reader = csv.DictReader(data_file)
for waypoint in waypoint_iter(data_reader):
pprint(waypoint)
This function uses the waypoint_iter()
function to create Waypoint
objects from the dictionaries read by the csv.DictReader
object. Each Waypoint
object contains a reference to the original raw data:
Waypoint(
raw=RawRow(
date='2012-11-27',
time='09:15:00',
lat='32.8321666666667',
lon='-79.9338333333333'),
lat_lon=(32.8321666666667, -79.9338333333333),
ts_date=datetime.date(2012, 11, 27),
ts_time=datetime.time(9, 15),
timestamp=datetime.datetime(2012, 11, 27, 9, 15)
)
Having the original input object can sometimes be helpful when diagnosing problems with the source data or the processing.
The source dataclass, the RawRow
class in this example, is designed to match the input document. The dataclass definition has attribute names that are exact matches for the source column names, and the attribute types are all strings to match the CSV input types.
Because the names match, the RawRow(**row)
expression will work to create an instance of the RawRow
class from the DictReader
dictionary.
From this initial, or raw, data, we can derive the more useful data, as shown in the Waypoint
class definition. The __post_init__()
method transforms the initial value in the self.raw
attribute into a number of more useful attribute values.
We've separated the Waypoint
object's creation from reading raw data. This lets us manage the following two kinds of common changes to application software:
It's helpful to disentangle the various aspects of a program so that we can let them evolve independently. Gathering, cleaning, and filtering source data is one aspect of this. The resulting computations are a separate aspect, unrelated to the format of the source data.
In many cases, the source CSV file will have headers that do not map directly to valid Python attribute names. In these cases, the keys present in the source dictionary must be mapped to the column names. This can be managed by expanding the RawRow
class definition to include a __post_init__()
method. This will help build the RawRow
dataclass object from a dictionary built from CSV row headers that aren't useful Python key names.
The following example defines a class called RawRow_HeaderV2
. This reflects a variant spreadsheet with different row names. We've defined the attributes with field(init=False)
and provided a __post_init__()
method, as shown in this code block:
@dataclass
class RawRow_HeaderV2:
source: Dict[str, str]
date: str = field(init=False)
time: str = field(init=False)
lat: str = field(init=False)
lon: str = field(init=False)
def __post_init__(self):
self.date = self.source['Date of Travel (YYYY-MM-DD)']
self.time = self.source['Arrival Time (HH:MM:SS)']
self.lat = self.source['Latitude (degrees N)']
self.lon = self.source['Logitude (degrees W)']
This RawRow_HeaderV2
class creates objects that are compatible with the RawRow
class. Either of these classes of objects can also be transformed into Waypoint
instances.
For an application that works with a variety of data sources, these kinds of "raw data transformation" dataclasses can be handy for mapping the minor variations in a logical layout to a consistent internal structure for further processing.
As the number of input transformation classes grows, additional type hints are required. For example, the following type hint provides a common name for the variations in input format:
Raw = Union[RawRow, RawRow_HeaderV2]
This type hint helps to unify the original RawRow
and the alternative RawRow_HeaderV2
as alternative type definitions with compatible features. This can also be done with a Protocol
type hint that spells out the common attributes.
Many file formats lack the elegant regularity of a CSV file. One common file format that's rather difficult to parse is a web server log file. These files tend to have complex data without a single, uniform separator character or consistent quoting rules.
When we looked at a simplified log file in the Writing generator functions with the yield statement recipe in the online chapter, Chapter 9, Functional Programming Features (link provided in the Preface), we saw that the rows look as follows:
[2016-05-08 11:08:18,651] INFO in ch09_r09: Sample Message One
[2016-05-08 11:08:18,651] DEBUG in ch09_r09: Debugging
[2016-05-08 11:08:18,652] WARNING in ch09_r09: Something might have gone wrong
There are a variety of punctuation marks being used in this file. The csv
module can't handle this complexity.
We'd like to write programs with the elegant simplicity of CSV processing. This means we'll need to encapsulate the complexities of log file parsing and keep this aspect separate from analysis and summary processing.
Parsing a file with a complex structure generally involves writing a function that behaves somewhat like the reader()
function in the csv
module. In some cases, it can be easier to create a small class that behaves like the DictReader
class.
The core feature of reading a complex file is a function that will transform one line of text into a dictionary or tuple of individual field values. This job can often be done by the re
package.
Before we can start, we'll need to develop (and debug) the regular expression that properly parses each line of the input file. For more information on this, see the String parsing with regular expressions recipe in Chapter 1, Numbers, Strings, and Tuples.
For this example, we'll use the following code. We'll define a pattern string with a series of regular expressions for the various elements of the line:
import re
pattern_text = (
r"[ (?P<date>.*?) ]s+"
r" (?P<level>w+) s+"
r"ins+(?P<module>S?)"
r":s+ (?P<message>.+)"
)
pattern = re.compile(pattern_text, re.X)
We've used the re.X
option so that we can include extra whitespace in the regular expression. This can help to make it more readable by separating prefix and suffix characters.
There are four fields that are captured and a number of characters that are part of the template, but they never vary. Here's how this regular expression works:
[
and ]
. We've had to use [
and ]
to escape the normal meaning of [
and ]
in a regular expression. This is saved as date
in the resulting group dictionary.w
. This is saved as level
in the dictionary of groups created by parsing text with this regular expression.in
; these are not captured. After this preface, the name is a sequence of non-whitespace characters, matched by S
. The module is saved as a group named module
.':'
character we can ignore, then more spaces we can also ignore. Finally, the message itself starts and extends to the end of the line.When we write a regular expression, we can wrap the interesting sub-strings to capture in ()
. After performing a match()
or search()
operation, the resulting Match
object will have values for the matched substrings. The groups()
method of a Match
object and the groupdict()
method of a Match
object will provide the captured strings.
Note that we've used the s+
sequence to quietly skip one or more space-like characters. The sample data appears to always use a single space as the separator. However, when absorbing whitespace, using s+
seems to be a slightly more general approach because it permits extra spaces.
Here's how this pattern works:
>>> sample_data = '[2016-05-08 11:08:18,651] INFO in ch10_r09: Sample Message One'
>>> match = pattern.match(sample_data)
>>> match.groups()
('2016-05-08 11:08:18,651', 'INFO', 'ch10_r09', 'Sample Message One')
>>> match.groupdict()
{'date': '2016-05-08 11:08:18,651',
'level': 'INFO',
'module': 'ch10_r09',
'message': 'Sample Message One'}
We've provided a line of sample data in the sample_data
variable. The match
object, has a groups()
method that returns each of the interesting fields. The value of the groupdict()
method of a match object is a dictionary with the name provided in the ?P<name>
preface to the regular expression in brackets, ()
.
This recipe is split into two parts. The first part defines a log_parser()
function to parse a single line, while the second part uses the log_parser()
function for each line of input.
Perform the following steps to define the log_parser()
function:
(?P<name>...)
regular expression construct to create a dictionary key for each piece of data that's captured. The resulting dictionary will then contain useful, meaningful keys:
import re
pattern_text = (
r"[ (?P<date>.*?) ]s+"
r" (?P<level>w+) s+"
r"ins+(?P<module>.+?)"
r":s+ (?P<message>.+)"
)
pattern = re.compile(pattern_text, re.X)
NamedTuple
must define the fields that are extracted by the parser. The field names should match the regular expression capture names in the ?P<name>
prefix:
class LogLine(NamedTuple):
date: str
level: str
module: str
message: str
def log_parser(source_line: str) -> LogLine:
match
object. We've assigned it to the match
variable and also checked to see if it is not None
:
if match := pattern.match(source_line):
None
, return a useful data structure with the various pieces of data from this input line. The cast(Match, match)
expression is necessary to help mypy
; it states that the match
object will not be None
, but will always be a valid instance of the Match
class. It's likely that a future release of mypy
will not need this:
data = cast(Match, match).groupdict()
return LogLine(**data)
None
, either log the problem or raise an exception to stop processing because there's a problem:
raise ValueError(f"Unexpected input {source_line=}")
Here's the log_parser()
function, all gathered together:
def log_parser(source_line: str) -> LogLine:
if match := pattern.match(source_line):
data = cast(Match, match).groupdict()
return LogLine(**data)
raise ValueError(f"Unexpected input {source_line=}")
The log_parser()
function can be used to parse each line of input. The text is transformed into a NamedTuple
instance with field names and values based on the fields found by the regular expression parser. These field names must match the field names in the NamedTuple
class definition.
This portion of the recipe will apply the log_parser()
function to each line of the input file:
pathlib
, import the Path
class definition:
from pathlib import Path
Path
object that identifies the file:
data_path = Path("data") / "sample.log"
Path
object to open the file in a with
statement:
with data_path.open() as data_file:
data_file
. In this case, we'll use the built-in map()
function to apply the log_parser()
function to each line from the source file:
data_reader = map(log_parser, data_file)
for row in data_reader:
pprint(row)
The output is a series of LogLine
tuples that looks as follows:
LogLine(date='2016-06-15 17:57:54,715', level='INFO', module='ch10_r10', message='Sample Message One')
LogLine(date='2016-06-15 17:57:54,715', level='DEBUG', module='ch10_r10', message='Debugging')
LogLine(date='2016-06-15 17:57:54,715', level='WARNING', module='ch10_r10', message='Something might have gone wrong')
We can do more meaningful processing on these tuple instances than we can on a line of raw text. These allow us to filter the data by severity level, or create a Counter
based on the module providing the message.
This log file is in First Normal Form (1NF): the data is organized into lines that represent independent entities or events. Each row has a consistent number of attributes or columns, and each column has data that is atomic or can't be meaningfully decomposed further. Unlike CSV files, however, this particular format requires a complex regular expression to parse.
In our log file example, the timestamp contains a number of individual elements – year, month, day, hour, minute, second, and millisecond – but there's little value in further decomposing the timestamp. It's more helpful to use it as a single datetime
object and derive details (like hour of the day) from this object, rather than assembling individual fields into a new piece of composite data.
In a complex log processing application, there may be several varieties of message fields. It may be necessary to parse these message types using separate patterns. When we need to do this, it reveals that the various lines in the log aren't consistent in terms of the format and number of attributes, breaking one of the 1NF assumptions.
We've generally followed the design pattern from the Reading delimited files with the CSV module recipe, so that reading a complex log is nearly identical to reading a simple CSV file. Indeed, we can see that the primary difference lies in one line of code:
data_reader = csv.DictReader(data_file)
As compared to the following:
data_reader = map(log_parser, data_file)
This parallel construct allows us to reuse analysis functions across many input file formats. This allows us to create a library of tools that can be used on a number of data sources.
One of the most common operations when reading very complex files is to rewrite them into an easier-to-process format. We'll often want to save the data in CSV format for later processing.
Some of this is similar to the Using multiple contexts for reading and writing files recipe in Chapter 7, Basics of Classes and Objects, which also shows multiple open contexts. We'll read from one file and write to another file.
The file writing process looks as follows:
import csv
def copy(data_path: Path) -> None:
target_path = data_path.with_suffix(".csv")
with target_path.open("w", newline="") as target_file:
writer = csv.DictWriter(target_file, LogLine._fields)
writer.writeheader()
with data_path.open() as data_file:
reader = map(log_parser, data_file)
writer.writerows(row._asdict() for row in reader)
The first portion of this script defines a CSV writer for the target file. The path for the output file, target_path
, is based on the input name, data_path
. The suffix changed from the original filename's suffix to .csv
.
The target file is opened with the newline character, which is turned off by the newline=''
option. This allows the csv.DictWriter
class to insert newline characters appropriate for the desired CSV dialect.
A DictWriter
object is created to write to the given file. The sequence of column headings is provided by the LogLines
class definition. This makes sure the output CSV file will contain column names matching the LogLines
subclass of the typing.NamedTuple
class.
The writeheader()
method writes the column names as the first line of output. This makes reading the file slightly easier because the column names are provided. The first row of a CSV file can be a kind of explicit schema definition that shows what data is present.
The source file is opened, as shown in the preceding recipe. Because of the way the csv
module writers work, we can provide the reader
generator expression to the writerows()
method of the writer. The writerows()
method will consume all of the data produced by the reader
generator. This will, in turn, consume all the rows produced by the open file.
We don't need to write any explicit for
statements to ensure that all of the input rows are processed. The writerows()
function makes this a guarantee.
The output file looks as follows:
date,level,module,message
"2016-05-08 11:08:18,651",INFO,ch10_r10,Sample Message One
"2016-05-08 11:08:18,651",DEBUG,ch10_r10,Debugging
"2016-05-08 11:08:18,652",WARNING,ch10_r10,Something might have gone wrong
The file has been transformed from the rather complex input format into a simpler CSV format, suitable for further analysis and processing.
with
statement, see the Reading and writing files with context managers recipe in Chapter 7, Basics of Classes and Objects.JavaScript Object Notation (JSON) is a popular syntax for serializing data. For details, see http://json.org. Python includes the json
module in order to serialize and deserialize data in this notation.
JSON documents are used widely by web applications. It's common to exchange data between RESTful web clients and servers using documents in JSON notation. These two tiers of the application stack communicate via JSON documents sent via the HTTP protocol.
The YAML format is a more sophisticated and flexible extension to JSON notation. For details, see https://yaml.org. Any JSON document is also a valid YAML document. The reverse is not true: YAML syntax is more complex and includes constructs that are not valid JSON.
To use YAML, an additional module has to be installed. The PyYAML project offers a yaml
module that is popular and works well. See https://pypi.org/project/PyYAML/.
In this recipe, we'll use the json
or yaml
module to parse JSON format data in Python.
We've gathered some sailboat racing results in race_result.json
. This file contains information on teams, legs of the race, and the order in which the various teams finished each individual leg of the race. JSON handles this complex data elegantly.
An overall score can be computed by summing the finish position in each leg: the lowest score is the overall winner. In some cases, there are null values when a boat did not start, did not finish, or was disqualified from the race.
When computing the team's overall score, the null values are assigned a score of one more than the number of boats in the competition. If there are seven boats, then the team is given eight points for their failure to finish, a hefty penalty.
The data has the following schema. There are two fields within the overall document:
legs
: An array of strings that shows the starting port and ending port.teams
: An array of objects with details about each team. Within each teams
object, there are several fields of data:
The data looks as follows:
{
"teams": [
{
"name": "Abu Dhabi Ocean Racing",
"position": [
1,
3,
2,
2,
1,
2,
5,
3,
5
]
},
...
],
"legs": [
"ALICANTE - CAPE TOWN",
"CAPE TOWN - ABU DHABI",
"ABU DHABI - SANYA",
"SANYA - AUCKLAND",
"AUCKLAND - ITAJAu00cd",
"ITAJAu00cd - NEWPORT",
"NEWPORT - LISBON",
"LISBON - LORIENT",
"LORIENT - GOTHENBURG"
]
}
We've only shown the first team. There was a total of seven teams in this particular race. Each team is represented by a Python dictionary, with the team's name and their history of finish positions on each leg. For the team shown here, Abu Dhabi Ocean Racing, they finished in first place in the first leg, and then third place in the next leg. Their worst performance was fifth place in both the seventh and ninth legs of the race, which were the legs from Newport, Rhode Island, USA to Lisbon, Portugal, and from Lorient in France to Gothenburg in Sweden.
The JSON-formatted data can look like a Python dictionary that t lists within it. This overlap between Python syntax and JSON syntax can be thought of as a happy coincidence: it makes it easier to visualize the Python data structure that will be built from the JSON source document.
JSON has a small set of data structures: null, Boolean, number, string, list, and object. These map to objects of Python types in a very direct way. The json
module makes the conversions from source text into Python objects for us.
One of the strings contains a Unicode escape sequence, u00cd
, instead of the actual Unicode character Í. This is a common technique used to encode characters beyond the 128 ASCII characters. The parser in the json
module handles this for us.
In this example, we'll write a function to disentangle this document and show the team finishes for each leg.
This recipe will start with importing the necessary modules. We'll then use these modules to transform the contents of the file into a useful Python object:
json
module to parse the text. We'll also need a Path
object to refer to the file:
import json
from pathlib import Path
race_summary()
function to read the JSON document from a given Path
instance:
def race_summary(source_path: Path) -> None:
source_path.read_text()
to read the file named by Path
. We provided this string to the json.loads()
function for parsing. For very large files, an open file can be passed to the json.load()
function; this can be more efficient than reading the entire document into a string object and loading the in-memory text:
document = json.loads(source_path.read_text())
teams
and legs
. Here's how we can iterate through each leg, showing the team's position in the leg:
for n, leg in enumerate(document['legs']):
print(leg)
for team_finishes in document['teams']:
print(
team_finishes['name'],
team_finishes['position'][n])
The data for each team will be a dictionary with two keys: name
and position
. We can navigate down into the team details to get the name of the first team:
>>> document['teams'][6]['name']
'Team Vestas Wind'
We can look inside the legs
field to see the names of each leg of the race:
>>> document['legs'][5]
'ITAJAÍ - NEWPORT'
Note that the JSON source file included a 'u00cd
' Unicode escape sequence. This was parsed properly, and the Unicode output shows the proper Í character.
A JSON document is a data structure in JavaScript Object Notation. JavaScript programs can parse the document trivially. Other languages must do a little more work to translate the JSON to a native data structure.
A JSON document contains three kinds of structures:
{"key": "value"}
. Unlike Python, JSON only uses "
for string quotation marks. JSON notation is intolerant of an extra ,
at the end of the dictionary value. Other than this, the two notations are similar.[item, ...]
, which looks like Python. JSON is intolerant of an extra ,
at the end of the array value.true
, false
, and null
. Strings are enclosed in "
and use a variety of escape
sequences, which are similar to Python's. Numbers follow the rules for floating-point values. The other three values are simple literals; these parallel Python's True
, False
, and None
literals. As a special case, numbers with no decimal point become Python int
objects. This is an extension of the JSON standard.There is no provision for any other kinds of data. This means that Python programs must convert complex Python objects into a simpler representation so that they can be serialized in JSON notation.
Conversely, we often apply additional conversions to reconstruct complex Python objects from the simplified JSON representation. The json
module has places where we can apply additional processing to the simple structures to create more sophisticated Python objects.
A file, generally, contains a single JSON document. The JSON standard doesn't provide an easy way to encode multiple documents in a single file. If we want to analyze a web log, for example, the original JSON standard may not be the best notation for preserving a huge volume of information.
There are common extensions, like Newline Delimited JSON, http://ndjson.org, and JSON Lines, http://jsonlines.org, to define a way to encode multiple JSON documents into a single file.
There are two additional problems that we often have to tackle:
When we represent a Python object's state as a string of text characters, we've serialized the object. Many Python objects need to be saved in a file or transmitted to another process. These kinds of transfers require a representation of the object state. We'll look at serializing and deserializing separately.
Many common Python data structures can be serialized into JSON. Because Python is extremely sophisticated and flexible, we can also create Python data structures that cannot be directly represented in JSON.
The serialization to JSON works out the best if we create Python objects that are limited to values of the built-in dict
, list
, str
, int
, float
, bool
, and None
types. This subset of Python types can be used to build objects that the json
module can serialize and can be used widely by a number of programs written in different languages.
One commonly used data structure that doesn't serialize easily is the datetime.datetime
object. Here's what happens when we try:
>>> import datetime
>>> example_date = datetime.datetime(2014, 6, 7, 8, 9, 10)
>>> document = {'date': example_date}
Here, we've created a simple document with a dictionary mapping a string to a datetime
instance. What happens when we try to serialize this in JSON?
>>> json.dumps(document)
Traceback (most recent call last):
...
TypeError: datetime.datetime(2014, 6, 7, 8, 9, 10) is not JSON serializable
This shows that objects will raise a TypeError
exception when they cannot be serialized. Avoiding this exception can done in one of two ways. We can either convert the data into a JSON-friendly structure before building the document, or we can add a default type handler to the JSON serialization process that gives us a way to provide a serializable version of the data.
To convert the datetime
object into a string prior to serializing it as JSON, we need to make a change to the underlying data. In the following example, we replaced the datetime.datetime
object with a string:
>>> document_converted = {'date': example_date.isoformat()}
>>> json.dumps(document_converted)
'{"date": "2014-06-07T08:09:10"}'
This uses the standardized ISO format for dates to create a string that can be serialized. An application that reads this data can then convert the string back into a datetime
object. This kind of transformation can be difficult for a complex document.
The other technique for serializing complex data is to provide a function that's used by the json
module during serialization. This function must convert a complex object into something that can be safely serialized. In the following example, we'll convert a datetime
object into a simple string value:
def default_date(object: Any) -> Union[Any, Dict[str, Any]]:
if isinstance(object, datetime.datetime):
return {"$date": object.isoformat()}
return object
We've defined a function, default_date()
, which will apply a special conversion rule to datetime
objects. Any datetime instance will be massaged into a dictionary with an obvious key – "$date"
– and a string value. This dictionary of strings can be serialized by functions in the json
module.
We provide this function to the json.dumps()
function, assigning the default_date()
function to the default
parameter, as follows:
>>> example_date = datetime.datetime(2014, 6, 7, 8, 9, 10)
>>> document = {'date': example_date}
>>> print(
... json.dumps(document, default=default_date, indent=2))
{
"date": {
"$date": "2014-06-07T08:09:10"
}
}
When the json
module can't serialize an object, it passes the object to the given default
function. In any given application, we'll need to expand this function to handle a number of Python object types that we might want to serialize in JSON notation. If there is no default function provided, an exception is raised when an object can't be serialized.
When deserializing JSON to create Python objects, there's a hook that can be used to convert data from a JSON dictionary into a more complex Python object. This is called object_hook
and it is used during processing by the json.loads()
function. This hook is used to examine each JSON dictionary to see if something else should be created from the dictionary instance.
The function we provide will either create a more complex Python object, or it will simply return the original dictionary object unmodified:
def as_date(object: Dict[str, Any]) -> Union[Any, Dict[str, Any]]:
if {'$date'} == set(object.keys()):
return datetime.datetime.fromisoformat(object['$date'])
return object
This function will check each object that's decoded to see if the object has a single field, and that single field is named $date
. If that is the case, the value of the entire object is replaced with a datetime
object. The return type is a union of Any
and Dict[str, Any]
to reflect the two possible results: either some object or the original dictionary.
We provide a function to the json.loads()
function using the object_hook
parameter, as follows:
>>> source = '''{"date": {"$date": "2014-06-07T08:09:10"}}'''
>>> json.loads(source, object_hook=as_date)
{'date': datetime.datetime(2014, 6, 7, 8, 9, 10)}
This parses a very small JSON document that meets the criteria for containing a date. The resulting Python object is built from the string value found in the JSON serialization.
We may also want to design our application classes to provide additional methods to help with serialization. A class might include a to_json()
method, which will serialize the objects in a uniform way. This method might provide class information. It can avoid serializing any derived attributes or computed properties. Similarly, we might need to provide a static from_json()
method that can be used to determine if a given dictionary object is actually an instance of the given class.
The XML markup language is widely used to represent the state of objects in a serialized form. For details, see http://www.w3.org/TR/REC-xml/. Python includes a number of libraries for parsing XML documents.
XML is called a markup language because the content of interest is marked with tags, and also written with a start <tag>
and an end </tag>
to clarify the structure of the data. The overall file text includes both the content and the XML markup.
Because the markup is intermingled with the text, there are some additional syntax rules that must be used to distinguish markup from text. In order to include the <
character in our data, we must use XML character entity references. We must use <
to include <
in our text. Similarly, >
must be used instead of >
, &
, which is used instead of &
. Additionally, "
is also used to embed a "
character in an attribute value delimited by "
characters. For the most part, XML parsers will handle this transformation when consuming XML.
A document, then, will have items as follows:
<team><name>Team SCA</name><position>...</position></team>
The <team>
tag contains the <name>
tag, which contains the text of the team's name. The <position>
tag contains more data about the team's finish position in each leg of a race.
Most XML processing allows additional
and space characters in the XML to make the structure more obvious:
<team>
<name>Team SCA</name>
<position>...</position>
</team>
As shown in the preceding example, content (like team name) is surrounded by the tags. The overall document forms a large, nested collection of containers. We can think of a document as a tree with a root tag that contains all the other tags and their embedded content. Between tags, there is some additional content. In some applications, the additional content between the ends of tags is entirely whitespace.
Here's the beginning of the document we'll be looking at:
<?xml version="1.0"?>
<results>
<teams>
<team>
<name>
Abu Dhabi Ocean Racing
</name>
<position>
<leg n="1">
1
</leg>
...
</position>
...
</team>
...
</teams>
</results>
The top-level container is the <results>
tag. Within this is a <teams>
tag. Within the <teams>
tag are many repetitions of data for each individual team, enclosed in the <team>
tag. We've used … to show where parts of the document were elided.
It's very, very difficult to parse XML with regular expressions. We need more sophisticated parsers to handle the syntax of nested tags.
There are two binary libraries that are available in Python for parsing: XML-SAX and Expat. Python includes the modules xml.sax
and xml.parsers.expat
to exploit these two libraries directly.
In addition to these, there's a very sophisticated set of tools in the xml.etree
package. We'll focus on using the ElementTree
module in this package to parse and analyze XML documents.
In this recipe, we'll use the xml.etree
module to parse XML data.
We've gathered some sailboat racing results in race_result.xml
. This file contains information on teams, legs, and the order in which the various teams finished each leg.
A team's overall score is the sum of the finish positions. Finishing first, or nearly first, in each leg will give a very low score. In many cases, there are empty values where a boat did not start, did not finish, or was disqualified from the race. In those cases, the team's score will be one more than the number of boats. If there are seven boats, then the team is given eight points for the leg. The inability to compete creates a hefty penalty.
The root tag for this data is a <results>
document. This has the following schema:
<legs>
tag contains individual <leg>
tags that name each leg of the race. The leg names contain both a starting port and an ending port in the text.<teams>
tag contains a number of <team>
tags with details of each team. Each team has data structured with internal tags:<name>
tag contains the team name.<position>
tag contains a number of <leg>
tags with the finish position for the given leg. Each leg is numbered, and the numbering matches the leg definitions in the <legs>
tag.The data for all the finish positions for a single team looks as follows:
<?xml version="1.0"?>
<results>
<teams>
<team>
<name>
Abu Dhabi Ocean Racing
</name>
<position>
<leg n="1">
1
</leg>
<leg n="2">
3
</leg>
<leg n="3">
2
</leg>
<leg n="4">
2
</leg>
<leg n="5">
1
</leg>
<leg n="6">
2
</leg>
<leg n="7">
5
</leg>
<leg n="8">
3
</leg>
<leg n="9">
5
</leg>
</position>
</team>
...
</teams>
<legs>
...
</legs>
</results>
We've only shown the first team. There was a total of seven teams in this particular race around the world.
In XML notation, the application data shows up in two kinds of places. The first is between the start and the end of a tag – for example, <name>Abu Dhabi Ocean Racing</name>
. The tag is <name>
, while the text between <name>
and </name>
is the value of this tag, Abu Dhabi Ocean Racing
.
Also, data shows up as an attribute of a tag; for example, in <leg n="1">
. The tag is <leg>
; the tag has an attribute, n
, with a value of 1
. A tag can have an indefinite number of attributes.
The <leg>
tags point out an interesting problem with XML. These tags include the leg number given as an attribute, n
, and the position in the leg given as the text inside the tag. The general approach is to put essential data inside the tags and supplemental, or clarifying, data in the attributes. The line between essential and supplemental is blurry.
XML permits a mixed content model. This reflects the case where XML is mixed in with text, where there is text inside and outside XML tags. Here's an example of mixed content:
<p>This has <strong>mixed</strong> content.</p>
The content of the <p>
tag is a mixture of text and a tag. The data we're working with does not rely on this kind of mixed content model, meaning all the data is within a single tag or an attribute of a tag. The whitespace between tags can be ignored.
We'll use the xml.etree
module to parse the data. This involves reading the data from a file and providing it to the parser. The resulting document will be rather complex.
We have not provided a formal schema definition for our sample data, nor have we provided a Document Type Definition (DTD). This means that the XML defaults to mixed content mode. Furthermore, the XML structure can't be validated against the schema or DTD.
Parsing XML data requires importing the ElementTree
module. We'll use this to write a race_summary()
function that parses the XML data and produces a useful Python object:
xml.etree.ElementTree
class to parse the XML text. We'll also need a Path
object to refer to the file. We've used the import… as…
syntax to assign a shorter name of XML
to the ElementTree
class:
import xml.etree.ElementTree as XML
from pathlib import Path
Path
instance:
def race_summary(source_path: Path) -> None:
ElementTree
object by parsing the XML text. It's often easiest to use source_path.read_text()
to read the file named by Path
. We provided this string to the XML.fromstring()
method for parsing. For very large files, an incremental parser is sometimes helpful:
source_text = source_path.read_text(encoding='UTF-8')
document = XML.fromstring(source_text)
"teams"
and "legs"
. Here's how we can iterate through each leg, showing the team's position in the leg:
legs = cast(XML.Element, document.find('legs'))
teams = cast(XML.Element, document.find('teams'))
for leg in legs.findall('leg'):
print(cast(str, leg.text).strip())
n = leg.attrib['n']
for team in teams.findall('team'):
position_leg = cast(XML.Element,
team.find(f"position/leg[@n='{n}']"))
name = cast(XML.Element, team.find('name'))
print(
cast(str, name.text).strip(),
cast(str, position_leg.text).strip()
)
Once we have the document object, we can then search the object for the relevant pieces of data. In this example, we used the find()
method to locate the two tags containing legs and teams.
Within the legs
tag, there are a number of leg
tags. Each of those tags has the following structure:
<leg n="1">
ALICANTE - CAPE TOWN
</leg>
The expression leg.attrib['n']
extracts the value of the attribute named n
. The expression leg.text.strip()
is all the text within the <leg>
tag, stripped of extra whitespace.
It's central to note that the results of the find()
function have a type hint of Optional[XML.Element]
. We have two choices to handle this:
if
statement to determine if the result is not None
.cast(XML.Element, tag.find(…))
to claim that the result is never going to be None
. For some kinds of output from automated systems, the tags are always going to be present, and the overhead of numerous if
statements is excessive.For each leg of the race, we need to print the finish positions, which are represented in the data contained within the <teams>
tag. Within this tag, we need to locate a tag containing the name of the team. We also need to find the proper leg
tag with the finish position for this team on the given leg.
We need to use a complex XPath search, f"position/leg[@n='{n}']"
, to locate a specific instance of the position
tag. The value of n
is the leg number. For the ninth leg, this search will be the string "position/leg[@n='9']"
. This will locate the position
tag containing a leg
tag that has an attribute n
equal to 9
.
Because XML is a mixed content model, all the
,
, and space characters in the content are perfectly preserved in the data. We rarely want any of this whitespace, and it makes sense to use the strip()
method to remove all extraneous characters before and after the meaningful content.
The XML parser modules transform XML documents into fairly complex objects based on a standardized document object model. In the case of the etree
module, the document will be built from Element
objects, which generally represent tags and text.
XML can also include processing instructions and comments. We'll ignore them and focus on the document structure and content here.
Each Element
instance has the text of the tag, the text within the tag, attributes that are part of the tag, and a tail. The tag is the name inside <tag>
. The attributes are the fields that follow the tag name. For example, the <leg n="1">
tag has a tag name of leg
and an attribute named n
. Values are always strings in XML; any conversion to a different data type is the responsibility of the application using the data.
The text is contained between the start and end of a tag. Therefore, a tag such as <name>Team SCA</name>
has "Team SCA"
for the value of the text
attribute of the Element
that represents the <name>
tag.
Note that a tag also has a tail attribute. Consider this sequence of two tags:
<name>Team SCA</name>
<position>...</position>
There's a
whitespace character after the closing </name>
tag and before the opening of the <position>
tag. This extra text is collected by the parser and put into the tail of the <name>
tag. The tail values can be important when working with a mixed content model. The tail values are generally whitespace when working in an element content model.
Because we can't trivially translate an XML document into a Python dictionary, we need a handy way to search through the document's content. The ElementTree
module provides a search technique that's a partial implementation of the XML Path Language (XPath) for specifying a location in an XML document. The XPath notation gives us considerable flexibility.
The XPath queries are used with the find()
and findall()
methods. Here's how we can find all of the team names:
>>> for tag in document.findall('teams/team/name'):
... print(tag.text.strip())
Abu Dhabi Ocean Racing
Team Brunel
Dongfeng Race Team
MAPFRE
Team Alvimedica
Team SCA
Team Vestas Wind
Here, we've looked for the top-level <teams>
tag. Within that tag, we want <team>
tags. Within those tags, we want the <name>
tags. This will search for all the instances of this nested tag structure.
Note that we've omitted the type hints from this example and assumed that all the tags will contain text values that are not None
. If we use this in an application, we may have to add checks for None
, or use the cast()
function to convince mypy
that the tags or the text attribute value is not None
.
We can search for attribute values as well. This can make it handy to find how all the teams did on a particular leg of the race. The data for this can be found in the <leg>
tag, within the <position>
tag for each team.
Furthermore, each <leg>
has an attribute named n
that shows which of the race legs it represents. Here's how we can use this to extract specific data from the XML document:
>>> for tag in document.findall("teams/team/position/leg[@n='8']"):
... print(tag.text.strip())
3
5
7
4
6
1
2
This shows us the finishing positions of each team on leg 8 of the race. We're looking for all tags with <leg n="8">
and displaying the text within that tag. We have to match these values with the team names to see that team number 3, Team SCA, finished first, and that team number 2, Dongfeng Race Team, finished last on this leg.
A great deal of content on the web is presented using HTML markup. A browser renders the data very nicely. How can we parse this data to extract the meaningful content from the displayed web page?
We can use the standard library html.parser
module, but it's not as helpful as we'd like. It only provides low-level lexical scanning information; it doesn't provide a high-level data structure that describes the original web page.
Instead, we'll use the Beautiful Soup module to parse HTML pages into more useful data structures. This is available from the Python Package Index (PyPI). See https://pypi.python.org/pypi/beautifulsoup4.
This must be downloaded and installed. Often, this is as simple as doing the following:
python -m pip install beautifulsoup4
Using the python -m pip
command ensures that we will use the pip
command that goes with the currently active virtual environment.
We've gathered some sailboat racing results in Volvo Ocean Race.html
. This file contains information on teams, legs, and the order in which the various teams finished each leg. It's been scraped from the Volvo Ocean Race website, and it looks wonderful when opened in a browser.
Except for very old websites, most HTML notation is an extension of XML notation. The content is surrounded by <tag>
marks, which show the structure and presentation of the data. HTML predates XML, and an XHTML standard reconciles the two. Note that browser applications must be tolerant of older HTML and improperly structured HTML. The presence of damaged HTML can make it difficult to analyze some data from the World Wide Web.
HTML pages can include a great deal of overhead. There are often vast code and style sheet sections, as well as invisible metadata. The content may be surrounded by advertising and other information.
Generally, an HTML page has the following overall structure:
<html>
<head>...</head>
<body>...</body>
</html>
Within the <head>
tag, there will be links to JavaScript libraries and links to Cascading Style Sheet (CSS) documents. These are used to provide interactive features and to define the presentation of the content.
The bulk of the content is in the <body>
tag. It can be difficult to track down the relevant data on a web page. This is because the focus of the design effort is on how people see it more than how automated tools can process it.
In this case, the race results are in an HTML <table>
tag, making them easy to find. What we can see here is the overall structure for the relevant content in the page:
<table>
<thead>
<tr>
<th>...</th>
...
</tr>
</thead>
<tbody>
<tr>
<td>...</td>
...
</tr>
...
</tbody>
</table>
The <thead>
tag includes the column titles for the table. There's a single table row tag, <tr>
, with table heading, <th>
, tags that include the content. Each of the <th>
tags contains two parts. It looks like this:
<th tooltipster data-title="<strong>ALICANTE - CAPE TOWN</strong>" data-theme="tooltipster-shadow" data-htmlcontent="true" data-position="top">LEG 1</th>
The essential display is a number for each leg of the race, LEG 1
, in this example. This is the content of the tag. In addition to the displayed content, there's also an attribute value, data-title
, that's used by a JavaScript function. This attribute value is the name of the leg, and it is displayed when the cursor hovers over a column heading. The JavaScript function pops up the leg's name.
The <tbody>
tag includes the team name and the results for each race. The table row, <tr>
, tag, contains the details for each team. The team name (and graphic and overall finish rank) is shown in the first three columns of the table data, <td>
. The remaining columns of table data contain the finishing position for a given leg of the race.
Because of the relative complexity of sailboat racing, there are additional notes in some of the table data cells. These are included as attributes to provide supplemental data regarding the reason why a cell has a particular value. In some cases, teams did not start a leg, did not finish a leg, or retired from a leg.
Here's a typical <tr>
row from the HTML:
<tr class="ranking-item">
<td class="ranking-position">3</td>
<td class="ranking-avatar">
<img src="..."> </td>
<td class="ranking-team">Dongfeng Race Team</td>
<td class="ranking-number">2</td>
<td class="ranking-number">2</td>
<td class="ranking-number">1</td>
<td class="ranking-number">3</td>
<td class="ranking-number" tooltipster data-title="<center><strong>RETIRED</strong><br>Click for more info</center>" data-theme="tooltipster-3" data-position="bottom" data-htmlcontent="true"><a href="/en/news/8674_Dongfeng-Race-Team-breaks-mast-crew-safe.html" target="_blank">8</a><div class="status-dot dot-3"></div></td>
<td class="ranking-number">1</td>
<td class="ranking-number">4</td>
<td class="ranking-number">7</td>
<td class="ranking-number">4</td>
<td class="ranking-number total">33<span class="asterix">*</span></td>
</tr>
The <tr>
tag has a class attribute that defines the style for this row. The class
attribute on this tag helps our data gathering application locate the relevant content. It also chooses the CSS style for this class of content.
The <td>
tags also have class attributes that define the style for the individual cells of data. Because the CSS styles reflect the content, the class also clarifies what the content of the cell means. Not all CSS class names are as well defined as these.
One of the cells has no text content. Instead, the cell has an <a>
tag and an empty <div>
tag. That cell also contains several attributes, including data-title
, data-theme
, data-position
, and others. These additional tags are used by a JavaScript function to display additional information in the cell. Instead of text stating the finish position, there is additional data on what happened to the racing team and their boat.
An essential complexity here is that the data-title
attribute contains HTML content. This is unusual, and an HTML parser cannot detect that an attribute contains HTML markup. We'll set this aside for the There's more… section of this recipe.
We'll start by importing the necessary modules. The function for parsing down the data will have two important sections: the list of legs and the team results for each leg. These are represented as headings and rows of an HTML table:
BeautifulSoup
class from the bs4
module to parse the text. We'll also need a Path
object to refer to the file. We've used a # type: ignore
comment because the bs4 module didn't have complete type hints at the time of publication:
from bs4 import BeautifulSoup # type: ignore
from pathlib import Path
Path
instance:
def race_extract(source_path: Path) -> Dict[str, Any]:
soup
. We've used a context manager to access the file. As an alternative, we could also read the content using the Path.read_text()
method:
with source_path.open(encoding="utf8") as source_file:
soup = BeautifulSoup(source_file, "html.parser")
Soup
object, we can navigate to the first <table>
tag. Within that, we need to find the first <thead>
tag. Within that heading, we need to find the <tr>
tag. This row contains the individual heading cells. Navigating to the first instance of a tag means using the tag name as an attribute; each tag's children are available as attributes of the tag:
thead_row = soup.table.thead.tr
<th>
cell within the row. There are three variations of the heading cells. Some have no text, some have text, and some have text and a data-title
attribute value. For this first version, we'll capture all three variations. The tag's text
attribute contains this content. The tag's attrs
attribute contains the various attribute values:
legs: List[Tuple[str, Optional[str]]] = []
for tag in thead_row.find_all("th"):
leg_description = (
tag.string, tag.attrs.get("data-title")
legs.append(leg_description)
Soup
object, we can navigate to the first <table>
tag. Within that, we need to find the first <tbody>
tag. We can leverage the way Beautiful Soup makes the first instance of any tag into an attribute of the parent's tag:
tbody = soup.table.tbody
<tr>
tags in order to visit all of the rows of the table. Within the <tr>
tags for a row, each cell is in a <td>
tag. We want to convert the content of the <td>
tags into team names and a collection of team positions, depending on the attributes available:
teams: List[Dict[str, Any]] = []
for row in tbody.find_all("tr"):
team: Dict[str, Any] = {
"name": None,
"position": []}
for col in row.find_all("td"):
if "ranking-team" in col.attrs.get("class"):
team["name"] = col.string
elif (
"ranking-number" in col.attrs.get("class")
):
team["position"].append(col.string)
elif "data-title" in col.attrs:
# Complicated explanation with nested HTML
# print(col.attrs, col.string)
Pass
teams.append(team)
document = {
"legs": legs,
"teams": teams,
}
return document
We've created a list of legs showing the order and names for each leg, and we parsed the body of the table to create a dict-of-list structure with each leg's results for a given team. The resulting object looks like this:
{'legs': [(None, None),
('LEG 1', '<strong>ALICANTE - CAPE TOWN</strong>'),
('LEG 2', '<strong>CAPE TOWN - ABU DHABI</strong>'),
('LEG 3', '<strong>ABU DHABI - SANYA</strong>'),
('LEG 4', '<strong>SANYA - AUCKLAND</strong>'),
('LEG 5', '<strong>AUCKLAND - ITAJAÍ</strong>'),
('LEG 6', '<strong>ITAJAÍ - NEWPORT</strong>'),
('LEG 7', '<strong>NEWPORT - LISBON</strong>'),
('LEG 8', '<strong>LISBON - LORIENT</strong>'),
('LEG 9', '<strong>LORIENT - GOTHENBURG</strong>'),
('TOTAL', None)],
'teams': [{'name': 'Abu Dhabi Ocean Racing',
'position': ['1', '3', '2', '2', '1', '2', '5', '3', '5', '24']},
{'name': 'Team Brunel',
'position': ['3', '1', '5', '5', '4', '3', '1', '5', '2', '29']},
{'name': 'Dongfeng Race Team',
'position': ['2', '2', '1', '3', None, '1', '4', '7', '4', None]},
{'name': 'MAPFRE',
'position': ['7', '4', '4', '1', '2', '4', '2', '4', '3', None]},
{'name': 'Team Alvimedica',
'position': ['5', None, '3', '4', '3', '5', '3', '6', '1', '34']},
{'name': 'Team SCA',
'position': ['6', '6', '6', '6', '5', '6', '6', '1', '7', None]},
{'name': 'Team Vestas Wind',
'position': ['4',
None,
None,
None,
None,
None,
None,
'2',
'6',
'60']}]}
This structure is the content of the source HTML table, unpacked into a Python dictionary we can work with. Note that the titles for the legs include embedded HTML within the attribute's value.
Within the body of the table, many cells have None
for the final race position, and a complex value in data-title
for the specific <TD>
tag. We've avoided trying to capture the additional results data in this initial part of the recipe.
The BeautifulSoup
class transforms HTML documents into fairly complex objects based on a document object model (DOM). The resulting structure will be built from instances of the Tag
, NavigableString
, and Comment
classes.
Generally, we're interested in the tags that contain the string content of the web page. These are instances of the Tag
class, as well as the NavigableString
class.
Each Tag
instance has a name, string, and attributes. The name is the word inside <
and >
. The attributes are the fields that follow the tag name. For example, <td class="ranking-number">1</td>
has a tag name of td
and an attribute named class
. Values are often strings, but in a few cases, the value can be a list of strings. The string attribute of the Tag
object is the content enclosed by the tag; in this case, it's a very short string, 1
.
HTML is a mixed content model. This means that a tag can contain child tags, in addition to navigable text. When looking at the children of a given tag, there will be a sequence of Tag
and NavigableText
objects freely intermixed.
One of the most common features of HTML is small blocks of navigable text that contain only newline characters. When we have HTML like this:
<tr>
<td>Data</td>
</tr>
There are three children within the <tr>
tag. Here's a display of the children of this tag:
>>> example = BeautifulSoup('''
... <tr>
... <td>data</td>
... </tr>
... ''', 'html.parser')
>>> list(example.tr.children)
['
', <td>data</td>, '
']
The two newline characters are peers of the <td>
tag. These are preserved by the parser. This shows how NavigableText
objects often surround a child Tag
object.
The BeautifulSoup
parser depends on another, lower-level library to do some of the parsing. It's easy to use the built-in html.parser
module for this. There are alternatives that can be installed as well. These may offer some advantages, like better performance or better handling of damaged HTML.
The Tag
objects of Beautiful Soup represent the hierarchy of the document's structure. There are several kinds of navigation among tags:
container
will have a parent. The top <html>
tag will often be the only child of the root container
.parents
attribute is a generator for all parents of a tag. It's a path "upward" through the hierarchy from a given tag.Tag
objects can have children. A few tags such as <img/>
and <hr/>
have no children. The children
attribute is a generator that yields the children of a tag.<html>
tag, for example, contains the entire document as descendants. The children
attribute contains the immediate children; the descendants
attribute generates all children of children, recursively.next_sibling
and previous_sibling
attributes to help us step through the peers of a tag.In some cases, a document will have a straightforward organization, and a simple search by the id
attribute or class
attribute will find the relevant data. Here's a typical search for a given structure:
>>> ranking_table = soup.find('table', class_="ranking-list")
Note that we have to use class_
in our Python query to search for the attribute named class
. The token class
is a reserved word in Python and cannot be used as a parameter name. Given the overall document, we're searching for any <table class="ranking-list">
tag. This will find the first such table in a web page. Since we know there will only be one of these, this attribute-based search helps distinguish between what we are trying to find and any other tabular data on a web page.
Here's the list of parents of this <table>
tag:
>>> list(tag.name for tag in ranking_table.parents)
['section', 'div', 'div', 'div', 'div', 'body', 'html', '[document]']
We've displayed just the tag name for each parent above the given <table>
. Note that there are four nested <div>
tags that wrap the <section>
that contains <table>
. Each of these <div>
tags likely has a different class attribute to properly define the content and the style for the content.
[document]
is the overall BeautifulSoup
container that holds the various tags that were parsed. This is displayed distinctively to emphasize that it's not a real tag, but a container for the top-level <html>
tag.
When we read data from a CSV format file, the csv
module offers two general choices for the kind of reader to create:
csv.reader()
, each row becomes a list of column values.csv.DictReader
, each row becomes a dictionary. By default, the contents of the first row become the keys for the row dictionary. An alternative is to provide a list of values that will be used as the keys.In both cases, referring to data within the row is awkward because it involves rather complex-looking syntax. When we use the csv.reader()
function, we must use syntax like row[2]
to refer to a cell; the semantics of index 2 are completely obscure.
When we use csv.DictReader
, we can use row['date']
, which is less obscure, but this is still a lot of extra syntax. While this has a number of advantages, it requires a CSV with a single-row header of unique column names, which is something not ubiquitous in practice.
In some real-world spreadsheets, the column names are impossibly long strings. It's hard to work with row['Total of all locations
excluding franchisees']
.
We can use a dataclass to replace this complex list or dictionary syntax with something simpler. This lets us replace an opaque index position or column names with a useful name.
One way to improve the readability of programs that work with spreadsheets is to replace a list of columns with a typing.NamedTuple
or dataclass
object. These two definitions provide easy-to-use names defined by the class instead of the possibly haphazard column names in the .csv
file.
More importantly, it permits much nicer syntax for referring to the various columns; for example, we can use row.date
instead of row['date']
or row[2]
.
The column names (and the data types for each column) are part of the schema for a given file of data. In some CSV files, the first line of the column titles provides part of the schema for the file. The schema that gets built from the first line of the file is incomplete because it can only provide attribute names; the target data types aren't known and have to be managed by the application program.
This points to two reasons for imposing an external schema on the rows of a spreadsheet:
We'll look at a CSV file that contains some real-time data that's been recorded from the log of a sailboat. This is the waypoints.csv
file, and the data looks as follows:
lat,lon,date,time
32.8321666666667,-79.9338333333333,2012-11-27,09:15:00
31.6714833333333,-80.93325,2012-11-28,00:00:00
30.7171666666667,-81.5525,2012-11-28,11:35:00
The data contains four columns, lot
, lon
, date
, and time
. Two of the columns are the latitude and longitude of the waypoint. It contains a column with the date and the time as separate values. This isn't ideal, and we'll look at various data cleansing steps separately.
In this case, the column titles happen to be valid Python variable names. This is rare, but it can lead to a slight simplification. The more general solution involves mapping the given column names to valid Python attribute names.
A program can use a dictionary-based reader that looks like the following function:
def show_waypoints_raw(data_path: Path) -> None:
with data_path.open() as data_file:
data_reader = csv.DictReader(data_file)
for row in data_reader:
ts_date = datetime.datetime.strptime(
row['date'], "%Y-%m-%d"
).date()
ts_time = datetime.datetime.strptime(
row['time'], "%H:%M:%S"
).time()
timestamp = datetime.datetime.combine(
ts_date, ts_time)
lat_lon = (
float(row['lat']),
float(row['lon'])
)
print(
f"{timestamp:%m-%d %H:%M}, "
f"{lat_lon[0]:.3f} {lat_lon[1]:.3f}"
)
This function embodies a number of assumptions about the available data. It combines the physical format, logical layout, and processing into a single operation. A small change to the layout – for example, a column name change – can be difficult to manage.
In this recipe, we'll isolate the various layers of processing to create some kind of immunity from change. This separation of concerns can create a much more flexible application.
We'll start by defining a useful dataclass. We'll create functions to read raw data, and create dataclass instances from this cleaned data. We'll include some of the data conversion functions in the dataclass definition to properly encapsulate it. We'll start with the target class definition, Waypoint_1
:
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class Waypoint_1:
arrival_date: str
arrival_time: str
lat: str
lon: str
Optional
, and will use the field()
function to define them as init=False
and provide a default value, which is usually None
:
_timestamp: Optional[datetime.datetime] = field(
init=False, default=None
)
datetime
object is a relatively expensive operation. The result is cached in an attribute value, _timestamp
, and returned after the initial computation:
@property
def arrival(self):
if self._timestamp is None:
ts_date = datetime.datetime.strptime(
self.arrival_date, "%Y-%m-%d"
).date()
ts_time = datetime.datetime.strptime(
self.arrival_time, "%H:%M:%S"
).time()
self._timestamp = datetime.datetime.combine(
ts_date, ts_time)
return self._timestamp
@property
def lat_lon(self):
return float(self.lat), float(self.lon)
'Waypoint_1'
– because the class is not fully defined when the method is created. When the body of the method is evaluated, the class will exist, and a quoted name is not used:
@staticmethod
def from_source(row: Dict[str, str]) -> 'Waypoint_1':
name_map = {
'date': 'arrival_date',
'time': 'arrival_time',
'lat': 'lat',
'lon': 'lon',
}
return Waypoint_1(
**{name_map[header]: value
for header, value in row.items()}
)
def show_waypoints_1(data_path: Path) -> None:
with data_path.open() as data_file:
data_reader = csv.DictReader(data_file)
waypoint_iter = (
Waypoint_1.from_source(row)
for row in data_reader
)
for row in waypoint_iter:
print(
f"{row.arrival:%m-%d %H:%M}, "
f"{row.lat_lon[0]:.3f} "
f"{row.lat_lon[1]:.3f}"
)
Looking back at the Reading delimited files with the CSV module recipe, earlier in this chapter, the structure of this function is similar to the functions in that recipe. Similarly, this design also echoes the processing shown in the Using dataclasses to simplify working with CSV files recipe, earlier in this chapter.
The expression (Waypoint_1.from_source(row) for row in data_reader)
provides each raw dictionary with the Waypoint_1.from_source()
static function. This function will map the source column names to class attributes, and then create an instance of the Waypoint_1
class.
The remaining processing has to be rewritten so that it uses the attributes of the new dataclass that was defined. This often leads to simplification because row-level computations of derived values have been refactored into the class definition, removing them from the overall processing. The remaining overall processing is a display of the detailed values from each row of the source CSV file.
There are several parts to this recipe. Firstly, we've used the csv
module for the essential parsing of rows and columns of data. We've also leveraged the Reading delimited files with the CSV module recipe to process the physical format of the data.
Secondly, we've defined a dataclass that provides a minimal schema for our data. The minimal schema is supplemented with the from_source()
function to convert the raw data into instances of the dataclass. This provides a more complete schema definition because it has a mapping from the source columns to the dataclass attributes.
Finally, we've wrapped the csv
reader in a generator expression to build dataclass objects for each row. This change permits the revision of the remaining code in order to focus on the object defined by the dataclass, separate from CSV file complications.
Instead of row[2]
or row['date']
, we can now use row.arrival_date
to refer to a specific column. This is a profound change; it can simplify the presentation of complex algorithms.
A common problem that CSV files have is blank rows that contain no useful data. Discarding empty rows requires some additional processing when attempting to create the row object. We need to make two changes:
from_source()
method so that it has a slightly different return value. It often works out well to change the return type from 'Waypoint_1'
to Optional['Waypoint_1']
and return a None
object instead of an empty or invalid Waypoint_1
instance.waypoint_iter
expression in order to include a filter to reject the None
objects.We'll look at each of these separately, starting with the revision to the from_source()
method.
Each source of data has unique rules for what constitutes valid data. In this example, we'll use the rule that all four fields must be present and have data that fits the expected patterns: dates, times, or floating-point numbers.
This definition of valid data leads to a profound rethinking of the way the dataclass is defined. The application only uses two attributes: the arrival time is a datetime.datetime
object, while the latitude and longitude is a Tuple[float, float]
object.
A more useful class definition, then, is this:
@dataclass
class Waypoint_2:
arrival: datetime.datetime
lat_lon: Tuple[float, float]
Given these two attributes, we can redefine the from_source()
method to build this from a row of raw data:
@staticmethod
def from_source(row: Dict[str, str]) -> Optional['Waypoint_2']:
try:
ts_date = datetime.datetime.strptime(
row['date'], "%Y-%m-%d"
).date()
ts_time = datetime.datetime.strptime(
row['time'], "%H:%M:%S"
).time()
arrival = datetime.datetime.combine(
ts_date, ts_time)
return Waypoint_2(
arrival=arrival,
lat_lon=(
float(row['lat']),
float(row['lon'])
)
)
except (ValueError, KeyError):
return None
This function will locate the source values, row['date']
, row['time']
, row['lat']
, and row['lon']
. It assumes the fields are all valid and attempts to do a number of complex conversions, including the date-time combination and float conversion of the latitude and longitude values. If any of these conversions fail, an exception will be raised and a None
value will be returned. If all the conversions are successful, then an instance of the Waypoint_2
class can be built and returned from this function.
Once this change is in place, we can make one more change to the main application:
def show_waypoints_2(data_path: Path) -> None:
with data_path.open() as data_file:
data_reader = csv.DictReader(data_file)
waypoint_iter = (
Waypoint_2.from_source(row)
for row in data_reader
)
for row in filter(None, waypoint_iter):
print(
f"{row.arrival:%m-%d %H:%M}, "
f"{row.lat_lon[0]:.3f} "
f"{row.lat_lon[1]:.3f}"
)
We've changed the for
statement that consumes values from the waypoint_iter
generator expression. This change introduces the filter()
function in order to exclude None
values from the source of data. Combined with the change to the from_source()
method, we can now exclude bad data and tolerate source file changes without complex rewrites.
csv
module to parse files.3.22.119.251