10

Input/Output, Physical Format, and Logical Layout

Computing often works with persistent data. There may be source data to be analyzed, or output to be created using Python input and output operations. The map of the dungeon that's explored in a game is data that will be input to the game application. Images, sounds, and movies are data output by some applications and input by other applications. Even a request through a network will involve input and output operations. The common aspect to all of these is the concept of a file of data. The term file is overloaded with many meanings:

  • The operating system (OS) uses a file as a way to organize bytes of data on a device. All of the wildly different kinds of content are reduced to a collection of bytes. It's the responsibility of application software to make sense of the bytes. Two common kinds of devices offer variations in terms of the features of OS files:
  • Block devices such as disks or solid-state drives (SSD): The bytes tend to be persistent. A file on this kind of device can seek any specific byte, making them particularly good for databases, where any row can be processed at any time.
  • Character devices such as a network connection, or a keyboard, or a GPS antenna. A file on this kind of device is viewed as a stream of individual bytes in transit to or from the device. There's no way to seek forward or backward; the bytes must be captured and processed as they arrive.
  • The word file also defines a data structure used by the Python runtime. A uniform Python file abstraction wraps the various OS file implementations. When we open a Python file, there is a binding between the Python abstraction, an OS implementation, and the underlying collection of bytes on a block device or stream of bytes of a character device.

Python gives us two common modes for working with a file's content:

  • In "b" (binary) mode, our application sees the bytes, without further interpretation. This can be helpful for processing media data like images, audio, and movies, which have complex encodings. These file formats can be rather complex, and difficult to work with. We'll often import libraries like pillow to handle the details of image file encoding into bytes.
  • In "t" (text) mode, the bytes of the file are used to decode string values. Python strings are made of Unicode characters, and there are a variety of encoding schemes for translating between text and bytes. Generally, the OS has a preferred encoding and Python respects this. The UTF-8 encoding is popular. Files can have any of the available Unicode encodings, and it may not be obvious which encoding was used to create a file.

Additionally, Python modules like shelve and pickle have unique ways of representing more complex Python objects than simple strings. There are a number of pickle protocols available; all of them are based on binary mode file operations.

Throughout this chapter, we'll talk about how Python objects are serialized. When an object is written to a file, a representation of the Python object's state is transformed into a series of bytes. Often, the translation involves text objects as an intermediate notation. Deserialization is the reverse process: it recovers a Python object's state from the bytes of a file. Saving and transferring a representation of the object state is the foundational concept behind REST web services.

When we process data from files, we have two common concerns:

  • The physical format of the data: This is the fundamental concern of how the bytes on the file can be interpreted to reconstruct a Python object. The bytes could represent a JPEG-encoded image or an MPEG-encoded movie. A common example is that the bytes of the file represent Unicode text, organized into lines. The bytes could encode text for comma-separated values (CSV) or the bytes could encode the text for a JSON document. All physical format concerns are commonly handled by Python libraries like csv, json, and pickle, among many others.
  • The logical layout of the data: A given layout may have flexible positions or ordering for the data. The arrangement of CSV columns or JSON fields may be variable. In cases where the data includes labels, the logical layout can be handled with tremendous flexibility. Without labels, the layout is positional, and some additional schema information is required to identify what data items are in each position.

Both the physical format and logical layout are essential to interpreting the data on a file. We'll look at a number of recipes for working with different physical formats. We'll also look at ways to divorce our program from some aspects of the logical layout.

In this chapter, we'll look at the following recipes:

  • Using pathlib to work with filenames
  • Replacing a file while preserving the previous version
  • Reading delimited files with the CSV module
  • Using dataclasses to simplify working with CSV files
  • Reading complex formats using regular expressions
  • Reading JSON and YAML documents
  • Reading XML documents
  • Reading HTML documents
  • Refactoring a .csv DictReader as a dataclass reader

In order to start doing input and output with files, we'll start by working with the OS filesystem. The common features of the directory structure of files and devices are described by Python's pathlib module. This module has consistent behavior across a number of operating systems, allowing a Python program to work similarly on Linux, macOS, and Windows.

Using pathlib to work with filenames

Most operating systems use a hierarchical path to identify a file. Here's an example filename, including the entire path:

/Users/slott/Documents/Writing/Python Cookbook/code

This full pathname has the following elements:

  • The leading / means the name is absolute. It starts from the root of the directory of files. In Windows, there can be an extra letter in front of the name, such as C:, to distinguish between the directories on individual storage devices. Linux and macOS treat all the devices as a unified hierarchy.
  • The names Users, slott, Documents, Writing, Python Cookbook, and code represent the directories (or "folders," as a visual metaphor) of the filesystem. The path names a top-level Users directory. This directory is expected to contain the slot subdirectory. This is true for each name in the path.
  • / is a separator between directory names. The Windows OS uses to separate items on the path. Python running on Windows, however, can use / because it automatically converts the more common / into the Windows path separator character gracefully; in Python, we can generally ignore the Windows use of .

There is no way to tell what kind of filesystem object the name at the end of the path, "code", represents. The name code might be a directory that contains the names of other files. It could be an ordinary data file, or a link to a stream-oriented device. The operating system retains additional directory information that shows what kind of filesystem object this is.

A path without the leading / is relative to the current working directory. In macOS and Linux, the cd command sets the current working directory. In Windows, the chdir command does this job. The current working directory is a feature of the login session with the OS. It's made visible by the shell.

This recipe will show you how we can work with pathlib.Path objects to get access to files in any OS directory structure.

Getting ready

It's important to separate two concepts:

  • The path that identifies a file, including the name and metadata like creation timestamps and ownership
  • The contents of the file

The path provides two things: an optional sequence of directory names and a mandatory filename. An OS directory includes each file's name, information about when each file was created, who owns the files, what the permissions are for each file, how many bytes the files use, and other details. The contents of the files are independent of the directory information: multiple directory entries can be linked to the same content.

Often, the filename has a suffix (or extension) as a hint as to what the physical format is. A file ending in .csv is likely a text file that can be interpreted as rows and columns of data. This binding between name and physical format is not absolute. File suffixes are only a hint and can be wrong.

In Python, the pathlib module handles all path-related processing. The module makes several distinctions among paths:

  • Pure paths that may or may not refer to an actual file
  • Concrete paths that are resolved; these refer to an actual file

This distinction allows us to create pure paths for files that our application will likely create or refer to. We can also create concrete paths for those files that actually exist on the OS. An application can resolve a pure path to a concrete path.

The pathlib module also makes a distinction between Linux path objects and Windows path objects. This distinction is rarely needed; most of the time, we don't want to care about the OS-level details of the path. An important reason for using pathlib is because we want processing that is isolated from details of the underlying OS. The cases where we might want to work with a PureLinuxPath object are rare.

All of the mini recipes in this section will leverage the following:

>>> from pathlib import Path

We rarely need any of the other definitions from pathlib.

We'll also presume the argparse module is used to gather the file or directory names. For more information on argparse, see the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs. We'll use the options variable as a namespace that contains the input filename or directory name that the recipe works with.

For demonstration purposes, a mock argument parsing is shown by providing the following Namespace object:

>>> from argparse import Namespace 
>>> options = Namespace(
...     input='/path/to/some/file.csv',
...     file1='data/ch10_file1.yaml',
...     file2='data/ch10_file2.yaml',
... )

This options object has three mock argument values. The input value is an absolute path. The file1 and file2 values are relative paths.

How to do it...

We'll show a number of common pathname manipulations as separate mini recipes. These will include the following manipulations:

  • Making the output filename by changing the input filename's suffix
  • Making a number of sibling output files
  • Creating a directory and a number of files in the directory
  • Comparing file dates to see which is newer
  • Removing a file
  • Finding all files that match a given pattern

We'll start by creating an output filename based on an input filename. This reflects a common kind of application pattern where a source file in one physical format is transformed into a file in a distinct physical format.

Making the output filename by changing the input filename's suffix

Perform the following steps to make the output filename by changing the input suffix:

  1. Create the Path object from the input filename string. In this example, the PosixPath class is displayed because the author is using macOS. On a Windows machine, the class would be WindowsPath. The Path class will properly parse the string to determine the elements of the path. Here's how we create a path from a string:
    >>> input_path = Path(options.input) 
    >>> input_path 
    PosixPath('/path/to/some/file.csv') 
    
  2. Create the output Path object using the with_suffix() method:
    >>> output_path = input_path.with_suffix('.out') 
    >>> output_path 
    PosixPath('/path/to/some/file.out') 
    

All of the filename parsing is handled seamlessly by the Path class. The with_suffix() method saves us from manually parsing the text of the filename.

Making a number of sibling output files with distinct names

Perform the following steps to make a number of sibling output files with distinct names:

  1. Create a Path object from the input filename string. In this example, the PosixPath class is displayed because the author uses Linux. On a Windows machine, the class would be WindowsPath. The Path class will properly parse the string to determine the elements of the path:
    >>> input_path = Path(options.input) 
    >>> input_path 
    PosixPath('/path/to/some/file.csv') 
    
  2. Extract the parent directory and the stem from the filename. The stem is the name without the suffix:
    >>> input_directory = input_path.parent 
    >>> input_stem = input_path.stem 
    
  3. Build the desired output name. For this example, we'll append _pass to the filename. An input file of file.csv will produce an output of file_pass.csv:
    >>> output_stem_pass = f"{input_stem}_pass"
    >>> output_stem_pass 
    'file_pass' 
    
  4. Build the complete Path object:
    >>> output_path = (
    ...     input_directory / output_stem_pass
    ...     ).with_suffix('.csv') 
    >>> output_path 
    PosixPath('/path/to/some/file_pass.csv') 
    

The / operator assembles a new path from path components. We need to put the / operation in parentheses to be sure that it's performed first to create a new Path object. The input_directory variable has the parent Path object, and output_stem_pass is a simple string. After assembling a new path with the / operator, the with_suffix() method ensures a specific suffix is used.

Creating a directory and a number of files in the directory

The following steps are for creating a directory and a number of files in the newly created directory:

  1. Create the Path object from the input filename string. In this example, the PosixPath class is displayed because the author uses Linux. On a Windows machine, the class would be WindowsPath. The Path class will properly parse the string to determine the elements of the path:
    >>> input_path = Path(options.input) 
    >>> input_path 
    PosixPath('/path/to/some/file.csv') 
    
  2. Create the Path object for the output directory. In this case, we'll create an output directory as a subdirectory with the same parent directory as the source file:
    >>> output_parent = input_path.parent / "output" 
    >>> output_parent 
    PosixPath('/path/to/some/output')
    
  3. Create the output filename using the output Path object. In this example, the output directory will contain a file that has the same name as the input with a different suffix:
    >>> input_stem = input_path.stem 
    >>> output_path = (
    ...     output_parent / input_stem).with_suffix('.src') 
    

We've used the / operator to assemble a new Path object from the parent Path and a string based on the stem of a filename. Once a Path object has been created, we can use the with_suffix() method to set the desired suffix for the path.

Comparing file dates to see which is newer

The following are the steps to see newer file dates by comparing them:

  1. Create the Path objects from the input filename strings. The Path class will properly parse the string to determine the elements of the path:
    >>> file1_path = Path(options.file1) 
    >>> file2_path = Path(options.file2) 
    
  2. Use the stat() method of each Path object to get timestamps for the file. This method returns a stat object; within that stat object, the st_mtime attribute of that object provides the most recent modification time for the file:
    >>> file1_path.stat().st_mtime 
    1464460057.0 
    >>> file2_path.stat().st_mtime 
    1464527877.0 
    

The values are timestamps measured in seconds. We can compare the two values to see which is newer.

If we want a timestamp that's sensible to people, we can use the datetime module to create a proper datetime object from this:

>>> import datetime 
>>> mtime_1 = file1_path.stat().st_mtime 
>>> datetime.datetime.fromtimestamp(mtime_1) 
datetime.datetime(2016, 5, 28, 14, 27, 37) 

We can use the strftime() method to format the datetime object or we can use the isoformat() method to provide a standardized display. Note that the time will have the local time zone offset implicitly applied to the OS timestamp; depending on the OS configuration(s), a laptop may not show the same time as the server that created it because they're in different time zones.

Removing a file

The Linux term for removing a file is unlinking. Since a file may have many links, the actual data isn't removed until all links are removed. Here's how we can unlink files:

  1. Create the Path object from the input filename string. The Path class will properly parse the string to determine the elements of the path:
    >>> input_path = Path(options.input) 
    >>> input_path 
    PosixPath('/path/to/some/file.csv') 
    
  2. Use the unlink() method of this Path object to remove the directory entry. If this was the last directory entry for the data, then the space can be reclaimed by the OS:
    >>> try: 
    ...     input_path.unlink() 
    ... except FileNotFoundError as ex: 
    ...     print("File already deleted") 
    File already deleted 
    

If the file does not exist, a FileNotFoundError is raised. In some cases, this exception needs to be silenced with the pass statement. In other cases, a warning message might be important. It's also possible that a missing file represents a serious error.

Additionally, we can rename a file using the rename() method of a Path object. We can create new soft links using the symlink_to() method. To create OS-level hard links, we need to use the os.link() function.

Finding all files that match a given pattern

The following are the steps to find all the files that match a given pattern:

  1. Create the Path object from the input directory name. The Path class will properly parse the string to determine the elements of the path:
    >>> Path(options.file1)
    PosixPath('data/ch09_file1.yaml')
    >>> directory_path = Path(options.file1).parent
    >>> directory_path
    PosixPath('data')
    
  2. Use the glob() method of the Path object to locate all files that match a given pattern. By default, this will not recursively walk the entire directory tree (add keyword-argument recursive=True to the glob call to walk the whole tree):
    >>> list(directory_path.glob("*.csv"))
    [PosixPath('data/wc1.csv'), PosixPath('data/ex2_r12.csv'),
     PosixPath('data/wc.csv'), PosixPath('data/ch07_r13.csv'),
     PosixPath('data/sample.csv'),
     PosixPath('data/craps.csv'), PosixPath('data/output.csv'),
     PosixPath('data/fuel.csv'), PosixPath('data/waypoints.csv'),
     PosixPath('data/quotient.csv'),
     PosixPath('data/summary_log.csv'), PosixPath('data/fuel2.csv')]
    

With this, we've seen a number of mini recipes for using pathlib.Path objects for managing the file resources. This abstraction is helpful for simplifying access to the filesystem, as well as providing a uniform abstraction that works for Linux, macOS, and Windows.

How it works...

Inside the OS, a path is a sequence of directories (a folder is a visual depiction of a directory). In a name such as /Users/slott/Documents/writing, the root directory, /, contains a directory named Users. This contains a subdirectory, slott, which contains Documents, which contains writing.

In some cases, a simple string representation can be used to summarize the navigation from root to directory, through to the final target directory. The string representation, however, makes many kinds of path operations into complex string parsing problems.

The Path class definition simplifies operations on paths. These operations on Path include the following examples:

  • Extract the parent directory, as well as a sequence of all enclosing directory names.
  • Extract the final name, the stem of the final name, and the suffix of the final name.
  • Replace the suffix with a new suffix or replace the entire name with a new name.
  • Convert a string into a Path. Also, convert a Path into a string. Many OS functions and parts of Python prefer to use filename strings.
  • Build a new Path object from an existing Path joined with a string using the / operator.

A concrete Path represents an actual filesystem resource. For concrete Paths, we can do a number of additional manipulations of the directory information:

  • Determine what kind of directory entry this is; that is, an ordinary file, a directory, a link, a socket, a named pipe (or FIFO), a block device, or a character device.
  • Get the directory details, including information such as timestamps, permissions, ownership, size, and so on. We can also modify many of these things.
  • Unlink (that is, remove) the directory entry.

Just about anything we might want to do with directory entries for files can be done with the pathlib module. The few exceptions are part of the os or os.path module.

There's more...

When we look at other file-related recipes in the rest of this chapter, we'll use Path objects to name the files. The objective is to avoid trying to use strings to represent paths.

The pathlib module makes a small distinction between Linux pure Path objects and Windows pure Path objects. Most of the time, we don't care about the OS-level details of the path.

There are two cases where it can help to produce pure paths for a specific operating system:

  • If we do development on a Windows laptop, but deploy web services on a Linux server, it may be necessary to use PureLinuxPath in unit test cases. This allows us to write test cases on the Windows development machine that reflect actual intended use on a Linux server.
  • If we do development on a macOS (or Linux) laptop, but deploy exclusively to Windows servers, it may be necessary to use PureWindowsPath.

The following snippet shows how to create Windows-specific Path objects:

>>> from pathlib import PureWindowsPath 
>>> home_path = PureWindowsPath(r'C:Usersslott') 
>>> name_path = home_path / 'filename.ini' 
>>> name_path 
PureWindowsPath('C:/Users/slott/filename.ini') 
>>> str(name_path) 
'C:\Users\slott\filename.ini' 

Note that the / characters are normalized from Windows to Python notation when displaying the WindowsPath object. Using the str() function retrieves a path string appropriate for the Windows OS.

When we use the generic Path class, we always get a subclass appropriate to the user's environment, which may or may not be Windows. By using PureWindowsPath, we've bypassed the mapping to the user's actual OS.

See also

  • In the Replacing a file while preserving the previous version recipe, later in this chapter, we'll look at how to leverage the features of a Path to create a temporary file and then rename the temporary file to replace the original file.
  • In the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs, we looked at one very common way to get the initial string that will be used to create a Path object.

Replacing a file while preserving the previous version

We can leverage the power of pathlib to support a variety of filename manipulations. In the Using pathlib to work with filenames recipe, earlier in this chapter, we looked at a few of the most common techniques for managing directories, filenames, and file suffixes.

One common file processing requirement is to create output files in a fail-safe manner. That is, the application should preserve any previous output file, no matter how or where the application fails.

Consider the following scenario:

  1. At time t0, there's a valid output.csv file from the previous run of the long_complex.py application.
  2. At time t1, we start running the long_complex.py application. It begins overwriting the output.csv file. Until the program finishes, the bytes are unusable.
  3. At time t2, the application crashes. The partial output.csv file is useless. Worse, the valid file from time t0 is not available either since it was overwritten.

We need a way to preserve the previous state of the file, and only replace the file when the new content is complete and correct. In this recipe, we'll look at an approach to creating output files that's safe in the event of a failure.

Getting ready

For files that don't span across physical devices, fail-safe file output generally means creating a new copy of the file using a temporary name. If the new file can be created successfully, then the old file should be replaced using a single, atomic rename operation.

The goal is to create files in such a way that at any time prior to the final rename, a crash will leave the original file in place. Subsequent to the final rename, the new file should be in place.

We can add capabilities to preserve the old file as well. This provides a recovery strategy. In case of a catastrophic problem, the old file can be renamed manually to make it available as the original file.

There are several ways to approach this. The fileinput module has an inplace=True option that permits reading a file while redirecting standard output to write a replacement of the input file. We'll show a more general approach that works for any file. This uses three separate files and does two renames:

  • The important output file we want to preserve in a valid state at all times, for example, output.csv.
  • A temporary version of the file: output.csv.new. There are a variety of conventions for naming this file. Sometimes, extra characters such as ~ or # are placed on the filename to indicate that it's a temporary, working file; for example, output.csv~. Sometimes, it will be in the /tmp filesystem.
  • The previous version of the file: name.out.old. Any previous .old file will be removed as part of finalizing the output. Sometimes, the previous version is .bak, meaning "backup."

To create a concrete example, we'll work with a file that has a very small but precious piece of data: a Quotient. Here's the definition for this Quotient object:

from dataclasses import dataclass, asdict, fields
@dataclass
class Quotient:
    numerator: int
    denominator: int

The following function will write this object to a file in CSV notation:

def save_data(
        output_path: Path, data: Iterable[Quotient]) -> None:
    with output_path.open("w", newline="") as output_file:
        headers = [f.name for f in fields(Quotient)]
        writer = csv.DictWriter(output_file, headers)
        writer.writeheader()
        for q in data:
            writer.writerow(asdict(q))

We've opened a file with a context manager to be guaranteed the file will be closed. The headers variable is the list of attribute names in the Quotient dataclass. We can use these headers to create a CSV writer, and then emit all the given instances of the Quotient dataclass.

Some typical contents of the file are shown here:

numerator,denominator
87,32

Yes. This is a silly little file. We can imagine that it might be an important part of the security configuration for a web server, and changes must be managed carefully by the administrators.

In the unlikely event of a problem when writing the data object to the file, we could be left with a corrupted, unusable output file. We'll wrap this function with another to provide a reliable write.

How to do it...

We start creating our wrapper by importing the classes we need:

  1. Import the Path class from the pathlib module:
    from pathlib import Path 
    
  2. Define a "wrapper" function to encapsulate the save_data() function with some extra features. The function signature is the same as it is for the save_data() function:
    def safe_write(
            output_path: Path, data: Iterable[Quotient]
            ) -> None:
    
  3. Save the original suffix and create a new suffix with .new at the end. This is a temporary file. If it is written properly, with no exceptions, then we can rename it so that it's the target file:
        ext = output_path.suffix
        output_new_path = output_path.with_suffix(f'{ext}.new')
        save_data(output_new_path, data)
    
  4. Before saving the current file, remove any previous backup copy. We'll remove an .old file, if one exists. If there's no .old file, we can use the missing_ok option to ignore the FileNotFound exception:
        output_old_path = output_path.with_suffix(f'{ext}.old')
        output_old_path.unlink(missing_ok=True)
    
  5. Now, we can preserve the current file with the name of .old to save it in case of problems:
        try:
            output_path.rename(output_old_path)
        except FileNotFoundError as ex:
            # No previous file. That's okay.
            pass
    
  6. The final step is to make the temporary .new file the official output:
        try:
            output_new_path.rename(output_path)
        except IOError as ex:
            # Possible recovery...
            output_old_path.rename(output_path)
    

This multi-step process uses two rename operations:

  • Rename the current version to a version with .old appended to the suffix.
  • Rename the new version, with .new appended to the suffix, to the current version of the file.

A Path object has a replace() method. This always overwrites the target file, with no warning if there's a problem. The choice depends on how our application needs to handle cases where old versions of files may be left in the filesystem. We've used rename() in this recipe to try and avoid overwriting files in the case of multiple problems. A variation could use replace() to always replace a file.

How it works...

This process involves the following three separate OS operations: an unlink and two renames. This leads to a situation in which the .old file is preserved and can be used to recover the previously good state.

Here's a timeline that shows the state of the various files. We've labeled the content as version 1 (the previous contents) and version 2 (the revised contents):

Time

Operation

.csv.old

.csv

.csv.new

t0

version 0

version 1

t1

Mid-creation

version 0

version 1

Will appear corrupt if used

t2

Post-creation, closed

version 0

version 1

version 2

t3

After unlinking .csv.old

version 1

version 2

t4

After renaming .csv to .csv.old

version 1

version 2

t5

After renaming .csv.tmp to .csv

version 1

version 2

Timeline of file operations

While there are several opportunities for failure, there's no ambiguity about which file is valid:

  • If there's a .csv file, it's the current, valid file.
  • If there's no .csv file, then the .csv.old file is a backup copy, which can be used for recovery.

Since none of these operations involve actually copying the files, the operations are all extremely fast and reliable.

There's more...

In some enterprise applications, output files are organized into directories with names based on timestamps. This can be handled gracefully by the pathlib module. We might, for example, have an archive directory for old files:

archive_path = Path("/path/to/archive") 

We may want to create date-stamped subdirectories for keeping temporary or working files:

import datetime 
today = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") 

We can then do the following to define a working directory:

working_path = archive_path / today 
working_path.mkdir(parents=True, exists_ok=True) 

The mkdir() method will create the expected directory, including the parents=True argument, which ensures that all parent directories will also be created. This can be handy the very first time an application is executed. exists_ok=True is handy so that if a directory already exists, it can be reused without raising an exception.

parents=True is not the default. With the default of parents=False, when a parent directory doesn't exist, the method will raise a FileNotFoundError exception because the required file doesn't exist.

Similarly, exists_ok=True is not the default. By default, if the directory exists, a FileExistsError exception is raised. Including options to make the operation silent when the directory already exists can be helpful.

Also, it's sometimes appropriate to use the tempfile module to create temporary files. This module can create filenames that are guaranteed to be unique. This allows a complex server process to create temporary files without regard to filename conflicts.

See also

  • In the Using pathlib to work with filenames recipe, earlier in this chapter, we looked at the fundamentals of the Path class.
  • In Chapter 11, Testing, we'll look at some techniques for writing unit tests that can ensure that parts of this will behave properly.
  • In Chapter 6, User Inputs and Outputs, the Using contexts and context managers recipe shows additional details regarding working with the with statement to ensure file operations complete properly, and that all of the OS resources are released.

Reading delimited files with the CSV module

One commonly used data format is comma-separated values (CSV). We can generalize this to think of the comma character as simply one of many candidate separator characters. For example, a CSV file can use the | character as the separator between columns of data. This generalization for separators other than the comma makes CSV files particularly powerful.

How can we process data in one of the wide varieties of CSV formats?

Getting ready

A summary of a file's content is called a schema. It's essential to distinguish between two aspects of the schema.

The physical format of the file: For CSV, this means the file's bytes encode text. The text is organized into rows and columns using a row separator character (or characters) and a column separator character. Many spreadsheet products will use ,(comma) as the column separator and the sequence of characters as the row separator. The specific combination of punctuation characters in use is called the CSV dialect.

The logical layout of the data in the file: This is the sequence of data columns that are present. There are several common cases for handling the logical layout in CSV files:

  • The file has one line of headings. This is ideal and fits nicely with the way the CSV module works. It can be helpful when the headings are also proper Python variable names.
  • The file has no headings, but the column positions are fixed. In this case, we can impose headings on the file when we open it.
  • If the file has no headings and the column positions aren't fixed, additional schema information is required to interpret the columns of data.
  • Some files can have multiple lines of headings. In this case, we have to write special processing to skip past these lines. We will also have to replace complex headings with something more useful in Python.
  • An even more difficult case is where the file is not in proper First Normal Form (1NF). In 1NF, each row is independent of all other rows. When a file is not in this normal form, we'll need to add a generator function to rearrange the data into 1NF. See the Slicing and dicing a list recipe in Chapter 4, Built-In Data Structures Part 1: Lists and Sets, and the Using stacked generator expressions recipe in the online chapter, Chapter 9, Functional Programming Features (link provided in the Preface), for other recipes that work on normalizing data structures.

We'll look at a CSV file that has some real-time data recorded from the log of a sailboat. This is the waypoints.csv file. The data looks as follows:

lat,lon,date,time 
32.8321666666667,-79.9338333333333,2012-11-27,09:15:00 
31.6714833333333,-80.93325,2012-11-28,00:00:00 
30.7171666666667,-81.5525,2012-11-28,11:35:00 

This data contains four columns named in the first line of the file: lat, lon, date, and time. These describe a waypoint and need to be reformatted to create more useful information.

How to do it...

  1. Import the csv module and the Path class:
    import csv 
    from pathlib import Path
    
  2. Examine the data file to confirm the following features:
    • The column separator character is ',', which is the default.
    • The row separator characters are ' ', also widely used in both Windows and Linux. Python's universal newlines feature means that the Linux standard ' ' will work just as well as a row separator.
    • There is a single-row heading. If it isn't present, the headings should be provided separately when the reader object is created.
  3. Define the raw() function to read raw data from a Path that refers to the file:
    def raw(data_path: Path) -> None:
    
  4. Use the Path object to open the file in a with statement:
        with data_path.open() as data_file:
    
  5. Create the CSV reader from the open file object. This is indented inside the with statement:
            data_reader = csv.DictReader(data_file)
    
  6. Read (and process) the various rows of data. This is properly indented inside the with statement. For this example, we'll only print the rows:
                for row in data_reader: 
                    print(row)
    

Here's the function that we created:

def raw(data_path: Path) -> None:
    with data_path.open() as data_file:
        data_reader = csv.DictReader(data_file)
        for row in data_reader:
            print(row)

The output from the raw() function is a series of dictionaries that looks as follows:

    {'date': '2012-11-27', 
     'lat': '32.8321666666667', 
     'lon': '-79.9338333333333', 
     'time': '09:15:00'} 

We can now process the data by referring to the columns as dictionary items, using syntax like, for example, row['date']. Using the column names is more descriptive than referring to the column by position; for example, row[0] is hard to understand.

To be sure that we're using the column names correctly, the typing.TypedDict type hint can be used to provide the expected column names.

How it works...

The csv module handles the physical format work of separating the rows from each other, and also separating the columns within each row. The default rules ensure that each input line is treated as a separate row and that the columns are separated by ",".

What happens when we need to use the column separator character as part of data? We might have data like this:

    lan,lon,date,time,notes 
    32.832,-79.934,2012-11-27,09:15:00,"breezy, rainy" 
    31.671,-80.933,2012-11-28,00:00:00,"blowing ""like stink""" 

The notes column has data in the first row, which includes the "," column separator character. The rules for CSV allow a column's value to be surrounded by quotes. By default, the quoting characters are ". Within these quoting characters, the column and row separator characters are ignored.

In order to embed the quote character within a quoted string, it is doubled. The second example row shows how the value blowing "like stink" is encoded by doubling the quote characters when they are part of the value of a column. These quoting rules mean that a CSV file can represent any combination of characters, including the row and column separator characters.

The values in a CSV file are always strings. A string value like 7331 may look like a number to us, but it's always text when processed by the csv module. This makes the processing simple and uniform, but it can be awkward for our Python application programs.

When data is saved from a manually prepared spreadsheet, the data may reveal the quirks of the desktop software's internal rules for data display. It's surprisingly common, for example, to have a column of data that is displayed as a date on the desktop software but shows up as a floating-point number in the CSV file.

There are two solutions to the date-as-number problem. One is to add a column in the source spreadsheet to properly format the date as a string. Ideally, this is done using ISO rules so that the date is represented in YYYY-MM-DD format. The other solution is to recognize the spreadsheet date as a number of seconds past some epochal date. The epochal dates vary slightly, but they're generally either Jan 1, 1900 or Jan 1, 1904.

There's more...

As we saw in the Combining map and reduce transformations recipe in Chapter 9, Functional Programming Features, there's often a pipeline of processing that includes cleaning and transforming the source data. In this specific example, there are no extra rows that need to be eliminated. However, each column needs to be converted into something more useful.

To transform the data into a more useful form, we'll use a two-part design. First, we'll define a row-level cleansing function. In this case, we'll create a dictionary object by adding additional values that are derived from the input data. A clean_row() function can look like this:

import datetime 
Raw = Dict[str, Any]
Waypoint = Dict[str, Any]
def clean_row(source_row: Raw) -> Waypoint:
    ts_date = datetime.datetime.strptime(
        source_row["date"], "%Y-%m-%d"
    ).date()
    ts_time = datetime.datetime.strptime(
        source_row["time"], "%H:%M:%S"
    ).time()
    return dict(
        date=source_row["date"],
        time=source_row["time"],
        lat=source_row["lat"],
        lon=source_row["lon"],
        lat_lon=(
            float(source_row["lat"]),
            float(source_row["lon"])
        ),
        ts_date=ts_date,
        ts_time=ts_time,
        timestamp = datetime.datetime.combine(
            ts_date, ts_time
        )
    )

Here, we've created some new column values. The column named lat_lon has a two-tuple with proper floating-point values instead of strings. We've also parsed the date and time values to create datetime.date and datetime.time objects. We've combined the date and time into a single, useful value, which is the value of the timestamp column.

Once we have a row-level function for cleaning and enriching our data, we can map this function to each row in the source of data. We can use map(clean_row, reader) or we can write a function that embodies this processing loop:

def cleanse(reader: csv.DictReader) -> Iterator[Waypoint]:
    for row in reader:
        yield clean_row(cast(Raw, row))

This can be used to provide more useful data from each row:

def display_clean(data_path: Path) -> None:
    with data_path.open() as data_file:
        data_reader = csv.DictReader(data_file)
        clean_data_reader = cleanse(data_reader)
        for row in clean_data_reader:
            pprint(row)

We've used the cleanse() function to create a very small stack of transformation rules. The stack starts with data_reader, and only has one other item in it. This is a good beginning. As the application software is expanded to do more computations, the stack will expand.

These cleansed and enriched rows look as follows:

{'date': '2012-11-27',
 'lat': '32.8321666666667',
 'lat_lon': (32.8321666666667, -79.9338333333333),
 'lon': '-79.9338333333333',
 'time': '09:15:00',
 'timestamp': datetime.datetime(2012, 11, 27, 9, 15),
 'ts_date': datetime.date(2012, 11, 27),
'ts_time': datetime.time(9, 15)}

We've added columns such as lat_lon, which have proper numeric values instead of strings. We've also added timestamp, which has a full date-time value that can be used for simple computations of elapsed time between waypoints.

We can leverage the typing.TypedDict type hint to make a stronger statement about the structure of the dictionary data that will be processed. The initial data has known column names, each of which has string data values. We can define the raw data as follows:

from typing import TypedDict
class Raw_TD(TypedDict):
    date: str
    time: str
    lat: str
    lon: str

The cleaned data has a more complex structure. We can define the output from a clean_row_td() function as follows:

class Waypoint_TD(Raw_TD):
    lat_lon: Tuple[float, float]
    ts_date: datetime.date
    ts_time: datetime.time
    timestamp: datetime.datetime

The Waypoint_TD TypedDict definition extends Raw_TD TypedDict to make the outputs from the cleanse() function explicit. This lets us use the mypy tool to confirm that the cleanse() function – and any other processing – adheres to the expected keys and value types in the dictionary.

See also

  • See the Combining map and reduce transformations recipe in Chapter 9, Functional Programming Features, for more information on the idea of a processing pipeline or stack.
  • See the Slicing and dicing a list recipe in Chapter 4, Built-In Data Structures Part 1: Lists and Sets, and the Using stacked generator expressions recipe in the online chapter, Chapter 9, Functional Programming Features (link provided in the Preface), for more information on processing a CSV file that isn't in a proper 1NF.
  • For more information on the with statement, see the Reading and writing files with context managers recipe in Chapter 7, Basics of Classes and Objects.

Using dataclasses to simplify working with CSV files

One commonly used data format is known as CSV. Python's csv module has a very handy DictReader class definition. When a file contains a one-row header, the header row's values become keys that are used for all the subsequent rows. This allows a great deal of flexibility in the logical layout of the data. For example, the column ordering doesn't matter, since each column's data is identified by a name taken from the header row.

This leads to dictionary-based references to a column's data. We're forced to write, for example, row['lat'] or row['date'] to refer to data in specific columns. While this isn't horrible, it would be much nicer to use syntax like row.lat or row.date to refer to column values.

Additionally, we often have derived values that should – perhaps – be properties of a class definition instead of a separate function. This can properly encapsulate important attributes and operations into a single class definition.

The dictionary data structure has awkward-looking syntax for the column references. If we use dataclasses for each row, we can change references from row['name'] to row.name.

Getting ready

We'll look at a CSV file that has some real-time data recorded from the log of a sailboat. This file is the waypoints.csv file. The data looks as follows:

lat,lon,date,time 
32.8321666666667,-79.9338333333333,2012-11-27,09:15:00 
31.6714833333333,-80.93325,2012-11-28,00:00:00 
30.7171666666667,-81.5525,2012-11-28,11:35:00 

The first line contains a header that names the four columns, lat, lon, date, and time. The data can be read by a csv.DictReader object. However, we'd like to do more sophisticated work, so we'll create a @dataclass class definition that can encapsulate the data and the processing we need to do.

How to do it…

We need to start with a dataclass that reflects the available data, and then we can use this dataclass with a dictionary reader:

  1. Import the definitions from the various libraries that are needed:
    from dataclasses import dataclass, field
    import datetime
    from typing import Tuple, Iterator
    
  2. Define a dataclass narrowly focused on the input, precisely as it appears in the source file. We've called the class RawRow. In a complex application, a more descriptive name than RawRow would be appropriate. This definition of the attributes may change as the source file organization changes:
    @dataclass
    class RawRow:
        date: str
        time: str
        lat: str
        lon: str
    
  3. Define a second dataclass where objects are built from the source dataclass attributes. This second class is focused on the real work of the application. The source data is in a single attribute, raw, in this example. Fields computed from this source data are all initialized with field(init=False) because they'll be computed after initialization:
    @dataclass
    class Waypoint:
        raw: RawRow
        lat_lon: Tuple[float, float] = field(init=False)
        ts_date: datetime.date = field(init=False)
        ts_time: datetime.time = field(init=False)
        timestamp: datetime.datetime = field(init=False)
    
  4. Add the __post_init__() method to initialize all of the derived fields:
    def __post_init__(self):
        self.ts_date = datetime.datetime.strptime(
            self.raw.date, "%Y-%m-%d"
        ).date()
        self.ts_time = datetime.datetime.strptime(
            self.raw.time, "%H:%M:%S"
        ).time()
        self.lat_lon = (
            float(self.raw.lat),
            float(self.raw.lon)
        )
        self.timestamp = datetime.datetime.combine(
            self.ts_date, self.ts_time
        )
    
  5. Given these two dataclass definitions, we can create an iterator that will accept individual dictionaries from a csv.DictReader object and create the needed Waypoint objects. The intermediate representation, RawRow, is a convenience so that we can assign attribute names to the source data columns:
    def waypoint_iter(reader: csv.DictReader) -> Iterator[Waypoint]:
        for row in reader:
            raw = RawRow(**row)
            yield Waypoint(raw)
    

The waypoint_iter() function creates RawRow objects from the input dictionary, then creates the final Waypoint objects from the RawRow instances. This two-step processing is helpful for managing changes to the source or the processing.

We can use the following function to read and display the CSV data:

def display(data_path: Path) -> None:
    with data_path.open() as data_file:
        data_reader = csv.DictReader(data_file)
        for waypoint in waypoint_iter(data_reader):
            pprint(waypoint)

This function uses the waypoint_iter() function to create Waypoint objects from the dictionaries read by the csv.DictReader object. Each Waypoint object contains a reference to the original raw data:

Waypoint(
    raw=RawRow(
        date='2012-11-27', 
        time='09:15:00',
        lat='32.8321666666667',
        lon='-79.9338333333333'),
    lat_lon=(32.8321666666667, -79.9338333333333),
    ts_date=datetime.date(2012, 11, 27), 
    ts_time=datetime.time(9, 15),
    timestamp=datetime.datetime(2012, 11, 27, 9, 15)
)

Having the original input object can sometimes be helpful when diagnosing problems with the source data or the processing.

How it works…

The source dataclass, the RawRow class in this example, is designed to match the input document. The dataclass definition has attribute names that are exact matches for the source column names, and the attribute types are all strings to match the CSV input types.

Because the names match, the RawRow(**row) expression will work to create an instance of the RawRow class from the DictReader dictionary.

From this initial, or raw, data, we can derive the more useful data, as shown in the Waypoint class definition. The __post_init__() method transforms the initial value in the self.raw attribute into a number of more useful attribute values.

We've separated the Waypoint object's creation from reading raw data. This lets us manage the following two kinds of common changes to application software:

  1. The source data can change because the spreadsheet was adjusted manually. This is common: a person may change column names or change the order of the columns.
  2. The required computations may change as the application's focus expands or shifts. More derived columns may be added, or the algorithms may change. We want to isolate this application-specific processing from reading the raw data.

It's helpful to disentangle the various aspects of a program so that we can let them evolve independently. Gathering, cleaning, and filtering source data is one aspect of this. The resulting computations are a separate aspect, unrelated to the format of the source data.

There's more…

In many cases, the source CSV file will have headers that do not map directly to valid Python attribute names. In these cases, the keys present in the source dictionary must be mapped to the column names. This can be managed by expanding the RawRow class definition to include a __post_init__() method. This will help build the RawRow dataclass object from a dictionary built from CSV row headers that aren't useful Python key names.

The following example defines a class called RawRow_HeaderV2. This reflects a variant spreadsheet with different row names. We've defined the attributes with field(init=False) and provided a __post_init__() method, as shown in this code block:

@dataclass
class RawRow_HeaderV2:
    source: Dict[str, str]
    date: str = field(init=False)
    time: str = field(init=False)
    lat: str = field(init=False)
    lon: str = field(init=False)
    def __post_init__(self):
        self.date = self.source['Date of Travel (YYYY-MM-DD)']
        self.time = self.source['Arrival Time (HH:MM:SS)']
        self.lat = self.source['Latitude (degrees N)']
        self.lon = self.source['Logitude (degrees W)']

This RawRow_HeaderV2 class creates objects that are compatible with the RawRow class. Either of these classes of objects can also be transformed into Waypoint instances.

For an application that works with a variety of data sources, these kinds of "raw data transformation" dataclasses can be handy for mapping the minor variations in a logical layout to a consistent internal structure for further processing.

As the number of input transformation classes grows, additional type hints are required. For example, the following type hint provides a common name for the variations in input format:

Raw = Union[RawRow, RawRow_HeaderV2]

This type hint helps to unify the original RawRow and the alternative RawRow_HeaderV2 as alternative type definitions with compatible features. This can also be done with a Protocol type hint that spells out the common attributes.

See also

  • The Reading delimited files with the CSV module recipe, earlier in this chapter, also covers CSV file reading.
  • In Chapter 6, User Inputs and Outputs, the Using dataclasses for mutable objects recipe also covers ways to use Python's dataclasses.

Reading complex formats using regular expressions

Many file formats lack the elegant regularity of a CSV file. One common file format that's rather difficult to parse is a web server log file. These files tend to have complex data without a single, uniform separator character or consistent quoting rules.

When we looked at a simplified log file in the Writing generator functions with the yield statement recipe in the online chapter, Chapter 9, Functional Programming Features (link provided in the Preface), we saw that the rows look as follows:

[2016-05-08 11:08:18,651] INFO in ch09_r09: Sample Message One 
[2016-05-08 11:08:18,651] DEBUG in ch09_r09: Debugging 
[2016-05-08 11:08:18,652] WARNING in ch09_r09: Something might have gone wrong 

There are a variety of punctuation marks being used in this file. The csv module can't handle this complexity.

We'd like to write programs with the elegant simplicity of CSV processing. This means we'll need to encapsulate the complexities of log file parsing and keep this aspect separate from analysis and summary processing.

Getting ready

Parsing a file with a complex structure generally involves writing a function that behaves somewhat like the reader() function in the csv module. In some cases, it can be easier to create a small class that behaves like the DictReader class.

The core feature of reading a complex file is a function that will transform one line of text into a dictionary or tuple of individual field values. This job can often be done by the re package.

Before we can start, we'll need to develop (and debug) the regular expression that properly parses each line of the input file. For more information on this, see the String parsing with regular expressions recipe in Chapter 1, Numbers, Strings, and Tuples.

For this example, we'll use the following code. We'll define a pattern string with a series of regular expressions for the various elements of the line:

import re 
pattern_text = (
    r"[   (?P<date>.*?)  ]s+"
    r"     (?P<level>w+)   s+"
    r"ins+(?P<module>S?)"
    r":s+ (?P<message>.+)"
)
pattern = re.compile(pattern_text, re.X)

We've used the re.X option so that we can include extra whitespace in the regular expression. This can help to make it more readable by separating prefix and suffix characters.

There are four fields that are captured and a number of characters that are part of the template, but they never vary. Here's how this regular expression works:

  • The date-time stamp contains digits, hyphens, colons, and a comma; it's surrounded by [ and ]. We've had to use [ and ] to escape the normal meaning of [ and ] in a regular expression. This is saved as date in the resulting group dictionary.
  • The severity level is a single run of "word" characters, w. This is saved as level in the dictionary of groups created by parsing text with this regular expression.
  • The module name has a preface of the characters in; these are not captured. After this preface, the name is a sequence of non-whitespace characters, matched by S. The module is saved as a group named module.
  • Finally, there's a message. This has a preface of an extra ':' character we can ignore, then more spaces we can also ignore. Finally, the message itself starts and extends to the end of the line.

When we write a regular expression, we can wrap the interesting sub-strings to capture in (). After performing a match() or search() operation, the resulting Match object will have values for the matched substrings. The groups() method of a Match object and the groupdict() method of a Match object will provide the captured strings.

Note that we've used the s+ sequence to quietly skip one or more space-like characters. The sample data appears to always use a single space as the separator. However, when absorbing whitespace, using s+ seems to be a slightly more general approach because it permits extra spaces.

Here's how this pattern works:

>>> sample_data = '[2016-05-08 11:08:18,651] INFO in ch10_r09: Sample Message One' 
>>> match = pattern.match(sample_data) 
>>> match.groups() 
('2016-05-08 11:08:18,651', 'INFO', 'ch10_r09', 'Sample Message One')
>>> match.groupdict()
{'date': '2016-05-08 11:08:18,651',
'level': 'INFO',
'module': 'ch10_r09',
'message': 'Sample Message One'}

We've provided a line of sample data in the sample_data variable. The match object, has a groups() method that returns each of the interesting fields. The value of the groupdict() method of a match object is a dictionary with the name provided in the ?P<name> preface to the regular expression in brackets, ().

How to do it...

This recipe is split into two parts. The first part defines a log_parser() function to parse a single line, while the second part uses the log_parser() function for each line of input.

Defining the parse function

Perform the following steps to define the log_parser() function:

  1. Define the compiled regular expression object. It helps us use the (?P<name>...) regular expression construct to create a dictionary key for each piece of data that's captured. The resulting dictionary will then contain useful, meaningful keys:
    import re 
    pattern_text = (
        r"[   (?P<date>.*?)  ]s+"
        r"     (?P<level>w+)   s+"
        r"ins+(?P<module>.+?)"
        r":s+ (?P<message>.+)"
    )
    pattern = re.compile(pattern_text, re.X)
    
  2. Define a class to model the resulting complex data object. This can have additional derived properties or other complex computations. Minimally, a NamedTuple must define the fields that are extracted by the parser. The field names should match the regular expression capture names in the ?P<name> prefix:
    class LogLine(NamedTuple):
        date: str
        level: str
        module: str
        message: str
    
  3. Define a function that accepts a line of text as an argument:
    def log_parser(source_line: str) -> LogLine:
    
  4. Apply the regular expression to create a match object. We've assigned it to the match variable and also checked to see if it is not None:
        if match := pattern.match(source_line):
    
  5. When the match is not None, return a useful data structure with the various pieces of data from this input line. The cast(Match, match) expression is necessary to help mypy; it states that the match object will not be None, but will always be a valid instance of the Match class. It's likely that a future release of mypy will not need this:
            data = cast(Match, match).groupdict()
            return LogLine(**data)
    
  6. When the match is None, either log the problem or raise an exception to stop processing because there's a problem:
        raise ValueError(f"Unexpected input {source_line=}")
    

Here's the log_parser() function, all gathered together:

def log_parser(source_line: str) -> LogLine:
    if match := pattern.match(source_line):
        data = cast(Match, match).groupdict()
        return LogLine(**data)
    raise ValueError(f"Unexpected input {source_line=}")

The log_parser() function can be used to parse each line of input. The text is transformed into a NamedTuple instance with field names and values based on the fields found by the regular expression parser. These field names must match the field names in the NamedTuple class definition.

Using the log_parser() function

This portion of the recipe will apply the log_parser() function to each line of the input file:

  1. From pathlib, import the Path class definition:
    from pathlib import Path
    
  2. Create the Path object that identifies the file:
    data_path = Path("data") / "sample.log"
    
  3. Use the Path object to open the file in a with statement:
    with data_path.open() as data_file:
    
  4. Create the log file reader from the open file object, data_file. In this case, we'll use the built-in map() function to apply the log_parser() function to each line from the source file:
        data_reader = map(log_parser, data_file)
    
  5. Read (and process) the various rows of data. For this example, we'll just print each row:
        for row in data_reader:
            pprint(row)
    

The output is a series of LogLine tuples that looks as follows:

LogLine(date='2016-06-15 17:57:54,715', level='INFO', module='ch10_r10', message='Sample Message One')
LogLine(date='2016-06-15 17:57:54,715', level='DEBUG', module='ch10_r10', message='Debugging')
LogLine(date='2016-06-15 17:57:54,715', level='WARNING', module='ch10_r10', message='Something might have gone wrong')

We can do more meaningful processing on these tuple instances than we can on a line of raw text. These allow us to filter the data by severity level, or create a Counter based on the module providing the message.

How it works...

This log file is in First Normal Form (1NF): the data is organized into lines that represent independent entities or events. Each row has a consistent number of attributes or columns, and each column has data that is atomic or can't be meaningfully decomposed further. Unlike CSV files, however, this particular format requires a complex regular expression to parse.

In our log file example, the timestamp contains a number of individual elements – year, month, day, hour, minute, second, and millisecond – but there's little value in further decomposing the timestamp. It's more helpful to use it as a single datetime object and derive details (like hour of the day) from this object, rather than assembling individual fields into a new piece of composite data.

In a complex log processing application, there may be several varieties of message fields. It may be necessary to parse these message types using separate patterns. When we need to do this, it reveals that the various lines in the log aren't consistent in terms of the format and number of attributes, breaking one of the 1NF assumptions.

We've generally followed the design pattern from the Reading delimited files with the CSV module recipe, so that reading a complex log is nearly identical to reading a simple CSV file. Indeed, we can see that the primary difference lies in one line of code:

data_reader = csv.DictReader(data_file) 

As compared to the following:

data_reader = map(log_parser, data_file) 

This parallel construct allows us to reuse analysis functions across many input file formats. This allows us to create a library of tools that can be used on a number of data sources.

There's more...

One of the most common operations when reading very complex files is to rewrite them into an easier-to-process format. We'll often want to save the data in CSV format for later processing.

Some of this is similar to the Using multiple contexts for reading and writing files recipe in Chapter 7, Basics of Classes and Objects, which also shows multiple open contexts. We'll read from one file and write to another file.

The file writing process looks as follows:

import csv 
def copy(data_path: Path) -> None:
    target_path = data_path.with_suffix(".csv")
    with target_path.open("w", newline="") as target_file:
        writer = csv.DictWriter(target_file, LogLine._fields)
        writer.writeheader()
        with data_path.open() as data_file:
            reader = map(log_parser, data_file)
            writer.writerows(row._asdict() for row in reader)

The first portion of this script defines a CSV writer for the target file. The path for the output file, target_path, is based on the input name, data_path. The suffix changed from the original filename's suffix to .csv.

The target file is opened with the newline character, which is turned off by the newline='' option. This allows the csv.DictWriter class to insert newline characters appropriate for the desired CSV dialect.

A DictWriter object is created to write to the given file. The sequence of column headings is provided by the LogLines class definition. This makes sure the output CSV file will contain column names matching the LogLines subclass of the typing.NamedTuple class.

The writeheader() method writes the column names as the first line of output. This makes reading the file slightly easier because the column names are provided. The first row of a CSV file can be a kind of explicit schema definition that shows what data is present.

The source file is opened, as shown in the preceding recipe. Because of the way the csv module writers work, we can provide the reader generator expression to the writerows() method of the writer. The writerows() method will consume all of the data produced by the reader generator. This will, in turn, consume all the rows produced by the open file.

We don't need to write any explicit for statements to ensure that all of the input rows are processed. The writerows() function makes this a guarantee.

The output file looks as follows:

    date,level,module,message 
    "2016-05-08 11:08:18,651",INFO,ch10_r10,Sample Message One 
    "2016-05-08 11:08:18,651",DEBUG,ch10_r10,Debugging 
    "2016-05-08 11:08:18,652",WARNING,ch10_r10,Something might have gone wrong 

The file has been transformed from the rather complex input format into a simpler CSV format, suitable for further analysis and processing.

See also

  • For more information on the with statement, see the Reading and writing files with context managers recipe in Chapter 7, Basics of Classes and Objects.
  • The Writing generator functions with the yield statement recipe in the online chapter, Chapter 9, Functional Programming Features (link provided in the Preface), shows other processing of this log format.
  • In the Reading delimited files with the CSV module recipe, earlier in this chapter, we looked at other applications of this general design pattern.
  • In the Using dataclasses to simplify working with CSV files recipe, earlier in this chapter, we looked at other sophisticated CSV processing techniques.

Reading JSON and YAML documents

JavaScript Object Notation (JSON) is a popular syntax for serializing data. For details, see http://json.org. Python includes the json module in order to serialize and deserialize data in this notation.

JSON documents are used widely by web applications. It's common to exchange data between RESTful web clients and servers using documents in JSON notation. These two tiers of the application stack communicate via JSON documents sent via the HTTP protocol.

The YAML format is a more sophisticated and flexible extension to JSON notation. For details, see https://yaml.org. Any JSON document is also a valid YAML document. The reverse is not true: YAML syntax is more complex and includes constructs that are not valid JSON.

To use YAML, an additional module has to be installed. The PyYAML project offers a yaml module that is popular and works well. See https://pypi.org/project/PyYAML/.

In this recipe, we'll use the json or yaml module to parse JSON format data in Python.

Getting ready

We've gathered some sailboat racing results in race_result.json. This file contains information on teams, legs of the race, and the order in which the various teams finished each individual leg of the race. JSON handles this complex data elegantly.

An overall score can be computed by summing the finish position in each leg: the lowest score is the overall winner. In some cases, there are null values when a boat did not start, did not finish, or was disqualified from the race.

When computing the team's overall score, the null values are assigned a score of one more than the number of boats in the competition. If there are seven boats, then the team is given eight points for their failure to finish, a hefty penalty.

The data has the following schema. There are two fields within the overall document:

  • legs: An array of strings that shows the starting port and ending port.
  • teams: An array of objects with details about each team. Within each teams object, there are several fields of data:
    • name: String team name.
    • position: Array of integers and nulls with position. The order of the items in this array matches the order of the items in the legs array.

The data looks as follows:

    { 
      "teams": [ 
        { 
          "name": "Abu Dhabi Ocean Racing", 
          "position": [ 
            1, 
            3, 
            2, 
            2, 
            1, 
            2, 
            5, 
            3, 
            5 
          ] 
        }, 
        ... 
      ], 
      "legs": [ 
        "ALICANTE - CAPE TOWN", 
        "CAPE TOWN - ABU DHABI", 
        "ABU DHABI - SANYA", 
        "SANYA - AUCKLAND", 
        "AUCKLAND - ITAJAu00cd", 
        "ITAJAu00cd - NEWPORT", 
        "NEWPORT - LISBON", 
        "LISBON - LORIENT", 
        "LORIENT - GOTHENBURG" 
      ] 
    } 

We've only shown the first team. There was a total of seven teams in this particular race. Each team is represented by a Python dictionary, with the team's name and their history of finish positions on each leg. For the team shown here, Abu Dhabi Ocean Racing, they finished in first place in the first leg, and then third place in the next leg. Their worst performance was fifth place in both the seventh and ninth legs of the race, which were the legs from Newport, Rhode Island, USA to Lisbon, Portugal, and from Lorient in France to Gothenburg in Sweden.

The JSON-formatted data can look like a Python dictionary that t lists within it. This overlap between Python syntax and JSON syntax can be thought of as a happy coincidence: it makes it easier to visualize the Python data structure that will be built from the JSON source document.

JSON has a small set of data structures: null, Boolean, number, string, list, and object. These map to objects of Python types in a very direct way. The json module makes the conversions from source text into Python objects for us.

One of the strings contains a Unicode escape sequence, u00cd, instead of the actual Unicode character Í. This is a common technique used to encode characters beyond the 128 ASCII characters. The parser in the json module handles this for us.

In this example, we'll write a function to disentangle this document and show the team finishes for each leg.

How to do it...

This recipe will start with importing the necessary modules. We'll then use these modules to transform the contents of the file into a useful Python object:

  1. We'll need the json module to parse the text. We'll also need a Path object to refer to the file:
    import json
    from pathlib import Path
    
  2. Define a race_summary() function to read the JSON document from a given Path instance:
    def race_summary(source_path: Path) -> None:
    
  3. Create a Python object by parsing the JSON document. It's often easiest to use source_path.read_text() to read the file named by Path. We provided this string to the json.loads() function for parsing. For very large files, an open file can be passed to the json.load() function; this can be more efficient than reading the entire document into a string object and loading the in-memory text:
        document = json.loads(source_path.read_text())
    
  4. Display the data: This document creates a dictionary with two keys, teams and legs. Here's how we can iterate through each leg, showing the team's position in the leg:
    for n, leg in enumerate(document['legs']):
        print(leg)
        for team_finishes in document['teams']:
            print(
                team_finishes['name'],
                team_finishes['position'][n])
    

The data for each team will be a dictionary with two keys: name and position. We can navigate down into the team details to get the name of the first team:

>>> document['teams'][6]['name']
'Team Vestas Wind'

We can look inside the legs field to see the names of each leg of the race:

>>> document['legs'][5]
'ITAJAÍ - NEWPORT'

Note that the JSON source file included a 'u00cd' Unicode escape sequence. This was parsed properly, and the Unicode output shows the proper Í character.

How it works...

A JSON document is a data structure in JavaScript Object Notation. JavaScript programs can parse the document trivially. Other languages must do a little more work to translate the JSON to a native data structure.

A JSON document contains three kinds of structures:

  • Objects that map to Python dictionaries: JSON has a syntax similar to Python: {"key": "value"}. Unlike Python, JSON only uses " for string quotation marks. JSON notation is intolerant of an extra , at the end of the dictionary value. Other than this, the two notations are similar.
  • Arrays that map to Python lists: JSON syntax uses [item, ...], which looks like Python. JSON is intolerant of an extra , at the end of the array value.
  • Primitive values: There are five classes of values: string, number, true, false, and null. Strings are enclosed in " and use a variety of escape sequences, which are similar to Python's. Numbers follow the rules for floating-point values. The other three values are simple literals; these parallel Python's True, False, and None literals. As a special case, numbers with no decimal point become Python int objects. This is an extension of the JSON standard.

There is no provision for any other kinds of data. This means that Python programs must convert complex Python objects into a simpler representation so that they can be serialized in JSON notation.

Conversely, we often apply additional conversions to reconstruct complex Python objects from the simplified JSON representation. The json module has places where we can apply additional processing to the simple structures to create more sophisticated Python objects.

There's more...

A file, generally, contains a single JSON document. The JSON standard doesn't provide an easy way to encode multiple documents in a single file. If we want to analyze a web log, for example, the original JSON standard may not be the best notation for preserving a huge volume of information.

There are common extensions, like Newline Delimited JSON, http://ndjson.org, and JSON Lines, http://jsonlines.org, to define a way to encode multiple JSON documents into a single file.

There are two additional problems that we often have to tackle:

  • Serializing complex objects so that we can write them to files, for example, a datetime object
  • Deserializing complex objects from the text that's read from a file

When we represent a Python object's state as a string of text characters, we've serialized the object. Many Python objects need to be saved in a file or transmitted to another process. These kinds of transfers require a representation of the object state. We'll look at serializing and deserializing separately.

Serializing a complex data structure

Many common Python data structures can be serialized into JSON. Because Python is extremely sophisticated and flexible, we can also create Python data structures that cannot be directly represented in JSON.

The serialization to JSON works out the best if we create Python objects that are limited to values of the built-in dict, list, str, int, float, bool, and None types. This subset of Python types can be used to build objects that the json module can serialize and can be used widely by a number of programs written in different languages.

One commonly used data structure that doesn't serialize easily is the datetime.datetime object. Here's what happens when we try:

>>> import datetime 
>>> example_date = datetime.datetime(2014, 6, 7, 8, 9, 10) 
>>> document = {'date': example_date} 

Here, we've created a simple document with a dictionary mapping a string to a datetime instance. What happens when we try to serialize this in JSON?

>>> json.dumps(document)  
Traceback (most recent call last): 
  ... 
TypeError: datetime.datetime(2014, 6, 7, 8, 9, 10) is not JSON serializable 

This shows that objects will raise a TypeError exception when they cannot be serialized. Avoiding this exception can done in one of two ways. We can either convert the data into a JSON-friendly structure before building the document, or we can add a default type handler to the JSON serialization process that gives us a way to provide a serializable version of the data.

To convert the datetime object into a string prior to serializing it as JSON, we need to make a change to the underlying data. In the following example, we replaced the datetime.datetime object with a string:

>>> document_converted = {'date': example_date.isoformat()} 
>>> json.dumps(document_converted) 
'{"date": "2014-06-07T08:09:10"}' 

This uses the standardized ISO format for dates to create a string that can be serialized. An application that reads this data can then convert the string back into a datetime object. This kind of transformation can be difficult for a complex document.

The other technique for serializing complex data is to provide a function that's used by the json module during serialization. This function must convert a complex object into something that can be safely serialized. In the following example, we'll convert a datetime object into a simple string value:

def default_date(object: Any) -> Union[Any, Dict[str, Any]]:
    if isinstance(object, datetime.datetime):
        return {"$date": object.isoformat()}
    return object

We've defined a function, default_date(), which will apply a special conversion rule to datetime objects. Any datetime instance will be massaged into a dictionary with an obvious key – "$date" – and a string value. This dictionary of strings can be serialized by functions in the json module.

We provide this function to the json.dumps() function, assigning the default_date() function to the default parameter, as follows:

>>> example_date = datetime.datetime(2014, 6, 7, 8, 9, 10)
>>> document = {'date': example_date}
>>> print(
...     json.dumps(document, default=default_date, indent=2))
{
  "date": {
    "$date": "2014-06-07T08:09:10"
  }
}

When the json module can't serialize an object, it passes the object to the given default function. In any given application, we'll need to expand this function to handle a number of Python object types that we might want to serialize in JSON notation. If there is no default function provided, an exception is raised when an object can't be serialized.

Deserializing a complex data structure

When deserializing JSON to create Python objects, there's a hook that can be used to convert data from a JSON dictionary into a more complex Python object. This is called object_hook and it is used during processing by the json.loads() function. This hook is used to examine each JSON dictionary to see if something else should be created from the dictionary instance.

The function we provide will either create a more complex Python object, or it will simply return the original dictionary object unmodified:

def as_date(object: Dict[str, Any]) -> Union[Any, Dict[str, Any]]:
    if {'$date'} == set(object.keys()):
        return datetime.datetime.fromisoformat(object['$date'])
    return object

This function will check each object that's decoded to see if the object has a single field, and that single field is named $date. If that is the case, the value of the entire object is replaced with a datetime object. The return type is a union of Any and Dict[str, Any] to reflect the two possible results: either some object or the original dictionary.

We provide a function to the json.loads() function using the object_hook parameter, as follows:

>>> source = '''{"date": {"$date": "2014-06-07T08:09:10"}}'''
>>> json.loads(source, object_hook=as_date)
{'date': datetime.datetime(2014, 6, 7, 8, 9, 10)}

This parses a very small JSON document that meets the criteria for containing a date. The resulting Python object is built from the string value found in the JSON serialization.

We may also want to design our application classes to provide additional methods to help with serialization. A class might include a to_json() method, which will serialize the objects in a uniform way. This method might provide class information. It can avoid serializing any derived attributes or computed properties. Similarly, we might need to provide a static from_json() method that can be used to determine if a given dictionary object is actually an instance of the given class.

See also

  • The Reading HTML documents recipe, later in this chapter, will show how we prepared this data from an HTML source.

Reading XML documents

The XML markup language is widely used to represent the state of objects in a serialized form. For details, see http://www.w3.org/TR/REC-xml/. Python includes a number of libraries for parsing XML documents.

XML is called a markup language because the content of interest is marked with tags, and also written with a start <tag> and an end </tag> to clarify the structure of the data. The overall file text includes both the content and the XML markup.

Because the markup is intermingled with the text, there are some additional syntax rules that must be used to distinguish markup from text. In order to include the < character in our data, we must use XML character entity references. We must use &lt; to include < in our text. Similarly, &gt; must be used instead of >, &amp;, which is used instead of &. Additionally, &quot; is also used to embed a " character in an attribute value delimited by " characters. For the most part, XML parsers will handle this transformation when consuming XML.

A document, then, will have items as follows:

<team><name>Team SCA</name><position>...</position></team> 

The <team> tag contains the <name> tag, which contains the text of the team's name. The <position> tag contains more data about the team's finish position in each leg of a race.

Most XML processing allows additional and space characters in the XML to make the structure more obvious:

<team> 
    <name>Team SCA</name> 
    <position>...</position> 
</team> 

As shown in the preceding example, content (like team name) is surrounded by the tags. The overall document forms a large, nested collection of containers. We can think of a document as a tree with a root tag that contains all the other tags and their embedded content. Between tags, there is some additional content. In some applications, the additional content between the ends of tags is entirely whitespace.

Here's the beginning of the document we'll be looking at:

<?xml version="1.0"?> 
  <results> 
    <teams> 
      <team> 
        <name> 
          Abu Dhabi Ocean Racing 
        </name> 
        <position> 
          <leg n="1"> 
            1 
          </leg> 
          ...
        </position>
        ...
      </team>
      ...
    </teams>
  </results>

The top-level container is the <results> tag. Within this is a <teams> tag. Within the <teams> tag are many repetitions of data for each individual team, enclosed in the <team> tag. We've used … to show where parts of the document were elided.

It's very, very difficult to parse XML with regular expressions. We need more sophisticated parsers to handle the syntax of nested tags.

There are two binary libraries that are available in Python for parsing: XML-SAX and Expat. Python includes the modules xml.sax and xml.parsers.expat to exploit these two libraries directly.

In addition to these, there's a very sophisticated set of tools in the xml.etree package. We'll focus on using the ElementTree module in this package to parse and analyze XML documents.

In this recipe, we'll use the xml.etree module to parse XML data.

Getting ready

We've gathered some sailboat racing results in race_result.xml. This file contains information on teams, legs, and the order in which the various teams finished each leg.

A team's overall score is the sum of the finish positions. Finishing first, or nearly first, in each leg will give a very low score. In many cases, there are empty values where a boat did not start, did not finish, or was disqualified from the race. In those cases, the team's score will be one more than the number of boats. If there are seven boats, then the team is given eight points for the leg. The inability to compete creates a hefty penalty.

The root tag for this data is a <results> document. This has the following schema:

  • The <legs> tag contains individual <leg> tags that name each leg of the race. The leg names contain both a starting port and an ending port in the text.
  • The <teams> tag contains a number of <team> tags with details of each team. Each team has data structured with internal tags:
    • The <name> tag contains the team name.
    • The <position> tag contains a number of <leg> tags with the finish position for the given leg. Each leg is numbered, and the numbering matches the leg definitions in the <legs> tag.

The data for all the finish positions for a single team looks as follows:

<?xml version="1.0"?> 
  <results> 
    <teams> 
      <team> 
        <name> 
          Abu Dhabi Ocean Racing 
        </name> 
        <position> 
          <leg n="1"> 
            1 
          </leg> 
          <leg n="2"> 
            3 
          </leg> 
          <leg n="3"> 
            2 
          </leg> 
          <leg n="4"> 
            2 
          </leg> 
          <leg n="5"> 
            1 
          </leg> 
          <leg n="6"> 
            2 
          </leg> 
          <leg n="7"> 
            5 
          </leg> 
          <leg n="8"> 
            3 
          </leg> 
          <leg n="9"> 
            5 
          </leg> 
        </position> 
      </team> 
        ... 
    </teams> 
    <legs> 
    ... 
    </legs> 
  </results> 

We've only shown the first team. There was a total of seven teams in this particular race around the world.

In XML notation, the application data shows up in two kinds of places. The first is between the start and the end of a tag – for example, <name>Abu Dhabi Ocean Racing</name>. The tag is <name>, while the text between <name> and </name> is the value of this tag, Abu Dhabi Ocean Racing.

Also, data shows up as an attribute of a tag; for example, in <leg n="1">. The tag is <leg>; the tag has an attribute, n, with a value of 1. A tag can have an indefinite number of attributes.

The <leg> tags point out an interesting problem with XML. These tags include the leg number given as an attribute, n, and the position in the leg given as the text inside the tag. The general approach is to put essential data inside the tags and supplemental, or clarifying, data in the attributes. The line between essential and supplemental is blurry.

XML permits a mixed content model. This reflects the case where XML is mixed in with text, where there is text inside and outside XML tags. Here's an example of mixed content:

<p>This has <strong>mixed</strong> content.</p> 

The content of the <p> tag is a mixture of text and a tag. The data we're working with does not rely on this kind of mixed content model, meaning all the data is within a single tag or an attribute of a tag. The whitespace between tags can be ignored.

We'll use the xml.etree module to parse the data. This involves reading the data from a file and providing it to the parser. The resulting document will be rather complex.

We have not provided a formal schema definition for our sample data, nor have we provided a Document Type Definition (DTD). This means that the XML defaults to mixed content mode. Furthermore, the XML structure can't be validated against the schema or DTD.

How to do it...

Parsing XML data requires importing the ElementTree module. We'll use this to write a race_summary() function that parses the XML data and produces a useful Python object:

  1. We'll need the xml.etree.ElementTree class to parse the XML text. We'll also need a Path object to refer to the file. We've used the import… as… syntax to assign a shorter name of XML to the ElementTree class:
    import xml.etree.ElementTree as XML
    from pathlib import Path
    
  2. Define a function to read the XML document from a given Path instance:
    def race_summary(source_path: Path) -> None:
    
  3. Create a Python ElementTree object by parsing the XML text. It's often easiest to use source_path.read_text() to read the file named by Path. We provided this string to the XML.fromstring() method for parsing. For very large files, an incremental parser is sometimes helpful:
        source_text = source_path.read_text(encoding='UTF-8')
        document = XML.fromstring(source_text)
    
  4. Display the data. This document creates a dictionary with two keys, "teams" and "legs". Here's how we can iterate through each leg, showing the team's position in the leg:
        legs = cast(XML.Element, document.find('legs'))
        teams = cast(XML.Element, document.find('teams'))
        for leg in legs.findall('leg'):
            print(cast(str, leg.text).strip())
            n = leg.attrib['n']
            for team in teams.findall('team'):
                position_leg = cast(XML.Element, 
                    team.find(f"position/leg[@n='{n}']"))
                name = cast(XML.Element, team.find('name'))
                print(
                    cast(str, name.text).strip(),
                    cast(str, position_leg.text).strip()
                )
    

Once we have the document object, we can then search the object for the relevant pieces of data. In this example, we used the find() method to locate the two tags containing legs and teams.

Within the legs tag, there are a number of leg tags. Each of those tags has the following structure:

<leg n="1">
   ALICANTE - CAPE TOWN
</leg>

The expression leg.attrib['n'] extracts the value of the attribute named n. The expression leg.text.strip() is all the text within the <leg> tag, stripped of extra whitespace.

It's central to note that the results of the find() function have a type hint of Optional[XML.Element]. We have two choices to handle this:

  • Use an if statement to determine if the result is not None.
  • Use cast(XML.Element, tag.find(…)) to claim that the result is never going to be None. For some kinds of output from automated systems, the tags are always going to be present, and the overhead of numerous if statements is excessive.

For each leg of the race, we need to print the finish positions, which are represented in the data contained within the <teams> tag. Within this tag, we need to locate a tag containing the name of the team. We also need to find the proper leg tag with the finish position for this team on the given leg.

We need to use a complex XPath search, f"position/leg[@n='{n}']", to locate a specific instance of the position tag. The value of n is the leg number. For the ninth leg, this search will be the string "position/leg[@n='9']". This will locate the position tag containing a leg tag that has an attribute n equal to 9.

Because XML is a mixed content model, all the , , and space characters in the content are perfectly preserved in the data. We rarely want any of this whitespace, and it makes sense to use the strip() method to remove all extraneous characters before and after the meaningful content.

How it works...

The XML parser modules transform XML documents into fairly complex objects based on a standardized document object model. In the case of the etree module, the document will be built from Element objects, which generally represent tags and text.

XML can also include processing instructions and comments. We'll ignore them and focus on the document structure and content here.

Each Element instance has the text of the tag, the text within the tag, attributes that are part of the tag, and a tail. The tag is the name inside <tag>. The attributes are the fields that follow the tag name. For example, the <leg n="1"> tag has a tag name of leg and an attribute named n. Values are always strings in XML; any conversion to a different data type is the responsibility of the application using the data.

The text is contained between the start and end of a tag. Therefore, a tag such as <name>Team SCA</name> has "Team SCA" for the value of the text attribute of the Element that represents the <name> tag.

Note that a tag also has a tail attribute. Consider this sequence of two tags:

<name>Team SCA</name> 
<position>...</position> 

There's a whitespace character after the closing </name> tag and before the opening of the <position> tag. This extra text is collected by the parser and put into the tail of the <name> tag. The tail values can be important when working with a mixed content model. The tail values are generally whitespace when working in an element content model.

There's more...

Because we can't trivially translate an XML document into a Python dictionary, we need a handy way to search through the document's content. The ElementTree module provides a search technique that's a partial implementation of the XML Path Language (XPath) for specifying a location in an XML document. The XPath notation gives us considerable flexibility.

The XPath queries are used with the find() and findall() methods. Here's how we can find all of the team names:

>>> for tag in document.findall('teams/team/name'): 
...      print(tag.text.strip()) 
Abu Dhabi Ocean Racing 
Team Brunel 
Dongfeng Race Team 
MAPFRE 
Team Alvimedica 
Team SCA 
Team Vestas Wind 

Here, we've looked for the top-level <teams> tag. Within that tag, we want <team> tags. Within those tags, we want the <name> tags. This will search for all the instances of this nested tag structure.

Note that we've omitted the type hints from this example and assumed that all the tags will contain text values that are not None. If we use this in an application, we may have to add checks for None, or use the cast() function to convince mypy that the tags or the text attribute value is not None.

We can search for attribute values as well. This can make it handy to find how all the teams did on a particular leg of the race. The data for this can be found in the <leg> tag, within the <position> tag for each team.

Furthermore, each <leg> has an attribute named n that shows which of the race legs it represents. Here's how we can use this to extract specific data from the XML document:

>>> for tag in document.findall("teams/team/position/leg[@n='8']"): 
...     print(tag.text.strip()) 
3 
5 
7 
4 
6 
1 
2 

This shows us the finishing positions of each team on leg 8 of the race. We're looking for all tags with <leg n="8"> and displaying the text within that tag. We have to match these values with the team names to see that team number 3, Team SCA, finished first, and that team number 2, Dongfeng Race Team, finished last on this leg.

See also

  • The Reading HTML documents recipe, later in this chapter, shows how we prepared this data from an HTML source.

Reading HTML documents

A great deal of content on the web is presented using HTML markup. A browser renders the data very nicely. How can we parse this data to extract the meaningful content from the displayed web page?

We can use the standard library html.parser module, but it's not as helpful as we'd like. It only provides low-level lexical scanning information; it doesn't provide a high-level data structure that describes the original web page.

Instead, we'll use the Beautiful Soup module to parse HTML pages into more useful data structures. This is available from the Python Package Index (PyPI). See https://pypi.python.org/pypi/beautifulsoup4.

This must be downloaded and installed. Often, this is as simple as doing the following:

python -m pip install beautifulsoup4

Using the python -m pip command ensures that we will use the pip command that goes with the currently active virtual environment.

Getting ready

We've gathered some sailboat racing results in Volvo Ocean Race.html. This file contains information on teams, legs, and the order in which the various teams finished each leg. It's been scraped from the Volvo Ocean Race website, and it looks wonderful when opened in a browser.

Except for very old websites, most HTML notation is an extension of XML notation. The content is surrounded by <tag> marks, which show the structure and presentation of the data. HTML predates XML, and an XHTML standard reconciles the two. Note that browser applications must be tolerant of older HTML and improperly structured HTML. The presence of damaged HTML can make it difficult to analyze some data from the World Wide Web.

HTML pages can include a great deal of overhead. There are often vast code and style sheet sections, as well as invisible metadata. The content may be surrounded by advertising and other information.

Generally, an HTML page has the following overall structure:

<html>
    <head>...</head> 
    <body>...</body> 
</html>

Within the <head> tag, there will be links to JavaScript libraries and links to Cascading Style Sheet (CSS) documents. These are used to provide interactive features and to define the presentation of the content.

The bulk of the content is in the <body> tag. It can be difficult to track down the relevant data on a web page. This is because the focus of the design effort is on how people see it more than how automated tools can process it.

In this case, the race results are in an HTML <table> tag, making them easy to find. What we can see here is the overall structure for the relevant content in the page:

<table> 
    <thead> 
        <tr> 
            <th>...</th> 
            ... 
        </tr> 
    </thead> 
    <tbody> 
        <tr> 
            <td>...</td> 
            ... 
        </tr> 
        ... 
    </tbody> 
</table> 

The <thead> tag includes the column titles for the table. There's a single table row tag, <tr>, with table heading, <th>, tags that include the content. Each of the <th> tags contains two parts. It looks like this:

<th tooltipster data-title="<strong>ALICANTE - CAPE TOWN</strong>" data-theme="tooltipster-shadow" data-htmlcontent="true" data-position="top">LEG 1</th>

The essential display is a number for each leg of the race, LEG 1, in this example. This is the content of the tag. In addition to the displayed content, there's also an attribute value, data-title, that's used by a JavaScript function. This attribute value is the name of the leg, and it is displayed when the cursor hovers over a column heading. The JavaScript function pops up the leg's name.

The <tbody> tag includes the team name and the results for each race. The table row, <tr>, tag, contains the details for each team. The team name (and graphic and overall finish rank) is shown in the first three columns of the table data, <td>. The remaining columns of table data contain the finishing position for a given leg of the race.

Because of the relative complexity of sailboat racing, there are additional notes in some of the table data cells. These are included as attributes to provide supplemental data regarding the reason why a cell has a particular value. In some cases, teams did not start a leg, did not finish a leg, or retired from a leg.

Here's a typical <tr> row from the HTML:

<tr class="ranking-item"> 
    <td class="ranking-position">3</td> 
    <td class="ranking-avatar"> 
        <img src="..."> </td> 
    <td class="ranking-team">Dongfeng Race Team</td> 
    <td class="ranking-number">2</td> 
    <td class="ranking-number">2</td> 
    <td class="ranking-number">1</td> 
    <td class="ranking-number">3</td> 
    <td class="ranking-number" tooltipster data-title="<center><strong>RETIRED</strong><br>Click for more info</center>" data-theme="tooltipster-3" data-position="bottom" data-htmlcontent="true"><a href="/en/news/8674_Dongfeng-Race-Team-breaks-mast-crew-safe.html" target="_blank">8</a><div  class="status-dot dot-3"></div></td>
    <td class="ranking-number">1</td> 
    <td class="ranking-number">4</td> 
    <td class="ranking-number">7</td> 
    <td class="ranking-number">4</td> 
    <td class="ranking-number total">33<span class="asterix">*</span></td> 
</tr> 

The <tr> tag has a class attribute that defines the style for this row. The class attribute on this tag helps our data gathering application locate the relevant content. It also chooses the CSS style for this class of content.

The <td> tags also have class attributes that define the style for the individual cells of data. Because the CSS styles reflect the content, the class also clarifies what the content of the cell means. Not all CSS class names are as well defined as these.

One of the cells has no text content. Instead, the cell has an <a> tag and an empty <div> tag. That cell also contains several attributes, including data-title, data-theme, data-position, and others. These additional tags are used by a JavaScript function to display additional information in the cell. Instead of text stating the finish position, there is additional data on what happened to the racing team and their boat.

An essential complexity here is that the data-title attribute contains HTML content. This is unusual, and an HTML parser cannot detect that an attribute contains HTML markup. We'll set this aside for the There's more… section of this recipe.

How to do it...

We'll start by importing the necessary modules. The function for parsing down the data will have two important sections: the list of legs and the team results for each leg. These are represented as headings and rows of an HTML table:

  1. We'll need the BeautifulSoup class from the bs4 module to parse the text. We'll also need a Path object to refer to the file. We've used a # type: ignore comment because the bs4 module didn't have complete type hints at the time of publication:
    from bs4 import BeautifulSoup  # type: ignore
    from pathlib import Path
    
  2. Define a function to read the HTML document from a given Path instance:
    def race_extract(source_path: Path) -> Dict[str, Any]:
    
  3. Create the soup structure from the HTML content. We'll assign it to a variable, soup. We've used a context manager to access the file. As an alternative, we could also read the content using the Path.read_text() method:
        with source_path.open(encoding="utf8") as source_file:
            soup = BeautifulSoup(source_file, "html.parser")
    
  4. From the Soup object, we can navigate to the first <table> tag. Within that, we need to find the first <thead> tag. Within that heading, we need to find the <tr> tag. This row contains the individual heading cells. Navigating to the first instance of a tag means using the tag name as an attribute; each tag's children are available as attributes of the tag:
        thead_row = soup.table.thead.tr
    
  5. We can accumulate data from each <th> cell within the row. There are three variations of the heading cells. Some have no text, some have text, and some have text and a data-title attribute value. For this first version, we'll capture all three variations. The tag's text attribute contains this content. The tag's attrs attribute contains the various attribute values:
        legs: List[Tuple[str, Optional[str]]] = []
        for tag in thead_row.find_all("th"):
            leg_description = (
                tag.string, tag.attrs.get("data-title")
            legs.append(leg_description)
    
  6. From the Soup object, we can navigate to the first <table> tag. Within that, we need to find the first <tbody> tag. We can leverage the way Beautiful Soup makes the first instance of any tag into an attribute of the parent's tag:
        tbody = soup.table.tbody
    
  7. Within that body, we need to find all the <tr> tags in order to visit all of the rows of the table. Within the <tr> tags for a row, each cell is in a <td> tag. We want to convert the content of the <td> tags into team names and a collection of team positions, depending on the attributes available:
        teams: List[Dict[str, Any]] = []
        for row in tbody.find_all("tr"):
            team: Dict[str, Any] = {
                "name": None, 
                "position": []}
            for col in row.find_all("td"):
                if "ranking-team" in col.attrs.get("class"):
                    team["name"] = col.string
                elif (
                    "ranking-number" in col.attrs.get("class")
                ):
                    team["position"].append(col.string)
                elif "data-title" in col.attrs:
                    # Complicated explanation with nested HTML
                    # print(col.attrs, col.string)
                    Pass
            teams.append(team)
    
  8. Once the legs and teams have been extracted, we can create a useful dictionary that will contain the two collections:
        document = {
            "legs": legs,
            "teams": teams,
        }
        return document
    

We've created a list of legs showing the order and names for each leg, and we parsed the body of the table to create a dict-of-list structure with each leg's results for a given team. The resulting object looks like this:

{'legs': [(None, None),
          ('LEG 1', '<strong>ALICANTE - CAPE TOWN</strong>'),
          ('LEG 2', '<strong>CAPE TOWN - ABU DHABI</strong>'),
          ('LEG 3', '<strong>ABU DHABI - SANYA</strong>'),
          ('LEG 4', '<strong>SANYA - AUCKLAND</strong>'),
          ('LEG 5', '<strong>AUCKLAND - ITAJAÍ</strong>'),
          ('LEG 6', '<strong>ITAJAÍ - NEWPORT</strong>'),
          ('LEG 7', '<strong>NEWPORT - LISBON</strong>'),
          ('LEG 8', '<strong>LISBON - LORIENT</strong>'),
          ('LEG 9', '<strong>LORIENT - GOTHENBURG</strong>'),
          ('TOTAL', None)],
 'teams': [{'name': 'Abu Dhabi Ocean Racing',
            'position': ['1', '3', '2', '2', '1', '2', '5', '3', '5', '24']},
           {'name': 'Team Brunel',
            'position': ['3', '1', '5', '5', '4', '3', '1', '5', '2', '29']},
           {'name': 'Dongfeng Race Team',
            'position': ['2', '2', '1', '3', None, '1', '4', '7', '4', None]},
           {'name': 'MAPFRE',
            'position': ['7', '4', '4', '1', '2', '4', '2', '4', '3', None]},
           {'name': 'Team Alvimedica',
            'position': ['5', None, '3', '4', '3', '5', '3', '6', '1', '34']},
           {'name': 'Team SCA',
            'position': ['6', '6', '6', '6', '5', '6', '6', '1', '7', None]},
           {'name': 'Team Vestas Wind',
            'position': ['4',
                         None,
                         None,
                         None,
                         None,
                         None,
                         None,
                         '2',
                         '6',
                         '60']}]}

This structure is the content of the source HTML table, unpacked into a Python dictionary we can work with. Note that the titles for the legs include embedded HTML within the attribute's value.

Within the body of the table, many cells have None for the final race position, and a complex value in data-title for the specific <TD> tag. We've avoided trying to capture the additional results data in this initial part of the recipe.

How it works...

The BeautifulSoup class transforms HTML documents into fairly complex objects based on a document object model (DOM). The resulting structure will be built from instances of the Tag, NavigableString, and Comment classes.

Generally, we're interested in the tags that contain the string content of the web page. These are instances of the Tag class, as well as the NavigableString class.

Each Tag instance has a name, string, and attributes. The name is the word inside < and >. The attributes are the fields that follow the tag name. For example, <td class="ranking-number">1</td> has a tag name of td and an attribute named class. Values are often strings, but in a few cases, the value can be a list of strings. The string attribute of the Tag object is the content enclosed by the tag; in this case, it's a very short string, 1.

HTML is a mixed content model. This means that a tag can contain child tags, in addition to navigable text. When looking at the children of a given tag, there will be a sequence of Tag and NavigableText objects freely intermixed.

One of the most common features of HTML is small blocks of navigable text that contain only newline characters. When we have HTML like this:

<tr> 
    <td>Data</td> 
</tr> 

There are three children within the <tr> tag. Here's a display of the children of this tag:

>>> example = BeautifulSoup(''' 
...     <tr> 
...         <td>data</td> 
...     </tr> 
... ''', 'html.parser') 
>>> list(example.tr.children) 
['
', <td>data</td>, '
'] 

The two newline characters are peers of the <td> tag. These are preserved by the parser. This shows how NavigableText objects often surround a child Tag object.

The BeautifulSoup parser depends on another, lower-level library to do some of the parsing. It's easy to use the built-in html.parser module for this. There are alternatives that can be installed as well. These may offer some advantages, like better performance or better handling of damaged HTML.

There's more...

The Tag objects of Beautiful Soup represent the hierarchy of the document's structure. There are several kinds of navigation among tags:

  • All tags except a special root container will have a parent. The top <html> tag will often be the only child of the root container.
  • The parents attribute is a generator for all parents of a tag. It's a path "upward" through the hierarchy from a given tag.
  • All Tag objects can have children. A few tags such as <img/> and <hr/> have no children. The children attribute is a generator that yields the children of a tag.
  • A tag with children may have multiple levels of tags under it. The overall <html> tag, for example, contains the entire document as descendants. The children attribute contains the immediate children; the descendants attribute generates all children of children, recursively.
  • A tag can also have siblings, which are other tags within the same container. Since the tags have a defined order, there's the next_sibling and previous_sibling attributes to help us step through the peers of a tag.

In some cases, a document will have a straightforward organization, and a simple search by the id attribute or class attribute will find the relevant data. Here's a typical search for a given structure:

>>> ranking_table = soup.find('table', class_="ranking-list") 

Note that we have to use class_ in our Python query to search for the attribute named class. The token class is a reserved word in Python and cannot be used as a parameter name. Given the overall document, we're searching for any <table class="ranking-list"> tag. This will find the first such table in a web page. Since we know there will only be one of these, this attribute-based search helps distinguish between what we are trying to find and any other tabular data on a web page.

Here's the list of parents of this <table> tag:

>>> list(tag.name for tag in ranking_table.parents) 
['section', 'div', 'div', 'div', 'div', 'body', 'html', '[document]'] 

We've displayed just the tag name for each parent above the given <table>. Note that there are four nested <div> tags that wrap the <section> that contains <table>. Each of these <div> tags likely has a different class attribute to properly define the content and the style for the content.

[document] is the overall BeautifulSoup container that holds the various tags that were parsed. This is displayed distinctively to emphasize that it's not a real tag, but a container for the top-level <html> tag.

See also

  • The Reading JSON documents and Reading XML documents recipes, shown earlier in this chapter, both use similar data. The example data was created for them by scraping the original HTML page using these techniques.

Refactoring a .csv DictReader as a dataclass reader

When we read data from a CSV format file, the csv module offers two general choices for the kind of reader to create:

  • When we use csv.reader(), each row becomes a list of column values.
  • When we use csv.DictReader, each row becomes a dictionary. By default, the contents of the first row become the keys for the row dictionary. An alternative is to provide a list of values that will be used as the keys.

In both cases, referring to data within the row is awkward because it involves rather complex-looking syntax. When we use the csv.reader() function, we must use syntax like row[2] to refer to a cell; the semantics of index 2 are completely obscure.

When we use csv.DictReader, we can use row['date'], which is less obscure, but this is still a lot of extra syntax. While this has a number of advantages, it requires a CSV with a single-row header of unique column names, which is something not ubiquitous in practice.

In some real-world spreadsheets, the column names are impossibly long strings. It's hard to work with row['Total of all locations excluding franchisees'].

We can use a dataclass to replace this complex list or dictionary syntax with something simpler. This lets us replace an opaque index position or column names with a useful name.

Getting ready

One way to improve the readability of programs that work with spreadsheets is to replace a list of columns with a typing.NamedTuple or dataclass object. These two definitions provide easy-to-use names defined by the class instead of the possibly haphazard column names in the .csv file.

More importantly, it permits much nicer syntax for referring to the various columns; for example, we can use row.date instead of row['date'] or row[2].

The column names (and the data types for each column) are part of the schema for a given file of data. In some CSV files, the first line of the column titles provides part of the schema for the file. The schema that gets built from the first line of the file is incomplete because it can only provide attribute names; the target data types aren't known and have to be managed by the application program.

This points to two reasons for imposing an external schema on the rows of a spreadsheet:

  • We can supply meaningful names
  • We can perform data conversions where necessary

We'll look at a CSV file that contains some real-time data that's been recorded from the log of a sailboat. This is the waypoints.csv file, and the data looks as follows:

    lat,lon,date,time 
    32.8321666666667,-79.9338333333333,2012-11-27,09:15:00 
    31.6714833333333,-80.93325,2012-11-28,00:00:00 
    30.7171666666667,-81.5525,2012-11-28,11:35:00 

The data contains four columns, lot, lon, date, and time. Two of the columns are the latitude and longitude of the waypoint. It contains a column with the date and the time as separate values. This isn't ideal, and we'll look at various data cleansing steps separately.

In this case, the column titles happen to be valid Python variable names. This is rare, but it can lead to a slight simplification. The more general solution involves mapping the given column names to valid Python attribute names.

A program can use a dictionary-based reader that looks like the following function:

def show_waypoints_raw(data_path: Path) -> None:
    with data_path.open() as data_file:
        data_reader = csv.DictReader(data_file)
        for row in data_reader:
            ts_date = datetime.datetime.strptime(
                row['date'], "%Y-%m-%d"
            ).date()
            ts_time = datetime.datetime.strptime(
                row['time'], "%H:%M:%S"
            ).time()
            timestamp = datetime.datetime.combine(
                ts_date, ts_time)
            lat_lon = (
                float(row['lat']),
                float(row['lon'])
            )
            print(
                f"{timestamp:%m-%d %H:%M}, "
                f"{lat_lon[0]:.3f} {lat_lon[1]:.3f}"
            )

This function embodies a number of assumptions about the available data. It combines the physical format, logical layout, and processing into a single operation. A small change to the layout – for example, a column name change – can be difficult to manage.

In this recipe, we'll isolate the various layers of processing to create some kind of immunity from change. This separation of concerns can create a much more flexible application.

How to do it...

We'll start by defining a useful dataclass. We'll create functions to read raw data, and create dataclass instances from this cleaned data. We'll include some of the data conversion functions in the dataclass definition to properly encapsulate it. We'll start with the target class definition, Waypoint_1:

  1. Import the modules and definitions required. We'll be introducing a dataclass, and we'll also need to define optional attributes of this dataclass:
    from dataclasses import dataclass, field
    from typing import Optional
    
  2. Define a base dataclass that matches the raw input from the CSV file. The names here should be defined clearly, irrespective of the actual headers in the CSV file. This is an initial set of attributes, to which we'll add any derived values later:
    @dataclass
    class Waypoint_1:
        arrival_date: str
        arrival_time: str
        lat: str
        lon: str
    
  3. Add any derived attributes to this dataclass. This should focus on attributes with values that are expensive to compute. These attributes will be a cache for the derived value to avoid computing it more than once. These will have a type hint of Optional, and will use the field() function to define them as init=False and provide a default value, which is usually None:
        _timestamp: Optional[datetime.datetime] = field(
            init=False, default=None
        )
    
  4. Write properties to compute the expensive derived values. The results are cached as an attribute value of the dataclass. In this example, the conversion of text date and time fields into a more useful datetime object is a relatively expensive operation. The result is cached in an attribute value, _timestamp, and returned after the initial computation:
    @property
    def arrival(self):
        if self._timestamp is None:
            ts_date = datetime.datetime.strptime(
                self.arrival_date, "%Y-%m-%d"
            ).date()
            ts_time = datetime.datetime.strptime(
                self.arrival_time, "%H:%M:%S"
            ).time()
            self._timestamp = datetime.datetime.combine(
                ts_date, ts_time)
        return self._timestamp
    
  5. Refactor any relatively inexpensive computed values as properties of the dataclass. These values don't need to be cached because the performance gains from a cache are so small:
    @property
    def lat_lon(self):
        return float(self.lat), float(self.lon)
    
  6. Define a static method to create instances of this class from source data. This method contains the mapping from the source column names in the CSV file to more meaningful attribute names in the application. Note the type hint for this method must use the class name with quotes – 'Waypoint_1' – because the class is not fully defined when the method is created. When the body of the method is evaluated, the class will exist, and a quoted name is not used:
    @staticmethod
    def from_source(row: Dict[str, str]) -> 'Waypoint_1':
        name_map = {
            'date': 'arrival_date',
            'time': 'arrival_time',
            'lat': 'lat',
            'lon': 'lon',
        }
        return Waypoint_1(
            **{name_map[header]: value 
                for header, value in row.items()}
        )
    
  7. Write a generator expression to use the new dataclass when reading the source data:
    def show_waypoints_1(data_path: Path) -> None:
        with data_path.open() as data_file:
            data_reader = csv.DictReader(data_file)
            waypoint_iter = (
                Waypoint_1.from_source(row)
                    for row in data_reader
            )
            for row in waypoint_iter:
                print(
                    f"{row.arrival:%m-%d %H:%M}, "
                    f"{row.lat_lon[0]:.3f} "
                    f"{row.lat_lon[1]:.3f}"
                )
    

Looking back at the Reading delimited files with the CSV module recipe, earlier in this chapter, the structure of this function is similar to the functions in that recipe. Similarly, this design also echoes the processing shown in the Using dataclasses to simplify working with CSV files recipe, earlier in this chapter.

The expression (Waypoint_1.from_source(row) for row in data_reader) provides each raw dictionary with the Waypoint_1.from_source() static function. This function will map the source column names to class attributes, and then create an instance of the Waypoint_1 class.

The remaining processing has to be rewritten so that it uses the attributes of the new dataclass that was defined. This often leads to simplification because row-level computations of derived values have been refactored into the class definition, removing them from the overall processing. The remaining overall processing is a display of the detailed values from each row of the source CSV file.

How it works...

There are several parts to this recipe. Firstly, we've used the csv module for the essential parsing of rows and columns of data. We've also leveraged the Reading delimited files with the CSV module recipe to process the physical format of the data.

Secondly, we've defined a dataclass that provides a minimal schema for our data. The minimal schema is supplemented with the from_source() function to convert the raw data into instances of the dataclass. This provides a more complete schema definition because it has a mapping from the source columns to the dataclass attributes.

Finally, we've wrapped the csv reader in a generator expression to build dataclass objects for each row. This change permits the revision of the remaining code in order to focus on the object defined by the dataclass, separate from CSV file complications.

Instead of row[2] or row['date'], we can now use row.arrival_date to refer to a specific column. This is a profound change; it can simplify the presentation of complex algorithms.

There's more...

A common problem that CSV files have is blank rows that contain no useful data. Discarding empty rows requires some additional processing when attempting to create the row object. We need to make two changes:

  1. Expand the from_source() method so that it has a slightly different return value. It often works out well to change the return type from 'Waypoint_1' to Optional['Waypoint_1'] and return a None object instead of an empty or invalid Waypoint_1 instance.
  2. Expand the waypoint_iter expression in order to include a filter to reject the None objects.

We'll look at each of these separately, starting with the revision to the from_source() method.

Each source of data has unique rules for what constitutes valid data. In this example, we'll use the rule that all four fields must be present and have data that fits the expected patterns: dates, times, or floating-point numbers.

This definition of valid data leads to a profound rethinking of the way the dataclass is defined. The application only uses two attributes: the arrival time is a datetime.datetime object, while the latitude and longitude is a Tuple[float, float] object.

A more useful class definition, then, is this:

@dataclass
class Waypoint_2:
    arrival: datetime.datetime
    lat_lon: Tuple[float, float]

Given these two attributes, we can redefine the from_source() method to build this from a row of raw data:

@staticmethod
def from_source(row: Dict[str, str]) -> Optional['Waypoint_2']:
    try:
        ts_date = datetime.datetime.strptime(
            row['date'], "%Y-%m-%d"
        ).date()
        ts_time = datetime.datetime.strptime(
            row['time'], "%H:%M:%S"
        ).time()
        arrival = datetime.datetime.combine(
            ts_date, ts_time)
        return Waypoint_2(
            arrival=arrival,
            lat_lon=(
                float(row['lat']),
                float(row['lon'])
            )
        )
    except (ValueError, KeyError):
        return None

This function will locate the source values, row['date'], row['time'], row['lat'], and row['lon']. It assumes the fields are all valid and attempts to do a number of complex conversions, including the date-time combination and float conversion of the latitude and longitude values. If any of these conversions fail, an exception will be raised and a None value will be returned. If all the conversions are successful, then an instance of the Waypoint_2 class can be built and returned from this function.

Once this change is in place, we can make one more change to the main application:

def show_waypoints_2(data_path: Path) -> None:
    with data_path.open() as data_file:
        data_reader = csv.DictReader(data_file)
        waypoint_iter = (
            Waypoint_2.from_source(row)
                for row in data_reader
        )
        for row in filter(None, waypoint_iter):
            print(
                f"{row.arrival:%m-%d %H:%M}, "
                f"{row.lat_lon[0]:.3f} "
                f"{row.lat_lon[1]:.3f}"
            )

We've changed the for statement that consumes values from the waypoint_iter generator expression. This change introduces the filter() function in order to exclude None values from the source of data. Combined with the change to the from_source() method, we can now exclude bad data and tolerate source file changes without complex rewrites.

See also

  • In the Reading delimited files with the csv module recipe, earlier in this chapter, we looked at the basics of using the csv module to parse files.
  • The Using dataclasses to simplify working with CSV files recipe, earlier in this chapter, covered a different approach to working with complex data mappings.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.119.251